I got a service call from our biggest customer on Sunday. The girl at the check in desk told me that she couldn’t get to the reservation system, so she couldn’t check customers in or out. She also could not open her email. And there was something about cooling equipment in the engine room that had failed.
That last bit worried me that a reboot of the workstation might not do the trick this time.
Last Sunday was also Fathers’ day. Not a good day for an emergency. I am happy that the call didn’t come before my kids had the chance to “wake me up” and deliver their congratulations and prezzies. They had really been waiting for it. In fact, i even had the time for a proper breakfast. But the rest of the day would seem to go to the dumps. My wife was also on duty call that weekend, and the kids and i were supposed to show up at my parents in law mid-day. Doom was impending.
I called the customer’s site security manager and got the news. There had been a power failure in a transformer a few blocks away. The on-site UPS was sucked dry, and the generators had failed to start. It was a cascade failure, and it was not good. But hey, they are a big customer. Maybe the servers would come back once power had been restored.
Power was back at about eleven-thirty. I did a bunch of phone calls to the customer’s different sites to ask whether their reservation systems were down or up, while the kids were growing louder. They were all dressed up and ready to leave and did an excellent job getting on each others’ nerves.
The reports from the sites were contradictory to say the least. The reservation system was up, no down, no it was up but now it’s not. Email was still down. And the lunch at the in-laws was about to start. So i gave them too a call and said that we’re going to be a few minutes late but that i’d probably have to set up a remote office at their place and do some phone calls and use my computer to take a remote connection to the customer. If all was really bad, i might have to skip lunch and visit the customer’s site, but the kids would be there anyway. And it surely wouldn’t take very long.
I felt the first grain of bad karma fall on me.
From my remote office, i was able to talk with the firewall, but the mail server didn’t respond to pings. And with the site manager on the phone suggesting that i should maybe stop mucking about with remote help and get my servicing arse over there instead, i concurred. Since i don’t have access to the servers’ ILO management system (which works even if the server is off and through which i could be able to remotely switch on the server), i thought i might as well look good in the customers’ eyes and drive down town to push the damn power button and be back in time for desert. Or coffee, if it was more than one server.
On the way down town, i had another chat with the customer’s IT manager and he decided he too would come to the disaster area. At the time, i thought it might be overkill. It’s probably just a flick of the switch on a server and we’re back up and running.
Boy i was wrong.
Things were a bit more silent in the engine room than usual. The air conditioning was okay, which was the first good bit of work related news for the day. We proceeded to fire up the servers. The domain controller was off. The file server was off. The mail server had hung, or it was off, or just b0rked. The intranet was down. The virtual server server (in lack of a better term) was off, and with it, the virtual servers. The disk array was on but one of the virtual servers could not connect to it. The reservation system was off for this site but up for another. The billing system, it turned out, was off. The orders printer in the kitchen was blown. The applications to operate and monitor telephone calls, wake-ups, keys and (oh!) the mini bars were off. Also, our management PC was off. And to top things off, the console thingy that one would operate half the servers with had suddenly decided that it wanted a password which nobody had. And all this was by no means apparent with a glance. Problems oozed in as others were solved. On site, three fathers: the site security chief, the IT manager, and me. How could things be better.
We started with the most critical systems. At this time, i had mobilized half of the Infra crew, most notably Niko who got the virtual servers and the disk array into order and Tero who was on a beach in Spain and remote-instructed us from there. Had it not been for their expertise, the customer’s systems would probably still be down. Soon, we had the check in system up and the three systems that need to run in tandem (trindem?) to take care of billings was slowly back in operation. Email required an extra booting, but it also came back.
Seldom had i more wished for proper documentation of the system than now. An inventory of equipment and servers and how to get everything running even for a guy like me who doesn’t spend most of his billable hours at this customer… would have save the day.
At this time, lunch, dessert and coffee were but a pressing but sad memory. By each hour, i had to tell my wife that this won’t take much longer and we just need this one system back up, after which it turned out that that one system really is a whole bunch of subsystems that first need to be physically located to get into operation. I felt the bad karma pile in massive quantities.
At this time i should probably tell you about the third server room on site. The first two ones are like proper server rooms. There’s loud air conditioning. There are a bit more monitors, cables, power supplies, cardboard boxes and junk lying around than there should be. There are racks with loud expensive technical equipment having lots of lights that blink. There’s a crapload of cables going in front of the boxes that blink most, so you can’t really access the equipment without a jungle machete or a lot of patience (the second option is preferred). Many of the servers are tightly crammed because at the time, nobody thought you really would need to get to the other side of the servers. Say, to plug in one of those bulky CRT monitors lying around because the console demands a password which, as i probably mentioned, nobody knew. And you couldn’t use remote desktop, because the stickers on the computers failed to mention the hostname or IP address of the box. And you would need to get to the computer to see if the apps on it are running. And just to really top it off, a few of the machines refused to start without a keyboard plugged in, and since the console was off-line because nobody knew of the password, it wasn’t considered a working keyboard, at least not by the computer.
Compared to the two main server rooms, the third server room is a mess. The non-techie people working around there use the room for ad-hoc storage of audiovisual equipment (speakers, cables, microphones, amps, cables, more cables…) and junk. I had to step around a cardboard box of miscellanea just to get into the room. A ghetto blaster was obstructing half of the entrance. A snake pit of cables was lying on the so called operator table, partly on top of and partly under the keyboard, mouse and KVM switch.
Above the operator table are a few shelves with servers. Well, actually they aren’t servers of the kind you would call servers. They are more like old workstations on server duty, in part because it’s cheaper that way and in part because nobody seems to know whether an application on one “server” will play nice with the application on another. Thus, there is one box per application. Per critical application, i might add, and that the workstations are five years old or more, and that they live in a crammed space on the second to top shelf in a room filled with snakes, audiovisual trash and a ghetto blaster. I really should have taken a picture.
Since nobody thought of it at installation time, the “servers” were not set to start automatically once they got power. In fact, this held true for nearly all computers, be they proper servers or workstations working as servers. And even if they had started, many of the critical applications still needed somebody to actually log in to the computer and start the application in question. Here, the computers were not part of any site-wide Windows domain, so we had to guess the passwords, just to keep things interesting.
It was a quarter past four when i headed back towards the remains of the fathers’ day reception. The other guests had looked after our kids who had been a bit confused on the non-presence of their father on that fathers’ day reception. I gave my kids a big hug, apologized to the company present, and hoped that i’d never have to see a computer again.
Boy was i wrong.
