Day of Chaos, Part 2

Here’s the long version of what happened Friday regarding the file server. If you’re not interested, by all means just ignore this. It’s for informational purposes only.

I was scheduled to work Thursday night 3rd shift. My plan was to apply security patches, run updates, scan hard drives, etc., on servers. Pretty much the stuff that can’t be done during the school day. Things were going well until about 2AM, when I tried to scan our file server (“Miss Piggy” for those interested in server names). I couldn’t get any response from the keyboard, and I also couldn’t log in via the network. Files were still being served, but for all intents and purposes, the server was locked up. Upon reboot, the computer wouldn’t even recognize there were hard drives at all.

The strange thing is that our file server is very high-end, and has built in alarms that are supposed to go off when there is a problem. If one of the hard drives fail, it sets off an alarm, and just keeps purring along thanks to it’s redundancy — however, in this case, the alarm was apparently non-functional. I have no idea how long it was serving files in “limp mode”, but after trying to repair it, it became quite clear that the RAID controller was completely shot.

It was now about 2:30AM, and I realized there was no way our file server would be ready for Friday morning. Even if a brand new server was in my lap, getting it installed and configured would not be possible in that amount of time. This is mainly because not only were all the user files gone, but the entire operating system was gone. So, my first item of business was to get our remaining servers/workstations at least partially functional. Believe it or not, MANY things rely on the availability of documents… So I spent the next hour and a half or so getting the windows server, macs, linux servers, etc, etc, to function without having the document server at all. I figured if we could have SDS, attendance, Internet, etc — it would be better than having nothing at all.

Then, at 4AM I started the process of rebuilding the file server. That’s where my over-paranoid backup practices finally paid off. :o) I also happened to have (again, paranoia) another RAID controller in a second server that I could transplant. Another good thing, and really the only “good” news in the whole story, is that I had brand new hard drives waiting to upgrade our file server’s storage space. This is NOT the way I wanted to upgrade the server, but I’m trying to focus on how nice it is to have it upgraded. :o)

The server finished restoring at 2:45PM Friday. It seemed absurd to put everything back online, so for the remainder of the day (and evening), I ran tests and scans on the server to make sure it was OK for Monday morning. Things finished just as the Varsity football game ended (I walked back and forth, so got to watch some of the game while the server was scanning), and I went home.

Thank you very much to those that sent supportive emails, etc. To be honest, as far as catastrophic hardware failures go — this repair couldn’t have gone any better. I do wish it would have happened on Saturday instead of Friday, but at least things are working on Monday. 🙂