Why the heck not? Its one less thing to worry about. I had installed the WordPress that came with Fedora core 7 but have since had some trouble with the i2o disk controllers. I basically lost everything on a RAID-5 array. Nothing aside from the few posts I made were really at risk of being lost, everything was just backed up prior to the system shuffling anyways.
The cause of this bizarre failure still escapes me.
I’d been tinkering with some new 100mm fans that I picked up at the local automotive/everything surplus store. The plan was to stuff as any of these as possible throughout both my desktop PC case and the server. 5 fans at $2.49ea…. great deal. I started with my desktop PC; in there is an eVGA 8800 GTS that idled in the mid 60C and would peak into the 80’s under load. 2 intake fans were added to it as well as an intake fan directly over the video card mounted on the case door. This did a great job of bringing the idle temps down to the mid-50’s and does a great job dissipating the heat under load.
Anyways, this is where it started to go wrong. After completing the fan install on my Desktop PC, I shelled into the FC7 box and issued a ‘halt’. Usually, the system crunches around, kills off all the processes and preps the system for shutdown. The usually finishes with a message like “System Halted. Power down”. 15 minutes have gone by and the shutdown process seems to have stalled; its not doing anything. I think this sort of thing happens on all Linux installs occasionally; perhaps due to some process that was killed manually and the PID file stays behind perhaps making the init script wait for a process that has already ended… It really could be anything and there is no way of knowing after-the-fact what it was.
So I gave it the “Big Red Button”. Powered-off, unplugged it and installed a few case fans. The fan went onto the PSU lead with the other fans using a full 12volt line. (There will be a rant about Antec’s “Fan Only” PSU lead in a future post). Checked my work, closed up the case and reconnected everything. Flicked the power switch…
It passes POST as usual but this time, instead of booting off the Adaptec 2400, it fell through and tried to boot off the CD, then tried to PXE off of each NIC. “No biggie” I thought; on rare occasions CMOS data can get reset when tampering inside a case… it does happen. So, power on again, mash the [DEL] key. Immediately I recognize that the settings are fine, as I left them.
Try to boot again… same deal.
Boot again and access the Adaptec hardware configuration, no problem, the array is still there. The tool only lets you manage the array, nothing configurable about the card such as boot interrupts or memory addresses.
Alrighty…. I don’t quite have the “Oh No!” feeling yet, so I boot off the FC7 DVD to see if I can determine if the card is still working on the Linux side; I can get my data off or even repair over-top. So I boot the Recover option from the DVD. The I2o drivers come up, the card is seen, capacity of the drives listed… but NO PARTITION TABLE! WtF? I break out of the installer to the first available console and try ‘fdisk’ for myself…. yup, its a big old empty partition table.
The “Oh No!” feeling sets in.
Fu-diddly-ck.
Having let FC7 configure the partition table when first installed, I didn’t note the specific partition sizes and disk layout. I’m no expert with the guts of the FC installer so manually invoking the partition tool escapes me. So I began to evaluate the old “diminishing returns” formula… RAID controllers tend to re-initialize when partition tables are written, so either way it looks like this installation is up-the-creek.
FC only takes a few minutes to re-install and I’d only be “out” 2 things, my WordPress installation/posts (a bit of searching later yields an XML file. w00t) and my latest configuration of Asterisk, Samba and such. So rather than messing around for hours on end, I quickly re-installed and got things going again.
At this point, I’m entertaining wild theories and ideas about what might have happened here and what I could have done to recover from this. Something mangled the partition table, I know that much. I’d suggest something about the RAID superblocks… but nothing hinted that the array was ever broken, its like something just blew away the partition table.
I suppose the lesson is, no matter how young and installation is, its never going to be immune from a good backup regime. Back-it-up or else!!