Dance With Grenades | So, Here's What Happened
Alright. The downtime I had for a couple of days earlier this week was entirely my fault. I am writing out what happened so that someone else may read it and not do the same thing. Also, I will go over what I had done well, and what I am doing better this time.
Here's What Happened
Wednesday afternoon. I log into HomeOne (my local server) via SSH. I forget exactly what it was I was doing, but it was most likely just to tail Cherokee's error log since I had PHP error printing turned off. I decided it was time to do an apt upgrade. HomeOne, at this point, was running Debian Squeeze, and it was already pretty messy. As you can imagine, Squeeze is not the best choice for a production server. In retrospect, I was banking on Debian's extremism toward stability to carry me without problems on the testing branch.
Up until this point, the PostgreSQL in the Debian repositories has been 8.1. Suddenly, I see apt upgrading to 8.4. Normally, this should be fine. Unfortunately, and I'm not entirely sure why, but postgres would not upgrade from 8.1 to 8.4. Finally, I removed 8.1 and 8.4 installed just fine. This would have been just peachy, but absentmindedly, I also purged 8.1 from the system, which deleted all of my data and configs for pg 8.1.
Now, at this point, I figured (stupidly, in retrospect), "Oh, I can probably still recover this data with reiserfsck" (HomeOne had been running ReiserFS since before Hans Reiser was convicted of murder. That's how long it'd been since I originally put Debian on there.) So I remount /dev/sda1 readonly (which turned out to be my music drive) and executed the following command:
reiserfsck --rebuild-tree -S -l /root/recovery.log /dev/sda1
This ran for a while, and when it finished, I realized I'd run it on my music drive. Luckily, these static files don't change, so nothing bad seems to have happened.
So I ran it again, but this time on the right drive partition. This one was an LVM partition, so it was /dev/mapper/HomeOne-root
This drive was a lot smaller (80GB as opposed to 320 on the music volume), and was divided into multiple partitions, so HomeOne-root was 10GB. As you can imagine, this went a lot faster. When this was finished, I went to reboot the machine, only to find that the shutdown sequence couldn't complete because /etc/init.d/rcS was no longer fully intact!
At this point, I screamed.
After regaining my composure, I decided the best course of action now would be to get a backup system running on a VM on ColonelRhombus (my desktop) to provide DHCP and DNS cache, which HomeOne was no longer able to provide for the time being. Once that was all set up, I set about finding a working harddrive to put into HomeOne and install Ubuntu Server, from which to recover any recoverable data on HomeOne-root.
I wasted what seemed like several unending hours trying to get HomeOne to boot into a USB installer, since it has no optical drive, to no avail. I got it to boot to a hardy installer, but could not complete installation from there because the installer expected a cdrom device to install from, and none was available. Any attempt to make a bootable lucid usb stick after that resulted in utter failure. At this point, I went to bed.
The next morning, I dug through some of the stuff in the closet here in my office, and found an antiquated Lite-On CDRW drive, with a beige bevel. Manufactured in 2005. This thing still works? Fortunately, yes it does. Once that was plugged in, I installed Lucid to HomeOne's new pair of 30.6GB Seagate Barracuda ATA II drives (after sorting through 4 other dead PATA drives, one of which gave me false hope), shut it down, unplugged the optical drive, plugged in the other two necessary drives (HomeOne LVM and music), and booted back up. I mounted /dev/sdc1 to find that was the music partition. I made a note, and unmounted. I attempted to mount /dev/sdd's partitions, unsuccessfully. I nearly panicked, until I remembered that it was LVM. After installing lvm, they mounted just fine from /dev/mapper. Fortunately, I keep all of my web files in /srv, which I also always keep on a separate partition. They were no problem to recover. I was also able to recover my MySQL databases. One table in `modx_dwg` needed to be recovered, but it went smoothly and suffered no data loss. Unfortunately, as I had feared, none of my postgres data had been recovered.
So I finished configuring everything, finally got around to fixing the file permissions on music (and the samba configs to go along; more on that in a future post), installed nginx, php, spawn-fcgi, mysql, postgresql, etc, and got all of my sites back up and running. I do have to rebuild one important database from PostgreSQL, but luckily, it wasn't terribly complex yet anyway.
What I Am Doing Better:
Well, for one, I'm not running a testing branch of Debian anymore. HomeOne is running Ubuntu 10.04 LTS, and I won't stray from it until the next LTS at least. I'm also using more partitions for better data segregation (rather than just root and srv, this time its root, home, var, srv). Additionally, I'm setting up a backup system to backup database and web files from HomeOne to HomeOne-Backup (the aforementioned VM). With a little luck, this will make future disaster recovery a lot less disastrous. Eventually, I'll move the backup duties to a VM on a Xen host, but for now, I don't have that luxury.
The main lesson to take away from this, kiddies, is to be very careful with dpkg --purge, especially when purging database packages! Make sure you have the data you need out of any packages you're purging before purging them!
Posts: 2
Reply #2 on : Tue January 24, 2012, 03:35:36