Last week there was a power grid failure which break down my server’s RAID array. I have no UPS (as I’m a skinflint) and no automatic email alerts (because I’m too lazy to set it up). As a result, for 5 days, my 3-disk RAID-5 array was relying on only 2 disks until I noticed the issue…
By using a combination of following commands, I was soon aware of the gravity of the situation:
cat /proc/mdstat mdadm --examine /dev/sda1
My /dev/sda1 disk was kicked out of the array, so I did the right stuff which consisted of reconstructing the array:
mdadm /dev/md0 -a /dev/sda1
Then, in an unlucky combination of cosmic ray bombardment, spooky action at a distance and astrological misalignment, half-way to the end of the rebuilding process (which can take up to 5 hours), another disk failed ! It was late, I was tired and utterly worried about losing 1.5 To of precious data. In such a bad shape, I was afraid to worsen the situation. So I decided to shutdown the server and sleep on the problem.
The next day I tried to boot my server to find it (surprise !) stuck in the middle of the boot process, with the famous message:
hit control-D to continue or give root password to fix manually
This is “normal” as my server tried to mount the ext3 filesystem from the /dev/md0 partition that was just assembled by mdadm. Of course md0, if assembled and available to the system, was not running because only one disk, out of three, was in a clean state.
I skip here the epic substory in which I wasted days in a search of a working keyboard, but I let you imagine how such adventures makes my week…
Eventually, I was able to analyze the situation in details. My first reflex ? Check that disks are not physically dead:
fdisk -l /dev/sda fdisk -l /dev/sdb fdisk -l /dev/sdc
“Linux raid partitions” (type code “fd“) are still there. Good. I assumed here that disks where not physically damaged. Maybe I should have looked at S.M.A.R.T. datas and statistics (via smartmontools). But remember, I’m lazy (and a bit crazy).
The next step was to get informations about the RAID array itself using:
mdadm --detail /dev/md0
which output the status table below (probably inaccurate as I reconstructed it afterwards):
Number Major Minor RaidDevice State 0 0 0 0 removed 1 0 0 1 faulty removed 2 8 33 2 active sync /dev/sdc1 3 8 17 3 spare
What this table told us ?
- The array is up, but not running. One of its device (
sdc1) was clean and active, but it’s not enough to get a working RAID-5. - My first attempt to rebuild the array lead to an unexpected result: it added
sda1as a spare device (in slot #3). - It confirm that
sdb1unexpectedly failed and is now in a bad state (“faulty removed“).
Then I stopped the array and tried to fearlessly (re)assemble it using 3 differents methods:
mdadm -S /dev/md0 mdadm -A /dev/md0 mdadm --assemble /dev/md0 --verbose /dev/sd[abc]1 mdadm --assemble --force --scan /dev/md0 --verbose
It always failed with messages like:
mdadm: failed to RUN_ARRAY /dev/md0: Input/output error mdadm: /dev/md0 assembled from 1 drives and 1 spare - not enough to start the array.
So I examined each drive from mdadm’s point of view:
mdadm -E /dev/sda1 mdadm -E /dev/sdb1 mdadm -E /dev/sdc1 mdadm -E /dev/sd[abc]1 | grep Event
The lastest command compare the “Event” attribute of all devices. It output something like:
Events : 0.53120 Events : 0.53108 Events : 0.53120
which indicate that sda1 and sdc1 are somewhat synced (share the same number) and sdb1 “late” (lower number).
Here I’ve got the idea of recreating the raid array without sdb1, relying only on sda1 and sdc1, by using the “magic” (hence dangerous) --assume-clean option. The latter doesn’t build, erase or initialize a new array. It just try to assemble it “as is”. Here is the command:
mdadm --create /dev/md0 --assume-clean --level=5 --verbose --raid-devices=3 /dev/sda1 missing /dev/sdc1
And it worked !
I mounted the md0 partition and cleaned it up:
fsck.ext3 -v /dev/md0 mount /dev/md0
I updated my mdadm configuration before rebooting my server:
mdadm --detail --scan >> /etc/mdadm/mdadm.conf vi /etc/mdadm/mdadm.conf reboot
But history repeat itself, and again, the system hang up during boot. Except this time I knew what was happening: the boot process detected the remaining sdb1 device as part of the old array (the one before the regeneration I did above) and tried to run it. Remembering my last year post, I zero-ized the superblock of sdb1:
mdadm -S /dev/md0 mdadm --zero-superblock /dev/sdb1
A server reboot proved I was right and my md0 partition was automagically mounted in altered state:
localhost:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdb1[3] sda1[0] sdc1[2]
1465143808 blocks level 5, 64k chunk, algorithm 2 [3/2] [U_U]
unused devices: <none>
I just had to re-add sdb1 to fill the available slot and update the mdadm configuration to get back my array in its initial state:
mdadm --manage /dev/md0 --add /dev/sdb1 mdadm --detail --scan >> /etc/mdadm/mdadm.conf vi /etc/mdadm/mdadm.conf

Sounds like I’m always in sync with the headlines ! An article discussing “why power outages are bad for your data” was posted on Slashdot some days ago. It concludes with this obvious recommendation: you should buy an UPS !
PHEW! I had a similar problem, only one failed disk but the array wouldn’t start due to an unspecified I/O error.
I knew the other two disks were ok so I went for the
--assume-cleanoption and it worked!PHEW again. I’d have committed suicide if I had lost that data. You don’t keep unimportant data on a raid array
thanks
My problem was not exactly the same but your case gave me the curage to add the
--forceflag to mdadm.It took 10 hours to synch up the faulty drive but no data lost!
Thanks for the great post. About a year ago, we had a similiar issue with one of our Dell PowerEdge server. Long story short…I will never go back to software raid-5. Hardware raid-5 is definitely the way to go in any backup environment.
Cheers
Nick
Useful.
First I am backing up the very devices the raid10 (of some 500GB of rather useful data… months of work lost if that raid is gone).
That raid10 is assembled of 6 partitions:
sd[a-f]6on 6 200GB drivesI think here can still be found out how to do it.
The commands that I’ll be using will be six commands of the form:
where capital X needs to be replaced by letter a, next command letter b, and so on through f.
E.g., to back up device
/dev/sda6, I’ve now issued:It’ll be taking loooong, but much greater extent than that length and labor will be the time and the work saved an secured from harm, in case I mess up that 500-GB-of-data raid10. as I’ll be able to start over from scratch…
In case someone feels it’s the right kind of backup, here you can read more.
I’ll let you know if and when and how I recovered these data, this raid10…
I think it was the right path to solution that I treaded above…
It’s very likely I didn’t loose any more data…
Namely, while one drive wqas failing, it corrupted the raid (there some loss of date, but not much), but, and(the following is the exact reason I chose raid10 over raid5) I was left with 4 out of 6 sevices woking (back some two years ago on raid5 I lost data in similar one and one more drive failure).
By backing up the devices as I explained above, I found out that I could smoothly back up (and that the data would be fine gzipped date) on
/dev/sda6 /dev/sdc6 /dev/sdd6 /dev/sdf6(but input/output error showed up soon on/dev/sdb6, andsde6was a partition on a failed disk, that I replaced.I examined the disks from FreeDOS with SeaTools (Seagate and other disks they are), and the disks were no errors (long test).
So I did (
#myCMNTis my comment and it’s history lines from Gentoo LiveCD):I already saw the data are there. And while I am writing this I only have to wait for the last partition to have raid10 rebuilt with on.
77 mdadm /dev/md3 -a /dev/sdb6 /dev/md3: ... Array Size : 606227968 (578.14 GiB 620.78 GB) Used Dev Size : 151556992 (144.54 GiB 155.19 GB) ... State : clean, degraded, recovering Active Devices : 5 Working Devices : 6 Failed Devices : 0 Spare Devices : 1 Chunk Size : 128K Rebuild Status : 41% complete ... Number Major Minor RaidDevice State 0 8 38 0 active sync /dev/sdc6 1 8 70 1 active sync /dev/sde6 2 8 86 2 active sync /dev/sdf6 3 8 54 3 active sync /dev/sdd6 4 8 6 4 active sync /dev/sda6 6 8 22 5 spare rebuilding /dev/sdb6I’m glad if this helps anybody who turns up seeking solution for a similar problem…
Cheers!
You saved my bacon!
Thanks so much for posting all this info!
You just not only saved my life from my client killing me, and the more important lives on 5 highly important machines on my network…
THANKS! \o/