Today mdadm send me a mail to warn that one of my hard drive (/dev/hdd1) was ejected from my RAID-5 array. After some manipulations (no writes, just reads on the file system to get information) and reboots, I ended up with a file system in a strange state: the folder structure was totally messed up and lots of files disappeared.
Assuming that this situation was about an inconsistent file index, I decided to reset the superblocks of the remaining physical disks:
1$ mdadm --zero-superblock /dev/hdc1
2$ mdadm --zero-superblock /dev/hdb1
I don’t know why I decided to do so, but it was the stupidest idea of the week. After such a violent treatment, my array refused to start:
1$ mdadm --assemble /dev/md0 --auto --scan --update=summaries --verbose
2mdadm: looking for devices for /dev/md0
3mdadm: no RAID superblock on /dev/hdc1
4mdadm: /dev/hdc1 has wrong raid level.
5mdadm: no RAID superblock on /dev/hdb1
6mdadm: /dev/hdb1 has wrong raid level.
7mdadm: no devices found for /dev/md0
At this moment I was sure that all my data assets were lost. I was desperate. My only alternative was to ask Google. So I did.
I spend several minutes browsing the web without hope. I finally found someone in the same situation as mine (sorry, in french) on debian-user-french mailing list.
The solution was to recreate the RAID array. This sound counter-intuitive: if we recreate a raid array over an existing one, it will be erased! Right? Wrong! As it is said on debian-user-french, mdadm is smart enough to “see” that HDD of the new array were elements of a previous one. Knowing that, mdadm will try to do its best (i.e. if parameters match the previous array configuration) and rebuild the new array upon the previous one in a non-destructive way, by keeping HDD content.
So, here is how I finally recovered my RAID array:
1$ mdadm --create /dev/md0 --verbose --level=5 --raid-devices=3 /dev/hdc1 missing /dev/hdb1
2mdadm: layout defaults to left-symmetric
3mdadm: chunk size defaults to 64K
4mdadm: size set to 312568576K
5mdadm: array /dev/md0 started.
Of course this doesn’t solve my initial problem about the /dev/md0 file system: it is still in an altered state. Maybe it’s too late to recover data. But at least I reverted all my today’s mistakes, and the situation will not deteriorate until I power up my RAID! :)