domingo, 6 de noviembre de 2011

Edmundo 750 GBs - HP 0 Gbs

Hi!

Just today I finished one almost perfect data recovery from another RAID5 that its controller spitted onto (another HP controller, yet again).This time around, it was a 6-disk RAID5 (some 130 GBs each).

I knew beforehand that my python library (which I had translated from java about one year ago) was too slow and this task was going to take days if I hadn't done something about it. So, first I took the task of translating it to C++, which I finished some days ago.

When faced with the disks, the HP controller reported that 1 of them was physically dead, one of them had been hotswapped and another was to fail soon (public administration, don't ask about the details... gruesome). Fact of the matter is that the controller didn't want to make it visible so the server would not start. HP support asked me to dip the array in holy whater and forget about it. But that was not to happen, was it?

The disks were mounted on a separate computer one by one (well, not one by one but on a separate computer), and images were made from each one of them (only one of the drives didn't allow itself to be copied). Tips: LiveUSB, ddrescue.

After getting my hands on the images, I started analyzing how it was built to figure out the order of the drives, the chunk size and the algorithm used. Normally for this job the first MBs from each image help to do it.... after taking a long cold look at the images, I had figured out the chunk size (64ks) by looking at some numbered markers present on the ext2 partition that was at the begining of the disks... however, as this was a linux server with no FAT of NTFS stuff on it, there was no data (specially text) to be able to figure out the order... and specially, not crammed at the begining of the disk (does NTFS still work like this? You gotta be kiding me).

After thinking about it (and working on it) for a while I ended up finding whole stripes that would be made out of text (except, most probably) for a single chunk for every stripe. Finding these stripes and their sequence took its time, but eventually I was able to break it. The only thing that bothered me was that for many stripes (one after the other) the checksum chunk was always on the same disk (which I didn't expect) and the sequence remained the same on those stripes. At first I thought that perhaps it had been built as a RAID4 instead of a RAID5, but after checking many stripes the whole thing made itself clear: It's a RAID5 in which row configurations repeat themselves a number of times (in this case, 16 times). All in all, left async 64Ks with 16 repetitions. This took me to make an adjustment to my C++ utility just for this task and on a single shot I was able to rebuild the RAID5 from its pieces. Rebuilding took some 14 hours (probably not because of the utility but because of the bottleneck of having all 5 images on a single drive).

After rebuilding, I was able to do fdisk -lu on the disk image, do the losetup trick, and mount the server's / partition and get data out of it. There were some IO errors (not that many, fortunately.... given that the controller had already reported more than one disk as broken, maybe some pieces of the images were broken or the controller had stopped using one of the drives as a whole some time before crashing). Some other people will take of getting the stuff working but my task is definitely over.

So, next time you have a RAID5 crash , always remember it can take some time/thought breaking it... but in the end, it can pay off.

I'm starting to feel like the Messi of data recovery. Anybody needs a data recovery hacker around there? I'm willing to travel anywhere!

PS And before you ask, it's 750 GBs (or close) for this array plus the one I had recovered in 2005.

No hay comentarios:

Publicar un comentario en la entrada