I was sitting here last night when suddenly the music stopped .... and I spent the next couple of hours troubleshooting minbar, our NFS server. Minbar is a SUN Ultra30 with a 711 disk pack, running Solaris 9. During the course of this endeavor I went down many false trails because it was clear that a disk had failed but, not being able to see the disk status LEDs on the 711, it wasn't immediately apparent which disk had actually failed. (I finally figured out that this is because if a disk with metadb replicas on it fails, Solaris isn't bright enough to stop trying to update the metadb replicas on the failed disk.)
Anyway, after realizing this, I was able to diagnose the failed disk, but didn't have a spare, so I went hunting on eBay for replacement disks without finding anything i was happy with. Five minutes after I finally gave up and went to bed, it occurred to me that I could pull a disk out of babylon4 (my Sun E3000) for now, because we're not running babylon4 at present (we can't afford the heat load or the power drain).
So today, I moved stuff around until i could haul out babylon4, popped one of its ten 9G disks, popped it into minbar's 711 array instead of the failed disk, partitioned the disk, created metadb replicas, recreated netadevices, created filesystems, did all this happy stuff and started restoring all the files ..... and about 40 minutes into restoring, the replacemetn drive crapped out with a sense error and a "FAILURE IMMINENT, REPLACE NOW" warning. Aaaaugh!!!
Lather, rinse, repeat. It's now 20 minutes or so into the second restore cycle, and no further hiccups, with a second disk borrowed from babylon4. But now I need two replacement disks....
no subject
no subject
When I put the second replacement in, I further rearranged drives, moving a Barracuda from slot 6 to slot 2 amd moving a Cheetah without a heat spreader down from slot 4 to slot 6, in order to put the second replacement Cheetah into slot 4, thus putting all three Cheetahs the closest to the cooling air intake just in case the second failure was a thermal problem. (It did seem awful hot when I took it out.)