A ZFS RAIDZ2 pool without hot spares can survive simultaneous failure of up to two devices and still continue operating in degraded mode.
So, naturally, between when we left for the elementary school this morning at 0755 and when we got back home at about 1015, apparently THREE of the twelve 300GB SATA disks in babylon4‘s main storage pool failed.
I suppose that’s what I get for trusting Maxtor disks. But they were free-to-me, so I can’t complain too loudly. Unfortunately, I have no spare disks, and can’t spare the money right now to replace them with better (not to mention new) disks, or I would have already done so.
Equally annoyingly, I hadn’t gotten backup migration from disk to tape set up yet. I’m more annoyed at the configuration work - principally Apache2 - that I’ll now have to redo than the little data that I’ve lost, which principally consists of a couple of minor edits to our recipe book, a dozen or so ISOs that I can re-download any time I want to, and a dozen or so source code tarballs.
To add insult to injury, I’d been going to work on tape migration next, after the full backup that was scheduled to run this morning completed....
Update: As noted in the comments below, a network-wide Full backup ran last night, starting at 03:10 and putting heavy, sustained load on the array. By the look of things, an already-weak disk folded under the load at 04:29:55, increasing the load on the remaining disks and setting up a cascade failure over about the next four and a half hours. The second-weakest disk succumbed to the increased load at 08:49:29, increasing the load on the remaining disks still further, then just under eight minutes later at 08:57:06, the third-weakest disk followed the first two and the entire array went down.
no subject
no subject
I've been studying the logs and found some more clues in the kernel log .... it looks like c1t7d0 went down at 04:29:55, then c1t6d0 croaked at 08:49:29, which put the array into degraded mode, then at 08:57:06 c1t4d0 failed and the array went down. It's probably significant that a full backup of everything on the network, including this array, to another filesystem on the array began at 0310, so I'm guessing that those three drives were shaky to start with and just folded one after the other as the load increased.
no subject
- Once you plan to do backups, that's when your hard drives will instinctively fail.
Honestly, the amount of times I've had catastrophic data-death a few days before I was going to get around to doing backups is getting too long to remember them all.
So now, I never even think about doing backups, so as not to jinx it.
no subject
no subject
no subject