When you use a computer 24/7/365 non-stop it tends to wear. At least this is the case with anything I’ve owned in my life. In the past 30 years of owning hard cards, hard drives, and SSDs I’ve accumulated piles of dead ones. Well, some aren’t dead, they just were replaced, but whatever.
TL;DR – learn from my fail, or point and laugh
I’ve gotten pretty familiar with the idea that all it takes is a couple of wrong events for every piece of digital me to disappear – this is roughly 25 years of data, 19 websites, several blogs, wedding photos, videos of my kids, birth of them, etc… I’ve also gotten pretty used to a drive with a 5 year warranty I can go through in about 1-2 years from the level of use I actually employ.
A few years back after my main drive died, a backup drive died at the same time (thinking electrical,) I discovered that not only had that happened but the syncing to the photo storage I was using had been not working for days. This is pretty standard, when it rains it pours (like when Pocketable’s Vaultpress stopped working 21 days before a worldwide 0-day WordPress theme hack took it out, and they didn’t send any notification it wasn’t working, and then Site5 deleted our backups, etc.)
When it rains…
So a year or two ago I decided it was time to next level it. No more single points of failure. I loved SSDs, well, now I was going to mirror my SSDs… bam. One LSI MegaRAID 9260-8, a couple of 1tb SSDs – I figured I was untouchable.
A few months back right during the start of the pandemic I threw a drive, the controller stopped recognizing the other drive, and I was face to face with a non-booting system that I had to pull the drive out, boot on a SATA2 controller, and mirror it back to a newly configured and now blank RAID I’d made. Meh. OK, evidently can’t trust the card either. Now using two forms of local backup, one of which is to a computer on our VPN, one to a spinner, 2 different backup softwares (Veeam, Windows,) and storing the pictures and video in their own clouded retreat as well.
Oh yeah, during this my boot/toolkit USB also broke – used 3 times, just dead. I’d not planned for that. Luckily I had a laptop and several years of CES USB promo kits I could overwrite.
Last night at 1:51am I was woken by the RAID card. Nothing like screaming drive failure to pound it into your head.. and a cold/allergies (yes, it’s probably that, I got soaked and stuck in an air conditioned hell for an hour,) and the realization that I could not recall the name of MegaRAID Storage Manager nor how to launch it. Oh well, it can wait till morning, it’s not like that computer’s secondary purpose was to act as an off-site storage dump for encrypted data from work. Oh wait…
Morning comes, had not slept much because I kept going through how to rebuild the thing from the ground up in case the other drive had damage. I slapped in a spare SSD I had laying about (did I mention this happens regularly enough in my job that I’m prepared at home and work?) and things look manageable – I’ve got a log of a drive that for reasons unknown last night stopped responding to commands. It appears it’s working now, no errors, no predictive failure warnings, but we don’t play that game. Drive gets replaced until the presumably bad drive gets tested. It went offline for some reason, no reason to think that’s not going to happen again and cause data loss.
So I replace the 4 year old Mushkin Reactor (which may be fine, we’ll see if it was a power issue, controller issue, etc soon enough) with a spare SSD and rebuild. five and a half hours later after a complete rebuild, I decide since I have two spares on hand I’m putting the other one in and devoting it as a hot spare. I carefully wire everything, close up the case, boot the machine and yup, somehow the clip-in no way to pull it out cable had pulled out.
Hour 2 of rebuilding again.
What have we learned?
- Have a link to your RAID management somewhere that doesn’t require you to execute it from a command prompt at 2am
- Even in a mirroring scenario have an extra disk in there so that when you turn off the alarm at 2am, when you wake up the mirroring is done and all you have to do is remove the presumably bad drive
- Get a job that doesn’t involve having a computer in your bedroom running 24/7
- Potentially don’t start archive integrity checks at midnight but that’s most likely unrelated
- Don’t close the case of your computer and then boot without paying close attention to the error messages displayed
- Pay someone else to simplify things