I’ve got a computer, and for the past month or so I’ve been pretty much living on the thing as I remote into our network and assist people with their systems, virtually run servers while the building is closed and had no power, etc.
TL;DR – RAID dies, rebuilds incorrectly, gets fixed.
It’s also got a hoary host of utilities, and projects that would cause me to cry very big tears if it went out (they’re all backed up, but verification of everything could take a day or two.) So on day -1 of the announced city mandated two week Coronavirus lockdown at 5am something of course happened.
I’ve got an LSI raid card, two SSDs mirrored and one that sat idle to be a hot spare in the event one of them died. As stated, my computer is pretty important to my work.
It serves as an offsite backup (on a spinner, not the SSDs,) and in the event of a catastrophe at work it takes about 45 minutes to have all the servers back up to the state they were in at the beginning of the day but at another location. Oh did I mention we were missed by a tornado by 800-1200 feet a couple of weeks back?
It’s what we’ve been able to afford, don’t judge. Company’s reinvented itself multiple times and managed to reach 40+ years old. There are other backups in the cloud, but getting them running is not an expense we can afford.
So, my computer threw a raid drive in a mirrored set. No big whoop. 5am I get an alarm that I’m convinced is the UPS next to the computer because it’s 5am, it’s shrieking, so I shut the computer off and go back to sleep. Nobody’s going to be in the virtual office until 10.
Got up, powered the machine up, see the degraded set, silence the alarm as it’s killing me, and let the machine rebuild the mirror on the new drive. All good so far.
Weird thing is I’m getting a TON of CRC errors on the drive that hasn’t failed. I assume due to the bad drive, and after the entire sync is done I run a chkdsk and watch the LSI monitor and errors galore.
SanDisk drive had failed, I power the system off and remove that drive. When I go back to boot I get a weird error that there’s a foreign disk and do I want to import it?
That’s a big nope… attempt to boot, nothing. Warnings that the RAID was degraded. No boot devices found. Uhhh…
Plug the bad disk back in, nope. Temp read foreign config. Nope.
Well horse hockey…
I of course have backups, but as with my issue I had with Pocketables, I never expect them to be good. Even if they verified. Even with a report showing they’re good.
As with this, I had a backup drive, it showed it’s good, it showed it copied, and now poof.
I try every combo of getting the RAID to boot, recognize, or whatnot, and in a couple of combos I manage to get that it’s working but it won’t boot.
Called a friend of mine who used to work with more RAIDs than you can shake a stick at. His suggestion was that since it was mirrored that each drive that was rebuilt to, at least on the LSI boards he’d worked with, was probably going to look to any normal system like a regular drive.
Yeah, I came from the era where a drive on a RAID did not look like a normal disk (sit down Sonny and lemme tell you about the 90’s and this server software called Novell 3.12.) So I had no clue that a mirrored drive could be yanked and put on a SATA 2 (I know, ick,) port and boot… but boot it did.
Booted off of the mirrored drive, verified it was good, ran a consistency check and made sure there were no CRC errors – all seemed good.
Ran into work (I was the only one there,) and grabbed the Hynix SSD that was sitting on my desk because the replacement SSDs I ordered will take 31 days to show up due to the pandemic and I can handle a mismatched set for a while.
And I’m back in business. Not comfortable with the situation but learned something new. Mainly that I no longer trust LSI, my computer, SSDs even duplicated, etc.
Next plan is to image the SSD raid regularly to a spinner or something as it’ll be a long time before I get the parts I need in.