Every company does a different level of disaster recovery and business continuity planning. That’s completely appropriate – smaller companies with less technology dependency (and process maturity) do not require as thorough of a disaster recovery plan.
This past week I’ve been dealing with a very challenging disaster scenario.
The company in question is reasonably technology dependent, but not very sophisticated (lots of data, few automated processes, not a complex technology environment). They have a solid backup system, a proper GFS rotation which includes off-site storage, and plenty of spare hardware in case of a major failure. They’re doing everything correctly for a company of their sophistication.
What do you do when something falls outside of your disaster recovery plan?
Three weeks ago the company’s main file server (and backup server) reported a failed drive (in a RAID array). The manufacturer happily replaced the drive under warranty. During the support process, they were sent complete logs from the system.
For three weeks the server seemed to operate normally. Then Windows reported data corruption. Upon reboot, the RAID controller reported the same drive had failed again. Unfortunately, Windows would not boot, so further OS-level investigation was not possible. Upon research, the manufacturer discovered that their previous diagnosis was wrong. It wasn’t a physical failure, it was a logical failure – the logical RAID drive failed.
A logical failure means that the drive(s) created on the RAID array are potentially corrupted. In this case, they had become so severely corrupted that the server would no longer boot. All three logical drives on the server were corrupted. For a company that processes a lot of data, this is as major disaster. Over 500 gigabytes of data may be affected.
At this point, most people jump up and say “They have full backups, not a problem! Just restore ..”
Unfortunately, while accurate (they do have backups), when dealing with data corruption, especially 3+ weeks of unknown data corruption, they only backup you can restore to is before the corruption happened. Approximately a month ago. For a company with high data traffic, a month is a very long time.
I spoke with several very talented IT people. They all recommended the same process – restore to your last known good (only overwriting files which have not changed), then manually check files that changed. Those that are still corrupted can have their backups restored (if it’s not corrupted as well).
There is no good way to deal with data corruption at this level. It’s always going to be a manual and time consuming job. We’ve been lucky so far – due to the level of planning, good staffing and redundant hardware we’ve been able to minimize the impact to give the company the best of a worst case scenario.