We were working through some systems with a client and came across some interesting findings. We were specifically working through backup and recovery of their systems.
They run a mix of servers, some on-premise, some in the cloud, and some systems that span both. Probably fairly typical of a number of shops out there. We also discovered that there were several systems that had cropped up at a departmental level (ah the days of Access almost seemed to rear their head) that existed in one location or the other.
As is pretty standard from what I’ve seen anyway, those departmental database systems ran the full range of recoverable to “hail mary” if something happened. We worked through managing those first, just because the systems were more straightforward.
Then we move to the other more industrial and complex systems.
This is where things got surprisingly complex really quickly. There are a number of services that were in play at the cloud provider, and even on-premise. These ran from the database itself to systems that supported the applications.
This was not surprising, but the recoverability aspects of this were complex. This is somethat that increasingly is stinging some people as systems are deployed that incorporate these various pieces and parts, and services. The steps to recover this can be complex and intertwined. The LAST place you want to find this out is in the heat of the moment when you’re trying to pick up the pieces from some failed system or process.
In some cases, when restores are completed, the restore goes to a separate database or system. You’re left with migrating the information to your production systems, which can be tedious and extremely error prone. In other cases, it was necessary to roll back to specific periodic backups, which meant reconciling those periodic backups against transactions and activity that happened since that time. In essence, the systems were out of sync after a recovery of the broken piece, at least potentially.
The details and steps involved to recover pieces can be surprisingly complex and the testing of a recovered system or application is critical. We also ended up coming up with an escalation process. This escalation led to decisions that included pulling all systems back to a known good point in time.
From a data perspective, this is great and everything works. From an operational standpoint, this is rarely, if ever, acceptable. It’s a worst case scenario and one that won’t be a fun meeting to explain to the users and higher-ups.
This all led to the need to both better understand the recovery processes and to also figure out how to synchronize and decrease the windows of possible information loss (sound familiar?). This is something we’ve addressed since the beginning of time with databases.
Customer: What can we do to keep our systems online?
Consultant: What can you afford to lose?
The same issues are here too, but sometimes the reliance on the service provider give us confused confidence in the recoverability and certainly in the processes to get disparate systems re-synchronized.
The moral to the story here is that it’s extremely important to make sure you know how to recover, but also how to recover the various pieces that may break and what that means to your applications and for your user base. Mixed, or hybrid solutions can add variables to this that are unexpected and need to be thought through and tested extensively so you know what needs to be done when the time comes.