When things go crazy, that’s what you need a plan. It sounds so obvious…
Whent things are in an unknown state you want to be executing your plan, not creating it. This true whether your systems are sitting in your own data center, or someone elses.
The outage at Amazon no-doubt shined the light on many a DR plan that may have overlooked the cascading storm of circumstance that happened when S3 went down this week. It shows that those same plans, those same approaches you have created over time for your own in-house systems, apply when you’re using cloud resources too.
Many people are rushing to point out how much of a bad idea the huge cloud providers are. I don’t think this will surprise you, but I disagree. While the outage was painful, and many things went wrong in the process, the cloud still makes a business-building difference to many companies.
That said, I think there may be a new point of planning that needs to be added to the plan.
I keep coming back to calling it “Resolution Escalation” – at what point do you decide it’s time to escalate your issue. Not from a “contact higher-ups at the support group” but rather “OK, this isn’t resolving as I need it to, it’s time to turn on plan b.”
With the Amazon issue, and their lack of communication and resolution – at some point in that scenario, there needs to be a point where you have to make some assumptions and take some actions. It might be cutting over to backup servers in a different data center. It might be notifying customers. It might be any number of things.
It could also be to wait for a resolution.
But the key thing, I think, is to have a specific decision point and process so you know what to do with the information you have.
“If systems have been down for 3 hours, and no resolution is apparently on the 2 hour horizon, we start bringing up instances at a different data center.”
“If systems are offline for more than 10 minutes, and we don’t foresee a resolution in the next 5 minutes, we notify customers that we are aware of, and working through, system issues.” — and then what are the follow-up timings?
Basically, absent detailed resolution information (remember, you’re not likely to have it in the “fog of war” type situation where the provider is dealing with the issue at hand), what actions do you take at what points?
I think it’s critically important to know these cut points and actions ahead of time – just as critical as knowing how to restore or recover services, and to bring systems back online after a failure.
And if nothing else, make sure you have some customer communication spots in there. Don’t do what they did and leave people guessing.