Editorials

The Oops That Shook the (Online) World

Amazon has already explained the issues that brought down so many sites and resources, and there is are some (not-so) subtle things to learn from the situation.

When the services went down, it was due to human error, plain and simple. Well, not so simple (you can see their summary in long-form, here), but at its root, it wasn’t a catastrophic event in the data center, nor even some seismic shift in the earth’s crust.

It was, in essence, a typo.

Human error.

Now they’ll correct that, and the routine that caused the issue, so that hopefully this type of thing can’t happen again. But we’ve been talking all week about the impact of this type of situation. From the triage nature of the actual event happening, to the response and decisions it forces for you as a cloud customer.

None of this, really, comes down to Amazon, IMHO. I believe it comes down to having managed systems. Systems that are managed by humans. You’ll always have the risk of issues and it remains a planning point in your own use of those systems.

It could well be that that planning point is accepting the risks associated. Truly, how many times has S3 caused a problem? The track record is solid and impressive. But it’s a choice to accept that risk.

I believe too, that if you’re going to accept that risk, there needs to be a decision point. I talked earlier this week about being in the heat of it all (fog of war, limited, shifting information), and having to decide to recover, to continue, to wait, etc. Having that plan is important. You may only exercise it once (or never) in your work with your provider.

That recovery process though, is different. Rebuilding a database, pulling up a server at a different data center, failing over to a new system, all of those need to be understood so you can responsibly manage in the context of the provider’s offerings. What’s more, if you work across providers, or move providers, that plan is going to change between them. Recovering a database on Azure is different from Amazon RDS, which is different from a VM on Amazon, etc.

I think the big takeaway in all of this is that stuff happens. You need to evaluate the risk of that stuff. You need to know when it’s enough pain that you take action. And you need to know what that action is.

I almost feel like I need backup dancers, but the fact is, it’s the “same as it ever was.” Be prepared and think through things before you’re in the moment, even with cloud services.