Editorials

What Did We Learn from the Amazon S3 Outage?

This is not a critique of AWS or S3. Rather, a look at the response and what worked, what didn’t.

As of this writing, no real specifics of what happened are available. The things we can learn, however, are immediately obvious on a few fronts.

1. Initial Response
First, the initial response was all based on discovery when customers noticed something was up. I’m not sure I expected them to proactively notify when things went south, but the fact that we had to go looking, and dig around quite a lot after that, proved frustrating.

Developers took to forums. People tweeted. The status board for services showed all green (more on that in a minute) and minor updates posted to twitter. It was diagnosis by customer consensus, rather than facts from the source, that drove the confirmation that you could stop looking to your own systems for the cause of the outage.

All of this is easy in hindsight. I totally get that. In the “fog of war” of a critical situation like this, when the reality is you don’t really know what, if anything, is happening, it’s hard to put into action a response plan. But IMHO, that’s exactly what we should expect from Amazon. A swift acknowledgement of an issue, even if not a solution or even ETA.

What I learned from this: talk early. Acknowledge early. Don’t let others do it for you, but they can do it WITH you. If you start communicating, the *correct* word may spread without your direct involvement.

2. Communication
The status boards, showing status of S3 (and the other services) were actually running reliant on S3 (this is confimed by Amazon’s own post addressing the fact that this had been recognized and corrected). This means no updates were flowing – not to RSS, not to the displayed status table, nothing. The first indication was the top of the page, a box showing there were “errors.” Again, fog of war, I get it. I’d rather they spent time fixing issues or troubleshooting. But in an operation the size of Amazon, it HAD to be a flood of calls, tickets and cries for support. That slows things down. By thinking ahead a bit, and not hosting the status boards on the very technology they were reporting on, a good deal could have been done to slow what must have been an incredible flood of help requests.

What I learned from this: give information on what you know. Tell people a little more if you know it – more than “errors” – and keep it updated. As the day was unfolding, there was one simple text update that tool more than 45 minutes from the time it was posted (they timestamped it) to the time it actually showed up on the page. Share what you find when you can, and try to over-share that you’re doing things so people don’t wonder.

We’ve talked about this before with systems issues in our work with databases. The more information you can get out there, the more people know you’re working on things, the more confident they feel that you have it handled, the more they’ll back off. One of my favorite suggestions, straight from Chris Shaw, is to provdie not only updates, but timing of the NEXT update. It might only be to say “we don’t know anything more and are continuing to troubleshoot” but telling people when the next bits of information will be out there really helps calm nerves. This is true of an in-house issue and a major outage like we experienced with Amazon.

3. Transparency
Twiter is a great tool for intermittent updates. Talk about what you’re doing, what you’re seeing. This goes back to communications. You don’t have to give away anything damaging. But being transparent about the work going on will save you a lot of grief.

What I learned from this: That people have a great sense of humor (the twitter posts alone are gold) and that, in the absense of real information about what’s being done, they’ll make stuff up. Not something you really want/need.

4. Follow-through
It remains to be seen what follow-through will be provided by Amazon. What happened, what is being done to correct it, what should customers do, etc. It would be nice if there was a summary, then things people can know, and things people can do if they’re so-inclined, in the future. From communications to architecture and all of that – don’t just set the indicators to green and clap your hands like it’s all over and done.

Talk with people, explain what was done. Give a good review that satisfies people that this wasn’t a fluke. Otherwise, it’s likely you’ll be left with people wondering why it won’t happen again and second-guessing what was done to correct it. This is critical.

Summary
All of this is still what I would consider “fog of war” and my opinion. It remains to be seen what the response will indeed be, and what the cause was. I hope we find out good information and learn all the way around.