Yes, it happens. Things happen, systems fail, someone trips over a cable (hopefully not so much these days in major data centers), etc. What has struck me though is that your options for dealing with downtime when you’re hosting (and it doesn’t matter if it’s “stretch” services or hybrid or entirely cloud) in these types of services, are limited.
If you consider for a moment what happened when Amazon’s S3 slipped up. Twitter came alive with “the internet is down!” – not far from the truth. So many things are hosted on the single service, from images to sites and so much more. Sites literally all over the globe slipped up. But getting information, ETAs for resolution and root causes, along with any useful information while the systems are offline is really quite tough. I’ve seen this with platform as a service, software as a service, on-premise mixed solutions (or even on-premise cloud, depending on who is running the show). The fact is that many times the services are so removed from your applications and environment, that you end up with separate teams running things, and not much triage information flowing when things go wrong.
John brought this up in a recent comment, too: “When they fail, how in-depth can we go to find out why? You really need to spend the time to poke around cloud technologies to be aware of potential gotchas – both technical and fiscal.”
This is a key point in your moving to these types of split environments. It’s critical that there is a plan in place to deal with downtime. I think it’s nearly impossible to imagine every single possibility, and some are more “expensive” than others (time and/or money) to account for. There may be some cases where you are left to wait it out. Even in those cases, however, having some transparency and communication with your users can very well be your plan. Set up a twitter account, set up an RSS feed, email, whatever it takes so people know you’re on it, have a brief idea of what’s going on. If you have an estimated time back online, let them know. This includes customers, end-users, etc.
One important point on this – you want that communication mechanism to be hosted and provided for differently than your core operational systems. The irony was clear when, on one S3 outage, the system status panel, where you went to see what was happening, was hosted on … S3. Well, S3 was down. So the panel was meaningless. Consider twitter or other systems for that type of communication and make sure the account is monitored, particularly in the case of an outage. People may respond or need more information, etc.
It’s a bit like the Center Core of the Falcon Heavy rocket. All sorts of incredible, amazing things happened with that launch (finally, flying cars!) but for far too long afterwards, the internet was buzzing about what happened to the one bit that didn’t work quite right. I’m in no position to coach SpaceX, and I personally think that the launch was a huge milestone. As an analogy to this discussion though, it’s an example of keeping the information flowing, even if it was a series of messages:
– We can’t seem to communicate with the Center Core. We’re investigating.
– Still no word about the Center Core. We believe it missed the landing barge and are investigating.
– Hey, 2 out of 3 ain’t bad! After a day with an incredible launch and recovery of 2 boosters, unfortunately, we must report that the Center Core was lost at sea.
You can apply this same thing to your own systems and keep people calmer. Have backup systems. Have communication. Know what you can do to keep functioning if something drops offline. It makes for much happier folks that you’ll be working with.