When we decide to make our big move to the cloud, we learned… a lot. We learned all sorts of things about options, settings, security and not-custom infrastructure things. Quite a bit of this is good to know, and best practices, but some was just a fact of moving as we did.
This is part 2 of moving to the cloud, and what we learned and were surprised by in the process. Here’s a link to part 1.
Cloud Architecture Can Make Troubleshooting Challenging
This was a huge lesson for us to learn. Basically, you’re introducing a large number of variables to what previously may have been fairly closed ecosystems. Sending email from your app/db? It’s not so simple any more if yo’ure having issues. Could be the server, process, or the email sending service or interface or security or your usage caps.
The way the cloud environment works, you end up with a whole combination of services and components that support your system. Getting these all working is part of what leads to the whole vendor lock-in issue (you get everything set up for their components, it’s more difficult to move providers). But it also leads to more things that have to be tuned, setup just right and talking to each other. VPCs, firewalls, security, access controls in general, load balancers, emailing services, database services… all of it. But since many components can be individual or separate services, when something is unable to talk to other components, it’s definitely a “sit down and take a breath” situation.
One of the things we’ve learned very quickly is that you have to really be aware of the systems on the whole. If you have just a few components in view, you will have a much more limited look at what may be going on. If you can narrow it down, definitively, you’ll be excluding variables and can get down to work. A few times, particularly during setup, we had accessibility issues that were confusing at first. X couldn’t get to Y. But when you eliminate things that are working, you can more quickly get busy fixing the issue. While this seems obvious, in the heat of the moment, it’s going to add a beat or two to figuring out what may be happening. We were stung by this, where we made assumptions based on symptoms, evaluated based on old approaches and infrastructure and, frankly, got it wrong.
Scaling Spans Components
When you’re scaling up to production or address new requirements, often times it’s more complex to bring up capacity (or, conversely, to shut it down). Going back to the whole “many components” thing, you may find, for example, that there are execution caps and throttles on each individual components. Database power, connections, processing power at the server(s), loading of interfaces, etc. All of these pieces come into play when you’re scaling, and missing one could impair your systems and the scaling effort.
As an aside, on scaling DOWN, this is one area where you have the real potential for hard costs for no reason. If you’ve scaled up and returning to “normal,” but miss components to scale down, you’ll potentially be paying for that increased capacity of certain components, while everything else has scaled down.
Make sure you have a map/plan/outline of what you scale up so you know what to scale down and double-check to make sure the operations complete as you reduce capacity. It can save you significant expense and surprise.
Support Is, Indeed, Not The Same as In-House (of course)
At the start of this series of posts, there was a comment that they’d never seen support care as much as in-house staff.
This is absolutely true.
Even with outstanding support, they’re looking at much bigger pictures. If you get in touch and say “XYZ service is failing” – I can almost guarantee you one of the first steps is to check and see if it’s more of an issue than just your own use of that service. If it is, it escalates and not just because of your own use, but because it is others as well.
When we first started, there were power issues (power backup, to be more specific) under certain situations a couple of times. In addition, we had some early-day disc issues. BOTH of these times, we had standard status reports that something was wrong, but getting any kind of feedback, estimate of time to completion or other information, let alone influencing the process or pushing it along (or helping), was just non-existent.
This can be extremely hard in the heat of a big issue, and that’s an understatement. It can be maddening.
You are one of many. The good side is that if you’re experiencing an outage, the pressure on the provider will be immense. The bad side is, unless you’re a major customer, you’re one in the mix. This means you wait for resolution, be thankful you had failover and recover stuff built into your mix and be thankful when things return to normal. It’s harsh, but it’s reality. They’ll have great folks working on it, but when things are completely messed up, we all want more information, not less as the support team turtles to deal with the issue. It’s very difficult.
Summary – So Far
So far, the cloud has been nothing short of amazing for our use. Platform as a service, infrastructure as a service, blah, blah, blah. We’ve deployed, used and use all of the combinations at this point. It’s brought more power, more recovery and more control to our environment. But there are definitely times when we’ve been bruised and battered getting it done.
As a tech guy, it’s been fascinating and I love getting the exposure to it all.
As a manager I want to kill someone.
As a customer/attendee/reader, I like to think the end-user experience has been improved greatly.
As an accountant, the net-net is that we’re saving money, significant money, and have some additional controls on our spend when it comes to infrastructure.
I have to close, of course, with YMMV. But I chuckle when people say “just move to the cloud and save money.” (Yes, I’ve heard people say that). It’s just not quite that simple of a choice or process.