Editorials

Interesting Lessons Learned in the Cloud

Interesting Lessons Learned in the Cloud
As we’ve been going through this conversion of SSWUG.org to a new model and platform, we’ve been converting just about everything along the way to new systems, new databases, new front- and back-end technologies. It’s been a very complex project, by far the most complex I’ve worked through with the site in the entire history of the site.

That said, we’ve stumbled into some very important lessons-learned about the migration and things we’ve come up against. I wanted to share a few of those here in the hopes that it can save you some work on your own implementations.

First, on databases – we worked with bringing the new site online in test mode as you might imagine. We had been through literally months of testing and load testing and beating up options and approaches on the site. When we cut it over initially, it was an outright failure.

We didn’t realize it at the time, but it was due to an under-powering of a combination of things. The two main culprits have turned out to be the database instance size and the front server sizings we selected.

I was ready for a “slow” server if traffic hit just so, and we had contingencies in place to deal with that and scale up nearly immediately. I wanted to start middle-low end of what we thought would be the reasonable range of power. In other words, not “we know this won’t work” but “it’s not maxed out.”

What we saw though was a mix of settings that only showed their colors when the site went live. The combination of cross-VPC traffic, the database instance capacity and the server capacity conspired to take the site offline functionally. The symptoms, aside from being offline to the world, were 100% CPU on the front end servers, 100% capacity utilization on the database instance and a LOT of traffic across our in-VPC and outside of the VPC “bridge.”

It all looked like the site was in this infite, capacity eating loop. It was extremely difficult to diagnose, because the symptoms seemed to be suggesting opposing issues.

It turns out that the biggest components were the sizing of the front-end instances and architecture of having everything in a common VPC. Correcting the latter issue alleviated traffic pattern issues (we lived on the same logical cloud/network segments). This was a surprising significant boost to overall throughput.

The front end instance capacity mis-match led to what we’ve identified in retrospect as queueing issues for the database – it would result in results being sent back to the front end, but the front end couldn’t keep up. The result was an outright traffic jam that fed on itself within minutes of being live. It was maddening to troubleshoot. The web processes would spin up and take more and more resources, the database queue would simply grow to broken proportions.

By increasing front-end capacity, it cleared up nearly instantly. It was like watching a magic trick. The database instance cleared, the performance returned and the choir sang.

Boy was it tough to troubleshoot. For awhile we thought there were DNS cutover issues. At another point, it seemed like firewall troubles. Then the queing and such came into play as well.

Long story short, once again, make sure you look at the simple first. Make sure you understand, as you set up your configuration, what the implications are of your sizing choices. What happens if it’s not enough? Does the information flow slow down? Not happen at all? It’s very important to understand.

More on this as we continue to tweak and tune, but it’s been a great learning curve!