Editorials

A Tip from the Trenches

Here’s an interesting scenario we all face at one time or another. Your application runs fine until you increase the load. It stands up pretty good with 4 or 5 processes. But, when you get to that 6th process, things begin to degrade. Each additional process, from this point on, increases the degradation exponentially. At some point, nothing is working, and all users are getting database timeout errors.

First you go to your reports built into SQL Server Management Studio. There is no blocking or deadlocking reported. Things are just slow. The longest running queries are the ones that generally run sub-second, but have degraded to the length of your application command timeout settings.

Now you start digging deeper, looking into the DMVs built into SQL Server. Aha, there’s the answer. You have some very long wait states going on during your stress tests. There is competition for some resource. No single query is blocking any other specific query. They are all simply waiting for each other.

My initial inclination is to try and optimize the query. In this case, all the optimization in the world would not help. It was an application design flaw causing the issue.

The application selected records it wanted to process, and put them into a log table. Then, it selected from the log table to process a batch of records. Once the records were processed, it then updated the batch table to mark the processed records as completed so they would not be included in a future batch.

Since the batches linked large amounts of data to the link table, when gathering data to process, there was a lot of contention on that specific table. To quickly get something going, the procedure that handled gathering data for a batch was modified. The records in the batch table were copied into a temporary table variable. The table variable was then joined to the large query for gathering records to process. This simple change was implemented and the stress test executed again, with similar, poor performance.

An engineer was checking the CPU and Memory usage on the application server and found that both were pegged at 100%, and paging was out of this world. The database was slow simply because the applicaiton server could not consume the large amount of data being returned from the database quckly enough. Adding memory to the application server resolved the issue altogether.

The moral of the story: When performance problems arise don’t blame your database without first checking into what is going on.

Cheers,

Ben