Editorials

Surprising Performance With # Tables

Last week I concluded with a high level review of the different Join Techniques Defined in the ANSI standards, as implemented in SQL Server. We looked at the Merge, Loop and Hash join methods, in preparation for the topic today on table variables and # tables.

The problem, bringing back again this much argued topic, is a simple report, based entirely on the results of one stored procedure. Here is a contrived example similar to the problem we were resolving.

From the parent table we want a person’s First Name, Last Name and Birthdate.

From a child table of person data we want the Value column from the records for their first, second and third dependent.

In order to produce this report, three queries were used.

  1. Select the persons to be included in the report based on input filters
  2. Select the dependent records using the results from query 1 as a filter, and pivot the results into three columns of Dependent 1, Dependent 2, Dependent 3.
  3. Return the final answer by joining the results from Query 1 and Query 2. You now have a table result of Person Id, First Name, Last Name, Birthdate from Query one joined to the results of query 2.

My colleague put together a similar query. The performance was really slow, about 20 seconds returning a table of 20,000
records. Asking me for help we made a few adjustments and reduced the execution time to just under a second.

Reviewing the execution plan we found that the first query was VERY fast due to the indexes already in the database.

The second query was slow. The data we were working on had the potential of having thousands of child records. That’s where my contrived example falls apart. It would be as if a person could have thousands of dependents, and we only want three specific ones. While that is not a real world scenario, that was what our data looked like. The final query was the slowest of all.

So what did we change to get the improvements?

The results from the first and second query had to be maintained in a temporary table to be used for as filters and joined sets for the final output. The developer initially started by using Table Variables. Table variables do not have statistics, and generally only have an index by the use of a UNIQUE constraint on one or more columns when you define the variable. This makes it more difficult to join Table Variable results to materialized database tables having indexes. Being included in a stored procedure, it really doesn’t have much to go on to determine a good execution path.

The results of the second query are also in a Table Variable. Now you are joining two Table Variables to get the final output. In our case we had 20,000 rows. That’s not a lot of data for SQL Server. However, since it had not statistics or indexes to operate against, the results of the final query were not efficient.

First we tried simplifying all three queries into three sequential Common Table Expressions, since you can write queries based on any preceding expression. Sometimes this works really nicely. In this case, it didn’t do well, essentially because CTEs are not materialized any more than Table Variables.

Temp tables, or # Tables as they are often called, do have the ability to have statistics and indices, because they are materialized. So, we changed the first two queries to use # Tables for the results.

Create the first temporary table:

SELECT PersonId, FirstName, LastName, BirthDate

INTO #Query1

FROM Person

WHERE …

Create an index in PersonId to be used as a filter for query 2 and a join column for Query 3.

CREATE INDEX idxQuery1 on #Query1 (PersonId)

Create the Second Temporary Table

SELECT PersonId, Dependent1, Dependent2, Dependent3

INTO #Query2

FROM … insert Pivot Query Here, based on the personIds in Query 1

Create an index in PersonId in the Query2 results to be used to join these results with Query 1

CREATE INDEX idxQuery2 on #Query2 (PersonId)

Return the Final Results


Select q1.PersonId, FirstName, LastName, BirthDate
, Dependent1, Dependent2, Dependent3

FROM Query1 q1 WITH (INDEX = idxQuery1)

JOIN Query2 q2 WITH (INDEX = idxQuery2)

ON q1.PersonId = q2.PersonId

With this implementation, the query optimizer chose good join implementations for all three queries, because all of the working data was materialized, and ordered efficiently. Because indexes were created on the detail table, and there were now statistics on the first query, the optimizer now chose a loop join to get all the detail records. This was much more efficient than the Hash Join used previously. There was exactly one to one ratio of records in Query1 and Query2. By ordering the data the same in both tables, SQL Server chose a Merge join for the final query. The time required to build the indexes was far less than the time required to process the query without them.

In these situations, where you know a lot about the data domain you are querying against, you can optimize the performance using techniques such as these. They are not required, and often not worth the time.

The point is, Table Variables and # tables do not behave the same. Simply because Table Variables are in memory does not mean they will perform faster.

Cheers,

Ben