Editorials

Optimizing Group By Queries

Today I want to talk about a common issue where we want to get a distinct list from a parent table, having a filtered value from a child table that may have many records. Here’s an example to explain the scenario, and it’s one that happens quite often.

Let’s say you want to get a list of regions. For each region, you want a distinct count of the customers who purchased any kind of sandpaper from your web site. If you have a normalized database you could start with the products table, and find all of your sand paper products. Then, join that to the order items table. This would be joined to the orders table to get the customer id. This would then return the list of customers and the region that represents them. A group by query of region and distinct customer id from the customer/purchase order table would return your count. Finally you need to join the results of that query to the region table, and get all of the properties for the region.

This query touches five tables nested four levels deep. Region->Customer->Order->OrderItem->Product. This could be written as a single query and may even perform quite well. A simple query could look like the following if you wanted to get the region ID, Description, LeadSalesPerson, State, Country

SELECT
        r.RegionId
        ,r.Description
        ,r.LeadSalesPerson
        ,r.State
        ,r.Country
        ,COUNT(DISTINCt c.CustomerId) AS CustomerCount
FROM	Region		r
JOIN	Customers   c	ON	r.RegionId = c.RegionId
JOIN	Order		o	ON	c.CustomerId = o.CustomerId
JOIN	OrderDetail	od	ON	o.OrderId = od.OrderId
JOIN	Product		p	ON	p.ProductId = od.ProductId
WHERE	p.ProductCategory = ?SANDPAPER?
GROUP BY 
        r.RegionId
        ,r.Description
        ,r.LeadSalesPerson
        ,r.State
        ,r.Country

If you look at this query can you see what you are really aggregating against? You are actually aggregating against the RegionID and Distinct CustomerID from Customer and Order. There other tables are necessary only for filtering or added data attributes desired in the final query. So, a simple optimization of this query is to reduce the number of columns that must be unique in the GROUP BY clause. You can do this by breaking the query down into two parts using an inline query ore a common table expression (CTE). I’ll demonstrate the difference using the CTE.

;WITH Summarized
AS
(
    SELECT 	c.RegionId
            ,COUNT(DISTINCt c.CustomerId) AS CustomerCount
    FROM	Customer	c	
    JOIN	Orders	o	ON	c.CustomerId = o.CustomerId
    JOIN	OrderDetail	od	ON	o.OrderId = od.OrderId
    JOIN	Product		p	ON	p.ProductId = od.ProductId
    WHERE	p.ProductCategory = ?SANDPAPER?
    GROUP BY 
     c.RegionId
)
SELECT
        r.RegionId
        ,r.Description
        ,r.LeadSalesPerson
        ,r.State
        ,r.Country
        ,c,CustomerCount
FROM	Summarized	c
JOIN	Region		r	ON	c.RegionId = s.RegionId

 

Depending on the number of orders and other data, etc. this query will perform much better because it does not have to aggregate for all of the properties from the Region table. It only groups by a single column, RegionId and doesn’t join the region table until the unique distinctions have already been filtered.

Technically, you could break this down even more by aggregating the OrderDetails for the distinct OrderID. However, with this example, it is less likely that a person is going to be purchasing a lot of line items for a single order having sandpaper. If that weren’t the case, then you could use the same principal again at a lower layer of the query. Your knowledge of the data domain will help you determine if the extra work in your query is worth the effort.

Tomorrow we’ll look into how you can perform a similar query using entity framework, using the same principal of grouping the least amount of data, and then joining the details once the grouping has been completed.

Cheers,

Ben