Editorials

Clustered Index Conversation – Part II

Today we continue to interact with a question on Clustered Index posted by Ravi. He writes:


I would want to know your thoughts on the clustered indexes. I would create my clustered index

  • a) on a transactional table
    • column/columns which is heavily queried upon, I would prefer it to be a numeric and sequential most of the time and does not allow nulls.
    • on a date/time column if the table is kind of logging table
  • on a static/lookup table
    • I would primarily do it on a column which is very frequently queried upon and does not allow nulls. I would be okay even if the column/columns are non-numeric.


Because of the Microsoft’s recommendation on ID (auto increment) column on every table, people often considers the ID column to be the column for clustered index because there is no need to worry about fragmentation during insert because it is sequential. But my question is,


  • insert happens only once, what about all the SELECT/UPDATE/DELETE s which needs to seek a row, Do we not want to consider the clustered index being on a heavily queried column on?

  • What about the fill factor we leave on each page to avoid page split just in case if the data is not sequential?


Please provide me your thoughts on the clustered index column selection and correct me if I am wrong.

Microsoft does indeed recommend a clustered index on each table based on a sequential value. I believe they have tuned SQL Server for that purpose. However, I have found that this recommendation works best for OLTP databases where data is constantly being inserted. In my opinion, there is no fill factor less than 100% that makes sense because data is always inserted at the end of the btree, being sequential. This results in a hot spot on the disk, but less head movement.

Of data insertion is the highest priority for your database, then perhaps the sequential column clustered index may be the most efficient method. However, if you report heavily on the data you may find that using other data types as a clustered index provide additional performance. For example, lets say you had a table of customer transactions, and the heaviest queries of the data are based on a single customerID. In this case, a clustered index based on CustomerID, followed by a sequential number can provide additional performance during reads.

Inserts are going to be less performant because data will not always be added at the end of the btree. In this case, a lower fill factor makes sense because it leaves room for additional records to be added without re-balancing the btree when the record is added throughout the index based on the customerId. In this case you sacrifice disk space and insert performance for faster read times.

In a data warehouse, or tables where the data is more static, the same clustered index may be used effectively without the additional fill factor, and perform inserts more efficiently if the data is added in the order it is organized in the clustered index. Adding records by Customer first and then by ID results in all the new records being added to the end of the btree, so the fill factor may be maintained at 100% with no loss. The queries perform more quickly because data is contiguous.

We’ll take a look at some more specifics tomorrow. For now, feel free to join in the conversation by adding your own comments to the discussion.

Cheers,

Ben