Editorials

Comparing Data Warehouse Designs

Today let’s compare the traditional data warehousing database schemas.

Normalized data is not often used today for data warehousing. The key reason is that, while the normalization of the tables results in less redundancy, and higher quality of data, it comes with a cost, because data needs to be joined from multiple tables to produce the results needed for reporting or data mining. You may see this form used in reporting databases because it may be more easily replicated from an OLTP database, or performed against an OLTP database directly.

Data Marts are more common for reporting, and even some data mining. They are the result of joining multiple tables from an OLTP database, and/or other external data sources. The data is granular in a detail mart, or may contain aggregate values in summary data marts. Generally data marts are used independent of other tables, resulting in a narrow focus.

A Star Schema combines multiple narrowly defined data marts. They consist of a fact table at the center, intersected with multiple dimension data marts. I call them data marts because they are de-normalized fact tables consisting of repeating data in rows and/or columns. A star schema lends itself very well to data mining due to the multiple dimensions associated with a fact. Cubes are readily derived from this data structure, and data mining languages, or tools such as powerpivot consume this structure quite easily. The facts and dimensions are easy for end users to understand, and little work is required to setup a mining problem.

The most sophisticated design of all combines normalization, and star schemas. The result is the snowflake schema. It also works well as a data mining source. The benefit is the reduction of replicated data for dimensions. This saves space. However, like reporting from a normalized database, a snowflake schema requires more work on the data engine to process queries due to additional join processes. It also requires a little more sophistication on the part of your consumers. They must be able to understand the additional tables, and how things fit together. The key reason to use a snowflake schema of a star schema is the reduction of repetitive data.

If you’d like to see a more detailed comparison of a star schema to a snowflake schema, you’ll find a good comparison at http://www.diffen.com/difference/Snowflake_Schema_vs_Star_Schema.

Tomorrow we’ll contrast these traditional data warehouse structures to non-structured data mining.

Cheers,

Ben