Editorials

Keeping Data Clean

Your data is often an asset for any company. I’ve worked on systems for companies who actually carry the data they collect and manage as an asset on their books. It is something they valuate, and would be included in the total worth of the company. SSWUG is an example of data being an asset as well. The editorials, articles, scripts, forums, videos, etc. are core assets of our business.

But, data can be messy. Especially data gathered from multiple sources, or growing over time. For example, if you were to have my medical data from the hospital in which I was born it would not have a Social Security Number (a national tax identification used in the United States). It wouldn’t have an email address, or a phone number. These are common data points in medical records for adults. This demonstrates one kind of situation where data may not be accurate over time. When I was born, these data points were not available. Now that they are, could the original record be updated?

Here’s another exmple you have probably experienced. How about broken links on the internet. You go to a page having a link somewhere else, but the content has been removed. Would you use a resouce where the preponderance of the links were invalid?

These are just a couple of the problems that can be identified out of the vast amount of data validation capabilities built into the Master Data Services shipping with SQL Server Enterprise or Developer editions.

Master Data Services in not intended to cleans your data directly. It identifies anomalies and recommends updates if available. It has a series of tools available allowing you to validate data. Often the data being validated is from multiple sources prior to being imported into a data warehouse. Data Issues that are found are logged, and users may update the data prior to ingesting.

The neat part is that you can use data validations you design on data coming from multiple sources, and even merge the results. So, a medical record for me when I was born could be merged with a medical record for me at a later time in life, resulting in a single, comprehensive record of me as I am known today.

This is just a tip of the iceberg. If you have ever had to identify duplicate data from multiple data sets, and generate a single source of truth, this kind of tool will be of great interest. It’s worth looking into anyway.

Cheers,

Ben