All I Want for Christmas is a Clean Database
I was reading an advertisement from Melissa Data regarding tools they have for data cleansing. They have tools for verifying and standardizing all kinds of data about people and organizations. As I was reading the ad, it reminded me of a project I worked on years ago when I first completed college. I would have loved to have tools like Melissa Data provides.
I was working as a consultant for 4 different institutions. They all had separate databases of financial donors and donations. Most of the donors made donations to at least two of the organizations. Some donated to all four.
At the end of the year a final statement of donations was produced separately by each of these organizations for reconciliation and reporting. The donors requested that the statements be consolidated into a single report instead of four separate reports. From the perspective of the donors, these organizations were simply different departments of the same organization, not separate corporations.
I had the challenge of taking and consolidating the data from each of the organizations. I had to use natural data in order to identify a unique donor. The only data I had to work from was Mailing Address, Name, and phone number. I did not have any external identifier such as a social security number.
You can imagine how much fun this was. I found a number of anomalies in the data making it a challenge to reconcile a single instance of an individual. Here are some of the data problems I found:
- Two different people (usually father and son) living at the same address having the same name. If they did not provide a phone number it was difficult to differentiate. If there were two records from a single organization, that also helped identify the fact that there may be two different persons.
- The same individual may have different phone numbers from one database to the next.
- The format of the address was not standardized allowing multiple ways to enter the data.
- One or more databases may have a shortened version of the same name. For example, the name Robert may also be Rob or Bob.
This project made a big impact on my approach to database design and utilization. Well break that out a little bit in future editorials. In the meantime, why not share with us experiences you have had working with dirty data, or cleaning it up? Do you have a favorite tool or service providing value in data cleansing? Drop me an Email at btaylor@sswug.org to share with our readers.
Cheers,
Ben
$$SWYNK$$
Featured White Paper(s)
Go Beyond SCOM Monitoring with Foglight for SQL Server
read more)