Editorials

Keep it Clean

I’m surprised about what things prompt me for topics when I write an editorial. Today, my motivation is based on the hard work being done for data cleanup for one of my projects.

Many applications have the ability for users to enter text for an attribute in a free form style. They may enter anything they wish without concern of impacting other data. Now one of my applications is being translated. All of the sudden, the cost increases as each individual enters a term or phrase just a little bit differently.

Generally, my first question is, “Can this be a relationship?” If we can make this a relationship, then it helps cleanup, and maintain clean data, removing duplication. The problem is, in this case, that the users had the ability to enter text without referring to the text already in use by others. In short, each user did what they liked best. In fact, they would often change something simply because it wasn’t the way they would have done it. Now when you have to get this translated, the cost increases, when two different users say the same thing just a little bit differently.

The second problem I had was how to clean up the existing data. Conglomerating the unique instances of the user data for the last three months resulted in thousands of entries, needing to be reviewed by a human. The business understood the value of reviewing this data, and removing duplicates that were just a little bit different. They even made extra effort to standardize the way the terms were created, resulting in a huge value for the end consumers of this data. Reviewing and cleaning up these terms took dozens of man hours; the business has found the results to be extremely useful.

The moral of the story…introduce processes early on to keep data clean as it is created. You never know when dirty data is going to impact you, and it takes a lot more work to clean it up than to keep it clean.

Cheers,

Ben