Editorials

Bad Data – How Big Is the Problem?

Bad data produces bad results. Who can argue with that logic? But it is something that happens frequently, for one reason or another. As data professionals, we do all we can to guard against bad data. We use techniques such as check constraints, strongly typed data, relationships, and much more. Still, there seems to be those nagging instances of bad data that creep into our systems and fight against our hard work.

I had the opportunity to work on a number of systems where bad data was the norm rather than the exception. The storage and data capture mechanisms were not well designed, allowing for the accrual of large amounts of bad data. For one reason, or another, instead of fixing the data, and the sources causing it to occur, the application was modified to revise the data during reporting, in an attempt to overcome the underlying issues. Fix upon fix was introduced into the code until it was difficult to understand what the original intention of the code was.

I’ve written on this concept in the past, where effort is used to resolve the symptoms, rather than fixing the root cause of a problem. Bad data is a big root problem.

I’m wondering how big the problem of big data really is? How about sharing your opinion by leaving a comment? I’m not looking for folks to expose company trade secrets, or skeletons in the closet. If you’ve been working with software for a while, you’re going to run into these kinds of scenarios sometime. So, let’s get some idea of how big the problem really is. Share with us your experience. Perhaps an estimate of the percentage of systems that have REALLY bad data? Is this something that needs more attention?

Cheers,

Ben