Editorials

Back in the late ‘80s and early ‘90s, when data warehousing was a new and growing concept, things were wild and crazy. Even then, using traditional tools such as Cognos, it was important to have a good understanding and structuring of your data in order to have reliable, performant, data mining.

Many data warehousing projects failed, because they could not define or acquire the data necessary for mining. Many businesses simply said, "put everything in there, and then we’ll decide what we want". Usually one of two things happened. 1) They got all of the data in there, but never had time to organize it for mining. 2) The project collapsed on itself because there was too much data available. The industry came up with a term for this, “Data Tenement”. The system had a lot of compartmentalized data, but the configuration did not serve well as a warehouse.

This was bad because traditional mining tools required some sort of structure in order to find correlations, map trends, and predict probabilities. Even with machine learning, the principle is still similar. Although Machine Learning doesn’t require tables or cubes, it does have to understand the data in limited forms.

What’s really cool is that with Hadoop techniques, you have data prior to schema. The structure is defined as you consume the data, not before you acquire it. Due to this different approach, what was formerly called a data tenement, is now considered a gold mine. While it may not be as efficient, there is much less human intervention required to find or convert large masses of data into structured data, which may be consumed by many different tools. The results may be addressed with vector tools such as R, it can be consumed by traditional data warehousing engines, you may even send it to a machine learning tool.

Perhaps now is the day to bring back the data tenement? What do you think? Comments are available for your thoughts.

Cheers,

Ben