Editorials

Big Data Acquisition and Retention

In response to my editorial on designing a database warehouse that is statistically relevant, Sarita asks, “what do you mean by statistically relevant or maintain data? Are you referring SQL server statistics or mathematical statistics?” That is a great question. Looks like I wandered around too much in the editorial, and never clarified the primary concept.

If you have a data warehouse for the purposes of predicting the need for you to perform some sort of activity, how current does the data need to be in order for you to make such a decision?

Take an auto insurance actuarial program. In order to be statistically relevant, a model does not require all of the accidents that happened in the last hour. I would venture to say that all the accidents in the last day would not alter the findings. It is more important to have data points that can help predict loss such as age, gender, location, number of miles commuting, health, etc. A company insuring a higher percentage of young teenagers would be assuming more risk, statistically.

The point is that most decision engines that chunk up large amounts of data to find trends or indicating statistics do not require the latest acquired data. I would guess there are very few rare cases when a warehouse benefits from data newer than the last 24 hours. I’ve worked with some that the last month is adequate.

So, the answer to your question is mathematical statistics. That is, unless you are maintaining a warehouse of SQL Server
Statistics, to mine for mathematical statistics 

Cheers,

Ben