Azure ML datasets have a new tool just out, that of looking at datasets and monitoring for data drift. Not the latest fad in racing, but rather the changing of data over time. This is quite important to be aware of on a few different levels as you work with these datasets that track behavior (more later this week) and usage, along with IoT.
Why?
Because these datasets are subject to morphing, to “sliding” or changing – drifting. I think, too, that they’re subject to Data Maturation, but it’s something that’s really tough to automate, unless we’re looking for it collectively, and solving for it.
Back in college (insert yet another old dude sigh here), I remember my statistical professor talking about the fact that stats can be made to say pretty nearly whatever you need. This really stuck with me. It’s quite true in many cases.
However, I’ve noticed too that the value of data can get fuzzier over time – averages become so soft around the edges because there is SO MUCH information behind the numbers. The values become less useful because you’ve accounted for so much information that values are too generalized to be specific.
I call this data maturation. I think this could really benefit from the whole concept of Data Drift. In the video I was watching about the new feature, (check it out here, on YouTube), they’re talking about data drift and how it pertains to data modeling, machine learning.
The short version is that you can use this monitoring of data drift, and time series analysis to review your information in Azure/SQL Server and see when information is starting to move in one direction or another and should get another review. It could mean that your original models need to be reviewed, relearned or that something is up as data slowly shifts.
I think, too that there is another use for this approach, though we probably need to look at proper tools to make the best use of it. This idea of data maturation is that it’s possible that more recent information is more meaningful or accurate in terms of representing the “now” than the whole of information. For example, if you’re looking at sales figures (to use a really silly-simple example), you might not see a sales trend for something that started with a bang, but now has other things coming up on its heels in terms of demand.
Or perhaps those sensors are picking up temps that are alarming because of heat levels, but the fact is that more “stuff” is flowing through the systems, so it’s ok – but if you use the original temperatures, you might have a false sense of what’s ok on those temps.
Silly examples. But the point is that newer information if it starts to deviate, may be doing so because it’s more representative of the here and now of your data. “Data maturation.”
It’s extremely important as you work with your SQL Servers and other data tools that you keep this in mind. Information that supports time series analysis approaches to determining the age-value of information, etc. – all of these are things that will need to be supported in the database schemas and solutions being created. Doing it now is a must; it simply isn’t possible after the fact, at some point down the road, to truly rebuild the history behind data points.