Many times when you see other data sets that you can use for your own applications, they’re about the weather in a location or traffic data. That type of thing. It seems like you have to wonder what you’d use that information for in your own applications.
I usually eye these types of resources and think that while it sounds good, I find it hard to imagine how we would put external reference databases to use with applications we work with and support. I’ve seen some pretty amazing things happening that have surprise combinations of information, the most recent had to do with Disney parks and estimating time in line based on whatever factors influence that process.
They found some intriguing things going on. Length of day, for example, specifically sunset times drives how people use the parks. In addition, school schedules around the country and advance ticket purchase times and….
When you look at the information gleaned from all of the data points, it’s one of those things that you sort of say “yeah, that makes total sense!” There are so many other factors that this line wait time application uses, it’s really amazing to read and think about.
In your business, or those of your clients, what types of data sets would be interesting? And I keep coming back to how do you know what to ask for in your datasets? Or, do you just dump whatever you can find into the mix, then look for correlations?
I suspect that’s likely the best thing to do. It seems like the insights and surprises are coming from things we can’t predict. Once you have the right machine learning solutions in place, and the right datasets, letting things discern the connections seems like the best payoff in analyzing the data.
Talking through this with myself (hey, it happens), how do you figure out what to feed the beast, and what the best use of resources is going to be for your unique situations? You can’t really just keep dumping in data and hoping for connections.
And how do you know, when you’re using public information, if it’s complete and provides insights that you are most interested in? And, lastly, how do you handle being aware of any bias in the data set? Could be political, could be analysis, could be simply the data that’s included or not. It could have a dramatic and fundamental impact on the way the data is applied.
The thing that had me started on this was reading about some public datasets that Microsoft is involved with publishing. (Here’s a link)
Have you used public datasets? What did you look for in qualifying them? Where you successful in applying large dataset tools to your own applications or did they confirm things you felt you already knew?