Editorials, Ethics

Statistics Can Lie, Big Data Ethics and More

There were some excellent comments yesterday on the post about statistics — and some great references.

Statistics can produce alternative facts. When you can’t agree with the facts you have no basis for a rational decision. Then when you make a decision based on alternative facts you can tweak the statistics to tell whatever story you want since there are so many possibilities.

Heck, Mike Ditka even showed up with a quote…

I’m not sure what to do about all of the “change the information or data or technique to support your goals” stuff.  What I have been seeing more and more is that an assumption is made, but genuine interest in the information is sought after.  An initial hypothesis is formed, then the data is re-analyzed to find supporting information for the hypothesis.

It’s not like this is criminal (in most cases) but it’s counter, I believe, to what we should be fighting for as data folks.  I think it’s critical that we work hard to flap our respective capes in the wind in favor of full information and fact-based analysis.  To me, “fact-based” means both in favor of, and counter to, the hypothesis.

It’s all really easy to say that we need to protect these processes, but the realities of time limitations and data limitations can be very challenging, to say the least.  My original issue was founded in the editorial about the data sets that are being made public.  They’re coming from single sources in some cases, and when you have sources for things like reporting, commentary and such, it’s inherently open to being presented from one viewpoint or another.  That’s an easy one to see.

But the data itself, from sources that we’re responsible for, can also be “slanted.”  I do believe it’s increasingly going to be on us collectively to recognize incomplete data assumptions and point it out to those seeking information and such.  This means data sources, data sets, modifications and such to data and all that those items entail.

I do get how big of a challenge that is.

However, if we don’t do the hard work to make sure full pictures are presented, we run the danger of invalidating all the hard work put into reporting, analysis and just the overall trust in the conclusions that are drawn.  Sure, one or two times a biased results set will make everyone smile.  At some point though, someone will speak up and question the information.  Once people figure out that the basis of the decisions and conclusions is flawed or biased or incomplete, the damage will have been done.

It’s critical that we be paying attention to, and pointing out the pieces and parts of data that are used for our systems.  At the very least we need to point out the basis of the data and assumptions.  Transparency is where it’s at.

It’s not like it’s not going to happen (the molding of data to support presumptions).  It’s a fact of life.  I just think we have to be very careful to be the protectors of transparency and completeness in every way we can.  Essentially, our collective industry depends on it.

We’re not fabricators, we’re analysts and database administrators and database architects and… We have to fight to maintain that credibility.