Chain of Ownership, Data, and AI

Interesting points about the whole artificial intelligence thought process.  It’s come up before that assuring the chain of ownership and who did what to information is critical when it comes to the trust of information.  If you don’t know how numbers were derived, or who modified them since they were calculated, how can you trust them?

I’ve felt this uneasiness about all of the distributed systems – distributed in the sense that person X downloads the data set to the worksheet, does their (unknown) magic with the numbers, then sends along the workbook to others to review and make comments on.  Ultimately, person Y sees something in the data, makes changes, does new graphs, etc.  But as the workbook moves along yet again from there, you’re looking at information that’s been interpreted and modified at least twice (in this example) and might not even realize it.

Taking into consideration that AI may be making changes, or even just a number of different analytic processes, it’s easy to see that you can get pretty nervous pretty fast when it comes to knowing the pedigree of the information you’re considering trusting.  How do you make that choice?  How do you know when you should actually be questioning the data, rather than trusting it?

I suppose it’s a bit like the “fake news” and modified news of late that is all the rage.  (Ducks and covers)

You have to be able to trust what you read and know that it’s for real, that it’s accurate, that it’s the full picture.  Otherwise, the value drops in a big way.  It can actually create MORE work for you as you face evaluating the information and the processing behind it.

How do you do this with AI in the mix – there are algorithms and all sorts of things messing with the information and, while the insights may indeed be spot-on, is there a point where you don’t know how to troubleshoot issues with them, or how to validate assumptions and learnings that are being fed into the collective mind that is working with your data?

John chimed in with “My biggest concern with all these machine learning algorithms is that it is insanely difficult to tell when the algorithms are wrong. We’ve got a grab bag of algorithms yet most of us onprem DBAs wouldn’t know if the output of said algorithm is off.

Quite a valid point.  It goes right along with troubleshooting when there are so many contributing processing parts – Hadoop, SQL Server, R, PowerBI, Excel, AI processes, etc.  All of them could come to bear on a single bit of information and knowing the pedigree of that information is going to be increasingly difficult to ascertain.

Even if we did have some sort of record of what had touched a bit of information, it would be hard to follow the though processes behind it – it could an AI process that leads to other, that is analyzed in excel, charted, then put into a word doc and interpreted there by a human.