Amazon RDS, Azure SQL Database, Editorials, SQL Server

Does the Cloud Force Data Silos?

As we’ve been working with different people, and, embarrassingly, our own systems, I’ve been noticing a trend.  It’s been bugging me for a while now, but I’ve not really been able to put words to it.  I mentioned in the prior editorial that multi-platform was important.  Basically, use the right tool for the job at hand.

And then I noticed this post that started to bring it a bit more into focus:  Data integration is one thing the cloud makes worse.

This is a real issue.  There is such specialization in the database platform tools (never mind the rest of the options in the hosting space) that you end up cherry-picking to solve a specific problem.  But in that cherry-picking, in many cases, you’re creating a heck of a system of silos.

*Gets out old-database-guy cane…*

In the old days, you’d figure out how to make your database platform of choice do what you needed, or as closely to that goal as possible, and then deploy.  It made absolutely no sense to go putting in a new database engine and learning to support it and use it.  The resources just didn’t make sense.  Besides, how many times did we (OK, *I*) preach about standardization as being the way to leverage your time and resources?  It was a constant fight.  Heck, there was so much talk even about the woes of SQL Server Express at the department level.

*Puts cane away*

This can be a significant issue, particularly if you embrace platform selection on a task-by-task level or an application by application level.  It’s so easy to roll out those other resources to answer a specific report or functionality issue.  But the side-effect is clear – you still have to support that.  It’ll make your hiring more challenging (do you know all of these platforms????)   and getting assistance and applying best practices will be more difficult.

In all my talk about data pedigree, this portion really had me taking pause:

But without a data integration strategy and technology, a single source of data truth is not possible. Systems become islands of automation unto themselves, and it doesn’t matter if they are in the public cloud or not.

We have to push back on not being able to maintain a data “truth” structure.  It can be done, though it certainly gets more and more challenging with the different variables.  I think documenting the data flows, making sure sources are known and expected and audited, and that data use is managed will all come into play.  Without this type of thought, I think it will be quite difficult to really ascertain the “correctness” of the information in those stores.  Developing audits and checks on the information flowing through the pieces will go a long way to making sure things are ok.

Much easier said than done – I mean, how do you audit a data flow for expected values?  What is an outlier that requires attention in the data infrastructure, and what’s a data element that is crying for analysis in the normal flow of reporting and such?

Very challenging indeed.