Uncategorized

Distributed Databases – Jeremiah Responds

Featured White Paper(s)
VIDEO WHITEPAPER – How to Avoid Downtime – VIDEO WHITEPAPER
In this Experts & Insights video, we take a close-up look at the data protection challenges and solutions surrounding downtim… (read more)

Distributed Databases – Jeremiah Responds
Be careful what you ask for, you just might get it. Well, today I have a real treasure to share. I have been looking into alternative data stores myself for a few months due to my engagement in systems that require great scale as well as economy.

While this does not preclude distributed data in structured storage such as a relational database, many of the current economical distributed data stores are not relational.

Personally, I am a huge fan of MS SQL Server, having worked with it for two decades. But the current MS tools for making a distributed database are not economical; they are cost prohibitive for Small and even some Medium size businesses. But I have had work with three companies in the last ten years with less than 50 employees who really need a distributed database system that can scale out economically and provide failover.

Let me say here that I highly value your responses to editorials. Today I have the opportunity to share a gem with you from Jeremiah.

Jeremiah Says:

Distributed data and NoSQL are interesting because they tend to get lumped into one bucket, even though there is no "standard" NoSQL database. I’ll hit up your questions as you listed them:


Do these cost effective tools have the performance and failover characteristics we have come to expect?

This depends on the database you’re talking about. However Hadoop’s HDFS, Cassandra, and Riak all use a redundant system of reads and writes to ensure data consistency. Every time a record is written to disk, it’s written on three separate systems (the number is configurable).

When we’re reading from the database, we can tell the cluster how many separate servers have to respond before a write is considered successful. The same goes for writes, of course. Out of the box, we have to run one of these databases with better failover recovery than SQL Server gives us out of the box. If a server fails, we remove it from the cluster and add a new server in. There are a few things that you have to do in the background, but that can be handled fairly simply once a process is in place.

Is this a technology that is coming to the Small to Medium Business, or is it only for the Massively intensive uses?

This is a technology that every business should be considering. Distributed databases provide a lot of flexibility in how we work with data. Hadoop MapReduce makes it possible to do serious number crunching on a massive number of commodity servers. You don’t have to have the volume of Amazon, Yahoo, or Ebay to take advantage of distributed data processing.

Are they secure?

They are securable, but security isn’t always the top priority. These products are being built to solve specific problems, not the general problems that relational databases are solving. Some databases have security built in, some need to be secured through external means. Like any piece of software, it’s possible to secure it. The bigger question is – what are you securing and who are you securing it from?

How do they compare in the cost per performance to other storage methods?

Typically, you don’t use a SAN with a distributed/NoSQL database. That takes storage costs down to about 1/10th of what you’d need to house the data in a SAN.
What kind of skills are going to be needed to support new storage mediums?

This is really broad and it depends on the database that you’re talking about. That being said, I think similar skill sets will be important for both the relational DBA and the non-relational DBA. The mentality and approach to problem solving is more important the skills that are being used.

What tools are available to mine data from these kinds of storage? Is it all custom code?

Again, this is hugely dependent on the underlying database. With Hive you can use HiveQL which is very similar to SQL-92 syntax. Other databases have their own formats, but the techniques are all very similar.

Are ETL Tools available or is this custom code?

A few tools are starting to appear on the market, but it still seems like most code is custom. The industry is still very young. As time goes on I think that more and more tools will support these database.

Can I live without my relational database system or cubes?

That really depends on your system. There are a number of business who are moving various pieces of functionality over to non-relational/NoSQL/distributed databases. It depends entirely on what you’re doing with the data. However, I think that the relational database has an important role to play in business and isn’t going to go away.

How does all of this interact with Cloud implementations?

I think that’s the subject of an entire article if not a book.

Thanks Jeremiah for your detailed response. If there are others out there with experience in this area, it would be great if you’d share your experience. Send your comments to btaylor@sswug.org.

Cheers,

Ben