NoSQL Data Stores, Readers Want More
First a little housekeeping. I think it’s interesting that many of the newer data storage engines are labeled as “NoSQL”. Before these engines came along there have been a number of highly successful NoSQL databases that are difficult to beat for performance.
For example, IDMS or CMS running on the IBM Mainframe platform has a solid history. ISAM files have been around forever. Mid Range computers also have their engines. Object oriented databases such as Cache have been around for decades, and dominate some industries due to their performance.
Enough Ranting. NoSQL is the label that has stuck because most of these engines have been designed to replace SQL Implementations of one sort or another. Interestingly, many of the NoSQL solutions often end up in a Relational data store eventually.
Hadoop is a engine wrapped around the Map/Reduce pattern made popular by Google. You’ll find it is frequently integrated with both relational and NoSQL data persistence.
I found the DBMS2 blog to be an interesting read when looking at different persistence technologies, both Relational and Non-Relational, as well as optimization engines.
One key difference in NoSQL implementations from Relational Databases is the distribution (sharding) of data. Data is not stored in a centralized data engine. As a result, instead of having multiple processors pulling/pushing data from a single store for processing, the processing is distributed to multiple stores based on the distribution of the data.
The evolution of software looks something like this:
- Single machine performs all processing and data persistence
- Processing is distributed and links to a centralized data persistence source
- Data is Distributed, and processing is moved close to the persistence
David Shares with us some of the Characteristics of different NoSQL engines, and how those characteristics may be put to your advantage…:
I’d like to start off by pointing to a blog post I read recently that provides a good overview of the differences between NoSQL databases along with a bit of why these choices were made: http://huanliu.wordpress.com/2011/01/21/dimensions-to-use-to-compare-nosql-data-stores/
The blog post mentions that there are four types of NoSQL Databases: Key-Value “blob”, Column Store, Document Store, and Graph Store. SQL’s Relational Store can reproduce all of these, but not as quickly. For an SQL-er to get the “gist” of these databases, I’m going to write some example queries that fit well with a particular DB type below:
If most of your queries look like this:
select * from myTable
where myKey = 'key'
(Always using a single primary key to access the table)
Then a Key-Value NoSQL DB will give you a performance boost.
If most of your queries look like this:
select MAX(val) from myTable
where person = 'Bob Smith'
(Performing operations across many, many rows with very few columns)
Then a Column Store DB will give you a performance boost.
If most of your queries look like this:
select * from myTable
inner join mySubTable on myTable.subTableKey = mySubTable.subTableKey
inner join mySecondTable on myTable.secondTableKey = mySecondTable.secondTableKey
left outer join myThirdTable on mySecondTable.thirdTableKey = myThirdTable.thirdTableKey
where myTable.name = 'Frank Williams' and mySubTable.location = 'White House'
(Always joining tables the same way every time, and querying as a whole entity)
Then a Document DB will give you a performance boost.
If most of your queries look like this:
select distinct friendsOfFriends.name, friendsOfFriends.homeAddress, friendsOfFriends.relationshipStatus from person as firstPerson
inner join person as friends on firstPerson.friendId = friends.personId
inner join person as friendsOfFriends on friends.friendId = friendsOfFriends.personId
where firstPerson.name = 'John Doe'
(Performing many different kinds of joins to build relationships within/between relatively simple tables)
Then a Graph DB will give you a performance boost.
Key-Value stores are very simple, and it is impossible to reconstruct the power of a relational database within its language constructs. In fact, I think it has just as much to do with Message Queues as it does with Databases, straddling them both, and being grouped with the other “NoSQL” databases only confuses things, as the other three are fully capable of doing everything an SQL Relational Database can do, but with different performance characteristics.
Column-Store DBs, in particular, are the “closest” to SQL, as they are designed for Data Warehousing situations, primarily (large dumps of data into the DB and aggregate reporting operations done on it). MonetDB, for instance, actually implements an SQL interface, too: http://www.monetdb.org/Documentation/SQLreference This means that select * from is a very bad idea on these databases, and insert into should be called once for many records instead of one record at a time.
Document Databases, such as MongoDB, are basically like pre-joined tables with duplicated column references removed (a 1:Many join keeps the one reference on the “1” side and the many references on the “Many” side, while in SQL the values on the “1” side would be duplicated “Many” times). It is possible to perform joins in these databases and do relational work, but the performance advantage is lost in these situations.
Graph Databases are all about joining data. Neo4j, one such graph database, boasts its join performance on the level of millions of joins per second with full ACID compliance (something NoSQL has a reputation for dropping support of). Its downside is ad-hoc querying can be incredibly slow – anything you would ever put into a where clause must be indexed or you’ll perform not a table scan, but essentially a database scan, as the concept of tables itself has essentially been eliminated (though neo4j describes certain usage patterns to reconstruct a relational database, document database, or key-value store).
I’ve worked with SQL and three of these four NoSQL DBs: MS SQL Server, PostgreSQL (SQL), Redis (Key-Value), MongoDB (Document), and Solr (a search-engine server, which I argue is a type of Column Store DB with very advanced querying functions).
For a web app, Redis is good for storing a user’s session object, MongoDB for storing/presenting user-entered data and doing fixed reports (Map-Reduce functions of MongoDB), and Solr for ad-hoc querying (of a very high-level sort, good for end-user requests and figuring out what kinds of reports to write).
Doing, this, however, does have the trade-off of roughly doubling the size of the memory footprint relative to SQL (data in a fixed relationship on MongoDB and then mostly duplicated into a “fuzzy” relationship on Solr) for faster performance (real world metric here: 2ms response time of a document object query [pre-joined and searched by indexed “fields”] from MongoDB and 1ms/6ms response time warm/cold from Solr on anything containing “foo” [the Solr DB containing the primary key of the MongoDB objects, a few values to show in a preview, and a special search column that contains every value concatenated as a string]).
As memory scales faster than processing power, I can see this sort of trade-off becoming more common.
Do you have more questions or have insight or experience to share with our readers? Send your comments to btaylor@sswug.org.
Cheers,
Ben
$$SWYNK$$
Featured Article(s)
Tips for using indexes in SQL Server 2008 (Part 3)
In this article, you can find some helpful tips to performance tune and optimize SQL Server 2008 indexes.
Featured White Paper(s)
Structuring the Unstructured: How to Dimensionalize Semi-Structured Business Data
Written by Interactive Edge
The Business Intelligence industry … (read more)