Editorials

I Have BLOBs in My Database…Now What?

SSWUGtv
With Stephen Wynkoop
Just because you are in the Cloud doesn’t mean you don’t need Disaster Recovery Solutions. Do you have a plan worked out? In this episode of SSWUGtv, Doron Pinhas shares some some great tips.
Watch the Show

We Need Your Help!
(…and you can win a prize!) We’re working with Infochimps to learn more about big data projects – YOUR big data projects! We’re looking for your feedback in our quick survey about big data and your CIO. Please take a few minutes and let us know your thoughts – we have free memberships and other goodies in the prize drawing, and we’ll be talking about results here on the site, but we need your help!

> The survey is only open this week
> Take the survey here – it’ll only take a few minutes.

I Have BLOBs in My Database…Now What?
Today Ryan writes in with a question regarding to database design containing BLOBs in response to yesterdays editorial on storing Big Data Types in SQL Server. Ryan writes:

I was interested in your article on Big Data. We are developing a document management system and for a number of reasons, including the ability to search file content, we have decided to use the Filestream data type to store the files. We have separated the data into two databases; the metadata database which contains all the data for the document record (security, categories, history, etc.) and the file database containing the files and a minimal amount of data to link to the record data. Once loaded our file database will initially be over 350GB and will grow quickly in the first 6 months.

As you mentioned, backup windows are a concern. Tests and rough calculations show that a full backup could take over 24 hours. Another option would be to skip backing up the Filestream data and only backup the associated data. Since the files in the database will never be deleted or updated (for historical purposes) the actual files could be backed up manually by copying any new files to a second location. We haven’t tested this concept yet and were wondering if you had any experience or thoughts in how to handle backing up this large volume of physical data?

Any advice would be appreciated.

This is a great example of why I brought up the question. The two database solution has merit. As you stated, the database grows continuously, at a rapid pace starting out, and continues through the life of the application. In that case, I think table partitioning works really well.

You might consider 52 weekly partitions where you partition on the week of the year for the current date. Then, you would also have 52 database files. Since the data is placed in the tables using the partition, you would only have to backup one partition each week, as the other partitions are stagnant. This would assume no updates occur on the old records.

Since you don’t allow deletes or updates your older files are literally stagnant once the week passes. However, if you wanted to allow deletes, you could defer that process until a time with a larger data window by simply marking the data as deleted in your relational database; then go back periodically and purge/backup historical partitions.

If you’re not interested in paying for the Enterprise version of SQL Server allowing you to use table partitioning, you could still do the same thing with the standard version. Instead of using the table partitioning formula to separate the data into different files, you would have different physical tables. Then, your relational database would have to store not only the record id for the required file, but also the table in which it is stored.

If you are going to use the separate table idea, it might even make sense to use separate databases instead, and actually shard the data. This would allow your application to really scale in that the databases could literally be served by multiple database servers. That would be something rather easy to roll on your own, and I might consider that idea as well. It certainly simplifies your backup and restore strategies in that each file database is autonomous.

If you are including full text indexing of the file contents in your database, sharded databases becomes even more practical due to the overhead of maintaining indexes on lots of data.

Indexing is an interesting concept for the file database design. If you use separate databases or partitioned tables, then I would consider NOT having a clustered index. I am assuming two columns…Key Column, and the BLOB contents. For this reason, who cares where the data is stored. File data would be the only data in the file, so clustering adds nothing but overhead. A non-clustered index on the key column will suffice, and may be defragmented with a rebuild without having to move all the file data around. Put the non-clustered index in a separate file so that it can be defragmented easier and wont be interwoven with the file data in the heap.

The biggest issue you are going to face in your file database(s) is backups. If you use simple recovery then you’ll need to manage your checkpoint recovery period. If you use full recovery, then you will have to be sure and establish transaction log backups so your transaction logs don’t grow too big.

Having separate databases allows concurrent full database backups to occur on the different databases simultaneously. Imagine the performance boost if these were hosted on separte servers. In fact, the size of the servers wouldn’t have to be as big.

This might also be a great fit for SQL Azure if you sharded the databases. Even if you host the two databases yourself, on design decision I would make would be to separate access from the metadata, to the file database. Don’t use heterogeneous queries (joining data from one database to the other in a TSQL query). Find the files you need to look up in one query to the first database, then, go get the files in a second query from the file database. SQL Azure and many other database engines don’t support heterogeneous queries. Also, because of this separation, you can replace the file database with a different data storage technique at a later time if needed with much less effort.

In short, your file data could easily be hosted in a NoSQL database such as MongoDb, in any server file system, in databases supporting BLOB data types, or even in some sort of memory cache. As a result, this is one of those times I would be sure to utilize a strategy pattern allowing you to implement different storage types by simply defining a new persistence module unique to a data store. Now your software can grow beyond your current perspective.

That’s a bunch of thoughts running through my mind. There are lots of others much smarter than me out there. SharePoint DBAs probably have a lot of experience in this area for sure. Why not write in with your thoughts by sending your Email to btaylor@sswug.org.

Cheers,

Ben

$$SWYNK$$

Featured Article(s)
SELECT from DELETE, UPDATE, and MERGE
A nice and relatively new SQL feature of DB2 for z/OS offers the ability to SELECT from DELETE, UPDATE, and MERGE statements. This was introduced in Version 9 and is similar to the SELECT from INSERT feature that was introduced with DB2 V8. So, before looking at the new V9 features, let’s review the V8 feature.

Featured White Paper(s)
Migrate to SQL Server’s next release with ease and optimize the database workload at any time.
read more)

Featured Script
dba3_fn_SimpleParseString_Demo
simple string parse demo e.g.(to parse ‘aaaa bbbb’ into ‘aaaa’ and ‘bbbb’)… (read more)