Integrating BLOB Data With Normalized Data

SelecTViews
With Stephen Wynkoop
Ryan Adams joins the show today, also information about disaster recovery and planning. Additional information on recent news, the SQL Server tip of the day and much more.
[Watch the Show]

Integrating BLOB Data With Normalized Data
As promised…we are moving on today from the topic of database normalization from the editor’s perspective. I will be sharing additional comments received yesterday.

Today I’d like to return to the Three Non-Votes from David for items he felt should not be included as candidates for the TOP (10) Worst Things to Do in a Database. He likes the idea of combining BLOB data with structured data in the same table or database.

David Writes:
-1 vote for #4 “Embed non-structured data with structured data in the same table and/or database”
This statement is way too broad. Sometimes it is highly desirable to embed non-structured data. In the old days (think pre-SQL 2005) the cost of doing so was much higher than it is today. But now, if you have a Person record, why not have a varchar(MAX) field for Biography? Why not have other non-structured data that is directly related, especially if it frequently needs to be accessed along with the structured data? Agreed: you don’t want to needlessly retrieve large amounts of non-structured data… but by the same token you don’t want to needlessly retrieve large amounts of structured data either. Unless you are just in love with SELECT * FROM … (grin) this should not be a problem.

Editor’s Comments:
One of the key problems with storing BLOB (Binary Large Objects) in a table with structured data is the need to separate that data when the large object is not required. David rightly points out in his very last line that this shouldn’t be a problem, “Unless you are just in love with SELECT * FROM”. SELECT * FROM was another candidate for the TOP 10 Bad Things to Do in a Database.

Understanding that one size does not fit all, let me share some technical issues I have experienced, causing me to wish the BLOB data stored elsewhere.

1) Large XML Data Killing Performance. I worked on one system where there was a lot of structured data stored in a table. The table contained 4 XML columns frequently containing a large amount of data. The single table had 2 million records, requiring 60 Gig of space. 99.5% of the data contained in the table was stored in the XML columns.

I experienced two problems with this design. When re-indexing this table to resolve fragmentation, a nightly task required to maintain performance of the system, the database required a huge amount of additional space or a larger TempDB in order to perform the re-indexing.

We were forced to use simply recovery because transaction backups took too long.

The system was originally designed to maintain state as the data within the XML was processed through various steps. Columns in the table were updated as the stage of processing was completed. As a result, a lot of data was being deleted and inserted again because it could not fit in the original allocated space.

Many simple report queries were performed frequently against the normalized data, requiring a number of covering indexes to enable reasonable performance.

2) Large Binary Data Killing Performance. I worked on another system that had a good amount of the core data replaced daily, or sometimes even more often. The core documents resulting in the ETL process performing the data modification were Excel, PDF or a screen capture of a web site.

The documents were stored in the database as artifacts, point in time source data. The database stored these artifacts in the same table as some of the parsed contents of the source, and other processing meta-data.

The system experienced the same performance and maintenance troubles as a result of the integrated data.

This table required transactions in order to maintain referential integrity with other supporting tables. As a result of the heavy load by including the BLOB data in the same transaction, the system experienced a considerable amount of blocking due to the size of the records.

It also required a great deal of our most expensive fast SAN disk for archival provenance data.

There are some options open to you in these kinds of situations.
1) Create a table with a one to zero or one relationship, and store your BLOB data in that table. Since the data is in a separate table you can place that table on a completely different disk reducing contention. Separating the data also allows you to maintain that data in a separate transaction reducing contention for the operating data stored in your normalized table.

2) Create a different database altogether. Store your blob data in this separate database. This gives you the option to change the backup of the database to simple allowing for a smaller transaction log, and not requiring transaction log backups for the blob data.

It also allows for the normalized data to be fully transactioned independently.

A third benefit is that you can more quickly index, backup, maintain statistics, Insert/update/delete data from your normalized tables. De-fragmentation becomes a non-issue. Blocking is alleviated.

3) A third option is to not put the BLOB data into the database in the first place. As David says, since SQL Server 2005, this isn’t always the best option. With SQL Server 2008 and the filestream capability you have a powerful tool allowing you to have both worlds of disk based storage and database ACID assurances.

The questions I would ask myself when embedding BLOB data with relational data are as follows:

Will the relational data and the BLOB data remain essentially static?
Will integrated BLOB data cause the performance, or maintenance, of the table or database, exceed performance windows?
Do I want to use my most expensive, high performance, resource to store this kind of data?
Do I need to include BLOB data as a part of a comprehensive database backup?

The scenario David presents is of a different nature. And again I agree that the statement may be to broad. A Biography could easily belong in a database, and could easily be considered a blob (especially if I am writing it). Would I put it into a separate table? Most likely, depending on the database. If I frequently access other data about a contact and rarely access the biography, I would definitely put it into another table. I’d still use the questions above when considering this option.

Readers Comments – Database Normalization

Brian Writes:
I’ve been working with relational databases starting with a beta version of Dec’s RDB, with some Oracle, and a lot of SQL Server (starting with a rolling upgrade from v4.2 to v6.0).

I’m convinced that the design of a good database is half technology and half art. The technical part is knowing how to fully normalize a structure. The art is applying that knowledge judiciously to the situation at hand.

The dba’s that I’ve worked with that designed the most usable databases, with the least operational issues, are the ones with a lot of non-dba experience. They’ve also been application developers, business analysts, middle tier developers, operational support dbas, etc.

David Writes:
Well I needed to chime in on this because it is one of the most misunderstood concepts in Database Design.

I have seen places where they coerce every database to use the same level of normalization, regardless of function or need.

One place even mandated 3rd normal form on everything with the reason that it would save drive space. What is lost in the translation is that every level of normalization has a purpose, and every level of normalization takes more time and complexity to use. One case of the extreme, they had a foreign key into a table for the values of Yes, No, Unknown. It caused all sorts of problems because the developers there did not understand the concept of a correlated sub query.

Normalization is a good thing when used properly, but there are not very many folks who understand how to implement it properly. I was once told by my mentor that a good rule of thumb was to go to third normal form, and then roll back some of the elements to make it more efficient. While there is some merit in this statement. It is that we should not doggedly stick to any level of normalization, we need to normalize to solve the data integrity requirements and balance that with the cost of the operation of the design in the queries etc that will be needed to extract or manipulate the data.

I saw a query that needed 14 derived tables in the from clauses to account for the over normalization. The developer who wrote this was wondering why it took so long, over 2 hours. It just so happened it was working with a very large sets of data records in the source tables as well. I redesigned the table structure and populated it with the data in Second Normal form plus. I also did a better job of indexing my design and the equivalent query ran in 6 minutes with my implementation. The outcome was spot on with the output from the other system, but my solution was dropped because it was insufficiently normalized. I am so glad I do not work at the company anymore!

Normalization is a powerful tool, but in the hands of the folks that do not properly understand how to implement it, and the overall performance costs of implementing it, is a data disaster waiting to happen.

Ok, I feel better now.

Thanks to all who have written in. It sure makes things a lot more interesting and beneficial to us all. Comments are always welcome. Send them to me, Ben Taylor, at btaylor@sswug.org.

Cheers,

Ben

$$SWYNK$$

Featured Article(s)
Troubleshooting SQL Blocking Scenarios with Activity Monitor
A demonstration on how to troubleshoot a SQL Server blocking scenario with just Activity Monitor.

Featured White Paper(s)
Upgrading to Microsoft SQL Server 2008 R2 from Microsoft SQL Server 2008, SQL Server 2005, and SQL Server 2000
More than ever, organizations rely on data storage and analysis for business operations. Companies need the ability to deploy… (read more)

Featured Script
dba3_DBMSVolumeConfigurationComplianceDocumentor
checks DBMS for file location compliance with respect to standards recorded in a table (where different general configuration… (read more)

Integrating BLOB Data With Normalized Data

Recent Posts

Debugging Multi-Cloud Performance

Mixing Flavors of SQL Server

July Spotlight – Db2 LUW: Types of I/O

Part II: Overview of B- Tree and B+ Tree

Is it undermining or rude to email the boss to ask him to get his act together?

Debugging Multi-Cloud Performance

Mixing Flavors of SQL Server

July Spotlight – Db2 LUW: Types of I/O

Part II: Overview of B- Tree and B+ Tree

Is it undermining or rude to email the boss to ask him to get his act together?

Getting Started With Deep Learning in Your Browser Using TensorFlow.js

SQL Server Collation Overview and Examples

MySQL Escaping on the Client-Side With Go

VS Code Gets New Python Language Server, Named After Monty Python Character

How Hello World! changed – top level statements and functions (C# 9)

SSWUG.ORG