Reverse Sharding Data
When you shard data you find a differentiation factor in the data to distribute the data. Today I am going to finish up on this series of merging data from distributed data stores into a single data store through replication. The issue we have been discussing is how to generate primary keys in remote data stores allowing the data to be replicated without conflict with the data from another regional data store.
I have experienced a few techniques, and read a number of other ones. Some have also already been shared by our readers yesterday. Lets see where we end up. Here are some methods you might use.
Use a GUID. A GUID is short for Guaranteed Unique ID. They are often generated using hardware and software capabilities to generate values unique to each computer. Some people don’t like GUIDs as keys because they take 16 bytes, potentially reducing performance. I did a test 4 years ago on a 32 bit machine and found that an Integer key performance cannot be perceived until you get millions of rows. On todays 64 bit machines I would imagine the variance to be even smaller.
Since MS Replication requires a GUID anyway, then why not use it for the primary key as well.
If you prefer identity values there are a couple techniques for implementing Identity keys (Auto Numbered sequences from the database engine). First, you can use a composite key consisting of a row identifier which may be duplicated across regions. Then, combine that value with a region value and you have a unique identifier.
Another technique for using Identity assigned values without duplication is to set the identity key start value at a different number for each region. Region 1 may start at 1,000,000. Region 2 starts at 2,000,000, etc. As you can see this has potential to run out of unique keys because it greatly reduces the number of values available for any single database. Using a BIGINT data type you could use a larger distribution factor (say 10,000,000 or 100,000,000) and still have reasonable performance due to the 64 bit machines.
Remote key management may be assigned centrally as another method. If you assigned primary keys from your centralized database using a sequence or even a table, each regional database would make a request of the central database for key assignment, assuring unique keys for all databases. If you use a centralized key assignment you are not restricted to numeric sequences. It could also assign GUIDs. Using centralized GUIDs you would be assured of 100% unique keys.
Another interesting value of centralized key assignment is that you could assign keys to your data prior to persisting it in your database. This technique can result in a performance boost because you don’t have to save parent records and get the key assigned by the database a single record at a time. This way you can assign all parent and foreign key values prior to persisting resulting in less round trips to your database.
That pretty much wraps up this topic of merging data from multiple data stores. We’ll pick up a new topic tomorrow. Feel free to write in with topics you’d like to see discussed in our editorials by writing to btaylor@sswug.org.
Cheers,
Ben
SSWUGtv
With Stephen Wynkoop
Did you miss our show yesterday? Ste[jhem says, "before you go and buy the Surface tablet, listen to Steve’s thoughts on it — it might surprise you!"
Watch the Show
$$SWYNK$$
Featured Article(s)
Dealing With Sparse Date Data
Generating a row per day within a specified date range is often necessary when you need to create a report showing daily results for each day in the range, regardless of whether there was an activity on a particular day or not.
Featured White Paper(s)
Top 10 Tips for Optimizing SQL Server Performance
read more)
Featured Script
Save / append logfile (ASCII)
This generic proc appends/creates an ASCII (log) file to an alternate place…. (read more)