A Sharding Lesson in Replication
Here is an idea I gleaned from a comment in a blog on the Cassandra web site about sharding. It turns out that no-sql has similar difficulties when it comes to sharding techniques. The blog was about handling high scalability for Social Network sites; http://highscalability.com/blog/2009/10/13/why-are-facebook-digg-and-twitter-so-hard-to-scale.html;
The key to sharding is to distribute data across multiple servers such that work can be distributed to those multiple servers. In order for the sharding strategy to be ultimately effective some data is duplicated across multiple shards for purposes of joining and processing.
In the comments a group posted a link to their White Paper about optimizing data storage for social network sharding, http://bit.ly/18Gqmx, using a technique they call One Hop Replication.
As I understand their technique, they find natural groupings of individuals and direct them to a server.
When I was in college we made diagrams for social networks in a Psychological Measurement class. We had each person write on a card the name of three friends. Then we took the cards and produced a chart, placing each person’s name in a circle. Then we drew lines from the person to each of the names they wrote on the card. Popular individuals looked like pin cushions with all the lines connecting them. Powerful relationships were identified when two individuals chose each other.
One Hop Replication uses this social relationship diagraming technique for replication. It identifies the individuals who like each other and puts them on the same server. Individuals who look like pin cushions become a bridge to different servers. They are the glue tying multiple small networks together.
As a bridge individual, all of their data is duplicated on multiple servers. This results in less overhead because only the updated data crosses server boundaries instead of queries crossing the network for the separate social networks to which this individual belongs.
Sharding is a growing art, more in demand as BI mining grows in popularity. It is used extensively in both no-sql and Relational solutions. Perhaps these blogs will provide some insight in data tasks you are doing now or in the future.
Be sure to leave comments on how sharding is working for you, or questions you may wish to have answered. You can always send your comments to btaylor@sswug.org.
Reader’s Comments on Encryption
Be eure to read the comments on Encryption by Patrick Townsend on the previous Encryption editorial.
Cheers,
Ben