Editorials

The Cost of Sharding

I have talked a lot about using a sharded database structure as a method to extend the performance and/or the capacity of a database. Sharding does not come without additional cost. Today I want to look at some of the things contributing to additional cost.

If you use database sharding that requires additional instances of your database engine. Those additional instances can have additional license fees for both the engine, and the host on which the database instance operates. If you are running MySql on Linux, the additional cost may be limited to none.

Regardless of licensing fees, you still have to purchase additional hardware. Even if you are using virtual machines, the resources being utilized by each instance add cost to the hosting hardware.

Redundancy is an additional cost in resources and disk consumption. When you shard, each separate instance requires its own copy of reference data. The fact that no data is shared across different shards directly, meaning you can’t write a query that accesses multiple shard instances, means you have to duplicate all reference data if it is needed for joins. This has an additional cost in disk space, and in the overhead to configure, and monitor replication of reference data. If any shard can create new reference data, then you have multi-directional replication, resulting in more overhead, and thus, more cost.

Finally, there can be a cost to application performance. When a query must be executed on more than one shard, because the desired data is not contained on a single instance, there is overhead in gathering and pulling everything together. Those of you using parallel programming patterns recognize this may result in faster actual times. But, it requires more resources somewhere. A parallel query, to really be parallel, must have more than one Core or CPU.

Welcome to sharding. It may not always be the solution you require. But, it is nice to have when the need arises.

Cheers,

Ben