Programming is hard by Stephan Schmidt

Sharding destroys the goals of your relational database

Sharding does destroy your relational database - which is a good thing. The idea behind sharding is to distribute data to several databases based on certain criterias. This could for example be the primary key. All entities that keys begin with 1 go to one database, with 2 to another and so on (often modulo functions on the key are used, or groups based on business data like customer location, or function). Several reasons exists for sharding, the main two being better performance and lower impact of crashed databases - only persons with a name that starts with S will be affected by a database crash.

Relational databases were the tool of choice for several decades when it comes to data storage. But they do more than store data. Even reading operations can be split into several functions. There are at least three kinds of database read queries:

  • Data graph building queries: With these you get your data out of the database, customers together with adresses etc.
  • Aggregation queries: How many orders have been stored in the August, aggregated by product category
  • Search queries: Give me all customers who live in New York

Sharding now does away with the second and third query and reduces databases to data storage. Because the shards are different databases on different systems you can’t aggregate queries (compared to a cluster) without custom code across systems and you cannot search with one query (only several ones - one to each database). Databases have lead to the notion that search and retrieval are linked together and should be dealt together. Most people think as retrieval and search as the same thing. This has blocked development on technologies. Sharding, S3, Dynamo, Memcached have changed this preception recently. I’ve written about splitting search and retrieval in “The unholy legacy of databases”. There I quote Rickard from Qi4j fame:

Entities are really cool. We have decided to split the storage from the indexing/querying, sort of like how the internet works with websites vs Google, which makes it possible to implement really simple storages. Not having to deal with queries makes things a whole lot easier.

and have concluded

Free your mind! Storage and search are two different things, if you split them, you gain flexibility.

People talked about splitting storage and search for some time now. Search engines like Lucene have driven searching out of databases. But mainly the notion of store&search is prevalent. Sharding as a mechanism for more perfomance and lower risk will move into many web companies and reduce databases to storage mechanism and drop the aggreation (data warehouse and reporting) and search parts. Those can be better filled with real data warehouse servers like Mondrian and search services based on Lucene or semantic enginse like Sesame. And storage might move from databases to simple storages like Amazon Elastic Block Storage or JDBM.

Thanks for listening, and think about your databases.

If you liked this post, subscribe to my free full RSS feed.
Filed under: Amazon EBS, Amazon Elastic Block Storage, Amazon S3, Database, Java, Sharding, Shards

You can share this post!
Do you want to tell others about this article? Use the social bookmark icons to submit this artice to the service of your choice. Thanks.

Get free updates by email

If you did like this article you can get free updates with your RSS reader, you can follow me on Twitter or get free update to new posts by email. Enter your email:

 
About the author: Stephan has been working as a head of development and CTO. He has experiences in different technologies since 20 years including Java, Rails and Python. Stephans main field of interest is maintainablity and productivity in software development. Want to know more? All views are only his own.

Comments

The reason for this is CAP theorem Consistency, Availability and Partition tolerance you can only have two of the three. traditionally DBMSs take the first two. Shards mean taking the last two

Arnon

stephan

@Arnon: I did know about CAP, but didn’t see it in the case of shards. Interesting, thanks.

todd

I think sharding and CAP come in because each shard can independently fail. So if 1 shard out of 10 fails your other 9 are still available, which increases the availability of your overall system at the expense of consistency.

Well, you just need some new layer to let you do summary queries across all your shards… Something like Map/Reduce. I can see LINQ going in this direction in the long term.

stephan

@website: You’re right, as I’ve said “without custom code across”. You always can write new code on top.

“! Storage and search are two different things, if you split them, you gain flexibility”

What a powerful concept.

Your post is worth reading just for that quote.

Thanks for other wonderful ideas, too.

stephan

Thanks.

Leave a Reply