Programming is hard by Stephan Schmidt

The unholy legacy of databases

When reading about the status of Qi4j on Rickards blog, I stumbled about

Entities are really cool. We have decided to split the storage from the indexing/querying, sort of like how the internet works with websites vs Google, which makes it possible to implement really simple storages. Not having to deal with queries makes things a whole lot easier.

We made the same experience when we developed the SnipSnap wiki application several years ago. We had a split in storages and search, each part with it’s own Java interface (a component could implement both of course). This way we could have Lucene, database and in-memory search and database and file (XML, plain text) storage. We were very flexible with storage and search this way and people could easily implement different storage backends because developers have been freed from the search implementation. Rickard seems to have made the same experiences:

We have one EntityStore based on JDBM (persistent binary hashmap), one on JGroups (replicated cluster hashmap), one on Amazon S3 (for global storage), and one on iBatis (for RDBMS storage)

So today SnipSnap would easily be able to supply a S3 backend, because of the split, whereas others which rely on the storage/search combination have much more problems to support a storage-only backend. So they have problems to support S3 or WebDav out of the box.

Why don’t more people split the problem of storage into storage and search? After some contemplation on the topic, perhaps it’s the unholy legacy of databases. Databases make it easy to solve the search/storage problem with only one technology. After 30 years of databases the problems have merged in a way that most developers think of them as one problem. By splitting the problem again, projects will be freed for better backends and better search solutions. Open Source projects will emerge which adress each of the problems better than current databases do.

This of course breaks the DAO pattern and the usage of the EntityManager as an DAO replacement and should be replaced by a Storage and Search pattern. Free your mind! Storage and search are two different things, if you split them, you gain flexibility.

Thanks for listening.

If you liked this post, subscribe to my free full RSS feed.
Filed under: Amazon S3, Databases, Java, RDBMS

You can share this post!
Do you want to tell others about this article? Use the social bookmark icons to submit this artice to the service of your choice. Thanks.

Get free updates by email

If you did like this article you can get free updates with your RSS reader, you can follow me on Twitter or get free update to new posts by email. Enter your email:

 
About the author: Stephan has been working as a head of development and CTO. He has experiences in different technologies since 20 years including Java, Rails and Python. Stephans main field of interest is maintainablity and productivity in software development. Want to know more? All views are only his own.

Comments

I have tried this sort of thing as well and I really really like the concept. The only thing that stops my poor, slow, small brain from really seeing it through to its logical conclusion is the sometimes-requirement to join attributes from one Thing stored with one Storage mechanism with the attributes of another Thing stored with another Storage mechanism. Do you have suggestions here?

Obviously, a Compass/Lucene-type search handles a huge number of cases–people tend to like to search by keywords, and so a coarse-grained search/locate strategy like that makes a lot of sense. But in some of the applications I work on, careful targeted queries that join bits of two entities together–a classic SQL join–are also needed.

Have you found a convenient way to expose a *common* SQL-like query mechanism across items that use different Search implementations?

stephan

@Laird: I’m not sure if this is possible. You’re giving stuff up when you try this approach. But you also gain something. I guess it depends on the application you have. If flexibility in the backend is needed, than this is a good approach. If a RDBMS is all you need, then this approach is overengineered.

Have you read the transaction apostate paper and the Amazon dynamo paper? Sometimes it seems it isn’t even possible to have data on one machine to join it.

http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p15.pdf

But if you have new insights and a solution to the problem of data joining of disparate stores, please drop me a line.

I surely don’t. :-) My main problem is that I need both–the ability to join LDAP information (Student information, say), as it happens, with additional information relevant to it from a database (classes they’re taking).

The decidedly brute-force, ugly, smelly, hairy, nasty and yet strangely cool approach I took at one point was to make a kind of query builder that, in conjunction with some simple lookup and filtering facilities implemented on Storage instances–but again, not true searches–whittled down the sets of items from each Storage to be combined (so the Student Storage was able to filter using a simple where clause/predicate, and the Class Storage was able to do the same). Then I loaded those sets into a temporary database (H2–obviously could have been anything) and did the more complicated joining there (minus the simple where clauses/filters that were used to get me the candidate sets). (A tip of my hat to a former colleague for first exploring this approach.) The result, of course, was dog slow and could not be used on enormous datasets, but performance was never a priority and the client understood that they would pay dearly in performance costs for this approach. I was hoping that someone somewhere smarter than me had figured out how to bridge querying disparate systems in a better way.

Thanks for the links to the papers; very interesting reading.

stephan

@Laird: Your query builder doesn’t sound too ugly, but I haven’t seen the code :-)

“I was hoping that someone somewhere smarter than me had figured out how to bridge querying disparate systems in a better way.”

Uh, smarter, than I’m most possibly not the right person.

Perhaps the joins are only needed for reporting, if that is the case it would be best to write the data also into a OLAP for reports.

Like http://mondrian.pentaho.org/

Part of the point of splitting storage and query is that it becomes easier to do cross-storage queries. If you store objects in many places, but index/query them in one (again, the website vs Google analogy), it becomes supertrivial to query stuff in different places. The LDAP database problem you outline is one of the cases I had in mind when I designed these API’s in Qi4j, because I want to be able to do the same thing. In Qi4j our primary indexer is going to be Sesame2 (i.e. RDF), with SPARQL as the main query language (although it’s usually hidden under a domain-oriented Java API). Will be very interesting to see how it works out.

Leave a Reply