An overview of the NoSQL world

Let's back up the Internet

A few sessions at the Disruptive Code 2010 were dedicated to the “NoSQL solutions” trendy topic and I was really looking forward —I have to admit— to what Adam Skogman from SpringSource and Eric Evans from Rackspace had to say on the subject.

In the last 5 years the amount of data produced worldwide (texts, images, audio, …) has drastically increased from 161 exabytes to 988 exabytes—one EB being one million TB or one billion GB—and with that come some new challenges (storage capacity, availability, …) that cannot be entirely solved by SQL solutions and relational databases.

Not entirely because “NoSQL” does not mean no SQL at all but Not Only SQL really and a mixed architecture is probably what the final solution to a problem will look like depending on the needs.

What are the problems?

The amount of data is growing at an exponential rate and a relational database (like MySQL) is not really a distributed solution. Even though reads can be performed on slaves, writes most likely have to be done on a master—where would the consistency be otherwise?—which becomes a bottleneck in a transaction intensive system.

Relational models are rather statics and once the model has been defined you better not have to change it. From my personal experience I can say that adding a column to a table that gets 10 millions (roughly) extra rows a day without downtime is a rather complex and costly (just in the human resources involved) operation. And when you need to keep 5 years of data in that same table (which would be 18.3 billions rows) and still have a good performance on reads and writes, it gets even more complex.

When you need to store, as fast as possible, large quantities of data which structure has to be somehow flexible you’ll definitely have to have a look at the following solutions.

What are the solutions?

There are 4 kinds of NoSQL models at the moment:

  • key/value
  • column
  • document
  • graph

Key/value stores design is domain driven. Entities that are tightly coupled go to the same bucket (a customer and her shopping cart for instance) but different instances (customer A and customer B) don’t have to be in the same bucket. With a key/value store like Redis a throughput of 110,000 database operations per second can be achieved when MySQL shows a good 15,000—good because it’s still pretty good.

Most key/value stores provide an indexing mechanism and/or a search engine usually based on Lucene, Solr or Elastic Search.

As for the other solutions we can mention Apache Cassandra from Facebook, Google Bigtable and Hadoop HBase (column models), CouchDB and MongoDB (document model) and Neo4J (graph database).

Graph models seems like a pretty interesting topic—even though performance is not what you can expect from a key/value store for instance—and I will definitely have a closer look at Neo4J.