Cache < Data Grid < Database

I would like to clarify definitions for the following technologies:

  • In-Memory Distributed Cache
  • In-Memory Data Grid
  • In-Memory Database

These three terms are, surprisingly, often used interchangeably and yet technically and historically they represent very different products and serve different, sometimes very different, use cases.

It’s also important to note that there’s no specifications or industry standards on what cache, or data grid or database should be (unlike java application servers and JEE, for example). There was and still is an attempt to standardize caching via JSR107 but it has been years (almost a decade) in the making and it is hopelessly outdated by now (I’m on the expert group).

Tricycle vs. Bike vs. Motorcycle

First of all, let me clarify that I am discussing caches, data grids and databases in the context of in-memory, distributed architectures. Traditional disk-based databases and non-distributed in-memory caches or databases are out of scope for this article.

cache_grid_db

Chronologically, caches, data grids and databases were developed in that order: starting from simple caching to more complex data grids and finally to distributed in-memory databases. The first distributed caches appeared in the late 1990s, data grids emerged around 2002-2003 and in-memory databases have really came to the forefront in the last 5 years.

All of these technologies are enjoying a significant boost in interest in the last couple years thanks to explosive growth in-memory computing in general fueled by 30% YoY price reduction for DRAM and cheaper Flash storage.

Despite the fact that I believe that distributed caching is rapidly going away, I still think it’s important to place it in its proper historical and technical context along with data grids and databases.

In-Memory Distributed Caching

The primary use case for caching is to keep frequently accessed data in process memory to avoid constantly fetching this data from disk, which leads to the High Availability (HA) of that data to the application running in that process space (hence, “in-memory” caching).

Most of the caches were built as distributed in-memory key/value stores that supported a simple set of ‘put’ and ‘get’ operations and optionally some sort of read-through and write-through behavior for writing and reading values to and from underlying disk-based storage such as an RDBMS. Depending on the product, additional features like ACID transactions, eviction policies, replication vs. partitioning, active backups, etc. also became available as the products matured.

These fundamental data management capabilities of distributed caches formed the foundation for the technologies that came later and were built on top of them such as In-Memory Data Grids.

In-Memory Data Grid

The feature of data grids that distinguishes them from distributed caches was their ability to support co-location of computations with data in a distributed context and consequently provided the ability to move computation to data. This capability was the key innovation that addressed the demands of rapidly growing data sets that made moving data to the application layer increasing impractical. Most of the data grids provided some basic capabilities to move the computations to the data.

This new and very disruptive capability also marked the start of the evolution of in-memory computing away from simple caching to a bona-fide modern system of record. This evolution culminated in today’s In-Memory Databases.

In-Memory Database

The feature that distinguishes in-memory databases over data grids is the addition of distributed MPP processing based on standard SQL and/or MapReduce, that allows to compute over data stored in-memory across the cluster.

Just as data grids were developed in response to rapidly growing data sets and the necessity to move computations to data, in-memory databases were developed to respond to the growing complexities of data processing. It was no longer enough to have simple key-based access or RPC type processing. Distributed SQL, complex indexing and MapReduce-based processing across TBs of data in-memory are necessary tools for today’s demanding data processing.

Adding distributed SQL and/or MapReduce type processing required a complete re-thinking of data grids, as focus has shifted from pure data management to hybrid data and compute management.

Full Disclosure

GridGain develops an In-Memory Database product as part of its end-to-end in-memory computing suite.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,362 other followers

%d bloggers like this: