PDC Theorem – The Underlying Principle of Real-Time Distributed Processing

Few months ago I blogged about real-time processing in enterprise systems of all stripes, and since then I’ve talked quite a bit about the underlying principle of such processing. I call it a PDC theorem as in “Processing, Data, and Co-Location”.

The idea behind PDC is pretty simple. Real-time response in a highly distributed system is not achievable unless the following 3 rules are followed:

  • Processing must be distributable for in-memory computation
  • Data storage must be distributable (i.e. partitioned) for in-memory storage
  • Co-location must be ensured between processing and data units to provide locality of remote operations

Few important notes:

  • We, of course, are talking about business or perceptual real-time (a.k.a. Near Real-Time or nRT) and not about hardware real-time. Perceptual real-time response is not well defined but you can conceptually visualize it as the time the user of the system willing to wait for the response that he or she expects right away… In most cases it means few seconds or less. In rarer cases like FOREX trading, for example, the real-time would mean microseconds.
  • It is critically important that your processing supports algorithmic parallelization. Not all tasks can be parallelized and therefore not all tasks can be optimized for real-time processing. However, many of the typical business and social graph tasks can be split into multiple sub-tasks executing in parallel – and therefore are trivially parallelizable.
  • Data have to be partitioned and stored in-memory. Any outside calls to get data from NoSQL storage, file systems like HDFS, or traditional SQL storage renders any real-time attempts useless in most cases. This is one of the most critical design element and it is often overlooked. In other words – in no time the remote processing should escape the boundaries of the local JVM it is executing on.
  • Co-location of the processing and data (a.k.a affinity-based routing referring to the fact that should be an affinity between the computation and the data this computation needs) is the main mechanism to ensure that there is no noise data transfer between remote nodes where a task is being processed. Such unnecessary data transfer will violate the locality principle of the remote operations making real-time processing often unachievable.

It’s also quite obvious that PDC theorem doesn’t guarantee the real-time processing – it merely states these three rules are necessary but not enough on their own for a real-time response. Latencies of the atomic remote operations will often dictate whether or not real-time response is achievable in practice.

It is interesting to note that combination of PDC and CAP theorems really defines the fundamentals of high performance distributed programming today.

3 responses

  1. We agree that to scale, you need your data to be available for each node that need it to perform its computation. On the cases you can’t have a full copy on each node, you’ll want to distribute it so each node need only the data it own, otherwise communication would kill the performance.

    But that data can be in the processor cache (fast), in memory (moderately fast), or worse on the file system (slow).

    This would limit the speed for executing one sequential task on a node. But if this execution is fast enough in term of latency you can simply run more of them in parallel (that’s what distributed mean no?) by using more nodes or by growing your raid array inside each node.

    I agree that data in memory is faster for an individual task even through we can have SSD raid array that manage to provide troughtput > 1GB/s without too much investment.

    This doesn’t mean it would not work if you need file system access, and I would not call it a theorem, except if you provide a mathematical proof of it.

  2. CAP is a theorem, it has a published peer reviewed mathematical proof. Before it was a theorem, it was a conjecture from Eric Brewer. I think this is a conjecture, and to be mathematically meaningful, you have to take out technology centric statements like “in-memory” and vague un-testable notions like “locality”.

    If you could make this precise and more importantly actually prove it, you’d have something important.

    A big limitation of CAP is that it doesn’t actually speak to latency. Abidi’s PACELC stuff tries to introduce latency into the engineering judgements but again leaves out the mathematical rigor.

    • Totally agree – I would even call it just an observation. However, in the common sense “in-memory” and “locality” are well understood. And although CAP has somewhat mathematical proof to it – there are number of successful attempts in “dis-proving” it by slightly relaxing certain guarantees (Pardon’s, etc.). So, formal mathematical rigor in this case doesn’t play a significant role (IMHO) as there are number of practical approaches in how to (almost) work around CAP limitations.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: