Few months ago I blogged about real-time processing in enterprise systems of all stripes, and since then I’ve talked quite a bit about the underlying principle of such processing. I call it a PDC theorem as in “Processing, Data, and Co-Location”.
The idea behind PDC is pretty simple. Real-time response in a highly distributed system is not achievable unless the following 3 rules are followed:
- Processing must be distributable for in-memory computation
- Data storage must be distributable (i.e. partitioned) for in-memory storage
- Co-location must be ensured between processing and data units to provide locality of remote operations
Few important notes:
- We, of course, are talking about business or perceptual real-time (a.k.a. Near Real-Time or nRT) and not about hardware real-time. Perceptual real-time response is not well defined but you can conceptually visualize it as the time the user of the system willing to wait for the response that he or she expects right away… In most cases it means few seconds or less. In rarer cases like FOREX trading, for example, the real-time would mean microseconds.
- It is critically important that your processing supports algorithmic parallelization. Not all tasks can be parallelized and therefore not all tasks can be optimized for real-time processing. However, many of the typical business and social graph tasks can be split into multiple sub-tasks executing in parallel – and therefore are trivially parallelizable.
- Data have to be partitioned and stored in-memory. Any outside calls to get data from NoSQL storage, file systems like HDFS, or traditional SQL storage renders any real-time attempts useless in most cases. This is one of the most critical design element and it is often overlooked. In other words – in no time the remote processing should escape the boundaries of the local JVM it is executing on.
- Co-location of the processing and data (a.k.a affinity-based routing referring to the fact that should be an affinity between the computation and the data this computation needs) is the main mechanism to ensure that there is no noise data transfer between remote nodes where a task is being processed. Such unnecessary data transfer will violate the locality principle of the remote operations making real-time processing often unachievable.
It’s also quite obvious that PDC theorem doesn’t guarantee the real-time processing – it merely states these three rules are necessary but not enough on their own for a real-time response. Latencies of the atomic remote operations will often dictate whether or not real-time response is achievable in practice.
It is interesting to note that combination of PDC and CAP theorems really defines the fundamentals of high performance distributed programming today.