On Myth of MapReduce Complexity…

I’ve read another post that was claiming that MapReduce (in Java) was a fairly complex paradigm and therefore hacking scripts with Pig or something like that is simpler.

Let me say it again:

Nail is a painful and bloody experience too if you hammer it with your bare hands… Right tools for the right job!

Here’s fully complete source code of MapReduce application in Java that counts non-space characters in argument string on any cloud or grid of any size. It will scale up and down with the size of your topology, it will work on one, two or thousands of nodes, it will perform advanced load balancing and fail over. And you won’t have to deploy a single thing – it will use zero deployment & provisioning provided by GridGain.

It will do all that and few dozens other things under the hood but you just need to write few lines of code:

public class SimpleMapReduce {
    public void main(final String[] args) throws GridException {
        G.in(new CIX1<Grid>() {
            @Override public void applyx(Grid g) throws GridException {
                System.out.println("Length of input argument is " + g.reduce(
                    SPREAD,
                    F.<String, Integer>cInvoke("length"),
                    Arrays.asList(args[0].split(" ")),
                    F.sumIntReducer()
                ));
            }
        });
    }
}

If that is not simple enough to get started – I don’t know what is.

Right tool for the right job…

6 responses

  1. I don’t see how such an exemple would serve in the real world.

    But from my understanding, the splitting of the string using space separator is done in one thread on one machine. The only thing done in the map reduce system then is to count the number of characters.

    Basically this code do the same as :
    SQL : SELECT SUM(LEN(S)) from SOURCE
    Clojure : (apply + (map #(.length %) list))
    JAVA :
    int length = 0;
    for (String current : list) {
    length+=current.length();
    }

    Seing how all there 3 versions are simpler, it is logical that peoples say map reduce is difficult.

    of course your code do more. But it is still complexe and ugly.

  2. So, you created a local version of the MapReduce and it got few lines shorter… And your point is?

    My point was that the few extra lines of code is nothing comparing to benefits you are getting from distributed execution – and code is absolutely not more complex than the snippets you have posted.

    If you look at Hadoop example for the same task – you’ll see what a *complex* solution would look like…

    • My point is that if Oracle can do it without any special syntax and any burden and scale to dozen of nodes, it shall be possible for others software too.

      Complexity : One line for SQL or Clojure version, 10 lines for GridGrain version even bundled with some introspection (for the length call). 10X more line of code and some use of reflexion API is not just “a few more lines of code for me.

      You wonder why people don’t want to give it a try ? The simplistic code exemple you shown require 2X more code than a standard JAVA code, 10X more code than standard SQL or a lisp flavour. Hey it even manage to bring some use of introspection…

      How people look at that could like it?

      • Can you please provide me the implementation on SQL or Clojure that would run on 100s of nodes with built-in:
        – failover
        – load balancing
        – collision resolution
        – dynamic topology resolution (adding and removing nodes from grid/cloud at runtime)
        – zero deployment (no copying/FTP-ing of anything what-so-ever)

        Once I see this implementation shorter and simpler than those 10 lines in Java – I’ll come back to you comments.

        BTW, you can remove reflection – but it will add one more line to the code… :)

        Wise up,
        Nikita.

      • Btw, I assumed the obvious but let me reiterate it in case you missed it: the example of counting char on the cloud is obviously silly and only serve the point of showing the plumbing (and NOT the “usefulness” of counting char in distributed manner).

        I hope you did get this idea…

  3. Can you please provide me the implementation on SQL or Clojure that would run on 100s of nodes with built-in

    -> Oracle RAC does it for SQL.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,361 other followers

%d bloggers like this: