Micro cloud in your JVM: code example….

Few days ago I blogged about how GridGain easily supports starting many GridGain nodes in the single JVM – which is a huge productivity boost during the development. I’ve got a lot of requests to show the code – so here it is. This is an example that we are shipping with upcoming 4.3 release (entire source code):


import org.gridgain.grid.*;
import org.gridgain.grid.spi.discovery.tcp.*;
import org.gridgain.grid.spi.discovery.tcp.ipfinder.*;
import org.gridgain.grid.spi.discovery.tcp.ipfinder.vm.*;
import org.gridgain.grid.typedef.*;

import javax.swing.*;
import java.util.concurrent.*;

public class GridJvmCloudExample {
  /** Number of nodes to start. */
  private static final int NODE_COUNT = 5;

  /**
   * Starts multiple nodes in the same JVM.
   */
  public static void main(String[] args) throws Exception {
    try {
      ExecutorService exe = new ThreadPoolExecutor(
        NODE_COUNT, 
        NODE_COUNT, 
        0, 
        TimeUnit.MILLISECONDS,
        new LinkedBlockingQueue<Runnable>()
      );

      // Shared IP finder for in-VM node discovery.
      final GridTcpDiscoveryIpFinder ipFinder = 
        new GridTcpDiscoveryVmIpFinder(true);

      for (int i = 0; i < NODE_COUNT; i++) {
        final String nodeName = "jvm-node-" + i;

        // Start nodes concurrently (it's faster).
        exe.submit(new Callable<Object>() {
          @Override public Object call() throws Exception {
            // All defaults.
            GridConfigurationAdapter cfg = new GridConfigurationAdapter();

            cfg.setGridName(nodeName);

            // Configure in-VM TCP discovery so we don't
            // interfere with other grids running on the same network.
            GridTcpDiscoverySpi discoSpi = new GridTcpDiscoverySpi();

            discoSpi.setIpFinder(ipFinder);

            cfg.setDiscoverySpi(discoSpi);

            G.start(cfg);

            return null;
          }
        });
      }

      exe.shutdown();

      exe.awaitTermination(20, TimeUnit.SECONDS);

      // Get first node.
      Grid g = G.grid("jvm-node-0");

      // Print out number of nodes in topology.
      X.println("Number of nodes in the grid: " + g.nodes().size());

      // Wait until Ok is pressed.
      JOptionPane.showMessageDialog(
        null,
        new JComponent[] {
          new JLabel("GridGain JVM cloud started."),
          new JLabel("Press OK to stop all nodes.")
        },
        "GridGain",
        JOptionPane.INFORMATION_MESSAGE);
    }
    finally {
      G.stopAll(true);
    }
  }
}

That’s all there’s to it. Enjoy your in-JVM micro cloud!

Micro cloud in your JVM with GridGain.

One of the features in GridGain’s In-Memory Data Platform that often goes unspoken for is ability to launch multiple GridGain’s node in the single JVM.

Now, as trivial as it sounds… can you start multiple JBoss or WebLogic or Infinisnap or Gigaspaces or Coherence or (gulp) Hadoop 100% independent runtimes in the single JVM? The answer is no. Even for a simple test run you’ll have to start multiple instances on your computer (or on multiple computers), and debug this via remotely connected debugger, different log windows, different configurations, etc. In one word – awkward…

Not so with GridGain. As I mentioned – you can launch as many GridGain nodes as you need in the single JVM. As the matter of fact – this is exactly how we debug complex internal algorithms here at GridGain System when we develop our product. We launch entire micro cloud of 5-10 nodes in the single JVM (right from our JUnits), put breakpoints into different nodes and walk in debugger through complex distributed execution path… never leaving the convenience of your local IDE’s debugger.

Naturally, all functionality (your tasks, access to data grid, clustering, messaging, CEP, etc.) work exactly the same way in a single JVM as it would on separate physical computers. You don’t have to adjust or reconfigure anything!

Now that’s the a productivity feature that most of us can appreciate.

GridGain presents and sponsors Scalathon 2012

Scalathon 2012
Scalathon 2012
July 27-29, 2012
Philadelphia, Pennsylvania

GridGain is a Gold Sponsor of Scalathon 2012 (#scalathon), one of the most exciting Scala developer conferences around. I will also be giving a presentation titled “Real-Time Streaming MapReduce.” As always – few slides and plenty of live MapReduce and in-memory distributed processing coding.

Visit the Scalathon website for more information, or register via their Meetup page.

You’re invited: 7/26 – Live Coding In-Memory Big Data with GridGain


GridGain will be hosting a webinar Thursday, July 26, 2012 at 3pm EST / 12pm PST during which GridGain’s CTO, Dmitriy Setrakyan, will be live coding GridGain examples in Java.

Dmitriy will cover the following examples:

  • How to distribute simple units of work to the grid.
  • Collocation of computation and data.
  • A full Streaming MapReduce example that performs SQL queries on streaming in-memory data.

This is a fantastic opportunity to see how easy it is to get started with GridGain, so register now and join me for “Live Coding In-memory Bid Data with GridGain.”

Five telltale “words” that your analytics/BI strategy is rotten…

Five telltale “words” that your analytics/BI strategy is rotten…

Over the last 12 months I’ve accumulated plenty of “conversations” where we’ve discussed big data analytics and BI strategies with our customers and potential users. These 5 points below represent some of the key take-away points about current state of analytics/BI field, why it is by in large a sore state of affairs and what some of the obvious tell telling signs of the decay.

Beware: some measure of hyperbole is used below to make the points more contrast…

“Batch”

This is probably getting obvious for the most of industry insiders but still worth while to mention. If you have “batch” process in your big data analytics – you are not processing live data and you are not processing it in real time context. Period.

That means that you are analyzing stale data and your competitors that are more agile and smart are running circles around you since they CAN analyze and process live (streaming) data in real time and make appropriate operational BI decisions based on real time analytics.

Having “batch” in your system design is like running your database off the tape drive. Would you do that when everyone around you using disk?

“Data Scientist”

A bit controversial. But… if you need one – your analytics/BI are probably not driving your business since you need a human body between your data and your business. Having humans (that sadly need to eat and sleep) paints any process with huge latency, and non real time characteristics.

In most cases it simply means:

  • Data you are collecting and the system that is collecting it are so messed up that you need a Data Scientist (i.e. Statistician/Engineer below 30) to clean up this mess
  • You process is hopelessly slow and clunky for real automation
  • You analytics/BI is outdated by definition (i.e. analyzing stale data with no meaningful BI impact on daily operations)

Now, sometime you need a domain expert to understand the data and come up with some modeling – but I’ve yet to see a case complex enough that a 4 year engineer degree in CS could not solve. Most of the time it is overreaction/over hiring as the result of not understanding the problem in the first place.

“Overnight”

It’s a little brother of “Batch”. It is essentially a built-in failure for any analytics or BI. In the world of hyper local advertising, geo locations, up to the seconds updates on Twitter or Facebook or LInkedIn – you are the proverbial grandma driving 1966 Buick with blinking turn light on the highway with everyone speeding past you…

There’s simply no excuse today to have any type of overnight processing (except for some rare legacy financial applications). Overnight processing is not only a technical laziness but it is often a built-in organizational tenet – and that’s what makes it even more appalling.

“ETL”

This is a little brother of “Overnight”. ETL is what many people blame for overnight processing… “Look – we’ve got to move this Oracle into Hadoop and it takes 6 hours, and we can only do it at night when no one is online”.

Well, I can really count two or three clients of ours where no one is online during the night. This is 2012 for god’s sake!!! Most businesses (even smallish startups) are 24/7 operations these days.

ETL is a clearest sign of significant technical debt accumulation. It is for the most parts manifestation of defensive and lazy approach to system’s design. It is especially troubling to see this approach in newer, younger companies that don’t have 25 years of legacy to deal with.

And it is equally invigorating to see it being steadily removed in the companies with 50 years of IT history.

“Petabyte”

This is a bit controversial again… But I’m getting a bit tired to hear that “We must design to process Petabytes of data” line from 20 people companies.

Let me break it:

  • 99.99% of the companies will NEVER need Petabytes scale
  • If your business “needs” to process Petabytes of data for its operations – you are likely doing something really wrong
  • Most of the “working sets” that we’ve seen, i.e. the data you really need to process, measure in low teens of terabytes for absolute majority of use cases
  • Given how frequently data is changing (in its structure, content, usefulness, fresh-ness, etc.) I don’t expect that “working set” size will grow nearly as fast (if at all) – overall data amount will grow but not the actual “window” that we need to process…

Yes – there are some companies and government organizations that will have a need for historical archival reasons to store petabytes and exabytes of data – but it’s for historical, archival and backup reasons in all of those rare cases – and likely never for frequent processing.

GridGain 4.2 Released!

I’m happy to announce that GridGain 4.2 is released!

This release includes several new exciting feature as well as the host of performance optimizations that we’ve included. This release is 100% backward compatible with 4.x product line and we recommend anyone on 4.x version to update as soon as possible.

Now – let’s talk about new features…

Delayed Preloading

In GridGain 4.2 we’ve introduced support for delayed preloading. Dmitriy Setrakyan wrote an excellent blog detailing this new functionality. Essentially, whenever a new node joins the grid or an existing node leaves the grid, cluster repartitioning happens. This basically means that, in case of new node, it has to take responsibility for some of the data cached on other nodes, and in case of node leaving the grid, other nodes have to take responsibility for the data cached on that node. Essentially this results in data movement between data grid nodes. Picture below illustrates how keys get partitioned among caching data nodes (share-nothing-architecture):

Now imagine that you need to bring multiple nodes up concurrently. The 1st node that comes up will take responsibility for some portion of the data cached on other nodes and will start loading that portion of the data from other nodes. When a 2nd node comes up, it will also take responsibility for some portion of the data, including some data from the 1st node that was just started, and now portion of the data that was moved to 1st node will have to be moved to the 2nd node. Ouch – wouldn’t it be more efficient to wait till 2nd node comes up to start data preloading? The same happens when nodes 3, 4, etc… come up. So the most efficient way to do preloading of keys and to avoid extra network traffic causes by moving data between newly started nodes is to delay preloading until the last node starts.

Delayed preloading allows for delayed or manual preloading start from API or from Visor DevOps Console.

JDBC Driver

One of the biggest additions to GridGain in 4.2 release is inclusion of JDBC driver that you can use to query your data in GridGain’s In-Memory Data Grid. Now, this is a pretty big deal. No custom languages, standard SQL and hunders of existing tools can be used to query & examine the data in your data grid.

Here’s the quick example. Notice how Java code looks 100% identical as if you talk to a standard SQL database – yet you are working in in-memory data platform:

// Register JDBC driver.
Class.forName("org.gridgain.jdbc.GridJdbcDriver");

// Open JDBC connection.
conn = DriverManager.getConnection(
    "jdbc:gridgain://localhost/" + CACHE_NAME, 
    configuration()
);

// Create prepared statement.
PreparedStatement stmt = conn.prepareStatement(
    "select name, age from Person where age >= ?"
);

// Configure prepared statement.
stmt.setInt(1, minAge);

// Get result set.
ResultSet rs = stmt.executeQuery();

Fields Querying

As part of our work on JDBC Driver we’ve added capability to query not just full objects from Data Grid but individual fields as well. As you saw in the JDBC example above – you can query just the fields you need (namely 'name' and 'age' in this example):

...
// Create prepared statement.
PreparedStatement stmt = conn.prepareStatement(
    "select name, age from Person where age >= ?"
);
...

Indexing SPI

In GridGain 4.2 we’ve overhauled our indexing and add Indexing SPI that allows you to plug any indexing implementation you would like – an extremely powerful feature and unique among data platforms. Our current default implementation is based on H2 but you can quite easily plug your own (e.g. special bitmap indexing) or reuse one from your existing RDBMS.

Visor Object Viewer

With introduction of JDBC Driver and full SQL support, GridGain’s Visor DevOps Console gained capability to view objects in GridGain cluster using convenient GUI Object Browser:

You can select one or multiple nodes, examine cache metadata and automatically construct SQL query. Result set gets displayed in paginated form and you sort results on every page. Object Viewer support tabulated SQL editor and convenient history of previously run SQL queries.

So, what you are waiting for? Go ahead and download it!

“BigData Distributed HPC with GridGain and Scala”, Boston, MIT, Thursday, July 12, 2012

I will be presenting at Boston Area Scala Meetup on “BigData Distributed HPC with GridGain and Scala”. Good discussion about in-memory data platform in general, BigData and what role in-memory technology plays in BigData. As always – plenty of live coding developing live MapReduce apps in front of the audience.

All details at Boston Area Scala Meetup website.

Hope to see you there tomorrow!