понедельник, 15 июля 2013 г.

Parallel Async Functions execution in JavaScript

The event-driven programming model of node.js makes it somewhat tricky to coordinate the program flow.

Simple sequential execution gets turned into nested callbacks, which is easy enough (though a bit convoluted to write down).

But how about parallel execution? Say you have three tasks A,B,C that can run in parallel and when they are done, you want to send their results to task D.

With a fork/join model this would be

    fork A
    fork B
    fork C
    join A,B,C, run D

How do I write that in node.js ? Are there any best practices or cookbooks? Do I have to hand-roll a solution every time, or is there some library with helpers for this?

Nothing is truly parallel in node.js since it is single threaded. However, multiple events can be scheduled and run in a sequence you can't determine beforehand. And some things like database access are actually "parallel" in that the database queries themselves are run in separate threads but are re-integrated into the event stream when completed.

So, how do you schedule a callback on multiple event handlers? Well, this is one common technique used in animations in browser side javascript: use a variable to track the completion.

This sounds like a hack and it is, and it sounds potentially messy leaving a bunch of global variables around doing the tracking and in a lesser language it would be. But in javascript we can use closures:

function fork (async_calls, shared_callback) {
  var counter = async_calls.length;
  var callback = function () {
    counter--;
    if (counter == 0) {
      shared_callback()
    }
  }

  for (var i=0;i<async_calls.length;i++) {
    async_calls[i](callback);
  }
}

// usage:
fork([A,B,C],D);

In the example above we keep the code simple by assuming the async and callback functions require no arguments. You can of course modify the code to pass arguments to the async functions and have the callback function accumulate results and pass it to the shared_callback function.
Additional answer:

Actually, even as is, that fork() function can already pass arguments to the async functions using a closure:

fork([
  function(callback){ A(1,2,callback) },
  function(callback){ B(1,callback) },
  function(callback){ C(1,2,callback) }
],D);

the only thing left to do is to accumulate the results from A,B,C and pass them on to D.
Even more additional answer:

I couldn't resist. Kept thinking about this during breakfast. Here's an implementation of fork() that accumulates results (usually passed as arguments to the callback function):

function fork (async_calls, shared_callback) {
  var counter = async_calls.length;
  var all_results = [];
  function makeCallback (index) {
    return function () {
      counter--;
      var results = [];
      // we use the arguments object here because some callbacks
      // in Node pass in multiple arguments as result.
      for (var i=0;i<arguments.length;i++) {
        results.push(arguments[i]);
      }
      all_results[index] = results;
      if (counter == 0) {
        shared_callback(all_results);
      }
    }
  }

  for (var i=0;i<async_calls.length;i++) {
    async_calls[i](makeCallback(i));
  }
}

That was easy enough. This makes fork() fairly general purpose and can be used to synchronize multiple non-homogeneous events.

Example usage in Node.js:

// Read 3 files in parallel and process them together:

function A (c){ fs.readFile('file1',c) };
function B (c){ fs.readFile('file2',c) };
function C (c){ fs.readFile('file3',c) };
function D (result) {
  file1data = result[0][1];
  file2data = result[1][1];
  file3data = result[2][1];

  // process the files together here
}

fork([A,B,C],D);

"Nothing is truly parallel in node.js since it is single threaded." Not true. Everything that does not use the CPU (such as waiting for network I/O) runs in parallel.

It is true, for the most part. Waiting for IO in Node doesn't block other code from running, but when the code is run, it is one at a time. The only true parallel execution in Node is from spawning child processes, but then that could be said of nearly any environment.

Usually we call code that does not use the CPU as not running. If you are not running you can't be "running" in parallel.

The implication of this is that with events, because we know they definitely cannot run in parallel, we don't have to worry about semaphores and mutexes while with threads we have to lock shared resources.

Am I correct in saying that these are not functions executing in parallel, but they are (at best) executing in an undetermined sequence with code not progressing until each 'async_func' returns?

Комментариев нет:

Отправить комментарий