Fully Loaded Node – A Node.JS Holiday Season, part 2 - Mozilla Hacks

Episode 2 in the A Node.JS Holiday Season series from Mozilla’s Identity team searches for an optimal server application architecture for computation heavy workloads.

This is a prose version of a short talk given by Lloyd Hilaiel at Node Philly 2012 with the same title.

A Node.JS process runs almost completely on a single processing core, because of this building scalable servers requires special care.
With the ability to write native extensions and a robust set of APIs for managing processes, there are many different ways to design a Node.JS application that executes code in parallel: in this post we’ll evaluate these possible designs.

This post also introduces the compute-cluster module: a small Node.JS library that makes it easy to manage a collection of processes to distribute computation.

The Problem

We chose Node.JS for Mozilla Persona, where we built a server that could handle a large number of requests with mixed characteristics.
Our “Interactive” requests have low computational cost to execute and need to get done fast to keep the UI feeling responsive, while “Batch” operations require about a half second of processor time and can be delayed a bit longer without detriment to user experience.

To find a great application design, we looked at the type of requests our application had to process, thought long and hard about usability and scaling cost, and came up with four key requirements:

saturation: The solution will be able to use every available processor.
responsiveness: Our application’s UI should remain responsive. Always.
grace: When overwhelmed with more traffic than we can handle, we should serve as many users as we can, and promptly display a clear error to the remainder.
simplicity: The solution should be easy to incrementally integrate into an existing server.

Armed with these requirements, we can meaningfully contrast the approaches:

Approach 1: Just do it on the main thread.

When computation is performed on the main thread. the results are terrible:
You cannot saturate multiple computation cores, and with repeated half second starvation of interactive requests you cannot be responsive nor graceful.
The only thing this approach has going for it is simplicity:

function myRequestHandler(request, response) {
  // Let's bring everything to a grinding halt for half a second.
  var results = doComputationWorkSync(request.somesuch);
}

Synchronous computation in a Node.JS program that is expected to serve more than one request at a time is a bad idea.

Approach 2: Do it Asynchronously.

Using asynchronous functions that run in the background will improve things, right?
Well, that depends on what precisely the background means:
If the computation function is implemented in such a way that it actually performs computation in javascript or Native code on the main thread, then performance is no better than with a synchronous approach.
Have a look:

function doComputationWork(input, callback) {
  // Because the internal implementation of this asynchronous
  // function is itself synchronously run on the main thread,
  // you still starve the entire process.
  var output = doComputationWorkSync(input);
  process.nextTick(function() {
    callback(null, output);
  });
}

function myRequestHandler(request, response) {
  // Even though this *looks* better, we're still bringing everything
  // to a grinding halt.
  doComputationWork(request.somesuch, function(err, results) {
    // ... do something with results ...
  });
}

The key point is usage of an asynchronous API in NodeJS does not necessarily yield an application that can use multiple processors.

Approach 3: Do it Asynchronously with Threaded Libraries!

Given a library that is written in native code and cleverly implemented, it is possible to execute work in different threads from within NodeJS.
Many examples exist, one being the excellent bcrypt library from Nick Campbell.

If you test this out on a four core machine, what you will see will look fantastic! Four times the throughput, leveraging all computation resources! If you perform the same test on a 24 core processor, you won’t be as happy: you will see four cores fully utilized while the rest sit idle.

The problem here is that the library is using NodeJS’s internal threadpool for a problem that it was not designed for, and this threadpool has a hardcoded upper bound of 4.

Deeper problems exist with this approach, beyond from these hardcoded limits:

Flooding NodeJS’s internal threadpool with computation work can starve network or file operations, which hurts responsiveness.
There’s no good way to control the backlog – If you have 5 minutes of computation work already sitting in your queue, do you really want to pile more on?

Libraries that are “internally threaded” in this manner fail to saturate multiple cores, adversely affect responsiveness, and limit the application’s ability to degrade gracefully under load.

Approach 4: Use node’s cluster module!

NodeJS 0.6.x and up offer a cluster module which allows for the creation of processes which “share a listening socket” to balance load across some number of child processes.
What if you were to combine cluster with one of the approaches described above?

The resultant design would inherit the shortcomings of synchronous or internally threaded solutions: which are not responsive and lack grace.

Simply spinning new application instances is not always the right answer.

Approach 5: Introducing compute-cluster

Our current solution to this problem in Persona is to manage a cluster of single-purpose processes for computation.
We’ve generalized this solution in the compute-cluster library.

compute-cluster spawns and manages processes for you, giving you a programatic means of running work on a local cluster of child processes.
Usage is thus:

const computecluster = require('compute-cluster');

// allocate a compute cluster
var cc = new computecluster({ module: './worker.js' });

// run work in parallel
cc.enqueue({ input: "foo" }, function (error, result) {
  console.log("foo done", result);
});
cc.enqueue({ input: "bar" }, function (error, result) {
  console.log("bar done", result);
});

The file worker.js should respond to message events to handle incoming work:

process.on('message', function(m) {
  var output;
  // do lots of work here, and we don't care that we're blocking the
  // main thread because this process is intended to do one thing at a time.
  var output = doComputationWorkSync(m.input);
  process.send(output);
});

It is possible to integrate compute-cluster behind an existing asynchronous API without modifying the caller, and to start really performing work in parallel across multiple processors with minimal code change.

So how does this approach achieve the four criteria?

saturation: Multiple worker processes use all available processing cores.

responsiveness: Because the managing process is doing nothing more than process spawning and message passing, it remains idle and can spend most of its time handling interactive requests.
Even if the machine is loaded, the operating system scheduler can help prioritize the management process.

simplicity: Integration into an existing project is easy: By hiding the details of compute-cluster behind a simple asynchronous API, calling code remains happily oblivious of the details.

Now what about gracefully degrading during overwhelming bursts of traffic?
Again, the goal is to run at maximum efficiency during bursts, and serve as many requests as possible.

Compute cluster enables a graceful design by managing a bit more than just process spawning and message passing.
It keeps track of how much work is running, and how long work takes to complete on average.
With this state it becomes possible to reliably predict how long work will take to complete before it is enqueued.

Combining this knowledge with a client supplied parameter, max_request_time, makes it possible to preemptively fail on requests that are likely to take longer than allowable.

This feature lets you easily map a user experience requirement into your code: “The user should not have to wait more than 10s to login”, translates into a max_request_time of about 7 seconds (with padding for network time).

In load testing the Persona service, the results so far are promising.
Under times of extreme load we are able allow authenticated users to continue to use the service, and block a portion of unauthenticated users right up front with a clear error message.

Next Steps

Application level parallelization using processes works well for a single tier deployment architecture – An arrangement where you have only one type of node and simply add more to support scale.
As applications get more complex however, it is likely the deployment architecture will evolve to have different application tiers to support performance or security goals.

In addition to multiple deployment tiers, high availability and scale often require application deployment in multiple colocation facilities. Finally, cost effective scaling of a computationally bound application can be achieved by leveraging on-demand cloud based computation resources.

Multiple tiers in multiple colos with demand spun cloud servers changes the parameters of the scaling problem considerably while the goals remain the same.

The future of compute-cluster may involve the ability to distribute work over multiple different tiers to maximally saturate available computation resources in times of load.
This may work cross-colo to support geographically asymmetric bursts.
This may involve the ability to leverage new hardware that’s demand spun…

Or we may solve the problem a different way! If you have thoughts on an elegant way to enable compute-cluster to distribute work over the network while preserving the properties it has thus far, I’d love to hear them!

Thanks for reading, and you join the discussion and learn more about current scaling challenges and approaches in Persona on our email list.

16 comments

Johan Sundström

Reader friendliness suggestion: have each article lead-in link prominently to the first, previous and next article in the series.

Great work, otherwise!

November 20th, 2012 at 11:18
1. Lloyd Hilaiel
  
  Fantastic idea, it would keep the continuity of the series. I’ll update this post.
  
  November 20th, 2012 at 13:00
Ryan Doherty

As much as I really, really like Node.js, when I read articles like this I wonder if either there’s something wrong with its architecture or we’re trying to fit a square peg into a round hole.

Were there benefits to keeping everything in Node.js for these computationally expensive operations? Could another platform/environment achieved your four requirements better? I’m not a backend expert but at some point when you’re creating workarounds like this it should hopefully make you question your decision.

I wasn’t there for your decisions, so it’s entirely possible you guys did analyze other environments and/or had long discussions about the problem. Would love to hear about it! :)

I only ask these questions because I worry about new Node.js devs searching around for solutions to these types of problems. If they are not aware of all the complexity and intricacies you took into account they may not make the best decision.

That being said this article was a great read and I like how you tied user requirements to technical ones. I’m looking forward to more in depth Node.js articles!

November 20th, 2012 at 12:38
1. Lloyd Hilaiel
  
  Yo Ryan,
  
  Thanks for the thoughts. It’s worthwhile to reflect on our language choices and to ask whether we really did make work for ourselves. It is also very hard for me to be unbiased in my evaluation of node.js, but here goes.
  
  The computation that we’re specifically talking about here was bcrypting passwords – hashing them using an intentionally expensive work function to make it so that we never store them in plaintext and even upon server compromise an attacker would have an (often intractable) amount of work to do in order to turn a hash into a plaintext password. The code that does this is written in C.
  
  Now, regardless of your server platform, if you need to run a massive amount of parallel computation operations that take a half a second on your hardware, you’ve *got* to put thought into it. What I was pleased with with Node.JS here, is that binding the native code into a javascript runtime is easy (V8’s embedder’s api is grand, comparable to jsctypes in firefox or ruby’s binding interface). That combined with a very simple and robust api for managing processes, AND quick process startup (30ish ms), AND minimal process memory overhead (12ish mb), made realizing the solution easy, so all of the work went into designing it – which didn’t feel to me like a workaround. It felt to me like performance tuning that you have to do for any application.
  
  Contrasted to other server platforms that I’ve worked on, I cannot think of one where it would be significantly easier to solve this particular problem.
  
  So I dunno – we haven’t really had a moment where we’ve regretted choosing node.js. But as with any server platform, you do have to invest time to understand and apply the tool you choose.
  
  To your final point, one of our main goals in this blog post series is to raise awareness of some of the stickier problems that arise when you attempt to build world class services on top of node.js – our hope is that by providing and advocating libraries like compute-cluster, that we can contribute to the community along by promoting application design patterns that “just work”.
  
  In short, I agree that node.js isn’t a silver bullet that runs fast always without you having to think about it. But I don’t feel like there’s an alternative that is significantly better in this regard.
  
  More in depth Node.JS articles? You got it! We have at least another dozen node.js articles coming your way, every couple weeks!
  
  Thanks for reading!
  
  November 20th, 2012 at 13:46
  1. Ryan Doherty
    
    Awesome, thanks Lloyd! That’s exactly the details I was looking for and found very educational. Process startup time and memory overhead is something I wouldn’t have thought about.
    
    Good point too about how anything CPU intensive is going to be difficult to architect around :)
    
    November 20th, 2012 at 15:20
NN

wrong bracket:

> function myRequestHandler(request, response) [

should be {

November 20th, 2012 at 17:23
1. Lloyd Hilaiel
  
  doh! nice catch.
  
  January 31st, 2013 at 10:09
Andrew Chilton

Hi Lloyd,

I think another thing to add to the list would be to start up a bunch of ‘worker’ servers which can accept pieces of work and return the results. The good thing about this is that the workers can be local or remote and therefore you can scale horizontally too.

Not sure why, but I like the idea of a lot of little servers/workers all talking to each other. I guess like your solution which has to pool and keep track of the workers, this would also need to keep track of where the workers live and what they do.

Am going to play with dnode and see if that works.

* https://github.com/substack/dnode

Thanks for the start of a great series of posts. :)

Cheers,
Andy

November 20th, 2012 at 18:54
1. Lloyd Hilaiel
  
  I’m very interested in this. I was looking for a zero config means of building a mesh of processes that could be resilient to failure and integrate behind the API exposed by compute cluster. Thus far I haven’t found anything that is really a great fit.
  
  A fun project would be to build (or contribute to) a little library that makes it trivial to build this mesh.
  
  In my initial look at dnode I found it not to be a *perfect* match – but maybe I should look again?
  
  January 31st, 2013 at 09:45
alessioalex

Great articles, keep up the good work.

November 23rd, 2012 at 07:58
dotnetcarpenter

This sounds like a pre-fork model, like SilkJS.
I always thought that I would use SilkJS for expensive tasks and node.js for the rest. But then I became an app dev.. Curious to how you guys see the two environments coexist, if at all.

November 28th, 2012 at 15:09
Ralf Mueller

It seems you made a vertical scaling of the application possible.
I am wondering how you deal with horizontal scaling.
What do you do if you want to offload heavy computation to a different machine?

January 29th, 2013 at 11:55
1. Lloyd Hilaiel
  
  For us, we run multiple “request processing” processes on a single machine, and those processes handle all computation in children. Then we have this pattern replicated across multiple machines behind load balancers.
  
  So specific to our application we want things to break in this order as load gets overwhelming:
  
  1. computation requests begin failing (for us, this translates into blocking users at sign-in – which is the best we can do)
  2. all other requests begin in some proportion such that the remainder complete in reasonable time (users who are served experience the service as responsive).
  
  But directly to your question, we do have other machines on the network that are not CPU bound and it’s an open question whether we’ll build distribution across the network into compute cluster, OR just reconfigure our deployments to scale the individual tiers in reasonable proportions. The former is sexy and could end up making the application more flexible and have less configuration required – the latter is easy. :)
  
  January 31st, 2013 at 09:50
gggeek

As a complete node ignorant, this looks a lot like reinventing the wheel.

Apache has had a multiprocess execution model since forever, where the app developer does not need to care about managing “workers” because the webserver does it for him.

Using FCGI, the (php) developer again gets the benefits of a worker-process-management solution which has been extensively tested.

And I think that in Java-lands there are hundreds of well-known solutions to this problem, full of knobs to adjust and thread pools to tune.

My question would thus be: does the magical promise of node.js being hyper-fast and hyper-scalable come in fact with nasty strings attached? How many real world web apps can be developed using purely async libraries? And how many of those async libraries do actually scale? How many node.js enthusiasts are even aware of the issue at hand?

January 30th, 2013 at 13:04
1. Lloyd Hilaiel
  
  You ask some hard questions. I offer personal biases in response and hope that they are useful:
  
  Apache: node requires drastically fewer resources per request.
  
  FCGI/php: exacerbates the resource requirement per request.
  
  Java and Threadpools: The debate of async code vs. threaded code is deep. I personally prefer the latter after extensive experience with the former – the thing I like about javascript is for implementing async code it is the most readable and natural language I’ve used. I might not want to type that, this is a minefield of opinions!
  
  “And how many of those async libraries do actually scale?” and “How many node.js enthusiasts are even aware of the issue at hand?”
  
  See approach 3 in the post – some prominent library authors I believe are misguided in exposing async implementations that use libuv’s threadpool. Also in the few node.js conferences I’ve spoken at, not everyone in the room is aware of the precise challenges of handling computation at scale in node.js – nor could present viable approaches to address this. But we all start somewhere, get excited, and learn more. Don’t we?
  
  I don’t see a fundamental problem here. For now, Node.JS is a fun, stable, fast environment to build software – it has an awesome community, and solves the problems I’m working on as well as any other technology I’ve applied. So. Let’s play!
  
  January 31st, 2013 at 10:08
Jorge

What you need is Threads à gogo

March 31st, 2013 at 21:52

Comments are closed for this article.

Fully Loaded Node – A Node.JS Holiday Season, part 2

By