A Taste of JavaScript’s New Parallel Primitives

Author’s note: Since this post was written, the API of postMessage has changed slightly. When sending a SharedArrayBuffer with postMessage, the buffer should no longer be in the transfer list argument of the postMessage call. Thus, if sab is a SharedArrayBuffer object and w is a worker, w.postMessage(sab) sends the buffer to the worker.

You can visit MDN’s SharedArrayBuffer documentation for more detail.

TL;DR – We’re extending JavaScript with a primitive API that lets programmers use multiple workers and shared memory to implement true parallel algorithms in JavaScript.

Multicore computation

JavaScript (JS) has grown up, and it works so well that virtually every modern web page contains large amounts of JS code that we don’t ever worry about — it just runs as a matter of course. JS is also being used for more demanding tasks: Client-side image processing (in Facebook and Lightroom) is written in JS; in-browser office packages such as Google Docs are written in JS; and components of Firefox, such as the built-in PDF viewer, pdf.js, and the language classifier, are written in JS. In fact, some of these applications are in the form of asm.js, a simple JS subset, that is a popular target language for C++ compilers; game engines originally written in C++ are being recompiled to JS to run on the web as asm.js programs.

The routine use of JS for these and many other tasks has been made possible by the spectacular performance improvements resulting from the use of Just-in-Time (JIT) compilers in JS engines, and by ever faster CPUs.

But JS JITs are now improving more slowly, and CPU performance improvement has mostly stalled. Instead of faster CPUs, all consumer devices — from desktop systems to smartphones — now have multiple CPUs (really CPU cores), and except at the low end they usually have more than two. A programmer who wants better performance for her program has to start using multiple cores in parallel. That is not a problem for “native” applications, which are all written in multi-threaded programming languages (Java, Swift, C#, and C++), but it is a problem for JS, which has very limited facilities for running on multiple CPUs (web workers, slow message passing, and few ways to avoid data copying).

Hence JS has a problem: if we want JS applications on the web to continue to be viable alternatives to native applications on each platform, we have to give JS the ability to run well on multiple CPUs.

Building Blocks: Shared Memory, Atomics, and Web Workers

Over the last year or so, Mozilla’s JS team has been leading a standards initiative to add building blocks for multicore computation to JS. Other browser vendors have been collaborating with us on this work, and our proposal is going through the stages of the JS standardization process. Our prototype implementation in Mozilla’s JS engine has helped inform the design, and is available in some versions of Firefox as explained below.

In the spirit of the Extensible Web we have chosen to facilitate multicore computation by exposing low-level building blocks that restrict programs as little as possible. The building blocks are a new shared-memory type, atomic operations on shared-memory objects, and a way of distributing shared-memory objects to standard web workers. These ideas are not new; for the high-level background and some history, see Dave Herman’s blog post on the subject.

The new shared memory type, called SharedArrayBuffer, is very similar to the existing ArrayBuffer type; the main difference is that the memory represented by a SharedArrayBuffer can be referenced from multiple agents at the same time. (An agent is either the web page’s main program or one of its web workers.) The sharing is created by transferring the SharedArrayBuffer from one agent to another using postMessage:

let sab = new SharedArrayBuffer(1024)
let w = new Worker("...")
w.postMessage(sab, [sab])   // Transfer the buffer

The worker receives the SharedArrayBuffer in a message:

let mem;
onmessage = function (ev) { mem = ev.data; }

This leads to the following situation where the main program and the worker both reference the same memory, which doesn’t belong to either of them:

shmem1

Once a SharedArrayBuffer is shared, every agent that shares it can read and write its memory by creating TypedArray views on the buffer and using standard array access operations on the view. Suppose the worker does this:

let ia = new Int32Array(mem);
ia[0] = 37;

Then the main program can read the cell that was written by the worker, and if it waits until after the worker has written it, it will see the value “37”.

It’s actually tricky for the main program to “wait until after the worker has written the data”. If multiple agents read and write the same locations without coordinating access, then the result will be garbage. New atomic operations, which guarantee that program operations happen in a predictable order and without interruption, make such coordination possible. The atomic operations are present as static methods on a new top-level Atomics object.

Speed and responsiveness

The two performance aspects we can address with multicore computation on the web are speed, i.e., how much work we can get done per unit of time, and responsiveness, i.e., the extent to which the user can interact with the browser while it’s computing.

We improve speed by distributing work onto multiple workers that can run in parallel: If we can divide a computation into four and run it on four workers that each get a dedicated core, we can sometimes quadruple the speed of the computation. We improve responsiveness by moving work out of the main program and into a worker, so that the main program is responsive to UI events even if a computation is ongoing.

Shared memory turns out to be an important building block for two reasons. First, it removes the cost of copying data. For example, if we render a scene on many workers but have to display it from the main program, the rendered scene must be copied to the main program, adding to rendering time and reducing the responsiveness of the main program. Second, shared memory makes coordination among the agents very cheap, much cheaper than postMessage, and that reduces the time that agents sit idle while they are waiting for communication.

No free lunch

It is not always easy to make use of multiple CPU cores. Programs written for a single core must often be significantly restructured and it is often hard to establish the correctness of the restructured program. It can also be hard to get a speedup from multiple cores if the workers need to coordinate their actions frequently. Not all programs will benefit from parallelism.

In addition, there are entirely new types of bugs to deal with in parallel programs. If two workers end up waiting for each other by mistake the program will no longer make progress: the program deadlocks. If workers read and write to the same memory cells without coordinating access, the result is sometimes (and unpredictably, and silently) garbage: the program has data races. Programs with data races are almost always incorrect and unreliable.

An example

NOTE: To run the demos in this post you’ll need Firefox 46 or later. You must also set the preference javascript.options.shared_memory to true in about:config unless you are running Firefox Nightly.

Let’s look at how a program can be parallelized across multiple cores to get a nice speedup. We’ll look at a simple Mandelbrot set animation that computes pixel values into a grid and displays that grid in a canvas, at increasing zoom levels. (Mandelbrot computation is what’s known as “embarrassingly parallel”: it is very easy to get a speedup. Things are usually not this easy.) We’re not going to do a technical deep dive here; see the end for pointers to deeper material.

The reason the shared memory feature is not enabled in Firefox by default is that it is still being considered by the JS standards body. The standardization process must run its course, and the feature may change along the way; we don’t want code on the web to depend on the API yet.

Serial Mandelbrot

Let’s first look briefly at the Mandelbrot program without any kind of parallelism: the computation is part of the main program of the document and renders directly into a canvas. (When you run the demo below you can stop it early, but later frames are slower to render so you only get a reliable frame rate if you let it run to the end.)

If you’re curious, here’s the source code:

Parallel Mandelbrot

Parallel versions of the Mandelbrot program will compute the pixels in parallel into a shared memory grid using multiple workers. The adaptation from the original program is conceptually simple: the mandelbrot function is moved into a web worker program, and we run multiple web workers, each of which computes a horizontal strip of the output. The main program will still be responsible for displaying the grid in the canvas.

We can plot the frame rate (Frames per Second, FPS) for this program against the number of cores used, to get the plot below. The computer used in the measurements is a late-2013 MacBook Pro, with four hyperthreaded cores; I tested with Firefox 46.0.

mandel3

The program speeds up almost linearly as we go from one to four cores, increasing from 6.9 FPS to 25.4 FPS. After that, the increases are more modest as the program starts running not on new cores but on the hyperthreads on the cores that are already in use. (The hyperthreads on the same core share some of the resources on the core, and there will be some contention for those resources.) But even so the program speeds up by three to four FPS for each hyperthread we add, and with 8 workers the program computes 39.3 FPS, a speedup of 5.7 over running on a single core.

This kind of speedup is very nice, obviously. However, the parallel version is significantly more complicated than the serial version. The complexity has several sources:

  • For the parallel version to work properly it needs to synchronize the workers and the main program: the main program must tell the workers when (and what) to compute, and the workers must tell the main program when to display the result. Data can be passed both ways using postMessage, but it is often better (i.e., faster) to pass data through shared memory, and doing that correctly and efficiently is quite complicated.
  • Good performance requires a strategy for how to divide the computation among the workers, to make the best use of the workers through load balancing. In the example program, the output image is therefore divided into many more strips than there are workers.
  • Finally, there is clutter that stems from shared memory being a flat array of integer values; more complicated data structures in shared memory must be managed manually.

Consider synchronization: The new Atomics object has two methods, wait and wake, which can be used to send a signal from one worker to another: one worker waits for a signal by calling Atomics.wait, and the other worker sends that signal using Atomics.wake. However, these are flexible low-level building blocks; to implement synchronization, the program will additionally have to use atomic operations such as Atomics.load,Atomics.store, and Atomics.compareExchange to read and write state values in shared memory.

Adding further to that complexity, the main thread of a web page is not allowed to call Atomics.wait because it is not good for the main thread to block. So while workers can communicate among themselves using Atomics.wait and Atomics.wake, the main thread must instead listen for an event when it is waiting, and a worker that wants to wake the main thread must post that event with postMessage.

(Those rushing out to test that should know that wait and wake are called futexWait and futexWake in Firefox 46 and Firefox 47. See the MDN page for Atomics for more information.)

It is possible to build good libraries to hide much of the complexity, and if a program — or usually, an important part of a program — can perform significantly better when running on multiple cores rather than on one, then the complexity can really be worth it. However, parallelizing a program is not a quick fix for poor performance.

With the disclaimers above, here is the code for the parallel version:

Further information

For reference material on the available APIs, read the proposed specification, which is largely stable now. The Github repository for the proposal also has some discussion documents that might be helpful.

Additionally, the Mozilla Developer Network (MDN) has documentation for SharedArrayBuffer and Atomics.

About Lars T Hansen

I'm a JavaScript compiler engineer at Mozilla. Previously I worked on ActionScript3 at Adobe and on JavaScript and other browser things at Opera.

More articles by Lars T Hansen…


26 comments

  1. Grant Stevens

    Excellent article … even so, the program speeds up almost look briefly at the Mandelbrot code on the cores that program standards body. I’m not sure if anything can be done about that.

    This is also my first introduction to Atomics. Thanks for the write-up!

    May 5th, 2016 at 13:16

  2. Sendil

    Hi Hansen,

    I am unable to enable the javascript.options.shared_memory in my chrome and firefox browser.. what shall i do to enable it. Please add it in your article which will be useful for other viewers as well.

    May 5th, 2016 at 21:09

    1. Lars T Hansen

      In Firefox 46 or later (not Chrome) you can create a new tab, and in the address field of that tab you type in “about:config”. You may have to click through a warning page after that. On the page that contains all the options that you then come to, enter “shared_memory” into the search field, and it will find this option. Double-click the option to change its value from false to true.

      May 5th, 2016 at 23:06

  3. Jonathan Raoult

    Very exciting things. Can’t wait :)

    One small thing regarding the “Speed and responsiveness” that could be a little bit incomplete: we already have a way today to transfer binary data with postMessage in a “no copy” way (based on ownership) thanks to Transferable.

    May 5th, 2016 at 21:42

    1. Lars T Hansen

      Yes, transfering works to avoid copying in some cases. But the program often has to be restructured radically to make that work, whereas with shared memory the program can more often retain (more of) its original shape. More important is that there are many patterns of memory access that are possible with shared memory but not with transfering, typically, where multiple workers use the same input data to produce individual slices of the output. With transfering, no sharing is possible. Another case is when data are processed first by row, then by column. We could use transfering to split a grid by rows and transfer the rows to workers, but to split by columns and then transfer columns later we’d need to make a copy of the grid.

      May 5th, 2016 at 23:11

  4. Yossi Kreinin

    I’m curious about 2 things:

    1. How is the accidental modification of random JS objects from multiple threads prevented – that is, how is the communication restricted to explicitly shared memory? Is it done by using OS process underneath?

    2. Exposing atomics greatly diminishes the effectiveness of automated race detection tools. Is there a specific rationale for not exposing an interface along the lines of Cilk instead – say, a parallel for loop and a parallel function call that can be waited for? The mandelbrot example looks like it could be handled just fine (meaning, just as efficiently and with a bit less code) with a parallel for loop with what OpenMP calls a dynamic scheduling policy (so an atomic counter hidden in its guts.)

    There do exist tasks which can be handled more efficiently using raw atomics than using a Cilk-like interface, but in my experience they are the exception rather than the rule; on the other hand parallelism bugs are the rule rather than the exception, and so effective automated debugging tools are a godsend.

    Cilk comes with great race detection tools and these can be developed for any system with a similar interface; the thing enabling this is that a Cilk program’s task dependency graph is a fork-join graph, whereas with atomics it’s a generic DAG and the number of task orderings an automated debugging tool has to try with a DAG is potentially very large, whereas with a fork-join graph it’s always just two orderings. I wrote about it here http://yosefk.com/blog/checkedthreads-bug-free-shared-memory-parallelism.html – my point though isn’t to plug my own Cilk knock-off that I present in that post but to elaborate on the benefits of a Cilk-like interface relatively to raw atomics.

    (This is a duplicate of my HN comment at https://news.ycombinator.com/item?id=11642394)

    May 6th, 2016 at 01:03

    1. Lars T Hansen

      The protection against accidental modification of “random” JS objects is that you can only share the memory of a SharedArrayBuffer. No other JS objects in one agent can be modified by another agent. And of course, modifications to the memory of a SharedArrayBuffer is range checked, so you can’t scribble outside that memory.

      There is indeed a specific rationale for exposing this very low level API. We see it as both a substrate for building higher-level abstractions for JS one the one hand (for example, one of the first things I built was a data-parallel framework) and as a compilation target for C/C++ (as part of asm.js) on the other hand. The current API is somewhat of a compromise between the needs of the two use cases.

      But even if we didn’t worry about the compilation target use case it’s not obvious what form of higher-level abstraction that would be appropriate. We actually spent a lot of effort on a prototype of such an abstraction in the form of Parallel JavaScript (PJS, see https://bugzil.la/801869 or http://smallcultfollowing.com/babysteps/blog/2012/01/09/parallel-javascript/), but that effort failed for reasons that are mostly well understood (extensive discussion in this thread: https://groups.google.com/forum/#!topic/mozilla.dev.tech.js-engine/H-YEsejE6DA).

      The Mandelbrot example is just a taster and not really representative of the types of programs we want to write. The more interesting use cases require more control, take a look at the publications linked from http://halide-lang.org. (I know Cilk would be a good fit for some of those, but a Cilk API would not be appropriate for asm.js, which remains a necessary use case.)

      May 7th, 2016 at 03:50

  5. Prem

    But CPU usage age is high my laptop temperature gets 81.c (is this normal)

    May 6th, 2016 at 10:17

    1. Lars T Hansen

      My CPU gets warmer too, when I run this example on all the available cores; this is certainly expected… Hard to say more than that without knowing more about the problem you’re seeing.

      May 7th, 2016 at 03:52

      1. Prem

        Heat is the only problem. example running correctly.

        May 7th, 2016 at 06:36

  6. James G.

    Excellent article Lars!

    I was going to point out the same thing as Prem did. Heat is a problem without a doubt..

    May 12th, 2016 at 09:19

    1. Lars T Hansen

      Clearly when you’re computing 100% on all cores the chip will be drawing a lot of power and generating heat, but it’s designed for that. Is there something beyond that that is a concern?

      May 12th, 2016 at 09:39

      1. Daniel Earwicker

        Strange to see feedback complaining about CPU getting hot.

        One way to describe this feature is that it is designed to allow you to maximise the temperature! That is what computation does.

        May 12th, 2016 at 16:42

  7. Slava

    Yeah! Deadlocks and semaphores in browser!

    May 12th, 2016 at 12:32

    1. Lars T Hansen

      It’s a sharp tool and it’s not without its problems. Several things can make the problem less acute:

      – Not being able to block on the main thread ensures that the browser does not lock up.
      – Workers can still lock up, so dev tools will have to keep up here and allow deadlocks and other problems to be diagnosed. (Performance problems too.)
      – Again, using frameworks for common patterns of parallelism will reduce the chance of cutting yourself. I’ve built some simple frameworks myself but this is where I’d expect to see the JS community do some good work.

      May 13th, 2016 at 00:03

  8. Jace A Mogill

    For those who find this sort of thing interesting and want to experiment now, Extended Memory Semantics (https://github.com/SyntheticSemantics/ems and https://www.npmjs.com/package/ems) adds shared memory parallelism to Node.js, works on JSON objects not just integers, supports persistence, transactions, and other shared memory parallel programming capabilities.

    [Disclaimer: I am the author of EMS]

    -J

    May 12th, 2016 at 20:39

    1. Lars T Hansen

      Interesting work! I will be looking at that in some more detail…

      May 13th, 2016 at 00:08

  9. mahem

    eagerly waiting to be shipped in browsers.

    Very useful for optimise and move expansive task to worker via shared memory.

    May 12th, 2016 at 22:19

  10. mahem

    Excellent article. waiting to be shipped in browsers.

    May 12th, 2016 at 22:21

  11. Nidin Vinayakan

    For all shared memory supporters, here is my Global Illumination (path tracing) renderer implementation using SharedArrayBuffer
    Demo : https://01alchemist.com/projects/xrenderer/example.html
    Source: https://github.com/nidin/xrenderer

    Heat Warning!

    May 13th, 2016 at 09:48

  12. Nils

    Hey,
    nice and interesting work!
    Is there a way to determine how many cores a computer has available from a web page, yet? This is like a very necessary thing because otherwise one is always guessing (4? 8?) and thus one either has not enough workers (cpus remain idle) or too many workers (switching costs reduce speedup)

    May 14th, 2016 at 13:10

    1. Lars T Hansen

      Try navigator.hardwareConcurrency(): https://developer.mozilla.org/en-US/docs/Web/API/NavigatorConcurrentHardware/hardwareConcurrency.

      May 17th, 2016 at 23:34

      1. Nils

        Thanks, that looks helpful. Even though browser support is not the best, still a huge improvement in the browsers it is available in.

        May 18th, 2016 at 04:47

  13. Max Franz

    Great article!

    Shared memory is definitely a step in the right direction. I would like to see support for shared objects (i.e. JSON) for increased ease of use. While sharing arrays of numbers is nice for straight forward numerical computation, I can see it being awkward for some more complex cases (where you’d come up with object-SharedArrayBuffer serialisation/deserialisation — in which case you may as well just copy).

    I’ll be adding SharedArrayBuffer support to Weaver.js (http://weaver.js.org). The ems project looks like it could be useful on the Node.js side — thanks for the link, Jace.

    May 24th, 2016 at 08:47

    1. Lars T Hansen

      Clearly we will want some kind of object support in shared memory. It’s possible to construct user-level object systems that go some way toward that, I’ve built a preprocessor (https://github.com/lars-t-hansen/flatjs) and used it to build a ray tracer with a scene graph in shared memory, for example. (FlatJS is not nearly production quality but suggestive of one not very dynamic type system suitable for high performance programs.)

      More elaborate object systems will need object management – reference counting or garbage collection, not an easy problem to solve in user code and something that will really need proper built-in support.

      May 24th, 2016 at 12:58

      1. Daniel Earwicker

        The dream would be for JS to gain a specific built-in concept of a “record” (an immutable object), which is only able to reference other records and primitives, so that a tree of records can be shared between workers.

        And also a neat syntax for creating modified clones, a la the Ocaml `with` keyword:

        let b = { a with firstName = “Homer” }

        or Haskell’s:

        let b = a { firstName = “Homer” }

        (This would also lend itself to being type-checked in TypeScript.)

        May 24th, 2016 at 15:11

Comments are closed for this article.