Introducing SIMD.js

SIMD stands for Single Instruction Multiple Data, and is the name for performing operations on multiple data elements together. For example, a SIMD add instruction can add multiple values, in parallel. SIMD is a very popular technique for accelerating computations in graphics, audio, codecs, physics simulation, cryptography, and many other domains.

In addition to delivering performance, SIMD also reduces power usage, as it uses fewer instructions to do the same amount of work.

SIMD.js

SIMD.js is a new API being developed by Intel, Google, and Mozilla for JavaScript which introduces several new types and functions for doing SIMD computations. For example, the Float32x4 type represents 4 float32 values packed up together. The API contains functions to operate on those values together, including all the basic arithmetic operations, and operations to rearrange, load, and store such values. The intent is for browsers to implement this API directly, and provide optimized implementations that make use of SIMD instructions in the underlying hardware.

The focus is currently on supporting both x86 platforms with SSE and ARM platforms with NEON. We’re also interested in the possibility of supporting other platforms, potentially including MIPS, Power, and others.

SIMD.js is originally derived from the Dart SIMD specification, and it is rapidly evolving to become a more general API, and to cover additional use cases such as those that require narrower integer types, including Int8x16 and Int16x8, and saturating operations.

SIMD.js is a fairly low-level API, and it is expected that libraries will be written on top of it to expose higher-level functionality such as matrix operations, transcendental functions, and more.

In addition to being usable in regular JS, there is also work is underway to add SIMD.js to asm.js too, so that it can be used from asm.js programs such those produced by Emscripten. In Emscripten, SIMD can be achieved through the built-in autovectorization, the generic SIMD extensions, or the new (and still growing) Emscripten-specific API. Emscripten will also be implementing subsets of popular headers such as <xmmintrin.h> with wrappers around the SIMD.js APIs, as additional ways to ease porting SIMD code in some situations.

SIMD.js Today

The SIMD.js API itself is in active development. The ecmascript_simd github repository is currently serving as a provision specification as well as providing a polyfill implementation to provide the functionality, though of course not the accelerated performance, of the SIMD API on existing browsers. It also includes some benchmarks which also serve as examples of basic SIMD.js usage.

To see SIMD.js in action, check out the demo page accompanying the IDF2014 talk on SIMD.js.

The API has been presented to TC-39, which has approved it for stage 1 (Proposal). Work is proceeding in preparation for subsequent stages, which will involve proposing something closer to a finalized API.

SIMD.js implementation in Firefox Nightly is in active development. Internet Explorer has listed SIMD.js as “under consideration”. There is also a prototype implementation in a branch of Chromium.

Short SIMD and Long SIMD

One of the uses of SIMD is to accelerate processing of large arrays of data. If you have an array of N elements, and you want to do roughly the same thing to every element in the array, you can divide N by whatever SIMD size the platform makes available and run that many instances of your SIMD subroutine. Since N can can be very large, I call these kind of problems long SIMD problems.

Another use of SIMD is to accelerate processing of clusters of data. RGB or RGBA pixels, XYZW coordinates, or 4×4 matrices are all examples of such clusters, and I call problems which are expressed in these kinds of types short SIMD problems.

SIMD is a broad domain, and the boundary between short and long SIMD isn’t always clear, but at a high level, the two styles are quite different. Even the terminology used to describe them features a split: In the short SIMD world, the operation which copies a scalar value into every element of a vector value is called a “splat”, while in the long vector world the analogous operation is called a “broadcast”.

SIMD.js is primarily a “short” style API, and is well suited for short SIMD problems. SIMD.js can also be used for long SIMD problems, and it will still deliver significant speedups over plain scalar code. However, its fixed-length types aren’t going to achieve maximum performance of some of today’s CPUs, so there is still room for another solution to be developed to take advantage of that available performance.

Portability and Performance

There is a natural tension in many parts of SIMD.js between the desire to have an API which runs consistently across all important platforms, and the desire to have the API run as fast as possible on each individual platform.

Fortunately, there is a core set of operations which are very consistent across a wide variety of platforms. These operations include most of the basic arithmetic operations and form the core of SIMD.js. In this set, little to no overhead is incurred because many of the corresponding SIMD API instructions map directly to individual instructions.

But, there also are many operations that perform well on one platform, and poorly on others. These can lead to surprising performance cliffs. The current approach of the SIMD.js API is to focus on the things that can be done well with as few performance cliffs as possible. It is also focused on providing portable behavior. In combination, the aim is to ensure that a program which runs well on one platform will likely run and run well on another.

In future iterations of SIMD.js, we expect to expand the scope and include more capabilities as well as mechanisms for querying capabilities of the underlying platform. Similar to WebGL, this will allow programs to determine what capabilities are available to them so they can decide whether to fall back to more conservative code, or disable optional functionality.

The overall vision

SIMD.js will accelerate a wide range of demanding applications today, including games, video and audio manipulation, scientific simulations, and more, on the web. Applications will be able to use the SIMD.js API directly, libraries will be able to use SIMD.js to expose higher-level interfaces that applications can use, and Emscripten will compile C++ with popular SIMD idioms onto optimized SIMD.js code.

Looking forward, SIMD.js will continue to grow, to provide broader functionality. We hope to eventually accompany SIMD.js with a long-SIMD-style API as well, in which the two APIs can cooperate in a manner very similar to the way that OpenCL combines explicit vector types with the implicit long-vector parallelism of the underlying programming model.

I've worked at Cray, Apple, and Google on several different compilers in a variety of contexts. I'm currently a member of the Mozilla Research team primarily working on asm.js, SIMD.js, and Emscripten.

10 comments

Peter Jensen

Great post. One minor correction: TC39 approved this for stage 1 (Proposal) at the July-2014 meeting

October 30th, 2014 at 09:37
Peter Jensen

Crosswalk(https://crosswalk-project.org/), an HTML5 web-runtime for building hybrid apps, supports the SIMD.js API as well. Crosswalk is available via Intel’s XDK (xdk.intel.com)

October 30th, 2014 at 10:08
AlejandroG

I think this is cool and useful but why not implementing real important things like the standard DEC64 for floating point proposed by Douglas Crockford, I mean, still creating roads on the mud.

October 30th, 2014 at 12:12
1. Rick Waldron
  
  First of all, Doug Crockford is not a member of TC39 anymore. Secondly, what you’re asking for is orthogonal to SIMD. Value Types and Typed Objects are already in progress for ES7: http://www.slideshare.net/BrendanEich/value-objects http://wiki.ecmascript.org/doku.php?id=harmony:typed_objects https://bugzilla.mozilla.org/show_bug.cgi?id=578700
  
  October 31st, 2014 at 10:57
Zac Bowling

How about threading.js or pthread.js.

October 30th, 2014 at 15:48
1. Luke
  
  You mean web workers? Or being able to use them in asm.js?
  
  October 30th, 2014 at 20:04
Ningxin Hu

Excellent overview!
For information, you can find source of v8 SIMD.js prototype in https://github.com/crosswalk-project/v8-crosswalk. And you can download latest Chromium SIMD.js build at https://drive.google.com/folderview?id=0B9RVWZYRtYFeOVlSMm1GdmZxM0k&usp=sharing

October 30th, 2014 at 17:43
Jonathan Ragan-Kelley

# SIMD.js reply

It is good to see this sort of thing progress in real web runtimes, but one part of this seriously concerns me: your stated view of “short” vs. “long” SIMD suggests a incorrect assumptions in how even an explicitly-sized short-vector programming model should be most efficiently used on current hardware. Specifically, it has not been the case for years that vectorization across structure components of application-level short vectors is actually the most efficient vectorization strategy for most code. Instead, the fastest vectorization strategies nearly always rely on applying a “long SIMD” view, vectorizing the innermost loop over a large data parallel dimension (like the x coordinate of a pixel array) using individual short-vector-at-a-time iterations to consume 4 or 8 pixels per loop iteration with more operations to compute on R/G/B, rather than a single pixel per loop iteration to compute RGB at once.

This parallels the “scalarization” movement in GPU shader implementations almost a decade ago. GPU shading languages expose 3- and 4-vector types natively, as well as dot products on these, etc., because these are extremely common operations in graphics code. They *do not*, however, exploit this for “short SIMD” execution, because this is almost never as efficient as simply scalarizing this style of code (turning a dot(vec3, vec3) operation into 3 scalar MULs and 2 scalar ADDs, or a scalar MUL and 2 scalar MADs, of the individual components) and vectorizing across the larger data-parallel dimension (separate pixels).

This isn’t specific to GPUs—it applies equally to short-vector SIMD architectures like NEON and SSE, to say nothing of >4-element short-vector SIMD architectures like AVX (or NEON/SSE when applied to partial-precision types). This is the execution strategy around which Intel’s ISPC is based, and for good reason: even on SSE, which was in no small part designed to process float32x4 vectors, this is the right strategy the vast majority of the time.

The reason why this is nearly always faster is easiest to show graphically, but is just a function of occupancy in these short vectors. Even if code vectorized across structure dimensions like RGBA and XYZW fully occupies the vectors on a given architecture, not all operations will be over full 4-vectors. Instead, real code—even doing geometric transforms in 3D homogeneous coordinates—involves some mix of 1-, 2-, 3-, and 4-component vectors. If we explode these into scalar operations, and then simply compute each scalar operation across 4 neighboring records simultaneously, we always fully utilize the 4 vector lanes for all operations (modulo branching), while the “short SIMD” version will only be partially utilized for any operations affecting only 1, 2, or 3 components. This is related to the fact that vectorizing across the larger data-parallel dimensions tends to yield more homogeneous operations, which take better advantage of vector instructions (it is much more often the case that your algorithm wants to treat 4 neighboring red values the same way, than that it wants to treat all of red, green, blue, and alpha the same way within a single pixel). The end result: the “short SIMD” view you emphasize gives significantly lower utilization for the significant majority of code, at least in the graphics cases where 4-vectors even are very common application data types.

Starting your design from this view of the world concerns me because it leads to considering almost exclusively examples of vectorization strategies which will not be the most efficient on. It also leads you to focus on different features: operating on vector condition/mask values and doing blending/predication is essential for the “scalarized” form, but rarely matters as much when vectors only encode XYZ/RGB-type tuples; similarly, if you imagine people mostly want to pack XYZ/RGB vectors, you will see little use in >4-element vector types, while it might be perfectly reasonable for most code to use logically, e.g., 512-bit vectors which then get each compiled to some constant number of the underlying machine ops.

Said with all respect and optimism, because I’d really love to be able to target this!

October 30th, 2014 at 17:57
1. Dan Gohman
  
  Indeed. In a hypothetical future where SIMD.js is running within a SPMD context, we would very likely switch implementation strategies. SIMD.js might then play the role that types like vec3 play in graphics languages today, where the JIT may choose to scalarize it. In such a system, extending SIMD.js to >4-element types would be very plausible.
  
  We don’t know if SPMD will be the answer, or even if so, if the SPMD kernel language will even be JS at all. There are a lot of possibilities. What we know is that SIMD.js has several important use cases today, and that there are several directions it can evolve in to serve the needs of a variety of futures.
  
  October 30th, 2014 at 19:56
Tom

Games are slow on my Asus tablet with Intel inside.
I think this is part of the motivation here and I’m encouraged.

November 4th, 2014 at 10:29

Comments are closed for this article.

Hacks

By Dan Gohman, Robert Nyman [Editor emeritus]

SIMD.js

SIMD.js Today

Short SIMD and Long SIMD

Portability and Performance

The overall vision

About Dan Gohman

About Robert Nyman [Editor emeritus]

10 comments

Introducing SIMD.js

By Dan Gohman, Robert Nyman [Editor emeritus]

SIMD.js

SIMD.js Today

Short SIMD and Long SIMD

Portability and Performance

The overall vision

About Dan Gohman

About Robert Nyman [Editor emeritus]

Discover great resources for web development

Thanks! Please check your inbox to confirm your subscription.