Making WebAssembly even faster: Firefox’s new streaming and tiering compiler

People call WebAssembly a game changer because it makes it possible to run code on the web faster. Some of these speedups are already present, and some are yet to come.

One of these speedups is streaming compilation, where the browser compiles the code while the code is still being downloaded. Up until now, this was just a potential future speedup. But with the release of Firefox 58 next week, it becomes a reality.

Firefox 58 also includes a new 2-tiered compiler. The new baseline compiler compiles code 10–15 times faster than the optimizing compiler.

Combined, these two changes mean we compile code faster than it comes in from the network.

On a desktop, we compile 30-60 megabytes of WebAssembly code per second. That’s faster than the network delivers the packets.

If you use Firefox Nightly or Beta, you can give it a try on your own device. Even on a pretty average mobile device, we can compile at 8 megabytes per second —which is faster than the average download speed for pretty much any mobile network.

This means your code executes almost as soon as it finishes downloading.

Why is this important?

Web performance advocates get prickly when sites ship a lot of JavaScript. That’s because downloading lots of JavaScript makes pages load slower.

This is largely because of the parse and compile times. As Steve Souders points out, the old bottleneck for web performance used to be the network. But the new bottleneck for web performance is the CPU, and particularly the main thread.

Old bottleneck, the network, on the left. New bottleneck, work on the CPU such as compiling, on the right

So we want to move as much work off the main thread as possible. We also want to start it as early as possible so we’re making use of all of the CPU’s time. Even better, we can do less CPU work altogether.

With JavaScript, you can do some of this. You can parse files off of the main thread, as they stream in. But you’re still parsing them, which is a lot of work, and you have to wait until they are parsed before you can start compiling. And for compiling, you’re back on the main thread. This is because JS is usually compiled lazily, at runtime.

Timeline showing packets coming in on the main thread, then parsing happening simultaneously on another thread. Once parse is done, execution begins on main thread, interrupted occassionally by compiling

With WebAssembly, there’s less work to start with. Decoding WebAssembly is much simpler and faster than parsing JavaScript. And this decoding and the compilation can be split across multiple threads.

This means multiple threads will be doing the baseline compilation, which makes it faster. Once it’s done, the baseline compiled code can start executing on the main thread. It won’t have to pause for compilation, like the JS does.

Timeline showing packets coming in on the main thread, and decoding and baseline compiling happening across multiple threads simultaneously, resulting in execution starting faster and without compiling breaks.

While the baseline compiled code is running on the main thread, other threads work on making a more optimized version. When the more optimized version is done, it can be swapped in so the code runs even faster.

This changes the cost of loading WebAssembly to be more like decoding an image than loading JavaScript. And think about it… web performance advocates do get prickly about JS payloads of 150 kB, but an image payload of the same size doesn’t raise eyebrows.

Developer advocate on the left tsk tsk-ing about large JS file. Developer advocate on the right shrugging about large image.

That’s because load time is so much faster with images, as Addy Osmani explains in The Cost of JavaScript, and decoding an image doesn’t block the main thread, as Alex Russell discusses in Can You Afford It?: Real-world Web Performance Budgets.

This doesn’t mean that we expect WebAssembly files to be as large as image files. While early WebAssembly tools created large files because they included lots of runtime, there’s currently a lot of work to make these files smaller. For example, Emscripten has a “shrinking initiative”. In Rust, you can already get pretty small file sizes using the wasm32-unknown-unknown target, and there are tools like wasm-gc and wasm-snip which can optimize this even more.

What it does mean is that these WebAssembly files will load much faster than the equivalent JavaScript.

This is big. As Yehuda Katz points out, this is a game changer.

Tweet from Yehuda Katz saying it's possible to parse and compile wasm as fast as it comes over the network.

So let’s look at how the new compiler works.

Streaming compilation: start compiling earlier

If you start compiling the code earlier, you’ll finish compiling it earlier. That’s what streaming compilation does… makes it possible to start compiling the .wasm file as soon as possible.

When you download a file, it doesn’t come down in one piece. Instead, it comes down in a series of packets.

Before, as each packet in the .wasm file was being downloaded, the browser network layer would put it into an ArrayBuffer.

Packets coming in to network layer and being added to an ArrayBuffer

Then, once that was done, it would move that ArrayBuffer over to the Web VM (aka the JS engine). That’s when the WebAssembly compiler would start compiling.

Network layer pushing array buffer over to compiler

But there’s no good reason to keep the compiler waiting. It’s technically possible to compile WebAssembly line by line. This means you should be able to start as soon as the first chunk comes in.

So that’s what our new compiler does. It takes advantage of WebAssembly’s streaming API.

WebAssembly.instantiateStreaming call, which takes a response object with the source file. This has to be served using MIME type application/wasm.

If you give WebAssembly.instantiateStreaming a response object, the chunks will go right into the WebAssembly engine as soon as they arrive. Then the compiler can start working on the first chunk while the next one is still being downloaded.

Packets going directly to compiler

Besides being able to download and compile the code in parallel, there’s another advantage to this.

The code section of the .wasm module comes before any data (which will go in the module’s memory object). So by streaming, the compiler can compile the code while the module’s data is still being downloaded. If your module needs a lot of data, the data can be megabytes, so this can be significant.

File split between small code section at the top, and larger data section at the bottom

With streaming, we start compiling earlier. But we can also make compiling faster.

Tier 1 baseline compiler: compile code faster

If you want code to run fast, you need to optimize it. But performing these optimizations while you’re compiling takes time, which makes compiling the code slower. So there’s a tradeoff.

We can have the best of both of these worlds. If we use two compilers, we can have one that compiles quickly without too many optimizations, and another that compiles the code more slowly but creates more optimized code.

This is called a tiered compiler. When code first comes in, it’s compiled by the Tier 1 (or baseline) compiler. Then, after the baseline compiled code starts running, a Tier 2 compiler goes through the code again and compiles a more optimized version in the background.

Once it’s done, it hot-swaps the optimized code in for the previous baseline version. This makes the code execute faster.

Timeline showing optimizing compiling happening in the background.

JavaScript engines have been using tiered compilers for a long time. However, JS engines will only use the Tier 2 (or optimizing) compiler when a bit of code gets “warm”… when that part of the code gets called a lot.

In contrast, the WebAssembly Tier 2 compiler will eagerly do a full recompilation, optimizing all of the code in the module. In the future, we may add more options for developers to control how eagerly or lazily optimization is done.

This baseline compiler saves a lot of time at startup. It compiles code 10–15 times faster than the optimizing compiler. And the code it creates is, in our tests, only 2 times slower.

This means your code will be running pretty fast even in those first few moments, when it’s still running the baseline compiled code.

Parallelize: make it all even faster

In the article on Firefox Quantum, I explained coarse-grained and fine-grained parallelization. We use both for compiling WebAssembly.

I mentioned above that the optimizing compiler will do its compilation in the background. This means that it leaves the main thread available to execute the code. The baseline compiled version of the code can run while the optimizing compiler does its recompilation.

But on most computers that still leaves multiple cores unused. To make the best use of all of the cores, both of the compilers use fine-grained parallelization to split up the work.

The unit of parallelization is the function. Each function can be compiled independently, on a different core. This is so fine-grained, in fact, that we actually need to batch these functions up into larger groups of functions. These batches get sent to different cores.

… then skip all that work entirely by caching it implicitly (future work)

Currently, decoding and compiling are redone every time you reload the page. But if you have the same .wasm file, it should compile to the same machine code.

This means that most of the time, this work could be skipped. And in the future, this is what we’ll do. We’ll decode and compile on first page load, and then cache the resulting machine code in the HTTP cache. Then when you request that URL, it will pull out the precompiled machine code.

This makes load time disappear for subsequent page loads.

Timeline showing all work disappearing with caching.

The groundwork is already laid for this feature. We’re caching JavaScript byte code like this in the Firefox 58 release. We just need to extend this support to caching the machine code for .wasm files.

About Lin Clark

Lin is an engineer on the Mozilla Developer Relations team. She tinkers with JavaScript, WebAssembly, Rust, and Servo, and also draws code cartoons.

More articles by Lin Clark…


22 comments

  1. yoshua wuyts

    Ohh, this is super neat! Is there any indication of preload tags / headers becoming available for this? Also curious what the interaction would be with HTTP2 Push.

    I can imagine that if the WASM stream can be initiated right after the initial handshake, performance could really be turned up to 100! ✨

    Thanks!

    January 17th, 2018 at 11:39

    1. Lin Clark

      I believe you should be able to use a link tag with as=”fetch”. I’m not sure about the interaction with HTTP2 Push. I think it would just make the bytes ready and available for the instantiate call, but I don’t think it would trigger compilation.

      January 19th, 2018 at 12:10

  2. Dawid

    Great article, thank you for sharing your knowledge in such a accessible way.

    January 17th, 2018 at 12:47

  3. Thomas E Enebo

    I was with you until the last section.

    Caching JS, as in your startup code, can probably just execute from a cached copy because you can either implicitly trust it is always the same code or you can fingerprint the contents to verify nothing changed (you already have it after all).

    For JS from the network you can at best stream compile while loading and then when it is finished decide that the stream is not needed because it happens to already be compiled in a past session. I guess in this sense the overhead time is just the streaming time but you still are speculatively stream-compiling because you don’t know yet you already have it.

    Right?

    January 17th, 2018 at 13:55

    1. Eric

      I thought of the same issue, so I’m assuming they’ll determine fingerprints based off functions, just like they’d do for core selection.

      January 17th, 2018 at 15:18

    2. George Mauer

      Well your browser already decides whether to load any given asset from cache or not. I read that part as *if* we were going to load from cache anyways, then skip compilation since the machine code is what is cached

      January 17th, 2018 at 17:18

  4. Dalin William s

    Fantastic presentation, and awesome new tech! I am also curious about the interactions with HTTP2 Push. How would this assist content loaded over something like GRPC-Web?

    Time to do all the research!

    January 17th, 2018 at 14:43

  5. Omar

    I felt like reading Sci-fi.
    Magnificent

    January 17th, 2018 at 18:02

  6. Raahul Kumar

    Glad to see that Ubuntu is already shipping with FireFox Quantum. Just wondering how do benchmark these speed increases? And instead of running 3 different Firefox versions, the stable one offered by Ubuntu, the developer edition, and nightly, can I get by with just the nightly alone? Or it is too unstable and bug prone to serve as the daily browser.

    57.0.4
    Firefox Release

    January 4, 2018
    Version 57.0.4, first offered to Release channel users on January 4, 2018

    January 17th, 2018 at 20:29

  7. ilya

    A well written good read. Thank you for sharing.

    January 17th, 2018 at 23:40

  8. Mark Entingh

    Great way to explain such complex ideas with hand drawn images. Thank you for putting so much effort into this post :)

    January 18th, 2018 at 02:51

  9. Willian

    Thanks for this excellent article. Does it work in the same fashion for JS compilation? If the answer is no, do you have plans to implement it for JS as well? Can we have it once binary AST is in place?

    January 18th, 2018 at 05:28

  10. Matt Cheung

    This is extremely well written. As someone new to this stuff, I feel like I can actually start to get a grasp of it all. Thank you for this post.

    January 18th, 2018 at 06:50

  11. oanchasa

    Verygood

    January 18th, 2018 at 06:53

  12. asgs

    Amazing innovation (or improvement). Thanks fir writing a Crystal clear article!

    January 18th, 2018 at 14:05

  13. Sandra Jane Kays

    I found this to be very interesting as well as educational.

    January 18th, 2018 at 20:50

  14. Bruno Santos

    Mozilla is doing an amazing job improving the web.

    This is not only a game changer, this will be how we will all soon be developing apps through our UIs using Kotlin, Rust, Clojure, C# or whatever language we prefer to get the job done.

    Can’t wait!

    January 19th, 2018 at 07:40

  15. John Paul Barbagallo

    Great article and explanation of the future of WA, definitely caught my interest!

    January 19th, 2018 at 15:47

  16. Klas Š.

    Is new WA compiler written in Rust?

    January 24th, 2018 at 02:42

  17. Ian Vickers

    Lin Clark, your articles are like XKCD for Mozilla! I love it!

    January 24th, 2018 at 19:40

  18. Brian Gaucher

    I know this is a little off topic, but if the mapping thing to debug the code in a “human” language arrives soon, I suppose debugging will only use the baseline compiler, since the optimized code would be somewhat less “human-readable” organised.
    Or would the baseline code only be used in the specific sections which need to be debugged, while the non-debugging sections would still run with the optimised version.

    January 25th, 2018 at 09:39

  19. Alec

    It is great!!! Is it possible to find somewhere tests that were used to measure performance improvement after using WebAssembly.{compile|instantiate}Streaming?
    I’m preparing tech talk about webAssembly and streaming compilation, so need some figures that reflect performance improvements in compilation

    January 28th, 2018 at 11:34

Comments are closed for this article.