Generational Garbage Collection in Firefox

Generational garbage collection (GGC) has now been enabled in the SpiderMonkey JavaScript engine in Firefox 32. GGC is a performance optimization only, and should have no observable effects on script behavior.

So what is it? What does it do?

GGC is a way for the JavaScript engine to collect short-lived objects faster. Say you have code similar to:

function add(point1, point2) {
    return [ point1[0] + point2[0], point1[1] + point2[1] ];
}

Without GGC, you will have high overhead for garbage collection (from here on, just “GC”). Each call to add() creates a new Array, and it is likely that the old arrays that you passed in are now garbage. Before too long, enough garbage will pile up that the GC will need to kick in. That means the entire JavaScript heap (the set of all objects ever created) needs to be scanned to find the stuff that is still needed (“live”) so that everything else can be thrown away and the space reused for new objects.

If your script does not keep very many total objects live, this is totally fine. Sure, you’ll be creating tons of garbage and collecting it constantly, but the scan of the live objects will be fast (since not much is live). However, if your script does create a large number of objects and keep them alive, then the full GC scans will be slow, and the performance of your script will be largely determined by the rate at which it produces temporary objects — even when the older objects aren’t changing, and you’re just re-scanning them over and over again to discover what you already knew. (“Are you dead?” “No.” “Are you dead?” “No.” “Are you dead?”…)

Generational collector, Nursery & Tenured

With a generational collector, the penalty for temporary objects is much lower. Most objects will be allocated into a separate memory region called the Nursery. When the Nursery fills up, only the Nursery will be scanned for live objects. The majority of the short-lived temporary objects will be dead, so this scan will be fast. The survivors will be promoted to the Tenured region.

The Tenured heap will also accumulate garbage, but usually at a far lower rate than the Nursery. It will take much longer to fill up. Eventually, we will still need to do a full GC, but under typical allocation patterns these should be much less common than Nursery GCs. To distinguish the two cases, we refer to Nursery collections as minor GCs and full heap scans as major GCs. Thus, with a generational collector, we split our GCs into two types: mostly fast minor GCs, and fewer slower major GCs.

GGC Overhead

While it might seem like we should have always been doing this, it turns out to require quite a bit of infrastructure that we previously did not have, and it also incurs some overhead during normal operation. Consider the question of how to figure out whether some Nursery object is live. It might be pointed to by a live Tenured object — for example, if you create an object and store it into a property of a live Tenured object.

How do you know which Nursery objects are being kept alive by Tenured objects? One alternative would be to scan the entire Tenured heap to find pointers into the Nursery, but this would defeat the whole point of GGC. So we need a way of answering the question more cheaply.

Note that these Tenured ⇒ Nursery edges in the heap graph won’t last very long, because the next minor GC will promote all survivors in the Nursery to the Tenured heap. So we only care about the Tenured objects that have been modified since the last minor (or major) GC. That won’t be a huge number of objects, so we make the code that writes into Tenured objects check whether it is writing any Nursery pointers, and if so, record the cross-generational edges in a store buffer.

In technical terms, this is known as a write barrier. Then, at minor GC time, we walk through the store buffer and mark every target Nursery object as being live. (We actually use the source of the edge at the same time, since we relocate the Nursery object into the Tenured area while marking it live, and thus the Tenured pointer into the Nursery needs to be updated.)

With a store buffer, the time for a minor GC is dependent on the number of newly-created edges from the Tenured area to the Nursery, not just the number of live objects in the Nursery. Also, keeping track of the store buffer records (or even just the checks to see whether a store buffer record needs to be created) does slow down normal heap access a little, so some code patterns may actually run slower with GGC.

Allocation Performance

On the flip side, GGC can speed up object allocation. The pre-GGC heap needs to be fully general. It must track in-use and free areas and avoid fragmentation. The GC needs to be able to iterate over everything in the heap to find live objects. Allocating an object in a general heap like this is surprisingly complex. (GGC’s Tenured heap has pretty much the same set of constraints, and in fact reuses the pre-GGC heap implementation.)

The Nursery, on the other hand, just grows until it is full. You never need to delete anything, at least until you free up the whole Nursery during a minor GC, so there is no need to track free regions. Consequently, the Nursery is perfect for bump allocation: to allocate N bytes you just check whether there is space available, then increment the current end-of-heap pointer by N bytes and return the previous pointer.

There are even tricks to optimize away the “space available” check in many cases. As a result, objects with a short lifespan never go through the slower Tenured heap allocation code at all.

Timings

I wrote a simple benchmark to demonstrate the various possible gains of GGC. The benchmark is sort of a “vector Fibonacci” calculation, where it computes a Fibonacci sequence for both the x and y components of a two dimensional vector. The script allocates a temporary object on every iteration. It first times the loop with the (Tenured) heap nearly empty, then it constructs a large object graph, intended to be placed into the Tenured portion of the heap, and times the loop again.

On my laptop, the benchmark shows huge wins from GGC. The average time for an iteration through the loop drops from 15 nanoseconds (ns) to 6ns with an empty heap, demonstrating the faster Nursery allocation. It also shows the independence from the Tenured heap size: without GGC, populating the long-lived heap slows down the mean time from 15ns to 27ns. With GGC, the speed stays flat at 6ns per iteration; the Tenured heap simply doesn’t matter.

Note that this benchmark is intended to highlight the improvements possible with GGC. The actual benefit depends heavily on the details of a given script. In some scripts, the time to initialize an object is significant and may exceed the time required to allocate the memory. A higher percentage of Nursery objects may get tenured. When running inside the browser, we force enough major GCs (eg, after a redraw) that the benefits of GGC are less noticeable.

Also, the description above implies that we will pause long enough to collect the entire heap, which is not the case — our incremental garbage collector dramatically reduces pause times on many Web workloads already. (The incremental and generational collectors complement each other — each attacks a different part of the problem.)

Benchmark Code

function bigHeap(N) {
    var result = [];
    for (var i = 0; i < N; i++)
        result.push({ 'number': i, 'prev': result[-1] });
    return result;
}

function add(a, b) {
    return [a[0] + b[0], a[1] + b[1]];
}

function vecfib(n) {
    var v1 = [0, 0];
    var v2 = [1, 1];
   for (var i = 0; i < n; i++) {
      var v = add(v1, v2);
      v1 = v2;
      v2 = v;
   }
   return v1;
}

var t = {};
var iters = 10000000;
t.smallheap_start = Date.now();
var dummy1 = vecfib(iters);
t.smallheap_end = Date.now();
H = bigHeap(10000000);
t.bigheap_start = Date.now();
var dummy2 = vecfib(iters);
t.bigheap_end = Date.now();

print("Small heap: " + ((t.smallheap_end - t.smallheap_start) / iters) * 1000000 + " ns/iter");
print("Big heap: " + ((t.bigheap_end - t.bigheap_start) / iters) * 1000000 + " ns/iter");

About Steve Fink

SpiderMonkey JavaScript engine developer, working mainly on the garbage collector and static analysis. Also moonlights working on release engineering, mercurial customization, and generally making a nuisance of himself. Author of the 'mrgiggles' bot on IRC.

More articles by Steve Fink…

About Robert Nyman [Editor emeritus]

Technical Evangelist & Editor of Mozilla Hacks. Gives talks & blogs about HTML5, JavaScript & the Open Web. Robert is a strong believer in HTML5 and the Open Web and has been working since 1999 with Front End development for the web - in Sweden and in New York City. He regularly also blogs at http://robertnyman.com and loves to travel and meet people.

More articles by Robert Nyman [Editor emeritus]…


14 comments

  1. StyMaar

    Well done guys !
    Thanks to this technique Firefox is currently 4 times quicker than Google Chrome, and 40 times quicker than IE11 with your benchmark code. Impressive !

    But I think this garbage collection mechanism is not quite perfect currently, because it does not release the memory after your script finished (it’s more visible if you take numbers ten time bigger like I did here : http://jsbin.com/yipihi/1/edit)

    I just filed a bug here : https://bugzilla.mozilla.org/show_bug.cgi?id=1073499

    September 26th, 2014 at 05:52

    1. Robert Nyman [Editor]

      Thanks for the input and for filing the bug!

      September 26th, 2014 at 07:49

  2. Ed Murphy

    This makes my Conway’s Life cellular automata run as smooth as silk at 60 fps. Previous versions would be hampered by the GC pauses. I would imagine many games will benefit from this improvement. Thanks.

    September 29th, 2014 at 09:07

  3. Thorsten

    “GGC is a performance optimization only, and should have no observable effects on script behavior.”

    Oh, but it does. My little attempt at a game runs observable more smooth-ly. Thank you very much.

    September 30th, 2014 at 06:12

  4. Dawson Loudon

    SDK News is stealing your content and stating that they wrote it: http://www.sdknews.com/firefox-os/generational-garbage-collection-in-firefox

    October 1st, 2014 at 07:38

    1. Robert Nyman [Editor]

      Thanks for the heads-up. This has been worked out over Twitter, and they will publish proper reference to the original article(s).

      October 1st, 2014 at 08:01

  5. eaglgenes101

    If you can have 2 levels of GC, what prevents you from having 3? Or 4? At what point does having more garbage collection levels hurt thoroughput more than it helps latency?

    October 5th, 2014 at 15:54

    1. Steve Fink

      Everything you suggest in that comment is exactly right. You can definitely add more levels of GC, and many systems do. But in all of these schemes, you have to keep track of the inter-generational pointers, which can be added (and removed) any time you write a GC pointer, which means you need to implement more and more complex write barriers (ie, extra code you have to run *on every write* to check whether the pointer you’re writing points into a different generation). You’ll also have to store the record of all these inter-generational pointers somewhere. So there are diminishing returns, and whether it makes sense or not depends on the workload. We have definitely looked into it, but for now it feels like the gain is very unclear and it adds a lot of complexity. For now, we have more than enough to do just fixing things up to properly take advantage of what we’ve already implemented. If we eventually get to the point where we feel like our bottleneck, at least some of the time, is stuff that could be fixed by adding more generations, then we’ll take another look. (We also have someone who did a thesis where he showed that in the system he was studying, adding this sort of complexity was a net loss. His results may not map perfectly to our system, but it’s a big cautionary red flag.)

      For now, we’re more concerned with getting more parallelism and concurrency, and better scheduling.

      October 6th, 2014 at 09:50

  6. Naman Patel

    Really nice article explaining the purpose of GGC. But is Nursery and Tenured heap allocated per Tab so that can GC in parallel? (Just a further optimization)

    October 12th, 2014 at 03:45

    1. Steve Fink

      Somewhat. It’s not exactly per-tab, but objects are allocated into a particular “compartment”, and compartments are grouped into “zones”, which roughly correspond to tabs. We can and do collect zones separately (one at a time or any subset of zones), but the Gecko embedding doesn’t fully track things on a per-zone basis, so we still need to do some global collections to be sure to clean up cycles through C++ objects. And deciding which zones to collect is another wrinkle in our general GC scheduling, which is… uh… let’s just say it has room for improvement. :-)

      October 13th, 2014 at 09:13

      1. Naman Patel

        Thank you for this valuable information. What I understand from your reply is that per Zone GGC is/can be carried out by a Thread mapped to a Tab (And which zones to collect can be optimized further). And Cycle Collector has to run on a Global basis.

        And just on the allocation front, does that mean that allocation happening in Nursery heap does not require JEMalloc and allocation for Tenured heap is still using JEMalloc, which require all the complex mapping for used and free Arenas?

        October 13th, 2014 at 10:44

        1. Terrence Cole

          That’s not quite right.

          The GGC heap is per-runtime not per tab. Only major GC’s can be done per-tab. There are not really any threads involved here: we are still a stop-the-world collector.

          We do not use JEMalloc for the GC heap; we mmap/VirtualAlloc blocks of memory directly from the OS and have our own management structures tuned for JS’s needs. However, objects in the GC heap may utilize JEMalloc to store more data if they become large.

          October 17th, 2014 at 11:21

        2. Steve Fink

          I wasn’t talking about threads. With the e10s project, you might have separate threads, but currently even that is only a single thread for all tabs. (Splitting the tabs into multiple threads is apparently a relatively easy followup, but currently they’re just getting >1 thread to work right now.) But it’s still very useful to be able to collect a subset of Zones — better latency and throughput both. You don’t waste time marking nor sweeping a bunch of inactive zones.

          (And I neglected to submit this comment, so Terrence replied about allocation in the meantime. Yay!)

          October 17th, 2014 at 11:43

          1. Naman Patel

            @Terrence: Thanks Sir for correcting me on the GC per tab and allocation of heap. Heap fragmentation that I have read somewhere is happening in the Tenured heap which is being allocated by mmap/VirtualAlloc and thus has to be external fragmentation. No fragmentation can happen in Nursery as after MinorGC Nursery would can be cleared. Just curious to know what happens when we need more memory for heap than we initially acquired from OS? How size of Nursery and Tenured heap is decided?
            @Steve: Does that mean that the heap will be divided into zones and then further into compartments? Which flavor of GGC does Firefox employ?

            Thank you for sparing your time.

            October 21st, 2014 at 05:22

Comments are closed for this article.