Mozilla

Performance Articles

Sort by:

View:

  1. Compacting Garbage Collection in SpiderMonkey

    Overview

    Compacting is a new feature of our garbage collector, released in Firefox 38, that allows us to reduce external fragmentation in the JavaScript heap. The aim is to use less memory in general and to be able to recover from more out-of-memory situations. So far, we have only implemented compacting for JavaScript objects, which are one of several kinds of garbage-collected cells in the heap.

    The problem

    The JavaScript heap is made up of 4K blocks of memory called arenas, each of which is divided into fixed-size cells. Different arenas are used to allocate different kinds of cells; each arena only contains cells of the same size and kind.

    The heap contains various kinds of cell, including those for JavaScript objects, strings, and symbols, as well as several internal kinds such as scripts (used to represent units of JS code), shapes (used to determine the layout of object properties in memory), and jitcode (compiled JIT code). Of these, object cells usually take up the most memory.

    An arena cannot be freed while it contains any live cells. Cells allocated at the same time may have different lifetimes and so a heap may end up in a state where there are many arenas that contain only a few cells. New cells of the same kind can be allocated into this space, but the space cannot be used for cells of a different kind or returned to the operating system if memory is low.

    Here is a simplified diagram of some data on the heap showing arenas containing two different kinds of cell:

    heap layout smallest

    Note that if the free space in arena 3 were used to hold the cells in arena 5, we could free up a whole arena.

    Measuring wasted heap space

    You can see how much memory these free cells are taking up by navigating to about:memory and hitting the ‘Measure’ button. The totals for the different kinds of cell are shown under the section js-main-runtime-gc-heap-committed/unused/gc-things. (If you’re not used to interpreting the about:memory reports, there is some documentation here).

    Here’s a screenshot of the whole js-main-runtime-gc-heap-committed section with compacting GC disabled, showing the difference between ‘used’ and ‘unused’ sizes:

    unused heap screenshot

    I made some rough measurements of my normal browsing profile with and without compacting GC (details of how to do this are below at the end of the post). The profile consisted of Google Mail, Calendar, many bugzilla tabs and various others (~50 tabs total), and I obtained the following readings:

    Total explicit allocations Unused cells
    Before compacting 1,324.46 MiB 69.58 MiB
    After compacting 1,296.28 MiB 40.18 MiB

    This shows a reduction of 29.4MiB (mebibytes) worth of explicit allocations. That’s only about 2% of total allocations, but accounts for over 8% of the space taken up by the JS heap.

    How does compacting work?

    To free up this space we have to allow the GC to move cells between arenas. That way it can consolidate the live cells in fewer arenas and reuse the unused space. Of course, this is easier said than done, as every pointer to a moved cell must be updated. Missing a single one is a sure-fire way to make the browser crash!

    Also, this is a potentially expensive operation as we have to scan many cells to find the pointers we need to update. Therefore the idea is to compact the heap only when memory is low or the user is inactive.

    The algorithm works in three phases:

    1. Select the cells to move.
    2. Move the cells.
    3. Update the pointers to those cells.

    Selecting the cells to move

    We want to move the minimum amount of data and we want to do it without allocating any more memory, as we may be doing this when we don’t have any free memory. To do this, we take all the arenas with free space in them and put them in a list arranged in decreasing order of the number of free cells they contain. We split this list into two parts at the first point at which the preceding arenas have enough free cells to contain the used cells in the subsequent arenas. We will move all the cells out of the subsequent arenas.

    Moving the cells

    We allocate a new cell from one of the arenas we are not moving. The previous step ensures there is always enough space for this. Then we copy the data over from the original location.

    In some cases, we know the cell contains pointers to itself and these are updated at this point. The browser may have external references to some kinds of object and so we also call an optional hook here to allow these to be updated.

    When we have moved a cell, we update the original location with a forwarding pointer to the new location, so we can find it later. This also marks the cell, indicating to the GC that the cell has been moved, when updating pointers in the next phase.

    Updating pointers to moved cells

    This is the most demanding part of the compacting process. In general, we don’t know which cells may contain pointers to cells we have moved, so it seems we have to iterate through all cells in the heap. This would be very expensive.

    We cut down this cost in a number of ways. Firstly, note that the heap is split into several zones (there is a zone per browser tab, and others for system use). Compacting is performed per-zone, since in general cells do not have cross-zone pointers (these are handled separately). Compacting per zone allows us to spread the total cost over many incremental slices.

    Secondly, not every kind of cell can contain pointers to every other kind of cell (indeed not all kinds of cells can contain pointers) so some kinds of cell can be excluded from the search.

    Finally, we can parallelise this work and make use of all CPU resources available.

    It’s important to note that this work was enabled by our shift to exact stack rooting, described in this blog post. It is only possible to move objects if we know which stack locations are roots, otherwise we could overwrite unrelated data on the stack if it happened to look like a moved cell pointer.

    Scheduling heap compaction

    As mentioned earlier, compacting GC doesn’t run every time we collect. Currently it is triggered on three events:

    • We ran out of memory and we’re performing a last-ditch attempt to free up some space
    • The OS has sent us a memory pressure event
    • The user has been inactive for some length of time (currently 20 seconds)

    The first two should allow us to avoid some out-of-memory situations, while the last aims to free up memory without affecting the user’s browsing experience.

    Conclusion

    Hopefully this has explained the problem compacting GC is trying to solve, and how it’s done.

    One unexpected benefit of implementing compacting GC is that it showed us a couple of places where we weren’t correctly tracing cell pointers. Errors like this can cause hard-to-reproduce crashes or potential security vulnerabilities, so this was an additional win.

    Ideas for future work

    The addition of compacting is an important step in improving our GC, but it’s not the end by any means. There are several ways in which we can continue to develop this:

    Currently we only compact cells corresponding to JavaScript objects, but there are several other kinds of cells in the heap. Moving these would bring greater memory savings.

    Is it possible to determine in advance which cells contain pointers to cells we want to move? If we had this information we could cut the cost of compacting. One possibility is to scan the heap in the background to determine this information, but we would need to be able to detect changes made by the mutator.

    The current algorithm mixes together cells allocated at different times. Cells with similar lifetimes are often allocated at the same time, so this may not be the best strategy.

    If compacting can be made fast enough, we might be able to do it whenever the collector sees a certain level of fragmentation in the heap.

    How to measure heap space freed up by compacting

    To measure roughly how much space is freed by compacting, you can perform the following steps:

    1. Disable compacting by navigating to about:config and setting javascript.options.mem.gc_compacting to false.
    2. It makes it easier to disable multiprocess Firefox as well at this point. This can be done from the main Preferences page.
    3. Restart the browser and open some tabs. I used ‘Reload all tabs’ to open all my pages from last time. Wait for everything to load.
    4. Open about:memory and force a full GC by clicking ‘Minimize memory usage’ and then click ‘Measure.’ Since memory usage can take a while to settle down, I repeated this a few times until I got a consistent number.
    5. Note the total ‘explicit’ size and that of js-main-runtime-gc-heap-committed/unused/gc-things.
    6. Enable compacting again by setting javascript.options.mem.gc_compacting to true. There is no need to restart for this to take effect.
    7. Click ‘Minimize memory usage’ again and then ‘Measure.’
    8. Compare the new readings to the previous.

    This does not give precise readings as all kinds of things might be happening in the background, but it can provide a good ballpark figure.

  2. How fast are web workers?

    The next version of Firefox OS, the mobile operating system, will unleash the power of devices by taking full advantage of their multi-core processors. Classically, JavaScript has been executed on a single thread, but web workers offer a way to execute code in parallel. Doing so frees the browser of anything that may get in the way of the main thread so that it can smoothly animate the UI.

    A brief introduction to web workers

    There are several types of web workers:

    They each have specific properties, but share a similar design. The code running in a worker is executed in its own separate thread and runs in parallel with the main thread and other workers. The different types of workers all share a common interface.

    Web workers

    Dedicated web workers are instantiated by the main process and they can communicate only with it.

    Shared workers

    Shared workers can be reached by all processes running on the same origin (different browser tabs, iframes or other shared workers).

    Service workers

    Service workers have gained a lot of attention recently. They make it possible to proxy a web server programmatically to deliver specific content to the requester (e.g. the main thread). One use case for service workers is to serve content while offline. Service workers are a very new API, not fully implemented in all browsers, and are not covered in this article.

    In order to verify that web workers make Firefox OS run faster, we need to validate their speed by benchmarking them.

    The cost of creating web workers

    This article focuses on Firefox OS. All measurement are made on a Flame device, powered by middle-end hardware.

    The first set of benchmarks will look at the time it takes to create web workers. To do that, we set up a script that instantiates a web worker and sends a minimal message, to which the worker replies immediately. Once the response is received by the main thread, the time that the operation takes is calculated. The web worker is destroyed and the operation is repeated enough times to get a good idea of how long it takes on average to get a functional web worker. Instantiating a web worker is as easy as:

    // Start a worker.
    var worker = new Worker('worker-script.js');
     
    // Terminate a worker.
    worker.terminate();

    The same method is applied to the creation of broadcast channel:

    // Open a broadcast channel.
    var channel = new window.BroadcastChannel('channel-name');
     
    // Close a broadcast channel.
    channel.close();

    Shared workers can’t really be benchmarked here because once they are created, the developer can’t destroy them. The browser is entirely responsible for their lifetime. For that reason, we can’t create and destroy shared workers at will to get a meaningful benchmark.

    Web workers take about 40 ms to be instantiated. Also, this time is pretty stable with variations of only a few milliseconds. Setting up a broadcast channel is usually done within 1 ms.

    Under normal circumstances, the browser UI is refreshed at a rate of 60 frames per second. This means that no JavaScript code should run longer than the time needed by a frame, i.e., 16.66ms (60 frames per second). Otherwise, you may introduce jankiness and lag in your application.

    Instantiating web workers is pretty efficient, but still may not fit in the time allocated for a single frame. That’s why it’s important to create as few web workers as possible and reuse them.

    Message latency

    A critical aspect of web workers is having fast communication between your main thread and the workers. There are two different ways the main browser thread can communicate with a web worker.

    postMessage

    This API is the default and preferred way to send and receive messages from a web worker. postMessage() is easy to use:

    // Send a message to the worker.
    worker.postMessage(myMessage);
     
    // Listen to messages from the worker.
    worker.onmessage = evt => {
      var message = evt.data;
    };

    Broadcast Channel

    This is a newly implemented API, only available in Firefox at the time of this writing. It lets us broadcast messages to all contexts sharing the same origin. All browser tabs, iframes, or workers served from the same origin can emit and receive messages:

    // Send a message to the broadcast channel.
    channel.postMessage(myMessage);
     
    // Listen to messages from the broadcast channel.
    channel.onmessage = evt => {
      var message = evt.data;
    };

    To benchmark this, we use a script similar to the one described above, except that the web worker is not destroyed and reused at each operation. The time to get a round trip response should be divided by two.

    As you might expect, the simple postMessage is fast. It usually takes between 0 to 1 ms to send a message, whether to a web or shared worker. Broadcast channel API takes about 1 to 2 ms.

    Under normal circumstances, exchanging messages with workers is fast and you should not feel too concerned about speed here. However, larger messages can take longer.

    The size of messages

    There are 2 ways to send messages to web workers:

    • Copying the message
    • Transferring the message

    In the first case, the message is serialized, copied, and sent over. In the latter, the data is transferred. This means that the original sender can no longer use it once sent. Transferring data is almost instantaneous, so there is no real point in benchmarking that. However, only ArrayBuffer is transferable.

    As expected, serializing, copying, and de-serializing data adds significant overhead to the message transmission. The bigger the message, the longer it takes to be sent.

    The benchmark here sends a typed array to a web worker. Its size is progressively increased at each iteration. There is a linear correlation between size of the message and transfer time. For each measurement, we can divide the size (in kilobytes) by the time (in milliseconds) to get the transfer speed in kb/ms.

    Typically, on a Flame, the transfer speed is 80 kB/ms for postMessage and 12 kB/ms using broadcast channel. This means that if you want your message to fit in a single frame, you should keep it under 1,300 kB with postMessage and under 200 kB when using the broadcast channel. Otherwise, it may introduce frame drop in your application.

    In this benchmark, we use typed arrays, because it makes it possible to determine their size in kilobytes precisely. You can also transfer JavaScript objects, but due to the serialization process, they take longer to post. For small objects, this doesn’t really matter, but if you need to send huge objects, you may as well serialize them to a binary format. You can use something similar to Protocol Buffer.

    Web workers are fast if used correctly

    Here is a quick summary of various benchmarks related to web workers, as measured on a Flame:

    Operation Value
    Instantiation of a web worker 40 ms
    Instantiation of a broadcast channel 1 ms
    Communication latency with postMessage 0.5 ms
    Communication latency with broadcast channel 1.5 ms
    Communication speed with postMessage 80 kB/ms
    Communication speed with broadcast channel 12 kB/ms
    Maximum message size with postMessage 1,300 kB
    Maximum message size with broadcast channel 200 kB

    Benchmarking is the only way to make sure that the solution you are implementing is fast. This process takes much of the guesswork out of web development.

    If you want to run these benchmarks on a specific device, the app I built to make these measurements, web workers benchmark, is open source. You are also welcome to contribute by submitting new types of benchmarks.

  3. Performance Testing Firefox OS With Raptor

    When we talk about performance for the Web, a number of familiar questions may come to mind:

    • Why does this page take so long to load?
    • How can I optimize my JavaScript to be faster?
    • If I make some changes to this code, will that make this app slower?

    I’ve been working on making these types of questions easier to answer for Gaia, the UI layer for Firefox OS, a completely web-centric mobile device OS. Writing performant web pages for the desktop has its own idiosyncrasies, and writing native applications using web technologies takes the challenge up an order of magnitude. I want to introduce the challenges I’ve faced in making performance an easier topic to tackle in Firefox OS, as well as document my solutions and expose holes in the Web’s APIs that need to be filled.

    From now on, I’ll refer to web pages, documents, and the like as applications, and while web “documents” typically don’t need the performance attention I’m going to give here, the same techniques could still apply.

    Fixing the lifecycle of applications

    A common question I get asked in regard to Firefox OS applications:

    How long did the app take to load?

    Tough question, as we can’t be sure we are speaking the same language. Based on UX and my own research at Mozilla, I’ve tried at adopt this definition for determining the time it takes an application to load:

    The amount of time it takes to load an application is measured from the moment a user initiates a request for the application to the moment the application appears ready for user interaction.

    On mobile devices, this is generally from the time the user taps on an icon to launch an app, until the app appears visually loaded; when it looks like a user can start interacting with the application. Some of this time is delegated to the OS to get the application to launch, which is outside the control of the application in question, but the bulk of the loading time should be within the app.

    So window load right?

    With SPAs (single-page applications), Ajax, script loaders, deferred execution, and friends, window load doesn’t hold much meaning anymore. If we could merely measure the time it takes to hit load, our work would be easy. Unfortunately, there is no way to infer the moment an application is visually loaded in a predictable way for everyone. Instead we rely on the apps to imply these moments for us.

    For Firefox OS, I helped develop a series of conventional moments that are relevant to almost every application for implying its loading lifecycle (also documented as a performance guideline on MDN):

    navigation loaded (navigationLoaded)

    The application designates that its core chrome or navigation interface exists in the DOM and has been marked as ready to be displayed, e.g. when the element is not display: none or any other functionality that would affect the visibility of the interface element.

    navigation interactive (navigationInteractive)

    The application designates that the core chrome or navigation interface has its events bound and is ready for user interaction.

    visually loaded (visuallyLoaded)

    The application designates that it is visually loaded, i.e., the “above-the-fold” content exists in the DOM and has been marked as ready to be displayed, again not display: none or other hiding functionality.

    content interactive (contentInteractive)

    The application designates that it has bound the events for the minimum set of functionality to allow the user to interact with “above-the-fold” content made available at visuallyLoaded.

    fully loaded (fullyLoaded)

    The application has been completely loaded, i.e., any relevant “below-the-fold” content and functionality have been injected into the DOM, and marked visible. The app is ready for user interaction. Any required startup background processing is complete and should exist in a stable state barring further user interaction.

    The important moment is visually loaded. This correlates directly with what the user perceives as “being ready.” As an added bonus, using the visuallyLoaded metric pairs nicely with camera-based performance verifications.

    Denoting moments

    With a clearly-defined application launch lifecycle, we can denote these moments with the User Timing API, available in Firefox OS starting with v2.2:

    window.performance.mark( string markName )
    

    Specifically during a startup:

    performance.mark('navigationLoaded');
    performance.mark('navigationInteractive');
    ...
    performance.mark('visuallyLoaded');
    ...
    performance.mark('contentInteractive');
    performance.mark('fullyLoaded');
    

    You can even use the measure() method to create a measurement between start and another mark, or even 2 other marks:

    // Denote point of user interaction
    performance.mark('tapOnButton');
    
    loadUI();
    
    // Capture the time from now (sectionLoaded) to tapOnButton
    performance.measure('sectionLoaded', 'tapOnButton');
    

    Fetching these performance metrics is pretty straightforward with getEntries, getEntriesByName, or getEntriesByType, which fetch a collection of the entries. The purpose of this article isn’t to cover the usage of User Timing though, so I’ll move on.

    Armed with the moment an application is visually loaded, we know how long it took the application to load because we can just compare it to—oh, wait, no. We don’t know the moment of user intent to launch.

    While desktop sites may be able to easily procure the moment at which a request was initiated, doing this on Firefox OS isn’t as simple. In order to launch an application, a user will typically tap an icon on the Homescreen. The Homescreen lives in a process separate from the app being launched, and we can’t communicate performance marks between them.

    Solving problems with Raptor

    Without the APIs or interaction mechanisms available in the platform to be able to overcome this and other difficulties, we’ve build tools to help. This is how the Raptor performance testing tool originated. With it, we can gather metrics from Gaia applications and answer the performance questions we have.

    Raptor was built with a few goals in mind:

    • Performance test Firefox OS without affecting performance. We shouldn’t need polyfills, test code, or hackery to get realistic performance metrics.
    • Utilize web APIs as much as possible, filling in gaps through other means as necessary.
    • Stay flexible enough to cater to the many different architectural styles of applications.
    • Be extensible for performance testing scenarios outside the norm.
    Problem: Determining moment of user intent to launch

    Given two independent applications — Homescreen and any other installed application — how can we create a performance marker in one and compare it in another? Even if we could send our performance mark from one app to another, they are incomparable. According to High-Resolution Time, the values produced would be monotonically increasing numbers from the moment of the page’s origin, which is different in each page context. These values represent the amount of time passed from one moment to another, and not to an absolute moment.

    The first breakdown in existing performance APIs is that there’s no way to associate a performance mark in one app with any other app. Raptor takes a simplistic approach: log parsing.

    Yes, you read that correctly. Every time Gecko receives a performance mark, it logs a message (i.e., to adb logcat) and Raptor streams and parses the log looking for these log markers. A typical log entry looks something like this (we will decipher it later):

    I/PerformanceTiming( 6118): Performance Entry: clock.gaiamobile.org|mark|visuallyLoaded|1074.739956|0.000000|1434771805380
    

    The important thing to notice in this log entry is its origin: clock.gaiamobile.org, or the Clock app; here the Clock app created its visually loaded marker. In the case of the Homescreen, we want to create a marker that is intended for a different context altogether. This is going to need some additional metadata to go along with the marker, but unfortunately the User Timing API does not yet have that ability. In Gaia, we have adopted an @ convention to override the context of a marker. Let’s use it to mark the moment of app launch as determined by the user’s first tap on the icon:

    performance.mark('appLaunch@' + appOrigin)
    

    Launching the Clock from the Homescreen and dispatching this marker, we get the following log entry:

    I/PerformanceTiming( 5582): Performance Entry: verticalhome.gaiamobile.org|mark|appLaunch@clock.gaiamobile.org|80081.169720|0.000000|1434771804212
    

    With Raptor we change the context of the marker if we see this @ convention.

    Problem: Incomparable numbers

    The second breakdown in existing performance APIs deals with the incomparability of performance marks across processes. Using performance.mark() in two separate apps will not produce meaningful numbers that can be compared to determine a length of time, because their values do not share a common absolute time reference point. Fortunately there is an absolute time reference that all JS can access: the Unix epoch.

    Looking at the output of Date.now() at any given moment will return the number of milliseconds that have elapsed since January 1st, 1970. Raptor had to make an important trade-off: abandon the precision of high-resolution time for the comparability of the Unix epoch. Looking at the previous log entry, let’s break down its output. Notice the correlation of certain pieces to their User Timing counterparts:

    • Log level and tag: I/PerformanceTiming
    • Process ID: 5582
    • Base context: verticalhome.gaiamobile.org
    • Entry type: mark, but could be measure
    • Entry name: appLaunch@clock.gaiamobile.org, the @ convention overriding the mark’s context
    • Start time: 80081.169720,
    • Duration: 0.000000, this is a mark, not a measure
    • Epoch: 1434771804212

    For every performance mark and measure, Gecko also captures the epoch of the mark, and we can use this to compare times from across processes.

    Pros and Cons

    Everything is a game of tradeoffs, and performance testing with Raptor is no exception:

    • We trade high-resolution times for millisecond resolution in order to compare numbers across processes.
    • We trade JavaScript APIs for log parsing so we can access data without injecting custom logic into every application, which would affect app performance.
    • We currently trade a high-level interaction API, Marionette, for low-level interactions using Orangutan behind the scenes. While this provides us with transparent events for the platform, it also makes writing rich tests difficult. There are plans to improve this in the future by adding Marionette integration.

    Why log parsing

    You may be a person that believes log parsing is evil, and to a certain extent I would agree with you. While I do wish for every solution to be solvable using a performance API, unfortunately this doesn’t exist yet. This is yet another reason why projects like Firefox OS are important for pushing the Web forward: we find use cases which are not yet fully implemented for the Web, poke holes to discover what’s missing, and ultimately improve APIs for everyone by pushing to fill these gaps with standards. Log parsing is Raptor’s stop-gap until the Web catches up.

    Raptor workflow

    Raptor is a Node.js module built into the Gaia project that enables the project to do performance tests against a device or emulator. Once you have the project dependencies installed, running performance tests from the Gaia directory is straightforward:

    1. Install the Raptor profile on the device; this configures various settings to assist with performance testing. Note: this is a different profile that will reset Gaia, so keep that in mind if you have particular settings stored.
      make raptor
    2. Choose a test to run. Currently, tests are stored in tests/raptor in the Gaia tree, so some manual discovery is needed. There are plans to improve the command-line API soon.
    3. Run the test. For example, you can performance test the cold launch of the Clock app using the following command, specifying the number of runs to launch it:
      APP=clock RUNS=5 node tests/raptor/launch_test
    4. Observe the console output. At the end of the test, you will be given a table of test results with some statistics about the performance runs completed. Example:
    [Cold Launch: Clock Results] Results for clock.gaiamobile.org
    
    Metric                            Mean     Median   Min      Max      StdDev  p95
    --------------------------------  -------  -------  -------  -------  ------  -------
    coldlaunch.navigationLoaded       214.100  212.000  176.000  269.000  19.693  247.000
    coldlaunch.navigationInteractive  245.433  242.000  216.000  310.000  19.944  274.000
    coldlaunch.visuallyLoaded         798.433  810.500  674.000  967.000  71.869  922.000
    coldlaunch.contentInteractive     798.733  810.500  675.000  967.000  71.730  922.000
    coldlaunch.fullyLoaded            802.133  813.500  682.000  969.000  72.036  928.000
    coldlaunch.rss                    10.850   10.800   10.600   11.300   0.180   11.200
    coldlaunch.uss                    0.000    0.000    0.000    0.000    0.000   n/a
    coldlaunch.pss                    6.190    6.200    5.900    6.400    0.114   6.300
    

    Visualizing Performance

    Access to raw performance data is helpful for a quick look at how long something takes, or to determine if a change you made causes a number to increase, but it’s not very helpful for monitoring changes over time. Raptor has two methods for visualizing performance data over time, in order to improve performance.

    Official metrics

    At raptor.mozilla.org, we have dashboards for persisting the values of performance metrics over time. In our automation infrastructure, we execute performance tests against devices for every new build generated by mozilla-central or b2g-inbound (Note: The source of builds could change in the future.) Right now this is limited to Flame devices running at 319MB of memory, but there are plans to expand to different memory configurations and additional device types in the very near future. When automation receives a new build, we run our battery of performance tests against the devices, capturing numbers such as application launch time and memory at fullyLoaded, reboot duration, and power current. These numbers are stored and visualized many times per day, varying based on the commits for the day.

    Looking at these graphs, you can drill down into specific apps, focus or expand your time query, and do advanced query manipulation to gain insight into performance. Watching trends over time, you can even pick out regressions that have sneaked into Firefox OS.

    Local visualization

    The very same visualization tool and backend used by raptor.mozilla.org is also available as a Docker image. After running the local Raptor tests, data will report to your own visualization dashboard based on those local metrics. There are some additional prerequisites for local visualization, so be sure to read the Raptor docs on MDN to get started.

    Performance regressions

    Building pretty graphs that display metrics is all well and fine, but finding trends in data or signal within noise can be difficult. Graphs help us understand data and make it accessible for others to easily communicate around the topic, but using graphs for finding regressions in performance is reactive; we should be proactive about keeping things fast.

    Regression hunting on CI

    Rob Wood has been doing incredible work in our pre-commit continuous integration efforts surrounding the detection of performance regressions in prospective commits. With every pull request to the Gaia repository, our automation runs the Raptor performance tests against the target branch with and without the patch applied. After a certain number of iterations for statistical accuracy, we have the ability to reject patches from landing in Gaia if a regression is too severe. For scalability purposes we use emulators to run these tests, so there are inherent drawbacks such as greater variability in the metrics reported. This variability limits the precision with which we can detect regressions.

    Regression hunting in automation

    Luckily we have the post-commit automation in place to run performance tests against real devices, and is where the dashboards receive their data from. Based on the excellent Python tool from Will Lachance, we query our historical data daily, attempting to discover any smaller regressions that could have crept into Firefox OS in the previous seven days. Any performance anomalies found are promptly reported to Bugzilla and relevant bug component watchers are notified.

    Recap and next steps

    Raptor, combined with User Timing, has given us the know-how to ask questions about the performance of Gaia and receive accurate answers. In the future, we plan on improving the API of the tool and adding higher-level interactions. Raptor should also be able to work more seamlessly with third-party applications, something that is not easily done right now.

    Raptor has been an exciting tool to build, while at the same time helping us drive the Web forward in the realm of performance. We plan on using it to keep Firefox OS fast, and to stay proactive about protecting Gaia performance.

  4. New Performance Tools in Firefox Developer Edition 40

    Today Mozilla is pleased to announce the availability of Firefox Developer Edition 40 (DE 40) featuring all-new performance tools! In this post we will cover some of DE 40’s new developer tools, fixes, and improvements made to existing tools. In addition, a couple of videos showcase some of these features.

    Note: Many of the new features were introduced in May, in an earlier Mozilla Hacks post.

    Introducing the new performance tools

    Firefox Developer Edition features a new performance tool that gives developers a better understanding of what is happening from a performance standpoint within their applications. Web developers can use these tools to profile performance in any kind of website, app, or game; for a fun insight into how these tools can be used to optimize HTML5 games, check out our post about the “Power Surge” game right after you’re done here.

    All performance tools can now be found grouped together under the Performance tab, for easier usage. Performance is all about timing, so you can view browser events in the context of a timeline, which in turn can be extended to include a number of detailed views based on the metrics you choose to monitor.

    In the following video, Dan Callahan demonstrates how to use the new performance tools.


    The Performance tab contains the new timeline, which includes: Waterfall view, Call Tree view and a Flame Chart view.

    All of the views above provide details of application performance that can be correlated with a recorded timeline overview. The timeline displays a compressed view of the Waterfall, minimum, maximum, and average frame rates, and a graphical representation of the frame rate. Left-clicking on the view and dragging to the desired range allows you to zoom into this timeline. This also simultaneously updates all three new views to represent a particular selected range.

    The recording view gives developers a quick way to zoom into areas where frame rate problems are occurring.
    recording

    The Waterfall view provides a graphical timeline of events occurring within the application. These events include markers for occurrences such as reflows, restyles, JavaScript calls, garbage collection, and paint operations. Using a simple filter button you can select the events you want to display in the Waterfall.

    perffilter

    You can use console commands like console.timeStamp() to indicate, with a marker on the Waterfall, when a specific event occurs. Also, you can graphically show timespans using the console.time() and console.timeEnd() functions.

    consoletimestamp

    The Call Tree view shows the results of the JavaScript profiler for the specified range. Using this view you can see the approximate time spent in a function. The table displays total time spent within a function call or the self-time that a particular function call is using. The total time encapsulates all time spent in the function and includes time spent in nested function calls. The self-time only includes time spent in the particular function, excluding nested calls. This view can be very helpful when trying to locate functions that are consuming a large portion of processing time. This view has been available in previous iterations of Firefox, and should be familiar to developers who have used the tool in the past.

    calltreeexample
    The Flame Chart view is similar to the Call Tree in that it graphically illustrates the call stack for a selected range. For example, in the screenshot below the drawCirc() function is taking over 25 milliseconds (ms) to complete, which is larger than the allotted time for frame generation to produce 60 frames per second.
    flamechartexample

    Performance profiles can be created, saved, imported, or deleted. In addition, multiple profiles can be opened at once to contrast and compare performance statistics between runs. Profiles can be created programmatically or using the console, by entering console.profile(“NameOfProfile”) to start a profile and console.profileEnd(“NameOfProfile”) to stop a profile. This allows you to fine-tune when profiling starts and stops within your code.
    consoleprofile
    You can find complete docs for the performance tools on MDN. These include a tour of the UI, reference pages for each of the main tools, and some examples in which we use the tools to diagnose performance problems in CSS animations and JavaScript-heavy pages.

    Additional features and improvements

    In addition to the new Performance tools we’ve also implemented many new convenience features — mostly inspired by direct feedback from developers via our UserVoice channel — and over ninety bug fixes, representing a ton of hard work over the last eight weeks from Firefox Developer Tools staff as well as many contributors. Please continue to submit your feedback.

    This video from Matthew “Potch” Claypotch shows off some the most requested feature implementations for Developer Edition 40.

    Network Monitor improvements

    As seen in the video above, the Network Monitor includes many improvements such data collection when the Network Tab is not active, and the ability to see quickly when an asset is loaded from the cache as opposed to the network.
    cached
    It is now possible to copy post data, URL parameters, and Request and Response headers using the context menu when selecting a row entry.
    postdata

    CSS docs integration

    Firefox Developer Tools now support integration with MDN documentation for CSS properties, providing more information for developers while they are debugging web app styling and layout. To access this feature, you can right-click (Ctrl + click on Mac) on CSS properties within the Inspector, and select “Show MDN Docs” from the context menu.
    mdncsslink

    Improved Inspector layout

    In the Inspector, whitespace in text node layout is cleaned up, providing a better view of your markup.
    whitespace

    Additional fixes

    Many additional fixes are also included, like improvements to the Animation Inspector, scroll into view context menu support and Inspector search improvements. To see all the bugs addressed in this release have a look at this big list in Bugzilla.

    We’d like to send a gigantic special thank you to all the contributors and individuals who reported bugs, tested patches, and spent many hours working to make Firefox Developer Tools impressive.

  5. Let’s get charged: Updates to the Battery Status API

    Web APIs provide a way for Open Web Apps to access device hardware, data and sensors through JavaScript, and open the doors to a number of possibilities especially for mobile devices, TVs, interactive kiosks, and Internet of Things (IoT) applications.

    Knowing the battery status of a device can be useful in a number of situations or use cases. Here are some examples:

    • Utility apps that collect statistics on battery usage or simply inform the user if the device is charged enough to play a game, watch a movie, or browse the Web.
    • High-quality apps that optimize battery consumption: for example, an email client may check the server for new email less frequently if the device is low on battery.
    • A word processor could save changes automatically before the battery runs out in order to prevent data loss.
    • A system checking if an interactive kiosk or TV installed in a showroom of an event is charging or if something wrong happened with the cables
    • A module that checks the battery status of a drone in order to make it come back to the base before it runs out of power.

    This article looks at a standardized way to manage energy consumption: The Battery Status API.

    The Battery Status API

    Open Web Apps can retrieve battery status information thanks to the Battery Status API, a W3C Recommendation supported by Firefox since version 16. The API is also supported by Firefox OS, and recently by Chrome, Opera, and the Android browser, so now it can be used in production across many major platforms.

    Also, the W3C Recommendation has recently been improved, introducing Promise Objects, and the ability to handle multiple batteries installed on the same device.

    At the time of writing, this W3C update has not yet been implemented by Firefox: please check the following bugs for implementation updates or in case you want to contribute to Gecko development:

    • [1050749] Expose BatteryManager via getBattery() returning a Promise instead of a synchronous accessor (navigator.battery)
    • [1050752] BatteryManager: specify the behavior when a host device has more than one battery

    Below we will look at using the Battery Status API in an instant messaging app running on Firefox OS and all the browsers that currently support the API.

    Demo: Low Energy Messenger

    Low Energy Messenger is an instant messaging demo app that pays close attention to battery status. The app has been built with HTML + CSS + Javascript (no libraries) and uses static data. It does not include web services running on the Internet, but includes real integration with the Battery Status API and has a realistic look & feel.

    You’ll find a working demo of Low Energy Messenger, along with the demo code on Github, and an >MDN article called Retrieving Battery status information that explains the code step-by-step.

    Low Energy Messenger

    Low Energy Messenger has the following features:

    • A battery status bar, containing battery status information.
    • A chat section, containing all the messages received or sent.
    • An action bar, containing a text field, a button to send a message, a button to take a photo, and a button to install the app on Firefox OS
    • In order to preserve battery life when the power level is low, the app doesn’t allow users to take photos when the device is running out of battery.

    The visual representation of the battery, in the app’s status bar, changes depending on the charge level. For example:

    13% Discharging: 0:23 remaining
    40% Discharging: 1:19 remaining
    92% Charging: 0:16 until full

    Low Energy Messenger includes a module called EnergyManager.js that uses the Battery Status API to get the information displayed above and perform checks.

    The battery object, of type BatteryManager, is provided by the navigator.getBattery method, using Promises, or by the deprecated navigator.battery property, part of a previous W3C specification and currently used by Firefox and Firefox OS. As mentioned above, follow this bug for implementation updates or if you want to contribute to Gecko development.

    The EnergyManager.js module eliminates this difference in API implementation in the following way:

    /* EnergyManager.js */
     init: function(callback) {
         var _self = this;
        /* Initialize the battery object */
        if (navigator.getBattery) {
           navigator.getBattery().then(function(battery) {
               _self.battery = battery;
               callback();
           });
        } else if (navigator.battery || navigator.mozBattery) { // deprecated battery objects
            _self.battery = navigator.battery || navigator.mozBattery;
            callback();
        }
     }

    The navigator.getBattery method returns a battery promise, which is resolved in a BatteryManager object providing events you can handle to monitor the battery status. The deprecated navigator.battery attribute returns the BatteryManager object directly; the implementation above checks for vendor prefixes as well, for even older, experimental, API implementations carried on by Mozilla in earlier stages of the specification.

    Logging into the Web Console of a browser is a useful way to understand how the Battery Status API actually works:

    /* EnergyManager.js */
     log: function(event) {
        if (event) {
            console.warn(event);
        }
        console.log('battery.level: ' + this.battery.level);
        console.log('battery.charging: ' + this.battery.charging);
        console.log('battery.chargingTime: ' + this.battery.chargingTime);
        console.log('battery.dischargingTime: ' + this.battery.dischargingTime);
     }

    Here is how the logs appear on the Web Console:

    Web Console

    Every time an event (dischargingtimechange, levelchange, etc.) gets fired, the BatteryManager object provides updated values that can be used by the application for any purpose.

    Conclusions

    The Battery Status API is a standardized way to access the device hardware and is ready to be used in production, even if at the time of writing some compatibility checks still have to be performed on Firefox. Also, the W3C specification is generic enough to be used in different contexts, thus the API covers a good number of real-world use cases.

  6. Firefox OS, Animations & the Dark Cubic-Bezier of the Soul

    I’ve been using Firefox OS daily for a couple of years now (wow, time flies!). While performance has steadily improved with efforts like Project Silk, I’ve often noticed delays in the user interface. I assumed the delays were because the hardware was well below the “flagship” hardware I’ve become accustomed to with Android and iOS devices.

    Last year, I built Firefox OS for a Nexus 4 and started using that as my daily phone. Quickly I realized that even with better hardware, I sometimes had to wait on Firefox OS for basic interactions, even when the task wasn’t computationally intensive. I moved on to a Nexus 5 and then a Sony Z3 Compact, both with better specs than the Nexus 4, and experienced the same thing.

    Time passed. Frustration grew. Whispers of a nameless fear…

    Running the numbers

    While reading Ralph Thomas’s post about creating animations based on physical models, I wondered about the implementation of animations in Firefox OS, and how that might be involved in this problem. I performed an audit of the number of instances of different animations, grouped by their duration. I removed progress indicators and things like the boot shutdown animation. Here are the animation and transition durations in Firefox OS, grouped by duration, for transitional interactions like scaling, opening, closing and sliding:

    • 0.1s: 15
    • 0.2s: 57
    • 0.3s: 79
    • 0.4s: 40
    • 0.5s: 78
    • 0.6s: 8

    A couple of things stand out. First, we have a pretty wide distribution of animation durations. Second, the vast majority of the animations are more than 300ms long!

    In fact, in more than 80 animations we are making the user wait more than half a second. These slow animations are dragging us down, resulting in a poorer overall experience of Firefox OS.

    How did we get here?

    The Firefox OS UX and interaction designers didn’t huddle in a room and design each interaction to be intentionally slow. The engineers who implemented these animations didn’t ever think to themselves “this feels really responsive… let’s make it slower!”

    My theory is that interactions like these don’t feel slow while you’re designing and implementing them, because you’re working with a single interaction at a time. When designing and developing an animation, I look for fluidity of motion, the aesthetics of that single action and how the visual impact enhances the task at hand, and then I iterate on duration and effects until it feels right.

    We do have guidelines for responsiveness and user-perceived performance in Firefox OS, written up by Gordon Brander, which you can see in the screenshot below. (Click the image for a larger, more readable version.) However, those guidelines don’t cover the sub-second period between the initial perception of cause and effect and the next actionable state of the user interface.

    Screenshot 2015-04-18 09.38.10

    Users have an entirely different experience than we do as developers and designers. Users make their way through our animations while hurriedly sending a text message, trying to capture that perfect moment on camera, entering their username and password, or arduously uploading a bunch of images one at a time. People are trying to get from point A to point B. They want to complete a task… well, actually not just one: Smartphone users are trying to complete 221 tasks every day, according to a study in the UK last October by Tecmark. All those animations add up! I assert that the aggregate of those 203 animations in Gaia that are 300ms and longer contributes to the frustrating feeling of slowness I was experiencing before digging into this.

    Making it feel fast

    So I tested this theory, by changing all animation durations in Gaia to 200ms, as a starting point. The result? Firefox OS feels far more responsive. Moving through tasks and navigating around the OS felt quick but not abrupt. The camera snaps to readiness. Texting feels so much more fluid and snappy. Apps pop up, instead of slowly hauling their creaky bones out of bed. The Rocketbar gets closer to living up to its name (though I still think the keyboard should animate up while the bar becomes active).

    Here’s a demo of some of our animations side by side, before and after this patch:

    There are a couple of things we can do about this in Gaia:

    1. I filed a bug to get this change landed in Gaia. The 200ms duration is a first stab at this until we can do further testing. Better to err on the snappy side instead of the sluggish side. We’ve got the thumbs-up from most of the 16 developers who had to review the changes, and are now working with the UX team to sign off before it can land. Kevin Grandon helped by adding a CSS variable that we can use across all of Gaia, which will make it easier to implement these types of changes OS-wide in the future as we learn more.
    2. I’m working with the Firefox OS UX team to define global and consistent best-practices for animations. These guidelines will not be correct 100% of the time, but can be a starting point when implementing new animations, ensuring that the defaults are based on research and experience.
    If you are a Firefox OS user, report bugs if you experience anything that feels slow. By reporting a bug, you can make change happen and help improve the user experience for everyone on Firefox OS.

    If you are a developer or designer, what are your animation best-practices? What user feedback have you received on the animations in your Web projects? Let us know in the comments below!

  7. Optimising SVG images

    SVG is a vector image format based on XML. It has great advantages, most notably it is lightweight. Since SVG is a text format, it can be viewed and modified using a simple text editor, and applying GZIP compression produces excellent results.

    It’s critical for a website to provide assets that are as lightweight as possible, especially on mobile where bandwidth can be very limited. You want to optimise your SVG files to have your app load and display as quickly as possible.

    This article will show how to use dedicated tools to optimise SVG images. You will also learn how the markup works so you can go the extra mile to produce the lightest possible images.

    Introducing svgo

    Optimising SVG is very similar to minifying CSS or other text-based formats such as JavaScript or HTML. It is mainly about removing useless whitespace and redundant characters.

    The tool I recommend to reduce the size of SVG images is svgo. It is written for node.js. To install it, just do:

    $ npm install -g svgo

    In its basic form, you’ll use a command line like this:

    $ svgo --input img/graph.svg --output img/optimised-graph.svg

    Please make sure to specify an --output parameter if you want to keep the original image. Otherwise svgo will replace it with the optimised version.

    svgo will apply several changes to the original file—stripping out useless comments, tags, and attributes, reducing the precision of numbers in path definitions, or sorting attributes for better GZIP compression.

    This works with no surprises for simple images. However, in more complex cases, the image manipulation can result in a garbled file.

    svgo plugins

    svgo is very modular thanks to a plugin-based architecture.

    When optimising complex images, I’ve noticed that the main issues are caused by two svgo plugins:

    • convertPathData
    • mergePaths

    Deactivating these will ensure you get a correct result in most cases:

    $ svgo --disable=convertPathData --disable=mergePaths -i img/a.svg

    convertPathData will convert the path data using relative and shorthand notations. Unfortunately, some environments won’t fully recognise this syntax and you’ll get something like:

    Screenshot of Gnome Image Viewer displaying an original SVG image (left) and a version optimised via svgo on (right)

    Screenshot of Gnome Image Viewer displaying an original SVG image (left) and a version optimised via svgo on (right)



    Please note that the optimised image will display correctly in all browsers. So you may still want to use this plugin.

    The other plugin that can cause you trouble—mergePaths—will merge together shapes of the same style to reduce the number of <path> tags in the source. However, this might create issues if two paths overlap.

    Merge paths issue

    In the image on the right, please note the rendering differences around the character’s neck and hand, also note the Twitter logo. The outline view shows 3 overlapping paths that make up the character’s head.

    My suggestion is to first try svgo with all plugins activated, then if anything is wrong, deactivate the two mentioned above.

    If the result is still very different from your original image, then you’ll have to deactivate the plugins one by one to detect the one which causes the issue. Here is a list of svgo plugins.

    Optimising even further

    svgo is a great tool, but in some specific cases, you’ll want to compress your SVG images even further. To do so, you have to dig into the file format and do some manual optimisations.

    In these cases, my favourite tool is Inkscape: it is free, open source and available on most platforms.

    If you want to use the mergePaths plugin of svgo, you must combine overlapping paths yourself. Here’s how to do it:

    Open your image in Inkscape and identify the path with the same style (fill and stroke). Select them all (maintain shift pressed for multiple selection). Click on the Path menu and select Union. You’re done—all three paths have been merged into a single one.

    Merge paths technique

    The 3 different paths that create the character’s head are merged, as shown by the outline view on the right.



    Repeat this operation for all paths of the same style that are overlapping and then you’re ready to use svgo again, keeping the mergePaths plugin.

    There are all sorts of different optimisations you can apply manually:

    • Convert strokes to paths so they can be merged with paths of similar style.
    • Cut paths manually to avoid using clip-path.
    • Exclude an underneath path from a overlapping path and merge with a similar path to avoid layer issues. (In the image above, see the character’s hair—the side hair path is under his head, but the top hair is above it—so you can’t merge the 3 hair paths as is.)

    Final considerations

    These manual optimisations can take a lot of time for meagre results, so think twice before starting!

    A good rule of thumb when optimising SVG images is to make sure the final file has only one path per style (same fill and stroke style) and uses no <g> tags to group path to objects.

    In Firefox OS, we use an icon font, gaia-icons, generated from SVG glyphs. I noticed that optimising them resulted in a significantly lighter font file, with no visual differences.

    Whether you use SVG for embedding images on an app or to create a font file, always remember to optimise. It will make your users happier!

  8. Project Silk

    Editor’s Note: An earlier version of this post appeared on Mason Chang’s personal blog.

    For the past few months, I’ve been working on Project Silk which improves smoothness across the browser. Very much like Project Butter for Android, part of it is finally live on Firefox OS. Silk does three things:

    1. Align Painting with hardware vsync
    2. Resample touch input events based on hardware vsync
    3. Align composites with hardware vsync

    What is vsync, why vsync, and why does it matter at all?

    Vertical synchronization (vsync) occurs when the hardware display shows a new frame on the screen. This rate is set by specific hardware, but major displays in the US occur at a rate of 60 times a second, or every 16.6 ms (milliseconds). This is where you hear about 60 frames per second, one frame each time the hardware display refreshes. What this means in reality is that no matter how many frames are produced in software, the hardware display will still only show at most 60 unique frames per second.

    Currently in Firefox, we mimic 60 frames per second and therefore vsync with a software timer that schedules rendering every 16.6 ms. However, the software scheduler has two problems: (a) it’s noisy and (b) it can be scheduled at bad times relative to vsync.

    In regards to noise, software timers are much noisier than hardware timers. This creates micro-jank for a number of reasons. First, many animations are keyed off timestamps that are generated by the software scheduler to update the position of the animation. If you’ve ever used requestAnimationFrame, you get a timestamp from a software timer. If you want smooth animations, the timestamp provided to requestAnimationFrame should be uniform. Non-uniform timestamps will create non-uniform and janky animations. Here is a graph showing software versus hardware vsync timer uniformity:

    timer

    Wow! Big improvement with a hardware timer. We get a much more uniform, and therefore smoother, timestamp to key animations off of. So that addresses problem (a), noisy timers in software versus hardware.

    With part (b), software timers can be scheduled at bad times relative to vsync. Regardless of what the software does, the hardware display will refresh on its own clock. If our rendering pipeline finishes producing a frame before the next vsync, the display is updated with new content. If we fail to finish producing a frame before the next vsync, the previous frame will be displayed, causing jankiness. Some rendering functions can occur close to vsync and overflow until the next interval. Thus, we actually introduce more potential latency since the frame won’t be displayed on the screen anyway until the next vsync. Let’s look at this in graphic form:

    frames

    At time 0, we start producing frames. For example, let’s say all frames take a constant time of 10 ms. Our frame budget is 16.6 ms because we only have to finish producing a frame before the next hardware vsync occurs. Since frame 1 is finished 6 ms before the next vsync (time t=16 ms), everything is successful and life is good. The frame is produced in time and the hardware display will be refreshed with the updated content.

    Now let’s look at Frame 2. Since software timers are noisy, we start producing a frame 9 ms from the next vsync (time t=32). Since our frame takes 10 ms to produce, we actually finish producing this frame at 1 ms AFTER the next vsync. That means at vsync number 2 (t=32), there is no new frame to display, so the display still shows the previous frame. In addition, the frame just produced won’t be shown until vsync 3 (t=48), because that’s when the hardware updates itself. This creates jank since now the display will have skipped one frame and will try to catch up in the upcoming frames. This also produces one extra frame of latency, which is terrible for games.

    Vsync addresses both of these problems since we get a much more uniform timer and the maximum amount of frame budget time to produce a new frame. Now that we know what vsync is, we can finally go on to what Project Silk is and how it helps create smooth experiences in Firefox.

    The Rendering Pipeline

    In super simplified terms, Gecko’s rendering pipeline does three things:

    1. Paint / draw the new frame on the main thread.
    2. Send the updated content to the Compositor via a LayerTransaction.
    3. Composite the new content.

    In an ideal world, we’d be able to do all three steps within 16.6 ms, but that’s not the case most of the time. Both steps (1) and (3) occur on independent software timers. Thus, there is no real synchronizing clock between the three steps, they are all ad hoc. They also have no relation to vsync, so the timing of the pipeline isn’t related to when the display actually updates the screen with content. With Silk, we replace both independent software timers with the hardware vsync timer. For our purposes, (2) doesn’t really affect the outcome, but is presented here for completeness.

    Align Painting with Hardware Vsync

    Aligning the timer used to tick the refresh driver with vsync creates smoothness in a couple of ways. First, many animations are still done on the main thread, which means any animation using timestamps to set the position of an animation should be smoother. This includes requestAnimationFrame animations! The other nice thing is that we now have a very strict ordering of when rendering is kicked off. Instead of (1) and (3), which occur at separate synched offsets, we start rendering at a specific time.

    Resample Touch Input Events Based on Vsync

    With Silk, we can enable touch resampling, which improves smoothness while tracking your finger. Since I’ve already blogged about touch resampling quite a bit, I’ll keep this short. With Silk, we can finally enable it!

    Align Composites with Hardware Vsync

    Finally, the last part of Silk is about aligning composites with hardware vsync. Compositing takes all the painted content and merges it together to create the single image you see on the display. With Silk, all composites start right after a hardware vsync occurs. This has actually produced a rather nice side benefit — the reduced composite times seen here:

    compositeTimes

    Within the device driver on a Flame device, there’s a global lock that’s grabbed when close to vsync intervals. This lock can take 5-6 ms to get, greatly increasing the composite times. However, when we start a composite right after a vsync, there is little contention to grab the lock. Thus we can shave off the wait, therefore reducing composite times quite a bit. Not only do we get smoother animations, but also reduced composite times, and therefore better battery life. What a nice win!

    With all three pieces, we now have a nice strict ordering of the rendering pipeline. We paint and send the updated content to the Compositor within 16.6 ms. At the next vsync, we composite the updated content. At the vsync after that, the frame should have gone through the rendering pipeline and will be displayed on the screen. Keeping this order reduces jank because we reduce the chance that the timers will schedule each step at a bad time. In a best-case scenario in the current implementation without Silk, a frame could be painted and composited within a single 16.6 ms frame. This is great. However, if the next frame takes 2 frames instead, we’ve just created extra jank, even though no stage in the pipeline was really slow. Aligning the whole pipeline to create a strict sequence of events reduces the chance that we mis-schedule a frame.

    master

    Here’s a picture of the rendering pipeline without Silk. We have Composites (3) at the bottom of this profile. We have painting (1) in the middle, where you see Styles, Reflow, Displaylist, and Rasterize. We have Vsync, represented by those small orange boxes at the top. Finally we have Layer Transactions (2) at the bottom. At first, when we start, compositing and painting are not aligned, so animations are at different positions depending on whether they are on the main thread or the compositor thread. Second, we see long composites because the compositor is waiting on a global lock in the device driver. Lastly, it’s difficult to read any ordering or see if there is a problem without deep knowledge of why / when things should be happening.

    silk

    Here is a picture of the same pipeline with Silk. Composites are a little shorter, and the whole pipeline only starts at vsync intervals. Composite times are reduced because we start composites exactly at vsync intervals. There is a clear ordering of when things should happen. Both composites and painting are keyed off the same timestamp, ensuring smoother animations. Finally, there is a clear indicator that as long as everything finishes before the next Vsync, things will be smooth.

    Ultimately, Silk aims to create a smoother experience across Firefox and the Web. Numerous people contributed to the project. Thanks to Jerry Shih, Boris Chou, Jeff Hwang, Mike Lee, Kartikaya Gupta, Benoit Girard, Michael Wu, Ben Turner, and Milan Sreckovic for their help in making Silk happen.

  9. An easier way of using polyfills

    Polyfills are a fantastic way to enable the use of modern code even while supporting legacy browsers, but currently using polyfills is too hard, so at the FT we’ve built a new service to make it easier. We’d like to invite you to use it, and help us improve it.

    Image from https://www.flickr.com/photos/hamur0w0/6984884135

    More pictures, they said. So here’s a unicorn, which is basically a horse with a polyfill.

    The challenge

    Here are some of the issues we are trying to solve:

    • Developers do not necessarily know which features need to be polyfilled. You load your site in some old version of IE beloved by a frustratingly large number of your users, see that the site doesn’t work, and have to debug it to figure out which feature is causing the problem. Sometimes the culprit is obvious, but often not, especially when legacy browsers also lack good developer tools.
    • There are often multiple polyfills available for each feature. It can be hard to know which one most faithfully emulates the missing feature.
    • Some polyfills come as a big bundle with lots of other polyfills that you don’t need, to provide comprehensive coverage of a large feature set, such as ES6. It should not be necessary to ship all of this code to the browser to fix something very simple.
    • Newer browsers don’t need the polyfill, but typically the polyfill is served to all browsers. This reduces performance in modern browsers in order to improve compatibility with legacy ones. We don’t want to make that compromise. We’d rather serve polyfills only to browsers that lack a native implementation of the feature.

    Our solution: polyfills as a service

    To solve these problems, we created the polyfill service. It’s a similar idea to going to an optometrist, having your eyes tested, and getting a pair of glasses perfectly designed to correct your particular vision problem. We are doing the same for browsers. Here’s how it works:

    1. Developers insert a script tag into their page, which loads the polyfill service endpoint.
    2. The service analyses the browser’s user-agent header and a list of requested features (or uses a default list of everything polyfillable) and builds a list of polyfills that are required for this browser
    3. The polyfills are ordered using a graph sort to place them in the right dependency order.
    4. The bundle is minified and served through a CDN (for which we’re very grateful to Fastly for their support)

    Do we really need this solution? Well, consider this: Modernizr is a big grab bag of feature detects, and all sensible use cases benefit from a custom build, but a large proportion of Modernizr users just use the default build, often from cdnjs.com or as part of html5boilerplate. Why include Modernizr if you aren’t using its feature detects? Maybe you misunderstand the purpose of the library and just think that Modernizr “fixes stuff”? I have to admit, I did, when I first heard the name, and I was mildly disappointed to find that rather than doing any actual modernising, Modernizr actually just defines modernness.

    The polyfill service, on the other hand, does fix stuff. There’s really nothing wrong with not wanting to spend time gaining intimate knowledge of all the foibles of legacy browsers. Let someone figure it out once, and then we can all benefit from it without needing or wanting to understand the details.

    How to use it

    The simplest use case is:

    <script src="//cdn.polyfill.io/v1/polyfill.min.js" async defer></script>

    This includes our default polyfill set. The default set is a manually curated list of features that we think are most essential to modern web development, and where the polyfills are reasonably small and highly accurate. If you want to specify which features you want to polyfill though, go right ahead:

    <!-- Just the Array.from polyfill -->
    <script src="//cdn.polyfill.io/v1/polyfill.min.js?features=Array.from" async defer></script>
     
    <!-- The default set, plus the geolocation polyfill -->
    <script src="//cdn.polyfill.io/v1/polyfill.min.js?features=default,Navigator.prototype.geolocation" async defer></script>

    If it’s important that you have loaded the polyfills before parsing your own code, you can remove the async and defer attributes, or use a script loader (one that doesn’t require any polyfills!).

    Testing and documenting feature support

    This table shows the polyfill service’s effect for a number of key web technologies and a range of popular browsers:

    Polyfill service support grid

    The full list of features we support is shown on our feature matrix. To build this grid we use Sauce Labs’ test automation platform, which runs each polyfill through a barrage of tests in each browser, and documents the results.

    So, er, user-agent sniffing? Really?

    Yes. There are several reasons why UA analysis wins out over feature detection for us:

    • In some cases, we have multiple polyfills for the same feature, because some browsers offer a non-compliant implementation that just needs to be bashed into shape, while others lack any implementation at all. With UA detection you can choose to serve the right variant of the polyfill.
    • With UA detection, the first HTTP request can respond directly with polyfill code. If we used feature detection, the first request would serve feature-detect code, and then a second one would be needed to fetch specific polyfills.

    Almost all websites with significant scale do UA detection. This isn’t to say the stigma attached to it is necessarily bad. It’s easy to write bad UA detect rules, and hard to write good ones. And we’re not ruling out making a way of using the service via feature-detects (in fact there’s an issue in our tracker for it).

    A service for everyone

    The service part of the app is maintained by the FT, and we are working on expanding and improving the tools, documentation, testing and service features all the time. The source is freely available on GitHub so you can easily host it yourself, but we also host an instance of the service on cdn.polyfill.io which you can use for free, and our friends at Fastly are providing free CDN distribution and SSL.

    We’ve made a platform. We need the community’s help to populate it. We already serve some of the best polyfills from Jonathan Neal, Mathias Bynens and others, but we’d love to be more comprehensive. Bring your polyfills, improve our tests, and make this a resource that can help move the web forward!

  10. Generational Garbage Collection in Firefox

    Generational garbage collection (GGC) has now been enabled in the SpiderMonkey JavaScript engine in Firefox 32. GGC is a performance optimization only, and should have no observable effects on script behavior.

    So what is it? What does it do?

    GGC is a way for the JavaScript engine to collect short-lived objects faster. Say you have code similar to:

    function add(point1, point2) {
        return [ point1[0] + point2[0], point1[1] + point2[1] ];
    }

    Without GGC, you will have high overhead for garbage collection (from here on, just “GC”). Each call to add() creates a new Array, and it is likely that the old arrays that you passed in are now garbage. Before too long, enough garbage will pile up that the GC will need to kick in. That means the entire JavaScript heap (the set of all objects ever created) needs to be scanned to find the stuff that is still needed (“live”) so that everything else can be thrown away and the space reused for new objects.

    If your script does not keep very many total objects live, this is totally fine. Sure, you’ll be creating tons of garbage and collecting it constantly, but the scan of the live objects will be fast (since not much is live). However, if your script does create a large number of objects and keep them alive, then the full GC scans will be slow, and the performance of your script will be largely determined by the rate at which it produces temporary objects — even when the older objects aren’t changing, and you’re just re-scanning them over and over again to discover what you already knew. (“Are you dead?” “No.” “Are you dead?” “No.” “Are you dead?”…)

    Generational collector, Nursery & Tenured

    With a generational collector, the penalty for temporary objects is much lower. Most objects will be allocated into a separate memory region called the Nursery. When the Nursery fills up, only the Nursery will be scanned for live objects. The majority of the short-lived temporary objects will be dead, so this scan will be fast. The survivors will be promoted to the Tenured region.

    The Tenured heap will also accumulate garbage, but usually at a far lower rate than the Nursery. It will take much longer to fill up. Eventually, we will still need to do a full GC, but under typical allocation patterns these should be much less common than Nursery GCs. To distinguish the two cases, we refer to Nursery collections as minor GCs and full heap scans as major GCs. Thus, with a generational collector, we split our GCs into two types: mostly fast minor GCs, and fewer slower major GCs.

    GGC Overhead

    While it might seem like we should have always been doing this, it turns out to require quite a bit of infrastructure that we previously did not have, and it also incurs some overhead during normal operation. Consider the question of how to figure out whether some Nursery object is live. It might be pointed to by a live Tenured object — for example, if you create an object and store it into a property of a live Tenured object.

    How do you know which Nursery objects are being kept alive by Tenured objects? One alternative would be to scan the entire Tenured heap to find pointers into the Nursery, but this would defeat the whole point of GGC. So we need a way of answering the question more cheaply.

    Note that these Tenured ⇒ Nursery edges in the heap graph won’t last very long, because the next minor GC will promote all survivors in the Nursery to the Tenured heap. So we only care about the Tenured objects that have been modified since the last minor (or major) GC. That won’t be a huge number of objects, so we make the code that writes into Tenured objects check whether it is writing any Nursery pointers, and if so, record the cross-generational edges in a store buffer.

    In technical terms, this is known as a write barrier. Then, at minor GC time, we walk through the store buffer and mark every target Nursery object as being live. (We actually use the source of the edge at the same time, since we relocate the Nursery object into the Tenured area while marking it live, and thus the Tenured pointer into the Nursery needs to be updated.)

    With a store buffer, the time for a minor GC is dependent on the number of newly-created edges from the Tenured area to the Nursery, not just the number of live objects in the Nursery. Also, keeping track of the store buffer records (or even just the checks to see whether a store buffer record needs to be created) does slow down normal heap access a little, so some code patterns may actually run slower with GGC.

    Allocation Performance

    On the flip side, GGC can speed up object allocation. The pre-GGC heap needs to be fully general. It must track in-use and free areas and avoid fragmentation. The GC needs to be able to iterate over everything in the heap to find live objects. Allocating an object in a general heap like this is surprisingly complex. (GGC’s Tenured heap has pretty much the same set of constraints, and in fact reuses the pre-GGC heap implementation.)

    The Nursery, on the other hand, just grows until it is full. You never need to delete anything, at least until you free up the whole Nursery during a minor GC, so there is no need to track free regions. Consequently, the Nursery is perfect for bump allocation: to allocate N bytes you just check whether there is space available, then increment the current end-of-heap pointer by N bytes and return the previous pointer.

    There are even tricks to optimize away the “space available” check in many cases. As a result, objects with a short lifespan never go through the slower Tenured heap allocation code at all.

    Timings

    I wrote a simple benchmark to demonstrate the various possible gains of GGC. The benchmark is sort of a “vector Fibonacci” calculation, where it computes a Fibonacci sequence for both the x and y components of a two dimensional vector. The script allocates a temporary object on every iteration. It first times the loop with the (Tenured) heap nearly empty, then it constructs a large object graph, intended to be placed into the Tenured portion of the heap, and times the loop again.

    On my laptop, the benchmark shows huge wins from GGC. The average time for an iteration through the loop drops from 15 nanoseconds (ns) to 6ns with an empty heap, demonstrating the faster Nursery allocation. It also shows the independence from the Tenured heap size: without GGC, populating the long-lived heap slows down the mean time from 15ns to 27ns. With GGC, the speed stays flat at 6ns per iteration; the Tenured heap simply doesn’t matter.

    Note that this benchmark is intended to highlight the improvements possible with GGC. The actual benefit depends heavily on the details of a given script. In some scripts, the time to initialize an object is significant and may exceed the time required to allocate the memory. A higher percentage of Nursery objects may get tenured. When running inside the browser, we force enough major GCs (eg, after a redraw) that the benefits of GGC are less noticeable.

    Also, the description above implies that we will pause long enough to collect the entire heap, which is not the case — our incremental garbage collector dramatically reduces pause times on many Web workloads already. (The incremental and generational collectors complement each other — each attacks a different part of the problem.)

    Continued…