Mozilla

Firefox 4: the HTML5 parser – inline SVG, speed and more

This is a guest post from Henri Sivonen, who has been working on Firefox’s new HTML5 parser. The HTML parser is one of the most complicated and sensitive pieces of a browser. It controls how your HTML source is turned into web pages and as such changes to it are rare and need to be well-tested. While most of Gecko has been rebuilt since its initial inception in the late 90s, the parser was one of the stand-outs as being “original.” This replaces that code with a new parser that’s faster, compliant with the new HTML5 standard and enables a lot of new functionality as well.

A project to replace Gecko’s old HTML parser, dating from 1998, has been ongoing for some time now. The parser was just turned on by default on the trunk, so you can now try it out by simply downloading a nightly build without having to flip any configuration switch.

There are four main things that improve with the new HTML5 parser:

  • You can now use SVG and MathML inline in HTML5 pages, without XML namespaces.
  • Parsing is now done off Firefox’s main UI thread, improving overall browser responsiveness.
  • It’s improved the speed of innerHTML calls by about 20%.
  • With the landing of the new parser we’ve fixed dozens of long-standing parser related bugs.

Try the demo with a Firefox Nightly or another HTML5-ready browser. It should look like this:

What Is It?

The HTML5 parser in Gecko turns a stream of bytes into a DOM tree according to the HTML5 parsing algorithm.

HTML5 is the first specification that tells implementors, in detail, how parse HTML. Before HTML5, HTML specifications didn’t say how to turn a stream of bytes into a DOM tree. In theory, HTML before HTML5 was supposed to be defined in terms of SGML. This implied a certain relationship between the source of valid HTML documents and the DOM. However, parsing wasn’t well-defined for invalid documents (and Web content most often isn’t valid HTML4) and there are SGML constructs that were in theory part of HTML but that in reality popular browsers didn’t implement.

The lack of a proper specification led to browser developers filling in the blanks on their own and reverse engineering the browser with the largest market share (first Mosaic, then Netscape, then IE) when in doubt about how to get compatible behavior. This led to a lot of unwritten common rules but also to different behavior across browsers.

The HTML5 parsing algorithm standardizes well-defined behavior that browsers and other applications that consume HTML can converge on. By design, the HTML5 parsing algorithm is suitable for processing existing HTML content, so applications don’t need to continue maintaining their legacy parsers for legacy content. Concretely, in the trunk nightlies, the HTML5 parser is used for all text/html content.

How Is It Different?

The HTML5 parsing algorithm has two major parts: tokenization and tree building. Tokenization is the process of splitting the source stream into tags, text, comments and attributes inside tags. The tree building phase takes the tags and the interleaving text and comments and builds the DOM tree. The tokenization part of the HTML5 parsing algorithm is closer to what Internet Explorer does than what Gecko used to do. Internet Explorer has had the majority market share for a while, so sites have generally been tested not to break when subjected to IE’s tokenizer. The tree building part is close to what WebKit does already. Of the major browser engines WebKit had the most reasonable tree building solution prior to HTML5.

Furthermore, the new HTML5 parser parses network streams off the main thread. Traditionally, browsers have performed most tasks on the main thread. Radical changes like off-the-main-thread parsing are made possible by the more maintainable code base of the HTML5 parser compared to Gecko’s old HTML parser.

What’s In It for Web Developers?

The changes mentioned above are mainly of interest to browser developers. A key feature of the HTML5 parser is that you don’t notice that anything has changed.

However, there is one big new Web developer-facing feature, too: inline MathML and SVG. HTML5 parsing liberates MathML and SVG from XML and makes them available in the main file format of the Web.

This means that you can include typographically sophisticated math in your HTML document without having to recast the entire document as XHTML or, more importantly, without having to retrofit the software that powers your site to output well-formed XHTML. For example, you can now include the solution for quadratic equations inline in HTML:

<math>
  <mi>x</mi>
 
  <mo>=</mo>
  <mfrac>
    <mrow>
      <mo>&minus;</mo>
      <mi>b</mi>
      <mo>&PlusMinus;</mo>
      <msqrt>
        <msup>
 
          <mi>b</mi>
          <mn>2</mn>
        </msup>
        <mo>&minus;</mo>
        <mn>4</mn>
        <mo>&InvisibleTimes;</mo>
        <mi>a</mi>
 
        <mo>&InvisibleTimes;</mo>
        <mi>c</mi>
      </msqrt>
    </mrow>
    <mrow>
      <mn>2</mn>
      <mo>&InvisibleTimes;</mo>
      <mi>a</mi>
 
    </mrow>
  </mfrac>
</math>

Likewise, you can include scalable inline art as SVG without having to recast your HTML as XHTML. As screen sized and pixel densities become more varied, making graphics look crisp at all zoom levels becomes more important. Although it has previously been possible to use SVG graphics in HTML documents by reference (using the object element), putting SVG inline is more convenient in some cases. For example, an icon such as a warning sign can now be included inline instead of including it from an external file.

<svg height=86 width=90 viewBox='5 9 90 86' style='float: right;'>
  <path stroke=#F53F0C stroke-width=10 fill=#F5C60C stroke-linejoin=round d='M 10,90 L 90,90 L 50,14 Z'/>
  <line stroke=black stroke-width=10 stroke-linecap=round x1=50 x2=50 y1=45 y2=75 />
</svg>

Make yourself a page that starts with <!DOCTYPE html> and put these two pieces of code in it and it should work with a new nightly.

In general, if you have a piece of MathML or SVG as XML, you can just copy and paste the XML markup inline into HTML (omitting the XML declaration and the doctype if any). There are two caveats: The markup must not use namespace prefixes for elements (i.e. no svg:svg or math:math) and the namespace prefix for the XLink namespace has to be xlink.

In the MathML and SVG snippits above you’ll see that the inline MathML and SVG pieces above are more HTML-like and less crufty than merely XML pasted inline. There are no namespace declarations and unnecessary quotes around attribute values have been omitted. The quote omission works, because the tags are tokenized by the HTML5 tokenizer—not by an XML tokenizer. The namespace declaration omission works, because the HTML5 tree builder doesn’t use attributes looking like namespace declarations to assign MathMLness or SVGness to elements. Instead, <svg> establishes a scope of elements that get assigned to the SVG namespace in the DOM and <math> establishes a scope of elements that get assigned to the MathML namespace in the DOM. You’ll also notice that the MathML example uses named character references that previously haven’t been supported in HTML.

Here’s a quick summary of inline MathML and SVG parsing for Web authors:

  • <svg></svg> is assigned to the SVG namespace in the DOM.
  • <math></math> is assigned to the MathML namespace in the DOM.
  • foreignObject and annotation-xml (an various less important elements) establish a nested HTML scope, so you can nest SVG, MathML and HTML as you’d expect to be able to nest them.
  • The parser case-corrects markup so <SVG VIEWBOX='0 0 10 10'> works in HTML source.
  • The DOM methods and CSS selectors behave case-sensitively, so you need to write your DOM calls and CSS selectors using the canonical case, which is camelCase for various parts of SVG such as viewBox.
  • The syntax <foo/> opens and immediately closes the foo element if it is a MathML or SVG element (i.e. not an HTML element).
  • Attributes are tokenized the same way they are tokenized in HTML, so you can omit quotes in the same situations where you can omit quotes in HTML (i.e. when the attribute value is not the empty string and does not contain whitespace, ", ', `, <, =, or >).
  • Warning: the two above features do not combine well due to the reuse of legacy-compatible HTML tokenization. If you omit quotes on the last attribute value, you must have a space before the closing slash. <circle fill=green /> is OK but <circle fill=red/> is not.
  • Attributes starting with xmlns have absolutely no effect on what namespace elements or attributes end up in, so you don’t need to use attributes starting with xmlns.
  • Attributes in the XLink namespace must use the prefix xlink (e.g. xlink:href).
  • Element names must not have prefixes or colons in them.
  • The content of SVG script elements is tokenized like they are tokenized in XML—not like the content of HTML script elements is tokenized.
  • When an SVG or MathML element is open <![CDATA[]]> sections work the way they do in XML. You can use this to hide text content from older browsers that don’t support SVG or MathML in text/html.
  • The MathML named characters are available for use in named character references everywhere in the document (also in HTML content).
  • To deal with legacy pages where authors have pasted partial SVG fragments into HTML (who knows why) or used a <math> tag for non-MathML purposes, attempts to nest various common HTML elements as children of SVG elements (without foreignObject) will immediately break out of SVG or MathML context. This may make some typos have surprising effects.

66 comments

Comments are now closed.

  1. WebManWalking wrote on October 29th, 2010 at 14:57:

    P.S.: When I use a for .. in loop on the style property of the SVG elements in question, every property seems to be null. If I’ve assigned a property value it shows up as if in an array. (For example, when I set opacity, it showed up as style[0], even though style.opacity is where I would’ve expected it to be.) Also, when I tried to alert(this.width), as opposed to this.style.width, I got “[object SVGAnimationLength]” or something like that.

    This was with Firefox 4 beta 6.

  2. Boris wrote on October 30th, 2010 at 19:14:

    > Apparently the x, y, height and width attributes don’t correspond to the CSS and JavaScript DOM
    > properties left, top, height and width, respectively.

    That’s correct. They’re defined differently in SVG.

    > On the other hand, I **can** animate opacity

    Yes, SVG defines opacity in a CSS-compatible way.

    > why can’t I modify left, top, height and width?

    In brief, because the SVG spec says one thing and jquery assumes another thing. File a bug on jquery?

  3. WebManWalking wrote on November 2nd, 2010 at 07:54:

    Thanks for your quick reply, Boris.

    I’ve contributed 2 jQuery plug-ins before. I think it would be less whiny of me to simply figure it out and contribute another, rather than classifying it as a jQuery bug for someone else to fix.

    Does Firefox and/or HTML5 expose an Inline SVG JavaScript API with which I can accomplish the same thing as CSS property modification?

    1. Boris wrote on November 2nd, 2010 at 08:25:

      You should be able to use setAttribute to change the x/y/width/height attributes.

      Alternately, for animation stuff, you could try to use SMIL to animate the values…

  4. Trackback from Quora on December 27th, 2010 at 02:34:

    When will MathML be adopted widely enough to replace images for mathematical notation on the web?…

    Gecko and WebKit both support mathML. Like SVG, I think it wasn’t well positioned for widespread use until recently because it was difficult to include in an HTML document. With HTML5, both SVG and mathML can be included inline in HTML pages. See http…

  5. Jukka K. Korpela wrote on March 9th, 2011 at 10:53:

    I’m puzzled: is the HTML5 parser now part of Firefox released versions, as I’ve understood? In versions that have the HTML5 parser, is it _the_ parser, or is it an alternative to the old parser? If they coexist, does the doctype declaration trigger the choice? Are there practical major implications, such as some legacy tags or attributes not recognized?

    1. Boris wrote on March 9th, 2011 at 16:06:

      In Firefox 4, the HTML5 parser and the old parser are both present, but which one is used is controlled by a hidden preference, with the default being the HTML5 parser. A page can’t control which parser will be used.

      In Firefox 5, the old parser will be removed.

  6. ybochatay wrote on August 29th, 2011 at 01:53:

    Météo-France now works with a full web application, based on firefox and svg. You can find demos with inline SVG at http://ybochatay.fr/Galerie
    Please have a look if you have time.

1 2

Comments are closed for this article.