How fast is PDF.js?

Hi, my name is Thorben and I work at Opera Software in Oslo, not at Mozilla. So, how did I end up writing for Mozilla Hacks? Maybe you know that there is no default PDF viewer in the Opera Browser, something we would like to change. But how to include one? Buy it from Adobe or Foxit? Start our own?

Introducing PDF.js

While investigating our options we quickly stumbled upon PDF.js. The project aims to create a full-featured PDF viewer in the browser using JavaScript and Canvas. Yeah, it sounds a bit crazy, but it makes sense: browsers need to be good at processing text, images, fonts, and vector graphics — exactly the things a PDF viewer has to be good at. The draw commands in PDFs are a subset of Postscript, and they are not so different from what Canvas offers. Also security is virtually no issue: using PDF.js is as secure as opening any other website.

Working on PDF.js

So Christian Krebs, Mathieu Henri and myself began looking at PDF.js in more detail and were impressed: it’s well designed, seems fast and big parts of the code are just wow!

But we also discovered some problems, mainly with performance on very large or graphics-heavy PDFs. We decided that the best way to get to know PDF.js better and to push the project further, was to help the project and address the major issues we found. This gave us a pretty good understanding of the project and its high potential. We were also very impressed by how much the performance of PDF.js improved while we worked on it. This is an active and well managed project.

Benchmarking PDF.js

Of course, our tests gave us the wrong impression about performance. We tried to find super large, awkward and hard-to-render PDFs, but that is not what most people want to view. Most PDFs you actually want to view in PDF.js are fine. But how to test that?

Well, you could check the most popular PDFs on the Internet – as these are the ones you probably want to view – and benchmark them. A snapshot of 5 to 10k PDFs should be enough … but how do you get them?

I figured that search engines would be my friend. If you tell them to search for PDFs only, they give you the most relevant PDFs for that keyword, which in turn are probably the most popular ones. And if you use the most searched keywords you end up with a good approximation.

Benchmarking that many PDFs is a big task. So I got myself a small cluster of old computers and built a nice server application that supplied them with tasks. The current repository has almost 7000 PDFs and benchmarking one version of PDF.js takes around eight hours.

The results

Let’s skip to the interesting part with the pretty pictures. This graph

histogram

gives us almost all the interesting results at one look. You see a histogram of the time it took to process all the pages in the PDFs in relation to the average time it takes to process the average page of the Tracemonkey Paper (the default PDF you see when opening PDF.js). The User Experience when viewing the Tracemonkey Paper is good and from my tests even 3 to 4 times slower is still okay. That means from all benchmarked pages over 96% (exclude pdfs that crashed) will translate to a good user experience. That is really good news! Or to use a very simple pie chart (in % of pages):

overview

You probably already noticed the small catch: around 0.8% of the PDFs crashed PDF.js when we tested them. We had a closer look at most of them and at least a third are actually so heavily damaged that probably no PDF viewer could ever display them.

And this leads us to another good point: we have to keep in mind that these results just stand here without comparison. There are some PDFs on the Internet that are so complex that there is no hope that even native PDF viewers could display them nice and fast. The slowest tested PDF is an incredibly detailed vector map of the public transport system of Lisbon. Try to open it in Adobe Reader, it’s not fun!

Conclusion

From these results we concluded that PDF.js is a very valid candidate to be used as the default PDF viewer in the Opera Browser. There is still a lot of work to do to integrate PDF.js nicely into it, but we are working right now on integrating it behind an experimental flag (BTW: There is an extension that adds PDF.js with the default Mozilla viewer. The “nice” integration I am talking about would be deeper and include a brand new viewer). Thanks Mozilla! We are looking forward to working on PDF.js together with you guys!

PS: Both the code of the computational system and the results are publicly available. Have a look and tell us if you find them useful!

PPS: If anybody works at a big search engine company and could give me a list with the actual 10k most used PDFs, that would be awesome :)

Appendix: What’s next?

The corpus and the computational framework I described, could be used to do all kinds of interesting things. In the next step, we hope to classify PDFs by used fonts formats, image formats and the like. So you can quickly get PDFs to test a new feature with. We also want to look at which drawing instructions are used with which frequency in the Postscript so we can better optimise for the very common ones, like we did with HTML in browsers. Let’s see what we can actually do ;)

About Robert Nyman [Editor emeritus]

Technical Evangelist & Editor of Mozilla Hacks. Gives talks & blogs about HTML5, JavaScript & the Open Web. Robert is a strong believer in HTML5 and the Open Web and has been working since 1999 with Front End development for the web - in Sweden and in New York City. He regularly also blogs at http://robertnyman.com and loves to travel and meet people.

More articles by Robert Nyman [Editor emeritus]…

About Thorben Bochenek

JavaScript Engineer at Opera Software in Oslo, Norway. Thorben learned to like the lean nature of JavaScript when he interned and became the only JS-developer in a hardcore Java company in Zurich. He studied computer science at ETH Zurich, loves to travel and only writes blog posts when he is asked to do so.

More articles by Thorben Bochenek…


37 comments

  1. Robert

    Great, more people helping PDF.js. Don’t forget printing, displaying is important, but many simple PDFs on websites are generated as a substitution for the lack of direct printing APIs on browsers. PDF.js on some platforms (Linux, not sure OS X) send the printer big print jobs because it render to a bitmap (sometimes the result is low resolution and blurry).

    May 8th, 2014 at 09:11

    1. Mariano Semelman

      +1

      May 8th, 2014 at 11:16

    2. David

      Good point Robert. Printing from pdf.js often doesn’t go so well:
      https://bugzilla.mozilla.org/show_bug.cgi?id=932313
      https://bugzilla.mozilla.org/show_bug.cgi?id=932289

      May 8th, 2014 at 13:30

    3. Jorge Jenson

      I work at a fortune 100 company and we found OS X impossible to use for anything related to in-browser PDF viewing. Linux was our only option. Since switching development production has gone up 30%.

      May 8th, 2014 at 13:38

    4. Matěj Cepl

      That’s wierd (or I misunderstood you) … on Linux (and Mac OS X) the default printing service is CUPS and that uses PDF as its internal printing format. Why would not pdf.js just send the PDF down to CUPS without any modification?

      May 9th, 2014 at 05:06

      1. Robert

        @Matěj, true, at least PDF.js should be able to generate a reduced PDF from the original data, users should be able to print page ranges

        May 9th, 2014 at 06:13

  2. Zubair Quraishi

    We actually use Pdf.js as the internal viewer for CVs on nemCV.com. We did did as we needed people to be able to view their CV which is a PDF from many devices and web browsers, but having to download Adobe reader was a real pain, as many devices were locked down, or in the case of Apple devices not available at all.

    We had some rendering issues on fonts with Pdf.js, but I think they may have been resolved in the newer versions of pdf.js which we still need to test. But overall pdf.js is the leader in the marketplace, and is even better than google docs viewer, which often cannot render hi res images on a pdf document.

    May 8th, 2014 at 10:21

  3. Steve

    The labels for the axes on your graph should be switched.

    May 8th, 2014 at 10:29

    1. Robert Nyman [Editor]

      Ha! :-) Thanks for the heads-up, updated now.

      May 8th, 2014 at 15:16

  4. Marcy Sutton

    I’m also curious about the accessibility of this solution. Canvas is known to be an inaccessible black box. Does PDF.js provide any Shadow DOM to accompany the visual implementation? What would be the security implications of that?

    May 8th, 2014 at 10:36

    1. Thorben Bochenek

      Yes, PDF.js has such a “shadow” DOM, so accessibility should be good.
      The performance implication is that with complex text-documents, this can be a bottleneck. We are working on it

      May 9th, 2014 at 01:54

  5. Zane

    Impresive results, you could use pdf on the go from now. Even if it’s slower with bigger pages/images it’s still worth a shot, after all you are doing this client side so you end up saving tons of resources in the server.
    Thanks for sharing this on HN !

    May 8th, 2014 at 11:14

  6. Damien

    I gladly welcome work on PDF.js, I’m using it in chromium instead of chrome for a true FLOSS browser.

    I did test the Lisbon metro map on various viewers I had: okular (poppler library), firefox with PDF.js, chromium with PDF.js and chromium with google pdf plugin.
    firefox with PDF.js is by far the slowest, okular and chromium+PDF.js give more or less a similar user experience, and the google pdf plugin is the fastest of the four. I have a quite new computer with a lot of CPU and RAM so I guess that’s why it works, it would clearly be too slow on older hardware but you should keep it for benchmarking in the future ;).

    May 8th, 2014 at 11:17

  7. Henrik Gemal

    What about adding some telemetry to the PDF.js that send URL + timings to a cental server? Mozilla has the infrastructure

    May 8th, 2014 at 11:24

    1. async5

      Here you go http://telemetry.mozilla.org/#beta/29/PDF_VIEWER_USED

      May 8th, 2014 at 14:54

  8. Aranjedeath

    Joepie91 wrote a pdf-sharing site a la scribd (minus everything that makes scribd horrendous) that uses pdf.js{1]. Perhaps in some time you can approach him for the top accessed list? :D

    [1]: http://pdf.yt/

    May 8th, 2014 at 16:38

  9. DB

    PDFs of many scanned pages are common and wait times can be several minutes. Worse yet, if your plan is to print the PDF you have to wait for the PDF.js to completely load the document before you can click the download button to open the PDF in Adobe.

    May 8th, 2014 at 18:52

    1. Thorben Bochenek

      The way we will print a PDF in Opera will be different from the FF way. We will send the PDF directly to an already existing PDF-Printing-API, so there will be no wait

      May 9th, 2014 at 01:51

  10. Aaron Boxer

    I see you guys are optimizing the jpeg 2000 and jpeg codecs in pdf.js.
    Awesome!!!!

    May 8th, 2014 at 18:57

    1. Thorben Bochenek

      I improved things for JPG, but JBIG2 and JPX is mostly the work of Mathieu – p01 – Henri

      May 9th, 2014 at 01:53

  11. jm

    1) “We tried to find super large, awkward and hard-to-render PDFs, but that is not what most people want to view.”
    2) “We had a closer look at most of them and at least a third are actually so heavily damaged that probably no PDF viewer could ever display them.”
    3) “If you tell them to search for PDFs only, they give you the most relevant PDFs for that keyword, which in turn are probably the most popular ones.”

    ‘most’, ‘probably’, ‘most’, ‘probably’,… come on, this is supposed to be a benchmark with at least some statistical significance. All of this suggest you’re benchmarking text-only pdfs.

    1) a lot of people using pdfs are doing so because there are not text-only, I’m a phd student and several times I had PDF.js struggling to render two pages with half math, half text.
    2) not sure I understand, how did you have a closer look, if not with a PDF viewer?
    3) search engines favor text pdfs, of course.

    May 9th, 2014 at 02:49

    1. Thorben Bochenek

      Thanks for starting a discussion on how the data was generated. I would agree that the way we collected PDFs is not perfect. As I said, this only gives us an approximation on what people view. This is not a scientific paper, the “most” and “probably” are there to tell readers what assumptions we made to arrive at the approximation, this is just informal language. I still think these assumptions are reasonable. If you have a better way on how to crawl PDFs, please tell me or make a pull request on [1] directly. We are more than happy to incorporate better ideas.

      As to your three points
      1) I guess as a phd student you are special and look at things that most users will not look at. And it’s fair to say that most math papers are not “popular”. As an anecdote: When I read papers I prefer to download them and view them with a native viewer, so I can view them anytime, offline or online and I am sure that I “have” them.
      2) You can look at the content of a pdf with a text editor of your choice, e.g. sublime, and analyse it. The format is kind of nice and it’s not impossible for a human to read. I also build a tool to help me analyse PDFs available here [2].
      To give you an example of a way in which pdf files can be damaged: PDFs include a table called the xref-table, that includes references to the objects in the document. When this table is damaged, it can be impossible to find some objects
      3) I am not sure how true your hypothesis is that “search engines favor text pdfs”, there are many very graphics intensive PDFs in the downloaded corpus. If you have a page about architecture that links to a pdf with just pictures, I guess you can find this pdf when you look for “architecture”

      [1] https://github.com/bthorben/pdfRepo
      [2] https://github.com/bthorben/pdf-analyser

      May 9th, 2014 at 03:47

  12. Mathieu ‘p01’ Henri

    Hello,

    PDF.js has a transparent text layer ( DOM elements with the actual text ). This is useful for two things: accessibility and text selection.

    We – Opera – also tested various plugins. While we could not accurately measure the performance, but it gave a good idea of where PDF.js stands. This testing phase showed that the rendering quality and overall user experience of PDF.js was on par with native solutions On top of that, unlike many plugins, PDF.js renders incrementally which provides a much better experience despite being slower.

    As Thorben and Aaron said, we are working on the performance of PDF.js and have already contributed in the fonts caching, color conversions, JBIG2, JPG, JPG 2000 decoders to name a few things.

    Opera is working with Mozilla to make PDF.js faster for everyone.

    May 9th, 2014 at 02:53

  13. theor

    Hello,

    Canadian Immigration and Citizenship agency have nice twisted pdfs :
    http://www.cic.gc.ca/english/pdf/kits/forms/IMM0008ENU_2D.pdf

    Fancy stuff.

    May 9th, 2014 at 07:33

  14. Jim

    PDF.js has a speed advantage over any type of plugin because it starts rendering right away, with adobe you have to wait for a plugin to load.

    Joepie91 is a self taught coder, I have always been impressed with him.

    May 9th, 2014 at 15:04

    1. Charlie

      Speed and perceived speed are not the same thing. I have PDFs that quite happily do nothing for 30s with PDF.js and are loaded in 5s (including spawning) with Acrobat Reader.

      May 12th, 2014 at 02:32

      1. Thorben Bochenek

        Are some of these pdfs public? Could you file a bug report at PDF.js Issues? We will have a look.

        May 12th, 2014 at 02:57

  15. Luke

    Anyone else noticing that the Mozilla PDF.js viewer doesn’t allow dual-page view?

    If I remember right, you can easily change that in the inspector by setting something to inline-block instead of block. It would be easier to make an extension for this, but it’s html, not extensible xul (maybe that could be added in a future release?)

    May 9th, 2014 at 21:59

    1. Thorben Bochenek

      There is a half-finished PR that will address this in Firefox:
      https://github.com/mozilla/pdf.js/pull/3723

      As we already said, Opera will have a different viewer.

      May 12th, 2014 at 03:00

  16. Kurt Pfeifle

    > In the next step, we hope to classify PDFs by used fonts formats, image formats and the like.

    I’m very much interested in this. It partly aligns with a private pet project of mine: I’m trying to create a repository of at least 1.000.000 PDF files downloaded from the internet and analyze all of them, putting their various properties into a SQLite3 database.

    So far I’ve based it all on Shell scripting (that’s the only “programming” I know). My current trove is just above 100.000 files. The scripts used commandline tools like `pdfinfo`, `pdffonts`, `pdfimages`, `pdfresurrect`, `md5sum`, `pdfid.py`, `pdf-parser.py` and some more to extract available statistical data from the files. My SQLite3 DB file is now around 8 GBytes of size (it’s not optimized in any way, and I’m also currently learning SQL[ite3] as I go along).

    My purpose for the collection of this info is exactly to have a way to create sets of PDF files (+ their URLs) that match certain criteria which may be required in order to test specific features of PDF processing software, or to run regression tests, etc.pp.

    So anything that you may be doing in this respect is potentially useful for all software projects which have to handle PDF files.

    May 10th, 2014 at 00:56

    1. Thorben Bochenek

      > So anything that you may be doing in this respect is potentially useful for all software projects which have to handle PDF files.

      I hope so. If you publish anything from your work about this, I would be happy to hear about it :)

      May 12th, 2014 at 02:58

  17. Charlie

    FWIW I’ve found PDF.js in the latest LTS version of Firefox to be disappointingly slow for anything more than a couple of pages. This can be frustrating because I have to check PDFs with it. Spawning a new instance of Acrobat feels much faster, possibly because it runs in the background.

    Nevertheless, PDF.js is a very interesting development.

    May 10th, 2014 at 08:37

  18. Ryan

    Do the PDF.JS improvements made by Opera also show up in Firefox or only for Opera?

    May 12th, 2014 at 13:37

    1. Thorben Bochenek

      We contributes directly to https://github.com/mozilla/pdf.js. So all improvements will benefit both browsers. And as long as we don’t have a viewer, only firefox actually ;)

      May 14th, 2014 at 09:53

      1. Adam

        @Thorben, since Opera is based on Chromium, why don’t you build a native PDF reader as a PPAPI/PNaCl plugin, possibly based on the open source Evince Browser Plugin? It should run faster than JavaScript.

        Next, are you wiling to support viewing Open Document Format, using WebODF?

        May 21st, 2014 at 07:25

  19. momo

    When I open a PDF file(e.g. 15 pages long), the first page gets rendered. Then I switch to another Tab/Window and get back later(e.g. 3 minutes) to the PDF file (it was still open at page 1). Now I am going to page 7 but again it needs to get rendered.

    My expectation:
    When I goto/scroll to page 7 it should already be (pre-)rendered, since Firefox had enough time and the computer enough resources to do that.

    Keep up your great browser!

    May 23rd, 2014 at 20:07

  20. Ian

    Now there is an open source PDF viewer from Chromium called pdfium.

    May 24th, 2014 at 17:47

Comments are closed for this article.