Firefox and the Web Speech API

Speech Synthesis and recognition are powerful tools to have available on computers, and they have become quite widespread in this modern age — look at tools like Cortana, Dictation and Siri on popular modern OSes, and accessibility tools like screenreaders.

But what about the Web? To be able to issue voice commands directly to a web page and have the browser read text content directly would be very useful.

Fortunately, some intelligent folk have been at work on this. The Web Speech API has been around for quite some time, the spec having been written in around 2014, with no significant changes made since. As of late 2015, Firefox (44+ behind a pref, and Firefox OS 2.5+) has implemented Web Speech, with Chrome support available too!

In this article we’ll explore how this API works, and what kind of fun you can already have.

How does it work?

You might be thinking “functionality like Speech Synthesis is pretty complex to implement.” Well, you’d be right. Browsers tend to use the speech services available on the operating system by default, so for example you’ll be using the Mac Speech service when accessing speech synthesis on Firefox or Chrome for OS X.

The recognition and synthesis parts of the Web Speech API sit in the same spec, but operate independently to one another. There is nothing to stop you from implementing an app that recognizes an inputted voice command and then speaks it back to the user, but apart from that their functionality is separate.

Each one has a series of interfaces defining their functionality, at the center of which sits a controller interface — called (predictably) SpeechRecognition and SpeechSynthesis. In the coming sections we’ll explore how to use these interfaces to build up speech-enabled apps.

Browser support in more detail

As mentioned above, the two browsers that have implemented Web Speech so far are Firefox and Chrome. Chrome/Chrome mobile have supported synthesis and recognition since version 33, the latter with webkit prefixes.

Firefox on the other hand has support for both parts of the API without prefixes, although there are some things to bear in mind:

  • Even through recognition is implemented in Gecko, it is not currently usable in desktop/Android because the UX/UI to allow users to grant an app permission to use it is not yet implemented.
  • Speech synthesis does not work in Android yet.
  • To use the recognition and synthesis parts of the spec in Firefox (desktop/Android), you’ll need to enable the media.webspeech.recognition.enable and media.webspeech.synth.enabled flags in about:config.
  • In Firefox OS, for an app to use speech recognition it needs to be privileged, and include the audio-capture and speech-recognition permission (see here for a suitable manifest example)
  • Firefox does not currently support the continuous property
  • The onnomatch event handler is currently of limited use — it doesn’t fire because the speech recognition engine Gecko has integrated, Pocketsphinx, does not support a confidence measure for each recognition. So it doesn’t report back “sorry that’s none of the above” — instead it says “of the choices you gave me, this looks the best”.

Note: Chrome does not appear to deal with specific grammars; instead it just returns all results, and you can deal with them as you want. This is because Chrome’s server-side speech recognition has more processing power available than the client-side solution Firefox uses. There are advantages to each approach.

Demos

We have written two simple demos to allow you to try out speech recognition and synthesis: Speech color changer and Speak easy synthesis. You can find both of these on Github.

To run them live:

Speech Recognition

Let’s look quickly at the JavaScript powering the Speech color changer demo.

Chrome support

As mentioned earlier, Chrome currently supports speech recognition with prefixed properties, so we start our code with this, to make sure each browser gets fed the right object (nom nom.)

var SpeechRecognition = SpeechRecognition || webkitSpeechRecognition
var SpeechGrammarList = SpeechGrammarList || webkitSpeechGrammarList
var SpeechRecognitionEvent = SpeechRecognitionEvent || webkitSpeechRecognitionEvent

The grammar

The next line defines the grammar we want our app to recognize:

var grammar = '#JSGF V1.0; grammar colors; public  = aqua | azure | beige | bisque | black | [LOTS MORE COLOURS] ;'

The grammar format used is JSpeech Grammar Format (JSGF).

Plugging the grammar into our speech recognition

The next thing to do is define a speech recognition instance to control the recognition for our application. This is done using the SpeechRecognition() constructor. We also create a new speech grammar list to contain our grammar, using the SpeechGrammarList() constructor.

var recognition = new SpeechRecognition();
var speechRecognitionList = new SpeechGrammarList();

We add our grammar to the list using the SpeechGrammarList.addFromString() method. Its parameters are the grammar we want to add, plus optionally a weight value that specifies the importance of this grammar in relation of other grammars available in the list (can be from 0 to 1 inclusive.) The added grammar is available in the list as a SpeechGrammar object instance.

speechRecognitionList.addFromString(grammar, 1);

We then add the SpeechGrammarList to the speech recognition instance by setting it to the value of the SpeechRecognition grammars property.

Starting the speech recognition

Now we implement an onclick handler so that when the screen is tapped/clicked, the speech recognition service will start. This is achieved by calling SpeechRecognition.start().

var diagnostic = document.querySelector('.output');
var bg = document.querySelector('html');

document.body.onclick = function() {
  recognition.start();
  console.log('Ready to receive a color command.');
}

Receiving and handling results

Once the speech recognition is started, there are many event handlers than can be used to retrieve results and other pieces of surrounding information (see the SpeechRecognition event handlers list.) The most common one you’ll probably use is SpeechRecognition.onresult, which is fired once a successful result is received:

recognition.onresult = function(event) {
  var color = event.results[0][0].transcript;
  diagnostic.textContent = 'Result received: ' + color + '.';
  bg.style.backgroundColor = color;
  console.log('Confidence: ' + event.results[0][0].confidence);
}

The second line here is a bit complex-looking, so let’s explain it step by step. The SpeechRecognitionEvent.results property returns a SpeechRecognitionResultList object containing one or more SpeechRecognitionResult objects. It has a getter so it can be accessed like an array — so the first [0] returns the SpeechRecognitionResult at position 0.

Each SpeechRecognitionResult object contains SpeechRecognitionAlternative objects that contain individual recognized words. These also have getters so they can be accessed like arrays — the second [0] therefore returns the SpeechRecognitionAlternative at position 0. We then return its transcript property to get a string containing the individual recognized result as a string, set the background color to that color, and report the color recognized as a diagnostic message in the UI.

You can find more detail about this demo on MDN.

Speech Synthesis

Now let’s quickly review how the Speak easy synthesis demo works

Setting variables

First of all, we capture a reference to Window.speechSynthesis. This is API’s entry point — it returns an instance of SpeechSynthesis, the controller interface for web speech synthesis. We also create an empty array to store the available system voices (see the next step.)

var synth = window.speechSynthesis;

  ...

var voices = [];

Populating the select element

To populate the <select> element with the different voice options the device has available, we’ve written a populateVoiceList() function. We first invoke SpeechSynthesis.getVoices(), which returns a list of all the available voices, represented by SpeechSynthesisVoice objects. We then loop through this list — for each voice we create an <option> element, set its text content to display the name of the voice (grabbed from SpeechSynthesisVoice.name), the language of the voice (grabbed from SpeechSynthesisVoice.lang), and — DEFAULT if the voice is the default voice for the synthesis engine (checked by seeing if SpeechSynthesisVoice.default returns true.)

function populateVoiceList() {
  voices = synth.getVoices();

  for(i = 0; i < voices.length ; i++) {
    var option = document.createElement('option');
    option.textContent = voices[i].name + ' (' + voices[i].lang + ')';

    if(voices[i].default) {
      option.textContent += ' -- DEFAULT';
    }

    option.setAttribute('data-lang', voices[i].lang);
    option.setAttribute('data-name', voices[i].name);
    voiceSelect.appendChild(option);
  }
}

When we come to run the function, we do the following. This is because Firefox doesn't support SpeechSynthesis.onvoiceschanged, and will just return a list of voices when SpeechSynthesis.getVoices() is fired. With Chrome however, you have to wait for the event to fire before populating the list, hence the if statement seen below.

populateVoiceList();
if (speechSynthesis.onvoiceschanged !== undefined) {
  speechSynthesis.onvoiceschanged = populateVoiceList;
}

Speaking the entered text

Next, we create an event handler to start speaking the text entered into the text field. We are using an onsubmit handler on the form so that the action happens when Enter/Return is pressed. We first create a new SpeechSynthesisUtterance() instance using its constructor — this is passed the text input's value as a parameter.

Next, we need to figure out which voice to use. We use the HTMLSelectElement selectedOptions property to return the currently selected <option> element. We then use this element's data-name attribute, finding the SpeechSynthesisVoice object whose name matches this attribute's value. We set the matching voice object to be the value of the SpeechSynthesisUtterance.voice property.

Finally, we set the SpeechSynthesisUtterance.pitch and SpeechSynthesisUtterance.rate to the values of the relevant range form elements. Then, with all necessary preparations made, we start the utterance being spoken by invoking SpeechSynthesis.speak(), passing it the SpeechSynthesisUtterance instance as a parameter.

inputForm.onsubmit = function(event) {

  event.preventDefault();

  var utterThis = new SpeechSynthesisUtterance(inputTxt.value);
  var selectedOption = voiceSelect.selectedOptions[0].getAttribute('data-name');
  for(i = 0; i < voices.length ; i++) {
    if(voices[i].name === selectedOption) {
      utterThis.voice = voices[i];
    }
  }
  utterThis.pitch = pitch.value;
  utterThis.rate = rate.value;
  synth.speak(utterThis);

Finally, we call blur() on the text input. This is mainly to hide the keyboard on Firefox OS.

inputTxt.blur();
}

You can find more detail about this demo on MDN.

About Chris Mills

Chris Mills is a senior tech writer at Mozilla, where he writes docs and demos about open web apps, Firefox OS, and related subjects. He loves tinkering around with HTML, CSS, JavaScript and other web technologies, and gives occasional tech talks at conferences and universities. He used to work for Opera and W3C, and enjoys playing heavy metal drums and drinking good beer. He lives near Manchester, UK, with his good lady and three beautiful children.

More articles by Chris Mills…


26 comments

  1. Brett Zamir

    It is great to see momentum building on this. Is there a rough ETA on desktop support?

    January 21st, 2016 at 10:09

    1. Chris Mills

      synthesis works on desktop, albeit behind a flag (which I’ve added details about now!)

      recognition needs some engineering work…I’ll ask the engineers to weigh in on this.

      January 22nd, 2016 at 00:28

  2. vince

    The demos dont work on Firefox Android

    January 21st, 2016 at 10:47

    1. Chris Mills

      Apologies — I shoulda been clearer on the prefs. I’ve added a line about the prefs you need to enabled in the notes near the top. Recognition won’t work at the moment in desktop or Android.

      January 21st, 2016 at 12:40

  3. Aurelio De Rosa

    Hi Chris.

    I’ve been following this amazing API since the beginning and I’ve also done several demos with it. While reading your article I was quite surprised by the mention of support for grammars in Firefox. The specifications aren’t updated from a long time and the last about grammars is:

    “Editor note: The group is currently discussing options for which grammar formats should be supported, how builtin grammar types are specified, and default grammars when not specified.”
    source: https://dvcs.w3.org/hg/speech-api/raw-file/tip/webspeechapi.html#speechreco-speechgrammar

    My question is: is the JSpeech Grammar Format the format adopted by all the browsers (thus, the specifications need to be updated) or is it just Firefox? If it’s the format agreed, do you when it was agreed (i.e. the date) and know any link/article/resource I can look at?

    Thank you

    January 21st, 2016 at 14:35

  4. Michael Müller

    I have two questions I wanted to test for quite some time myself, but it’s probably faster just to ask:

    1. Since when is SpeechSynthesis.speak() supported on Firefox OS? https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesis/speak and several other pages say 2.5, but I can confirm it works on my device with version 2.0.
    2. Is SSML supported, and to what degree? https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesisUtterance/text says: “The text may be provided as plain text, or a well-formed SSML document. The SSML tags will be stripped away by devices that don’t support SSML.”

    January 22nd, 2016 at 00:09

    1. Chris Mills

      I’ve only tested it on 2.5; I was given the impression that this was when supported started. If it works on 2.0, then I’ll have to update the support tables ;-)

      I don’t know to exactly what degree we support SSML. I’ll try to get a dev to answer this.

      January 26th, 2016 at 02:13

  5. Kelly Davis

    Brett Zamir, we don’t have an ETA for desktop, sorry.

    January 22nd, 2016 at 02:51

  6. Kelly Davis

    Aurelio De Rosa

    The JSpeech Grammar Format is, as far as I know, only adopted by Firefox.

    Generally, grammars are used to limit the universe of possible utterances to a manageable, finite set and thus simply recognition computationally.

    Other browsers, for example Chrome, don’t use grammars at all. They use server based recognition which is able to tap into vast CPU cycle resources and thus do away with the grammar requirement.

    We do recognition on-device. Thus, we require grammars. Our target device, be it a Flame or something even more resource constrained, can not be guaranteed to have the CPU resources resident in Google’s cloud.

    January 22nd, 2016 at 03:08

  7. Nick Tulett

    How do you test that the spoken output is correct?

    January 25th, 2016 at 09:33

    1. Chris Mills

      Do you mean “test the output from the synthesiser to make sure it is correct before it is then played to the user” ? I’m not sure if this is really possible — the synthesis uses the platform’s default synthesis tool, so I think it is assumed that it just works. From looking at the spec, I think the closest we have is https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesisUtterance/onstart (you could pause the speech on start, check whether the utterance contains the correct text to be spoken, then resume again if everything is ok. But I’m not sure if this is what you meant.)

      January 26th, 2016 at 02:11

  8. Stebs

    With Desktop Nightly on linux, the demos do not work (enabled both settings in about:config).
    Ok, speech recognition not working in Desktop, but what about synthesis, does it need espeak or something?

    January 25th, 2016 at 09:58

    1. Chris Mills

      It will need a speech synthesis engine of some kind to be present on the platform. Does Linux (or certain flavours or) not have one by default?

      January 26th, 2016 at 02:12

  9. Nick Tulett

    I’m thinking from the point of view of system/regression testing.

    I can check text output easily and graphical output not so easily but well enough.

    Where do you start with audio? Do you just confirm the result manually and then use the encoded request to the speech engine as your regression test data?

    January 26th, 2016 at 04:57

  10. Matěj Cepl

    Concerning Linux … Orca is working, screen reading working, but when I open
    http://mdn.github.io/web-speech-api/speech-color-changer/ page, I get error

    ReferenceError: webkitSpeechRecognition is not defined

    That’s with Firefox 45 (Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0)

    January 26th, 2016 at 06:36

    1. Chris Mills

      Yeah, I’m afraid that the recognition part of the spec doesn’t work on Firefox desktop yet; only Firefox OS so far.

      January 27th, 2016 at 01:51

  11. Eitan

    Hey Nick,

    That’s a good question! In Firefox, we use a dummy speech service for testing. You could check out our tests for some examples: https://dxr.mozilla.org/mozilla-central/source/dom/media/webspeech/synth/test

    How would a web developer do automated testing? The speech synthesis backend is extendable. A developer can create a speech service extension and have access the the actual output on the other end. For an example of a speech service extension, you can see: http://blog.monotonous.org/2015/04/29/espeak-web-speech-api-addon/

    January 26th, 2016 at 09:39

  12. Stebs

    Ok, got speech sysnthesis to work on linux, I installed espeak (often not installed by default, but present in package-manager of most distributions).
    Reboot needed though.

    January 28th, 2016 at 12:22

    1. Chris Mills

      Ah, this is good news – thanks for the positive report!

      January 29th, 2016 at 03:30

  13. Jefbinomed

    I try to run it on android but it sill don’t work…. I don’t understand, I activate the flags on beta / nightly. Look at whatI get on remote debug and for beta, the object SpeechRecognition is not present at all. And for nightly, when I try to use SpeechRecognition object, even if it seems to be present, it says “SpeechRecognition not defined”….

    I restart both application but it don’t change anything… ? What do I have missed ?

    I’m on a nexus 4 on lolipop.

    regards

    January 29th, 2016 at 01:41

    1. Chris Mills

      The recognition part of the API won’t currently work on Android (as stated in the above line — “Even through recognition is implemented in Gecko, it is not currently usable in desktop/Android because the UX/UI to allow users to grant an app permission to use it is not yet implemented.”)

      However, the synthesis part should work on Firefox Android, but the demo currently seems to fail. I’m investigating this currently, and will get back to you asap with an answer.

      January 29th, 2016 at 03:51

    2. Chris Mills

      Ok, it turns out that the speech synthesis part does not work on Android yet. Sorry about this. I’ll update the article as appropriate.

      February 1st, 2016 at 01:55

  14. Dmitri Levitin

    On Firefox desktop, if setting the about:config flags, i am able to get synth to work after firefox restart, but not recognition. Cannot find webkitSpeechRecognition or SpeechRecognition. Do you have any idea when this will be fixed?

    January 31st, 2016 at 18:55

    1. Chris Mills

      As said in the article, recognition does not yet work on desktop, as some of the UX/UI is not yet sorted out. There is currently no ETA for fixing this.

      February 1st, 2016 at 01:58

  15. Michael Gorham

    +1 for getting SpeechRecognition into Firefox desktop. Is there a Bugzilla report we can track this feature request?

    (If you put SpeechRecognition behind a flag, why would the permissions UI need to be sorted?)

    February 16th, 2016 at 10:06

  16. Mido Basim

    Thanks for the post Chris. I was looking forward to seeing the progress on this.
    +1 to what Michael Gorham said, but I think what Chris meant is the UI for enabling the microphone is not done (i.e. the part where the browser asks permission to use the microphone from the user).
    I tracked down two tickets for this: https://bugzilla.mozilla.org/show_bug.cgi?id=1244237 and https://bugzilla.mozilla.org/show_bug.cgi?id=1248897
    They are both not in progress as of writing this. I hope the firefox team works on this soon :D.
    Cheers.

    February 18th, 2016 at 02:24

Comments are closed for this article.