A Journey to <10% Word Error Rate

At Mozilla, we believe speech interfaces will be a big part of how people interact with their devices in the future. Today we are excited to announce the initial release of our open source speech recognition model so that anyone can develop compelling speech experiences.

The Machine Learning team at Mozilla Research has been working on an open source Automatic Speech Recognition engine modeled after the Deep Speech papers (1, 2) published by Baidu. One of the major goals from the beginning was to achieve a Word Error Rate in the transcriptions of under 10%. We have made great progress: Our word error rate on LibriSpeech’s test-clean set is 6.5%, which not only achieves our initial goal, but gets us close to human level performance.

This post is an overview of the team’s efforts and ends with a more detailed explanation of the final piece of the puzzle: the CTC decoder.

The architecture

Deep Speech is an end-to-end trainable, character-level, deep recurrent neural network (RNN). In less buzzwordy terms: it’s a deep neural network with recurrent layers that gets audio features as input and outputs characters directly — the transcription of the audio. It can be trained using supervised learning from scratch, without any external “sources of intelligence”, like a grapheme to phoneme converter or forced alignment on the input.

This animation shows how the data flows through the network. Data flows from the audio input to the feature computation, through three initial feed forward layers, then through a bidirectional RNN layer, and finally through the final softmax layer, where a character is predicted.

This animation shows how the data flows through the network. In practice, instead of processing slices of the audio input individually, we do all slices at once.

 

The network has five layers: the input is fed into three fully connected layers, followed by a bidirectional RNN layer, and finally a fully connected layer. The hidden fully connected layers use the ReLU activation. The RNN layer uses LSTM cells with tanh activation.

The output of the network is a matrix of character probabilities over time. In other words, for each time step the network outputs one probability for each character in the alphabet, which represents the likelihood of that character corresponding to what’s being said in the audio at that time. The CTC loss function (PDF link) considers all alignments of the audio to the transcription at the same time, allowing us to maximize the probability of the correct transcription being predicted without worrying about alignment. Finally, we train using the Adam optimizer.

The data

Supervised learning requires data, lots and lots of it. Training a model like Deep Speech requires thousands of hours of labeled audio, and obtaining and preparing this data can be as much work, if not more, as implementing the network and the training logic.

We started by downloading freely available speech corpora like TED-LIUM and LibriSpeech,, as well as acquiring paid corpora like Fisher and Switchboard. We wrote importers in Python for the different data sets that convert the audio files to WAV, split the audio and cleaned up the transcription of unneeded characters like punctuation and accents. Finally we stored the preprocessed data in CSV files that can be used to feed data into the network.

Using existing speech corpora allowed us to quickly start working on the model. But in order to achieve excellent results, we needed a lot more data. We had to be creative. We thought that maybe this type of speech data would already exist out there, sitting in people’s archives, so we reached out to public TV and radio stations, language study departments in universities, and basically anyone who might have labeled speech data to share. Through this effort, we were able to more than double the amount of training data we had to work with, which is now enough for training a high-quality English model.

Having a high-quality voice corpus publicly available not only helps advance our own speech recognition engine. It will eventually allow for broad innovation because developers, startups and researchers around can train and experiment with different architectures and models for different languages. It could help democratize access to Deep Learning for those who can’t afford to pay for thousands of hours of training data (almost everyone).

To build a speech corpus that’s free, open source, and big enough to create meaningful products with, we worked with Mozilla’s Open Innovation team and launched the Common Voice project to collect and validate speech contributions from volunteers all over the world. Today, the team is releasing a large collection of voice data into the public domain. Find out more about the release on the Open Innovation Medium blog.

The hardware

Deep Speech has over 120 million parameters, and training a model this large is a very computationally expensive task: you need lots of GPUs if you don’t want to wait forever for results. We looked into training on the cloud, but it doesn’t work financially: dedicated hardware pays for itself quite quickly if you do a lot of training. The cloud is a good way to do fast hyperparameter explorations though, so keep that in mind.

We started with a single machine running four Titan X Pascal GPUs, and then bought another two servers with 8 Titan XPs each. We run the two 8 GPU machines as a cluster, and the older 4 GPU machine is left independent to run smaller experiments and test code changes that require more compute power than our development machines have. This setup is fairly efficient, and for our larger training runs we can go from zero to a good model in about a week.

Setting up distributed training with TensorFlow was an arduous process. Although it has the most mature distributed training tools of the available deep learning frameworks, getting things to actually work without bugs and to take full advantage of the extra compute power is tricky. Our current setup works thanks to the incredible efforts of my colleague Tilman Kamp, who endured long battles with TensorFlow, Slurm, and even the Linux kernel until we had everything working.

Putting it all together

At this point, we have two papers to guide us, a model implemented based on those papers, the resulting data, and the hardware required for the training process. It turns out that replicating the results of a paper isn’t that straightforward. The vast majority of papers don’t specify all the hyperparameters they use, if they specify any at all. This means you have to spend a whole lot of time and energy doing hyperparameter searches to find a good set of values. Our initial tests with values chosen through a mix of randomness and intuition weren’t even close to the ones reported by the paper, probably due to small differences in the architecture — for one, we used LSTM (Long short-term memory) cells instead of GRU (gated recurrent unit) cells. We spent a lot of time doing a binary search on dropout ratios, we reduced the learning rate, changed the way the weights were initialized, and experimented with the size of the hidden layers as well. All of those changes got us pretty close to our desired target of <10% Word Error Rate, but not there.

One piece missing from our code was an important optimization: integrating our language model into the decoder. The CTC (Connectionist Temporal Classification) decoder works by taking the probability matrix that is output by the model and walking over it looking for the most likely text sequence according to the probability matrix. If at time step 0 the letter “C” is the most likely, and at time step 1 the letter “A” is the most likely, and at time step 2 the letter “T” is the most likely, then the transcription given by the simplest possible decoder will be “CAT”. This strategy is called greedy decoding.

A cat with icons of pause buttons where its paws would be.

This is a pretty good way of decoding the probabilities output by the model into a sequence of characters, but it has one major flaw: it only takes into account the output of the network, which means it only takes into account the information from audio. When the same audio has two equally likely transcriptions (think “new” vs “knew”, “pause” vs “paws”), the model can only guess at which one is correct. This is far from optimal: if the first four words in a sentence are “the cat has tiny”, we can be pretty sure that the fifth word will be “paws” rather than “pause”. Answering those types of questions is the job of a language model, and if we could integrate a language model into the decoding phase of our model, we could get way better results.

When we first tried to tackle this issue, we ran into a couple of blockers in TensorFlow: first, it doesn’t expose its beam scoring functionality in the Python API (probably for performance reasons); and second, the log probabilities output by the CTC loss function were (are?) invalid.

We decided to work around the problem by building something like a spell checker instead: go through the transcription and see if there are any small modifications we can make that increase the likelihood of that transcription being valid English, according to the language model. This did a pretty good job of correcting small spelling mistakes in the output, but as we got closer and closer to our target error rate, we realized that it wasn’t going to be enough. We’d have to bite the bullet and write some C++.

Beam scoring with a language model

Integrating the language model into the decoder involves querying the language model every time we evaluate an addition to the transcription. Going back to the previous example, when looking into whether we want to choose “paws” or “pause” for the next word after “the cat has tiny”, we query the language model and use that score as a weight to sort the candidate transcriptions. Now we get to use information not just from audio but also from our language model to decide which transcription is more likely. The algorithm is described in this paper by Hannun et. al.

Luckily, TensorFlow does have an extension point on its CTC beam search decoder that allows the user to supply their own beam scorer. This means all you have to do is write the beam scorer that queries the language model and plug that in. For our case, we wanted that functionality to be exposed to our Python code, so we also exposed it as a custom TensorFlow operation that can be loaded using tf.load_op_library.

Getting all of this to work with our setup required quite a bit of effort, from fighting with the Bazel build system for hours, to making sure all the code was able to handle Unicode input in a consistent way, and debugging the beam scorer itself. The system requires quite a few pieces to work together:

  • The language model itself (we use KenLM for building and querying).
  • A trie of all the words in our vocabulary.
  • An alphabet file that maps integer labels output by the network into characters.

Although adding this many moving parts does make our code harder to modify and apply to different use cases (like other languages), it brings great benefits: Our word error rate on LibriSpeech’s test-clean set went from 16% to 6.5%, which not only achieves our initial goal, but gets us close to human level performance (5.83% according to the Deep Speech 2 paper). On a MacBook Pro, using the GPU, the model can do inference at a real-time factor of around 0.3x, and around 1.4x on the CPU alone. (A real-time factor of 1x means you can transcribe 1 second of audio in 1 second.)

It has been an incredible journey to get to this place: the initial release of our model! In the future we want to release a model that’s fast enough to run on a mobile device or a Raspberry Pi.

If this type of work sounds interesting or useful to you, come check out our repository on GitHub and our Discourse channel. We have a growing community of contributors and we’re excited to help you create and publish a model for your language.

About Reuben Morais

Reuben is an engineer on the Machine Learning group at Mozilla Research.

More articles by Reuben Morais…


21 comments

  1. Hugo

    Very nice article, thanks a lot for developing this speech recognition model.
    It’s quite creepy to send all our voice to Google/Apple/Microsoft servers, hopefully we’ll be able to start building software that don’t rely on them thanks to this framework. This is a very import piece of technology and I’m so glad we now have a open source solution thanks to you !

    November 29th, 2017 at 10:49

    Reply

  2. Charles Pritchard

    Powerful summary of the development of “Project DeepSpeech” an open source implementation of speech-to-text, and the Common Voice project, a public domain corpus of voice recognition data.

    The Mozilla Research Machine Learning team storyline starts with an architecture that uses existing modern machine learning software, then trains a deep recurrent neural network, wrangling annotated speech corpora from anywhere they can and subsequently creating Common Voice project to make this step easier for developers.

    By implementing research from several papers on speech-to-text, Mozilla has demonstrated to others what data science work looks like, and has made it accessible to a wide range of people, summarizing a year of development into a documented and reproducible experiment for any developer interested in machine learning software.

    I really appreciate the conscientious development choices made, around building a readable repository, while supporting a variety of hardware; this project, and this article are a great study.

    November 29th, 2017 at 21:37

    Reply

  3. hdnzkwk

    Hello. I have a nice idea: you can generate the marked speech with webspeech on user’s devices using large variety of speech synthesis engines available to the users (especially on Android). You don’t need to install and/or buy them, just implement the api allowing to process the webspeech with webaudio and a page requesting the text from the server, speaking it, saving audio with timing and sending back to the server.

    November 30th, 2017 at 03:40

    Reply

    1. Reuben Morais

      That’s certainly an area worthy of exploration. One has to be careful about feeding artificial data in scenarios like this, because the model could fail to generalize from the artificial data to real human speech.

      December 4th, 2017 at 07:58

      Reply

  4. joe

    Really neat stuff. I’m going to share this article with coworkers so they understand better the amount of work necessary. I’m curious how long this took to do with what size of staff? Thanks for sharing!

    November 30th, 2017 at 17:53

    Reply

    1. Reuben Morais

      It took our team of four (five for the first six months) slightly less than a year of work.

      December 4th, 2017 at 07:56

      Reply

  5. Jaromir

    Is possible to put voices from tatoeba ?
    This is a good data

    November 30th, 2017 at 22:34

    Reply

  6. Vivek Tyagi

    Bravo, Speech team Mozilla. You are modern day pioneers for Speech Technology and for bringing it to masses. Thank you very much for your open source speech corpus!

    December 1st, 2017 at 22:44

    Reply

  7. Mike Nelson

    Agreed. Absolutely great work! The alternatives are all cloud based meaning not only are we relying on Google/Apple/Microsoft/IBM but in most cases there are fees. It would be amazing if high quality speech recognition could be run offline on a mobile device using an open source free model.

    December 2nd, 2017 at 12:03

    Reply

    1. Lesław

      I am working on a project like that. Should have a high quality acoustic and language model ready for Julius in next 3 to 6 months. All thanks to Mozilla :)

      December 6th, 2017 at 01:46

      Reply

  8. Ben

    So, I read all this, much of which was incomprehensible to me at the moment as it is hugely jargon-laden, and I find I have two unanswered questions:

    First, can employing this result in a one-machine, not-going-to-the-LAN-or-WAN speech recognition engine?

    Second, when you say it’s in “Python”, unfortunately, that no longer is definitive. Python 2 and Python 3 are source incompatible. Which Python did you use?

    If this is a stand-alone engine, or the core of one that can arise from same, then I’m super excited, thank you so very much. If not… well, there are already send-your-data-elsewhere engines… and I’m not. :)

    December 2nd, 2017 at 13:48

    Reply

    1. Reuben Morais

      Yes, this model can be used to do offline speech recognition. As of now, our code needs Python 2.7 for training, but if you just want to use the pre-trained models, we have packages for Python 2.7, 3.4, 3.5 and 3.6.

      December 4th, 2017 at 07:54

      Reply

      1. Ben

        The only word that comes to mind is “fabulous.” :)

        December 4th, 2017 at 15:36

        Reply

  9. foucault vincent

    Thanks a lot for your hard work, Reuben
    Good job !!

    December 3rd, 2017 at 11:35

    Reply

  10. Praveen Yadav

    Great work Mozilla, specially Reuben.

    Just curious how many epochs did it took for the librispeech training to get to 6% WER ?

    December 5th, 2017 at 10:20

    Reply

    1. Reuben Morais

      I don’t remember it off the top of my head. Something between 10 and 15 epochs.

      December 5th, 2017 at 10:49

      Reply

  11. Michaela Merz

    Great first (second .. third ?) step. I’ve been working to “liberate” smart devices from their “bridge” servers to be able to control them directly within my WiFi environment. As most devices use ESP8266 chips that’s not too complicated. So far I still have to use Alexa to do the voice recognition – but that will hopefully not be necessary for too long.

    A few suggestions for the future:

    — The ability to train words for an application. We sure don’t need the whole data base to recognize the words commonly used in IoT environments;
    — Direct microphone recording;
    — Wake word integration. I am currently using kit.ai which can be trained and is free for developers;

    Again – congrats. I was playing around with the current prototype and can’t wait for this development reach some early production mile stone.

    Happy Holidays to y’all :D

    Michaela

    December 7th, 2017 at 16:08

    Reply

  12. Amit

    Hello Reuben,

    Great work by you and your team. Can you also share the volume of content used by your team for training?

    December 11th, 2017 at 03:15

    Reply

    1. Reuben Morais

      About 3000 hours of speech were used.

      December 11th, 2017 at 04:27

      Reply

  13. Borislav Zlatanov

    Thank you for making this incredible contribution, both with the algorithm and the open data.

    I am wondering what procedure exactly you use to calculate the Word Error Rate. It seems that Word Error Rate percentages are being cited in various articles (like, Google achieved this percentage and Microsoft achieved that percentage), but nobody specifies their procedure for calculating it. No steps or tools are specified.

    It will be very helpful if you could provide your step-by-step procedure for calculating the WER. Thanks!

    December 11th, 2017 at 14:31

    Reply

    1. Reuben Morais

      Here it is: https://github.com/mozilla/DeepSpeech/blob/9c338b76dba4f8816f129e0c5219aacace93b022/util/text.py#L85-L108

      December 11th, 2017 at 15:39

      Reply

Post Your Comment