Common Voice dataset tops 20,000 hours

The latest Common Voice dataset, released today, has achieved a major milestone: More than 20,000 hours of open-source speech data that anyone, anywhere can use. The dataset has nearly doubled in the past year.

Why should you care about Common Voice?

  • Do you have to change your accent to be understood by a virtual assistant? 
  • Are you worried that so many voice-operated devices are collecting your voice data for proprietary Big Tech datasets?
  • Are automatic subtitles unavailable for you in your language?

Automatic Speech Recognition plays an important role in the way we can access information, however, of the 7,000 languages spoken globally today only a handful are supported by most products.

Mozilla’s Common Voice seeks to change the language technology ecosystem by supporting communities to collect voice data for the creation of voice-enabled applications for their own languages. 

Common Voice Dataset Release 

This release wouldn’t be possible without our contributors — from voice donations to initiating their language in our project, to opening new opportunities for people to build voice technology tools that can support every language spoken across the world.

Access the dataset: https://commonvoice.mozilla.org/datasets

Access the metadata: https://github.com/common-voice/cv-dataset 

Highlights from the latest dataset:

  • The new release also features six new languages: Tigre, Taiwanese (Minnan), Meadow Mari, Bengali, Toki Pona and Cantonese.
  • Twenty-seven languages now have at least 100 hours of speech data. They include Bengali, Thai, Basque, and Frisian.
  • Nine languages now have at least 500 hours of speech data. They include Kinyarwanda (2,383 hours), Catalan (2,045 hours), and Swahili (719 hours).
  • Nine languages now all have at least 45% of their gender tags as female. They include Marathi, Dhivehi, and Luganda.
  • The Catalan community fueled major growth. The Catalan community’s Project AINA — a collaboration between Barcelona Supercomputing Center and the Catalan Government — mobilized Catalan speakers to contribute to Common Voice. 
  • Supporting community participation in decision making yet. The Common Voice language Rep Cohort has contributed feedback and learnings about optimal sentence collection, the inclusion of language variants, and more. 

 Create with the Dataset 

How will you create with the Common Voice Dataset?

Take some inspiration from technologists who are creating conversational chatbots, spoken language identifiers, research papers and virtual assistants with the Common Voice Dataset by watching this talk: 

https://mozilla.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=6492f3ae-3a0d-4363-99f6-adc00111b706 

Share with us how you are using the dataset on social media using #CommonVoice or sharing on our Community discourse. 

 

About Melissa Thermidor

Melissa is a Senior Internet Advertising Specialist at Mozilla

More articles by Melissa Thermidor…