High-performance speech recognition with no supervision at all
“Whether it’s giving directions, answering questions, or carrying out requests, speech recognition makes life easier in countless ways. But today the technology is available for only a small fraction of the thousands of languages spoken around the globe. This is because high-quality systems need to be trained with large amounts of transcribed speech audio. This data simply isn’t available for every language, dialect, and speaking style. Transcribed recordings of English-language novels, for example, will do little to help machines learn to understand a Basque speaker ordering food off a menu or a Tagalog speaker giving a business presentation.
This is why we developed wav2vec Unsupervised (wav2vec-U), a way to build speech recognition systems that require no transcribed data at all. It rivals the performance of the best supervised models from only a few years ago, which were trained on nearly 1,000 hours of transcribed speech. We’ve tested wav2vec-U with languages such as Swahili and Tatar, which do not currently have high-quality speech recognition models available because they lack extensive collections of labeled training data.
Wav2vec-U is the result of years of Facebook AI’s work in speech recognition, self-supervised learning, and unsupervised machine translation. It is an important step toward building machines that can solve a wide range of tasks just by learning from their observations. We think this work will bring us closer to a world where speech technology is available for many more people…”