HuBERT: Self-supervised representation learning for speech recognition, generation, and compression
“The north star for many AI research programs has been continuously learning to better recognize and understand speech simply through listening and interacting with others, similar to how babies learn their first language. This requires not only analyzing the words that someone speaks but also many other cues from how those words are delivered, e.g., speaker identity, emotion, hesitation, and interruptions. Furthermore, to completely understand a situation as a person would, the AI system must distinguish and interpret noises that overlap with the speech signal, e.g., laughter, coughing, lip-smacking, background vehicles, or birds chirping.
To open the door for modeling these types of rich lexical and nonlexical information in audio, we are releasing HuBERT, our new approach for learning self-supervised speech representations. HuBERT matches or surpasses the SOTA approaches for speech representation learning for speech recognition, generation, and compression…”