The first high-performance self-supervised algorithm that works for speech, vision, and text
“Self-supervised learning — where machines learn by directly observing the environment rather than being explicitly taught through labeled images, text, audio, and other data sources — has powered many significant recent advances in AI. But while people appear to learn in a similar way regardless of how they get information — whether they use sight or sound, for example — there are currently big differences in the way self-supervised learning algorithms learn from images, speech, text, and other modalities.
This discrepancy has been a significant barrier to applying advances in self-supervised learning more broadly. Because a powerful algorithm designed for, say, understanding images can’t be directly applied to another modality, such as text, it is difficult to push several modalities ahead at the same rate.
This is why Meta AI developed and is excited to announce data2vec, the first high-performance self-supervised algorithm that works for multiple modalities. We apply data2vec separately to speech, images and text and it outperformed the previous best single-purpose algorithms for computer vision and speech and it is competitive on NLP tasks. It also represents a new paradigm of holistic self-supervised learning, where new research improves multiple modalities rather than just one. It also does not rely on contrastive learning or reconstructing the input example. In addition to helping accelerate progress in AI, data2vec brings us closer to building machines that learn seamlessly about different aspects of the world around them. It will enable us to develop more adaptable AI, which we believe will be able to perform tasks beyond what today’s systems can do…”