Turing Bletchley: A Universal Image Language Representation model by Microsoft
Turing Bletchley: A Universal Image Language Representation model by Microsoft – Microsoft Research
“Language and vision are inherently linked. When we hear the statement “a beautiful sunset on the beach” we imagine an image similar to the one above. Models that focus only on language fail to capture this link. To these models, sentences are no more than a grammatically correct sequence of words.
Furthermore, vision is a global modality. The same sight of the beach sunset can be narrated in any language (“una hermosa puesta de sol en la playa”, “un beau coucher de soleil sur la plage”, “Matahari terbenam yang indah di pantai”, etc.), and it would not change the corresponding visual representation. Traditional multi-modal models tie vision to a particular language (most commonly English) and therefore fail to capture this universal property of vision.
With T-Bletchley, we address both these shortcomings. We take a multi-modal approach that advances a computer’s ability to understand language as well as understand images natively, just from pixels. Additionally, we consider language modality with a universal-first approach when developing the model. The result is a one-of-a-kind universal multi-modal model that understands images and text across 94 different languages, resulting in some impressive capabilities. For example, by utilizing a common image-language vector space, without using any metadata or extra information like surrounding text, T-Bletchley can retrieve images that match a text description provided in any language. It can also find images that answer text-based questions in any language, or images that are semantically like another image…”