CCMatrix: A billion-scale bitext data set for training translation models
CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models. With more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public data set, CCMatrix is more than 50 times larger than the WikiMatrix corpus that we shared last year. Gathering a data set of this size required modifying our previous bitext mining approach used for WikiMatrix, assuming that the translation of one sentence could be found anywhere on CommonCrawl, which functions as an open archive of the internet. To address the significant computational challenges posed by comparing billions of sentences to determine which ones are mutual translations, we used massively parallel processing, as well as our highly efficient FAISS library for fast similarity searches.