Teaching AI to translate 100s of spoken and written languages in real-time

February 23, 2022

For people who understand languages like English, Mandarin, or Spanish, it may seem like today’s apps and web tools already provide the translation technology we need. But billions of people are being left out — unable to easily access most of the information on the internet or connect with most of the online world in their native language. Today’s machine translation (MT) systems are improving rapidly, but they still rely heavily on learning from large amounts of textual data, so they do not generally work well for low-resource languages, i.e., languages that lack training data, and for languages that don’t have a standardized writing system.

Eliminating language barriers would be profound, making it possible for billions of people to access information online in their native or preferred languages. Advances in MT won’t just help those people who don’t speak one of the languages that dominates the internet today; they’ll also fundamentally change the way people in the world connect and share ideas.

Imagine, for example, people in a marketplace who speak different languages being able to communicate with one another in real time using a phone, watch, or glasses. Or multimedia content on the web that’s accessible to anyone in the world in their preferred language. In the not too distant future, when emerging technologies like virtual and augmented reality bring the digital and physical worlds together in the metaverse, translation tools will enable people to do everyday activities — hosting a book club or collaborating on a work project — with anyone, anywhere, just as they would with someone next door.

Meta AI is announcing a long-term effort to build language and MT tools that will include most of the world’s languages. This includes two new projects. The first is No Language Left Behind, where we are building a new advanced AI model that can learn from languages with fewer examples to train from, and we will use it to enable expert-quality translations in hundreds of languages, ranging from Asturian to Luganda to Urdu. The second is Universal Speech Translator, where we are designing novel approaches to translating from speech in one language to another in real time so we can support languages without a standard writing system as well as those that are both written and spoken.

It will take much more work to provide everyone around the world with truly universal translation tools. But we believe the efforts described here are an important step forward. Sharing details and open-sourcing our code and models in the future means that others can build on our work and bring us closer to achieving this important goal.

The challenges of translating every language

The AI translation systems of today are not designed to serve the thousands of languages used around the world, or to provide real-time speech-to-speech translation. To truly serve everyone, the MT research community will need to overcome three important challenges. We will need to overcome data scarcity by acquiring more training data in more languages as well as finding new ways to leverage the data already available today. We’ll also need to overcome the modeling challenges that arise as models grow to serve many more languages. And we will need to find new ways to evaluate and improve on their results.

Data scarcity remains one of the biggest hurdles to expanding translation tools across more languages. MT systems for text translations typically rely on learning from millions of sentences of annotated data. Because of this, MT systems capable of high-quality translations have been developed for only the handful of languages that dominate the web. Expanding to other languages means finding ways to acquire and use training examples from languages with sparse web presences.

For direct speech-to-speech translation, the challenge of acquiring data is even more severe. Most speech MT systems use text as an intermediary step, meaning speech in one language is first converted to text, then translated to text in the target language, and then finally input into a text-to-speech system to generate audio. This makes speech-to-speech translations dependent on text in ways that limit their efficiency and make them difficult to scale to languages that are primarily oral.

Direct speech-to-speech translation models can enable translations for languages that don’t have standardized writing systems. This speech-based approach could also lead to much faster, more efficient translation systems, since they won’t require the additional steps of converting speech to text, translating it, and then generating speech in the target language.

In addition to their needing suitable training data in thousands of languages, MT systems today are simply not designed to scale to meet the needs of everyone around the globe. Many MT systems are bilingual, meaning there is a separate model for each language pair, such as English-Russian or Japanese-Spanish. This approach is extraordinarily difficult to scale to dozens of language pairs, let alone to all the languages in use around the world. Imagine needing to create and maintain many thousands of different models for every combination from Thai-Lao to Nepali-Assamese. Many experts have suggested that multilingual systems might be helpful here. But it has been tremendously difficult to incorporate many languages into a single efficient, high-performance multilingual model that has the capacity to represent all languages.

Real-time speech-to-speech MT models face many of the same challenges as text-based models but also need to overcome latency — the lag that occurs when one language is being translated to another — before they can be effectively used to enable real-time translations. The main challenge comes from the fact that a sentence can be spoken in different word orders in different languages. Even professional simultaneous interpreters lag behind the original speech by around three seconds. Consider a sentence in German, “Ich möchte alle Sprachen übersetzen,” and its equivalent in Spanish, “Quisiera traducir todos los idiomas.” Both mean “I would like to translate all languages.” But translating from German to English in real time would be more challenging because the verb “translate” appears at the end of the sentence, while the word order in Spanish and English is similar.

Finally, as we scale to more and more languages, we also need to develop new ways of evaluating the work produced by MT models. There are already resources to evaluate the quality of translations from, say, English to Russian, but what about from Amharic to Kazakh? As we expand the number of languages our MT models can translate, we’ll also have to develop new approaches to training data and measurement to cover more languages. Besides evaluating the performance of MT systems for accuracy, it’s also important to make sure that translations are being done responsibly. We’ll need to find ways to make sure that MT systems preserve cultural sensitivities and do not create or intensify biases. As we describe in the sections below, Meta AI is tackling each of these three challenges.

Training low-resource and direct speech-to-speech translation systems

To enable translations for low-resource languages and to create the building blocks for future translations of more languages no matter how widely written or spoken, we’re expanding our automatic data set creation techniques. One such technique is LASER, an open source toolkit that now encompasses more than 125 languages written in 28 different scripts.

LASER converts sentences of various languages into a single multilingual representation. Then we use large-scale multilingual similarity search to identify sentences that have a similar representation, i.e., are likely to have the same meaning in different languages. We used LASER to build systems like ccMatrix and ccAligned, which are capable of finding parallel texts on the internet. Because low-resource languages have little data available, we created a new teacher-student training method that enables LASER to focus on specific language subgroups — such as Bantu languages — and learn from much smaller data sets. This allows LASER to operate effectively at scale across languages. Each of these advances will allow us to cover more languages as we work toward scaling, improving, and expanding them to support mining for hundreds of languages, and eventually to every language with a writing system. We recently extended LASER to also work with speech: By building representations for speech and text in the same multilingual space, we are able to extract translations between speech in one language and text in another — or even direct speech-to-speech translations. With this method, we have already identified nearly 1,400 hours of aligned speech in French, German, Spanish, and English.

Text data is important but not sufficient for building translation tools to serve everyone’s needs. Speech translation benchmark data was previously available for a handful of languages, so we created CoVoST 2, which covers 22 languages and 36 language directions with different resource conditions. Moreover, it is difficult to find large amounts of audio in different languages. VoxPopuli, which contains 400,000 hours of speech in 23 languages, enables large-scale semisupervised and self-supervised learning for speech applications such as speech recognition and speech translation. VoxPopuli was subsequently used to build the largest open and universal pretrained model for 128 languages and speech tasks, including speech translation. This model improved the previous state of the art for speech-to-text translation from 21 languages into English by 7.4 BLEU on the CoVoST 2 data set.

Building models that work across many languages and different modalities

Besides producing more data for training MT systems and making them available to other researchers, we’re working to improve model capacity in order to handle translations between a much wider range of languages. MT systems today often work within a single modality and across a limited set of languages. If the model is too small to represent many languages, its performance might suffer, introducing inaccuracies with both text and speech translations. Innovations in modeling will help us create a future where translations move quickly and seamlessly across modalities, going from speech to text, text to speech, text to text, or speech to speech in a multitude of languages.

To achieve improved performance for our MT models, we invested heavily in creating models that train efficiently despite large capacity, focusing on sparsely gated mixture-of-expert models. By increasing model size and learning an automatic routing function so different tokens use different expert capacity, we were able to balance high-resource and low-resource translation performance.

To scale text-based MT to 101 languages, we created the first multilingual text translation system that is not English-centric. Bilingual systems usually work by first translating from the source language into English and then from English into the target language. To make these systems more efficient and higher quality, we eliminated English as a medium so that languages could be translated directly into other languages without going through English. While eliminating English increased the model’s capacity, multilingual models were previously unable to reach the same level of quality as customized bilingual systems. Recently, though, our multilingual translation system won the Workshop on Machine Translation competition, outperforming even the best bilingual models.

We aim for our technology to be inclusive: It should support both written languages and languages without a standard writing system. With that in mind, we are developing a speech-to-speech translation system that does not rely on generating an intermediate textual representation during inference. This approach has been demonstrated to be faster than a traditional cascaded system that combines separate speech recognition, machine translation, and speech synthesis models. With improved efficiency and a simpler architecture, direct speech-to-speech could unlock near human-quality real-time translation for future devices, like AR glasses. Finally, in order to create spoken translations that preserve the expressiveness and character in everyone’s speech, we are working to include some aspects of the input audio, such as intonation, in the generated audio translations.

Measuring success across hundreds of languages

Developing large-scale models that can translate between many more languages brings up an important question: How can we determine whether we have developed better data or better models? Evaluating a large-scale, multilingual model’s performance is tricky, especially because it requires us to have on-ground expertise in all the languages that the model covers — a challenge that is time-consuming, resource intensive, and often impractical.

We have created FLORES-101, the first multilingual translation evaluation data sets covering 101 languages, enabling researchers to rapidly test and improve upon multilingual translation models. Unlike existing data sets, FLORES-101 allows researchers to quantify the performance of systems through any language direction — not just translating into and out of English. For the millions of people worldwide who live in places with dozens of official languages, this enables the creation of translation systems that serve important real-world needs.

Using FLORES-101, we have collaborated with other leaders in the AI research community to advance multilingual low-resource translation. At the 2021 Workshop on Machine Translation, we hosted a shared task to collectively make progress in this domain. Researchers from all over the world participated, many focusing on languages that were personally relevant to them. We look forward to continuing to expand FLORES to cover hundreds of languages.

As we make tangible progress toward universal translation, we’re also focused on doing this work responsibly. We’re working with linguists to help us understand the challenges of producing accurate data set collections, and networks of evaluators to help us make sure that translations are accurate. We’re also conducting case studies with speakers of more than 20 languages to understand what translation features are important to people from different backgrounds and how they will be using the translations our AI models produce. There are many more aspects to developing universal translation responsibly, including mitigating bias and toxicity and preserving cultural sensitivities as information passes from one language to another. Achieving our long-term translation goals will require not just expertise in AI but also the sustained input of numerous experts, researchers, and individuals from around the world.

What’s next?

If No Language Left Behind and Universal Speech Translator, combined with the efforts of the MT research community, succeed in creating translation technologies that include everyone in the world, it will open up the digital and physical worlds in ways previously not possible. We’re already making advances in enabling translations for low-resource languages, a significant barrier to universal translation for most of the world’s population. By advancing and open-sourcing our work in corpus creation, multilingual modeling, and evaluation, we’re hoping that other researchers can build on this work and bring the real world uses of translation systems closer to reality.

Our ability to communicate is one of the most fundamental aspects of being human. Technologies — from the printing press to video chat — have often transformed our ways of communicating and sharing ideas. The power of these and other technologies will be extended when they can work in the same way for billions of people around the world — giving them similar access to information and letting them communicate with a much wider audience, regardless of the languages they speak or write. As we strive for a more inclusive and connected world, it’ll be even more important to break down existing barriers to information and opportunity by empowering people in their chosen languages.

This work is being undertaken by a multidisciplinary team that includes Yossi Adi, Bapi Akula, Pierre Andrews, Shruti Bhosale, Brian Bui, Onur Celebi, Peng-Jen Chen, James Cross, Ning Dong, Maha Elbayad, Gustavo Gandia Rivera, Cynthia Gao, Hongyu Gong, Vedanuj Goswami, Jiatao Gu, Kenneth Heafield, Kevin Heffernan, Wei-Ning Hsu, Semarley Jarrett, Kyle Johnson, Justine Kao, Elahe Kalbassi, Philipp Koehn, Janice Lam, Ann Lee, Daniel Licht, Xutai Ma, Jean Maillard, Brian O’Horo, Adam Polyak, Sravya Popuri, Christophe Ropers, Dirk Rowe, Safiyyah Saleem, Anna Sun, Chau Tran, Holger Schwenk, Shannon Spruit, Yun Tang, Changhan Wang, Jeff Wang, Guillaume Wenzek, and Al Youngblood.