Multilingual, Not Monolithic: Why Diverse Language Data Matters for Generative AI

June 3, 2024

APCO Intern, Simone Nikitina, authored this piece.

Much attention is being paid to generative artificial intelligence (AI) as its prevalence and capabilities grow. Accordingly, generative AI is increasingly being integrated into office workflows: nearly two-thirds of office workers currently use generative AI at work or plan to. While many companies opt for commercial models like Anthropic’s Claude or OpenAI’s ChatGPT, more companies are increasingly developing proprietary generative AI models to meet their specific business needs. To harness the full potential benefits of these AI tools, companies must prioritize training their models on data from a range of languages.

Many of the most cutting-edge capabilities of generative AI, such as speeding up life-saving medication development, are powered by transformers. Transformers are neural networks that discern patterns in huge amounts of data, from which they learn contextual information that they apply to future tasks. Despite advancements in transformer-based models, a significant performance gap exists between “high-resource” and “low-resource” languages.

Out of the more than 7,000 world languages, high-resource languages are those with abundant and quality text data. Because of the large amount of data available to learn from, generative AI models perform exceptionally well in high-resource languages. English is the highest-resource language because of the copious amount of English language text on the Internet, where it is readily available to engineers. Low-resource languages, on the other hand, have small quantities of quality data on the Internet—leading to poor performance by generative AI in these languages.

Since the Internet contains copious English language data, sophisticated generative AI models can be developed by using English exclusively. These models may struggle to perform accurately and reliably when receiving prompts and producing outputs in low-resource languages. By contrast, diverse language training data helps AI models generalize better across languages, leading to more accurate and reliable performance, especially in multilingual environments. Multilingual models have been shown to adeptly synthesize data from high- and low-resource languages for overall performance that is greater than the sum of its parts.

There are many benefits of using AI tools with strong and diverse multilingual capabilities in corporate settings:

Improved Communication and Collaboration Generative AI tools trained using a mix of high-resource and low-resource languages can significantly improve communication within multinational corporations by overcoming language barriers. When AI models can accurately translate and interpret a wide range of languages, they facilitate smoother interactions among colleagues that speak different languages. For example, real-time translation and language generation can facilitate clearer and more accurate communication during meetings and over email. This allows colleagues to better exchange constructive feedback, brainstorm, form new professional relationships and resolve disagreements. However, collaboration isn’t only about working with colleagues on a specific deliverable—collaborative environments can also increase employees’ intrinsic motivation and their sense of engagement. Companies that prioritize collaboration may see long-term benefits such as lower turnover rates and absenteeism and higher profits.
Greater Sense of Inclusivity and Belonging Linguistically diverse generative AI models can also promote inclusivity and belonging. Intentionally seeking to train tools on a range of languages can communicate to employees that their languages and cultural identities are acknowledged and supported in the workplace, making them feel more empowered to express themselves without fear of exclusion or misunderstanding. Generative AI models trained on diverse language data are less likely to produce biased and discriminatory content. Studies show that more than 95% of employees feel it is important to feel respected at work. Employees that can bring their authentic selves to work are shown to be happier, more motivated and less likely to quit. Companies that promote a culture of respect, inclusivity and belonging in the office may see greater employee satisfaction and higher retention.
Enhanced Learning and Development In global companies, essential information may be shared between regional teams and in a variety of languages. Linguistically diverse generative AI tools can quickly and faithfully translate and categorize this content in several languages, making learnings accessible to a wider audience within the organization. By enabling employees to access and leverage knowledge resources in their preferred language, AI can foster company-wide learning and development. Companies that cultivate strong learning and development initiatives can see increased profits, higher employee retention and engagement and greater efficiency and productivity.

While sourcing and curating training datasets that include lower-resource languages require effort and intentionality, the long-term benefits of using multilingual AI tools in the workplace are profound. Companies should use linguistically diverse training data when developing their own AI tools to realize generative AI’s full potential to impact internal and external operations. Within organizations, linguistically diverse generative AI tools can improve communication, promote inclusivity and belonging among employees and enable knowledge-sharing within organizations. By embracing multilingual AI, businesses can pave the way for a more inclusive, efficient and globally connected workforce.