Meet CulturaX: A New Multilingual Dataset for Training AI Models in 167 Languages

Mike YoungMike Young
5 min read

The field of artificial intelligence (AI) is advancing at a breakneck pace. From voice assistants to self-driving cars, AI systems are becoming integrated into more aspects of our lives. However, much of this progress has centered on the English language. AI systems still struggle when dealing with other languages spoken by billions of people worldwide. But a groundbreaking new dataset called CulturaX aims to change all that.

Subscribe or follow me on Twitter for more content like this!

In this in-depth look, we’ll cover how CulturaX could democratize AI and spread its benefits to diverse communities across the planet.

The Limitations of AI Today

Many of today’s most advanced AI systems are powered by neural networks trained on massive datasets. But for most languages beyond English, publicly available training data has been scarce. This has led to a couple of major limitations:

First, AI systems tend to work much worse in other languages. Translation tools like Google Translate used to be notoriously error-prone outside English. Voice assistants struggle with accurate speech recognition and natural responses in foreign tongues. Even fundamental tasks like identifying the language of a text snippet are less reliable.

Second, lack of data stifles progress in improving AI systems for non-English languages. With limited data to train on, fewer researchers bother focusing on these languages. English ends up dominating research. This creates a vicious cycle where other languages get left further and further behind.

Driving Democratization Through Data

To democratize access to quality training data, researchers at the University of Oregon and Adobe Research have constructed a game-changing resource called CulturaX (paper here). This dataset provides:

  • Text data for a whopping 167 languages

  • Over 6 trillion words in total

  • Extensive cleaning and deduplication

  • Completely free and open availability

With quality data now available for so many more languages, researchers worldwide can develop better AI systems for their own communities. No longer limited by lack of training data, progress in languages beyond English may accelerate rapidly.

The open nature of CulturaX also allows any issues around bias and fairness for specific languages to be identified and addressed. With more equal access to data, the democratization and benefits of AI can be shared across diverse linguistic groups.

Merging Two Massive Multilingual Datasets

To construct CulturaX, the researchers combined two existing large-scale multilingual datasets - mC4 and OSCAR. Together, these provided an initial 13.5 billion documents in over 100 languages.

While a great starting point, these datasets had some limitations. mC4 used a weaker language identification tool, introducing errors. Neither dataset was comprehensively deduplicated at the document level. The text also included untranslated snippets and other noise.

Cleaning up and merging such a vast corpus of text required ingenious methods. Let's look at how the researchers transformed these raw datasets into a quality resource.

Refining a Truly Massive Amount of Text

To produce the CulturaX dataset, the raw content from mC4 and OSCAR underwent extensive processing. The key steps included:

Identifying Languages Accurately

  • Replaced mC4's language detector with FastText - the current state-of-the-art

  • Removed mC4 docs without confident language ID

  • Ensured all 167 languages are fully supported

Filtering Out Harmful Content

  • Used blacklist to remove toxic websites

  • Eliminated documents with abuse, porn, hate speech, etc.

Catching Noisy Documents

  • Computed metrics like repetition to find low-quality docs

  • Filtered out statistical outliers for each language

Cleaning Individual Documents

  • Stripped irrelevant content like code snippets

  • Normalized formatting across documents

Deduplicating Similar Entries

  • Used MinHash for near-deduplication across languages

  • Compared document URLs to remove duplicates

Through each stage, the goal was to prune away low-quality, repetitive and potentially harmful content. This leaves a refined dataset optimized for training AI systems.

CulturaX by the Numbers

The final CulturaX corpus contains:

  • 167 languages - From Afrikaans to Zulu, encompassing global diversity

  • 6.3 trillion words - Orders of magnitude more than previous multilingual datasets

  • 15 billion documents - Providing huge variety of text sources

  • 27 terabytes of text - Requiring new methods to process efficiently

This makes CulturaX the largest and most diverse multilingual dataset openly available today. The scale finally begins approaching that of private datasets used by tech giants.

Multilingual AI's Potential

With CulturaX now available for researchers worldwide, what could be achieved through its use?

Some possibilities include:

  • Training universal translation models - to accurately translate text, audio and video between virtually any language pair with a single model

  • Building culturally-aware chatbots - that can converse naturally in hundreds of languages and with local knowledge

  • Developing truly global voice assistants - able to understand and respond fluently across languages and accents

  • Enabling nuanced multilingual search - so that web content of all types can be precisely understood and retrieved

  • Advancing language-specific assets - like improved speech recognition for tonal languages such as Mandarin Chinese

Of course, models will need to be carefully developed and tested to avoid issues like bias amplification. But the possibilities are endless!

Next Steps to Spread Benefits Broadly

Releasing CulturaX removes a major obstacle to equal access to the benefits of AI globally. But many challenges remain to develop and apply this technology thoughtfully.

Next steps could include:

  • Involving under-resourced communities - in identifying potential harms from AI systems built using their languages and data.

  • Filling remaining data gaps - for languages with limited text data, creative solutions can help generate training data.

  • Exploring multimodal learning - using images, videos and speech data alongside text may improve versatility.

  • Testing rigorously for biases - across languages, cultures, demographics and tasks before deployment.

  • Investing in two-way open research - so insights from global researchers continuously improve shared models and data.

Through collaborative efforts guided by inclusiveness, AI's remarkable potential can be shared broadly. The CulturaX dataset opens the door to that future for all languages.

Conclusion

With its unprecedented linguistic breadth and meticulous construction, CulturaX represents a historic advance for multilingual AI. This post has only scratched the surface of the countless possibilities now opened up by democratizing access to quality training data across the world's languages and cultures. It's time to imagine - then build - an AI future enriched by humanity in all its diversity.

Subscribe or follow me on Twitter for more content like this!

0
Subscribe to my newsletter

Read articles from Mike Young directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Mike Young
Mike Young