Read more of my articles on the Alibaba Cloud blog

Introduction

Datasets are the lifeblood of artificial intelligence, especially in training large language models (LLMs) that power everything from chatbots to content generators. These datasets form the foundation upon which AI models learn and develop their capabilities. However, as the demand for more advanced AI systems grows, so does the need for high-quality, diverse, and extensive datasets. This article delves into the history of dataset usage, the types of data required at various stages of LLM training, and the challenges faced in sourcing and utilizing these datasets.

A Brief History of Dataset Usage in AI

In the early days of AI research, datasets were meticulously curated from various sources, such as encyclopedias, parliamentary transcripts, phone call recordings, and weather forecasts. Each dataset was tailored to address specific tasks, ensuring relevance and quality. However, with the advent of transformers in 2017—a neural network architecture pivotal to modern language models—the focus shifted toward sheer volume, marking a significant change in the AI research approach. Researchers realized that the performance of LLMs improved significantly with larger models and datasets, leading to indiscriminate data scraping from the internet.

By 2018, the internet had become the dominant source for all data types, including audio, images, and video. This trend has continued, resulting in a significant gap between internet-sourced data and manually curated datasets. The demand for scale also led to the widespread use of synthetic data—data generated by algorithms rather than collected from real-world interactions.

Types of Data Needed for LLM Training

Pre-training

Pre-training is the initial phase, where the model is exposed to vast amounts of text data to learn general language patterns and structures. During this stage, the model requires:

Diverse Text Sources: Data should come from a wide range of topics and languages to ensure broad understanding, a crucial factor in AI model development.
High Volume: Billions of tokens are needed to train the model effectively.
Quality Control: While quantity is crucial, maintaining a baseline level of quality is equally important as it helps prevent the model from learning incorrect or biased information. Sources often include web pages, books, articles, and other publicly available texts.

However, ethical considerations arise when using copyrighted materials without permission.

Continuous Pre-training

Continuous pre-training involves updating the model with new data to keep it current and improve its knowledge base. This phase requires:

Recent Data: To incorporate the latest information and trends.
Domain-Specific Data: Depending on the industry's needs, specialized datasets (e.g., medical journals for healthcare applications) may be necessary.

Fine-tuning

Fine-tuning adapts the pre-trained model to specific tasks or domains. It typically uses smaller, more targeted, carefully labeled, and curated datasets. For example:

Task-Specific Data: Sentiment analysis might require annotated reviews, while question-answering systems need pairs of questions and answers.
Domain Adaptation: Legal documents, scientific papers, or technical manuals for specialized applications.

Below are examples of datasets and methods used in this process.

Example of a Fine-Tuning Dataset

Task-Specific Data: For sentiment analysis, the Stanford Sentiment Treebank (SST-2) is a widely used dataset containing annotated movie reviews labeled as positive or negative. Similarly, question-answering systems often use SQuAD (Stanford Question Answering Dataset), which pairs questions with context-based answers.
Domain Adaptation: Legal applications employ the CaseLaw Corpus, a collection of annotated judicial rulings, while medical models could use PubMed Abstracts for scientific literature analysis.

Key Fine-Tuning Methods

Parameter-Efficient Fine-Tuning (PEFT): PEFT techniques, such as LoRA (Low-Rank Adaptation) or Adapter Layers, update only a small subset of the model's parameters, reducing computational costs while maintaining performance. For instance, LoRA freezes the original model weights and adds trainable low-rank matrices to specific layers.
Instruction Fine-Tuning: This method involves training the model on task-specific instructions paired with input-output examples. For example, a model fine-tuned on instructions like "Classify the sentiment of this review: [text]" learns to follow explicit commands, improving usability in real-world applications
Transfer Learning: Pre-trained models are adapted to new domains by fine-tuning domain-specific corpora. For example, a general-purpose LLM can be fine-tuned on financial reports from EDGAR SEC Filings to specialize in stock market analysis.

By combining curated datasets with advanced methods like PEFT, researchers and developers can optimize LLMs for niche applications while addressing resource constraints and scalability challenges

Reinforcement Learning

Reinforcement learning from human feedback (RLHF) involves training the model to align better with human preferences. This stage needs:

Human Feedback: Ratings or corrections provided by humans to guide the model's behavior.
Interactive Data: Real-time interactions where the model receives immediate feedback.

Below are examples of datasets and methods central to RLHF:

Example of an RLHF Dataset

Preference Datasets: RLHF begins with collecting human-labeled preference data, where humans rank or rate model outputs. For instance, OpenAI's early RLHF experiments used datasets where annotators compared multiple model-generated responses to the same prompt, labeling which ones were more helpful, truthful, or aligned with ethical guidelines. These datasets often include nuanced examples, such as distinguishing between factual and biased answers in sensitive topics like politics or healthcare.

Key RLHF Methods

Reward Model Training: A reward model is trained on human preference data to predict which outputs humans prefer. This model acts as a proxy for human judgment during reinforcement learning. For example, Alibaba Cloud's Qwen series uses reward models to penalize harmful or unsafe outputs while rewarding clarity and coherence.
Proximal Policy Optimization (PPO): PPO is a reinforcement learning algorithm that fine-tunes the LLM's policy (output generation) to maximize rewards from the trained reward model. This method ensures stable updates, preventing drastic deviations from the desired behavior. For example, PPO is used to iteratively refine chatbot responses in systems like Qwen.
Interactive Feedback Loops: Real-time human feedback is integrated into training pipelines. For example, AI assistants like Google's Gemini may deploy beta versions to collect user ratings (e.g., thumbs-up/down) on responses, which are fed back into the RLHF pipeline to improve future outputs.
Safety-Critical Filtering: Specialized datasets focus on high-stakes scenarios, such as medical advice or legal queries, where errors could have serious consequences. These datasets often involve domain experts annotating outputs for accuracy and safety, ensuring the model adheres to strict guidelines.

Challenges in RLHF Datasets

Scalability of Human Feedback: Collecting high-quality preference data is labor-intensive and expensive. Scaling this process requires balancing automation (e.g., synthetic feedback) with human oversight to avoid bias.
Cultural and Ethical Bias: Preference datasets often reflect the values of annotators from specific regions (e.g., Western-centric perspectives), risking biased outputs in global applications.

By combining preference datasets, reward modeling, and iterative human feedback, RLHF ensures LLMs evolve from generic text generators to systems prioritizing safety, relevance, and human alignment.

Challenges in Sourcing Data

Exhaustion of Available Data

One of the most pressing issues today is readily available textual data exhaustion. Major tech players have reportedly indexed almost all accessible text data from the open and dark web, including pirated books, movie subtitles, personal messages, and social media posts. With fewer new sources to tap into, the industry faces a bottleneck in further advancements.

Cumulative amount of data (in logarithmic scale for text, in hours for speech/video) from each source category, across all modalities. Source categories in the legend are ordered in descending order of quantity.

Cultural Asymmetry

Most datasets originate from Europe and North America, reflecting a Western-centric worldview. Less than 4% of analyzed datasets come from Africa, highlighting a significant cultural imbalance. This bias can lead to skewed perceptions and reinforce stereotypes, particularly in multimodal models that generate images and videos.

Centralization of Power

Large corporations dominate the acquisition and control of influential datasets. Platforms like YouTube provide over 70% of video data used in AI training, concentrating immense power in the hands of a few entities. This centralization hinders innovation and creates barriers for smaller players who lack access to these resources.

Collection of Dataset

The following table shows the sources of text collections. Properties include the number of datasets, tasks, languages, and text domains. The Source column indicates the content of the collection: human-generated text on the web, language model output, or both. The final column indicates the collection's licensing status: blue for commercial use, red for non-commercial and academic research, and yellow for unclear licensing. Finally, the OAI column indicates collections that include generations of OpenAI models. The datasets are sorted chronologically to emphasise trends over time. Source here

Collection of the text data:

Collection of the video data:

Collection of the audio data:

Solutions and Future Directions

Leveraging Untapped Data Sources

Despite the apparent depletion of easily accessible data, numerous untapped sources remain:

Archival Data: Libraries, periodicals, and historical records offer rich, unexplored content.
Enterprise Data: Companies sit on vast troves of unused data, such as equipment telemetry, meteorological reports, system logs, and marketing statistics.

Advanced LLMs can help structure and utilize these latent datasets for future training.

Federated Learning

Federated learning allows models to be trained on sensitive data without transferring it outside secure environments. This method is ideal for industries dealing with confidential information, such as healthcare, finance, and telecommunications. By keeping data localized, federated learning ensures privacy while enabling collaborative model improvement.

Synthetic Data and Augmentation

Synthetic data generation and data augmentation present promising avenues for expanding training datasets:

Synthetic Data: Generated by algorithms, synthetic data can fill gaps in real-world data but must be handled cautiously to avoid compounding errors.
Data Augmentation: Modifying existing data through techniques like flipping images, altering colors, or adjusting contrast maintains realism while increasing diversity.

Conclusion

As the field of AI continues to evolve, the role of datasets remains paramount. While the exhaustion of readily available data poses a challenge, it's crucial that we, as AI researchers and enthusiasts, are aware of and take responsibility for addressing issues of cultural asymmetry and centralization. Innovative solutions like leveraging untapped sources, federated learning, and synthetic data generation offer pathways forward. By combining these strategies, we can ensure equitable and diverse AI development, paving the way for more sophisticated and inclusive artificial intelligence systems.

The Evolving Landscape of LLM Training Data

Table of contents