Creating Synthetic Data for Language Models

Large Language Models (LLMs) typically require extensive datasets for both pre-training and fine-tuning. Although using human-curated data is ideal, many online sources for scraping data have implemented protections against such activities once they recognized the value of their data. Consequently, relying solely on human annotators often proves insufficient to generate the vast quantities of data necessary to adequately train these models, given their sheer size and complexity.

This often necessitates collaborating with companies that possess the data or leveraging state-of-the-art models such as ChatGPT, Gemini, Claude, or OpenHermes to generate our own data. In this article, our focus will be on exploring the latter approach.

My Experience

Why should you read my article? What experience do I have? I bring extensive experience in generating datasets tailored for both pre-training and fine-tuning purposes, spanning across multiple languages including English and Indic languages.

Some examples of my work include:

Bhandara: A dataset specifically designed for India-specific pre-training in Hindi.
PhilosophiseMe: A dataset focused on philosophy, curated for writing articles and can be used for pre-training.
Gooftagoo: A fine-tuning dataset aimed at generating conversational data in Hindi and Hinglish across various day-to-day topics.
Parlar: A fine-tuning dataset for English, focusing on generating conversational data related to STEM topics.

Navigating Stages of Training

The data generation process varies depending on the stage of training. Pre-training typically involves feeding the model with unstructured text data, such as paragraphs of large corpora. Fine-tuning, on the other hand, adapts the model to specific tasks or structures that are required for generating desired outputs.

It's commonly understood that pre-training plays a crucial role in transferring knowledge into the model, while fine-tuning tailors the model to respond accurately to questions or tasks within the bounds of its existing knowledge.

For dedicated enthusiasts of Large Language Models (LLMs), there exists another phase beyond fine-tuning that employs Reinforcement Learning (RL) techniques, often referred to as RLHF (Reinforcement Learning for Human Feedback). These techniques encompass various algorithms such as DPO (Deterministic Policy Optimization), PPO (Proximal Policy Optimization), KTO (Kahneman-Tversky Optimization), and others. A separate section will be included to delve into the methods and strategies for generating and accumulating data specific to these advanced RL techniques.

Main Types of Data Generation

Generating Based on a Large Text Corpus

One effective method for generating a substantial amount of data involves leveraging large corpora, such as extensive Wikipedia articles focused on specific topics or specialized books.

To perform this type of data generation, one approach is the direct ingestion of such content into the model, allowing it to generate responses based on the input data—a straightforward but potentially less refined method. Alternatively, a more efficient strategy involves utilizing a knowledge-based system like Retrieval-Augmented Generation (RAG). This approach indexes and organizes the data, enabling not only better initial questions but also more coherent follow-up queries.

The key advantage of using a knowledge-based system like RAG is its ability to ground generated data firmly in factual accuracy derived from the text. While the former method may be preferable for creating narratives or articles featuring fantastical elements like stories, the latter excels in contexts requiring precise STEM or factual data generation.

When preparing data for pre-training, you have the option to either feed all available data into the model or apply more effective filtering to ensure a higher quality dataset. In contrast, for fine-tuning, leveraging specific portions of the data becomes crucial for generating targeted outputs such as conversation data or question-answer sets.

Generating based from Key Phrases

If access to a large corpus of data is unavailable, you may need to rely on alternative methods. One approach involves manually crafting phrases or generating key phrases based on topic headings. Starting with broad topics, you can progressively refine them into more specific subtopics. Once these subtopics are sufficiently detailed, you can generate pre-training or fine-tuning data focused on both the main topic and its associated tree of subtopics.

However, a significant challenge arises from the potential for substantial hallucinations if the model lacks specialization in the given topic and lacks a robust knowledge base to reference. This limitation severely restricts the range of topics that can be effectively queried of the models and, consequently, the suitability of models that can be utilized.

For example, if you aim to generate data on the topic of "Philosophy," you can create subtopics such as "Eastern Philosophy" and "Western Philosophy." Within "Eastern Philosophy," you can further delineate into categories like "Vedic Philosophy," "Buddhist Philosophy," and "Taoist Philosophy." Similarly, "Western Philosophy" can be subdivided into "Greek Philosophy," "Modern Philosophy," and "Post-Modern Philosophy."

Continuing this hierarchical approach, you can refine these subtopics even further. For instance, under "Indian Philosophy," you might explore topics like "Tat Twam Asi and non-dualism," while under "Post-Modern Philosophy," you could delve into concepts like "I think therefore I am" from Descartes. These fine-grained topics can serve as the basis for generating large text corpora for pre-training data or for crafting conversation and question-answer sets for fine-tuning data.

Generating Data for RL

It's standard practice to impose constraints on models to keep them on-topic and prevent them from generating inappropriate responses during conversations. Typically, this involves providing the model with pairs of responses: one expected/correct response and one incorrect response (with varying weights in some algorithms to emphasize correctness).

To align their model's behavior with superior models, many users use the response from ChatGPT as the expected response and their own model's response as the incorrect one. It's important to note that this conditioning method doesn't impart 'knowledge' to the model but rather guides it away from certain probabilistic paths.

Anecdotal evidence suggests that using the same model's responses for both expected and incorrect responses, possibly with slight modifications, can establish a known benchmark for the model to navigate. This approach helps the model discern which paths to avoid and which to pursue.

Note - We refer to probabilistic paths because Large Language Models (LLMs) operate as next-token generators. This means they predict the most probable token based on the sequence of tokens they have encountered so far. Essentially, LLMs analyze the context provided by preceding tokens to determine the likelihood of subsequent tokens in a sequence.

Some Key Consideration

Control the Controllables

When aiming to cover a wide array of possibilities, it's crucial to ensure that the data remains tightly focused on the desired topics. For instance, when generating data using phrases, specificity is key. The more precise and narrow the phrases are, the denser the information within each data point, enhancing its relevance to a specific topic.

The goal is to not only keep individual data points specific but also to maintain the entire dataset centered around a topic. This approach optimizes the model's learning process by providing concentrated and targeted information.

Alternatively, employing retrieval-augmented generation (RAG) with specific questions aligned closely with existing knowledge proves to be highly effective when utilizing large corpora. This method ensures that generated responses are grounded in accurate and relevant information, thereby enhancing the model's understanding and performance.

For Example,

When generating fine-tuning conversation data on a topic like Lacanian theory's "The Real, the Imaginary, and the Symbolic," there are several factors besides the main topic that you can control to ensure the quality and specificity of the data:

Participant Dynamics: Specify the relationship dynamics between participants, such as "Between Teacher and Student," "Between a professor and his doctoral student," or "Between a speaker and a confused audience member." This defines the roles and expectations within the conversation.

Emotional Context: Introduce emotional states and moods into the conversation, such as "between a confused and arrogant person," "between a skeptical and enthusiastic debater," or "between a supportive mentor and an uncertain mentee." These emotional nuances shape the tone and direction of the dialogue.

Conversation Structure: Define the structure or format of the conversation, such as Q&A sessions, debates, lectures, or informal discussions. Each structure influences how information is exchanged and processed during the conversation.

Specific Scenarios: Craft specific scenarios or contexts within which the conversation takes place, such as "during a seminar on Lacanian theory," "at a conference panel discussion," or "in a one-on-one mentoring session." These settings provide situational context that grounds the conversation in practical or academic contexts.

Knowledge Level: Specify the level of understanding or expertise of the participants regarding the topic. For instance, conversations can be tailored for beginners exploring basic concepts or for advanced learners discussing nuanced theories.

By controlling these factors in addition to the core topic itself, you ensure that the generated conversation data is rich, diverse, and aligned with specific learning objectives or research goals. This approach also provides a structured framework that guides the model's responses, enhancing its coherence and relevance in generating conversational data for fine-tuning purposes.

Allow the Others to Run Free

Balancing control over key topics and conversation direction while allowing for spontaneity and creativity is crucial in generating diverse and engaging data for models like ChatGPT. While grounding conversations in factual accuracy and essential topics is essential, introducing variability and unpredictability in other aspects can enhance the model's ability to generate novel and interesting responses.

Here’s why this balance is important:

Ensuring Accuracy: Grounding conversations in factual knowledge or core topics maintains the integrity and reliability of the generated data. This ensures that the model learns from accurate information and maintains coherence in its responses.
Encouraging Creativity: Allowing some degree of randomness or variability in non-essential aspects of the conversation fosters creativity. This enables the model to explore different perspectives, generate diverse responses, and adapt to varying contexts or scenarios.
Avoiding Overfitting: Strictly controlling every aspect of the conversation can lead to overfitting, where the model becomes too specialized and rigid in its responses. Allowing for randomness helps prevent this by exposing the model to a broader range of inputs and contexts.
Enhancing Engagement: Introducing unpredictability makes the generated conversations more engaging and realistic. It mimics the natural flow of human dialogue, which is often spontaneous and varied in tone, mood, and direction.

In practice, this approach involves selectively grounding conversations in structured topics or factual information while leaving room for the model to explore and innovate within predefined boundaries. This balance not only enriches the dataset used for training but also improves the model’s ability to handle diverse conversational scenarios effectively.

For example,

When generating fine-tuning data for a conversation between "a student and a professor" discussing "Zizek's perspective on Desire and Lack," it's important to structure prompts in a way that guides the conversation towards mutual understanding and meaningful exchange. However, the manner in which you prompt can influence the flow of conversation, especially in scenarios where one party may dominate with questions rather than contributing substantive dialogue.

In essence, the effectiveness of the generated fine-tuning data hinges on aligning prompts with the desired functionality of the model, whether it's for robust conversational AI or specific task-oriented dialogue systems. By emphasizing balanced interaction and meaningful contributions from both parties, the model can better learn to navigate and enrich conversations.

Hyper Parameters

When generating data, two key parameters to consider are temperature and top_p. It's widely observed that lower values of temperature and top_p are often used for generating factual responses, while slightly higher values are preferred for generating stories and opinion pieces.

For more detailed insights on how these parameters influence data generation, there are numerous existing articles and posts that delve into their effects, making it unnecessary to repeat here. One can easily find posts such as this.

What about Data Quality?

All the steps discussed so far have centered around prompt engineering and parameter adjustment, which is crucial. However, the paramount aspect of data generation is ensuring and maintaining data quality, as the quality of the data used to train a model is pivotal to its effectiveness.

Let's break down some of the common pitfalls :-

Problem: Very shallow and out-of-depth data

Consequences:

Responses lack depth and expertise, especially on complex topics.
Users may receive superficial or unhelpful answers in production, leading to frustration.

Fix:

Make sure that the provided prompts are detailed and require the model to delve deeper into the topic.
Ensure that the prompts you ask are within the scope of the model's training to ensure it leverages existing knowledge rather than hallucinating.

Problem: Similar Data

Consequences:

Redundancy in training data, leading to inefficiencies and longer training times.
Limits the model's exposure to diverse data points and perspectives.

Fix:

Use embedding and clustering techniques to identify similar data points.
New data points can further be generating by using some form of augmenting techniques.

Problem: Wrong Data

Consequences:

Incorporation of factually incorrect information into the model.
Decreased model reliability and trustworthiness in real-world applications.

Fix:

Ensure that the prompts you ask are within the scope of the model's training to ensure it leverages existing knowledge rather than hallucinating.
Manually identify a few data points containing incorrect information and trace their origins to eliminate similar inaccuracies using embedding and clustering methods.
If the above method doesn't work, you might be forced to manually check the data.

Many repositories, such as Datatrove, offer tools for data cleaning and standardization. The choice of tool often depends on the specific requirements and type of data you need to work with.

Next Actions

The steps outlined above provide a concise introduction to data generation techniques, drawing inspiration from diverse sources and discussions within the field. Valuable insights have been gleaned from resources such as Airoboros, EvolInstruct, the "All You Need" series (including Phi models), as well as engaging discussions on platforms like Twitter and Discord.

I hope you find these insights helpful and can apply some of these learnings to enhance your own projects.

How to Create Synthetic Data for Large Language Models

Table of contents

My Experience

Navigating Stages of Training

Main Types of Data Generation

Generating Based on a Large Text Corpus

Generating based from Key Phrases

Generating Data for RL

Some Key Consideration

Control the Controllables

Allow the Others to Run Free

Hyper Parameters

What about Data Quality?

Problem: Very shallow and out-of-depth data

Problem: Similar Data

Problem: Wrong Data

Next Actions

Subscribe to my newsletter

adithya kamath

adithya kamath