pretraining llm

In this series, we’ll explore everything about pre-training in the Generative AI (GenAI) pipeline — including how models are trained based on specific objectives or goals, and how to design an effective pre-training pipeline.

But before diving into pre-training, let’s first understand:

What does "training" mean from (ML/DL) perspective?

And how is it different when viewed from a GenAI perspective? Is it the same or different?

In the Machine Learning / Deep Learning context, “training” refers to the process of learning a function f(x) that maps input data x to an output y, by minimising a loss function over a dataset.

This involves:

Model Initialisation: Initialise learnable parameters (e.g., weights and biases) randomly or with heuristics.
Forward Pass: Compute the output of the model given the input.
Loss Computation: Compare the predicted output with the ground truth using a loss function (e.g., cross-entropy, MSE).
Backward Pass (Backpropagation) : Compute gradients of the loss with respect to parameters using the chain rule.
Parameter Optimisation: Update parameters using an optimiser (e.g., GD, SGD, Adam) to reduce the loss.

So goal is to minimise training error and generalise well to unseen data.

In Generative AI, the foundational process is still ML/DL training, but it differs in terms of scale, objective, architecture, and data handling:

Training Objective

Traditional ML: Learn a discriminative or predictive function (classification, regression).
GenAI: Learn a generative function that models the joint or conditional probability of sequences in language models.The objective is typically next-token prediction

2. Data Scale & Type

Traditional ML: Often task-specific, labelled datasets (e.g., MNIST, CIFAR-10).
GenAI: Uses large-scale, unlabeled, and often noisy datasets (e.g., Common Crawl, The Pile, LAION for vision).

3. Learning technique:

Traditional ML: Primarily supervised or semi-supervised.
GenAI: Mostly self-supervised — predicting parts of the input from other parts (masked tokens, next-token, image patches, etc.).

4. Model Architecture

Traditional ML: Various architectures — CNNs, RNNs, gradient boosting, etc.
GenAI: Almost exclusively Transformers (autoregressive or encoder-decoder variants)

5. Output Modality

Traditional ML: Predict discrete labels or continuous values.
GenAI: Generates new data — text, images, audio, code — often autoregressively.

To summarise:

Aspect	Traditional ML/DL	Generative AI
Objective	Prediction/classification/regression	Content generation via sequence modeling
Data	Task-specific, often labelled	Web-scale, unlabeled, multi-modal
Learning	Supervised learning, transfer learning	Self-supervised
Models	CNNs, RNNs, Classical NLP models	Transformers (GPT, BERT, etc.)
Output	Numeric or categorical	New data: text, images, audio, code
Optimization	Gradient descent on clear tasks (e.g., correct label)	Gradient descent on creative goals (e.g., generate next word or image)

What is Pre-Training?

Pre-training is the initial training phase where a model learns general-purpose representations from large-scale data before being adapted (fine-tuned) to a specific task.

Pretraining typically involves training on unlabeled data to learn useful features, which can then be leveraged for more specific, downstream tasks that require supervision (e.g., classification, generation, etc.).

Pretraining can use different levels of supervision based on the availability and type of labels. They are:

Self-Supervised Pre-Training method is popular among all types, but why?

No need of for labeled data: which is time-consuming, expensive, and often impractical, especially for large-scale tasks.
Scalable and Generalizable: Self-supervised methods allow models to learn general representations that can be fine-tuned for a wide variety of tasks (e.g., classification, question answering, summarization) with minimal labeled data.
Versatility Across Domains: It works across different modalities (e.g., text, images, audio), making it highly flexible for multimodal models

But other methods are also used, but in different situations:

Weakly-Supervised:
Used for multimodal tasks like text-to-image generation (e.g., DALL·E), where weak labels such as descriptive text, captions, or hashtags assist in aligning textual and visual data.

Supervised:
Applied in tasks requiring high-quality labeled data for specific purposes, like sentiment analysis or natural language inference (NLI), with models like BERT fine-tuned on task-specific datasets.

Semi-Supervised:
Useful when only a small amount of labeled data is available, often in domains like healthcare, where models such as TinyBERT are trained on a mix of labeled and unlabeled data.

Some common self-supervised techniques used in pretraining LLMs:

1. Causal Language Modeling (CLM)

In Causal Language Modeling, the model learns to predict the next token in a sequence based on previous tokens. This approach assumes a left-to-right structure where the model is trained to predict each word/token from its context.

Examples:

GPT (Generative Pretrained Transformer)

LLaMA (Large Language Model Meta AI)

Use Case: Language models that generate coherent and contextually relevant text (e.g., writing, code generation, chatbots).

2. Masked Language Modeling (MLM)

In Masked Language Modeling, some tokens in the input text are randomly masked, and the model learns to predict the missing tokens. This approach helps the model learn to understand the context of the text by inferring the masked words.
The input text is masked (e.g., replacing certain words with [MASK]), and the model must predict the original masked words based on the surrounding context.

Example: BERT (Bidirectional Encoder Representations from Transformers)

Use Case:Tasks that involve understanding text, such as sentence classification, named entity recognition (NER), and question answering.

3. Span Corruption / Infilling

In Span Corruption, spans (or chunks) of text are masked or removed, and the model is trained to reconstruct the missing spans. This is similar to MLM but with larger chunks of text instead of individual tokens.
The model learns to predict entire spans of missing text based on the surrounding words or context, making it learn longer-range dependencies.

Examples:

T5 (Text-to-Text Transfer Transformer)

UL2 (Unifying Language Learning)

Use Case:Tasks requiring both text comprehension and generation, such as summarization, translation, and text infilling.

4. Permutation Language Modeling (PLM)

Instead of predicting tokens in a fixed left-to-right order (like CLM), the model predicts tokens based on a permuted order, learning bidirectional context without relying on the typical MLM objective.”

Example: XLNet

Use Case: Language understanding with bidirectional context while avoiding the limitations of masking in MLM.

5. Denoising Autoencoding

Corrupt input sequences (e.g., through deletion, shuffling, masking), then train the model to reconstruct the original input. Similar to span corruption but more generalized.

Example: BART (Bidirectional and Auto-Regressive Transformers)

Use Case: Seq2Seq tasks like summarization, translation, and question answering.

6. Prefix Language Modeling( Prefix LM)/Infilling

A hybrid of CLM and MLM where the model has access to a prefix (unmasked left part of the sequence) and masked spans later.In other words, it Involves filling in masked portions of text after a visible prefix

Example: UL2

Use Case: Flexible modeling for both generation and understanding tasks.

7. Multi-Task Language Modeling

Combines multiple pretraining objectives (e.g., MLM + span corruption + sequence classification) to create a more robust model.

Example: T5, UL2

Use Case: Versatile models capable of handling a wide variety of downstream tasks.

Summary:

Discussed about Training from a Traditional ML & GenAI perspective
Different types of Pre-Training methods
Types of Self-supervised Techniques

In the next article, we will discuss more about pre-training types based on their objectives

Until then, keep learning!

Make sure to like and subscribe to the newsletter if you find this content useful!

My socials:

LinkedIn | GitHub | Hashnode | Topmate | YouTube

LLM Pre-Training in GenAI