How to Achieve Full Fine Tuning: Step-by-Step Guide

Vikas SrinivasaVikas Srinivasa
14 min read

Introduction

In this blog, we’ll explore full fine-tuning, a supervised fine-tuning technique that updates all model parameters to adapt a pretrained transformer to a specific task. We’ll discuss how full fine-tuning works, its advantages, and how to implement it using PyTorch and Hugging Face.

Before diving into full fine-tuning, let’s first understand supervised fine-tuning and its different approaches.

What is Supervised Fine-Tuning?

Supervised fine-tuning involves training a model on labeled data, where input-output pairs are explicitly defined. It is the most commonly used fine-tuning method in NLP and can be categorized into different approaches:

  • Full Fine-Tuning → Updates all model parameters.

  • Layer-Specific Fine-Tuning → Fine-tunes only certain layers of the model.

  • Parameter-Efficient Fine-Tuning (PEFT) → Modifies only a subset of parameters while keeping most of the model frozen, using methods like:

    • Adapters (Houlsby Adapters, Compacter, BitFit)

    • LoRA (Low-Rank Adaptation)

    • QLoRA (Quantized LoRA)

    • IA³ (Infused Adapter Layers)

    • Prefix-Tuning, Prompt-Tuning, and P-Tuning

Now that we have an overview of supervised fine-tuning, let’s take a closer look at full fine-tuning.

What is Full Fine-Tuning?

Full fine-tuning is a transfer learning technique where a pretrained model is further trained on a new dataset by updating all its parameters. Unlike selective fine-tuning techniques that modify only a small portion of the model (such as adapter layers or the final classifier head), full fine-tuning allows the model to completely adapt to a new task.

In simple terms, full fine-tuning repurposes a general-purpose model (such as BERT, GPT, T5, or LLaMA) for a specific task by adjusting every layer during training.

Why Not Just Train a New Model from Scratch?

A common question is: If we are updating all parameters, why not train a model from scratch instead?

The reason is simple—training from scratch is extremely resource-intensive:

  • Requires terabytes of data for effective generalization.

  • Demands massive computational resources (thousands of GPUs and weeks of training).

  • Needs extensive hyperparameter tuning for convergence.

In contrast, full fine-tuning leverages a pretrained model that already has general language understanding, requiring significantly fewer resources while still adapting effectively to a new domain.

However, full fine-tuning isn’t always the most efficient approach. Parameter-Efficient Fine-Tuning (PEFT) techniques, such as LoRA and adapters, allow adaptation with fewer trainable parameters, reducing computational costs.

Additionally, full fine-tuning enhances coherence and fluency. Since all layers are updated, the model aligns better with the domain-specific data, much like how a PhD graduate develops deep expertise in a subject.

Overview of the Full Fine-Tuning Process

The key steps involved in full fine-tuning are:

  1. Load a pretrained model (e.g., bert-base-uncased).

  2. Unfreeze all layers to allow training across the entire network.

  3. Fine-tune the model using task-specific data (e.g., classification, summarization, or text generation).

  4. Optimize the entire model based on the new dataset, enabling it to learn domain-specific patterns.


Implementing Full Fine Tuning

For this tutorial, we’ll be fine-tuning BERT for the task of Extractive Question Answering (QA). This means we’re using a version of BERT where the final layer has been adapted specifically to predict the start and end positions of an answer within a given context. Since BERT is an encoder-only model, it is non-generative by design.

So, what does that mean?

Generative vs. Non-Generative Models

A generative model is capable of producing new content. It doesn't just memorize—it learns the structure and semantics of the data it’s trained on. When asked a question, it responds by generating a new answer in natural language, much like a smart student who has internalized the material and can explain it in their own words.

In contrast, a non-generative model learns a direct mapping from input to output. It doesn’t generate new content—it simply selects the most likely output from existing input. Think of it like a student who has memorized answers and reproduces them word-for-word during the exam.

BERT is a non-generative model, and in this case, we’re fine-tuning it for Extractive QA—a task where the model identifies the exact answer span within a provided context. It doesn't generate a response from scratch; instead, it points to the answer in the given text.

What Does Extractive QA Look Like?

The model is given two inputs:

  • A question

  • A context (a passage that contains the answer)

Its job is to extract the answer directly from the context.

Here’s an example of one entry from our dataset:

{
  "question": "What is the capital of the UK?",
  "context": "London is the capital of the United Kingdom.",
  "answer_text": "London",
  "answer_start_index": 0,
  "answer_end_index": 6
}

In this case, the model learns that the correct answer ("London") starts at character index 0 and ends at 6 in the context.

Then by fine tuning we teach the model to predict the token position of the start of the answer in the context provided and the token position of the end of the answer.

Install The Dependencies

%%capture
%pip install transformers datasets peft accelerate torch
  • What it does: Installs all the necessary Python packages using pip.

  • Key libraries:

    • transformers: From Hugging Face, provides pretrained transformer models.

    • datasets: Hugging Face's library for loading and processing datasets.

    • peft: For parameter-efficient fine-tuning techniques.

    • accelerate: Makes model training device-agnostic (CPU, GPU, TPU).

    • torch: PyTorch, the core deep learning framework used for training.

  • %%capture: Suppresses cell output (to keep the notebook clean).

Import the required modules

import json
import torch
from transformers import BertTokenizerFast, BertForQuestionAnswering, Trainer, TrainingArguments, pipeline

Imports essential classes and functions.

  • BertTokenizerFast: Efficient tokenizer from Hugging Face for BERT.

  • BertForQuestionAnswering: Pretrained BERT model with a QA head.

  • Trainer, TrainingArguments: Hugging Face’s high-level training API.

  • pipeline: Provides a simple interface for inference.

  • torch: For tensor computations and model operations.

Preprocess the Data

In the preprocess_data function, we iterate through each entry in our dataset and prepare it for training with BERT’s question answering model.

# Initialize the tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

def preprocess_data(data):
    tokenized_data = []
    for item in data:
        # Tokenize the question and context together with offset mapping
        inputs = tokenizer(
            item['question'],
            item['context'],
            max_length=512,
            truncation=True,
            padding='max_length',
            return_offsets_mapping=True,  # This is crucial
            return_tensors='pt'
        )

        offset_mapping = inputs.pop('offset_mapping')  # Extract offset mapping
        input_ids = inputs['input_ids'].squeeze()  # Remove batch dimension

        # Convert character indices to token indices for the answer
        start_char = item['answer_start_index']
        end_char = item['answer_end_index']

        start_token_idx, end_token_idx = None, None

        for i, (start, end) in enumerate(offset_mapping.squeeze().tolist()):
            if start_char >= start and start_char < end:
                start_token_idx = i
            if end_char > start and end_char <= end:
                end_token_idx = i
                break  # Stop once the end position is found

        # Ensure valid token indices
        if start_token_idx is None or end_token_idx is None:
            continue  # Skip this example if indices are not found

        tokenized_data.append({
            'input_ids': input_ids,
            'attention_mask': inputs['attention_mask'].squeeze(),
            'token_type_ids': inputs['token_type_ids'].squeeze(),
            'start_positions': torch.tensor([start_token_idx]),
            'end_positions': torch.tensor([end_token_idx])
        })

    return tokenized_data

Let’s look at each and every step in the above pre-processing function in depth now.

To understand this better let’s consider the following example :

{
  "question": "What is the capital of the UK?",
  "context": "London is the capital of the United Kingdom.",
  "answer_text": "London",
  "answer_start_index": 0,
  "answer_end_index": 6
}

Now we’ll walk through the preprocessing using this input.

Step 1: Tokenization

inputs = tokenizer(
    item['question'],
    item['context'],
    max_length=512,
    truncation=True,
    padding='max_length',
    return_offsets_mapping=True,
    return_tensors='pt'
)
  • This takes both the question and context and combines them into:

      [CLS] What is the capital of the UK? [SEP] London is the capital of the United Kingdom. [SEP]
    
  • return_offsets_mapping=True adds a mapping between each token and its original character span in the raw text (the context).

Example of offset mapping for the context part:

# For the context only (after the [SEP] token)
offset_mapping for context might look like:
[
  (0, 6),   # "London"
  (7, 9),   # "is"
  (10, 13), # "the"
  (14, 21), # "capital"
  (22, 24), # "of"
  (25, 28), # "the"
  (29, 35), # "United"
  (36, 43), # "Kingdom"
  (43, 44)  # "."
]

So "London" spans characters 0–6, and that's how we know which token it maps to.

So what we essentially did is ,for every question-context-answer triplet, we tokenize the question and context together using the BertTokenizerFast from Hugging Face. This is important because BERT expects the input in a specific format: [CLS] question [SEP] context [SEP].

We set max_length to 512 (the maximum sequence length BERT can handle), and apply truncation or padding accordingly. The crucial part here is return_offsets_mapping=True, which tells the tokenizer to return a mapping between each token and its original character span in the raw text. This allows us to later locate the answer’s position at the token level. Since we want the tokenizer to return PyTorch tensors, we set return_tensors="pt".


Step 2: Extract Offsets and Remove Batch Dimension

offset_mapping = inputs.pop('offset_mapping')
input_ids = inputs['input_ids'].squeeze()
  • inputs is a dictionary of PyTorch tensors which contains the following keys input_ids, attention_mask, token_type_ids, and offset_mapping.

  • Let’s look at what each of these parameters denote in detail

    • input_ids: This tensor contains the tokenized representation of the input text. Each token (like words or subwords) is mapped to its corresponding integer ID from the BERT vocabulary. The typical format is:
      [CLS] question tokens [SEP] context tokens [SEP].

      squeeze() is used to go from shape (1, 512)(512,) since you’re only processing one example at a time.

    • attention_mask: This indicates which tokens should be attended to by the model. It has the same shape as input_ids and contains 1s for real tokens and 0s for padding tokens. This helps the model ignore padded positions during attention computation.

    • token_type_ids (also known as segment ids): This helps BERT distinguish between the two segments in the input: the question and the context. Tokens from the question are usually marked with 0, and tokens from the context with 1.

    • We've already discussed what offset_mapping represents in the previous step. Here, we remove it from inputs because it's only required during preprocessing — it's not used during model training.

Step 3: Map Character Indices to Token Indices

start_char = item['answer_start_index']  # 0
end_char = item['answer_end_index']      # 6

Now you find which tokens these character spans correspond to:

for i, (start, end) in enumerate(offset_mapping.squeeze().tolist()):
    if start_char >= start and start_char < end:
        start_token_idx = i
    if end_char > start and end_char <= end:
        end_token_idx = i
        break

This loop finds:

  • The token where the answer starts (i.e., spans character 0).

  • The token where the answer ends (i.e., ends at character 6).

In this case, both start_char=0 and end_char=6 fall into the first token → "London".

So:

start_token_idx = 0
end_token_idx = 0

Step 4: Append to Final Tokenized Dataset

You prepare the input dict BERT expects:

{
  'input_ids': ...,             # Tokens
  'attention_mask': ...,        # 1s and 0s (padding info)
  'token_type_ids': ...,        # Segment IDs (0 for question, 1 for context)
  'start_positions': tensor([start_token_idx]),
  'end_positions': tensor([end_token_idx])
}

Why This Matters

  • BERT QA models are trained to predict the start and end token positions of the answer within the input sequence.

  • Since the dataset gives character indices, you need to map them to token positions — and offset_mapping is what makes that possible.

Once token indices are identified, we prepare a dictionary containing input_ids, attention_mask, token_type_ids, and the computed start_positions and end_positions. This format is exactly what BertForQuestionAnswering expects during training. Overall, this preprocessing step is essential for aligning the dataset’s character-based labels with BERT’s token-based structure.


Load the Dataset

def load_dataset(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)
    return data

data = load_dataset('qa_london_data.json')
tokenized_datasets = preprocess_data(data)

Absolutely! Here's a more in-depth, intuitive explanation of each part of Step 5: Train the Model, with extra technical clarity while keeping it beginner-friendly.


Train the Model

1. Load a Pretrained Model with a Question Answering (QA) Head

model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
  • BertForQuestionAnswering is a class from the transformers library.

  • It takes a pretrained BERT model (bert-base-uncased) and adds a QA head on top of it.

    • The QA head is a small feedforward layer that predicts:

      • The start token of the answer span.

      • The end token of the answer span.

  • This head outputs a pair of logits for each token in the input:

    • One for the probability of being the start of the answer.

    • One for the probability of being the end.

This model is trained to minimize a loss function that compares predicted token positions to ground-truth start/end positions from your dataset.

2. Define Training Configuration

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="no",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)
  • This sets up the training environment. TrainingArguments is a utility class from transformers that defines:

Key Parameters:

ParameterWhat it Controls
output_dir="./results"Where to save checkpoints and logs.
evaluation_strategy="no"Disables evaluation during training. You can set "steps" or "epoch" to evaluate periodically.
learning_rate=2e-5Small step size for gradient descent (common for fine-tuning).
per_device_train_batch_size=16Number of samples per batch per device (GPU/CPU).
num_train_epochs=3Number of times the model will see the entire training data.
weight_decay=0.01Regularization to prevent overfitting by shrinking weights slightly.

This configuration is crucial because transformers are sensitive to hyperparameters—especially learning rate and batch size.

3. Initialize the Trainer API

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    tokenizer=tokenizer,
)
  • Trainer is Hugging Face's high-level training loop wrapper.

  • It combines:

    • The model to be trained (BertForQuestionAnswering).

    • The training arguments (defined above).

    • The tokenized dataset (your preprocessed QA examples).

    • The tokenizer, used internally for padding, batching, etc.

4. Start Training

trainer.train()
  • This begins the actual fine-tuning.

  • Each training step:

    1. Feeds a batch of tokenized questions & contexts through the model.

    2. Computes predictions for start and end positions.

    3. Compares predictions to ground-truth positions using cross-entropy loss.

    4. Updates the model’s parameters using backpropagation.

  • Runs for 3 full epochs over the dataset (as per your config).

Once this is done, your model is fine-tuned to your domain-specific QA task (in this case, answering questions based on tourism activities in London).


Save the Model

model.save_pretrained("./fine_tuned_bert")
tokenizer.save_pretrained("./fine_tuned_bert")
  • Saves both the fine-tuned model weights and the tokenizer config locally to ./fine_tuned_bert.

  • From transformers: Lets you easily reload the model for inference or sharing.


Test the Model

qa_pipeline = pipeline("question-answering", model="./fine_tuned_bert", tokenizer="./fine_tuned_bert")
  • Loads the saved model using Hugging Face’s pipeline.

    Training can be extremely time consuming and resource intensive if you don’t have a GPU , you can access my model from HuggingFace.

  • pipeline("question-answering") abstracts away all preprocessing and postprocessing steps.

result = qa_pipeline({
    "question": "...",
    "context": "..."
})

Passes a dictionary with question and context.

The model returns:

  • answer: Extracted text span.

  • score: Confidence score. start and

  • end: Token-level positions in the context.

print(result)

Displays the model’s predicted answer.


Inference: Testing the Fine-Tuned Model

Once the model is fine-tuned, we can use it to answer questions using the Hugging Face pipeline API. Here's an example:

result = qa_pipeline({
    "question": "To which category does the Christmas Lights by Night Open-Top Bus Tour belong?",
    "context": "Christmas Lights by Night Open-Top Bus Tour is an activity of type guided tour. It lasts 1.5 hours..."
})

The model runs on the GPU (cuda:0) and returns:

{
  'score': 0.9995,
  'start': 67,
  'end': 78,
  'answer': 'guided tour'
}

The output obtained can be interpreted as follows:

  • Answer: guided tour — the model correctly identifies the category.

  • Score: 0.9995 — a high confidence score, meaning the model is very certain about its prediction.

  • Start/End: These are token-level positions that map the extracted answer back to the original context.

This shows that the fine-tuned model can successfully extract relevant information from domain-specific data (in this case, tourism activities in London).


Drawbacks of Full Fine-Tuning

While full fine-tuning allows for complete adaptation to a downstream task, it also comes with some important trade-offs:

  • Computationally Expensive: Updating all model parameters requires more memory and compute. For large models like BERT or LLaMA, this can be prohibitive without access to powerful GPUs or TPUs.

  • Slower Training: Since every layer is updated, training takes significantly longer compared to more lightweight techniques like PEFT.

  • Risk of Overfitting: With small datasets, full fine-tuning can easily lead to overfitting if not regularized properly.

  • Low Reusability: The resulting model is heavily tied to the fine-tuning task. It becomes harder to reuse for other tasks without retraining from scratch again.

Because of these limitations, full fine-tuning is best suited when:

  • You have enough computational resources.

  • Your dataset is relatively large and domain-specific.

  • You need the model to deeply align with your data.


Conclusion

In this blog, we explored full fine-tuning, a powerful supervised learning technique where all parameters of a pretrained transformer are updated to specialize in a new task. We walked through the intuition behind transfer learning, the benefits of leveraging pretrained models, and how to implement full fine-tuning using Hugging Face and PyTorch—from preprocessing your dataset to saving and testing your model.

While full fine-tuning offers deep domain adaptation, it's not always the most efficient choice. As models grow in size, fine-tuning every parameter can become expensive and impractical. That’s why in the next blog, we’ll shift our focus to Parameter-Efficient Fine-Tuning (PEFT). These methods allow you to achieve strong performance with just a fraction of the compute and memory requirements—making them ideal for deploying large models at scale.

Stay tuned as we continue our journey into The Intuitive Fine-Tuning Series, where we break down cutting-edge methods in an approachable and hands-on way.

Check out the full code and dataset in my GitHub Repo - Full Fine Tuning

10
Subscribe to my newsletter

Read articles from Vikas Srinivasa directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vikas Srinivasa
Vikas Srinivasa

My journey into AI has been unconventional yet profoundly rewarding. Transitioning from a professional cricket career, a back injury reshaped my path, reigniting my passion for technology. Seven years after my initial studies, I returned to complete my Bachelor of Technology in Computer Science, where I discovered my deep fascination with Artificial Intelligence, Machine Learning, and NLP —particularly it's applications in the finance sector.