How to Achieve Full Fine Tuning: Step-by-Step Guide


Introduction
In this blog, we’ll explore full fine-tuning, a supervised fine-tuning technique that updates all model parameters to adapt a pretrained transformer to a specific task. We’ll discuss how full fine-tuning works, its advantages, and how to implement it using PyTorch and Hugging Face.
Before diving into full fine-tuning, let’s first understand supervised fine-tuning and its different approaches.
What is Supervised Fine-Tuning?
Supervised fine-tuning involves training a model on labeled data, where input-output pairs are explicitly defined. It is the most commonly used fine-tuning method in NLP and can be categorized into different approaches:
Full Fine-Tuning → Updates all model parameters.
Layer-Specific Fine-Tuning → Fine-tunes only certain layers of the model.
Parameter-Efficient Fine-Tuning (PEFT) → Modifies only a subset of parameters while keeping most of the model frozen, using methods like:
Adapters (Houlsby Adapters, Compacter, BitFit)
LoRA (Low-Rank Adaptation)
QLoRA (Quantized LoRA)
IA³ (Infused Adapter Layers)
Prefix-Tuning, Prompt-Tuning, and P-Tuning
Now that we have an overview of supervised fine-tuning, let’s take a closer look at full fine-tuning.
What is Full Fine-Tuning?
Full fine-tuning is a transfer learning technique where a pretrained model is further trained on a new dataset by updating all its parameters. Unlike selective fine-tuning techniques that modify only a small portion of the model (such as adapter layers or the final classifier head), full fine-tuning allows the model to completely adapt to a new task.
In simple terms, full fine-tuning repurposes a general-purpose model (such as BERT, GPT, T5, or LLaMA) for a specific task by adjusting every layer during training.
Why Not Just Train a New Model from Scratch?
A common question is: If we are updating all parameters, why not train a model from scratch instead?
The reason is simple—training from scratch is extremely resource-intensive:
Requires terabytes of data for effective generalization.
Demands massive computational resources (thousands of GPUs and weeks of training).
Needs extensive hyperparameter tuning for convergence.
In contrast, full fine-tuning leverages a pretrained model that already has general language understanding, requiring significantly fewer resources while still adapting effectively to a new domain.
However, full fine-tuning isn’t always the most efficient approach. Parameter-Efficient Fine-Tuning (PEFT) techniques, such as LoRA and adapters, allow adaptation with fewer trainable parameters, reducing computational costs.
Additionally, full fine-tuning enhances coherence and fluency. Since all layers are updated, the model aligns better with the domain-specific data, much like how a PhD graduate develops deep expertise in a subject.
Overview of the Full Fine-Tuning Process
The key steps involved in full fine-tuning are:
Load a pretrained model (e.g.,
bert-base-uncased
).Unfreeze all layers to allow training across the entire network.
Fine-tune the model using task-specific data (e.g., classification, summarization, or text generation).
Optimize the entire model based on the new dataset, enabling it to learn domain-specific patterns.
Implementing Full Fine Tuning
For this tutorial, we’ll be fine-tuning BERT for the task of Extractive Question Answering (QA). This means we’re using a version of BERT where the final layer has been adapted specifically to predict the start and end positions of an answer within a given context. Since BERT is an encoder-only model, it is non-generative by design.
So, what does that mean?
Generative vs. Non-Generative Models
A generative model is capable of producing new content. It doesn't just memorize—it learns the structure and semantics of the data it’s trained on. When asked a question, it responds by generating a new answer in natural language, much like a smart student who has internalized the material and can explain it in their own words.
In contrast, a non-generative model learns a direct mapping from input to output. It doesn’t generate new content—it simply selects the most likely output from existing input. Think of it like a student who has memorized answers and reproduces them word-for-word during the exam.
BERT is a non-generative model, and in this case, we’re fine-tuning it for Extractive QA—a task where the model identifies the exact answer span within a provided context. It doesn't generate a response from scratch; instead, it points to the answer in the given text.
What Does Extractive QA Look Like?
The model is given two inputs:
A question
A context (a passage that contains the answer)
Its job is to extract the answer directly from the context.
Here’s an example of one entry from our dataset:
{
"question": "What is the capital of the UK?",
"context": "London is the capital of the United Kingdom.",
"answer_text": "London",
"answer_start_index": 0,
"answer_end_index": 6
}
In this case, the model learns that the correct answer ("London") starts at character index 0 and ends at 6 in the context.
Then by fine tuning we teach the model to predict the token position of the start of the answer in the context provided and the token position of the end of the answer.
Install The Dependencies
%%capture
%pip install transformers datasets peft accelerate torch
What it does: Installs all the necessary Python packages using
pip
.Key libraries:
transformers
: From Hugging Face, provides pretrained transformer models.datasets
: Hugging Face's library for loading and processing datasets.peft
: For parameter-efficient fine-tuning techniques.accelerate
: Makes model training device-agnostic (CPU, GPU, TPU).torch
: PyTorch, the core deep learning framework used for training.
%%capture
: Suppresses cell output (to keep the notebook clean).
Import the required modules
import json
import torch
from transformers import BertTokenizerFast, BertForQuestionAnswering, Trainer, TrainingArguments, pipeline
Imports essential classes and functions.
BertTokenizerFast
: Efficient tokenizer from Hugging Face for BERT.BertForQuestionAnswering
: Pretrained BERT model with a QA head.Trainer
,TrainingArguments
: Hugging Face’s high-level training API.pipeline
: Provides a simple interface for inference.torch
: For tensor computations and model operations.
Preprocess the Data
In the preprocess_data
function, we iterate through each entry in our dataset and prepare it for training with BERT’s question answering model.
# Initialize the tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
def preprocess_data(data):
tokenized_data = []
for item in data:
# Tokenize the question and context together with offset mapping
inputs = tokenizer(
item['question'],
item['context'],
max_length=512,
truncation=True,
padding='max_length',
return_offsets_mapping=True, # This is crucial
return_tensors='pt'
)
offset_mapping = inputs.pop('offset_mapping') # Extract offset mapping
input_ids = inputs['input_ids'].squeeze() # Remove batch dimension
# Convert character indices to token indices for the answer
start_char = item['answer_start_index']
end_char = item['answer_end_index']
start_token_idx, end_token_idx = None, None
for i, (start, end) in enumerate(offset_mapping.squeeze().tolist()):
if start_char >= start and start_char < end:
start_token_idx = i
if end_char > start and end_char <= end:
end_token_idx = i
break # Stop once the end position is found
# Ensure valid token indices
if start_token_idx is None or end_token_idx is None:
continue # Skip this example if indices are not found
tokenized_data.append({
'input_ids': input_ids,
'attention_mask': inputs['attention_mask'].squeeze(),
'token_type_ids': inputs['token_type_ids'].squeeze(),
'start_positions': torch.tensor([start_token_idx]),
'end_positions': torch.tensor([end_token_idx])
})
return tokenized_data
Let’s look at each and every step in the above pre-processing function in depth now.
To understand this better let’s consider the following example :
{
"question": "What is the capital of the UK?",
"context": "London is the capital of the United Kingdom.",
"answer_text": "London",
"answer_start_index": 0,
"answer_end_index": 6
}
Now we’ll walk through the preprocessing using this input.
Step 1: Tokenization
inputs = tokenizer(
item['question'],
item['context'],
max_length=512,
truncation=True,
padding='max_length',
return_offsets_mapping=True,
return_tensors='pt'
)
This takes both the question and context and combines them into:
[CLS] What is the capital of the UK? [SEP] London is the capital of the United Kingdom. [SEP]
return_offsets_mapping=True
adds a mapping between each token and its original character span in the raw text (the context).
Example of offset mapping for the context part:
# For the context only (after the [SEP] token)
offset_mapping for context might look like:
[
(0, 6), # "London"
(7, 9), # "is"
(10, 13), # "the"
(14, 21), # "capital"
(22, 24), # "of"
(25, 28), # "the"
(29, 35), # "United"
(36, 43), # "Kingdom"
(43, 44) # "."
]
So "London"
spans characters 0–6, and that's how we know which token it maps to.
So what we essentially did is ,for every question-context-answer triplet, we tokenize the question and context together using the BertTokenizerFast
from Hugging Face. This is important because BERT expects the input in a specific format: [CLS] question [SEP] context [SEP]
.
We set max_length
to 512 (the maximum sequence length BERT can handle), and apply truncation or padding accordingly. The crucial part here is return_offsets_mapping=True
, which tells the tokenizer to return a mapping between each token and its original character span in the raw text. This allows us to later locate the answer’s position at the token level. Since we want the tokenizer to return PyTorch tensors, we set return_tensors="pt"
.
Step 2: Extract Offsets and Remove Batch Dimension
offset_mapping = inputs.pop('offset_mapping')
input_ids = inputs['input_ids'].squeeze()
inputs
is a dictionary of PyTorch tensors which contains the following keysinput_ids
,attention_mask
,token_type_ids
, andoffset_mapping
.Let’s look at what each of these parameters denote in detail
input_ids
: This tensor contains the tokenized representation of the input text. Each token (like words or subwords) is mapped to its corresponding integer ID from the BERT vocabulary. The typical format is:
[CLS] question tokens [SEP] context tokens [SEP]
.squeeze()
is used to go from shape(1, 512)
→(512,)
since you’re only processing one example at a time.attention_mask
: This indicates which tokens should be attended to by the model. It has the same shape asinput_ids
and contains 1s for real tokens and 0s for padding tokens. This helps the model ignore padded positions during attention computation.token_type_ids
(also known as segment ids): This helps BERT distinguish between the two segments in the input: the question and the context. Tokens from the question are usually marked with0
, and tokens from the context with1
.We've already discussed what
offset_mapping
represents in the previous step. Here, we remove it frominputs
because it's only required during preprocessing — it's not used during model training.
Step 3: Map Character Indices to Token Indices
start_char = item['answer_start_index'] # 0
end_char = item['answer_end_index'] # 6
Now you find which tokens these character spans correspond to:
for i, (start, end) in enumerate(offset_mapping.squeeze().tolist()):
if start_char >= start and start_char < end:
start_token_idx = i
if end_char > start and end_char <= end:
end_token_idx = i
break
This loop finds:
The token where the answer starts (i.e., spans character 0).
The token where the answer ends (i.e., ends at character 6).
In this case, both start_char=0
and end_char=6
fall into the first token → "London"
.
So:
start_token_idx = 0
end_token_idx = 0
Step 4: Append to Final Tokenized Dataset
You prepare the input dict BERT expects:
{
'input_ids': ..., # Tokens
'attention_mask': ..., # 1s and 0s (padding info)
'token_type_ids': ..., # Segment IDs (0 for question, 1 for context)
'start_positions': tensor([start_token_idx]),
'end_positions': tensor([end_token_idx])
}
Why This Matters
BERT QA models are trained to predict the start and end token positions of the answer within the input sequence.
Since the dataset gives character indices, you need to map them to token positions — and
offset_mapping
is what makes that possible.
Once token indices are identified, we prepare a dictionary containing input_ids
, attention_mask
, token_type_ids
, and the computed start_positions
and end_positions
. This format is exactly what BertForQuestionAnswering
expects during training. Overall, this preprocessing step is essential for aligning the dataset’s character-based labels with BERT’s token-based structure.
Load the Dataset
def load_dataset(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
data = json.load(file)
return data
data = load_dataset('qa_london_data.json')
tokenized_datasets = preprocess_data(data)
Absolutely! Here's a more in-depth, intuitive explanation of each part of Step 5: Train the Model, with extra technical clarity while keeping it beginner-friendly.
Train the Model
1. Load a Pretrained Model with a Question Answering (QA) Head
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
BertForQuestionAnswering
is a class from thetransformers
library.It takes a pretrained BERT model (
bert-base-uncased
) and adds a QA head on top of it.The QA head is a small feedforward layer that predicts:
The start token of the answer span.
The end token of the answer span.
This head outputs a pair of logits for each token in the input:
One for the probability of being the start of the answer.
One for the probability of being the end.
This model is trained to minimize a loss function that compares predicted token positions to ground-truth start/end positions from your dataset.
2. Define Training Configuration
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="no",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
)
- This sets up the training environment.
TrainingArguments
is a utility class fromtransformers
that defines:
Key Parameters:
Parameter | What it Controls |
output_dir="./results" | Where to save checkpoints and logs. |
evaluation_strategy="no" | Disables evaluation during training. You can set "steps" or "epoch" to evaluate periodically. |
learning_rate=2e-5 | Small step size for gradient descent (common for fine-tuning). |
per_device_train_batch_size=16 | Number of samples per batch per device (GPU/CPU). |
num_train_epochs=3 | Number of times the model will see the entire training data. |
weight_decay=0.01 | Regularization to prevent overfitting by shrinking weights slightly. |
This configuration is crucial because transformers are sensitive to hyperparameters—especially learning rate and batch size.
3. Initialize the Trainer API
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets,
tokenizer=tokenizer,
)
Trainer
is Hugging Face's high-level training loop wrapper.It combines:
The model to be trained (
BertForQuestionAnswering
).The training arguments (defined above).
The tokenized dataset (your preprocessed QA examples).
The tokenizer, used internally for padding, batching, etc.
4. Start Training
trainer.train()
This begins the actual fine-tuning.
Each training step:
Feeds a batch of tokenized questions & contexts through the model.
Computes predictions for start and end positions.
Compares predictions to ground-truth positions using cross-entropy loss.
Updates the model’s parameters using backpropagation.
Runs for 3 full epochs over the dataset (as per your config).
Once this is done, your model is fine-tuned to your domain-specific QA task (in this case, answering questions based on tourism activities in London).
Save the Model
model.save_pretrained("./fine_tuned_bert")
tokenizer.save_pretrained("./fine_tuned_bert")
Saves both the fine-tuned model weights and the tokenizer config locally to
./fine_tuned_bert
.From transformers: Lets you easily reload the model for inference or sharing.
Test the Model
qa_pipeline = pipeline("question-answering", model="./fine_tuned_bert", tokenizer="./fine_tuned_bert")
Loads the saved model using Hugging Face’s
pipeline
.Training can be extremely time consuming and resource intensive if you don’t have a GPU , you can access my model from HuggingFace.
pipeline("question-answering")
abstracts away all preprocessing and postprocessing steps.
result = qa_pipeline({
"question": "...",
"context": "..."
})
Passes a dictionary with question
and context
.
The model returns:
answer: Extracted text span.
score: Confidence score. start and
end: Token-level positions in the context.
print(result)
Displays the model’s predicted answer.
Inference: Testing the Fine-Tuned Model
Once the model is fine-tuned, we can use it to answer questions using the Hugging Face pipeline
API. Here's an example:
result = qa_pipeline({
"question": "To which category does the Christmas Lights by Night Open-Top Bus Tour belong?",
"context": "Christmas Lights by Night Open-Top Bus Tour is an activity of type guided tour. It lasts 1.5 hours..."
})
The model runs on the GPU (cuda:0
) and returns:
{
'score': 0.9995,
'start': 67,
'end': 78,
'answer': 'guided tour'
}
The output obtained can be interpreted as follows:
Answer:
guided tour
— the model correctly identifies the category.Score:
0.9995
— a high confidence score, meaning the model is very certain about its prediction.Start/End: These are token-level positions that map the extracted answer back to the original context.
This shows that the fine-tuned model can successfully extract relevant information from domain-specific data (in this case, tourism activities in London).
Drawbacks of Full Fine-Tuning
While full fine-tuning allows for complete adaptation to a downstream task, it also comes with some important trade-offs:
Computationally Expensive: Updating all model parameters requires more memory and compute. For large models like BERT or LLaMA, this can be prohibitive without access to powerful GPUs or TPUs.
Slower Training: Since every layer is updated, training takes significantly longer compared to more lightweight techniques like PEFT.
Risk of Overfitting: With small datasets, full fine-tuning can easily lead to overfitting if not regularized properly.
Low Reusability: The resulting model is heavily tied to the fine-tuning task. It becomes harder to reuse for other tasks without retraining from scratch again.
Because of these limitations, full fine-tuning is best suited when:
You have enough computational resources.
Your dataset is relatively large and domain-specific.
You need the model to deeply align with your data.
Conclusion
In this blog, we explored full fine-tuning, a powerful supervised learning technique where all parameters of a pretrained transformer are updated to specialize in a new task. We walked through the intuition behind transfer learning, the benefits of leveraging pretrained models, and how to implement full fine-tuning using Hugging Face and PyTorch—from preprocessing your dataset to saving and testing your model.
While full fine-tuning offers deep domain adaptation, it's not always the most efficient choice. As models grow in size, fine-tuning every parameter can become expensive and impractical. That’s why in the next blog, we’ll shift our focus to Parameter-Efficient Fine-Tuning (PEFT). These methods allow you to achieve strong performance with just a fraction of the compute and memory requirements—making them ideal for deploying large models at scale.
Stay tuned as we continue our journey into The Intuitive Fine-Tuning Series, where we break down cutting-edge methods in an approachable and hands-on way.
Check out the full code and dataset in my GitHub Repo - Full Fine Tuning
Subscribe to my newsletter
Read articles from Vikas Srinivasa directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Vikas Srinivasa
Vikas Srinivasa
My journey into AI has been unconventional yet profoundly rewarding. Transitioning from a professional cricket career, a back injury reshaped my path, reigniting my passion for technology. Seven years after my initial studies, I returned to complete my Bachelor of Technology in Computer Science, where I discovered my deep fascination with Artificial Intelligence, Machine Learning, and NLP —particularly it's applications in the finance sector.