When and how to finetune LLM model

Review of Building Blocks

Lets quickly review/review the basic buildings blocks towards LLMs

“Self Attention” : Technique to compute the attention score of each token. Used all through the transformer architecture [Encoders, Decoders].

“Parallel Processing and Multi Stack Ranking” : Tokenization of the input text is performed and each token is processed in parallel through the encoders and then techniques like multi head attention is performed to identify the context of the key tokens.

“Pre-Training” : PreTraining a model is already performed with massive datasets and large unlabelled corpus by large companies like Meta, OpenAI etc. This is computationally very expensive.

“Fine Tuning” : Orgs want to build custom LLMs with labelled corpus , can do so using Fine tuning a publicly available LLM. We will deep dive on this over the next 2 parts.

What is Fine-tuning

One of the analogies that helped me understand the key concept of fine tuning:

For the first 12-15 years of a child’s life, parents and Schools tries to provide the education that makes the child understand and speak a language, capable to understand the world through diverse topics like history , science, geography. Similarly LLMs are built by large enterprises with billions of parameters to become language experts.

But how can the child become a domain expert? A degree college spends 4 years making the same kid an expert in healthcare, engineering etc. - Yes Fine-Tuning helps it become that domain expert.

You might want to fine tune a pre-trained LLM for :

Adapting to specific domain-specific data [Healthcare, e-Commerce, Customer Support , Enterprise etc]
Improve performance on specific use cases

As you can see in the figure - A regular pre-trained LLM might misunderstand “SOB” or the medical implication of “elevated toponin levels” whereas a healthcare fine-tuned LLM like BioGPT understands the clinical implications of “elevated toponin levels”, correctly expands on medical abbreviations.

Fine Tuning = Domain Expertise + Precision + Clarity

When NOT to Fine-Tune?

Fine-Tuning is an expensive process as well, so before understanding “How to fine-tune?” lets understand when not to fine tune and what are the usecases that can be solved through :

Prompt Engineering + One/Few shot inference

When LLM is already expert in a topic and has the knowledge base to address the input text and provide an inference, however you want a customized inference for your usecase.

We can use Prompt Engineering + One/Few shot inference

For Ex: You want your AI agent to become movie buff in “Westerns” and be able to classify a movie and provide sentiment analysis.

Most of the data for Movies would have been publicly available within the datasets for the pretraining of the LLMs. In this case you can use Prompt or Few Shot inference to hint the LLM to generate sentiment in a certain format and biased towards “Westerns”.

This is Free and easy to change or update.

Quick Example: Writing Movie Reviews that suits that aligns with your taste

import requests

# Define the prompt to provide Context for the Model 
# to infer in the same tone,verbiage,sentiment and format 
post_prompt = (
    "You are my assistant to help me write movie reviews.
    Keep it concise. Use simple language.
    Use the following guidelines.
    General Structure
        Start with a one-line summary or hook (keep it punchy and emotional).
        Include a brief plot overview without major spoilers.
        Highlight main themes, tone, and genre (e.g., dystopian sci-fi, coming-of-age drama).
        Conclude with a personal verdict or recommendation.
    Critical Evaluation
        Comment on direction, writing, and pacing (e.g., was the story engaging or slow?).
        Evaluate acting performances, especially standout roles or disappointments.
        Assess cinematography, music, and visual style.
        Mention if the film had originality or felt derivative.
    Emotional and Intellectual Impact
        Describe how the film makes the audience feel — moved, thrilled, confused, nostalgic, etc.
        Discuss if the movie raises any social, moral, or philosophical questions.
    Contextual Insight
        Mention the director’s style or previous works if relevant.
        Compare it to similar films or influences (e.g., “Fans of Inception will love this.”).
        Indicate if the movie met or defied expectations.
    Optional Enhancements
        Add notable quotes or scenes (without spoilers).
        Include audience suitability (family-friendly, mature themes, etc.).
        Provide a rating (e.g., ⭐️ 4/5 or “Highly Recommended”). 

    Follow my taste for movies of different genres based on my past reviews 
    SciFi : https://www.newspapers.com/article/the-journal-news-the-empire-strikes-back/169909270/
    SciFi : https://abcnews.go.com/GMA/Culture/review-alto-knights-struggles-earn-place-pantheon-gangster/story?id=119998938
    Drama : https://abcnews.go.com/GMA/Culture/review-jenna-ortega-death-unicorn-beating-heart-amid/story?id=120237255 
"

)

# Create the request payload
payload = {
    "model": "llama3-8b-8192",  # You can also try "llama3-70b-8192" for a larger model
    "messages": [
        {"role": "system", "content": "You are an AI assistant that writes movie reviews adapted to my taste. Write a review for the movie - John Doe."},
        {"role": "user", "content": post_prompt}
    ],
    "temperature": 0.7,  # Higher value for more creative output
    "max_tokens": 300  # Limit the response length
}

# Set headers
headers = {
    "Authorization": f"Bearer {GROQ_API_KEY}",
    "Content-Type": "application/json"
}

# Make the API request
response = requests.post(GROQ_API_URL, json=payload, headers=headers)

# Extract and print the generated LinkedIn post
if response.status_code == 200:
    linkedin_post = response.json()["choices"][0]["message"]["content"]
    print("\n🔹 **Generated Movie Review:**\n")
    print(linkedin_post)
else:
    print("Error:", response.json())

RAG + Enterprise / Customer specific inference

If you are looking to personalize the output of the model based on patients personal data, customer purchase data based recommendations etc.

Example : https://learningai.hashnode.dev/retrieval-augmented-generation-rag-made-easy#heading-build-a-rag-using-langchain-and-mistral

Parameters for Fine-Tuning

Now lets understand some key parameters of Model training , which also are the hyper parameters in Fine tuning:

Learning Rate : I wont go into the details of factors like model training convergence but think about yourself learning Calculus - you grow through Limits, Derivates , Continuity etc. You grow through all the chapters write a test , check for errors (Loss Function in model training terms) - learn more and then take a test again.

The interval that which you take the test is “Learning Rate”.

If learning rate / the interval of tests is too high you may have actually gone through all the possible examples/books and actually memorized and that is called overfitting. It will now not be a test of your understanding skills hence this is not suitable.

If learning rate/ the interval of tests is too low , you risk the chances of you actually slowing down your actual learning.

However since this is Fine Tuning an existing trained model - its safer to keep learning rate lower in order to avoid large updates.

Batch Size : Number of training examples in a single iteration - typically batch size is 8-128 for LLMs
Number of Epochs : The number of times the entire training is passed through a model - too many epochs allow for overfitting again, typically for fine tuning - 2 to 5 epochs are accepted.
Gradient Accumulation Steps : If fine-training LLMs over limited hardware - batch can be broken down into mini batches.
Weigh Decay (L2 Regularization) : We already learnt about overfitting (when training accuracy » validation accuracy) , WD is a way to introduce a penalty (a small positive number) to the loss function - Models will favor smaller weights improving generalization to unseen data.

Another parameter related to Regularization is Dropout Rate - that randomly drops a percentage of neurons during training to prevent overfitting.

Learning Rate Scheduler : As name suggests adjusts the learning rate dynamically during training. Common strategies are linear decay or cosine annealing schedules.
Attention Heads and Layer Count : We already spent a series understanding Multi Head attention , more attention heads and layers improve learning but is computationally more expensive - for fine-tuning , using existing model architecture is preferred.
Max Sequence Length : Max number of tokens that can be input to the model during fine tuning. Longer sequences are better for more context however more computationally intensive , Fine tuning could involve 512 token or fewer.
Optimizer : Optimization algorithm like Adam, AdamW adjusts model parameters during training.

Simple Example of Fine Tuning a Bert based LLM for sentiment analysis

Now we will apply the above parameters and showcase using very simple dataset around how fine-tuning works.

Feel free to execute/run/create a copy of the google colab here : https://colab.research.google.com/drive/1OZltaPSvlj2Y6p40lnbZVbAxOaQ4Do1E?usp=sharing

import os
import torch
from torch.utils.data import Dataset
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from torch.optim import AdamW

# Simplified training data for demonstration
train_texts = ["I love programming!", "I hate bugs."]
train_labels = [1, 0]  # 1 = positive sentiment, 0 = negative sentiment

# Tokenize the texts (convert text to numerical format accepted by the model)
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)

# Custom Dataset class to handle tokenization and labels
class TextDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings  # Encoded inputs (input_ids, attention_mask)
        self.labels = labels  # Corresponding labels for the texts

    def __len__(self):
        return len(self.labels)  # Number of data points in the dataset

    def __getitem__(self, idx):
        # Return the necessary input items and labels as tensors
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}  # Convert to tensor format
        item['labels'] = torch.tensor(self.labels[idx])  # Add labels to the dictionary
        return item
# Create the Dataset object for training
train_dataset = TextDataset(train_encodings, train_labels)

# Hyperparameters for fine-tuning
learning_rate = 2e-5  # Learning rate: The step size for each update during training.
batch_size = 8  # Batch size: The number of training samples processed together in each step.
num_epochs = 3  # Number of epochs: The number of complete passes through the training dataset.
warmup_steps = 500  # Warmup steps: Gradually increase the learning rate for the first few steps to prevent instability.
weight_decay = 0.01  # Weight decay: Regularization term that helps prevent overfitting by penalizing large weights.
max_grad_norm = 1.0  # Max gradient norm: Clipping value for gradients to avoid exploding gradients.

# Set up the optimizer (AdamW is commonly used for fine-tuning transformer-based models)
optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)

# Training arguments (These configure various aspects of the training process)
training_args = TrainingArguments(
    output_dir='./results',  # Output directory to store model checkpoints and results
    # Changed evaluation_strategy to eval_strategy as per newer transformers versions
    eval_strategy="epoch",  # Evaluation strategy: Perform evaluation after every epoch
    learning_rate=learning_rate,  # Learning rate hyperparameter used in optimization
    per_device_train_batch_size=batch_size,  # Batch size per device (e.g., GPU/CPU)
    num_train_epochs=num_epochs,  # Number of epochs to train the model
    weight_decay=weight_decay,  # Weight decay used to regularize the model
    warmup_steps=warmup_steps,  # Number of steps for learning rate warmup
    max_grad_norm=max_grad_norm,  # Maximum gradient norm for gradient clipping
    logging_dir='./logs',  # Directory to store logs during training
    logging_steps=10,  # Log training details every 10 steps
    report_to=None,  # Disable reporting to Weights & Biases (set it to None if not using)
)

# Trainer setup (This is a convenient wrapper around the training loop)
trainer = Trainer(
    model=model,  # The model that will be fine-tuned
    args=training_args,  # The training arguments with all hyperparameters
    train_dataset=train_dataset,  # The training dataset
    eval_dataset=train_dataset,  # Optional: You can add a separate eval dataset for evaluation
    optimizers=(optimizer, None),  # The optimizer to use; you can also pass a scheduler here
)

# Start the fine-tuning process
trainer.train()

# Save the fine-tuned model after training
model.save_pretrained("./fine_tuned_bert")  # Save model weights and configuration for later use

# Sample prediction using the fine-tuned model
inputs = tokenizer("I enjoy coding very much!", return_tensors="pt", padding=True, truncation=True, max_length=128)
outputs = model(**inputs)  # Pass the inputs through the model
logits = outputs.logits  # Extract logits from the model's output
prediction = torch.argmax(logits, dim=1)  # Convert logits to predicted label
print("Predicted Label:", prediction.item())  # Output the predicted label (1 = positive, 0 = negative)

Fine Tuning BERT for Bio-Medical domain

Now lets deep dive into a very specific training for Bio-Medical domain.

Imagine we have labelled medical dataset like

{
  "qas": [
    {
      "id": "56bc751eac7ad10019000013_001",
      "question": "Name synonym of Acrokeratosis paraneoplastica.",
      "answers": [
        {
          "text": "Bazex syndrome",
          "answer_start": 31
        }
      ]
    }
  ],
  "context": "Acrokeratosis paraneoplastica (Bazex syndrome): report of a case associated with small cell lung carcinoma and review of the literature. Acrokeratosis paraneoplastic (Bazex syndrome) is a rare, but distinctive paraneoplastic dermatosis characterized by erythematosquamous lesions located at the acral sites and is most commonly associated with carcinomas of the upper aerodigestive tract. We report a 58-year-old female with a history of a pigmented rash on her extremities, thick keratotic plaques on her hands, and brittle nails. Chest imaging revealed a right upper lobe mass that was proven to be small cell lung carcinoma. While Bazex syndrome has been described in the dermatology literature, it is also important for the radiologist to be aware of this entity and its common presentations."
}

Reference of the dataset : https://www.kaggle.com/datasets/maverickss26/bioasq-dataset

We want to fine tune a BERT model with the above dataset.

Code : https://colab.research.google.com/drive/1W8zHN2xx5CtwTVYKU1bmyxe1-QEQE086?usp=sharing

Most of the code is self explanatory.

Post the training -

A question like

Question: What are the symptoms of COVID-19?

Provides a medically relevant Answer: mitochondrial fission

Hopefully this was helpful - in the next article we will create an agentic application using all the learnings using RAG, LangChain and LLama.

Fine Tuning LLMs - Part 1

Table of contents

Review of Building Blocks

What is Fine-tuning

When NOT to Fine-Tune?

Prompt Engineering + One/Few shot inference

RAG + Enterprise / Customer specific inference

Parameters for Fine-Tuning

Simple Example of Fine Tuning a Bert based LLM for sentiment analysis

Fine Tuning BERT for Bio-Medical domain

Subscribe to my newsletter

Sarnab Podder

Sarnab Podder