Fine-Tuning Large Language Models

INTRODUCTION

Models like GPT, Gemma, Llama, etc are what we refer to as base models – powerful Large Language Models (LLMs) that have been extensively pre-trained on vast datasets from the internet, up to a specific knowledge cutoff date. For instance, Gemini is a fine-tuned iteration of the Gemma LLM, and ChatGPT is a fine-tuned version of its underlying GPT model. At their core, these base models are sophisticated transformers designed just to predict the next token in a structured sequence. While incredibly versatile, their initial training is general-purpose. They weren't specifically built for the nuanced, conversational interactions we frequently engage in today. Fine tuning is the process of adapting a pre-trained base model to perform specific tasks or adhere to particular styles. Think of it as tailoring a general-purpose tool to fit a specialized need. In this article, we will take a look at the behind the scenes of these types of fine tuning in action.

TYPES OF FINE TUNING

A fine tuning progression typically looks like this:

BASE MODEL
→ [Full Parameter Fine-Tuning] → 
FINE-TUNED MODEL
→ [Domain-Specific Fine-Tuning] → 
Coding Agent, Article Writer, etc.

This sequential process allows us to transform a broadly capable LLM into a highly specialized application, such as a coding assistant or a content generator. Developing such models from the ground up, like training a foundational GPT, demands substantial computational resources, particularly immense GPU power. Fine tuning demands similar computational power. When it comes to customizing an LLM, there are two primary methodologies:

Full Parameter Fine-Tuning
LoRA (Low-Rank Adaptation) Fine-Tuning

The goal of this article is to simplify the concepts of fine tuning and the two major ways to do it in order to equip with a working understanding of generative AI models.

IS SYSTEM PROMPTING FINE TUNING?

A system prompt is a prompt to the LLM like "You are a helpful AI assistant…" or “Act like a Java Expert…“, etc. Can we consider it as a type of fine-tuning?

The answer is no. System prompting instructs the model on how to behave for a particular interaction without altering its underlying architecture or weights. It's akin to giving specific instructions to a professional, rather than fundamentally changing their skills or knowledge base. No changes are made to the model's weights during system prompting.

FULL PARAMETER FINE TUNING

Full parameter fine-tuning involves directly modifying the ‘weights’ of the LLM. These weights are numerical values that dictate the strength of connections between the neurons or nodes within the model's neural network. During full parameter fine-tuning, these values are adjusted based on your specific dataset.

Consider an illustrative example:

Suppose you present an image to an LLM, expecting it to identify "Cat."

Initially, the LLM might return an incorrect or nonsensical output.
Upon observing this discrepancy between the desired output ("Cat") and the actual output, you would then update the model's weights.
This adjustment aims to increase the likelihood of the LLM accurately identifying "Cat" in similar future instances.
This iterative adjustment process, known as back-propagation, continually refines the model's internal parameters.
The cycle of feeding input, evaluating output, and adjusting weights is repeated until the LLM consistently provides the expected "Cat" output.

For individuals without access to high-end computing resources, Google Colab offers a viable solution for fine-tuning. It provides access to GPUs (free and paid), making it an excellent platform for experimentation and development in this domain.

CAN WE FINE-TUNE CHATGPT?

Proprietary LLM models like ChatGPT, Gemini, etc can indeed be fine-tuned, where they are offered to the clients as a paid service. The process usually involves submitting your custom dataset, often in a ‘.jsonl’ file (JSON Lines format), which contains pairs of inputs and their corresponding desired outputs. The provider then trains the model on your data, making the fine-tuned version accessible via their API. However, direct access to these models for independent training is generally not provided. It's important to reiterate that fine-tuning, especially full parameter fine-tuning, is resource-intensive. This demand has led to the emergence of specialized services, such as 'Replicate,' which rent out GPU resources for intensive model training tasks.

LOW RANK ADAPTATION FINE TUNING

Given the resource-demands of full parameter fine-tuning, LoRA (Low-Rank Adaptation) fine-tuning presents an efficient alternative. In LoRA, the majority of the base model's parameters (weights) are "frozen". Instead of updating the entire model, LoRA introduces small, low-rank matrices that capture the necessary changes. These matrices are then combined with the frozen base model during inference. The key advantages of LoRA include significantly reduced GPU requirements, though it does necessitate a substantial amount of memory for operation. The trade-off for this efficiency is that LoRA fine-tuning may result in slightly lower accuracy compared to full parameter fine-tuning. It represents a pragmatic balance between performance and resource consumption.

In order to visualize how this works, lets imagine that we want to LoRA fine tune an LLM where we pass an input asking “what is 2 + 2?” and the LLM gave an output “100”. Here the ‘loss’ would be -96 (i.e. reducing the expected output from the actual output or simply → 4 - 100 = 96). As the parameters are “frozen” for LoRA, we cannot change the weights; but we can keep a track of the loss such that every time when the user queries with “what is 2 + 2?”, knowing that the LLM gives “100” as the answer, we can simply add -96 to it to get the correct answer “4” (i.e. 100 + -96 = 4). This should also explain why memory usage is higher.

RAG vs FINE TUNING

A critical decision in Gen-AI development is discerning when to employ Retrieval Augmented Generation (RAG) or “AI agents” versus fine-tuning.

Well, there is no clear cut rule for this but:

RAG is most effective when working with real-time or frequently updated data. It enables an LLM to retrieve pertinent information from an external knowledge base at the time of inference and then generate a response based on this retrieved data. This is ideal for scenarios requiring access to the most current information.
Fine-tuning, on the other hand, is preferred when working with static documents or when the objective is to imbue the model with a specific style, tone, or domain-specific knowledge embedded within your dataset.

In many advanced applications, combining both RAG and fine-tuning can yield superior results. Also, many LLMs now offer ChatGPT-compatible APIs, simplifying their integration into diverse applications.

(Pseudocode) FULL PARAMETER FINE TUNING

To provide a clearer understanding of the practical steps involved, here's a simplified pseudocode for full parameter fine-tuning:

// Step 1: Verify GPU availability and activation
Check GPU is active 

// Step 2: Install necessary libraries esp. huggingface - 'github for LLM models'
Install HuggingFace’s transformers library 

// Step 3: Obtain authentication token for model access
Get a token from HuggingFace (for model retrieval) 

// Step 4: Configure environment variables
Import OS to set the HuggingFace token as an environment variable 
Create a device (for tokenizer and model operations) 

// Step 5: Import essential LLM components
Import ‘AutoTokenizer’ and ‘AutoModelForCausalLM’ libraries 

// Step 6: Select the pre-trained model for fine-tuning
Select and set the base model to fine-tune (e.g., Gemma) 

// Step 7: Load the model's tokenizer
Using ‘AutoTokenizer’ library, pull the tokenizer corresponding to the selected model 

// Step 8: Define a chat-specific input format
Setup a chat template (to convert chat-like queries into the model's expected tokenized format) 

// Step 9: Load the model onto the designated device
Pull the model to device from HuggingFace 

// Step 10: Prepare your custom training dataset
Set the dataset (comprising input and expected output pairs) 

// Step 11: Construct full conversation sequences for training
For each example in the dataset:
    full_conversation = input_tokens + output_tokens + tokenizer.eos_token 

// Step 12: Set the model to training mode
Put the model in training mode 

// Step 13: Execute the training loop
[In a loop, for a specified number of epochs or until desired performance is reached]:
    Feed the input to the model
    Obtain the model's generated output
    Calculate the loss by comparing the expected output with the actual output
    Perform backpropagation to adjust the model's weights based on the calculated loss

// Step 14: Save and share the fine-tuned model
Once training is complete, push the fine-tuned model to your own repository in HuggingFace 

// Step 15: The fine-tuned model is now ready for use
Fine Tuned Model is ready

CONCLUSION

Fine-tuning LLMs offers a powerful avenue for customizing these advanced models to meet specific requirements, transforming them from general-purpose tools into highly specialized assets. Whether through comprehensive full parameter fine tuning or the more efficient LoRA fine tuning, the ability to tailor LLMs opens up significant possibilities in AI application development.

REFERENCE

GenAI cohort by Hitesh Chaudhary and Piyush Garg

#4 - Fine Tuning An LLM

Table of contents