As large language models continue to grow in size, fine-tuning them for specific tasks demands increasingly more VRAM. LoRA (Low-Rank Adaptation) offers a solution to this problem. Instread of retraining the entire model, LoRA injects small, trainable layers that adjust the model’s behavior with minimal overhead and no extra inference latency. In this post, we will break down how LoRA works. The method described here is based on the original LoRA paper: LoRA: Low-Rank Adaptation of Large Language Models.

Low-rank Decomposition

The core of LoRA is Low-rank decomposition. Let’s say the pretrained weight matrix W has a shape of d r. During fine-tuning, instead of updating W directly, LoRA models the update to W (denoted as ∆W).

$$∆W = BA$$

B is a matrix of shape d * r
A is a matrix of shape r * k
r is the rank of the LoRA module (typically much smaller than d and k)

In this setup, only A and B are trainable parameters during fine-tuning, while the original weights W remain frozen. This significantly reduces the number of trainable parameters.

At inference time, the modified forward pass looks like:

$$Wx + ∆W x = Wx + BAx$$

LoRA also applies a scaling factor α / k to the update to control its impact:

$$∆W = α/r*BA$$

This makes it easy to tune how strongly LoRA affects the model’s behavior.

Benefits

The key benefits of LoRA are its reduced memory and storage requirements. By using a low rank r, it significantly decreases the number of trainable parameters, which in turn reduces training time and GPU memory usage. Another advantage is the ability to easily switch between tasks by simply swapping out the LoRA weights which makes deployment more efficient and cost-effective, especially when handling multiple downstream tasks.

Applying to Transformer models

Where to Apply LoRA in Transformers

According to the experiments from the LoRA paper, multiple attention weight matriceswith a lower rank (e.g., r = 4) performs better than adapting a single matrix with a higher rank (e.g., r = 8). The best results came from applying LoRA to both query and value weight matrix, suggesting that wider adaptation across weights is more effective than deeper adaptation of just one.

Optiaml r for LoRA

The paper also explores how different rank values affect performance. Validation accuracy on WikiSQL and MultiNLI shows that smaller ranks (e.g., r = 4) can achieve competitive or even superior results compared to higher ranks. This highlights LoRA's efficiency: good performance with minimal trainable parameters.

LoRA in Action: Training with SFTTrainer

Below is an example of how to fine-tune a model using LoRA with Hugging Face’s SFTTrainer and peft (The code is from huggingface’s TRL guide ). In this setup, we load a dataset trl-lib/Capybara and fine-tune the Qwen2.5-0.5B model using low-rank adaptation.

from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig

dataset = load_dataset("trl-lib/Capybara", split="train")

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules="all-linear",
    modules_to_save=["lm_head", "embed_token"],
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    "Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
    args=SFTConfig(output_dir="Qwen2.5-0.5B-SFT"),
    peft_config=peft_config
)

trainer.train()

The LoraConfig specifies how LoRA is applied:

r=16: The rank of the low-rank matrices A and B, controlling the capacity of the LoRA module.
lora_alpha=32: A scaling factor that adjusts the strength of the LoRA updates.
lora_dropout=0.05: Dropout applied to the LoRA layers during training, helping prevent overfitting.
target_modules="all-linear": Applies LoRA to all linear layers in the model.
modules_to_save=["lm_head", "embed_token"]: Ensures these components are kept trainable and saved alongside the adapted weights.

task_type="CAUSAL_LM": Specifies the task type for LoRA — in this case, causal language modeling.

Conclusion

LoRA is a powerful and efficient technique for fine-tuning large language models, especially when working under memory or compute constraints. By introducing a low-rank update mechanism that leaves the original weights frozen, LoRA drastically reduces the number of trainable parameters while maintaining strong performance across tasks. Its modularity also makes task-switching and deployment lightweight and practical.

As shown in both the theory and practical example above, LoRA enables scalable, cost-effective fine-tuning — making it an essential tool for anyone working with large-scale models. For further details, refer to the original paper: LoRA: Low-Rank Adaptation of Large Language Models.

How LoRA Makes Model Fine-Tuning Cheaper and Faster