Imagine that you have a pre-trained model which works well on task A. Now you get some similar tasks called task B, C, D. How to fine-tune your base model for those new tasks? Can we just update partial of the parameters for each task and run those training in parallel? Well, LoRA makes it possible. So how does it work?

How does it work?

Consider a weight matrix W in a transformer (e.g., in a self-attention layer).
Instead of updating W, LoRA adds a low-rank decomposition:

W′=W+ΔW=W+AB

W: Frozen original weights
A (r × d) and B (d × r): Trainable matrices, where r is a low rank (e.g., 4 or 8).

During training, only A and B are updated. The rest of the model stays unchanged.

What are the benefits in training and inference?

During fine-tuning, since the back probation only runs on low-rank matrices A and B to update weight, it minimized computational load and speeding up the training process.
During inference, since the updated weight will have the same size as the pretrained weight without additional parameters, we won’t introduce inference latency.
Besides, this process can be done in parallel for all parameters, and it will be faster than a single forward pass.

Other benefits of LoRA

If the model is underperformed, you can have a clear direction to improve the result by increasing the rank. But in other fine-tuning methods like prefix-running, BitFit, or adapters, there is no clear clue how to improve from the last training.
Reduction of checkpoints. For example, if you are training GPT3, it can reduce the checkpoint from 1TB to 25MB (99.9975%).

Best practice in engineering

1. Cache as many LoRA modules in RAM during deployment. So model switching simply involves data transfer between RAM and VRAM. Since RAM is usually much larger than VRAM, we can cache thousands of LoRA modules and never worry about reading from disk again. (e.g. using perft from huggingface to implement it)

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("base-model")
adapter_name = "cached/lora_adapter_math"

# Load LoRA adapter from cache
model = PeftModel.from_pretrained(base_model, adapter_name)

2. Train multiple LoRA modules in parallel. Each on its own task. This is achieved by sharing the same base model and routing different inputs in a single batch through different LoRA modules. In this way, we can batch multiple LoRA jobs together to fully utilize GPUs.

Reference

Paper: LoRA: Low-Rank Adaptation of Large Language Models
GitHub: LoRA GitHub Repository
Project for running LoRA modules in parallel: S-LoRA
Project for running LoRA modules in parallel: BLoRA
Project for running LoRA modules in parallel: mLoRA

Optimize Your AI Models: A Deep Dive into LoRA Techniques

Table of contents