Everything You Need to Know About Fine-Tuning with Adapters

Table of contents
- What Are Adapters?
- How Do Adapters Work?
- When Are Adapters Used?
- Why Are Adapters Used?
- When Not to Use Adapters
- Pros and Cons of Adapters
- What Does an Adapter Layer Constitute?
- Mathematical Intuition Behind Adapter Layers in NLP
- Why Do Adapters Use a Bottleneck Structure?
- Where Should You Add Adapter Layers—and How Many?
- General Best Practices by Model Type
- Conclusion

In this blog we’ll learn about adapters which is a Parameter Efficient Fine-Tuning(PEFT), adapters are small trainable neural layers that are inserted between the frozen layers.
What Are Adapters?
Adapters are small, trainable neural layers inserted between the frozen layers of a pretrained transformer model. Instead of modifying the entire model, adapters enable parameter-efficient fine-tuning by updating only a small set of task-specific parameters. This drastically reduces memory usage and computational overhead.
Originally introduced by Houlsby et al. (2019), adapters are designed to capture task-specific knowledge while keeping the core transformer layers intact. This makes them an ideal choice for scenarios where compute resources are limited or model generalization needs to be preserved.
How Do Adapters Work?
Adapters introduce lightweight bottleneck layers (typically a down-projection followed by a non-linearity and an up-projection) within each transformer block. During fine-tuning:
The original model parameters remain frozen.
Only the adapter layers are trained, making the process more efficient and modular.
This structure ensures that the base model retains its pretrained capabilities, while the adapters specialize the model for new tasks.
When Are Adapters Used?
Adapters are particularly useful in situations where efficient, scalable fine-tuning is needed. Common use cases include:
Large-Scale Model Fine-Tuning : Fine-tune large models like BERT, GPT, or LLaMA for specific NLP tasks without updating billions of parameters.
Multi-Task Learning : Use a single base model with multiple adapter modules—each tuned for a different task (e.g., translation, classification, summarization)—allowing quick task switching without retraining.
Domain-Specific Adaptation : Train adapters on specialized domains such as finance, legal, or medical texts while preserving the general language understanding of the base model.
Efficient Deployment : Adapters introduce only a small number of parameters, making models lightweight and easier to store, load, and deploy compared to fully fine-tuned models.
Low-Resource Training : Since only the adapter layers are trained, fine-tuning can be performed on single GPUs or even CPU-based environments, making them ideal for low-resource settings.
Example Scenarios
Sentiment Analysis with BERT: Attach an adapter to BERT to fine-tune it for sentiment classification while preserving its broad language understanding.
Multi-Lingual Translation with T5: Use different adapters for each target language, enabling a single T5 model to support multiple translation tasks.
Task-Switching in Chatbots: Equip a multi-domain chatbot with task-specific adapters (e.g., customer support, FAQs, product recommendations), allowing seamless switching between domains.
Why Are Adapters Used?
Adapters offer a compelling alternative to full or partial fine-tuning, especially when working with large-scale language models. Here’s why they’re widely adopted:
Memory & Computational Efficiency : Fine-tuning large transformer models (like GPT-3 with 175B parameters) is resource-intensive. Adapters drastically reduce the number of trainable parameters, making fine-tuning feasible on consumer GPUs without a major drop in performance.
Retains Pretrained Knowledge : Full fine-tuning updates all weights, which can lead to catastrophic forgetting—where the model loses its general language understanding. Adapters preserve the original pretrained knowledge while injecting task-specific capabilities.
Efficient Task Switching : In full fine-tuning, switching tasks typically requires loading a separate model for each task. With adapters, only the lightweight task-specific modules need to be swapped in, enabling multi-task learning without retraining.
Reduces Overfitting on Small Datasets : When labeled data is limited, full fine-tuning can overfit. Adapters, by training fewer parameters, promote better generalization in low-resource scenarios.
Faster Training & Seamless Inference : Adapter-based models train faster and remain nearly as fast as the base model during inference, making them practical for real-world deployment.
When Not to Use Adapters
While adapters are efficient and flexible, they may not be suitable in every situation:
When You Need Maximum Accuracy : If achieving the highest possible task-specific accuracy is critical (e.g., in high-stakes applications like medical diagnostics), full fine-tuning may outperform adapters.
When the Task Is Very Different from Pretraining : Adapters build on the assumptions of the pretrained model. If your task is radically different (e.g., using BERT, trained on natural language, for DNA sequence classification), adapters may not be expressive enough.
When Working with Small Models : Adapters are primarily designed for large models. If you're using compact models like DistilBERT, full fine-tuning is often computationally affordable and more effective.
When Ultra-Low Latency is Required : Adapters add a small amount of inference overhead. In latency-sensitive environments, such as real-time systems, full fine-tuning or model distillation might be better suited.
Pros and Cons of Adapters
Aspect | Pros | Cons |
Memory Efficiency | Requires far less memory than full fine-tuning. | Introduces slight inference overhead due to additional layers. |
Computational Cost | Fewer trainable parameters lead to faster training. | May not match full fine-tuning performance on complex tasks. |
Knowledge Retention | Preserves pretrained knowledge while adapting to new tasks. | Less effective when the task is vastly different from pretraining. |
Multi-Task Learning | Easily switch between tasks by swapping adapters. | Managing multiple adapters for many tasks adds implementation overhead. |
Scalability for LLMs | Supports fine-tuning of large models (GPT, BERT, T5, etc.) without touching core weights. | Not necessary for small models where full fine-tuning is viable. |
Overfitting Reduction | Limits trainable parameters, helping avoid overfitting in low-data settings. | Can restrict the model’s ability to deeply learn highly task-specific features. |
What Does an Adapter Layer Constitute?
Adapter layers introduce task-specific transformations into a frozen transformer model. The core idea is to add a bottleneck structure that allows learning a small number of additional parameters, while keeping the main model unchanged.
An adapter layer typically includes three components:
Down-Projection Layer
Reduces the dimensionality of the incoming hidden representation (e.g., from 768 to 64 dimensions).
This bottleneck design minimizes the number of trainable parameters.
Non-Linear Activation Function
Applies a non-linear transformation (commonly ReLU or GELU) to introduce complexity into the representation.
Helps the model learn task-specific features beyond linear transformations.
Up-Projection Layer
Restores the transformed representation back to the original hidden size.
Ensures the modified output aligns dimensionally with the rest of the transformer layers.
Each of these layers contains trainable weights and biases, but their parameter count is significantly smaller compared to full model fine-tuning.
Mathematical Intuition Behind Adapter Layers in NLP
The working of an adapter layer can be described in a step-by-step manner:
Step 1: Input Representation
Let the hidden representation from a transformer layer be denoted by:
$$h \in \mathbb{R}^{d}$$
Where \(d\) is the hidden size (e.g., 768 in BERT-base).
A hidden representation (or hidden state) refers to the internal vector that captures what the model has learned about the input sequence at a certain layer. It contains encoded information such as word meanings, syntactic roles, and contextual relationships derived from the input tokens. These representations evolve across the layers of a transformer and serve as the foundation for predictions and task-specific outputs.
Step 2: Down-Projection
This step reduces the dimensionality from \(d\) to a smaller bottleneck dimension mm:
$$z = W_d h + b_d$$
\(W_d \in \mathbb{R}^{m \times d}\) is a learnable weight matrix.
\(b_d \in \mathbb{R}^{m}\) is the bias term.
\(z \in \mathbb{R}^{m}\) is the compressed representation.
This operation significantly reduces the number of trainable parameters, making fine-tuning efficient.
Step 3: Apply Non-Linearity
A non-linear activation function is applied to capture more complex relationships:
$$z' = \text{ReLU}(z) \quad \text{or} \quad z' = \text{GELU}(z)$$
This helps the adapter learn patterns beyond what linear operations can represent.
Step 4: Up-Projection
The non-linear output \(z'\) is projected back to the original hidden size:
$$h' = W_u z' + b_u$$
\(W_u \in \mathbb{R}^{d \times m}\) is the up-projection weight matrix.
\(b_u \in \mathbb{R}^{d}\) is the corresponding bias.
\(h' \in \mathbb{R}^{d}\) matches the original dimensionality.
Step 5: Residual Connection
To preserve the original information from the pretrained model, a residual connection is used:
$$h_{\text{adapter}} = h + \lambda h'$$
Where \(\lambda \) is a scaling factor (often set to 1). This step ensures that the adapter modifies the representation without discarding the general language knowledge encoded in \(h\).
Why Do Adapters Use a Bottleneck Structure?
A defining characteristic of adapter layers is their bottleneck architecture—a design in which high-dimensional input representations are first compressed to a lower-dimensional space, transformed, and then expanded back to their original size. This structure is not arbitrary; it serves several critical purposes in the context of parameter-efficient fine-tuning:
Parameter Efficiency
By compressing the hidden representations into a lower-dimensional space, the number of trainable parameters is drastically reduced. Instead of updating the entire set of weights in a large-scale model (which may span hundreds of millions or even billions of parameters), adapters constrain learning to a much smaller subset. This allows for efficient fine-tuning of models like GPT-3, LLaMA, and T5, even on consumer-grade hardware.
Preservation of Pretrained Knowledge
The restricted capacity of the bottleneck inherently limits how much new, task-specific information the adapter can encode. This constraint is intentional—it reduces the risk of catastrophic forgetting, where task-specific fine-tuning could otherwise overwrite the general-purpose language understanding learned during pretraining. The adapter instead operates as a lightweight specialization layer that complements the frozen backbone.
Reduced Memory Footprint and Faster Training
Since the bottleneck structure limits the number of parameters being updated, it leads to significantly lower memory consumption during training. This also translates to faster optimization, as fewer gradients need to be computed and stored. Compared to full fine-tuning, adapter-based training offers a substantial reduction in both computational cost and time-to-convergence.
Where Should You Add Adapter Layers—and How Many?
Once you understand how adapter layers function, the natural next questions are:
Where should they be inserted within the transformer architecture?
How many adapter layers are necessary for effective adaptation?
These design decisions are crucial, as they directly affect both model performance and fine-tuning efficiency.
Where do you add adapter layers ?
There are two key decisions when placing adapter layers:
1. Adapter Placement: Before or After Transformer Blocks?
Adapter layers can be inserted in different positions within each transformer block. The most common placement options are:
a) Before the Self-Attention Block
In this configuration, the adapter modifies the input representations prior to them entering the self-attention mechanism. This may be beneficial in scenarios where early adjustment of token embeddings is important for the downstream task.
Use case: Tasks requiring input re-encoding or re-weighting of token importance.
Risk: Interfering with self-attention dynamics too early may lead to unstable training.
b) After the Feedforward Block (Default Setting)
This is the most widely adopted configuration. The adapter modifies the hidden representation after both the self-attention and feedforward sublayers have processed the input.
Rationale: The adapter captures task-specific adjustments without altering the attention mechanism.
Stability: Empirically, this configuration yields more stable and robust fine-tuning across tasks and models.
Note:
Most adapter-based architectures, such as those proposed by Houlsby et al. (2019) insert adapters after the feedforward block due to its simplicity and consistent performance.
2. Inserting Adapters in All vs. Some Layers
The number of adapter layers you insert can vary depending on task requirements, model architecture, and available resources. This decision is commonly framed as a choice between:
a) Shallow Placement (Adapters in the Last Few Layers)
Only the upper transformer layers are augmented with adapters.
Best for: Simple tasks like sentiment analysis or intent classification, where the base model already performs well.
Advantages: Reduces training time and memory footprint.
Note:
Shallow placement is efficient but may limit performance when deeper semantic adjustments are needed.
b) Deep Placement (Adapters in All Transformer Layers)
Every transformer block receives an adapter, allowing learning to be distributed throughout the model’s hierarchy.
Best for: Complex or domain-shifted tasks like medical NLP, multilingual translation, or scientific text generation.
Advantages: Enables the model to learn detailed task-specific transformations at all levels of representation.
Note:
The default recommendation—especially for large language models—is to insert adapters in every transformer block unless the task is trivially close to the pretraining objective.
How Many Adapter Layers Should Be Added?
The number of adapter layers to insert into a transformer model is not fixed—it depends on several factors, including task complexity, model size, and available computational resources. The choice involves a trade-off between fine-tuning efficiency and task-specific performance.
1. Task Complexity
The nature of the target task plays a significant role in determining how many adapters are needed:
Simple Tasks (e.g., sentiment analysis, topic classification)
Shallow placement—adding adapters only to the final few layers—is often sufficient. These tasks typically rely on high-level semantic representations already captured by the pretrained model.Complex Tasks (e.g., machine translation, text summarization, domain-specific NLP like biomedical or legal text)
Such tasks demand more extensive adaptation across layers. Adding adapters to all transformer layers allows the model to develop hierarchical, task-specific representations at every level of abstraction.
Note:
When the target task significantly differs from the model’s pretraining data or domain, deeper adapter integration across all layers is generally recommended.
2. Model Size
The scale of the underlying model also influences how many adapter layers are appropriate:
Small Models (e.g., DistilBERT, MiniLM)
These models have fewer layers, and often perform well with adapters added to only a subset of those layers.Large Models (e.g., GPT-3, LLaMA-65B, T5-11B)
Larger models contain more depth and capacity. To leverage this, adapters are typically inserted into most or all layers, enabling richer and deeper task-specific fine-tuning.
Note:
Adapter-based fine-tuning scales particularly well with large language models, allowing customization without modifying the full parameter set.
3. Computational Constraints
Resource availability imposes practical limits on how many adapter layers can be trained:
Limited Resources
When memory or compute is constrained (e.g., single GPU setups), using fewer adapters—typically in the upper layers—can reduce both training cost and inference latency.Abundant Resources
With access to high-performance hardware (e.g., multiple GPUs or TPUs), inserting adapters in all layers maximizes task adaptability and performance.
Note:
There is a clear trade-off: increasing the number of adapters can improve task-specific performance, but at the cost of additional memory and compute.
General Best Practices by Model Type
Model Type | Where to Insert Adapters | Recommended Adapter Count |
BERT-like (encoder-only) | After the feedforward block in selected or all layers | 3–6 layers for simple tasks; all layers for complex or domain-specific tasks |
GPT-like (decoder-only) | After feedforward (standard), optionally before self-attention | Typically in all layers for robust task adaptation |
T5-like (encoder-decoder) | After feedforward blocks in both encoder and decoder | Adapter count depends on whether full or partial fine-tuning is desired |
Conclusion
Adapters represent a powerful and elegant solution to the challenge of fine-tuning large transformer models in a resource-efficient and modular way. By introducing small, task-specific layers into an otherwise frozen architecture, adapters enable rapid adaptation to new tasks without the computational cost, memory demands, or risk of catastrophic forgetting associated with full model fine-tuning.
In this article, we explored:
What adapters are and how they integrate into transformer models
Why they are used, especially in scenarios involving limited compute, multi-task learning, or domain adaptation
How they work, including the mathematical underpinnings of their bottleneck architecture
Where and how many adapter layers to insert, based on task complexity, model size, and resource constraints
Adapters have become a foundational component of the modern Parameter-Efficient Fine-Tuning (PEFT) landscape. They strike a pragmatic balance between performance and efficiency, making large-scale language models more accessible and adaptable across a range of real-world applications.
As transformer models continue to scale, and as the demand for domain-specific and multi-task deployment grows, adapters will remain a central technique in the toolkit of NLP practitioners and researchers alike.
In the next installment of the Intuitive Fine-Tuning series, we’ll dive deeper into other fine tuning techniques.
Subscribe to my newsletter
Read articles from Vikas Srinivasa directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Vikas Srinivasa
Vikas Srinivasa
My journey into AI has been unconventional yet profoundly rewarding. Transitioning from a professional cricket career, a back injury reshaped my path, reigniting my passion for technology. Seven years after my initial studies, I returned to complete my Bachelor of Technology in Computer Science, where I discovered my deep fascination with Artificial Intelligence, Machine Learning, and NLP —particularly it's applications in the finance sector.