In this blog we’ll learn about adapters which is a Parameter Efficient Fine-Tuning(PEFT), adapters are small trainable neural layers that are inserted between the frozen layers.

What Are Adapters?

Adapters are small, trainable neural layers inserted between the frozen layers of a pretrained transformer model. Instead of modifying the entire model, adapters enable parameter-efficient fine-tuning by updating only a small set of task-specific parameters. This drastically reduces memory usage and computational overhead.

Originally introduced by Houlsby et al. (2019), adapters are designed to capture task-specific knowledge while keeping the core transformer layers intact. This makes them an ideal choice for scenarios where compute resources are limited or model generalization needs to be preserved.

How Do Adapters Work?

Adapters introduce lightweight bottleneck layers (typically a down-projection followed by a non-linearity and an up-projection) within each transformer block. During fine-tuning:

The original model parameters remain frozen.
Only the adapter layers are trained, making the process more efficient and modular.

This structure ensures that the base model retains its pretrained capabilities, while the adapters specialize the model for new tasks.

When Are Adapters Used?

Adapters are particularly useful in situations where efficient, scalable fine-tuning is needed. Common use cases include:

Large-Scale Model Fine-Tuning : Fine-tune large models like BERT, GPT, or LLaMA for specific NLP tasks without updating billions of parameters.
Multi-Task Learning : Use a single base model with multiple adapter modules—each tuned for a different task (e.g., translation, classification, summarization)—allowing quick task switching without retraining.
Domain-Specific Adaptation : Train adapters on specialized domains such as finance, legal, or medical texts while preserving the general language understanding of the base model.
Efficient Deployment : Adapters introduce only a small number of parameters, making models lightweight and easier to store, load, and deploy compared to fully fine-tuned models.
Low-Resource Training : Since only the adapter layers are trained, fine-tuning can be performed on single GPUs or even CPU-based environments, making them ideal for low-resource settings.

Example Scenarios

Sentiment Analysis with BERT: Attach an adapter to BERT to fine-tune it for sentiment classification while preserving its broad language understanding.
Multi-Lingual Translation with T5: Use different adapters for each target language, enabling a single T5 model to support multiple translation tasks.
Task-Switching in Chatbots: Equip a multi-domain chatbot with task-specific adapters (e.g., customer support, FAQs, product recommendations), allowing seamless switching between domains.

Why Are Adapters Used?

Adapters offer a compelling alternative to full or partial fine-tuning, especially when working with large-scale language models. Here’s why they’re widely adopted:

Memory & Computational Efficiency : Fine-tuning large transformer models (like GPT-3 with 175B parameters) is resource-intensive. Adapters drastically reduce the number of trainable parameters, making fine-tuning feasible on consumer GPUs without a major drop in performance.
Retains Pretrained Knowledge : Full fine-tuning updates all weights, which can lead to catastrophic forgetting—where the model loses its general language understanding. Adapters preserve the original pretrained knowledge while injecting task-specific capabilities.
Efficient Task Switching : In full fine-tuning, switching tasks typically requires loading a separate model for each task. With adapters, only the lightweight task-specific modules need to be swapped in, enabling multi-task learning without retraining.
Reduces Overfitting on Small Datasets : When labeled data is limited, full fine-tuning can overfit. Adapters, by training fewer parameters, promote better generalization in low-resource scenarios.
Faster Training & Seamless Inference : Adapter-based models train faster and remain nearly as fast as the base model during inference, making them practical for real-world deployment.

When Not to Use Adapters

While adapters are efficient and flexible, they may not be suitable in every situation:

When You Need Maximum Accuracy : If achieving the highest possible task-specific accuracy is critical (e.g., in high-stakes applications like medical diagnostics), full fine-tuning may outperform adapters.
When the Task Is Very Different from Pretraining : Adapters build on the assumptions of the pretrained model. If your task is radically different (e.g., using BERT, trained on natural language, for DNA sequence classification), adapters may not be expressive enough.
When Working with Small Models : Adapters are primarily designed for large models. If you're using compact models like DistilBERT, full fine-tuning is often computationally affordable and more effective.
When Ultra-Low Latency is Required : Adapters add a small amount of inference overhead. In latency-sensitive environments, such as real-time systems, full fine-tuning or model distillation might be better suited.

Pros and Cons of Adapters

Aspect	Pros	Cons
Memory Efficiency	Requires far less memory than full fine-tuning.	Introduces slight inference overhead due to additional layers.
Computational Cost	Fewer trainable parameters lead to faster training.	May not match full fine-tuning performance on complex tasks.
Knowledge Retention	Preserves pretrained knowledge while adapting to new tasks.	Less effective when the task is vastly different from pretraining.
Multi-Task Learning	Easily switch between tasks by swapping adapters.	Managing multiple adapters for many tasks adds implementation overhead.
Scalability for LLMs	Supports fine-tuning of large models (GPT, BERT, T5, etc.) without touching core weights.	Not necessary for small models where full fine-tuning is viable.
Overfitting Reduction	Limits trainable parameters, helping avoid overfitting in low-data settings.	Can restrict the model’s ability to deeply learn highly task-specific features.

What Does an Adapter Layer Constitute?

Adapter layers introduce task-specific transformations into a frozen transformer model. The core idea is to add a bottleneck structure that allows learning a small number of additional parameters, while keeping the main model unchanged.

An adapter layer typically includes three components:

Down-Projection Layer
- Reduces the dimensionality of the incoming hidden representation (e.g., from 768 to 64 dimensions).
- This bottleneck design minimizes the number of trainable parameters.
Non-Linear Activation Function
- Applies a non-linear transformation (commonly ReLU or GELU) to introduce complexity into the representation.
- Helps the model learn task-specific features beyond linear transformations.
Up-Projection Layer
- Restores the transformed representation back to the original hidden size.
- Ensures the modified output aligns dimensionally with the rest of the transformer layers.

Each of these layers contains trainable weights and biases, but their parameter count is significantly smaller compared to full model fine-tuning.

Mathematical Intuition Behind Adapter Layers in NLP

The working of an adapter layer can be described in a step-by-step manner:

Step 1: Input Representation

Let the hidden representation from a transformer layer be denoted by:

$$h \in \mathbb{R}^{d}$$

Where $d$ is the hidden size (e.g., 768 in BERT-base).

A hidden representation (or hidden state) refers to the internal vector that captures what the model has learned about the input sequence at a certain layer. It contains encoded information such as word meanings, syntactic roles, and contextual relationships derived from the input tokens. These representations evolve across the layers of a transformer and serve as the foundation for predictions and task-specific outputs.

Step 2: Down-Projection

This step reduces the dimensionality from $d$ to a smaller bottleneck dimension mm:

$$z = W_d h + b_d$$

$W_d \in \mathbb{R}^{m \times d}$ is a learnable weight matrix.
$b_d \in \mathbb{R}^{m}$ is the bias term.
$z \in \mathbb{R}^{m}$ is the compressed representation.

This operation significantly reduces the number of trainable parameters, making fine-tuning efficient.

Step 3: Apply Non-Linearity

A non-linear activation function is applied to capture more complex relationships:

$$z' = \text{ReLU}(z) \quad \text{or} \quad z' = \text{GELU}(z)$$

This helps the adapter learn patterns beyond what linear operations can represent.

Step 4: Up-Projection

The non-linear output $z'$ is projected back to the original hidden size:

$$h' = W_u z' + b_u$$

$W_u \in \mathbb{R}^{d \times m}$ is the up-projection weight matrix.
$b_u \in \mathbb{R}^{d}$ is the corresponding bias.
$h' \in \mathbb{R}^{d}$ matches the original dimensionality.

Step 5: Residual Connection

To preserve the original information from the pretrained model, a residual connection is used:

$$h_{\text{adapter}} = h + \lambda h'$$

Where $\lambda $ is a scaling factor (often set to 1). This step ensures that the adapter modifies the representation without discarding the general language knowledge encoded in $h$.

Why Do Adapters Use a Bottleneck Structure?

A defining characteristic of adapter layers is their bottleneck architecture—a design in which high-dimensional input representations are first compressed to a lower-dimensional space, transformed, and then expanded back to their original size. This structure is not arbitrary; it serves several critical purposes in the context of parameter-efficient fine-tuning:

Parameter Efficiency

By compressing the hidden representations into a lower-dimensional space, the number of trainable parameters is drastically reduced. Instead of updating the entire set of weights in a large-scale model (which may span hundreds of millions or even billions of parameters), adapters constrain learning to a much smaller subset. This allows for efficient fine-tuning of models like GPT-3, LLaMA, and T5, even on consumer-grade hardware.
Preservation of Pretrained Knowledge

The restricted capacity of the bottleneck inherently limits how much new, task-specific information the adapter can encode. This constraint is intentional—it reduces the risk of catastrophic forgetting, where task-specific fine-tuning could otherwise overwrite the general-purpose language understanding learned during pretraining. The adapter instead operates as a lightweight specialization layer that complements the frozen backbone.
Reduced Memory Footprint and Faster Training

Since the bottleneck structure limits the number of parameters being updated, it leads to significantly lower memory consumption during training. This also translates to faster optimization, as fewer gradients need to be computed and stored. Compared to full fine-tuning, adapter-based training offers a substantial reduction in both computational cost and time-to-convergence.

Where Should You Add Adapter Layers—and How Many?

Once you understand how adapter layers function, the natural next questions are:

Where should they be inserted within the transformer architecture?
How many adapter layers are necessary for effective adaptation?

These design decisions are crucial, as they directly affect both model performance and fine-tuning efficiency.

Where do you add adapter layers ?

There are two key decisions when placing adapter layers:

1. Adapter Placement: Before or After Transformer Blocks?

Adapter layers can be inserted in different positions within each transformer block. The most common placement options are:

a) Before the Self-Attention Block

In this configuration, the adapter modifies the input representations prior to them entering the self-attention mechanism. This may be beneficial in scenarios where early adjustment of token embeddings is important for the downstream task.

Use case: Tasks requiring input re-encoding or re-weighting of token importance.
Risk: Interfering with self-attention dynamics too early may lead to unstable training.

b) After the Feedforward Block (Default Setting)

This is the most widely adopted configuration. The adapter modifies the hidden representation after both the self-attention and feedforward sublayers have processed the input.

Rationale: The adapter captures task-specific adjustments without altering the attention mechanism.
Stability: Empirically, this configuration yields more stable and robust fine-tuning across tasks and models.

Note:
Most adapter-based architectures, such as those proposed by Houlsby et al. (2019) insert adapters after the feedforward block due to its simplicity and consistent performance.

2. Inserting Adapters in All vs. Some Layers

The number of adapter layers you insert can vary depending on task requirements, model architecture, and available resources. This decision is commonly framed as a choice between:

a) Shallow Placement (Adapters in the Last Few Layers)

Only the upper transformer layers are augmented with adapters.

Best for: Simple tasks like sentiment analysis or intent classification, where the base model already performs well.
Advantages: Reduces training time and memory footprint.

Note:
Shallow placement is efficient but may limit performance when deeper semantic adjustments are needed.

b) Deep Placement (Adapters in All Transformer Layers)

Every transformer block receives an adapter, allowing learning to be distributed throughout the model’s hierarchy.

Best for: Complex or domain-shifted tasks like medical NLP, multilingual translation, or scientific text generation.
Advantages: Enables the model to learn detailed task-specific transformations at all levels of representation.

Note:
The default recommendation—especially for large language models—is to insert adapters in every transformer block unless the task is trivially close to the pretraining objective.

How Many Adapter Layers Should Be Added?

The number of adapter layers to insert into a transformer model is not fixed—it depends on several factors, including task complexity, model size, and available computational resources. The choice involves a trade-off between fine-tuning efficiency and task-specific performance.

1. Task Complexity

The nature of the target task plays a significant role in determining how many adapters are needed:

Simple Tasks (e.g., sentiment analysis, topic classification)
Shallow placement—adding adapters only to the final few layers—is often sufficient. These tasks typically rely on high-level semantic representations already captured by the pretrained model.
Complex Tasks (e.g., machine translation, text summarization, domain-specific NLP like biomedical or legal text)
Such tasks demand more extensive adaptation across layers. Adding adapters to all transformer layers allows the model to develop hierarchical, task-specific representations at every level of abstraction.

Note:
When the target task significantly differs from the model’s pretraining data or domain, deeper adapter integration across all layers is generally recommended.

2. Model Size

The scale of the underlying model also influences how many adapter layers are appropriate:

Small Models (e.g., DistilBERT, MiniLM)
These models have fewer layers, and often perform well with adapters added to only a subset of those layers.
Large Models (e.g., GPT-3, LLaMA-65B, T5-11B)
Larger models contain more depth and capacity. To leverage this, adapters are typically inserted into most or all layers, enabling richer and deeper task-specific fine-tuning.

Note:
Adapter-based fine-tuning scales particularly well with large language models, allowing customization without modifying the full parameter set.

3. Computational Constraints

Resource availability imposes practical limits on how many adapter layers can be trained:

Limited Resources
When memory or compute is constrained (e.g., single GPU setups), using fewer adapters—typically in the upper layers—can reduce both training cost and inference latency.
Abundant Resources
With access to high-performance hardware (e.g., multiple GPUs or TPUs), inserting adapters in all layers maximizes task adaptability and performance.

Note:
There is a clear trade-off: increasing the number of adapters can improve task-specific performance, but at the cost of additional memory and compute.

General Best Practices by Model Type

Model Type	Where to Insert Adapters	Recommended Adapter Count
BERT-like (encoder-only)	After the feedforward block in selected or all layers	3–6 layers for simple tasks; all layers for complex or domain-specific tasks
GPT-like (decoder-only)	After feedforward (standard), optionally before self-attention	Typically in all layers for robust task adaptation
T5-like (encoder-decoder)	After feedforward blocks in both encoder and decoder	Adapter count depends on whether full or partial fine-tuning is desired

Conclusion

Adapters represent a powerful and elegant solution to the challenge of fine-tuning large transformer models in a resource-efficient and modular way. By introducing small, task-specific layers into an otherwise frozen architecture, adapters enable rapid adaptation to new tasks without the computational cost, memory demands, or risk of catastrophic forgetting associated with full model fine-tuning.

In this article, we explored:

What adapters are and how they integrate into transformer models
Why they are used, especially in scenarios involving limited compute, multi-task learning, or domain adaptation
How they work, including the mathematical underpinnings of their bottleneck architecture
Where and how many adapter layers to insert, based on task complexity, model size, and resource constraints

Adapters have become a foundational component of the modern Parameter-Efficient Fine-Tuning (PEFT) landscape. They strike a pragmatic balance between performance and efficiency, making large-scale language models more accessible and adaptable across a range of real-world applications.

As transformer models continue to scale, and as the demand for domain-specific and multi-task deployment grows, adapters will remain a central technique in the toolkit of NLP practitioners and researchers alike.

In the next installment of the Intuitive Fine-Tuning series, we’ll dive deeper into other fine tuning techniques.

Everything You Need to Know About Fine-Tuning with Adapters

Table of contents