Whisper is computationally intensive, and running it efficiently at scale or on constrained devices is a real engineering challenge.

To make Whisper more suitable for production deployment, I explored two independent and hardware-aware optimization strategies — one for the encoder, one for the decoder.

In this post, I’ll walk through how I:

Quantized the encoder to INT8 for efficient CPU inference.
Converted the decoder to FP16 for faster GPU inference.

Each optimization is tailored to the component’s numerical behavior and role in Whisper’s architecture. The results show that, with careful precision tuning, you can achieve real-world performance gains without sacrificing transcription quality.

Why Optimize Encoder and Decoder Separately?

Whisper isn’t a single block. It’s composed of:

An encoder: processes the input audio into hidden representations.
A decoder: generates text token-by-token using those representations.

These components are structurally and numerically different — and that’s why optimizing them requires different approaches.

Component	Nature	Optimization Strategy	Target Platform
Encoder	Feed-forward	INT8 quantization	CPU
Decoder	Autoregressive	FP16 mixed precision	GPU

Part 1 — Optimizing the Encoder with INT8 Quantization

Why INT8 Works for the Encoder

The encoder processes the full audio sequence in a single pass, without dependencies between time steps. Typically more tolerant of lower precision because it processes relatively stable audio feature representations. Static INT8 quantization can be applied here effectively with minor accuracy degradation. This makes it:

Stable under aggressive quantization.
Ideal for static inference on CPUs.
A prime candidate for reducing memory and compute via 8-bit weights.

PyTorch Dynamic Quantization

To identify quantizable layers, I inspected the encoder structure:

from transformers import WhisperModel

model = WhisperModel.from_pretrained("openai/whisper-small")
print(model.encoder)

This revealed several Conv1d and Linear layers — perfect for quantization.

WhisperEncoder( (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,)) (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,)) (layers): ModuleList( (0-11): 12 x WhisperEncoderLayer( ... (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) ... ) ) )

We use dynamic quantization, a technique where the model's floating-point weights are converted to 8-bit integers (INT8). During inference, the activations are quantized on-the-fly. This is highly effective for CPU-based inference.

Here’s how I applied INT8 quantization. The key operation is performed using torch.ao.quantization.quantize_dynamic. We surgically remove the original FP32 encoder and replace it with its new, quantized version.:

from torch.ao.quantization import quantize_dynamic
from transformers import WhisperForConditionalGeneration

model_fp32 = WhisperForConditionalGeneration.from_pretrained(model_path)

quantized_encoder = quantize_dynamic(
    model_fp32.model.encoder,
    {torch.nn.Linear, torch.nn.Conv1d},
    dtype=torch.qint8
)

model_fp32.model.encoder = quantized_encoder

This replaces encoder weights with 8-bit versions dynamically during inference.

Benchmark: INT8 Encoder on CPU

I benchmarked the INT8 encoder against the original FP32 model using 100 test samples from Common Voice (Azerbaijani).

Metric	FP32 Encoder	INT8 Encoder	Change
WER (%)	53.39	55.38	+1.99 pts
Real-Time Factor	0.192	0.168	-12.5%
Memory Usage (MB)	970.08	37.23	-96.2%

Conclusion: Despite a minor WER increase, the quantized encoder delivers major efficiency gains — perfect for CPU inference or deployment on memory-constrained systems.

Part 2 — Optimizing the Decoder with FP16 Mixed Precision

Why FP16 Works for the Decoder

Our second experiment targets the sensitive decoder. The goal is to reduce latency without compromising the model's ability to make fine-grained decisions.

Here, we use a less aggressive technique: converting the decoder to 16-bit floating-point (FP16) precision. This halves the memory footprint and speeds up computations on modern GPUs with Tensor Cores, while still maintaining a large dynamic range for numerical stability.

Instead, I used FP16 precision, which:

Cuts memory usage in half.
Increases inference speed on supported GPUs.
Retains high numerical fidelity for generation.

Half Precision Decoder

I kept the encoder in FP32 and converted only the decoder to FP16:

from transformers import WhisperForConditionalGeneration

model_fp16 = WhisperForConditionalGeneration.from_pretrained(model_path)
model_fp16.model.decoder = model_fp16.model.decoder.half()

During inference on a GPU, we use torch.autocast to create a mixed-precision context. This allows PyTorch to automatically run operations in FP16 for speed while using FP32 for certain sensitive operations to maintain stability.

with torch.autocast(device_type="cuda", dtype=torch.float16):
    pred_ids = model.generate(input_features)[0]

Benchmark: FP16 Decoder on GPU

Metric	FP32 Decoder	FP16 Decoder	Change
WER (%)	53.39	52.99	-0.40 pts
Real-Time Factor	0.037	0.031	-16.2%
Memory Usage (MB)	133.32	33.00	-75.2%

Conclusion: The FP16 decoder improves inference time and reduces memory — with no loss in accuracy. This is ideal for real-time GPU inference or batched processing.

Final Comparison

Component	Strategy	Precision	WER Change	Speed Gain	Memory Gain
Encoder	Dynamic Quantization	INT8	+1.99 pts	+12.5%	-96.2%
Decoder	Mixed Precision	FP16	-0.40 pts	+16.2%	-75.2%

Conclusion

These results demonstrate that Whisper’s encoder and decoder can be optimized independently and safely using precision-aware strategies.

This makes it possible to deploy Whisper in environments that previously would have struggled with its computational demands.

Full codes: https://github.com/NijatZeynalov/whisper-experiments

If you found this useful or want to dive deeper, feel free to connect or reach out.

Whisper Optimization: Precision Tuning the Encoder and Decoder Separately