Whisper Optimization: Precision Tuning the Encoder and Decoder Separately


Whisper is computationally intensive, and running it efficiently at scale or on constrained devices is a real engineering challenge.
To make Whisper more suitable for production deployment, I explored two independent and hardware-aware optimization strategies — one for the encoder, one for the decoder.
In this post, I’ll walk through how I:
Quantized the encoder to INT8 for efficient CPU inference.
Converted the decoder to FP16 for faster GPU inference.
Each optimization is tailored to the component’s numerical behavior and role in Whisper’s architecture. The results show that, with careful precision tuning, you can achieve real-world performance gains without sacrificing transcription quality.
Why Optimize Encoder and Decoder Separately?
Whisper isn’t a single block. It’s composed of:
An encoder: processes the input audio into hidden representations.
A decoder: generates text token-by-token using those representations.
These components are structurally and numerically different — and that’s why optimizing them requires different approaches.
Component | Nature | Optimization Strategy | Target Platform |
Encoder | Feed-forward | INT8 quantization | CPU |
Decoder | Autoregressive | FP16 mixed precision | GPU |
Part 1 — Optimizing the Encoder with INT8 Quantization
Why INT8 Works for the Encoder
The encoder processes the full audio sequence in a single pass, without dependencies between time steps. Typically more tolerant of lower precision because it processes relatively stable audio feature representations. Static INT8 quantization can be applied here effectively with minor accuracy degradation. This makes it:
Stable under aggressive quantization.
Ideal for static inference on CPUs.
A prime candidate for reducing memory and compute via 8-bit weights.
PyTorch Dynamic Quantization
To identify quantizable layers, I inspected the encoder structure:
from transformers import WhisperModel
model = WhisperModel.from_pretrained("openai/whisper-small")
print(model.encoder)
This revealed several Conv1d
and Linear
layers — perfect for quantization.
WhisperEncoder( (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,)) (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,)) (layers): ModuleList( (0-11): 12 x WhisperEncoderLayer( ... (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) ... ) ) )
We use dynamic quantization, a technique where the model's floating-point weights are converted to 8-bit integers (INT8). During inference, the activations are quantized on-the-fly. This is highly effective for CPU-based inference.
Here’s how I applied INT8 quantization. The key operation is performed using torch.ao.quantization.quantize_dynamic. We surgically remove the original FP32 encoder and replace it with its new, quantized version.:
from torch.ao.quantization import quantize_dynamic
from transformers import WhisperForConditionalGeneration
model_fp32 = WhisperForConditionalGeneration.from_pretrained(model_path)
quantized_encoder = quantize_dynamic(
model_fp32.model.encoder,
{torch.nn.Linear, torch.nn.Conv1d},
dtype=torch.qint8
)
model_fp32.model.encoder = quantized_encoder
This replaces encoder weights with 8-bit versions dynamically during inference.
Benchmark: INT8 Encoder on CPU
I benchmarked the INT8 encoder against the original FP32 model using 100 test samples from Common Voice (Azerbaijani).
Metric | FP32 Encoder | INT8 Encoder | Change | |
WER (%) | 53.39 | 55.38 | +1.99 pts | |
Real-Time Factor | 0.192 | 0.168 | -12.5% | |
Memory Usage (MB) | 970.08 | 37.23 | -96.2% |
Conclusion: Despite a minor WER increase, the quantized encoder delivers major efficiency gains — perfect for CPU inference or deployment on memory-constrained systems.
Part 2 — Optimizing the Decoder with FP16 Mixed Precision
Why FP16 Works for the Decoder
Our second experiment targets the sensitive decoder. The goal is to reduce latency without compromising the model's ability to make fine-grained decisions.
Here, we use a less aggressive technique: converting the decoder to 16-bit floating-point (FP16) precision. This halves the memory footprint and speeds up computations on modern GPUs with Tensor Cores, while still maintaining a large dynamic range for numerical stability.
Instead, I used FP16 precision, which:
Cuts memory usage in half.
Increases inference speed on supported GPUs.
Retains high numerical fidelity for generation.
Half Precision Decoder
I kept the encoder in FP32 and converted only the decoder to FP16:
from transformers import WhisperForConditionalGeneration
model_fp16 = WhisperForConditionalGeneration.from_pretrained(model_path)
model_fp16.model.decoder = model_fp16.model.decoder.half()
During inference on a GPU, we use torch.autocast to create a mixed-precision context. This allows PyTorch to automatically run operations in FP16 for speed while using FP32 for certain sensitive operations to maintain stability.
with torch.autocast(device_type="cuda", dtype=torch.float16):
pred_ids = model.generate(input_features)[0]
Benchmark: FP16 Decoder on GPU
Metric | FP32 Decoder | FP16 Decoder | Change |
WER (%) | 53.39 | 52.99 | -0.40 pts |
Real-Time Factor | 0.037 | 0.031 | -16.2% |
Memory Usage (MB) | 133.32 | 33.00 | -75.2% |
Conclusion: The FP16 decoder improves inference time and reduces memory — with no loss in accuracy. This is ideal for real-time GPU inference or batched processing.
Final Comparison
Component | Strategy | Precision | WER Change | Speed Gain | Memory Gain |
Encoder | Dynamic Quantization | INT8 | +1.99 pts | +12.5% | -96.2% |
Decoder | Mixed Precision | FP16 | -0.40 pts | +16.2% | -75.2% |
Conclusion
These results demonstrate that Whisper’s encoder and decoder can be optimized independently and safely using precision-aware strategies.
This makes it possible to deploy Whisper in environments that previously would have struggled with its computational demands.
Full codes: https://github.com/NijatZeynalov/whisper-experiments
If you found this useful or want to dive deeper, feel free to connect or reach out.
Subscribe to my newsletter
Read articles from Nijat Zeynalov directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
