Controlling Whisper Output with Inference-Time Token Constraints


The Whisper model by OpenAI is a powerful multilingual, multitask ASR system. However, this generality can become a liability when adapting Whisper to a specific language and task. For example, transcribing Azerbaijani speech without translation or language drift.
While experimenting with fine-tuning OpenAI's Whisper for Azerbaijani, I quickly ran into a key challenge: the decoder often ignored my intentions.
Despite finetuning it on Azerbaijani speech, it occasionally switched languages, repeated outputs, or even hallucinated translations. So I deep into decoder optimization.
When fine-tuning Whisper on a low-resource language like Azerbaijani, the decoder's flexibility needs to be strategically constrained. Without such constraints, the decoder may:
Default to higher-resource languages like English during decoding.
Switch to translation mode when only transcription is desired.
Generate incorrect or repetitive tokens due to multilingual pretraining.
In this post, I’m sharing the practical experiments I used to make Whisper’s decoder behave predictably — and the actual code I used to implement these constraints.
This post is part of my ongoing experiments with speech models. I’ll keep sharing things I apply as I go.
Why Whisper Needs Decoder Constraints
The Whisper decoder is autoregressive, meaning it generates tokens one by one, each conditioned on previously generated tokens. It uses special tokens such as <|en|>
, <|az|>
, <|transcribe|>
, and <|translate|>
to set the language and task.
By default, the model may not reliably pick the right tokens during decoding unless explicitly instructed. These issues are amplified during fine-tuning, where only a subset of the model’s parameters are adapted.
Goal:
Force Whisper to always:
Transcribe (not translate).
Transcribe in Azerbaijani only.
Start generation with the correct prefix tokens.
This is achieved without changing model weights, using the following mechanisms:
Three Key Decoder Controls
Constraint Type | Purpose | Method |
Token Suppression | Prevent generation of wrong language/task | suppress_tokens |
Forced Decoder IDs | Enforce the right task + language context | forced_decoder_ids |
Decoder Configuration | Apply limits like beam search and max length | generation_config |
Silencing Unwanted Outputs
Whisper’s tokenizer includes hundreds of tokens like <|en|>
, <|fr|>
, <|translate|>
, etc. Unless suppressed, the model may generate these during decoding.
Following code implements dynamic token suppression by scanning the tokenizer for all language and task tokens except Azerbaijani and transcription:
language_tokens = []
for token_id in range(len(processor.tokenizer)):
token = processor.tokenizer.convert_ids_to_tokens(token_id)
if token and token.startswith("<|") and token.endswith("|>") and token != "<|az|>":
if len(token) == 5 and token[2:-2].isalpha(): # likely a language tag
language_tokens.append(token_id)
# Also suppress <|translate|>
task_translate_id = processor.tokenizer.convert_tokens_to_ids("<|translate|>")
if task_translate_id is not None:
language_tokens.append(task_translate_id)
suppress_tokens = language_tokens
Suppression Effect:
All tokens in suppress_tokens
are assigned zero probability during generation. This ensures that no other languages or the translation task can be selected — regardless of the model’s pretraining biases.
model.generation_config.suppress_tokens = suppress_tokens
Forced Decoder IDs: Controlling the Generation Prefix
Whisper models expect a prefix of special tokens at the beginning of generation, typically:
<|startoftranscript|><|language|><|task|>
For our use case, the correct prefix is:
<|startoftranscript|><|az|><|transcribe|>
These are injected at fixed decoder positions using forced_decoder_ids
:
az_token_id = processor.tokenizer.convert_tokens_to_ids("<|az|>")
transcribe_token_id = processor.tokenizer.convert_tokens_to_ids("<|transcribe|>")
model.generation_config.forced_decoder_ids = [
(1, az_token_id), # position 1: <|az|>
(2, transcribe_token_id) # position 2: <|transcribe|>
]
This guarantees:
Every generation begins with Azerbaijani language context.
The task is transcription, not translation.
The decoder never "guesses" the initial tokens — it's forced into the correct mode.
Enhanced Decoder Configuration
Beyond constraints, you apply other key decoding configurations:
model.generation_config.use_cache = True
model.generation_config.max_length = 225
model.generation_config.num_beams = 5
max_length
: Prevents overly long generations.num_beams
: Enables beam search for better decoding quality.use_cache
: Speeds up decoding using cached key/value pairs from attention layers.
Together, this ensures fast, high-quality, task-specific decoding.
Verifying Constraints Work as Expected
Before training, we should validate that constraints are properly enforced:
test_sample = common_voice["test"][0]
input_features = torch.tensor(test_sample["input_features"]).unsqueeze(0).to(device)
with torch.no_grad():
generated_ids = generate_with_constraints(model, input_features)
decoded = processor.tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0]
if "<|az|>" in decoded and "<|transcribe|>" in decoded:
print("Language and task constraints working")
else:
print("Constraints may not be working properly")
This step is crucial to debug decoder configuration issues early before wasting GPU time on misaligned training.
Metrics: Accurate WER with Aligned Outputs
Since weare evaluating with predict_with_generate=True
, decoding is active during validation. The suppressed tokens and forced prefixes ensure the decoded outputs are in the correct format for computing WER (Word Error Rate):
def compute_metrics(pred):
pred_ids = pred.predictions
label_ids = pred.label_ids
label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)
wer = 100 * metric.compute(predictions=pred_str, references=label_str)
return {"wer": wer}
By aligning the decoder configuration with the task, we get realistic, comparable WER scores.
Conclusion: Guardrails for Stable, Task-Aligned Generation
Our approach puts decoder-level guardrails around Whisper's powerful but general-purpose generation engine.
\> Token Suppression avoids accidental mode/language drift.
\> Forced Decoder IDs eliminate ambiguity at generation start.
\> Constrained Decoding Config provides speed, precision, and clarity.
These techniques don’t alter model weights, impact behavior. This makes them ideal for production-ready deployments, domain adaptation, or fine-tuning on low-resource languages.
This is part of a broader series of experiments I'm running with Whisper and speech models in general. If you're working on fine-tuning Whisper, especially for low-resource languages, I hope this helps you avoid some of the common decoding issues.
You can checkout codes: https://github.com/NijatZeynalov/whisper-experiments
Follow along — I’ll keep sharing my findings as I go!
Subscribe to my newsletter
Read articles from Nijat Zeynalov directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
