Controlling Whisper Output with Inference-Time Token Constraints

Nijat ZeynalovNijat Zeynalov
5 min read

The Whisper model by OpenAI is a powerful multilingual, multitask ASR system. However, this generality can become a liability when adapting Whisper to a specific language and task. For example, transcribing Azerbaijani speech without translation or language drift.

While experimenting with fine-tuning OpenAI's Whisper for Azerbaijani, I quickly ran into a key challenge: the decoder often ignored my intentions.

Despite finetuning it on Azerbaijani speech, it occasionally switched languages, repeated outputs, or even hallucinated translations. So I deep into decoder optimization.

When fine-tuning Whisper on a low-resource language like Azerbaijani, the decoder's flexibility needs to be strategically constrained. Without such constraints, the decoder may:

  • Default to higher-resource languages like English during decoding.

  • Switch to translation mode when only transcription is desired.

  • Generate incorrect or repetitive tokens due to multilingual pretraining.

In this post, I’m sharing the practical experiments I used to make Whisper’s decoder behave predictably — and the actual code I used to implement these constraints.

This post is part of my ongoing experiments with speech models. I’ll keep sharing things I apply as I go.


Why Whisper Needs Decoder Constraints

The Whisper decoder is autoregressive, meaning it generates tokens one by one, each conditioned on previously generated tokens. It uses special tokens such as <|en|>, <|az|>, <|transcribe|>, and <|translate|> to set the language and task.

By default, the model may not reliably pick the right tokens during decoding unless explicitly instructed. These issues are amplified during fine-tuning, where only a subset of the model’s parameters are adapted.

Goal:
Force Whisper to always:

  1. Transcribe (not translate).

  2. Transcribe in Azerbaijani only.

  3. Start generation with the correct prefix tokens.

This is achieved without changing model weights, using the following mechanisms:


Three Key Decoder Controls

Constraint TypePurposeMethod
Token SuppressionPrevent generation of wrong language/tasksuppress_tokens
Forced Decoder IDsEnforce the right task + language contextforced_decoder_ids
Decoder ConfigurationApply limits like beam search and max lengthgeneration_config

Silencing Unwanted Outputs

Whisper’s tokenizer includes hundreds of tokens like <|en|>, <|fr|>, <|translate|>, etc. Unless suppressed, the model may generate these during decoding.

Following code implements dynamic token suppression by scanning the tokenizer for all language and task tokens except Azerbaijani and transcription:

language_tokens = []
for token_id in range(len(processor.tokenizer)):
    token = processor.tokenizer.convert_ids_to_tokens(token_id)
    if token and token.startswith("<|") and token.endswith("|>") and token != "<|az|>":
        if len(token) == 5 and token[2:-2].isalpha():  # likely a language tag
            language_tokens.append(token_id)

# Also suppress <|translate|>
task_translate_id = processor.tokenizer.convert_tokens_to_ids("<|translate|>")
if task_translate_id is not None:
    language_tokens.append(task_translate_id)

suppress_tokens = language_tokens

Suppression Effect:

All tokens in suppress_tokens are assigned zero probability during generation. This ensures that no other languages or the translation task can be selected — regardless of the model’s pretraining biases.

model.generation_config.suppress_tokens = suppress_tokens

Forced Decoder IDs: Controlling the Generation Prefix

Whisper models expect a prefix of special tokens at the beginning of generation, typically:

<|startoftranscript|><|language|><|task|>

For our use case, the correct prefix is:

<|startoftranscript|><|az|><|transcribe|>

These are injected at fixed decoder positions using forced_decoder_ids:

az_token_id = processor.tokenizer.convert_tokens_to_ids("<|az|>")
transcribe_token_id = processor.tokenizer.convert_tokens_to_ids("<|transcribe|>")

model.generation_config.forced_decoder_ids = [
    (1, az_token_id),           # position 1: <|az|>
    (2, transcribe_token_id)    # position 2: <|transcribe|>
]

This guarantees:

  • Every generation begins with Azerbaijani language context.

  • The task is transcription, not translation.

  • The decoder never "guesses" the initial tokens — it's forced into the correct mode.


Enhanced Decoder Configuration

Beyond constraints, you apply other key decoding configurations:

model.generation_config.use_cache = True
model.generation_config.max_length = 225
model.generation_config.num_beams = 5
  • max_length: Prevents overly long generations.

  • num_beams: Enables beam search for better decoding quality.

  • use_cache: Speeds up decoding using cached key/value pairs from attention layers.

Together, this ensures fast, high-quality, task-specific decoding.


Verifying Constraints Work as Expected

Before training, we should validate that constraints are properly enforced:

test_sample = common_voice["test"][0]
input_features = torch.tensor(test_sample["input_features"]).unsqueeze(0).to(device)

with torch.no_grad():
    generated_ids = generate_with_constraints(model, input_features)
    decoded = processor.tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0]

    if "<|az|>" in decoded and "<|transcribe|>" in decoded:
        print("Language and task constraints working")
    else:
        print("Constraints may not be working properly")

This step is crucial to debug decoder configuration issues early before wasting GPU time on misaligned training.


Metrics: Accurate WER with Aligned Outputs

Since weare evaluating with predict_with_generate=True, decoding is active during validation. The suppressed tokens and forced prefixes ensure the decoded outputs are in the correct format for computing WER (Word Error Rate):

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}

By aligning the decoder configuration with the task, we get realistic, comparable WER scores.


Conclusion: Guardrails for Stable, Task-Aligned Generation

Our approach puts decoder-level guardrails around Whisper's powerful but general-purpose generation engine.

\> Token Suppression avoids accidental mode/language drift.
\> Forced Decoder IDs eliminate ambiguity at generation start.
\> Constrained Decoding Config provides speed, precision, and clarity.

These techniques don’t alter model weights, impact behavior. This makes them ideal for production-ready deployments, domain adaptation, or fine-tuning on low-resource languages.

This is part of a broader series of experiments I'm running with Whisper and speech models in general. If you're working on fine-tuning Whisper, especially for low-resource languages, I hope this helps you avoid some of the common decoding issues.

You can checkout codes: https://github.com/NijatZeynalov/whisper-experiments

Follow along — I’ll keep sharing my findings as I go!

0
Subscribe to my newsletter

Read articles from Nijat Zeynalov directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Nijat Zeynalov
Nijat Zeynalov