BetterWhisperX: Enhancing Speech Recognition with Speed and Precision Introduction

Welcome to a deep dive into BetterWhisperX, an open-source project that's pushing the boundaries of automatic speech recognition (ASR). This is a fork of the well-known WhisperX by Max Bain, which has been further enhanced to offer faster, more accurate transcription services.

What is BetterWhisperX?

BetterWhisperX is a tool designed to improve upon the original WhisperX, which was developed by OpenAI. The original project has since been abandoned by its author, but BetterWhisperX keeps the spirit alive, aiming to provide:

  • Lightning-fast Transcription: Achieve up to 70x real-time speed with the large-v2 model, making it ideal for transcribing long audio files quickly.

  • Reduced GPU Memory: With the integration of the faster-whisper backend, BetterWhisperX requires less than 8GB of GPU memory, even when using the large-v2 model with a beam_size of 5.

  • Accurate Word Timestamps: Utilizing wav2vec2 for forced alignment, this fork ensures timestamps are accurate down to the word level, significantly improving upon the utterance-level timestamps of the original Whisper.

  • Multi-Speaker Recognition: Speaker diarization is incorporated, allowing for the identification of different speakers within an audio file thanks to the integration with pyannote-audio.

  • Voice Activity Detection (VAD): This preprocessing step reduces transcription errors or "hallucinations" by only focusing on segments where there's actual speech, without degrading the word error rate (WER).

Architecture and Improvements

  • Batched Inference: BetterWhisperX leverages batched inference for transcription, which allows for a 70x real-time speed increase with the large-v2 model. This is achieved by performing transcription without timestamps in a single pass, ensuring one forward pass per sample in the batch for efficiency, although this might cause some discrepancies compared to the default Whisper output.

  • Forced Phoneme Alignment: The tool uses wav2vec2 for alignment, providing word-level timestamps. This contrasts with the original Whisper's utterance-level timestamps, enhancing the accuracy of when each word is spoken. The alignment process involves:

    • Phoneme ASR Models: These are language-specific and are automatically selected based on the detected language of the audio file. Currently, models are provided for languages like English, French, German, Spanish, Italian, Japanese, Chinese, Dutch, Ukrainian, and Portuguese.

    • Forced Alignment: This process aligns the orthographic transcription with the audio, generating precise phone-level segmentation which is then used to pinpoint word timings.

  • Voice Activity Detection (VAD): Utilizes VAD to segment audio, focusing only on parts with speech, which reduces errors (hallucinations) and allows for more efficient batch processing without degrading transcription quality.

  • Speaker Diarization: Employs pyannote-audio for speaker diarization, which partitions audio into segments by speaker identity. This is crucial for multi-speaker audio files. Diarization involves:

    • Segmentation: Dividing the audio into segments where speech occurs.

    • Speaker Labeling: Assigning speaker IDs to these segments. Users can specify a known range of speakers with --min_speakers and --max_speakers flags to improve accuracy.

      whisperx-arch

Computational Requirements and Optimization

  • GPU Memory Management: BetterWhisperX is optimized to work with less than 8GB of GPU memory for the large-v2 model, which is significant for users with limited GPU resources:

    • Batch Size Reduction: Users can adjust the batch size to conserve memory; for example, --batch_size 4.

    • Model Size Selection: Using smaller models like --model base can also reduce memory usage, albeit at the cost of some accuracy.

    • Compute Type: Switching to int8 computation can help manage memory at the expense of potential accuracy loss.

  • CUDA and cuDNN: The project's compatibility with CUDA and cuDNN is crucial for GPU acceleration. Users must be aware of version compatibility:

    • Warning for Google Colab Users: The update to ctranslate2 4.5.0 means users need to install version 4.4.0 for compatibility with CUDA 8.x on Google Colab.

Python API and CLI Usage

  • Python Interface: Provides detailed control over the transcription process:

      import whisperx
      import gc
    
      device = "cuda"
      audio_file = "audio.mp3"
      batch_size = 16
      compute_type = "float16"
    
      model = whisperx.load_model("large-v2", device, compute_type=compute_type)
      audio = whisperx.load_audio(audio_file)
      result = model.transcribe(audio, batch_size=batch_size)
      # ... further steps for alignment and diarization ...
    
  • Command Line Interface: Offers a straightforward way to transcribe with various parameters:

      whisperx --model large-v2 --language de examples/sample_de_01.wav --diarize --highlight_words True
    

Limitations

  • Limited Language Support: BetterWhisperX currently supports phoneme alignment for only a select few languages: {en, fr, de, es, it, ja, zh, nl, uk, pt}. If the audio's language is not within this list, users must manually find or provide a compatible phoneme-based ASR model from resources like the Hugging Face Model Hub. This limitation can significantly impact the tool's usability for less common languages or dialects.

  • Unaligned Words: Words or symbols not represented in the alignment model's phoneme dictionary (e.g., numbers like "2014", currency like "ยฃ13.60") cannot be aligned, thus not receiving timestamps.

  • Overlapping Speech: The system does not handle overlapping speech well, where multiple speakers talk simultaneously, which can lead to inaccuracies in transcription or speaker identification.

  • Diarization Accuracy: While BetterWhisperX includes speaker diarization, the process is not infallible. Errors in speaker labeling can occur, particularly in scenarios with many speakers or when speech characteristics are similar.

  • Language-Specific Models: For each language, a specific wav2vec2 model is needed, which means that expanding language support requires additional model development or discovery.

Contributing to BetterWhisperX

The project welcomes contributions, especially in expanding the language support for phoneme ASR models beyond the currently supported languages. If you can help find or train models for other languages or if you spot bugs, your pull requests would be invaluable.

Future Plans

The roadmap includes:

  • Enhancing examples with diarization and word highlighting.

  • Reintroducing subtitle .ass output.

  • Adding benchmarking capabilities.

  • Improving diarization at the word level, which has proven to be a complex task.

Citation

If you're using BetterWhisperX for research, please cite the original WhisperX paper:

@article{bain2022whisperx,
  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
  journal={INTERSPEECH 2023},
  year={2023}
}

In The End

BetterWhisperX stands as a testament to the power of open-source collaboration, enhancing speech recognition technology to be more accessible, faster, and precise. Whether you're a researcher, developer, or content creator, this tool could revolutionize how you handle audio transcription. Give it a try, and join the community in making speech recognition even better.๐Ÿ˜ƒ๐Ÿ˜ƒ If you have any questions about it you can ask them in the comments.

0
Subscribe to my newsletter

Read articles from behrooz bozorgchamy directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

behrooz bozorgchamy
behrooz bozorgchamy