Unlocking Audio Insights: Speaker Diarization with WhisperX for "Who Said What"

Victor NdutiVictor Nduti
9 min read

Introduction: The Power of Knowing "Who Said What"

Ever tried analyzing a customer call, a lengthy meeting, or a podcast, only to realize that a raw audio transcript is just a jumbled mess of words? The real challenge isn't just what was said, but who said it, and when. This is where Speaker Diarization comes in, transforming raw audio into structured insights by identifying individual speakers.

This crucial preprocessing step lays the foundation for all downstream NLP tasks like sentiment analysis, emotion detection, or rule-based intent extraction. In this post, I’ll walk through how I built my preprocessing stage using WhisperX – a powerful tool that brings "who spoke when" to your audio data..

Understanding OpenAI Whisper: The ASR Foundation

OpenAI's Whisper model revolutionized Automatic Speech Recognition (ASR). Trained on a massive 680,000 hours of multilingual audio, it quickly became the go-to for its capabilities:

  • State-of-the-art accuracy: Among the most accurate open ASR models.

  • Multilingual: Out-of-the-box support for ~100 languages.

  • No commercial restrictions: Fully open source.

  • Simple API: Just install and call .transcribe().

For many transcription tasks (e.g., podcasts, lectures), Whisper is often the default go-to solution due to its impressive performance and ease of use.

❌ Where Whisper Falls Short: Diarization

Despite its prowess, Whisper has a key limitation for multi-speaker conversations: it does not natively label or separate speakers. Diarization—figuring out who spoke when—is not part of its core functionality.

💡 Workarounds (and Why They’re Limited)

Before WhisperX emerged, some teams tried "hacking" diarization onto Whisper. The typical approach involved:

  1. Running a separate diarization library (e.g., pyAudioAnalysis, pyannote.audio) to get speaker timestamps.

  2. Then manually trying to align Whisper’s transcription timestamps with these diarization timestamps.

This approach was often fraught with problems:

  • Very fragile: Different timestamp resolutions and lack of direct integration.

  • Requires lots of custom logic: Complex code to stitch everything together.

  • Hard to keep in sync: Especially on longer audio files.

  • No word-level speaker attribution: You'd only get speaker segments, not which specific words were spoken by which speaker.

This is precisely why tools like WhisperX became essential: to bridge that gap cleanly and efficiently.

🧠 Introducing WhisperX: The "Who Said What" Enhancer

WhisperX is a powerful extension to OpenAI’s Whisper, created by Max Bain, a researcher from the University of Oxford. It adds crucial capabilities that Whisper lacks:

Word-level alignment: Precise timestamps for every word. ✅ Speaker diarization: Identifying and labeling individual speakers. ✅ Better timing accuracy: Beyond Whisper's segment-level timestamps. ✅ Seamless integration: Works directly with Hugging Face's pyannote.audio diarization models.

🧪 How WhisperX Works Under the Hood

WhisperX isn't a brand new ASR model; instead, it orchestrates specialized models to enhance Whisper's output:

  1. Whisper transcribes the audio into coarse, utterance-level segments.

  2. A dedicated phoneme-based alignment model (like Wav2Vec2) fine- tunes these timestamps, providing highly accurate word-level timings.

  3. A robust diarization model from pyannote.audio analyzes the audio to split it into distinct speaker segments.

  4. Finally, WhisperX intelligently combines the word-level transcription with the speaker segments, assigning a speaker ID to each word and phrase with exact timing.

It's this clever orchestration that provides the rich, structured output we're looking for. If you’re interested in the deeper technical details, you can read the WhisperX research paper here.


🧑‍💻 How to Use WhisperX: Getting Started

You can run WhisperX in two main environments:

Google Colab: The easiest way to get started, offering free GPU access. ✅ Your local machine: Provides more control for larger projects or sensitive data.

Prerequisites for WhisperX: Hugging Face Authentication

WhisperX uses pyannote.audio for speaker diarization. These are powerful research models hosted on Hugging Face that require authentication.

You’ll need:

  1. A Hugging Face account (free to create).

  2. A User Access Token (often called a "Read Token").

📜 Accept Model Terms on Hugging Face

Before you can use the diarization models in your code, you must accept the Terms and Conditions for each on their respective Hugging Face website pages. Typically, these are:

  • pyannote/speaker-diarization-3.1

  • pyannote/segmentation-3.0

  • pyannote/embedding-3.0

  • pyannote/sad-3.0

You'll be asked for details like your email or organization. Once accepted (which you only need to do once per model per account), these models can be pulled automatically by your notebook/script using your token.

🎟️ Getting Your User Access Token

  1. Go to your Hugging Face account page: 👉 https://huggingface.co/settings/tokens

  2. Click "New Token" → choose "Read" access.

  3. Copy the token string (it will look like hf_xxx...).

⚙️ Why the Hugging Face Token and Model Terms?

Your Hugging Face User Access Token acts as your programmatic login credential. When WhisperX tries to download pyannote.audio models, it authenticates with your token.

These models, while often open-source, are gated to ensure users acknowledge and agree to their specific licensing terms (e.g., privacy considerations for data they were trained on, or research-only usage).

Your token signals that you've accepted these terms.


Setting Up Your Environment

Installation on Google Colab

Colab is the easiest way to get started because it provides GPU support without any local setup.

!pip install -U whisperx

import whisperx
import pandas as pd
from google.colab import userdata

device = "cuda"
audio_file = "/content/F_0101_13y1m_1.wav"
compute_type = "int16"
hf_token = userdata.get('HF_TOKEN')

Code Explanation (Setup Section):

  • !pip install -U whisperx: This command (prefixed with ! in Colab to run as a shell command) installs or updates the whisperx library. This is our core tool for all the magic that follows.

  • import whisperx: Makes the whisperx library functions available in our Python script.

  • import pandas as pd: Imports the pandas library, aliased as pd. Pandas is indispensable for working with tabular data (like the final transcription results we'll generate into DataFrames).

  • from google.colab import userdata: This line is specific to Google Colab. It allows us to securely retrieve sensitive information, like our Hugging Face token, from Colab's "Secrets" feature. This is much better than hardcoding your token directly in the script!

  • device = "cuda" (or "cpu"): This crucial variable tells whisperx where to perform its heavy computations.

    • "cuda": Leverages your NVIDIA GPU, providing significantly faster processing times, especially for longer audio or larger models. Highly recommended if you have a GPU runtime.

    • "cpu": Uses your computer's main processor. Slower, but works everywhere without special hardware.

  • audio_file = "/content/F_0101_13y1m_1.wav": This defines the path to the audio file we want to process. In Colab, files uploaded directly to the session typically reside in the /content/ directory. Make sure your .wav file is there!

  • compute_type = "int16": This specifies the numerical precision for the model's internal calculations.

    • "int16": A good default, offering a balance between speed, memory usage, and accuracy.

    • "float16": Can be even faster on modern GPUs (those with Tensor Cores), but might use slightly more memory.

    • "int8": Fastest, lowest memory, but might come with a minor accuracy trade-off.

  • hf_token = userdata.get('HF_TOKEN'): Retrieves your Hugging Face read token. This token is required to access certain gated models, specifically the pyannote/speaker-diarization model that WhisperX uses internally. Ensure you have accepted the terms and conditions for this model on Hugging Face Hub

Step 1: Transcribing the Audio with Whisper

Our journey begins by converting speech to text using the powerful Whisper model.

Python

print(f"Loading Whisper model ('base') on {device}...")
model = whisperx.load_model("base", device, compute_type=compute_type)

print(f"Transcribing audio: {audio_file}")
result = model.transcribe(audio_file)

Quick Explanation:

  • whisperx.load_model("base", ...): Loads the base Whisper ASR model. You can choose larger models like "medium" or "large" for higher accuracy at the cost of speed/memory.

  • model.transcribe(audio_file): Executes the transcription. The result dictionary stores the initial transcribed segments and the detected audio language.

Step 2: Accurate Word-Level Alignment

Whisper's initial output provides segment-level timestamps. For precise diarization, we need accurate timestamps for every single word. WhisperX achieves this with a dedicated alignment model.

Python

print(f"Loading alignment model for language: {result['language']}...")
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)

print("Performing word-level alignment...")
result = whisperx.align(result["segments"], model_a, metadata, audio_file, device, return_char_alignments=False)

print("\n--- Sample Aligned Segment ---")
print(result["segments"][0])

Quick Explanation:

  • whisperx.load_align_model(...): Loads a specialized model to align transcribed words to their exact positions in the audio waveform. It uses the detected language (result["language"]) to load the correct model.

  • whisperx.align(...): Applies the alignment model to our result["segments"], enriching each segment with a words list containing precise start and end timestamps for every word.


Step 3: Speaker Diarization - Who Said What?

This is where we assign speaker labels. WhisperX integrates with powerful diarization models to identify distinct voices in the audio.

Python

diarize_model = whisperx.diarize.DiarizationPipeline(use_auth_token=hf_token, device=device)

diarize_segments = diarize_model(audio_file)

with_speakers = whisperx.assign_word_speakers(diarize_segments, result)

Quick Explanation:

  • whisperx.diarize.DiarizationPipeline(...): Loads the diarization model. Crucially, your hf_token is required here to access the gated pyannote/speaker-diarization model.

  • diarize_model(audio_file): Executes the diarization process, identifying distinct speaker turns in the audio. You can pass min_speakers and max_speakers if you know the approximate number of speakers to improve accuracy.

  • whisperx.assign_word_speakers(...): This function is the bridge! It intelligently combines the speaker turns from diarize_segments with the word-level timestamps from result, assigning a speaker label (e.g., SPEAKER_00, SPEAKER_01) to each word.


Step 4: Structuring Results with Pandas

To easily view, analyze, and export our speaker-attributed transcription, we'll convert the results into user-friendly Pandas DataFrames.

Python

# Extract and convert to DataFrames
full_segments_df = pd.DataFrame(with_speakers['segments'])
word_segments_df = pd.DataFrame(with_speakers['word_segments'])

print(full_segments_df.head(5))

print(word_segments_df.head(5))

Quick Explanation:

  • pd.DataFrame(with_speakers['segments']): Creates a DataFrame where each row represents a transcribed segment, now including its assigned speaker and a list of its words with individual timestamps and speakers.

  • pd.DataFrame(with_speakers['word_segments']): Creates a more granular DataFrame. Each row here is a single word, showing its text, precise start/end times, and the speaker who uttered it. This is incredibly useful for detailed analysis.

Running WhisperX Locally: A Quick Setup Guide

While Google Colab is fantastic for quick starts, you might prefer running WhisperX on your local machine for larger projects or privacy. The Python code remains identical, but you'll need to handle a few prerequisites:

  1. Python Version & Virtual Environment: WhisperX officially requires Python 3.9 or higher (up to 3.12). Python 3.10 and 3.11 are often good, stable choices. Always use a virtual environment (like venv) to keep your project dependencies clean and isolated.

    Bash

     python3 -m venv my_whisperx_env
     source my_whisperx_env/bin/activate
    
  2. FFmpeg Installation: WhisperX relies on FFmpeg for robust audio handling. This is a separate command-line tool, not a Python package. Install it for your operating system (e.g., brew install ffmpeg on macOS, sudo apt install ffmpeg on Linux, or download/add to PATH on Windows). Verify with ffmpeg -version in your terminal.

  3. PyTorch Installation (for GPU): If you plan to use your NVIDIA GPU (device = "cuda"), you must install PyTorch with CUDA support correctly. Visit the official PyTorch website to get the precise pip install command for your specific system and CUDA version. For CPU-only, the pip install torch command is simpler.

Conclusion: Your Audio, Now Fully Understood

You've just performed sophisticated speech-to-text transcription and speaker diarization with WhisperX! From raw audio, you now have:

  • Accurate Transcriptions: What was said.

  • Precise Timestamps: Exactly when each word was spoken.

  • Speaker Attribution: Who spoke each word.

This powerful combination unlocks new levels of insight for meetings, interviews, podcasts, and more. Experiment with different audio files, Whisper model sizes, and the min_speakers/max_speakers parameters to optimize results for your specific use cases.

Further Resources:

0
Subscribe to my newsletter

Read articles from Victor Nduti directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Victor Nduti
Victor Nduti

Data enthusiast. Curious about all things data.