Unlocking Audio Insights: Speaker Diarization with WhisperX for "Who Said What"


Introduction: The Power of Knowing "Who Said What"
Ever tried analyzing a customer call, a lengthy meeting, or a podcast, only to realize that a raw audio transcript is just a jumbled mess of words? The real challenge isn't just what was said, but who said it, and when. This is where Speaker Diarization comes in, transforming raw audio into structured insights by identifying individual speakers.
This crucial preprocessing step lays the foundation for all downstream NLP tasks like sentiment analysis, emotion detection, or rule-based intent extraction. In this post, I’ll walk through how I built my preprocessing stage using WhisperX – a powerful tool that brings "who spoke when" to your audio data..
Understanding OpenAI Whisper: The ASR Foundation
OpenAI's Whisper model revolutionized Automatic Speech Recognition (ASR). Trained on a massive 680,000 hours of multilingual audio, it quickly became the go-to for its capabilities:
✅ State-of-the-art accuracy: Among the most accurate open ASR models.
✅ Multilingual: Out-of-the-box support for ~100 languages.
✅ No commercial restrictions: Fully open source.
✅ Simple API: Just install and call
.transcribe()
.
For many transcription tasks (e.g., podcasts, lectures), Whisper is often the default go-to solution due to its impressive performance and ease of use.
❌ Where Whisper Falls Short: Diarization
Despite its prowess, Whisper has a key limitation for multi-speaker conversations: it does not natively label or separate speakers. Diarization—figuring out who spoke when—is not part of its core functionality.
💡 Workarounds (and Why They’re Limited)
Before WhisperX emerged, some teams tried "hacking" diarization onto Whisper. The typical approach involved:
Running a separate diarization library (e.g.,
pyAudioAnalysis
,pyannote.audio
) to get speaker timestamps.Then manually trying to align Whisper’s transcription timestamps with these diarization timestamps.
This approach was often fraught with problems:
Very fragile: Different timestamp resolutions and lack of direct integration.
Requires lots of custom logic: Complex code to stitch everything together.
Hard to keep in sync: Especially on longer audio files.
No word-level speaker attribution: You'd only get speaker segments, not which specific words were spoken by which speaker.
This is precisely why tools like WhisperX became essential: to bridge that gap cleanly and efficiently.
🧠 Introducing WhisperX: The "Who Said What" Enhancer
WhisperX is a powerful extension to OpenAI’s Whisper, created by Max Bain, a researcher from the University of Oxford. It adds crucial capabilities that Whisper lacks:
✅ Word-level alignment: Precise timestamps for every word. ✅ Speaker diarization: Identifying and labeling individual speakers. ✅ Better timing accuracy: Beyond Whisper's segment-level timestamps. ✅ Seamless integration: Works directly with Hugging Face's pyannote.audio
diarization models.
🧪 How WhisperX Works Under the Hood
WhisperX isn't a brand new ASR model; instead, it orchestrates specialized models to enhance Whisper's output:
Whisper transcribes the audio into coarse, utterance-level segments.
A dedicated phoneme-based alignment model (like Wav2Vec2) fine- tunes these timestamps, providing highly accurate word-level timings.
A robust diarization model from
pyannote.audio
analyzes the audio to split it into distinct speaker segments.Finally, WhisperX intelligently combines the word-level transcription with the speaker segments, assigning a speaker ID to each word and phrase with exact timing.
It's this clever orchestration that provides the rich, structured output we're looking for. If you’re interested in the deeper technical details, you can read the WhisperX research paper here.
🧑💻 How to Use WhisperX: Getting Started
You can run WhisperX in two main environments:
✅ Google Colab: The easiest way to get started, offering free GPU access. ✅ Your local machine: Provides more control for larger projects or sensitive data.
Prerequisites for WhisperX: Hugging Face Authentication
WhisperX uses pyannote.audio
for speaker diarization. These are powerful research models hosted on Hugging Face that require authentication.
✅ You’ll need:
A Hugging Face account (free to create).
A User Access Token (often called a "Read Token").
📜 Accept Model Terms on Hugging Face
Before you can use the diarization models in your code, you must accept the Terms and Conditions for each on their respective Hugging Face website pages. Typically, these are:
pyannote/speaker-diarization-3.1
pyannote/segmentation-3.0
pyannote/embedding-3.0
pyannote/sad-3.0
You'll be asked for details like your email or organization. Once accepted (which you only need to do once per model per account), these models can be pulled automatically by your notebook/script using your token.
🎟️ Getting Your User Access Token
Go to your Hugging Face account page: 👉 https://huggingface.co/settings/tokens
Copy the token string (it will look like
hf_xxx...
).
⚙️ Why the Hugging Face Token and Model Terms?
Your Hugging Face User Access Token acts as your programmatic login credential. When WhisperX tries to download pyannote.audio
models, it authenticates with your token.
These models, while often open-source, are gated to ensure users acknowledge and agree to their specific licensing terms (e.g., privacy considerations for data they were trained on, or research-only usage).
Your token signals that you've accepted these terms.
Setting Up Your Environment
Installation on Google Colab
Colab is the easiest way to get started because it provides GPU support without any local setup.
!pip install -U whisperx
import whisperx
import pandas as pd
from google.colab import userdata
device = "cuda"
audio_file = "/content/F_0101_13y1m_1.wav"
compute_type = "int16"
hf_token = userdata.get('HF_TOKEN')
Code Explanation (Setup Section):
!pip install -U whisperx
: This command (prefixed with!
in Colab to run as a shell command) installs or updates thewhisperx
library. This is our core tool for all the magic that follows.import whisperx
: Makes thewhisperx
library functions available in our Python script.import pandas as pd
: Imports thepandas
library, aliased aspd
. Pandas is indispensable for working with tabular data (like the final transcription results we'll generate into DataFrames).from google.colab import userdata
: This line is specific to Google Colab. It allows us to securely retrieve sensitive information, like our Hugging Face token, from Colab's "Secrets" feature. This is much better than hardcoding your token directly in the script!device = "cuda"
(or"cpu"
): This crucial variable tellswhisperx
where to perform its heavy computations."cuda"
: Leverages your NVIDIA GPU, providing significantly faster processing times, especially for longer audio or larger models. Highly recommended if you have a GPU runtime."cpu"
: Uses your computer's main processor. Slower, but works everywhere without special hardware.
audio_file = "/content/F_0101_13y1m_1.wav"
: This defines the path to the audio file we want to process. In Colab, files uploaded directly to the session typically reside in the/content/
directory. Make sure your.wav
file is there!compute_type = "int16"
: This specifies the numerical precision for the model's internal calculations."int16"
: A good default, offering a balance between speed, memory usage, and accuracy."float16"
: Can be even faster on modern GPUs (those with Tensor Cores), but might use slightly more memory."int8"
: Fastest, lowest memory, but might come with a minor accuracy trade-off.
hf_token = userdata.get('HF_TOKEN')
: Retrieves your Hugging Face read token. This token is required to access certain gated models, specifically thepyannote/speaker-diarization
model that WhisperX uses internally. Ensure you have accepted the terms and conditions for this model on Hugging Face Hub
Step 1: Transcribing the Audio with Whisper
Our journey begins by converting speech to text using the powerful Whisper model.
Python
print(f"Loading Whisper model ('base') on {device}...")
model = whisperx.load_model("base", device, compute_type=compute_type)
print(f"Transcribing audio: {audio_file}")
result = model.transcribe(audio_file)
Quick Explanation:
whisperx.load_model("base", ...)
: Loads the base Whisper ASR model. You can choose larger models like"medium"
or"large"
for higher accuracy at the cost of speed/memory.model.transcribe(audio_file)
: Executes the transcription. Theresult
dictionary stores the initial transcribed segments and the detected audio language.
Step 2: Accurate Word-Level Alignment
Whisper's initial output provides segment-level timestamps. For precise diarization, we need accurate timestamps for every single word. WhisperX achieves this with a dedicated alignment model.
Python
print(f"Loading alignment model for language: {result['language']}...")
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
print("Performing word-level alignment...")
result = whisperx.align(result["segments"], model_a, metadata, audio_file, device, return_char_alignments=False)
print("\n--- Sample Aligned Segment ---")
print(result["segments"][0])
Quick Explanation:
whisperx.load_align_model(...)
: Loads a specialized model to align transcribed words to their exact positions in the audio waveform. It uses the detected language (result["language"]
) to load the correct model.whisperx.align(...)
: Applies the alignment model to ourresult["segments"]
, enriching each segment with awords
list containing precisestart
andend
timestamps for every word.
Step 3: Speaker Diarization - Who Said What?
This is where we assign speaker labels. WhisperX integrates with powerful diarization models to identify distinct voices in the audio.
Python
diarize_model = whisperx.diarize.DiarizationPipeline(use_auth_token=hf_token, device=device)
diarize_segments = diarize_model(audio_file)
with_speakers = whisperx.assign_word_speakers(diarize_segments, result)
Quick Explanation:
whisperx.diarize.DiarizationPipeline(...)
: Loads the diarization model. Crucially, yourhf_token
is required here to access the gatedpyannote/speaker-diarization
model.diarize_model(audio_file)
: Executes the diarization process, identifying distinct speaker turns in the audio. You can passmin_speakers
andmax_speakers
if you know the approximate number of speakers to improve accuracy.whisperx.assign_word_speakers(...)
: This function is the bridge! It intelligently combines the speaker turns fromdiarize_segments
with the word-level timestamps fromresult
, assigning aspeaker
label (e.g.,SPEAKER_00
,SPEAKER_01
) to each word.
Step 4: Structuring Results with Pandas
To easily view, analyze, and export our speaker-attributed transcription, we'll convert the results into user-friendly Pandas DataFrames.
Python
# Extract and convert to DataFrames
full_segments_df = pd.DataFrame(with_speakers['segments'])
word_segments_df = pd.DataFrame(with_speakers['word_segments'])
print(full_segments_df.head(5))
print(word_segments_df.head(5))
Quick Explanation:
pd.DataFrame(with_speakers['segments'])
: Creates a DataFrame where each row represents a transcribed segment, now including its assigned speaker and a list of its words with individual timestamps and speakers.pd.DataFrame(with_speakers['word_segments'])
: Creates a more granular DataFrame. Each row here is a single word, showing its text, precise start/end times, and the speaker who uttered it. This is incredibly useful for detailed analysis.
Running WhisperX Locally: A Quick Setup Guide
While Google Colab is fantastic for quick starts, you might prefer running WhisperX on your local machine for larger projects or privacy. The Python code remains identical, but you'll need to handle a few prerequisites:
Python Version & Virtual Environment: WhisperX officially requires Python 3.9 or higher (up to 3.12). Python 3.10 and 3.11 are often good, stable choices. Always use a virtual environment (like
venv
) to keep your project dependencies clean and isolated.Bash
python3 -m venv my_whisperx_env source my_whisperx_env/bin/activate
FFmpeg Installation: WhisperX relies on FFmpeg for robust audio handling. This is a separate command-line tool, not a Python package. Install it for your operating system (e.g.,
brew install ffmpeg
on macOS,sudo apt install ffmpeg
on Linux, or download/add to PATH on Windows). Verify withffmpeg -version
in your terminal.PyTorch Installation (for GPU): If you plan to use your NVIDIA GPU (
device = "cuda"
), you must install PyTorch with CUDA support correctly. Visit the official PyTorch website to get the precisepip install
command for your specific system and CUDA version. For CPU-only, thepip install torch
command is simpler.
Conclusion: Your Audio, Now Fully Understood
You've just performed sophisticated speech-to-text transcription and speaker diarization with WhisperX! From raw audio, you now have:
Accurate Transcriptions: What was said.
Precise Timestamps: Exactly when each word was spoken.
Speaker Attribution: Who spoke each word.
This powerful combination unlocks new levels of insight for meetings, interviews, podcasts, and more. Experiment with different audio files, Whisper model sizes, and the min_speakers
/max_speakers
parameters to optimize results for your specific use cases.
Further Resources:
WhisperX GitHub Repo: https://github.com/m-bain/whisperx
WhisperX Research Paper: WhisperX: Time-Accurate Speech Transcription via Large ASR Models
PyTorch Installation Guide: https://pytorch.org/get-started/locally/
Subscribe to my newsletter
Read articles from Victor Nduti directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Victor Nduti
Victor Nduti
Data enthusiast. Curious about all things data.