[Paper Review] Revolutionizing Speaker Recognition and Diarization: A Novel Methodology in Speech Analysis


It is my first paper review about Speech-to-Text methodology. After reading this paper I tried to dig into the sound and wave world to learn how the digitalized sound data is transformed into the form that humans can understand.
My company is now working on the Speech-to-Text skills and trying to catch up the state-of-the-art techniques to provide with the better quality services. Yet, there has been a lack of research and the stout background of STT knowledge, thus I picked several research papers including this paper. We need diarization as well to be able to specify the speaker as well as transcription. This paper helped in such way to understand how the architecture and algorithms should look like prior to building such service.
1. Abstract
The paper aims to fulfill Speech-to-Text tasks with speaker recognition, so called diarization leveraging Whisper model for transcription, ECAPA-TDNN for speaker embeddings, Agglomerative Hierarchical Clustering for speaker clustering to identify who spoke when.
2. Introduction
- why the paper do this research?
in modern society meeting transcription and audio processing are important. Yet the accuracy in audio processing needs improvement to be further developed to reach the service level.
objects
identification and segmentation of speakers
content comprehension
with what?
speaker embeddings
various acoustic features within speech
discern between speakers
Models and Packages
In this paper the research was conducted using Whisper model for transcription, Pyannote algorithm for speaker embeddings and Agglomerative Hierarchical Clustering for grouping similar embeddings. The key features in short are listed as below.
- Whisper
Whisper has been trained with multilingual supervised dataset to be able to differentiate languages, linguistic nuance and accents. It supports with many languages. Details to be mentioned later below.
- Pyannote
It is another model to extract and manipulate speaker embeddings. It encapsulates the unique sound characteristics of speakers. It is the key feature to diarize.
- Agglomerative Hierarchical Clustering
The embedded data will now then grouped into clusters using this algorithm, then it could be labeled.
in summary
whisper: transcription of audio data
pyannote: extract embeddings from acoustic features
Agglomerative hierarchical clustering: unveil relationship between speakers
3. Related work
RNN and LSTM algorithm were utilized for Diarization research to understand the sequential features from the audio. However, there was a limitation in long sequential data. Also, CNN was used in such matter. It excelled in hierarchical spatial features and pattern extraction from spectrogram but had also a limit for variable data length (input size). After sometime when Transformers was released to the world, it was adapted for diarzation researches and applications as it performed well in many tasks.
Embeddings
To read the features for diarization, many researches were conducted. X-vectors is one of them and is introduced as the state-of-the-art algorithm (time delay neural network, so called TDNN) for speaker verification
Proposal
to overcome the challenges and limitation stated above, this paper proposes the methodology of
Emphasized Channel Attention, Propagation, and Aggregation (ECAPA-TDNN)
ECAPA-TDNN
an advanced iteration of TDNN
it uses
attention mechanism
multilayer feature aggregation (MFA)
squeeze excitation modules
residual blocks
k-means
efficient on large dataset, but has limitation when a speaker is dominant, but agglomerative hierarchical clustering can control such imbalances
4. Methodology
Whisper
The notable information about Whisper model is that, it is built with the transformer architecture with an encoder a decoder. It supports 99 languages with the word error rate (WER) of 4.2%. Korean has a lot more of a complexity in measuring the error rate and it cannot be measured by the WER as the spacing and letter combination system differ. Instead, character error rate is adopted to check the performance.
4.2% WER
99 languages
680,000h audio (online platform)
563,000h english
117,000h other languages
robustness against accents, ambient disturbances
currently has large-v3 and turbo model
utilizes encoder-decoder transformer
encoder: derives a latent representation from speech
decoder: generates text, based on the latent representation
Other Speech-to-Text models
Suggested algorithm
Audio file is processed and handled in 16kHz PCM format scaled in a range of -1 to 1 normalized. Frequency is then converted using 80 channel Mel Spectrogram as 80 channel is most common and accepted by experience.
Encoder, decoder
conversion involves
window size: 25ms
stride: 10ms
segments: 30s
encoder operates per 30-second segment, to extract features
it involves two GELU activated convolutions
- filter size of 3 for input embeddings
position embedding uses Sine function
- performed by transformer
decoder
calculates probability based on the latent representation
token determination via Greedy Search or Beak Search
output: maximum 224 tokens per 30-second segment
Process in short,
transcription
- spoken content (Whisper)
speaker embeddings
speaker embeddings from audio (unique features of individuals)
- bases for analysis
clustering
- clustering using Agglomerative hierarchical clustering method based on similarity
output
will be able to see who spoke when
Whisper model’s output has time information with transcription
audio will be cut according to the time information and determines the speaker
Dataset
Paper used VoxCeleb1 and 2 dataset. This dataset was collected from Youtube of the celebrities. VoxCeleb dataset is the best usecase for many researches as it contains different voices from clear voices to voices with noise in the background.
Reflection
It was indeed intriguing and was informative for me to kick off in the STT field. However, what was lacking was that it gave less information about how the digitalized sound information was transformed into Mel-Spectrogram and data process. Overall the explanation about the process was a bit of a let-down.
Secondly, the paper shows the result data with two model comparison, yet it did not give the comparison data of train and test data error rate so to tell if the model was overfitted.
The algorithm of ECAPA-TDNN will be a good choice for diarization if the number of speaker is fixed in every situation. But in real life there is always a exception where in the meeting, another member comes in and the existing member could be out from the meeting. I personally think that the embeddings and clustering should consider such situation that the embedding information could change in any situation.
With such realization, next paper to read will be the very standard paper written by Oxford researchers who established VoxCeleb dataset. Or something else if there wil be any interesting paper
Reference
[1] R. D. Shankar, R. B. Manjula, and R. C. Biradar, "Revolutionizing Speaker Recognition and Diarization: A Novel Methodology in Speech Analysis," SN Computer Science, vol. 6, no. 87, 2025. https://doi.org/10.1007/s42979-024-03509-6
Subscribe to my newsletter
Read articles from Ramhee Yeon directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Ramhee Yeon
Ramhee Yeon
South Korean, master's degree of AI. Programmer for LLM, app and web development. I am seeking opportunities to go to the USA and Canada, and Europe. I used to live in Frankfurt, Moscow, Pretoria, Baguio. Where to next?