It is my first paper review about Speech-to-Text methodology. After reading this paper I tried to dig into the sound and wave world to learn how the digitalized sound data is transformed into the form that humans can understand.

My company is now working on the Speech-to-Text skills and trying to catch up the state-of-the-art techniques to provide with the better quality services. Yet, there has been a lack of research and the stout background of STT knowledge, thus I picked several research papers including this paper. We need diarization as well to be able to specify the speaker as well as transcription. This paper helped in such way to understand how the architecture and algorithms should look like prior to building such service.

1. Abstract

The paper aims to fulfill Speech-to-Text tasks with speaker recognition, so called diarization leveraging Whisper model for transcription, ECAPA-TDNN for speaker embeddings, Agglomerative Hierarchical Clustering for speaker clustering to identify who spoke when.

2. Introduction

why the paper do this research?

in modern society meeting transcription and audio processing are important. Yet the accuracy in audio processing needs improvement to be further developed to reach the service level.

objects
- identification and segmentation of speakers
- content comprehension
with what?
- speaker embeddings
  - various acoustic features within speech
  - discern between speakers

Models and Packages

In this paper the research was conducted using Whisper model for transcription, Pyannote algorithm for speaker embeddings and Agglomerative Hierarchical Clustering for grouping similar embeddings. The key features in short are listed as below.

Whisper

Whisper has been trained with multilingual supervised dataset to be able to differentiate languages, linguistic nuance and accents. It supports with many languages. Details to be mentioned later below.

Pyannote

It is another model to extract and manipulate speaker embeddings. It encapsulates the unique sound characteristics of speakers. It is the key feature to diarize.

Agglomerative Hierarchical Clustering

The embedded data will now then grouped into clusters using this algorithm, then it could be labeled.

in summary

whisper: transcription of audio data
pyannote: extract embeddings from acoustic features
Agglomerative hierarchical clustering: unveil relationship between speakers

RNN and LSTM algorithm were utilized for Diarization research to understand the sequential features from the audio. However, there was a limitation in long sequential data. Also, CNN was used in such matter. It excelled in hierarchical spatial features and pattern extraction from spectrogram but had also a limit for variable data length (input size). After sometime when Transformers was released to the world, it was adapted for diarzation researches and applications as it performed well in many tasks.

Embeddings

To read the features for diarization, many researches were conducted. X-vectors is one of them and is introduced as the state-of-the-art algorithm (time delay neural network, so called TDNN) for speaker verification

Proposal

to overcome the challenges and limitation stated above, this paper proposes the methodology of Emphasized Channel Attention, Propagation, and Aggregation (ECAPA-TDNN)
- ECAPA-TDNN
  - an advanced iteration of TDNN
  - it uses
    - attention mechanism
    - multilayer feature aggregation (MFA)
    - squeeze excitation modules
    - residual blocks

k-means

efficient on large dataset, but has limitation when a speaker is dominant, but agglomerative hierarchical clustering can control such imbalances

4. Methodology

Whisper

The notable information about Whisper model is that, it is built with the transformer architecture with an encoder a decoder. It supports 99 languages with the word error rate (WER) of 4.2%. Korean has a lot more of a complexity in measuring the error rate and it cannot be measured by the WER as the spacing and letter combination system differ. Instead, character error rate is adopted to check the performance.

4.2% WER
99 languages
680,000h audio (online platform)
- 563,000h english
- 117,000h other languages
robustness against accents, ambient disturbances

currently has large-v3 and turbo model
utilizes encoder-decoder transformer
- encoder: derives a latent representation from speech
- decoder: generates text, based on the latent representation

Other Speech-to-Text models

Suggested algorithm

Audio file is processed and handled in 16kHz PCM format scaled in a range of -1 to 1 normalized. Frequency is then converted using 80 channel Mel Spectrogram as 80 channel is most common and accepted by experience.

Encoder, decoder

conversion involves
- window size: 25ms
- stride: 10ms
- segments: 30s
encoder operates per 30-second segment, to extract features
- it involves two GELU activated convolutions
  - filter size of 3 for input embeddings
- position embedding uses Sine function
  - performed by transformer
decoder
- calculates probability based on the latent representation
- token determination via Greedy Search or Beak Search
- output: maximum 224 tokens per 30-second segment

Process in short,

transcription
- spoken content (Whisper)
speaker embeddings
- speaker embeddings from audio (unique features of individuals)
  - bases for analysis
clustering
- clustering using Agglomerative hierarchical clustering method based on similarity
output
- will be able to see who spoke when
  - Whisper model’s output has time information with transcription
  - audio will be cut according to the time information and determines the speaker

Dataset

Paper used VoxCeleb1 and 2 dataset. This dataset was collected from Youtube of the celebrities. VoxCeleb dataset is the best usecase for many researches as it contains different voices from clear voices to voices with noise in the background.

Reflection

It was indeed intriguing and was informative for me to kick off in the STT field. However, what was lacking was that it gave less information about how the digitalized sound information was transformed into Mel-Spectrogram and data process. Overall the explanation about the process was a bit of a let-down.

Secondly, the paper shows the result data with two model comparison, yet it did not give the comparison data of train and test data error rate so to tell if the model was overfitted.

The algorithm of ECAPA-TDNN will be a good choice for diarization if the number of speaker is fixed in every situation. But in real life there is always a exception where in the meeting, another member comes in and the existing member could be out from the meeting. I personally think that the embeddings and clustering should consider such situation that the embedding information could change in any situation.

With such realization, next paper to read will be the very standard paper written by Oxford researchers who established VoxCeleb dataset. Or something else if there wil be any interesting paper

Reference

[1] R. D. Shankar, R. B. Manjula, and R. C. Biradar, "Revolutionizing Speaker Recognition and Diarization: A Novel Methodology in Speech Analysis," SN Computer Science, vol. 6, no. 87, 2025. https://doi.org/10.1007/s42979-024-03509-6

[Paper Review] Revolutionizing Speaker Recognition and Diarization: A Novel Methodology in Speech Analysis

Table of contents

1. Abstract