One time, somebody asked me if it was possible to transcribe call center conversations to text so they later be analyzed for quality purposes. This is exactly what we will be trying to do here in the most simple way: audio transcription.

Fortunately, companies like Facebook and OpenAI have already invested millions of dollars in producing these technologies, and they placed them on a website called "Hugging Face".

For this post, we will be using the pre-trained model "distil-whisper/distil-small.en," which is able to translate audio arrays into text. There are many different models and versions of the "distil-small.en" model based on the number of parameters. It's said that a model with a higher number of parameters might produce better transcriptions but will also convert audio into extremely long sequences, so the final transcription might take a lot of time, depending on how long the audio file is.

For our case, we will be using the smallest model, which has 166 million parameters to convert a sample audio file of 1:51 minutes.

Download the sample Audio File

Please download this sample audio file from this link. This is an mp3 file, but you are welcome to use your own file. Rename the audio file as "speech.mp3" and place the file at the root of your colab directory.

Nicely done! Now, it's time to code. Surprisingly, the code required for this is very simple but requires some tweaking. Let's get our hands dirty.

Import Libraries

To reproduce this code, you can use Google Colab, so we will need to install the following dependencies:

  !pip install transformers
  !pip install soundfile
  !pip install librosa

Transformers: This is the main library that allows us to play with thousands of pre-trained models and download them directly from the model objects.
Soundfile: this library will help us load the wav, flac or mp3 files
Librosa: Its Librosa not LeviOsA. This library will help us with some pre-processing we need to do to the mp3 file such as converting stereo files to mono and downcasting an audio file to another sampling rate.

Great! Now let's load all the libraries we need:

from transformers.utils import logging
logging.set_verbosity_error()

import soundfile as sf
import io
import numpy as np
import librosa

from transformers import pipeline

Create the Transcription Pipeline

A pipeline is an object that loads a pre-trained model and sets up all the required processing so we don't have to do it ourselves. This is very simple, we just need to create the pipeline object and call the model we will be using for the transcription.

asr = pipeline("automatic-speech-recognition", 
model="distil-whisper/distil-small.en")

This single line of code will download the model and store it in the ASR model variable. We are almost ready to use it. The only thing we need to do is make sure the audio file is compatible with the model, and for that, we will need to transform the audio file with Librosa.

Transforming the Audio File

The "distil-small.en" has been trained in a specific sampling rate. Concretely this model works with audio that is in the 16Hz sampling rate. We can find this out by checking the model sampling rate:

print(asr.feature_extractor.sampling_rate)

output: 16000

Now, its time to check the sampling rate of the audio file "speech.mp3"

audio, sampling_rate = sf.read('speech.mp3')
print(sampling_rate)

output: 44100

No problem, we need to do two things with this audio file. The first one 1) is to make sure the file is mono, as these models work with single-channel audio files and 2) downcast the 44Hz audio file to 16Hz. Let's do exactly that in this single block of code:

audio_transposed = np.transpose(audio)
audio_mono = librosa.to_mono(audio_transposed)

audio_16KHz = librosa.resample(audio_mono,
                               orig_sr=sampling_rate,
                               target_sr=asr.feature_extractor.sampling_rate)

We will transpose our audio file to a single array to manage a united version of the audio file. The transposed array will now be converted to mono using the Librosa to_mono method. That's it; now the file is in a single channel.

The second part is also very simple. Librosa has a resampling method that allows users to change the sampling rate of a mono audio file. The audio_16KHz object now contains the pre-processed audio file that is compatible with the HuggingFace model for audio transcription.