Implementing Speech-to-Text: A Quick Guide

A few weeks back, I got a request from a student at UG Legon who needed help with their project. They were having issues with integrating speech-to-text into their workflow using GhanaNLP API, and the Automatic Speech Recognition just wasn’t cutting it. I took on the challenge, but after testing their setup on my own system, I quickly realized it was time to explore better alternatives.

That’s when I discovered Faster Whisper, a game-changer for ASR, offering both speed and accuracy while being relatively easy to set up. Unlike many other solutions that require extensive fine-tuning or are too slow for practical use, Faster Whisper stood out for its performance. In this post, I’ll guide you through how I set it up and why it’s a solid choice if you’re working on similar projects.

But before diving into the setup, for those who prefer working with simpler ASR solutions, you can check out the popular SpeechRecognition library on PyPI. It's a more beginner-friendly approach and can handle basic transcription tasks without much hassle.

Exploring Alternatives: Assembly AI

Now, not everyone wants to go through the setup process involved with self-hosted models like Faster Whisper. If you’re looking for a plug-and-play solution, Assembly AI might be the way to go. It’s a cloud-based service that provides developers with $50 in free credits to get started. Their Speech-to-Text API comes with a bunch of useful features:

Speech recognition
Speaker diarization
Custom spelling and vocabulary
Profanity filtering, auto punctuation, and casing

You can integrate Assembly AI into your project using SDKs available directly on your dashboard once you sign up. Here’s a quick example of setting it up in Python:

# Start by making sure the `assemblyai` package is installed.
# If not, you can install it by running the following command:
# pip install -U assemblyai
#
# Note: Some macOS users may need to use `pip3` instead of `pip`.

import assemblyai as aai

# Replace with your API key
aai.settings.api_key = "YOUR_API_KEY"

# URL of the file to transcribe
FILE_URL = "https://github.com/AssemblyAI-Community/audio-examples/raw/main/20230607_me_canadian_wildfires.mp3"

# You can also transcribe a local file by passing in a file path
# FILE_URL = './path/to/file.mp3'

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(FILE_URL)

if transcript.status == aai.TranscriptStatus.error:
    print(transcript.error)
else:
    print(transcript.text)

With Assembly AI, you get access to features like speaker diarization and custom vocab out-of-the-box, which can be a massive time-saver if you don’t want to spend hours setting up your own infrastructure.

Setting Up Faster Whisper for Speech-to-Text

For those of us who like the control that comes with managing our own setups, Faster Whisper is a great option. Here’s a step-by-step guide on how I got it working.

The repo: https://github.com/PhidLarkson/whisper-stt-api

Step 1: Install the Required Libraries

Start by installing the libraries you need. You can use the requirements.txt I shared above or manually install them:

pip install faster-whisper Flask sounddevice scipy requests

Step 2: Building the Flask API

I set up a simple Flask API that handles audio input, processes it using Faster Whisper, and returns the transcription as JSON. Here’s the code:

from flask import Flask, request, jsonify
from faster_whisper import WhisperModel
import os

app = Flask(__name__)

model_size = "distil-large-v3"

@app.route('/transcribe', methods=['POST'])
def transcribe():
    # Get the audio data from the request
    audio_data = request.files['audio'].read()

    # Save the audio data to a temporary file
    with open('temp.wav', 'wb') as f:
        f.write(audio_data)

    # Transcribe the audio
    output = transcribe_audio('temp.wav')

    # Remove the temporary audio file
    os.remove('temp.wav')

    # Return the transcription as JSON
    return jsonify({'transcription': output})

def transcribe_audio(audio_file):
    output = ""
    model = WhisperModel(model_size, device="cpu", compute_type="int8")
    segments, info = model.transcribe(audio_file, beam_size=5, language="en", condition_on_previous_text=False)
    for segment in segments:
        output += segment.text
    return output

if __name__ == '__main__':
    app.run(debug=True)

Step 3: Recording and Sending Audio to the API

To test this setup, I wrote a Python script that records audio using sounddevice, converts it to WAV format, and sends it to the Flask API for transcription:

import sounddevice as sd
from scipy.io.wavfile import write
import requests

# Set the API endpoint URL
api_url = "REPLACE WITH THE API ENDPOINT URL OF YOUR STT SERVICE"

# Set the recording parameters
sample_rate = 16000
duration = 10  # seconds

def record_audio():
    """Record audio for the specified duration"""
    print("Recording audio...")
    audio = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=1)
    sd.wait()  # Wait until recording is finished
    print("Recording complete.")
    return audio

def send_audio_to_api(audio):
    """Send the recorded audio to the Whisper STT API"""
    # Convert the audio to a WAV file
    write('recording.wav', sample_rate, audio)
    with open('recording.wav', 'rb') as f:
        files = {'audio': f}
        response = requests.post(api_url, files=files)
        if response.status_code == 200:
            return response.json()['transcription']
        else:
            return "Error: Failed to transcribe audio"

def main():
    audio = record_audio()
    transcription = send_audio_to_api(audio)
    print(f"Transcription: {transcription}")

if __name__ == "__main__":
    main()

Step 4: Performance and Final Thoughts

The performance of Faster Whisper has been impressive. Even on a basic setup, it handles the transcription quickly, and the accuracy is solid. For anyone building speech-to-text features into their apps, this combination of Faster Whisper and Flask is both practical and efficient.

If you prefer the DIY route, Faster Whisper is an excellent choice, but if you’re looking for something quicker to set up with extra features like custom vocab, definitely check out Assembly AI.

Happy transcribing!