How I used Voice AI to bring imaginary characters to life!

Kevin NaidooKevin Naidoo
8 min read

Most evenings, when tucking my daughter in(she’s 6 years old); I read a bedtime story to her, but not your conventional bedtime storybooks, no. One of the most fun things we love doing is to make up our own imaginary characters and stories.

One such character, we created, is Zazu. A brown monkey toy that we “reimagined”, Zazu has some interesting friends and crazy adventures. While making up the voices is fun, I am terrible at it! I am no voice actor but at least she gets a good laugh 😂

Anyway, to get back to the point of this article, It got me thinking, wouldn’t it be cool if I could create Zazu for real so that she doesn’t have to listen to my silly voice acting all the time? Plus, it would be so much fun for her to actually have a “play” conversation with it.

I am no Geppetto, nor any good with sketch art or making movies, but I can code and know a thing or two about AI.

So, I decided to create a voice AI agent of sorts that is set up with a system prompt telling the model that it should mimic Zazu as best as it can. In this article, I will give you a general overview of how to build such a system, and a code walkthrough of how I achieved it.

WebRTC VoiceAI: How will this work?

We will use WebRTC to stream audio from the browser to a Python web socket server built on Flask. The backend will then orchestrate the AI chain to transcribe the audio, prompt the model, and finally convert the model’s response back to audio and stream it back to the browser using WebRTC.

Now, since this is a fun app, I wasn’t going to spend a lot of time building my own transcription model, or TTS model for that matter. Instead, we’ll just use Assembly AI for transcription and OpenAI for everything else.

I initially tried to build this with VOSK instead of Assembly AI which is a decent open-source speech-to-text library, however, the transcription quality wasn’t that great and it wasn’t worth the time and effort to train a custom VOSK model. AssemblyAI has really good accuracy and costs just $0.37 per hour which is not that bad.

You can view the full source code here: https://github.com/kevincoder-co-za/zazu-voiceai

⚠️ I omitted “try except” blocks throughout to make the code more readable, however, for a real production app, you should catch and handle errors more efficiently.

Setting up our web socket server

For the backend, we’ll use Flask (a bit old school right?). Flask has very good support for web sockets and is fairly easy to spin up a quick prototype. To get started, let’s install a few pip packages:

pip install -r requirements.txt
flask
flask-sockets
gevent
gevent-websocket
assemblyai
ffmpeg

You will in addition need to install “FFmpeg” for your operating system. FFMPEG is a powerful audio manipulation library used by most modern audio tools on the market, and we’ll use it to alter the audio generated by OpenAI to sound more cartoonish.

Next, you’ll want to create a Python file: “server.py” or whatever you prefer:

from flask import Flask, render_template
from flask_sockets import Sockets
from gevent import pywsgi
from geventwebsocket.handler import WebSocketHandler
from audio_handlers import AudioHandler
from llm_adapters import OpenAILLM
import settings

app = Flask(__name__)
sockets = Sockets(app)

@app.route("/dialer")
def dialer():
    return render_template("dialer.html", 
        sample_rate = settings.SAMPLE_RATE,
        socket_url = settings.SOCKET_URL
    )

# Since we didn't add any API authentication.
# Use a slightly less secure method, just for the purposes
# Of prototyping. We add a secret token at the end of the URL.

@sockets.route("/websocket/stream/<token>")
def audio_stream(ws):
    audio_handler = None
    llm = None

    while not ws.closed:
        message = ws.receive()

        if message is None:
            continue
        if isinstance(message, str):
            if "start call" in message:
                print("Call started", flush=True)
                llm = OpenAILLM()
                audio_handler = AudioHandler(llm, ws)
            elif "end call" in message and audio_handler:
                audio_handler.stop()
                llm = None

        elif isinstance(message, bytes) or isinstance(message, bytearray):
            audio_handler.stream(bytes(message))

if __name__ == "__main__":
    WEBSOCKET_PORT = 5000
    server = pywsgi.WSGIServer(("", WEBSOCKET_PORT), app, handler_class=WebSocketHandler)
    print(f"Server listening on ws://0.0.0.0:{WEBSOCKET_PORT}")
    server.serve_forever()

Let’s go through this step-by-step, first, we set our socket server and two routes:

app = Flask(__name__)
sockets = Sockets(app)

@app.route("/dialer")
def dialer():
    return render_template("dialer.html", 
        sample_rate = settings.SAMPLE_RATE,
        socket_url = settings.SOCKET_URL
    )

@sockets.route("/websocket/stream")
def audio_stream(ws):

The dialer route just returns an HTML page: “dialer.html”. This contains our JavaScript WebRTC phone implementation. We’ll get to that in a bit.

The “/websocket/stream” is where all the magic happens. When the browser initiates a call, this endpoint is triggered, and a socket is opened. The socket will remain open until either party terminates the call. We can therefore constantly stream data back and forth through this socket.

@sockets.route("/websocket/stream")
def audio_stream(ws):
    audio_handler = None
    llm = None

    while not ws.closed:
        message = ws.receive()

        if message is None:
            continue
        if isinstance(message, str):
            if "start call" in message:
                print("Call started", flush=True)
                llm = OpenAILLM()
                audio_handler = AudioHandler(llm, ws)
            elif "end call" in message and audio_handler:
                audio_handler.stop()
                llm = None

        elif isinstance(message, bytes) or isinstance(message, bytearray):
            audio_handler.stream(bytes(message))

In our “audio_stream” method we scan the incoming data to check if it’s raw bytes or a string. When the data is raw bytes, this means that we are receiving audio data and when the data is a string, this means we are receiving an event such as “start call” or “end call”.

When the call starts, we set up an audio_handler and llm. The “audio_handler” will transcribe audio and send it back to the user.

The “llm” will communicate with GPT4o-mini and generate an appropriate response to the user based on the transcription.

Building the LLM Adapter

The LLM adapter has two tasks it needs to perform:

  • Prompt the text model with the transcription and get back a generated response (invoke method).

  • Take the generated response and convert it to audio so that we can stream it back to the browser (text_to_audio method).

from openai import OpenAI
import settings

class OpenAILLM:
    def __init__(self):
        self.llm = OpenAI()
        self.messages = [
            ("system",settings.SYSTEM_AI_PROMPT)
        ]

    def invoke(self, message):
        self.messages.append(("user", message))
        messages = self.messages.copy()

        chat_completion = self.llm.chat.completions.create(
            model= settings.OPENAI_GPT_MODEL,
            messages=[{"role": m[0], "content": m[1]} for m in messages]
        )

        response = chat_completion.choices[0].message.content
        self.messages.append(("assistant", response))

        return response

    def text_to_audio(self, text, audio_file_path):
        response = self.llm.audio.speech.create(
            model=settings.OPENAI_TTS_MODEL,
            voice=settings.OPENAI_TTS_VOICE,
            input=text
        )

        response.stream_to_file(audio_file_path)

Note: we also have some memory built in so that the LLM has context for the entire conversation. To maintain memory, we simply just append all messages to a list of “messages” and then use the conversation API to send a batch of messages.

Building the Audio Manager

The Audio Manager can be found in: “audio_handlers.py”:

import os
import uuid
import settings
import assemblyai as aai
import ffmpeg

aai.settings.api_key = settings.ASSEMBLYAI_API_KEY

class AudioHandler(aai.RealtimeTranscriber):
    def __init__(self, llm, ws):
        super().__init__(
            on_data=self.on_data,
            sample_rate=settings.SAMPLE_RATE,
            on_error=lambda x : print(x, flush=True),
            disable_partial_transcripts=True
        )
        self.llm = llm
        self.ws = ws
        self.connect()

    def stop(self):
        self.close()

    def on_data(self, transcript: aai.RealtimeTranscript):
       if isinstance(transcript, aai.RealtimeFinalTranscript) and transcript.text:
            response = self.llm.invoke(transcript.text)
            if response != "":
                self.respond_to_user_prompt(response)

    def respond_to_user_prompt(self, transcript):
        tmp_id = str(uuid.uuid4())
        tmp_file = f"/tmp/_{tmp_id}.mp3"
        cartoonified_version = f"/tmp/modified_{tmp_id}.mp3"

        self.llm.text_to_audio(transcript, tmp_file)        

        ffmpeg.input(tmp_file).filter('asetrate', 33050).output(cartoonified_version).run()

        audio_data = None

        with open(cartoonified_version, "rb") as f:
            audio_data = f.read()

        if os.path.exists(tmp_file):
            os.remove(tmp_file)
        if os.path.exists(cartoonified_version):
            os.remove(cartoonified_version)

        if audio_data:
            self.ws.send(audio_data)

The audio manager class extends Assembly AI’s RealtimeTranscriber, we override the “on_data” method to grab the finished transcription and then use both OpenAI Whisper and FFmpeg to generate a cartoonish voice.

    def __init__(self, llm, ws):
        super().__init__(
            on_data=self.on_data,
            sample_rate=settings.SAMPLE_RATE,
            on_error=lambda x : print(x, flush=True),
            disable_partial_transcripts=True
        )
        self.llm = llm
        self.ws = ws
        self.connect()

In the constructor, we set up a few configs:

  • on_data: This method will allow us to receive the full transcription and perform any other post-processing needed.

  • sample_rate: Measured in hertz and basically the number of audio samples carried per second. Assembly AI supports a wide variety of sample rates, but we want to keep a consistent sample rate between the different services (OpenAI Whisper only supports 16Khz hence our sample rate is 16000).

  • disable_partial_transcripts: for this application, we don’t need phrases or parts of speech, all we care about is the full transcript. Turning this setting on will minimize the network requests between us and Assembly AI, thus improving performance.

Once we get a transcription back from Assembly AI, we then need to:

  • Prompt the LLM to generate a response.

  • Send the response to the OpenAI TTS model.

  • Receive the audio from OpenAI TTS and then use FFmpeg to alter the audio so that it sounds similar to a cartoonish voice. OpenAI does not support custom voices, so we are stuck with human voices and therefore need to alter the pitch to sound similar to a cartoon character.

Silent pauses are a pain!

One of the major pain points is the “silent pause” which occurs for a few milliseconds (or seconds in some instances) after the user stops speaking, this is the time taken for the audio transcriber to detect the end of a sentence and also all the AI queries back and forth to generate the final response back to the caller.

This can be annoying since it’s not a fluid conversation that you would typically have with a real person.

Nonetheless, it works pretty decently if you can stomach the annoying pauses.

A few approaches I will look at to improve the code when I get some more time:

  • Rebuild in a faster language, I actually already done this in Golang. It does make a difference but could still do with more optimizations.

  • Run my own model near the WebSocket server. PHI-3 is fairly good and fast, running this model with Ollama and then fine-tuning could really improve both the speed and accuracy of the AI queries.

  • Train my own Whisper model, similar to the above. Might be better off using my own audio model instead of Assembly AI, and running this on a local network to minimize the latency between servers.

Conclusion

Machine learning and LLMs have opened up a whole new world of applications that are now possible to build. Furthermore, it’s kind of fun to fiddle and explore these possibilities and expand my knowledge at the same time.

While this “toy app” is not much at this stage, hopefully, it’s of some value to you and you’ve learned a thing or two, or at the very least, it’s piqued your interest in machine learning, AI, and Voice AI.

0
Subscribe to my newsletter

Read articles from Kevin Naidoo directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Kevin Naidoo
Kevin Naidoo

I am a South African-born tech leader with 15+ years of experience in software development including Linux servers and machine learning. My passion is web development and teaching. I often love experimenting with emerging technologies and use this blog as an outlet to share my knowledge and adventures. Learn about Python, Linux servers, SQL, Golang, SaaS, PHP, Machine Learning, and loads more.