Building a Conversational Voice Chatbot: Integrating OpenAI's Speech-to-Text & Text-to-Speech

Ved VekhandeVed Vekhande
7 min read

Introduction

Welcome to an engaging tutorial where we'll develop a voice-responsive chatbot utilizing OpenAI's advanced speech-to-text and text-to-speech services, all integrated within a Streamlit web application. This project is not just about textual interactions; it's about enabling a natural, voice-based dialogue with a chatbot.

For those who might not be familiar with OpenAI's capabilities in handling speech, I recommend watching my detailed video (watch here). It provides an excellent introduction to the speech-to-text and text-to-speech functionalities that are central to our project.

In this blog, we will walk through the entire process of setting up the development environment, incorporating OpenAI services into our application, and crafting a chatbot that can seamlessly converse with users using voice inputs and outputs.

Setting Up the Environment

To begin building our voice-responsive OpenAI chatbot, it's essential to set up the right development environment. This involves installing necessary libraries and configuring API access. Here's how you can get started:

1. Install Required Libraries

Your chatbot relies on several Python libraries, as listed in the requirements.txt file. These libraries include Streamlit for the web interface, OpenAI for accessing speech processing services, and others for specific functionalities like audio recording. Install them by running the following command in your project directory:

pip install -r requirements.txt

Here's a quick breakdown of the key libraries:

  • streamlit: For building and running the web app.

  • openai: To access OpenAI's API for speech-to-text and text-to-speech services.

  • audio_recorder_streamlit: To record audio within the Streamlit app.

  • streamlit-float: Provides floating elements in the Streamlit interface.

2. Set Up the .env File

Sensitive information such as your OpenAI API key should be stored in a .env file. This approach keeps your credentials secure. Create a .env file in the root of your project and include your OpenAI API key like this:

OPENAI_API_KEY='your_openai_api_key_here'

Ensure that this file is not shared publicly, especially if you are pushing your code to a public repository.

3. Understanding the Project Structure

Your project primarily consists of two Python files:

  • app.py: This file contains the Streamlit web application logic. It's where you define the user interface and manage the flow of input/output for the chatbot.

  • utils.py: This file includes functions for processing speech-to-text and text-to-speech, as well as generating chatbot responses.

With your environment set up and a basic understanding of your project's structure, you're now ready to start building the chatbot's functionalities.

Building the Chatbot: Streamlit Interface (app.py)

In this section, we dive into the construction of our chatbot, focusing on how the Streamlit interface is set up and how voice inputs are handled and processed in app.py.

Streamlit Interface Setup

Streamlit is a powerful tool that allows us to quickly build interactive web applications for our chatbot. In app.py, the Streamlit application is initialized and configured to handle user interactions:

import streamlit as st
from utils import get_answer, text_to_speech, autoplay_audio, speech_to_text
from audio_recorder_streamlit import audio_recorder
from streamlit_float import *

# Initialize floating features for the interface
float_init()

# Initialize session state for managing chat messages
def initialize_session_state():
    if "messages" not in st.session_state:
        st.session_state.messages = [{"role": "assistant", "content": "Hi! How may I assist you today?"}]

initialize_session_state()

st.title("OpenAI Conversational Chatbot 🤖")

In this setup, we initialize the Streamlit app, import necessary functions from utils.py, and set up the session state to track and manage chat messages. The float_init() function from streamlit_float is used to create floating elements, enhancing the user interface.

Handling Voice Inputs

The core functionality of our chatbot is its ability to handle voice inputs. This is achieved using the audio_recorder_streamlit library, which allows us to record audio directly in the Streamlit interface:

# Create a container for the microphone and audio recording
footer_container = st.container()
with footer_container:
    audio_bytes = audio_recorder()

The audio_recorder() function captures audio input from the user. Once the audio is recorded, it's processed to extract the spoken text:

if audio_bytes:
    with st.spinner("Transcribing..."):
        # Write the audio bytes to a temporary file
        webm_file_path = "temp_audio.mp3"
        with open(webm_file_path, "wb") as f:
            f.write(audio_bytes)

        # Convert the audio to text using the speech_to_text function
        transcript = speech_to_text(webm_file_path)
        if transcript:
            st.session_state.messages.append({"role": "user", "content": transcript})
            with st.chat_message("user"):
                st.write(transcript)
            os.remove(webm_file_path)

Here, we write the recorded audio to a file and then use the speech_to_text function from utils.py to convert it into text. The transcribed text is then added to the session state for the chatbot to process.

Chatbot Response Processing

Once a user's voice input is converted to text, the chatbot processes this input to generate a response:

if st.session_state.messages[-1]["role"] != "assistant":
    with st.chat_message("assistant"):
        with st.spinner("Thinking🤔..."):
            final_response = get_answer(st.session_state.messages)
        with st.spinner("Generating audio response..."):    
            audio_file = text_to_speech(final_response)
            autoplay_audio(audio_file)
        st.write(final_response)
        st.session_state.messages.append({"role": "assistant", "content": final_response})
        os.remove(audio_file)

In this part of the code, the get_answer function is used to generate a text response based on the user's input. This response is then converted to speech using the text_to_speech function, and the audio is played back to the user.

Integrating OpenAI's Services (utils.py)

In utils.py, we have defined key functions that integrate OpenAI's speech-to-text and text-to-speech services, along with the logic for generating chatbot responses. Let's explore these functions in detail.

speech_to_text Function

The speech_to_text function is responsible for converting the audio input from the user into text. This is a critical step in enabling the chatbot to understand and process user queries:

def speech_to_text(audio_data):
    with open(audio_data, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            response_format="text",
            file=audio_file
        )
    return transcript

In this function, the audio file captured from the user is opened and sent to OpenAI's speech-to-text service. The service transcribes the audio into text using the Whisper model, which is known for its high accuracy in speech recognition. The transcribed text is then returned for further processing by the chatbot.

text_to_speech Function

Conversely, the text_to_speech function takes the chatbot's textual response and converts it into an audio format, allowing the chatbot to 'speak' back to the user:

def text_to_speech(input_text):
    response = client.audio.speech.create(
        model="tts-1",
        voice="nova",
        input=input_text
    )
    webm_file_path = "temp_audio_play.mp3"
    with open(webm_file_path, "wb") as f:
        response.stream_to_file(webm_file_path)
    return webm_file_path

Here, the chatbot's response text is converted into speech using OpenAI's text-to-speech service. The output is saved as an audio file, which is then played back to the user, creating an audio response.

get_answer Function

The get_answer function generates the chatbot's responses to user inputs. It uses OpenAI's language models to create contextually appropriate and conversational replies:

def get_answer(messages):
    system_message = [{"role": "system", "content": "You are an helpful AI chatbot, that answers questions asked by User."}]
    messages = system_message + messages
    response = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=messages
    )
    return response.choices[0].message.content

In this function, the conversation history is combined with a system message defining the chatbot's role. This data is then sent to OpenAI's conversational AI model, which generates a response based on the input and context.

Chatbot Interaction Flow

The interaction flow of the chatbot, as orchestrated in app.py, is a seamless integration of these functionalities. When a user speaks to the chatbot, the audio is recorded and converted to text using speech_to_text. The chatbot then processes this input with get_answer to generate a response. Finally, this response is converted back into speech using text_to_speech, allowing the chatbot to audibly communicate with the user. This flow creates a natural and interactive conversational experience, showcasing the potential of integrating advanced AI and speech processing technologies in a user-friendly application.

Conclusion

As we wrap up our exploration of building a voice-responsive OpenAI chatbot with Streamlit, let's reflect on what we've accomplished and the potential for further development.

Reflecting on the Project

This project demonstrates the power and versatility of integrating advanced AI services into a user-friendly application. By combining OpenAI's speech-to-text and text-to-speech capabilities with Streamlit, we've created a chatbot that can understand spoken language and respond in kind. The key functionalities we've implemented, such as handling voice inputs, generating intelligent responses, and speaking back to the user, exemplify how AI can be used to create more natural and engaging user interfaces.

Additional Resources

For a detailed walkthrough of this project and a practical demonstration, make sure to watch my YouTube video. Also, you can access the complete code and documentation on my GitHub repository.

If you're curious about the latest in AI technology, I invite you to visit my project, AI Demos, at aidemos.com. It's a rich resource offering a wide array of video demos showcasing the most advanced AI tools. My goal with AI Demos is to educate and illuminate the diverse possibilities of AI.

For even more in-depth exploration, be sure to visit my YouTube channel at youtube.com/@aidemos.futuresmart. Here, you'll find a wealth of content that delves into the exciting future of AI and its various applications.

5
Subscribe to my newsletter

Read articles from Ved Vekhande directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ved Vekhande
Ved Vekhande

I am Data Science Intern at FutureSmart AI where I am working on projects related to Langchain, Llamaindex, OpenAI, etc. I am Machine Learning Enthusiast and have passion for Data. Currently I am in my pre-final year pursuing my Bachelor's in Computer Science from IIIT Vadodara ICD .