Develop Your Own Multi-Modal AI Telegram Chatbot: A Beginner's Guide - Part 2

Jonas GebhardtJonas Gebhardt
8 min read

In Part 1, we built a basic Telegram Therapist AI Chatbot. Now, in this part, we're going to enhance and expand that initial code. If you missed Part 1, don't worry—we'll cover everything you need to know here.

In this part of the guide, we will enhance our basic Telegram Therapist AI Chatbot by adding new features: a Retrieval-Augmented Generation (RAG) system and voice capabilities. The RAG system uses chat history and smart prompts to make responses more accurate and relevant. Plus, with OpenAI's Whisper-1 and TTS-1 models, the bot can turn voice messages into text and create natural-sounding voice replies, making the whole experience a lot more engaging.


Install the necessary libraries

pip install langchain-openai langchain ffmpeg-python openai

Setup

RAG (Retrieval-Augmented Generation) is an AI framework that combines traditional information retrieval systems (like databases) with generative large language models (LLMs). RAG systems make chatbots better by pulling in the right info to improve response quality and keep conversations on point. This results in more accurate replies and personalized chats based on user queries and past interactions.

Let's move beyond the basic dictionary approach from Part 1 and incorporate a RAG system. First, we need to install some dependencies and do the basic setup. This time, we'll use API keys from environment variables, so make sure you have an .env file in the same folder as your .py file. Set up OPENAI_API_KEY and TELEGRAM_BOT_TOKEN with their respective keys.

import os
import subprocess
import tempfile
import uuid
from dotenv import load_dotenv
from telegram import Update
from telegram.ext import ApplicationBuilder, CommandHandler, MessageHandler, filters, CallbackContext
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from langchain_core.chat_history import InMemoryChatMessageHistory, BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from openai import OpenAI
import ffmpeg

# Load environment variables
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
telegram_api_key = os.getenv("TELEGRAM_BOT_TOKEN")

# Adding ChatOpenAI to use langchain for the RAG system
model = ChatOpenAI(model="gpt-3.5-turbo")
# Instantiate OpenAI for TTS-1 and Whisper-1
client = OpenAI(api_key=openai_api_key)
# Store will keep our conversation history
store = {}

Enhancing Chatbot Intelligence with Retrieval-Augmented Generation (RAG)

This time, we are using the langchain library to implement the RAG system. Therefore, we need to reconfigure how we set up the chatbot and adjust the handler functions accordingly.

# Define the prompt template for the chatbot, including instructions for handling messages, voice, and video
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "assistant",
            """You will play the role of a human CBT (Cognitive Behavioral Therapy) therapist without a name,
               who mimics the popular AI program Eliza and has to treat me as a therapist patient.
               Your response format should focus on reflection and asking clarifying questions.
               Always use the informal you form. You may ask intermediate questions or ask further questions after the initial greeting.
               Exercise patience, but allow yourself to get frustrated when the same topics are brought up again and again.
               You may apologize if the discussion becomes offensive or overly emotional.
               Decide on a name and stick to it. Start by welcoming me to your office
               and ask for my name. Wait for my response. Then ask how you can help.
               Stay in your role. Do not make up patient responses: Only treat input as patient responses.
               You can recognize voice messages as well as videos. If the patient sends you a video
               the clever prompt starts with "The patient has sent you a video". Then simply react to the content
               and pretend you can really see the video. You can also recognize images.
            """
        ),
        MessagesPlaceholder(variable_name="messages"),
    ]
)

# Create a processing chain that combines the prompt template with the model
chain = prompt | model

# Function to retrieve or initialize the chat history for a given user session
async def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()  # Initialize if not present
    return store[session_id]

# Function to generate a response based on user message and session history
async def generate_response(user_id, user_message) -> str:
    config = {"configurable": {"session_id": user_id}}  # Configuration including session ID
    session_history = await get_session_history(user_id)  # Retrieve session history
    with_message_history = RunnableWithMessageHistory(chain, lambda: session_history)  # Prepare to handle message history
    response = with_message_history.invoke([HumanMessage(content=user_message)], config=config)  # Generate response
    return response.content

# Handler for /start command to initiate interaction with the chatbot
async def start(update: Update, context: CallbackContext) -> None:
    user_id = update.message.from_user.id  # Extract user ID
    user_message = update.message.text  # Extract user message
    response = await generate_response(user_id, user_message)  # Generate chatbot response
    await update.message.reply_text(response)  # Send response back to user

# Handler for general text messages to generate and send chatbot responses
async def handle_message(update: Update, context: CallbackContext) -> None:
    user_id = update.message.from_user.id  # Extract user ID
    user_message = update.message.text  # Extract user message
    response = await generate_response(user_id, user_message)  # Generate chatbot response
    await update.message.reply_text(response)  # Send response back to user

Here’s how it works: The get_session_history function fetches the chat history for each user, which helps the bot keep track of previous conversations. The RAG system, using this history and the chain, combines a smart prompt template with up-to-date user context to generate responses that are more relevant and coherent, making the conversation feel more natural and connected.

Integrating Voice Features: From Speech to Text and Back Again

First, we need to convert the voice message to text using OpenAI's Whisper-1 model. This model transcribes the audio, allowing us to process the content of the voice message effectively.

def speech_to_text_conversion(file_path):
    # Open the audio file specified by file_path in binary read mode
    with open(file_path, 'rb') as file_like:
        # Use OpenAI's Whisper-1 model to convert speech in the audio file to text
        transcription = client.audio.transcriptions.create(
            model="whisper-1",
            file=file_like
        )
    # Return the transcribed text from the audio file
    return transcription.text

Next, we create a function to generate a voice message from the text response provided by OpenAI. We use the .ogg format because it displays correctly on Telegram, unlike other formats where issues such as incorrect message length display may occur. The unique_id ensures that each voice message is uniquely handled, preventing any overwriting of messages from different users. We also remove leftover files to keep things tidy.

async def text_to_speech_conversion(text) -> str:
    # Generate a unique ID for temporary file names
    unique_id = uuid.uuid4().hex
    mp3_path = f'{unique_id}.mp3'  # Path for temporary MP3 file
    ogg_path = f'{unique_id}.ogg'  # Path for final OGG file

    # Convert the input text to speech and save it as an MP3 file
    with client.audio.speech.with_streaming_response.create(
        model="tts-1",       # Use the text-to-speech model
        voice="nova",        # Specify the voice model to use
        input=text           # Text to convert to speech
    ) as response:
        # Write the streamed audio data to the MP3 file
        with open(mp3_path, 'wb') as f:
            for chunk in response.iter_bytes():
                f.write(chunk)

    # Convert the MP3 file to OGG format with OPUS codec using ffmpeg
    ffmpeg.input(mp3_path).output(ogg_path, codec='libopus').run(overwrite_output=True)

    # Remove the temporary MP3 file as it is no longer needed
    os.remove(mp3_path)

    # Return the path to the final OGG file
    return ogg_path

With these functions in place, we can now implement the process_voice_message function. This function handles the conversion of voice messages to text and then generates a response in voice format. We also need a function to send the generated voice message to the user. Although combining these functions into one is possible, keeping them separate simplifies handling additional features like video processing later.

async def process_voice_message(update: Update, context: CallbackContext):
    user_id = update.effective_user.id  # Get the ID of the user sending the message

    # Download and save the voice message from Telegram
    file = await update.message.voice.get_file()  # Fetch the voice file
    file_bytearray = await file.download_as_bytearray()  # Download the file as a byte array

    # Save the byte array to a temporary OGG file
    with tempfile.NamedTemporaryFile(suffix=".ogg", delete=False) as temp_ogg:
        temp_ogg.write(file_bytearray)  # Write byte data to the file
        temp_ogg_path = temp_ogg.name  # Get the file path

    # Convert the temporary OGG file to WAV format
    wav_path = temp_ogg_path.replace('.ogg', '.wav')
    subprocess.run(['ffmpeg', '-i', temp_ogg_path, wav_path], check=True)  # Use ffmpeg for conversion

    # Convert the WAV file to text using speech-to-text conversion
    text = speech_to_text_conversion(wav_path)

    # Generate a response based on the text and convert it to speech
    response = await generate_response(user_id, text)
    audio_path = await text_to_speech_conversion(response)

    # Send the generated speech response as a voice message
    await send_voice_message(update, context, audio_path)


async def send_voice_message(update: Update, context: CallbackContext, audio_path: str):
    # Open the audio file and send it as a voice message
    with open(audio_path, 'rb') as audio_data:
        await update.message.reply_voice(voice=audio_data)

    # Remove the OGG file from the server after sending it
    if os.path.exists(audio_path):
        os.remove(audio_path)

Finally, we add a handler to our main function to recognize voice messages and link them to the process_voice_message function. This integration ensures that the bot can handle incoming voice messages and respond appropriately.

def main() -> None:
    # Create the Telegram bot application using the provided token
    application = ApplicationBuilder().token(telegram_api_key).build()

    # Add handler for the /start command, which triggers the 'start' function
    application.add_handler(CommandHandler('start', start))

    # Add handler for text messages (excluding commands), which triggers the 'handle_message' function
    application.add_handler(MessageHandler(filters.TEXT & ~filters.COMMAND, handle_message))

    # Add handler for voice messages, which triggers the 'process_voice_message' function
    application.add_handler(MessageHandler(filters.VOICE, process_voice_message))

    # Start polling for new messages and handle them as they arrive
    application.run_polling()

if __name__ == '__main__':
    main()  # Run the main function if this script is executed directly

Wrapping Up: Elevating Your AI Chatbot with Advanced Features

Wrapping up, we've significantly enhanced our AI Telegram chatbot by incorporating a Retrieval-Augmented Generation (RAG) system and voice features. The RAG system improves response accuracy and relevance, while the voice capabilities allow the bot to process and respond to voice messages, enriching the user experience. With these upgrades, your chatbot can now engage users in more dynamic ways. Stay tuned for the next part, where we’ll extend these features to handle video, picture, and URL processing.

See Part 3 where we add image and video recognition capabilities to our Chatbot.

I'm Jonas, and you can connect with me on LinkedIn or follow me on Twitter @jonasjeetah.

1
Subscribe to my newsletter

Read articles from Jonas Gebhardt directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Jonas Gebhardt
Jonas Gebhardt

👋 Hello! I'm Jonas, a passionate Data Scientist from Berlin, Germany. At 31 years I decided to quit my career in Supply Chain to get into coding and start building awesome products. Come follow me on my journey.