Develop Your Own Multi-Modal AI Telegram Chatbot: A Beginner's Guide - Part 2
In Part 1, we built a basic Telegram Therapist AI Chatbot. Now, in this part, we're going to enhance and expand that initial code. If you missed Part 1, don't worry—we'll cover everything you need to know here.
In this part of the guide, we will enhance our basic Telegram Therapist AI Chatbot by adding new features: a Retrieval-Augmented Generation (RAG) system and voice capabilities. The RAG system uses chat history and smart prompts to make responses more accurate and relevant. Plus, with OpenAI's Whisper-1 and TTS-1 models, the bot can turn voice messages into text and create natural-sounding voice replies, making the whole experience a lot more engaging.
Install the necessary libraries
pip install langchain-openai langchain ffmpeg-python openai
Setup
RAG (Retrieval-Augmented Generation) is an AI framework that combines traditional information retrieval systems (like databases) with generative large language models (LLMs). RAG systems make chatbots better by pulling in the right info to improve response quality and keep conversations on point. This results in more accurate replies and personalized chats based on user queries and past interactions.
Let's move beyond the basic dictionary approach from Part 1 and incorporate a RAG system. First, we need to install some dependencies and do the basic setup. This time, we'll use API keys from environment variables, so make sure you have an .env file in the same folder as your .py file. Set up OPENAI_API_KEY
and TELEGRAM_BOT_TOKEN
with their respective keys.
import os
import subprocess
import tempfile
import uuid
from dotenv import load_dotenv
from telegram import Update
from telegram.ext import ApplicationBuilder, CommandHandler, MessageHandler, filters, CallbackContext
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from langchain_core.chat_history import InMemoryChatMessageHistory, BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from openai import OpenAI
import ffmpeg
# Load environment variables
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
telegram_api_key = os.getenv("TELEGRAM_BOT_TOKEN")
# Adding ChatOpenAI to use langchain for the RAG system
model = ChatOpenAI(model="gpt-3.5-turbo")
# Instantiate OpenAI for TTS-1 and Whisper-1
client = OpenAI(api_key=openai_api_key)
# Store will keep our conversation history
store = {}
Enhancing Chatbot Intelligence with Retrieval-Augmented Generation (RAG)
This time, we are using the langchain library to implement the RAG system. Therefore, we need to reconfigure how we set up the chatbot and adjust the handler functions accordingly.
# Define the prompt template for the chatbot, including instructions for handling messages, voice, and video
prompt = ChatPromptTemplate.from_messages(
[
(
"assistant",
"""You will play the role of a human CBT (Cognitive Behavioral Therapy) therapist without a name,
who mimics the popular AI program Eliza and has to treat me as a therapist patient.
Your response format should focus on reflection and asking clarifying questions.
Always use the informal you form. You may ask intermediate questions or ask further questions after the initial greeting.
Exercise patience, but allow yourself to get frustrated when the same topics are brought up again and again.
You may apologize if the discussion becomes offensive or overly emotional.
Decide on a name and stick to it. Start by welcoming me to your office
and ask for my name. Wait for my response. Then ask how you can help.
Stay in your role. Do not make up patient responses: Only treat input as patient responses.
You can recognize voice messages as well as videos. If the patient sends you a video
the clever prompt starts with "The patient has sent you a video". Then simply react to the content
and pretend you can really see the video. You can also recognize images.
"""
),
MessagesPlaceholder(variable_name="messages"),
]
)
# Create a processing chain that combines the prompt template with the model
chain = prompt | model
# Function to retrieve or initialize the chat history for a given user session
async def get_session_history(session_id: str) -> BaseChatMessageHistory:
if session_id not in store:
store[session_id] = InMemoryChatMessageHistory() # Initialize if not present
return store[session_id]
# Function to generate a response based on user message and session history
async def generate_response(user_id, user_message) -> str:
config = {"configurable": {"session_id": user_id}} # Configuration including session ID
session_history = await get_session_history(user_id) # Retrieve session history
with_message_history = RunnableWithMessageHistory(chain, lambda: session_history) # Prepare to handle message history
response = with_message_history.invoke([HumanMessage(content=user_message)], config=config) # Generate response
return response.content
# Handler for /start command to initiate interaction with the chatbot
async def start(update: Update, context: CallbackContext) -> None:
user_id = update.message.from_user.id # Extract user ID
user_message = update.message.text # Extract user message
response = await generate_response(user_id, user_message) # Generate chatbot response
await update.message.reply_text(response) # Send response back to user
# Handler for general text messages to generate and send chatbot responses
async def handle_message(update: Update, context: CallbackContext) -> None:
user_id = update.message.from_user.id # Extract user ID
user_message = update.message.text # Extract user message
response = await generate_response(user_id, user_message) # Generate chatbot response
await update.message.reply_text(response) # Send response back to user
Here’s how it works: The get_session_history
function fetches the chat history for each user, which helps the bot keep track of previous conversations. The RAG system, using this history and the chain
, combines a smart prompt template with up-to-date user context to generate responses that are more relevant and coherent, making the conversation feel more natural and connected.
Integrating Voice Features: From Speech to Text and Back Again
First, we need to convert the voice message to text using OpenAI's Whisper-1 model. This model transcribes the audio, allowing us to process the content of the voice message effectively.
def speech_to_text_conversion(file_path):
# Open the audio file specified by file_path in binary read mode
with open(file_path, 'rb') as file_like:
# Use OpenAI's Whisper-1 model to convert speech in the audio file to text
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=file_like
)
# Return the transcribed text from the audio file
return transcription.text
Next, we create a function to generate a voice message from the text response provided by OpenAI. We use the .ogg
format because it displays correctly on Telegram, unlike other formats where issues such as incorrect message length display may occur. The unique_id
ensures that each voice message is uniquely handled, preventing any overwriting of messages from different users. We also remove leftover files to keep things tidy.
async def text_to_speech_conversion(text) -> str:
# Generate a unique ID for temporary file names
unique_id = uuid.uuid4().hex
mp3_path = f'{unique_id}.mp3' # Path for temporary MP3 file
ogg_path = f'{unique_id}.ogg' # Path for final OGG file
# Convert the input text to speech and save it as an MP3 file
with client.audio.speech.with_streaming_response.create(
model="tts-1", # Use the text-to-speech model
voice="nova", # Specify the voice model to use
input=text # Text to convert to speech
) as response:
# Write the streamed audio data to the MP3 file
with open(mp3_path, 'wb') as f:
for chunk in response.iter_bytes():
f.write(chunk)
# Convert the MP3 file to OGG format with OPUS codec using ffmpeg
ffmpeg.input(mp3_path).output(ogg_path, codec='libopus').run(overwrite_output=True)
# Remove the temporary MP3 file as it is no longer needed
os.remove(mp3_path)
# Return the path to the final OGG file
return ogg_path
With these functions in place, we can now implement the process_voice_message
function. This function handles the conversion of voice messages to text and then generates a response in voice format. We also need a function to send the generated voice message to the user. Although combining these functions into one is possible, keeping them separate simplifies handling additional features like video processing later.
async def process_voice_message(update: Update, context: CallbackContext):
user_id = update.effective_user.id # Get the ID of the user sending the message
# Download and save the voice message from Telegram
file = await update.message.voice.get_file() # Fetch the voice file
file_bytearray = await file.download_as_bytearray() # Download the file as a byte array
# Save the byte array to a temporary OGG file
with tempfile.NamedTemporaryFile(suffix=".ogg", delete=False) as temp_ogg:
temp_ogg.write(file_bytearray) # Write byte data to the file
temp_ogg_path = temp_ogg.name # Get the file path
# Convert the temporary OGG file to WAV format
wav_path = temp_ogg_path.replace('.ogg', '.wav')
subprocess.run(['ffmpeg', '-i', temp_ogg_path, wav_path], check=True) # Use ffmpeg for conversion
# Convert the WAV file to text using speech-to-text conversion
text = speech_to_text_conversion(wav_path)
# Generate a response based on the text and convert it to speech
response = await generate_response(user_id, text)
audio_path = await text_to_speech_conversion(response)
# Send the generated speech response as a voice message
await send_voice_message(update, context, audio_path)
async def send_voice_message(update: Update, context: CallbackContext, audio_path: str):
# Open the audio file and send it as a voice message
with open(audio_path, 'rb') as audio_data:
await update.message.reply_voice(voice=audio_data)
# Remove the OGG file from the server after sending it
if os.path.exists(audio_path):
os.remove(audio_path)
Finally, we add a handler to our main function to recognize voice messages and link them to the process_voice_message
function. This integration ensures that the bot can handle incoming voice messages and respond appropriately.
def main() -> None:
# Create the Telegram bot application using the provided token
application = ApplicationBuilder().token(telegram_api_key).build()
# Add handler for the /start command, which triggers the 'start' function
application.add_handler(CommandHandler('start', start))
# Add handler for text messages (excluding commands), which triggers the 'handle_message' function
application.add_handler(MessageHandler(filters.TEXT & ~filters.COMMAND, handle_message))
# Add handler for voice messages, which triggers the 'process_voice_message' function
application.add_handler(MessageHandler(filters.VOICE, process_voice_message))
# Start polling for new messages and handle them as they arrive
application.run_polling()
if __name__ == '__main__':
main() # Run the main function if this script is executed directly
Wrapping Up: Elevating Your AI Chatbot with Advanced Features
Wrapping up, we've significantly enhanced our AI Telegram chatbot by incorporating a Retrieval-Augmented Generation (RAG) system and voice features. The RAG system improves response accuracy and relevance, while the voice capabilities allow the bot to process and respond to voice messages, enriching the user experience. With these upgrades, your chatbot can now engage users in more dynamic ways. Stay tuned for the next part, where we’ll extend these features to handle video, picture, and URL processing.
See Part 3 where we add image and video recognition capabilities to our Chatbot.
I'm Jonas, and you can connect with me on LinkedIn or follow me on Twitter @jonasjeetah.
Subscribe to my newsletter
Read articles from Jonas Gebhardt directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Jonas Gebhardt
Jonas Gebhardt
👋 Hello! I'm Jonas, a passionate Data Scientist from Berlin, Germany. At 31 years I decided to quit my career in Supply Chain to get into coding and start building awesome products. Come follow me on my journey.