TL;DR ✨

In this easy-to-follow tutorial, you will learn how to build your own voice assistant Siri with the LLAMA-3 AI Model. 😎

What you will learn: 👀

Learn how to set up TTS in a Python project using OpenAI TTS / Pyttsx3 / gTTS.
Learn to generate chat response using Groq with LLAMA-3 model.
Learn to capture webcam images and process it using the Google Generative AI.
Learn to automate all the manual tasks with shell scripting.

Let's build this thing together! 😵‍💫

Ready GIF

Setting up the Environment 🛠️

Create a folder to keep all your source code for the project:

mkdir siri-voice-llama3
cd siri-voice-llama3

Create a few new subfolders where we will store the source code, shell scripts, logs and chat history:

mkdir -p src logs src/scripts data/ai_response data/chat_history

Now that the initial folder structure is set up, it's time to create a new virtual environment and install all the modules we will be using in our project.

Run these commands to create and activate a new virtual environment in the root of our project:

python3 -m venv .venv
source .venv/bin/activate # If you are using fish shell, change the activate binary to activate.fish

Run this command to install all the necessary modules we will be using in our project:

pip3 install SpeechRecognition opencv-python openai google-generativeai gTTS pyttsx3 groq faster-whisper numpy python-dotenv pyperclip pydub PyAudio pillow

⚠️ NOTE: Installing packages this way can lead to issues if they change in the future. For exact versions, please find my requirements.txt file here. Copy the contents to create this file in your project's root directory.

To install, run: pip3 install -r requirements.txt

Here is what each module is used for:

SpeechRecognition: Enables speech recognition from audio files or streams.
opencv-python: Used for processing webcam images.
groq: A library for working with Groq. Used to generate response from LLAMA-3.
google-generativeai: Used in image processing to provide context.
faster-whisper: A faster implementation of Whisper for speech recognition.
python-dotenv: For reading key-value pairs from a .env file.
pyperclip: Facilitates clipboard operations (copying and pasting).
pydub: Handles audio manipulation tasks.
PyAudio: Manages audio input/output.
numpy: Supports numerical computing and efficient array handling.
pillow: A fork of the Python Imaging Library (PIL) for image processing.

Optional Modules:

ℹ️ Among these, only one is required.

openai: Enables Text-to-Speech using OpenAI's streaming audio.
gTTS: Google Text-to-Speech library for generating speech from text.
pyttsx3: A Python Text-to-Speech library for offline speech synthesis.

Let's Code It 💻

Fire GIF

Setting Up Chat History Support 📋

💡 We are going to add support for chat history in a log file for each day seperately.

Inside the src directory, add a file named utils.py with the following code:

In this file, we will store all the helper functions that we will require in our program.

# 👇 siri-voice-llama3/src/utils.py

import os
import sys
from datetime import datetime
from pathlib import Path
from typing import Literal, NoReturn, Optional

import pyperclip
from PIL import ImageGrab

import utils


def get_log_file_for_today(project_root_folder_path: Path) -> Path:
    """
    Retrieves the log file path for today's date, ensuring that the necessary
    directories are created. If the log file for the current day does not exist,
    it creates an empty log file.

    Args:
        project_root_folder_path (Path): The root folder of the project, where the 'data'
        directory resides.

    Returns:
        Path: The absolute path to the log file for today's date.
    """

    today = datetime.today()

    # The year is always 4 digit and the month, day is always 2 digit using this format.
    year = today.strftime("%Y")
    month = today.strftime("%m")
    day = today.strftime("%d")

    base_folder = os.path.join(
        project_root_folder_path, "data", "chat_history", year, month
    )

    os.makedirs(base_folder, exist_ok=True)
    chat_log_file = os.path.join(base_folder, f"{day}.log")

    Path(chat_log_file).touch(exist_ok=True)

    return Path(os.path.abspath(chat_log_file))


def log_chat_message(
    log_file_path: Path,
    user_message: Optional[str] = None,
    ai_message: Optional[str] = None,
) -> None:
    """
    Logs user and assistant chat messages to the provided log file, along with
    a timestamp. Either the user message or the assistant message (or both) can
    be provided.

    Args:
        log_file_path (Path): The absolute path to the log file where messages will be logged.
        user_message (Optional[str]): The message sent by the user. Defaults to None.
        ai_message (Optional[str]): The message generated by the assistant. Defaults to None.

    Returns:
        None: This function appends the messages to the log file in a readable format
        with a timestamp. It does not return anything.
    """

    # If neither of the message is given, return.
    if not user_message and not ai_message:
        return

    timestamp = datetime.now().strftime("[%H : %M : %S]")

    with open(log_file_path, "a") as log_file:
        if user_message:
            log_file.write(f"{timestamp} - USER: {user_message}")

        if ai_message:
            log_file.write(f"{timestamp} - ASSISTANT: {ai_message}\n")

        log_file.write("\n")

The get_log_file_for_today function takes in a path to the project root folder, which is usually where our main.py file will be located.

It constructs a path to today’s log file stored in data/chat_history/{month}/{day}.log. If the file doesn't exist, it creates an empty file and returns the path. If it does exist, it simply returns the existing path.

The log_chat_message function takes the path to the log file, the user message, and the AI message, then logs the received messages with a specific timestamp.

API Keys Configuration 🔑

For this project, we will require a few API keys. That includes Groq key, Google Generative AI key and optionally OpenAI key.

Create a new file .env in the root of the project and populate it with the API keys.

# Required
GROQ_API_KEY=
GOOGLE_GENERATIVE_AI_API_KEY=

# Optional
OPENAI_API_KEY=

After populating the .env file with the API keys, it's time to get access to it in our Python code.

Inside the src directory, create a new file setup.py with the following code:

# 👇 siri-voice-llama3/src/setup.py

import os

from dotenv import load_dotenv

import utils


def get_credentials() -> tuple[str, str, str | None]:
    """
    Load API keys from environment variables and return them as a tuple.

    This function loads environment variables from a `.env` file using `dotenv`.
    It retrieves the Groq API key, Google Generative AI API key, and OpenAI API key.
    If any of the keys are missing, it exits the program with an error message.

    Returns:
        tuple[str, str, str | None]: A tuple containing the Groq API key, Google Generative AI API key,
                              and OpenAI API key.

    Raises:
        SystemExit: If any of the required API keys are not found, the program exits with an error message.
    """
    load_dotenv()

    groq_api_key: str | None = os.getenv("GROQ_API_KEY")
    google_gen_ai_api_key: str | None = os.getenv("GOOGLE_GENERATIVE_AI_API_KEY")
    openai_api_key: str | None = os.getenv("OPENAI_API_KEY")

    if groq_api_key is None or google_gen_ai_api_key is None:
        return utils.exit_program(
            status_code=1,
            message="Missing required API key(s). Make sure to set them in `.env` file. If you are using the OpenAI approach, then populate the OpenAI api key as well.",
        )

    return groq_api_key, google_gen_ai_api_key, openai_api_key

The get_credentials function loads API keys from environment variables using the dotenv library and returns them as a tuple.

If either the Groq or Google API keys are missing, the function exits the program with an error message, prompting the user to set the necessary keys in a .env file. The OpenAI key is returned as optional and can be None if not set.

Defining Additional Helper Functions 👷

Above in the get_credentials function in setup.py, we are using utils.exit_program, but we have not defined it yet.

Let's work on that and add some more helper functions that we will need in the project.

Inside the utils.py file in the src directory, add the following lines of code:

# 👇 siri-voice-llama3/src/utils.py

# Rest of the code...

def exit_program(status_code: int = 0, message: str = "") -> NoReturn:
    """
    Exit the program with an optional error message.

    Args:
        status_code (int): The exit status code. Defaults to 0 (success).
        message (str): An optional error message to display before exiting.
    """

    if message:
        print(f"ERROR: {message}\n")
    sys.exit(status_code)

def get_path_to_folder(folder_type: Literal["webcam", "screenshot"]) -> Path:
    """
    Get the path to the specified folder type (webcam or screenshot).

    Args:
        folder_type (Literal["webcam", "screenshot"]): The type of folder to retrieve the path for.

    Returns:
        Path: The path to the specified folder.

    Raises:
        ValueError: If the folder_type is not valid.
    """

    base_path = Path(os.path.join(Path.home(), "Pictures", "llama3.1"))
    folder_map = {
        "screenshot": Path(os.path.join(base_path, "Screenshots")),
        "webcam": Path(os.path.join(base_path, "Webcam")),
    }

    if folder_type not in folder_map:
        raise ValueError(
            f"ERROR: Invalid folder_type: {folder_type}. Expected 'webcam' or 'screenshot'."
        )

    return folder_map[folder_type]

The exit_program as the name suggests it exits the program with a specified status code and optional error message. If a message is provided, it prints it before exiting.

The get_path_to_folder function constructs and returns the folder path for a specified folder type, either "webcam" or "screenshot." It combines the user's home directory with a predefined base path ("Pictures/llama3.1") and appends the relevant folder name. We will use this function to store images in the appropriate folder, either for webcam or screenshots.

Now, we will define few more functions that deals with capturing and removing screenshots and getting clipboard text.

# 👇 siri-voice-llama3/src/utils.py

# Rest of the code...

def capture_screenshot() -> Path:
    """
    Captures a screenshot and saves it to the designated folder.

    Returns:
        Path: The file path of the saved screenshot.
    """

    screenshot_folder_path = utils.get_path_to_folder(folder_type="screenshot")

    os.makedirs(screenshot_folder_path, exist_ok=True)

    screen = ImageGrab.grab()

    time_stamp = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
    rgb_screenshot = screen.convert("RGB")

    image_filename = f"screenshot_{time_stamp}.png"
    image_file_path = Path(os.path.join(screenshot_folder_path, image_filename))

    rgb_screenshot.save(image_file_path, quality=20)

    return image_file_path


def remove_last_screenshot() -> None:
    """
    Remove the most recent screenshot file from the designated screenshots folder.

    The function checks if the folder exists and if there are any .png files. If
    found, it deletes the most recently created screenshot.
    """

    folder_path = utils.get_path_to_folder(folder_type="screenshot")

    if not os.path.exists(folder_path):
        return

    files = [
        file
        for file in os.listdir(folder_path)
        if os.path.isfile(os.path.join(folder_path, file)) and file.endswith(".png")
    ]
    if not files:
        return

    most_recent_file = max(
        files, key=lambda f: os.path.getctime(os.path.join(folder_path, f))
    )

    os.remove(os.path.join(folder_path, most_recent_file))


def get_clipboard_text() -> str:
    """
    Retrieves the current text content from the system clipboard.

    This function uses the `pyperclip` module to access the clipboard. If the clipboard
    content is a valid string, it returns the content. If the content is not a string,
    it returns an empty string.

    Returns:
        str: The text content from the clipboard, or an empty string if the content is
        not a string or the clipboard is empty.
    """

    clipboard_content = pyperclip.paste()

    if isinstance(clipboard_content, str):
        return clipboard_content

    return ""

The capture_screenshot function captures the current screen using the ImageGrab module, saves it as a PNG file in a designated screenshots folder, and returns the full file path of the saved screenshot. 📸

It constructs the filename with a timestamp to ensure uniqueness and sets a quality of 20 for the saved image. We are degrading the quality of the image so to make it fast to process the image later.

The remove_last_screenshot function identifies and deletes the most recently created screenshot from the designated folder. It first checks if the folder exists and looks for .png files within it. If files are found, it uses the creation time to determine the most recent file before removing it. 🚮

The get_clipboard_text function uses the pyperclip module to access the clipboard. If the clipboard content is a valid string, it returns the content. If the content is not a string, it returns an empty string.

Integrating Webcam Support 📸

For capturing images from the webcam, we will have to add support to it.

Inside the src directory, create a new file webcam.py and add the following lines of code:

# 👇 siri-voice-llama3/src/webcam.py

import os
from datetime import datetime
from pathlib import Path
from typing import NoReturn, Union

import cv2

import utils


def get_available_webcam() -> cv2.VideoCapture | None:
    """
    Checks for available webcams and returns the first one that is opened.

    This function attempts to open the first 10 webcam indices. If a webcam is found
    and successfully opened, it returns a VideoCapture object. If no webcams are found,
    it exits the program with an error message.

    Returns:
        cv2.VideoCapture: The opened webcam object.
        None: If no webcam is found, the program exits with an error message.
    """

    # Assuming that we are checking the first 10 webcams.
    for index in range(10):
        web_cam = cv2.VideoCapture(index)
        if web_cam.isOpened():
            return web_cam

    return utils.exit_program(status_code=1, message="No webcams found.")


def capture_webcam_image() -> Union[Path, NoReturn]:
    """
    Captures an image from the available webcam and saves it to the specified folder.

    This function first checks for an available webcam using `get_available_webcam`.
    If a webcam is successfully opened, it creates a folder for saving the images if
    it does not already exist, generates a timestamped filename, captures a frame,
    and saves the image to the specified folder. The function then releases the webcam.

    Returns:
        Path: The file path of the saved image.
        NoReturn: If there was an error capturing the image, the program exits with an error message.
    """

    webcam = get_available_webcam()
    if webcam is None or not webcam.isOpened():
        return utils.exit_program(
            status_code=1, message="There was an error capturing the image."
        )

    webcam_folder_path = utils.get_path_to_folder(folder_type="webcam")

    os.makedirs(webcam_folder_path, exist_ok=True)

    timestamp = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
    image_filename = f"webcam_{timestamp}.png"

    _, frame = webcam.read()

    image_file_path_str = os.path.join(webcam_folder_path, image_filename)

    cv2.imwrite(image_file_path_str, frame)

    webcam.release()

    return Path(image_file_path_str)

The get_available_webcam function checks for available webcams, assuming the first ten indices. If a webcam is successfully opened, it returns the corresponding cv2.VideoCapture object. 🎥 If no webcams are found, it exits the program with an error message.

The capture_webcam_image function captures an image from the available webcam. It first calls our previously written helper function get_available_webcam to get an available webcam. If successful, it creates a folder for saving images (if it doesn’t already exist), generates a timestamped filename, captures a frame, and saves the image. Finally, it releases the webcam and returns the path of the saved image. 🖼️

Implementing the Main Program Logic 😵‍💫

Now that we have coded all the utilities that we will be requiring when working on the project, Let's start with the main program logic

Inside the src directory, create a new file siri.py and add the following lines of code:

# 👇 siri-voice-llama3/src/siri.py

import os
import re
import time
from pathlib import Path
from typing import List

import google.generativeai as genai
import pyttsx3
import speech_recognition as sr
from faster_whisper import WhisperModel
from groq import Groq
from groq.types.chat import ChatCompletionMessageParam
from gtts import gTTS
from openai import OpenAI
from PIL import Image
from pydub import AudioSegment
from pydub.playback import play

import utils
import webcam


class Siri:
    """
    A multi-modal AI voice assistant that responds to user prompts
    by processing voice commands and context from images or clipboard content.
    """

    def __init__(
        self,
        log_file_path: Path,
        project_root_folder_path: Path,
        groq_api_key: str,
        google_gen_ai_api_key: str,
        openai_api_key: str | None,
    ) -> None:
        """
        Initializes the Siri assistant with API clients for Groq, OpenAI, and Google Generative AI.

        Args:
            log_file_path (Path): Path to the log file.
            project_root_folder_path (Path): Root folder of the project.
            groq_api_key (str): API key for Groq.
            google_gen_ai_api_key (str): API key for Google Generative AI.
            openai_api_key (str): API key for OpenAI.
        """
        self.log_file_path = log_file_path
        self.project_root_folder_path = project_root_folder_path

        self.pyttsx3_engine = pyttsx3.init()

        self.groq_client = Groq(api_key=groq_api_key)
        self.openai_client = OpenAI(api_key=openai_api_key)

        # Configure Google Generative AI model
        genai_generation_config = genai.GenerationConfig(
            temperature=0.7, top_p=1, top_k=1, max_output_tokens=2048
        )
        genai.configure(api_key=google_gen_ai_api_key)

        self.genai_model = genai.GenerativeModel(
            "gemini-1.5-flash-latest",
            generation_config=genai_generation_config,
            safety_settings=[
                {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
                {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
                {
                    "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
                    "threshold": "BLOCK_NONE",
                },
                {
                    "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
                    "threshold": "BLOCK_NONE",
                },
            ],
        )

        # Initialize conversation context for the AI
        self.conversation: List[ChatCompletionMessageParam] = [
            {
                "role": "user",
                "content": (
                    "You are a multi-modal AI voice assistant. Your user may have attached a photo (screenshot or webcam capture) "
                    "for context, which has already been processed into a detailed text prompt. This will be attached to their transcribed "
                    "voice input. Generate the most relevant and factual response by carefully considering all previously generated text "
                    "before adding new information. Do not expect or request additional images; use the provided context if available. "
                    "Please do not include newlines in your response. Keep it all in one paragraph. "
                    "Ensure your responses are clear, concise, and relevant to the ongoing conversation, avoiding any unnecessary verbosity."
                ),
            }
        ]

        total_cpu_cores = os.cpu_count() or 1

        # Initialize the audio transcription model
        self.audio_transcription_model = WhisperModel(
            device="cpu",
            compute_type="int8",
            model_size_or_path="base",
            cpu_threads=total_cpu_cores // 2,
            num_workers=total_cpu_cores // 2,
        )

        # Initialize speech recognition components
        self.speech_recognizer = sr.Recognizer()
        self.mic_audio_source = sr.Microphone()
        self.wake_word = "siri"

The Siri class sets up a multi-modal AI voice assistant that not only handles voice commands but also processes images or clipboard content to add context.

It takes in a few key parameters, like the log_file_path and project_root_folder_path help with logging conversations and storing AI responses as mp3 files when using gTTS. You’ll also need API keys for Groq, Google Generative AI, and optionally OpenAI. 🤖

The class sets up clients for Groq, OpenAI, and Google Generative AI. For Google GenAI, it uses the gemini-1.5-flash model and tweaks some of the safety settings.

An initial conversation prompt is built in to guide the AI on how to respond based on the user’s voice commands or any processed images. 💬

For audio transcription, it uses the Faster Whisper model, which runs on the CPU with specific settings for performance. It also sets up speech recognition using Recognizer and Microphone, with the assistant listening for the wake word "siri" to start taking commands.

Now that the initial configuration is done, let's add some methods to the class. We will define a few more to handle audio transcription, extract the user's prompt, listen for the prompt, and generate chat responses using Groq. 🛠️

Add the following methods in the siri.py file inside the src directory.

# 👇 siri-voice-llama3/src/siri.py

# Rest of the code...

def transcribe_audio_to_text(self, audio_file_path: Path) -> str:
    """
    Transcribes audio from a file to text.

    Args:
        audio_file_path (Path): Path to the audio file.

    Returns:
        str: The transcribed text from the audio.
    """

    segments, _ = self.audio_transcription_model.transcribe(
        audio=str(audio_file_path)
    )
    return "".join(segment.text for segment in segments)

def extract_prompt(self, transcribed_text: str) -> str | None:
    """
    Extracts the user's prompt from the transcribed text after the wake word.

    Args:
        transcribed_text (str): The transcribed text from audio input.

    Returns:
        str | None: The extracted prompt if found, otherwise None.
    """

    pattern = rf"\b{re.escape(self.wake_word)}[\s,.?!]*([A-Za-z0-9].*)"
    regex_match = re.search(
        pattern=pattern, string=transcribed_text, flags=re.IGNORECASE
    )

    if regex_match is None:
        return None

    return regex_match.group(1).strip()

def listen(self) -> None:
    """
    Starts listening for the wake word and processes audio input in the background.
    """

    with self.mic_audio_source as mic:
        self.speech_recognizer.adjust_for_ambient_noise(source=mic, duration=2)

    self.speech_recognizer.listen_in_background(
        source=self.mic_audio_source, callback=self.handle_audio_processing
    )

    while True:
        time.sleep(0.5)

def generate_chat_response_with_groq(
    self, prompt: str, image_context: str | None
) -> str:
    """
    Generates a response from the Groq model based on user input and optional image context.

    Args:
        prompt (str): The user's prompt.
        image_context (str | None): Optional image context for the response.

    Returns:
        str: The generated response from the assistant.
    """

    if image_context:
        prompt = f"USER_PROMPT: {prompt}\n\nIMAGE_CONTEXT: {image_context}"

    self.conversation.append({"role": "user", "content": prompt})

    completion = self.groq_client.chat.completions.create(
        messages=self.conversation, model="llama-3.1-8b-instant"
    )

    ai_response = completion.choices[0].message.content

    self.conversation.append({"role": "assistant", "content": ai_response})

    return ai_response or "Sorry, I'm not sure how to respond to that."

The method transcribe_audio_to_text takes the path to an audio file and transcribes its content into text. It uses the WhisperModel to process the audio file in segments and returns a string that concatenates the transcribed text from all segments. 🎧

The method extract_prompt extracts the user’s spoken prompt from the transcribed text, specifically after the wake word (e.g., "siri"). It uses a regular expression to find and capture the prompt following the wake word, returning the cleaned-up prompt or None if no prompt is found. 🗣️

The method listen continuously listens for the wake word and processes audio input. It first adjusts for ambient noise, then starts listening in the background using a callback (handle_audio_processing). The method enters an infinite loop, pausing briefly in each iteration to keep listening. 🔄

The method generate_chat_response_with_groq generates a response using the Groq model based on the user’s prompt and optional image context. It formats the prompt with the image context (if any), adds the conversation to the model, and appends the AI’s response to the ongoing conversation. It then returns the generated response or a default message if no response is generated. 💬

Text-to-Speech Generation 🗣️

For Text To Speech Generation we will be implementing three different approaches. pyttsx3, OpenAI and gTTS (Google Text To Speech). You are free to choose any approach that fits your requirement.

Pyttsx3 Approach

Here for text to speech generation, we will be using a famous Python Module Pyttsx3.

In the siri.py file, add the following method.

# 👇 siri-voice-llama3/src/siri.py

# Rest of the code...

# Pyttsx3 Approach (Weaker Audio Quality)
def text_to_speech(self, text: str) -> None:
    """
    Converts text to speech using Pyttsx3's text-to-speech API.

    Args:
        text (str): The text to convert to speech.
    """

    self.pyttsx3_engine.setProperty("volume", 1.0)
    self.pyttsx3_engine.setProperty("rate", 125)

    voices = self.pyttsx3_engine.getProperty("voices")

    # Set voice to Female.
    self.pyttsx3_engine.setProperty("voice", voices[0].id)

    self.pyttsx3_engine.say(text)
    self.pyttsx3_engine.runAndWait()

    self.pyttsx3_engine.stop()

The text_to_speech method uses the pyttsx3_engine we initialized inside the Siri class to set some of the properties for the engine and finally say the text that we provide to it.

OpenAI Approach

For this approach, we will be using the OpenAI Audio speech with streaming. This approach has overall the best experience than any other approaches, but it requires you to have OpenAI API setup and your account having some OpenAI credits.

In the siri.py file, add the following method.

# 👇 siri-voice-llama3/src/siri.py

# Rest of the code...

# OpenAI Approach (Best Quality Audio with multiple voice available).
def text_to_speech(self, text: str) -> None:
    """
    Converts text to speech using OpenAI's text-to-speech API.

    Args:
        text (str): The text to convert to speech.
    """

    stream = pyaudio.PyAudio().open(
        format=pyaudio.paInt16, channels=1, rate=24000, output=True
    )
    stream_start = False

    with self.openai_client.audio.speech.with_streaming_response.create(
        model="tts-1", voice="nova", response_format="pcm", input=text
    ) as openai_response:
        silence_threshold = 0.1
        for chunk in openai_response.iter_bytes(chunk_size=1024):
            if stream_start:
                stream.write(chunk)

            elif max(chunk) > silence_threshold:
                stream.write(chunk)
                stream_start = True

The text_to_speech method converts text into speech using OpenAI's text-to-speech (TTS) API.

It starts by opening an audio stream using PyAudio, which is configured to output audio at a sample rate of 24,000 Hz with 16-bit resolution. The method then calls the OpenAI TTS API with the provided text, specifying the model ("tts-1"), voice ("nova"), and response format ("pcm"). The audio data is streamed in real time. 🚀

You can change the voices if you like. For a list of available options, visit here.

Within the loop, the method checks the audio chunks returned by the OpenAI API. If the audio exceeds a certain silence threshold, the stream begins playing the audio chunks, ensuring the text is spoken only when meaningful sound is detected. This prevents the stream from starting with silence. 🔊

gTTS Approach

For this approach, we will be using the Google Text to Speech Engine. This approach is pretty slow and also there's a need to save the AIs response as an 'mp3' and then play that audio file.

In the siri.py file, add the following method.

# 👇 siri-voice-llama3/src/siri.py

# Rest of the code...

def text_to_speech(self, text: str) -> None:
    """
    Converts text to speech using Google's text-to-speech API.

    Args:
        text (str): The text to convert to speech.
    """

    tts = gTTS(text=text, lang="en", slow=False)

    response_folder_path = Path(
        os.path.abspath(
            os.path.join(self.project_root_folder_path, "data", "ai_response")
        )
    )

    os.makedirs(response_folder_path, exist_ok=True)

    response_audio_file_path = Path(
        os.path.join(response_folder_path, "ai_response_audio.mp3")
    )

    tts.save(response_audio_file_path)

    response_audio = AudioSegment.from_mp3(response_audio_file_path)
    play(response_audio)

    # After the audio is played, delete the audio file.
    if os.path.exists(response_audio_file_path):
        os.remove(response_audio_file_path)

The text_to_speech method converts text to speech using Google's TTS API. It first generates speech from the given text in English, setting slow=False to speed up the playback. The method then creates a folder path for storing the response audio file inside a "data/ai_response" directory. After ensuring the directory exists, it saves the speech as an mp3 file.

Once the mp3 file is saved, it loads the audio using AudioSegment and plays it. After playing the audio, the method deletes the mp3 file to clean up.

Now, that we've also worked on the text_to_speech method, we need to write a few more methods that deals with analyzing image prompt if the user attaches image context to the prompt, select the relevant assistant action and process the audio and take relevant actions.

Add the following code to the siri.py file in the src directory.

# 👇 siri-voice-llama3/src/siri.py

# Rest of the code...

def analyze_image_prompt(self, prompt: str, image_path: Path) -> str:
    """
    Analyzes an image based on the user prompt to extract semantic information.

    Args:
        prompt (str): The user's prompt related to the image.
        image_path (Path): The path to the image file.

    Returns:
        str: The analysis result from the image based on the prompt.
    """

    image = Image.open(image_path)
    prompt = (
        "You are an image analysis AI tasked with extracting semantic meaning from images to assist another AI in "
        "generating a user response. Your role is to analyze the image based on the user's prompt and provide all relevant, "
        "objective data without directly responding to the user. Focus solely on interpreting the image in the context of "
        f"the user’s request and relay that information for further processing. \nUSER_PROMPT: {prompt}"
    )
    genai_response = self.genai_model.generate_content([prompt, image])
    return genai_response.text


def select_assistant_action(self, prompt: str) -> str:
    """
    Determines the appropriate action for the assistant to take based on user input.

    Args:
        prompt (str): The user's prompt.

    Returns:
        str: The selected action for the assistant.
    """

    system_prompt_message = (
        "You are an AI model tasked with selecting the most appropriate action for a voice assistant. Based on the user's prompt, "
        "choose one of the following actions: ['extract clipboard', 'take screenshot', 'delete screenshot', 'capture webcam', 'generic']. "
        "Assume the webcam is a standard laptop webcam facing the user. Provide only the action without explanations or additional text. "
        "Respond strictly with the most suitable option from the list."
    )
    function_conversation: List[ChatCompletionMessageParam] = [
        {"role": "system", "content": system_prompt_message},
        {"role": "user", "content": prompt},
    ]

    completion = self.groq_client.chat.completions.create(
        messages=function_conversation, model="llama-3.1-8b-instant"
    )

    ai_response = completion.choices[0].message.content

    return ai_response or "generic"


def handle_audio_processing(self, recognizer: sr.Recognizer, audio: sr.AudioData):
    """
    Callback function to process audio input once recognized.

    Args:
        recognizer (sr.Recognizer): The speech recognizer instance.
        audio (sr.AudioData): The audio data captured by the microphone.
    """

    data_folder_path = Path(os.path.abspath(os.path.join(".", "data")))
    os.makedirs(data_folder_path, exist_ok=True)

    audio_prompt_file_path = Path(
        os.path.abspath(os.path.join(data_folder_path, "user_audio_prompt.wav"))
    )
    with open(audio_prompt_file_path, "wb") as f:
        f.write(audio.get_wav_data())

    transcribed_text = self.transcribe_audio_to_text(
        audio_file_path=audio_prompt_file_path
    )
    parsed_prompt = self.extract_prompt(transcribed_text=transcribed_text)

    if parsed_prompt:
        utils.log_chat_message(
            log_file_path=self.log_file_path, user_message=parsed_prompt
        )
        skip_response = False

        selected_assistant_action = self.select_assistant_action(
            prompt=parsed_prompt
        )

        if "capture webcam" in selected_assistant_action:
            image_path = webcam.capture_webcam_image()
            image_analysis_result = self.analyze_image_prompt(
                prompt=parsed_prompt, image_path=image_path
            )

        elif "take screenshot" in selected_assistant_action:
            image_path = utils.capture_screenshot()
            image_analysis_result = self.analyze_image_prompt(
                prompt=parsed_prompt, image_path=image_path
            )

        elif "delete screenshot" in selected_assistant_action:
            utils.remove_last_screenshot()
            image_analysis_result = None
            ai_response = "Screenshot deleted successfully."
            self.text_to_speech(text=ai_response)

            utils.log_chat_message(
                log_file_path=self.log_file_path, ai_message=ai_response
            )

            skip_response = True

        elif "extract clipboard" in selected_assistant_action:
            clipboard_content = utils.get_clipboard_text()
            parsed_prompt = (
                f"{parsed_prompt}\n\nCLIPBOARD_CONTENT: {clipboard_content}"
            )
            image_analysis_result = None

        else:
            image_analysis_result = None

        # If the response is not supposed to be skipped, then generate the response and speak it out.
        if not skip_response:
            response = self.generate_chat_response_with_groq(
                prompt=parsed_prompt, image_context=image_analysis_result
            )
            utils.log_chat_message(
                log_file_path=self.log_file_path, ai_message=response
            )
            self.text_to_speech(text=response)

    # Remove the user prompt audio after the response is generated.
    if os.path.exists(audio_prompt_file_path):
        os.remove(audio_prompt_file_path)

The analyze_image_prompt method analyzes an image based on a user’s prompt to extract semantic information. It starts by opening the specified image file using the PIL library. The method then constructs a prompt that instructs the image analysis AI to focus on extracting relevant data from the image without responding directly to the user.

The method sends the constructed prompt and the image to the Google Generative AI model for processing. Finally, it returns the image analysis result as text. 📝

The select_assistant_action method determines the appropriate action for the assistant based on the user's input. 🤔 It starts by creating a system prompt that instructs the AI model to choose from a predefined list of actions: 'extract clipboard', 'take screenshot', 'delete screenshot', 'capture webcam', or 'generic'.

Next, the method constructs a conversation list that includes the system prompt and the user’s prompt. It then sends this conversation to the Groq client to generate a response using the specified model. The response can be any one of the item from the predefined list of actions.

The handle_audio_processing method processes audio input after it’s recognized by the assistant. First, it saves the captured audio as a .wav file in the "data" folder. It then transcribes the audio to text using the transcribe_audio_to_text method and extracts the user’s prompt from the text with extract_prompt.

If a prompt is found, it logs the user’s message and determines the appropriate assistant action using select_assistant_action. Depending on the action, it may capture a webcam image, take a screenshot, delete a screenshot, or extract clipboard content. For image-based actions, it uses analyze_image_prompt to analyze the image. 🔍

The skip_response variable is used to control whether the assistant should skip generating and speaking a response after performing certain actions. It is initially set to False, meaning the response generation is expected.

For example, when the action is "delete screenshot," the method deletes the screenshot and directly provides a pre-defined response ("Screenshot deleted successfully.") via text-to-speech. In this case, skip_response is set to True to prevent the assistant from generating a separate response for the user prompt, as the action itself is sufficient. ✅

For other actions, it generates a response using generate_chat_response_with_groq method and converts the response to speech. After the response is generated, the method deletes the audio file. 🚮

Writing the `main.py` File 🧑‍💻

This is going to be the entry point for our program. It performs the setup and initialization needed for the assistant to function.

Create a new file called main.py inside the root of the project and add the following lines of code:

# 👇 siri-voice-llama3/main.py

import os
import sys
from pathlib import Path

# Add the src directory to the module search path
sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), "src"))

from src import setup, siri, utils

"""
Main entry point for the AI llama3 siri voice assitant.

This script loads the necessary API credentials from environment variables,
initializes the Siri assistant with the provided keys, and starts listening
for user input. The program will exit if any of the required API keys are
missing.

To run the application, execute this script in an environment where the
`.env` file is properly configured with the required API keys.
"""

if __name__ == "__main__":
    # Determine the current directory of the script
    project_root_folder_path = Path(os.path.dirname(os.path.abspath(__file__)))

    chat_log_file_path = utils.get_log_file_for_today(
        project_root_folder_path=project_root_folder_path
    )

    all_api_keys = setup.get_credentials()
    groq_api_key, google_gen_ai_api_key, openai_api_key = all_api_keys

    siri = siri.Siri(
        log_file_path=chat_log_file_path,
        project_root_folder_path=project_root_folder_path,
        groq_api_key=groq_api_key,
        google_gen_ai_api_key=google_gen_ai_api_key,
        openai_api_key=openai_api_key,
    )

    siri.listen()

👀 Notice that we are inserting the path to our src directory using the sys.path.insert() method to ensure that Python can locate and import modules from the src directory.

The main block starts by determining the project root folder, then gets the daily log file path using utils.get_log_file_for_today to log the chat messages.

Next, we retrieve APIs (for Groq, Google Generative AI, and OpenAI) using setup.get_credentials function that we wrote earlier when working on coding some helper functions.

We then create an instance of the Siri class, passing the log file path, project root folder path, and API keys.

Finally, the siri.listen method is called, which starts the assistant and listens for user input.

By now, you should have a working version of your own Voice Assitant. 🥂

Optional: Building a Shell Script 🧰

🤔 Why is there a need to write a shell script?

Well, there is no need to. I actually wrote this shell script myself, thinking that I could automate running it on system reboot through scheduling tools like Linux services or cron jobs. However, I couldn't get it to work since it requires access to hardware components (like the mic), so it didn’t really speak out the response (Do let me know if you find a fix). But this shell script can come in really handy if you want to automate every manual steps, like creating a virtual environment, installing dependencies, and finally running the program. You could also add the script to your PATH via symlink and run it from anywhere on your system. 😉

Create a new file start_siri_llama3.sh inside the src/scripts directory with the following lines of code:

💁 If you are using the fish shell, you can find the same code with the fish syntax here. Create a new file called start_siri_llama3.fish inside the src/scripts directory, and add the code from the link.

# 👇 siri-voice-llama3/src/scripts/start_siri_llama3.sh

#!/usr/bin/env bash
# Using this above way of writing shebang can have some security concerns.
# See this stackoverflow thread: https://stackoverflow.com/a/72332845
# Since, I want this script to be portable for most of the users, instead of hardcoding like '#!/usr/bin/bash', I am using this way.

ERROR_USAGE="ERROR: Usage: bash {path_to_main.py}"
ERROR_FILE_NOT_FOUND="ERROR: The main.py file does not exist or is not a valid file."
ERROR_PYTHON_NOT_FOUND="ERROR: No suitable Python executable found."
ERROR_BASH_NOT_INSTALLED="ERROR: Bash shell is not installed. Please install Bash."
ERROR_ACTIVATE_NOT_FOUND="ERROR: activate file not found in '$VENV_DIR/bin'"
ERROR_UNSUPPORTED_SHELL="ERROR: Unsupported shell: '$SHELL'"
ERROR_REQUIREMENTS_NOT_FOUND="ERROR: requirements.txt file not found in '$SCRIPT_DIR'"

# Determine the script directory, virtual environment directory, and log file
SCRIPT_DIR="$(dirname "$(realpath "$0")")"
VENV_DIR="$(realpath "$SCRIPT_DIR/../../.venv")"
LOG_FILE="$(realpath "$SCRIPT_DIR/../../logs/shell-error-bash.log")"
REQUIREMENTS_FILE_PATH="$(realpath "$SCRIPT_DIR/../../requirements.txt")"

log_and_exit() {
  local message="$1"

  echo "[$(date +"%Y-%m-%d %H:%M:%S")] $message" | tee -a $LOG_FILE
  exit 1
}

# Check if the main.py file is provided as an argument
if [ $# -ne 1 ]; then
  log_and_exit "$ERROR_USAGE"
fi

# Function to check if a file exists and has the correct extension
check_file() {
    local file_path="$1"
    local expected_extension="$2"

    if [ ! -f "$file_path" ]; then
        log_and_exit "$ERROR_FILE_NOT_FOUND"
    fi

    if ! [[ "$file_path" == *".$expected_extension" ]]; then
        log_and_exit "The file '$file_path' must be a '.$expected_extension' file."
    fi
}

# Validate the provided main.py file
check_file "$1" "py"

# Extract and validate arguments
MAIN_FILE_PATH="$(realpath "$1")"

# Find the appropriate Python executable
PYTHON_EXEC="$(command -v python3 || command -v python)"

# Ensure that the Python executable is available before creating the virtual environment
if [ ! -d "$VENV_DIR" ]; then
    if [ -z "$PYTHON_EXEC" ]; then
        log_and_exit "$ERROR_PYTHON_NOT_FOUND"
    fi

    "$PYTHON_EXEC" -m venv "$VENV_DIR"

    # Activate the virtual environment after creating it
    if [ -f "$VENV_DIR/bin/activate" ]; then
        source "$VENV_DIR/bin/activate"
    else
        log_and_exit "$ERROR_ACTIVATE_NOT_FOUND"
    fi

    PIP_EXEC_VENV = "$(command -v pip3 || command -v pip)"

    # Check if requirements.txt exists and install dependencies
    if [ -f "$REQUIREMENTS_FILE_PATH" ]; then
        "$PIP_EXEC_VENV" install -r "$REQUIREMENTS_FILE_PATH"
    else
        log_and_exit "$ERROR_REQUIREMENTS_NOT_FOUND"
    fi
fi

# Ensure that the Bash shell is installed.
if ! command -v bash &> /dev/null; then
    log_and_exit "$ERROR_BASH_NOT_INSTALLED"
fi

# Activate the virtual environment based on the shell type
if [[ "$SHELL" == *"/bash" ]]; then
    # Check if the activate file exists before sourcing it
    if [ -f "$VENV_DIR/bin/activate" ]; then
        source "$VENV_DIR/bin/activate"
    else
        log_and_exit "$ERROR_ACTIVATE_NOT_FOUND"
    fi
else
    log_and_exit "$ERROR_UNSUPPORTED_SHELL"
fi

# Set the python executable to the one from the virtual environment
PYTHON_EXEC="$(command -v python3 || command -v python)"

# Run the main.py file
"$PYTHON_EXEC" "$MAIN_FILE_PATH"

This script is designed to automate the setup and execution of a Python program, ensuring the necessary environment is prepared before running main.py. ⚙️

First, it checks whether a valid Python file (main.py) is passed as an argument. If not, it logs an error and exits. It also verifies the file exists and has the correct extension (.py). 🐍

The script then searches for a Python executable (python3 or python), and if a virtual environment (venv) doesn't exist, it creates one using Python's venv module. Once the venv is created, it activates it, and installs dependencies from requirements.txt if found. The script ensures that both Python and Bash are installed on the system, as it only supports the Bash shell.

If the user's shell isn't Bash, it logs an error and exits. Otherwise, it runs the Python script (main.py) inside the virtual environment using the Python executable found.

Now, to actually be able to run this script from anywhere on your system, you can add it to your PATH with symlink. 🔗

Usually, /usr/local/bin is a place where we add our custom built scripts. Firstly make sure that it is in your PATH, by running the command:

echo $PATH

If not add it to your PATH, then you can add this script as a symlink to /usr/local/bin with the following command:

ln -s {absolute_path_to_script_sh/fish} /usr/local/bin/start_siri_llama3

After running this command, you should now be able to run this program from anywhere on your system. 🎉

Conclusion ⚡

Wow! 😮‍💨 We've done a lot together! If you've made it this far, give yourself a well-deserved pat on the back. By now, you’ve successfully built a personal SIRI voice assistant using the LLAMA-3 AI model.

The entire documented source code for this article is available here:

https://github.com/shricodev/siri-voice-llama3.git

Thank you so much for reading! 🎉 🫡

Drop down your thoughts in the comment section below. 👇

Follow me on Socials 🐥: https://linktr.ee/shricodev

Build your own personal SIRI with LLAMA-3 like a PRO! 🧙‍♂️ 🪄

TL;DR ✨

Setting up the Environment 🛠️

Let's Code It 💻

Setting Up Chat History Support 📋

API Keys Configuration 🔑

Defining Additional Helper Functions 👷

Integrating Webcam Support 📸

Implementing the Main Program Logic 😵‍💫

Text-to-Speech Generation 🗣️

Writing the `main.py` File 🧑‍💻

Optional: Building a Shell Script 🧰

Conclusion ⚡

Subscribe to my newsletter

Shrijal Acharya

Shrijal Acharya

Build your own personal SIRI with LLAMA-3 like a PRO! 🧙‍♂️ 🪄

TL;DR ✨

Setting up the Environment 🛠️

Let's Code It 💻

Setting Up Chat History Support 📋

API Keys Configuration 🔑

Defining Additional Helper Functions 👷

Integrating Webcam Support 📸

Implementing the Main Program Logic 😵‍💫

Text-to-Speech Generation 🗣️

Writing the main.py File 🧑‍💻

Optional: Building a Shell Script 🧰

Conclusion ⚡

Subscribe to my newsletter

Shrijal Acharya

Shrijal Acharya

Writing the `main.py` File 🧑‍💻