Building a Real-Time Object Detection and Text Recognition System with YOLOS, TTS, and OCR

Shivani YadavShivani Yadav
5 min read

In the ever-evolving world of artificial intelligence, integrating computer vision with real-time functionalities is opening doors to smarter, interactive applications. In this blog, we will build a Real-Time Object Detection and Text Recognition System using Python libraries such as Hugging Face's Transformers, OpenCV, Pyttsx3 for Text-to-Speech (TTS), and EasyOCR. This project combines object detection and optical character recognition (OCR) with voice feedback for a highly interactive user experience.


Key Components of the System

  • YOLOS (You Only Look One-level Series): For real-time object detection.

  • EasyOCR: For extracting text from images.

  • Pyttsx3: To convert text into speech for feedback.

  • OpenCV: For capturing webcam frames and visualization.

  • Threading: To enable asynchronous speech synthesis without blocking the main execution.


Prerequisites

Before diving into the code, ensure you have the following Python libraries installed:

pip install transformers pillow torch torchvision pyttsx3 easyocr opencv-python-headless

The Code: A Deep Dive

Here's the complete implementation of the real-time detection and recognition system:


1. Import Required Libraries

from transformers import YolosImageProcessor, YolosForObjectDetection
from PIL import Image
import torch
import cv2
import pyttsx3  # For TTS
import easyocr   # For OCR
import threading  # To run TTS asynchronously
  • Transformers: Imports the YOLOS model (YolosForObjectDetection) and its corresponding image processor (YolosImageProcessor).

  • Pillow (PIL): Used to handle image conversion from OpenCV format.

  • Torch: Enables GPU acceleration for YOLOS operations, making the system faster.

  • OpenCV: Captures webcam input and handles visualization (e.g., drawing bounding boxes).

  • Pyttsx3: Provides a text-to-speech engine for audible feedback.

  • EasyOCR: Handles text recognition in real-time.

  • Threading: Ensures that TTS operations run asynchronously, preventing lag in detection.


2. Initialize Models and Devices

# Initialize YOLOS model and image processor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Use GPU if available
model = YolosForObjectDetection.from_pretrained('hustvl/yolos-tiny').to(device)
image_processor = YolosImageProcessor.from_pretrained('hustvl/yolos-tiny')

# Initialize TTS engine
tts = pyttsx3.init()
tts.setProperty('rate', 150)  # Set speech rate

# Initialize OCR reader
reader = easyocr.Reader(['en'])  # English language OCR
  • Device Selection: Automatically switches to GPU if available; otherwise, uses the CPU.

  • YOLOS Initialization: The yolos-tiny model is pre-trained and lightweight, ensuring efficient object detection in real-time.

  • Image Processor: Prepares images for YOLOS by normalizing and converting them into tensors.

  • TTS Engine: Configures pyttsx3 to provide voice feedback with a speech rate of 150 words per minute.

  • OCR Reader: Initializes EasyOCR to recognize English text.


3. Access the Webcam

cap = cv2.VideoCapture(0)

# State variables for mode toggling and speech control
mode = "object_detection"  # Default mode
tts_thread = None  # To manage TTS in a separate thread
stop_tts = False  # Flag to stop speech
  • Webcam Access: cv2.VideoCapture(0) initializes the first available webcam.

  • State Variables:

    • mode: Toggles between "object detection" and "text recognition."

    • tts_thread: Manages TTS operations in a separate thread to prevent overlaps.

    • stop_tts: Used to stop ongoing speech.


4. Function to Handle Text-to-Speech Asynchronously

def speak(text):
    global tts_thread, stop_tts
    if tts_thread and tts_thread.is_alive():
        return  # Prevent overlapping speech
    stop_tts = False
    tts_thread = threading.Thread(target=lambda: (tts.say(text), tts.runAndWait()))
    tts_thread.start()

def stop_speech():
    global stop_tts
    stop_tts = True
    tts.stop()
  • speak: Converts input text into speech and ensures that ongoing speech is not interrupted.

  • Threading: Allows TTS to run independently of the main detection loop, preventing lags.

  • stop_speech: Interrupts the TTS engine when the user requests it (e.g., by pressing a shortcut key).


5. Main Loop for Detection and Recognition

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Convert frame to PIL image
    frame_pil = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
  • Frame Capture: Reads webcam input frame by frame using cap.read().

  • Pillow Conversion: Converts OpenCV frames (BGR format) to RGB for compatibility with YOLOS.


6. Object Detection Mode

if mode == "object_detection":
    inputs = image_processor(images=frame_pil, return_tensors="pt").to(device)
    outputs = model(**inputs)

    results = image_processor.post_process_object_detection(
        outputs, threshold=0.9, 
        target_sizes=torch.tensor([frame_pil.size[::-1]]).to(device)
    )[0]

    detected_objects = []
    for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
        box = [round(i) for i in box.tolist()]
        object_name = model.config.id2label[label.item()]
        detected_objects.append(object_name)
        cv2.rectangle(frame, (box[0], box[1]), (box[2], box[3]), (0, 255, 0), 2)
        text = f"{object_name}: {round(score.item(), 2)}"
        cv2.putText(frame, text, (box[0], box[1] - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

    if detected_objects:
        speak(", ".join(detected_objects))
  • Image Processing: Converts the frame into tensors for YOLOS input.

  • Post-Processing: Extracts detected objects, their labels, and bounding boxes.

  • Drawing Bounding Boxes: Uses OpenCV to visualize detected objects on the frame.

  • Speech Feedback: Announces detected objects using TTS.


7. Text Recognition Mode

elif mode == "text_recognition":
    ocr_results = reader.readtext(frame)
    recognized_text = " ".join([res[1] for res in ocr_results])

    if recognized_text:
        cv2.putText(frame, recognized_text, (50, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 0, 0), 2)
        speak(recognized_text)
  • OCR Execution: Extracts text from the webcam frame using EasyOCR.

  • Text Display: Displays recognized text on the video feed.

  • Speech Feedback: Announces the recognized text via TTS.


8. Keyboard Shortcuts for Mode Switching

cv2.putText(frame, "'o' - Object Detection | 'r' - Text Recognition | 's' - Stop Speech | 'q' - Quit", (10, 20),
            cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 1)

key = cv2.waitKey(1) & 0xFF
if key == ord('o'):
    mode = "object_detection"
elif key == ord('r'):
    mode = "text_recognition"
elif key == ord('s'):
    stop_speech()
elif key == ord('q'):
    break
  • Key Bindings:

    • 'o': Switches to object detection mode.

    • 'r': Switches to text recognition mode.

    • 's': Stops ongoing speech.

    • 'q': Exits the program.


9. Cleanup

cap.release()
cv2.destroyAllWindows()
  • Resource Release: Closes the webcam and destroys all OpenCV windows to ensure a clean exit.

Highlights of the System

  1. Object Detection: Leverages YOLOS to identify objects in real time, display bounding boxes, and provide audible feedback.

  2. Text Recognition: Utilizes EasyOCR to extract and vocalize text from webcam frames.

  3. Voice Feedback: Asynchronous TTS ensures non-blocking interaction, making the system responsive.

  4. Dynamic Mode Switching: Users can toggle between object detection and text recognition using keyboard shortcuts.


Applications

  • Assistive Technology: For visually impaired individuals, offering real-time insights into their surroundings.

  • Smart Surveillance: Detect objects and read texts in monitored areas.

  • Interactive Learning Tools: Enhance learning experiences with real-time visuals and audio feedback.


Conclusion

This project showcases the power of combining various AI technologies to create a versatile real-time system. By leveraging YOLOS, EasyOCR, and Pyttsx3, developers can build applications that bridge the gap between vision and interaction. Whether you're building assistive tools or exploring innovative AI applications, this system offers a solid foundation.

Start building your own real-time detection and recognition system today and unlock the endless possibilities of AI-driven applications!

31
Subscribe to my newsletter

Read articles from Shivani Yadav directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Shivani Yadav
Shivani Yadav