✋ Sign Language to Text using AI: Real-Time Gesture Recognition with Mediapipe & LSTMs

Khushal JhaveriKhushal Jhaveri
3 min read

So this one came from a very real problem — how do we help someone who communicates using sign language get their message across instantly, in text?

I thought, what if we could build a system that just looks at hand gestures, understands them, and outputs text in real time — kind of like a translator for the hearing-impaired. That’s what this project was about: building a Sign Language Recognition System using computer vision + deep learning.


🧩 What I Was Trying to Solve

Sign language isn’t just a static set of hand poses. It's a sequence — there’s motion, timing, and orientation involved. So just using image classification wasn’t enough. I needed something that could understand sequences of gestures.

So my idea was to use:

  • MediaPipe for fast, real-time hand tracking

  • LSTM to capture the sequence of hand movements

  • A model that could take raw webcam input and translate signs into text


🔧 What I Used to Build It

  • Python

  • MediaPipe Hands API for 21-point hand landmark detection

  • OpenCV to process real-time webcam input

  • TensorFlow / Keras to build the LSTM model

  • NumPy / Pandas for dataset creation

  • TensorBoard for visualizing training performance


🛠️ How I Built It

1️⃣ Data Collection

  • Used MediaPipe to track hand keypoints in each frame (21 landmarks per hand, each with x, y, z)

  • Recorded videos of different signs performed repeatedly to capture variation in speed and angle

  • Built a labeled dataset where each sample = a full sequence of keypoints over time

2️⃣ Preprocessing

  • Normalized landmark values to avoid scale/position issues

  • Applied smoothing to remove jitter in frame transitions

  • Stored final training set as .npy arrays for each class (e.g., "hello", "thanks", "yes", etc.)

3️⃣ Model Architecture

  • Used a stacked LSTM model with dropout to handle sequential hand landmark data

  • Each sequence was fed as a tensor of shape (frames, 63), since each frame had 21 keypoints × 3 coords

4️⃣ Training & Testing

  • Trained on 10–12 common signs with 100+ sequences each

  • Accuracy reached ~90%+ after tuning learning rate and batch size

  • Also plotted confusion matrices — most confusion was in signs that looked very similar

5️⃣ Real-Time Deployment

  • Built a simple OpenCV webcam interface

  • Detected gestures in real time and overlaid recognized text on the video feed


📈 Results

  • Real-time sign-to-text achieved at ~10–15 fps on a standard laptop

  • Accuracy: ~90% on common signs (like “hello,” “thank you,” “yes,” “no”)

  • Smooth experience with minimal lag using MediaPipe + lightweight LSTM


💡 What I Learned

  • The combination of traditional CV (MediaPipe) and DL (LSTM) works really well when you want both speed and accuracy

  • Small things like gesture smoothing and timing consistency made a huge difference in accuracy

  • For real-world use, you need a much bigger vocabulary and maybe even transformers for longer gesture phrases


🤝 Why It Matters

This kind of project can help build better tools for people who rely on gestures for communication — not just in day-to-day life but also in hospitals, public spaces, or learning environments.

And it touches the same space as behavioral AI tools — if your system understands motion and intent through vision, that’s a major part of human behavior modeling.


✉️ Let’s Collaborate

Still improving this, maybe will add more signs and switch to transformers later. If this space excites you, I’d love to connect.

📩 LinkedIn | 🔗 GitHub

0
Subscribe to my newsletter

Read articles from Khushal Jhaveri directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Khushal Jhaveri
Khushal Jhaveri