Emotions are messy. They don't just live in our words — they hide in our tone, our facial twitches, and the way we pause mid-sentence. That’s exactly why unimodal emotion recognition (just text or just audio) often falls short. So, for my most recent project, I went all-in on multimodal learning — and built an AI system that reads text, listens to tone, and watches facial cues to figure out what someone’s really feeling.

🧩 The Goal

Build a system that can detect 7 emotions (Joy, Anger, Sadness, Fear, Disgust, Surprise, Neutral) from video clips using:

Text from transcripts
Audio (speech tone)
Video (facial expressions)

I used the MELD dataset — a goldmine of emotionally rich dialogues from the show Friends, complete with aligned video, audio, and text.

🔧 The Pipeline

Step 1: Extract the 3 Modalities

Text: Transcribed audio using OpenAI Whisper, then embedded utterances with BERT, including speaker info and previous utterances (context-aware).
Audio: Extracted WAV files and passed them through HuBERT, a self-supervised model that picks up tone, rhythm, and pitch patterns.
Video: Cropped faces from each frame using MTCNN, then passed them into EfficientNet-B3 to classify expression over time.

🧠 Training the Models

📝 Text – BERT Fine-Tuning

Used HuggingFace’s BERT-base-uncased.
Tried CLS token, max pooling, and mean+max pooling for embeddings.
Trained with CrossEntropyLoss and AdamW, observing ~45% macro F1 after data augmentation.

🔊 Audio – HuBERT Embeddings

Used pre-trained HuBERT-base to get 768-d audio vectors.
Trained an MLP classifier on top.
Audio was weakest alone, but crucial in fusion (especially for fear and disgust).

🎥 Video – EfficientNet on Cropped Faces

Converted video frames into tensor batches.
Trained EfficientNet-B3 on stacked face crops from each utterance.
Got strong results on neutral and joy classes post-augmentation.

🧪 The Fusion Strategy

Instead of relying on a single modality, I combined them using:

Early Fusion: Concatenated all features before classification (didn't perform well).
Late Fusion (Unweighted): Averaged predictions from the 3 models.
Late Fusion (Weighted): Gave more weight to text (50%), then audio (30%) and video (20%) — this performed the best, with 60% accuracy and 58% weighted F1.

🦾 Bonus: Fine-Tuning an LLM for Emotion Explanation

I didn’t stop at classification.

I also fine-tuned a merged DeepSeek LLM using facial Action Units (from OpenFace), teaching the model to explain emotions.
E.g., “AU12 (lip corner puller) + AU06 (cheek raiser)” → “This person seems genuinely happy.”

Post-training, the model could explain emotions like a therapist with a face-reading superpower.

💻 Real-Time App with Streamlit

I deployed the whole pipeline in a clean Streamlit app:

Upload a video
Auto-processes text/audio/video
Outputs the predicted emotion
Bonus: Shows LLM-generated explanation based on facial AUs

📈 Results Recap

Modality	Accuracy (Post-Aug)
Text	59% (best: BERT + CLS + context)
Audio	38% (post HuBERT + augmentation)
Video	47% (EfficientNet, augmented)
Fusion (Weighted)	60%

💡 What I Learned

Emotion detection needs context — both temporal (what happened before) and cross-modal (what was said vs. how it was said).
Audio is tough to get right, but HuBERT helps a lot.
Weighted late fusion > early fusion, especially when modalities vary in reliability.
Building end-to-end AI apps makes your models feel real — Streamlit is perfect for this.

🎯 What’s Next?

I’d love to:

Integrate real-time face/audio capture
Explore multimodal transformers (like MME Transformer or CLIP-based fusions)
Improve few-shot LLM fine-tuning on emotion grounding

🎭 How I Built a Multimodal Emotion Recognition System Using BERT, HuBERT, EfficientNet, and Fusion Techniques