🎭 How I Built a Multimodal Emotion Recognition System Using BERT, HuBERT, EfficientNet, and Fusion Techniques

Emotions are messy. They don't just live in our words — they hide in our tone, our facial twitches, and the way we pause mid-sentence. That’s exactly why unimodal emotion recognition (just text or just audio) often falls short. So, for my most recent project, I went all-in on multimodal learning — and built an AI system that reads text, listens to tone, and watches facial cues to figure out what someone’s really feeling.
🧩 The Goal
Build a system that can detect 7 emotions (Joy, Anger, Sadness, Fear, Disgust, Surprise, Neutral) from video clips using:
Text from transcripts
Audio (speech tone)
Video (facial expressions)
I used the MELD dataset — a goldmine of emotionally rich dialogues from the show Friends, complete with aligned video, audio, and text.
🔧 The Pipeline
Step 1: Extract the 3 Modalities
Text: Transcribed audio using OpenAI Whisper, then embedded utterances with BERT, including speaker info and previous utterances (context-aware).
Audio: Extracted WAV files and passed them through HuBERT, a self-supervised model that picks up tone, rhythm, and pitch patterns.
Video: Cropped faces from each frame using MTCNN, then passed them into EfficientNet-B3 to classify expression over time.
🧠 Training the Models
📝 Text – BERT Fine-Tuning
Used HuggingFace’s BERT-base-uncased.
Tried CLS token, max pooling, and mean+max pooling for embeddings.
Trained with CrossEntropyLoss and AdamW, observing ~45% macro F1 after data augmentation.
🔊 Audio – HuBERT Embeddings
Used pre-trained HuBERT-base to get 768-d audio vectors.
Trained an MLP classifier on top.
Audio was weakest alone, but crucial in fusion (especially for fear and disgust).
🎥 Video – EfficientNet on Cropped Faces
Converted video frames into tensor batches.
Trained EfficientNet-B3 on stacked face crops from each utterance.
Got strong results on neutral and joy classes post-augmentation.
🧪 The Fusion Strategy
Instead of relying on a single modality, I combined them using:
Early Fusion: Concatenated all features before classification (didn't perform well).
Late Fusion (Unweighted): Averaged predictions from the 3 models.
Late Fusion (Weighted): Gave more weight to text (50%), then audio (30%) and video (20%) — this performed the best, with 60% accuracy and 58% weighted F1.
🦾 Bonus: Fine-Tuning an LLM for Emotion Explanation
I didn’t stop at classification.
I also fine-tuned a merged DeepSeek LLM using facial Action Units (from OpenFace), teaching the model to explain emotions.
E.g., “AU12 (lip corner puller) + AU06 (cheek raiser)” → “This person seems genuinely happy.”
Post-training, the model could explain emotions like a therapist with a face-reading superpower.
💻 Real-Time App with Streamlit
I deployed the whole pipeline in a clean Streamlit app:
Upload a video
Auto-processes text/audio/video
Outputs the predicted emotion
Bonus: Shows LLM-generated explanation based on facial AUs
📈 Results Recap
Modality | Accuracy (Post-Aug) |
Text | 59% (best: BERT + CLS + context) |
Audio | 38% (post HuBERT + augmentation) |
Video | 47% (EfficientNet, augmented) |
Fusion (Weighted) | 60% |
💡 What I Learned
Emotion detection needs context — both temporal (what happened before) and cross-modal (what was said vs. how it was said).
Audio is tough to get right, but HuBERT helps a lot.
Weighted late fusion > early fusion, especially when modalities vary in reliability.
Building end-to-end AI apps makes your models feel real — Streamlit is perfect for this.
🎯 What’s Next?
I’d love to:
Integrate real-time face/audio capture
Explore multimodal transformers (like MME Transformer or CLIP-based fusions)
Improve few-shot LLM fine-tuning on emotion grounding
Subscribe to my newsletter
Read articles from Khushal Jhaveri directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
