🎭 How I Built a Multimodal Emotion Recognition System Using BERT, HuBERT, EfficientNet, and Fusion Techniques

Khushal JhaveriKhushal Jhaveri
3 min read

Emotions are messy. They don't just live in our words — they hide in our tone, our facial twitches, and the way we pause mid-sentence. That’s exactly why unimodal emotion recognition (just text or just audio) often falls short. So, for my most recent project, I went all-in on multimodal learning — and built an AI system that reads text, listens to tone, and watches facial cues to figure out what someone’s really feeling.


🧩 The Goal

Build a system that can detect 7 emotions (Joy, Anger, Sadness, Fear, Disgust, Surprise, Neutral) from video clips using:

  • Text from transcripts

  • Audio (speech tone)

  • Video (facial expressions)

I used the MELD dataset — a goldmine of emotionally rich dialogues from the show Friends, complete with aligned video, audio, and text.


🔧 The Pipeline

Step 1: Extract the 3 Modalities

  • Text: Transcribed audio using OpenAI Whisper, then embedded utterances with BERT, including speaker info and previous utterances (context-aware).

  • Audio: Extracted WAV files and passed them through HuBERT, a self-supervised model that picks up tone, rhythm, and pitch patterns.

  • Video: Cropped faces from each frame using MTCNN, then passed them into EfficientNet-B3 to classify expression over time.


🧠 Training the Models

📝 Text – BERT Fine-Tuning

  • Used HuggingFace’s BERT-base-uncased.

  • Tried CLS token, max pooling, and mean+max pooling for embeddings.

  • Trained with CrossEntropyLoss and AdamW, observing ~45% macro F1 after data augmentation.

🔊 Audio – HuBERT Embeddings

  • Used pre-trained HuBERT-base to get 768-d audio vectors.

  • Trained an MLP classifier on top.

  • Audio was weakest alone, but crucial in fusion (especially for fear and disgust).

🎥 Video – EfficientNet on Cropped Faces

  • Converted video frames into tensor batches.

  • Trained EfficientNet-B3 on stacked face crops from each utterance.

  • Got strong results on neutral and joy classes post-augmentation.


🧪 The Fusion Strategy

Instead of relying on a single modality, I combined them using:

  • Early Fusion: Concatenated all features before classification (didn't perform well).

  • Late Fusion (Unweighted): Averaged predictions from the 3 models.

  • Late Fusion (Weighted): Gave more weight to text (50%), then audio (30%) and video (20%) — this performed the best, with 60% accuracy and 58% weighted F1.


🦾 Bonus: Fine-Tuning an LLM for Emotion Explanation

I didn’t stop at classification.

I also fine-tuned a merged DeepSeek LLM using facial Action Units (from OpenFace), teaching the model to explain emotions.
E.g., “AU12 (lip corner puller) + AU06 (cheek raiser)” → “This person seems genuinely happy.”

Post-training, the model could explain emotions like a therapist with a face-reading superpower.


💻 Real-Time App with Streamlit

I deployed the whole pipeline in a clean Streamlit app:

  • Upload a video

  • Auto-processes text/audio/video

  • Outputs the predicted emotion

  • Bonus: Shows LLM-generated explanation based on facial AUs


📈 Results Recap

ModalityAccuracy (Post-Aug)
Text59% (best: BERT + CLS + context)
Audio38% (post HuBERT + augmentation)
Video47% (EfficientNet, augmented)
Fusion (Weighted)60%

💡 What I Learned

  • Emotion detection needs context — both temporal (what happened before) and cross-modal (what was said vs. how it was said).

  • Audio is tough to get right, but HuBERT helps a lot.

  • Weighted late fusion > early fusion, especially when modalities vary in reliability.

  • Building end-to-end AI apps makes your models feel real — Streamlit is perfect for this.


🎯 What’s Next?

I’d love to:

  • Integrate real-time face/audio capture

  • Explore multimodal transformers (like MME Transformer or CLIP-based fusions)

  • Improve few-shot LLM fine-tuning on emotion grounding

0
Subscribe to my newsletter

Read articles from Khushal Jhaveri directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Khushal Jhaveri
Khushal Jhaveri