🎧 Decoding Audio with MFCCs: A Visual & Mathematical Journey

🔍 Introduction
You’ve probably heard how AI models can recognize accents, emotions, or even impersonations. But how does a machine see sound? In this post, we’ll use MFCCs—Mel-Frequency Cepstral Coefficients—to compare two audio signals side by side, and even split a single sample to analyze how its features evolve over time.
Unlike exploring MFCC in a linear pipeline, this blog goes hands-on with comparison, revealing differences between languages or emotional states with visual analysis.
🧠 What Are MFCCs (Mel-Frequency Cepstral Coefficients)?
MFCCs (Mel-Frequency Cepstral Coefficients) are a compact, information-rich way to represent the timbre of audio—what makes the texture of a sound, not pitch or loudness.
They’re inspired by how human hearing works.
MFCCs are widely used in audio analysis because they mimic how humans perceive sound. Our ears don't respond linearly to frequencies; we're more sensitive to certain ranges. MFCCs capture this nonlinear behavior by modeling log-spaced frequency bands.
📐 MFCC Workflow (Mathematical View):
Pre-Emphasis (Optional):
y[n]=x[n]−αx[n−1]y[n] = x[n] - \alpha x[n - 1]y[n]=x[n]−αx[n−1]
Framing & Windowing:
Split signal into short frames.
Apply Hamming window w[n]w[n]w[n].
Short-Time Fourier Transform (STFT):
X(k)=∑n=0N−1x[n]⋅w[n]⋅e−j2πkn/NX(k) = \sum_{n=0}^{N-1} x[n] \cdot w[n] \cdot e^{-j 2\pi kn/N}X(k)=n=0∑N−1x[n]⋅w[n]⋅e−j2πkn/N
Mel Filter Bank:
- Apply a bank of filters spaced on the Mel scale:
m=2595⋅log10(1+f700)m = 2595 \cdot \log_{10}(1 + \frac{f}{700})m=2595⋅log10(1+700f)
Logarithmic Compression:
- Human ears perceive loudness logarithmically.
Discrete Cosine Transform (DCT):
- Decorrelate the features:
c[n]=∑k=1Klog(Ek)cos[πnK(k−0.5)]c[n] = \sum_{k=1}^{K} \log(E_k) \cos\left[\frac{\pi n}{K}(k - 0.5)\right]c[n]=k=1∑Klog(Ek)cos[Kπn(k−0.5)]
🧠 The Pipeline (What Happens Under the Hood)
Let’s break it down step by step:
Step | What it does | Human analogy |
1️⃣ Pre-emphasis | Boosts high frequencies | Like adjusting treble in your speakers |
2️⃣ Framing | Splits audio into 20-40ms chunks | Like snapshots in time |
3️⃣ Windowing | Smooths the edges of chunks | Prevents edge distortion |
4️⃣ FFT | Converts to frequency domain | Tells what frequencies are present |
5️⃣ Mel filterbank | Applies filters spaced like human ear | Focuses on what we hear best |
6️⃣ Log energy | Turns amplitudes into decibels | Log scale = how we perceive loudness |
7️⃣ DCT (Discrete Cosine Transform) | Decorrelates features | Think: summarizes the "shape" of the sound spectrum |
Why 13 MFCCs?
Out of all the frequencies, only the first few coefficients capture most of the relevant information:
Coefficient # | Represents |
1 (MFCC[0]) | Overall energy of the signal |
2–4 | Broad spectral envelope (vowel-like) |
5–13 | Fine details like consonants, textures |
👉 Think of MFCCs as a sound fingerprint – the first few coefficients capture general sound shape, later ones add texture.
You can extract more than 13, but 13 is a default sweet spot (historically from speech recognition).
🧪 Why MFCC Plots Seem Meaningless (Until Now)
Coefficient Range | What It Represents | Visual Pattern | Interpretation |
1st – 2nd | Broad energy & pitch contour | Smooth + high magnitude | Vowels / sustained notes |
3rd – 6th | Mid-level structure & formants | Slight wobbles | Speech articulation (e.g. “ka”, “ma”) |
7th – 13th | Fine noise and texture (timbre details) | Rapid jagged changes | Fricatives, noisy textures like "shhh", "fff" |
🔍 How MFCCs Reflect Emotions
Emotion | MFCC Pattern | What It Means |
Happy | High variation in higher coefficients (6–13), More harmonics, smooth curve, consistent energy in first few coefficients | Brighter tone, stronger high frequencies, energetic articulation. |
Angry | Stronger values in lower coefficients (1–5), sharp jittery transitions — especially in higher coefficients | Harsh tone, higher energy, sharp pitch/tempo shifts. |
Sad | Flatter, more stable curves overall, especially in mid coefficients, Lower energy, smooth but decaying coefficients (soft delivery) | Lower pitch, softer articulation, slower transitions. |
Neutral | Balanced curves, smooth across all coefficients | Even tone, no extreme shifts in pitch or texture. |
MFCCs compress spectral shape of audio.
Emotions modulate pitch, tempo, and articulation — which change the spectral shape.
MFCC coefficients are sensitive to frequency bands:
Lower coefficients: overall spectral slope i.e. energy and pitch
Higher coefficients: finer details i.e. roughness and consonants
🧠 MFCC Interpretation Tips (What do the 13 coefficients mean?)
When reading an MFCC plot:
1st Coefficient (energy / overall spectral shape): Stable in vowels, fluctuates in dynamic or stressed speech.
2nd–5th Coefficients (formants): Reflect resonance patterns; can shift depending on emotion and vowel articulation.
6th–13th Coefficients (detail): Capture high-frequency noise and nuances. More chaotic in angry or stressed tones.
🛠️ Setup & Libraries
!pip install librosa matplotlib ipywidgets --quiet
🔍 Side-by-Side MFCC Coefficients Interpretation
# MFCC Demystified: Colab-ready script
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
# Load your own audio file here (or use the example trumpet sound)
# To use your file, upload it in Colab using:
audio_path = librosa.example('trumpet')
y, sr = librosa.load(audio_path)
# y, sr = librosa.load(librosa.ex('trumpet'))
# Extract MFCCs
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
# MFCC coefficient interpretations
coef_tips = {
0: "Low freq. energy - pitch, loudness",
1: "General spectral slope",
2: "Changes in pitch",
3: "Energy variation",
4: "Middle freq. textures",
5: "Consonant shape hints",
6: "Voicing info",
7: "Noise / fricatives",
8: "Tonal purity",
9: "Roughness or grain",
10: "Energy fluctuation",
11: "High-freq dynamics",
12: "Noise floor variations"
}
# Plot each MFCC with its label
fig, axs = plt.subplots(13, 1, figsize=(12, 20), sharex=True)
for i in range(13):
axs[i].plot(mfccs[i])
axs[i].set_ylabel(f"MFCC {i+1}\\n{coef_tips[i]}", rotation=0, labelpad=50, fontsize=9, va='center')
axs[i].grid(True)
axs[-1].set_xlabel("Time (frames)")
plt.suptitle("MFCC Coefficient Interpretations (with Labels)", fontsize=16)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
📈 Applications of MFCC Comparisons
Use Case | Description |
🎙️ Speaker Identification | Every speaker has a unique MFCC "fingerprint". |
😃 Emotion Detection | MFCCs reflect energy, pitch, and prosody—core to emotion. |
🌍 Language ID | Languages have distinct phonetic and frequency patterns. |
🧠 NeuroAI | MFCCs serve as proxies for cognitive auditory models. |
🔬 Why This Matters: The Human Side of Sound
MFCCs bridge math and meaning. They help us visualize human traits—emotion, tone, origin—through simple curves. With this playground, you're not just seeing sounds—you're seeing people.
Note: While I originally planned to include MFCC comparisons across emotions (like happy, sad, and angry) or languages, I’ve chosen not to showcase those directly in this blog. However, you can find those comparisons in my GitHub repository, where i’ve uploaded notebooks and examples. I encourage you to explore them and try generating your own — now that you understand how to compute and interpret all 13 MFCC coefficients, you’re well-equipped to draw meaningful insights from sound.
Subscribe to my newsletter
Read articles from Hamna Kaleem directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
