🎧 Decoding Audio with MFCCs: A Visual & Mathematical Journey

Hamna KaleemHamna Kaleem
6 min read

🔍 Introduction

You’ve probably heard how AI models can recognize accents, emotions, or even impersonations. But how does a machine see sound? In this post, we’ll use MFCCs—Mel-Frequency Cepstral Coefficients—to compare two audio signals side by side, and even split a single sample to analyze how its features evolve over time.

Unlike exploring MFCC in a linear pipeline, this blog goes hands-on with comparison, revealing differences between languages or emotional states with visual analysis.


🧠 What Are MFCCs (Mel-Frequency Cepstral Coefficients)?

MFCCs (Mel-Frequency Cepstral Coefficients) are a compact, information-rich way to represent the timbre of audio—what makes the texture of a sound, not pitch or loudness.

They’re inspired by how human hearing works.

MFCCs are widely used in audio analysis because they mimic how humans perceive sound. Our ears don't respond linearly to frequencies; we're more sensitive to certain ranges. MFCCs capture this nonlinear behavior by modeling log-spaced frequency bands.

📐 MFCC Workflow (Mathematical View):

  1. Pre-Emphasis (Optional):

    y[n]=x[n]−αx[n−1]y[n] = x[n] - \alpha x[n - 1]y[n]=x[n]−αx[n−1]

  2. Framing & Windowing:

    • Split signal into short frames.

    • Apply Hamming window w[n]w[n]w[n].

  3. Short-Time Fourier Transform (STFT):

    X(k)=∑n=0N−1x[n]⋅w[n]⋅e−j2πkn/NX(k) = \sum_{n=0}^{N-1} x[n] \cdot w[n] \cdot e^{-j 2\pi kn/N}X(k)=n=0∑N−1​x[n]⋅w[n]⋅e−j2πkn/N

  4. Mel Filter Bank:

    • Apply a bank of filters spaced on the Mel scale:

m=2595⋅log⁡10(1+f700)m = 2595 \cdot \log_{10}(1 + \frac{f}{700})m=2595⋅log10​(1+700f​)

  1. Logarithmic Compression:

    • Human ears perceive loudness logarithmically.
  2. Discrete Cosine Transform (DCT):

    • Decorrelate the features:

c[n]=∑k=1Klog⁡(Ek)cos⁡[πnK(k−0.5)]c[n] = \sum_{k=1}^{K} \log(E_k) \cos\left[\frac{\pi n}{K}(k - 0.5)\right]c[n]=k=1∑K​log(Ek​)cos[Kπn​(k−0.5)]


🧠 The Pipeline (What Happens Under the Hood)

Let’s break it down step by step:

StepWhat it doesHuman analogy
1️⃣ Pre-emphasisBoosts high frequenciesLike adjusting treble in your speakers
2️⃣ FramingSplits audio into 20-40ms chunksLike snapshots in time
3️⃣ WindowingSmooths the edges of chunksPrevents edge distortion
4️⃣ FFTConverts to frequency domainTells what frequencies are present
5️⃣ Mel filterbankApplies filters spaced like human earFocuses on what we hear best
6️⃣ Log energyTurns amplitudes into decibelsLog scale = how we perceive loudness
7️⃣ DCT (Discrete Cosine Transform)Decorrelates featuresThink: summarizes the "shape" of the sound spectrum

Why 13 MFCCs?

Out of all the frequencies, only the first few coefficients capture most of the relevant information:

Coefficient #Represents
1 (MFCC[0])Overall energy of the signal
2–4Broad spectral envelope (vowel-like)
5–13Fine details like consonants, textures

👉 Think of MFCCs as a sound fingerprint – the first few coefficients capture general sound shape, later ones add texture.

You can extract more than 13, but 13 is a default sweet spot (historically from speech recognition).

🧪 Why MFCC Plots Seem Meaningless (Until Now)

Coefficient RangeWhat It RepresentsVisual PatternInterpretation
1st – 2ndBroad energy & pitch contourSmooth + high magnitudeVowels / sustained notes
3rd – 6thMid-level structure & formantsSlight wobblesSpeech articulation (e.g. “ka”, “ma”)
7th – 13thFine noise and texture (timbre details)Rapid jagged changesFricatives, noisy textures like "shhh", "fff"

🔍 How MFCCs Reflect Emotions

EmotionMFCC PatternWhat It Means
HappyHigh variation in higher coefficients (6–13), More harmonics, smooth curve, consistent energy in first few coefficientsBrighter tone, stronger high frequencies, energetic articulation.
AngryStronger values in lower coefficients (1–5), sharp jittery transitions — especially in higher coefficientsHarsh tone, higher energy, sharp pitch/tempo shifts.
SadFlatter, more stable curves overall, especially in mid coefficients, Lower energy, smooth but decaying coefficients (soft delivery)Lower pitch, softer articulation, slower transitions.
NeutralBalanced curves, smooth across all coefficientsEven tone, no extreme shifts in pitch or texture.
  • MFCCs compress spectral shape of audio.

  • Emotions modulate pitch, tempo, and articulation — which change the spectral shape.

  • MFCC coefficients are sensitive to frequency bands:

    • Lower coefficients: overall spectral slope i.e. energy and pitch

    • Higher coefficients: finer details i.e. roughness and consonants


🧠 MFCC Interpretation Tips (What do the 13 coefficients mean?)

When reading an MFCC plot:

  • 1st Coefficient (energy / overall spectral shape): Stable in vowels, fluctuates in dynamic or stressed speech.

  • 2nd–5th Coefficients (formants): Reflect resonance patterns; can shift depending on emotion and vowel articulation.

  • 6th–13th Coefficients (detail): Capture high-frequency noise and nuances. More chaotic in angry or stressed tones.


🛠️ Setup & Libraries

!pip install librosa matplotlib ipywidgets --quiet

🔍 Side-by-Side MFCC Coefficients Interpretation

# MFCC Demystified: Colab-ready script
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# Load your own audio file here (or use the example trumpet sound)
# To use your file, upload it in Colab using:
audio_path = librosa.example('trumpet')
y, sr = librosa.load(audio_path)
# y, sr = librosa.load(librosa.ex('trumpet'))

# Extract MFCCs
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

# MFCC coefficient interpretations
coef_tips = {
    0: "Low freq. energy - pitch, loudness",
    1: "General spectral slope",
    2: "Changes in pitch",
    3: "Energy variation",
    4: "Middle freq. textures",
    5: "Consonant shape hints",
    6: "Voicing info",
    7: "Noise / fricatives",
    8: "Tonal purity",
    9: "Roughness or grain",
    10: "Energy fluctuation",
    11: "High-freq dynamics",
    12: "Noise floor variations"
}

# Plot each MFCC with its label
fig, axs = plt.subplots(13, 1, figsize=(12, 20), sharex=True)
for i in range(13):
    axs[i].plot(mfccs[i])
    axs[i].set_ylabel(f"MFCC {i+1}\\n{coef_tips[i]}", rotation=0, labelpad=50, fontsize=9, va='center')
    axs[i].grid(True)

axs[-1].set_xlabel("Time (frames)")
plt.suptitle("MFCC Coefficient Interpretations (with Labels)", fontsize=16)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()


📈 Applications of MFCC Comparisons

Use CaseDescription
🎙️ Speaker IdentificationEvery speaker has a unique MFCC "fingerprint".
😃 Emotion DetectionMFCCs reflect energy, pitch, and prosody—core to emotion.
🌍 Language IDLanguages have distinct phonetic and frequency patterns.
🧠 NeuroAIMFCCs serve as proxies for cognitive auditory models.

🔬 Why This Matters: The Human Side of Sound

MFCCs bridge math and meaning. They help us visualize human traits—emotion, tone, origin—through simple curves. With this playground, you're not just seeing sounds—you're seeing people.

Note: While I originally planned to include MFCC comparisons across emotions (like happy, sad, and angry) or languages, I’ve chosen not to showcase those directly in this blog. However, you can find those comparisons in my GitHub repository, where i’ve uploaded notebooks and examples. I encourage you to explore them and try generating your own — now that you understand how to compute and interpret all 13 MFCC coefficients, you’re well-equipped to draw meaningful insights from sound.

0
Subscribe to my newsletter

Read articles from Hamna Kaleem directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Hamna Kaleem
Hamna Kaleem