Understanding Speech Features: Fourier Transform, Spectrograms, and MFCCs

Hamna KaleemHamna Kaleem
5 min read

Speech processing is a critical part of AI and Audio Technologies. In this article, we’ll explore fundamental techniques used for speech feature extraction:

  • Fourier Transform & Spectrograms (Visualizing speech in the frequency domain)

  • Mel-Frequency Cepstral Coefficients (MFCCs) (Key features for human speech signal)

1️⃣ Fourier Transform & Spectrograms

What is the Fourier Transform?

Sound waves are typically represented in the time domain (waveforms), but analyzing their frequency components is crucial. The Fourier Transform (FT) converts a time-domain signal into its frequency components.

The Short-Time Fourier Transform (STFT) is commonly used in speech processing to create spectrograms, which display how frequencies change over time.

Spectrograms

A spectrogram is a visual representation of sound frequencies of a signal as it varies with time. Unlike waveforms that show amplitude over time, spectrograms reveal the frequency content.

Types of Spectrograms

  • Short-Time Fourier Transform (STFT) Spectrogram – Standard time-frequency representation.

  • Log-Scaled Spectrogram – More interpretable for human perception.

  • Mel Spectrogram – Based on the Mel scale, which mimics human hearing.

  • MFCC Spectrogram – Compressed Mel spectrogram, commonly used in speech recognition.

  • CQT (Constant-Q Transform) Spectrogram – Good for musical applications.

  • Chromagram – Highlights pitch classes (useful in music & speech emotion detection).

🔹 How to generate a spectrogram in Python?

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

def plot_waveform_and_spectrograms(audio_path):
    # Load audio file
    y, sr = librosa.load(audio_path, sr=None)

    # Compute different spectrograms
    D = np.abs(librosa.stft(y))  # Short-Time Fourier Transform (STFT)
    D_db = librosa.amplitude_to_db(D, ref=np.max)  # Log-scale spectrogram

    mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
    mel_spectrogram_db = librosa.amplitude_to_db(mel_spectrogram, ref=np.max)

    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

    cqt = np.abs(librosa.cqt(y, sr=sr, hop_length=512))
    cqt_db = librosa.amplitude_to_db(cqt, ref=np.max)

    chroma = librosa.feature.chroma_stft(y=y, sr=sr, hop_length=512)

    # Create subplots
    fig, ax = plt.subplots(3, 2, figsize=(15, 12))

    # Waveform
    ax[0, 0].plot(np.linspace(0, len(y) / sr, num=len(y)), y, color='b')
    ax[0, 0].set_title("Waveform")
    ax[0, 0].set_xlabel("Time (s)")
    ax[0, 0].set_ylabel("Amplitude")

    # STFT Spectrogram
    librosa.display.specshow(D_db, sr=sr, x_axis='time', y_axis='log', ax=ax[0, 1])
    ax[0, 1].set_title("STFT Spectrogram")

    # Mel Spectrogram
    librosa.display.specshow
        (mel_spectrogram_db, sr=sr, x_axis='time', y_axis='mel', ax=ax[1, 0])
    ax[1, 0].set_title("Mel Spectrogram")

    # MFCCs
    librosa.display.specshow(mfccs, sr=sr, x_axis='time', ax=ax[1, 1])
    ax[1, 1].set_title("MFCCs")

    # Constant-Q Transform (CQT)
    librosa.display.specshow(cqt_db, sr=sr, x_axis='time', y_axis='cqt_note', ax=ax[2, 0])
    ax[2, 0].set_title("Constant-Q Transform (CQT)")

    # Chromagram
    librosa.display.specshow(chroma, sr=sr, x_axis='time', y_axis='chroma', ax=ax[2, 1])
    ax[2, 1].set_title("Chromagram")

    plt.tight_layout()
    plt.show()

# Example Usage
audio_path = "audio.wav"  # Replace with your audio file
plot_waveform_and_spectrograms(audio_path)

🔍 How to Interpret a Spectrogram

A spectrogram is a 3D representation of an audio signal where:

  • X-axis (Time) → Represents time progression.

  • Y-axis (Frequency) → Shows different frequency components.

  • Color Intensity (Amplitude/Power) → Represents the energy (loudness) of each frequency at a given time.

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

# Load an audio file
y, sr = librosa.load("audio.wav", sr=None)

# Compute the Short-Time Fourier Transform (STFT)
D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)

# Display the Spectrogram
plt.figure(figsize=(10, 4))
librosa.display.specshow(D, sr=sr, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title("Spectrogram")
plt.show()

📊 Example Spectrogram Interpretation

Scenario: Suppose you have a spectrogram of someone saying "Hello".

  • Low-frequency bands (below 300 Hz) → Represent voiced sounds (e.g., vowels like "e" in "hello").

  • Higher frequency components (above 3000 Hz) → Capture fricatives (e.g., "h" sound).

  • Vertical dark bands → Indicate plosive sounds (e.g., "h" in "hello").

  • Continuous smooth areas → Represent vowels, which have a longer duration.

📌 Observing Key Features

  1. Formants: Dark horizontal bands that indicate resonant frequencies of speech sounds.

  2. Harmonics: Parallel lines visible in periodic sounds like vowels.

  3. Silences/Gaps: Flat regions without frequency content indicate pauses.

Example:
🚀 A word like "Hello" will have strong frequency energy at the start ("H"), followed by distinct formants for "e" and "o".

2️⃣ Mel-Frequency Cepstral Coefficients (MFCCs)

What are MFCCs?

MFCCs are widely used in speech recognition as they mimic how humans perceive sound. The human ear is more sensitive to certain frequencies, so MFCCs use a Mel scale to focus on perceptually important features. They represent the speech signal's spectral properties in a way that mimics human auditory perception.

Why are MFCCs Important?

  • The human ear is more sensitive to certain frequencies.

  • MFCCs use the Mel scale, which compresses high frequencies.

  • Unlike raw spectrograms, MFCCs capture speech-relevant information efficiently.

How MFCCs are Computed

  1. Pre-emphasis – Boost high frequencies.

  2. Framing – Divide speech into small time windows.

  3. Windowing – Apply a Hamming window to reduce discontinuities.

  4. FFT (Fast Fourier Transform) – Convert to the frequency domain.

  5. Mel Filter Bank – Apply triangular filters to mimic human hearing.

  6. Logarithm & Discrete Cosine Transform (DCT) – Extract compact feature vectors.

🔹 Extracting MFCCs using Librosa:

mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
plt.figure(figsize=(10, 4))
librosa.display.specshow(mfccs, sr=sr, x_axis='time')
plt.colorbar()
plt.title("MFCCs")
plt.show()

🔍 How to Interpret MFCCs

MFCCs are compact representations of the spectrogram, emphasizing speech-relevant features while ignoring unnecessary frequency information.

📊 Reading MFCCs

  • X-axis (Time): Shows the progression of speech over time.

  • Y-axis (MFCC Coefficients): Represents different frequency bands (similar to spectrogram frequency axis but compressed using the Mel scale).

  • Color Intensity: Represents energy in that frequency band.

📌 Understanding MFCC Components

  • Lower MFCCs (1-2): Represent the overall spectral shape (like pitch and loudness).

  • Middle MFCCs (3-8): Capture important phoneme information (helps in speech recognition).

  • Higher MFCCs (9+): Represent fine spectral details, which are often ignored in ASR models.

📌 Key Observations

  1. Different words produce different MFCC patterns.

    • Vowel sounds → Show stable, smooth coefficient variations.

    • Consonants → Produce sharper changes.

    • Fricatives (like “s” or “sh”) → Appear as high-frequency variations.

  2. MFCCs are useful for distinguishing speakers and speech emotions.

    • Different speakers will have slightly different MFCC patterns due to vocal tract differences.

    • Emotions affect energy distribution, leading to varying MFCC patterns.

3️⃣🎯 Conclusion

  • The Fourier Transform helps analyze frequencies in speech.

  • Spectrograms visualize how frequencies change over time.

  • MFCCs are key features for human speech models.

6
Subscribe to my newsletter

Read articles from Hamna Kaleem directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Hamna Kaleem
Hamna Kaleem