In the previous days, we gradually built the foundation for understanding how speech is represented for machines.

Day 1 introduced how sound becomes digital:
Air pressure vibration β microphone β sampled waveform x[n]
This waveform is simply air pressure values measured over time.
Day 2 showed how we analyze this waveform using the Short-Time Fourier Transform (STFT). Instead of only seeing amplitude over time, we discovered how to reveal which frequencies exist at each moment, producing the spectrogram.
X(t,f)
A spectrogram shows:
Day 3 introduced an important insight:
The spectrogram describes physical sound, but human hearing does not perceive sound linearly.
So we modified the representation using:
This gave us the Log-Mel Spectrogram, which already resembles how the human ear organizes sound.
But in many speech systems, we go one step further.
We convert this representation into something called MFCCs (Mel-Frequency Cepstral Coefficients).
MFCCs are one of the most influential feature representations in speech processing. They take the spectral information of speech and compress it into a small set of numbers that capture the important characteristics of the vocal tract.
To understand why MFCCs work and how they are computed, we need to carefully walk through the entire MFCC pipeline.
A raw waveform contains enormous detail.
For example, if speech is sampled at 16,000 Hz, that means:
1 second of audio β 16,000 samples
If we feed raw samples directly into classical machine learning systems, two problems arise:
1. Redundancy: Neighboring samples are highly correlated. Much of the information is repeated.
2. Irrelevant detail: Speech recognition does not need every microscopic vibration of air. What matters more is the shape of the frequency spectrum, which is determined by how the vocal tract filters sound.
MFCCs aim to represent exactly this.
They compress speech into a compact representation that captures the spectral envelope of the signal, which carries phonetic information.
Instead of thousands of samples per second, we may only use 13β40 coefficients per frame.
But how do we get from the waveform to those coefficients?
We do it through a sequence of carefully designed steps.
The first operation slightly amplifies high frequencies.
The pre-emphasis filter is defined as:
y[n] = x[n] - Ξ±x[n-1]
where
This equation subtracts a fraction of the previous sample from the current sample.
This acts like a high-pass filter, which boosts higher frequencies relative to lower ones.
This step is needed because natural speech has a spectral tilt.
Due to the physics of sound production in the vocal tract:
If we analyze speech without correcting this, the low-frequency components dominate the spectrum.
Pre-emphasis partially balances the spectrum so that higher-frequency information (which can be important for distinguishing consonants) becomes more visible.
Conceptually:
Original spectrum β dominated by low frequencies
After pre-emphasis β more balanced spectrum
This prepares the signal for better frequency analysis.
Speech is constantly changing.
If we analyze a long segment of speech all at once, the signal is not stationary. Its statistical properties change over time as phonemes are produced.
However, over very short intervals, speech behaves approximately stationary.
So we split the signal into small overlapping segments called frames.
Typical parameters are:
For example:
|-----25ms-----|
|-----25ms-----|
|-----25ms-----|
Each frame overlaps slightly with the next one.
Why overlap?
Because speech transitions smoothly. Overlapping frames ensure we do not lose important information between boundaries.
Each frame is analyzed independently.
When we cut a frame from the waveform, we introduce sharp edges at the boundaries.
These abrupt edges create artificial frequency components when we perform a Fourier transform. This phenomenon is called spectral leakage.
To reduce this problem, we multiply each frame by a window function.
The most common window used in speech processing is the Hamming window.
w[n] = 0.54 - 0.46cos(2Οn/N-1)
The window gradually reduces the amplitude near the edges of the frame.
Instead of abruptly cutting the signal, the window gently tapers it toward zero.
So the frame transitions smoothly from zero β full amplitude β zero.
This reduces artificial frequencies in the spectral analysis.
Now we analyze the frequency content of each frame.
We apply the Fast Fourier Transform (FFT), which efficiently computes the Discrete Fourier Transform.
X[k]=n=0βNβ1βx[n]eβj2Οkn/N
This transformation converts the signal from the time domain into the frequency domain.
The result tells us:
Usually we compute the power spectrum:
P[k] = |X[k]|^2
This gives the energy present at each frequency.
At this point we have a spectrum for each frame.
However, the frequencies are still linearly spaced in Hertz, which does not reflect how humans perceive pitch.
This is where we reconnect with Day 3.
Human pitch perception is not linear.
The difference between 200 Hz and 300 Hz feels much larger than the difference between 5200 Hz and 5300 Hz, even though both differences are 100 Hz.
This happens because the earβs frequency resolution is approximately logarithmic.
To reflect this, we transform the frequency axis into the Mel scale, which approximates perceived pitch.
The Mel scale is defined as:
Mel(f) = 2595 log10(1 + f/700)
Using this scale, we create a bank of triangular filters spaced evenly in Mel units.
Each filter captures energy from a small frequency region.
The energy in the (m)-th Mel band is computed as:
Mm = β P[k]Hm[k]
k
where Hm[k] is the (m)-th filter.
Instead of hundreds of FFT frequency bins, we now obtain around 20β40 Mel band energies.
This step performs two things:
Next we apply a logarithm to each Mel band energy:
Lmβ=log(Mmβ)
This step is motivated by how humans perceive loudness.
The ear does not perceive loudness linearly.
Instead, loudness perception roughly follows a logarithmic relationship with physical sound intensity.
For example:
The log operation also compresses the dynamic range of the signal, making features more stable and easier for models to learn from.
At this stage we have the Log-Mel spectrum.
The final step converts the Log-Mel spectrum into MFCC coefficients.
Adjacent Mel bands are strongly correlated because the spectrum changes smoothly.
The Discrete Cosine Transform (DCT) decorrelates these values.
cnβ=m=1βMβLmβcos(MΟnβ(mβ0.5))
The DCT transforms the spectral information into a new space where:
Usually we keep only 12β13 coefficients.
These numbers are the Mel-Frequency Cepstral Coefficients.
If we connect all the steps, the transformation looks like this:
Raw waveform
β
Pre-emphasis
β
Framing
β
Windowing
β
FFT β Power Spectrum
β
Mel Filterbank
β
Log Compression
β
Discrete Cosine Transform
β
MFCC Feature Vector
Each frame of speech becomes a small vector such as:
[c1, c2, c3, ..., c13]
These vectors are what classical speech recognition systems use as input.
Speech sounds are produced when the vocal cords generate a sound source, which is then filtered by the shape of the vocal tract.
The resulting sound contains a spectral envelope that reflects the configuration of the mouth, tongue, and throat.
MFCCs are designed specifically to capture this envelope while ignoring irrelevant details like pitch harmonics.
Because of this, MFCCs became a standard feature representation in:
Even many modern deep learning systems still use Log-Mel or MFCC features as their input representation.
Think of the MFCC pipeline as a series of transformations that move from raw physics to perceptual abstraction.
First we observe the physical signal.
Then we reshape it according to how humans hear.
Finally, we compress it into a compact numerical description of speech structure.
In other words:
Air vibration
β
Waveform
β
Spectral analysis
β
Perceptual frequency scaling
β
Compact speech features
MFCCs are essentially a mathematical approximation of the early auditory processing performed by the human ear.