Speech AI: Day 2 - Time & Frequency Analysis

We use cookies to improve your browsing experience and to analyze our website traffic. By clicking “Accept All” you agree to our use of cookies. Privacy policy.

Yesterday we established that:

Sound = vibration of air pressure over time

Digitization = sampling + quantization + encoding

After digitization, we have a sequence of numbers x[n], This is the waveform.

It represents Amplitude vs Time, but speech is not just amplitude changes. Speech is a combination of frequencies. So today we answer: “How do we see the frequencies inside a waveform?”

Understanding the Simplest Sound: A Single Sine Wave which is the simplest possible signal.

A sine wave Repeats perfectly, Has one frequency, Has one amplitude, Has one phase

If one cycle takes 0.01 seconds: f=1//0.01 = Hz

This means: 100 cycles per second.A single sine wave contains exactly one frequency..Nothing else.

You can look at it as the “atom” of vibration.

Waveform Properties (Time Domain): When you plot digitial audio, you see:

X-axis - Time
Y-axis - Amplitude
This is called the time domain representation because it answers how signal change over time

What Can We See in Time Domain?

Amplitude: How strong the vibration is. Higher amplitude → louder sound.
Period: If the wave repeats, the time between repetitions. Example: If one full cycle takes 0.01 seconds, that’s the period (T).
Frequency f=1/T, if T=0.01 seconds, f=1/0.0.1=100Hz. This means wave repeats 100 times per second.

Limitation of the Time Domain

If someone plays 200 Hz tone, 400 Hz tone and 800 Hz tone at the same time the waveform becomes complicated and difficult to interpret even though it is just three simple sine waves added together (superposition) you cannot easily see “What frequencies are present?” “How strong is each one?” You only see a messy curve. We need a way to separate the ingredients because speech it self is not a single sine wave.

Core Idea Behind the Fourier Transform

Here is the fundamental truth any complex periodic signal can be expressed as a sum of sine and cosine waves. This is Fourier’s Theorem. The Fourier Transform is a mathematical tool that transforms a signal from the time domain (how it changes over time) to the frequency domain (how much of each frequency is present in the signal). That means Speech. Music, Noise can all be decomposed into pure sine waves. Since sine waves contain one frequency each, If we can measure how much of each sine wave exists then we can understand the signal’s frequency content. That is exactly what the Fourier Transform does.

Why is the Fourier Transform Useful?

The reason the Fourier Transform is so powerful is that many complex signals, like sound, can be represented as a sum of simpler, pure sine waves. Each sine wave has a specific frequency, amplitude, and phase.

So, instead of analyzing a complicated signal directly, we can analyze it by looking at the individual sine waves that make it up.

For example, if you're listening to a song, there are many different frequencies (like bass, mid-range, and treble). The Fourier Transform helps us separate and identify these different frequencies.

How Fourier Transform works

Given a digital signal: x[n], The Fourier Transform asks “How similar is this signal to a sine wave at frequency f?”

(picture of the fouirer transform formular)

For each frequency:

Generate a sine wave at that frequency
Multiply it with the signal
Sum over time

If that frequency exists in the signal: The result is large

If it does not: The result is near zero

Repeating this for many frequencies gives: X(f)

This is the frequency spectrum.

Now we plot:

X-axis - Frequency
Y-axis - Amplitude (or energy)

We have moved from Amplitude vs Time to Amplitude vs Frequency.

Fourier Transform is essentially a Frequency detector, It scans through possible frequencies and measures: “How strong is this frequency inside the signal?”

Example: Fourier Transform of a Pure Tone

Let’s take an example of a pure sine wave of 100 Hz. In the time domain, it looks like this: A smooth, oscillating wave (it repeats every 0.01 seconds, because 100 Hz = 1/0.01).

If you apply the Fourier Transform to this pure sine wave, the result will show: A single peak at 100 Hz in the frequency domain, indicating that the signal contains only one frequency: 100 Hz.

This is simple because a pure sine wave has just one frequency.

Real-World Application

But, in real-life signals like speech, the sound is made up of many frequencies, so the Fourier Transform is essential to analyze all of them. For example, if you listen to speech, you don’t hear a single pure tone. Instead, you hear a combination of frequencies that change over time (the vowels, consonants, etc.).

The Fourier Transform helps separate these frequencies and analyze how they evolve, making it much easier to study the structure of the sound.

Discrete Fourier Transform (DFT)

In digital systems, we use the Discrete Fourier Transform (DFT) because our signal is sampled.

If we have N samples: The DFT gives N frequency bins where each bin represents: A specific frequency component. Computing DFT directly costs: O(N2) but using the Fast Fourier Transform (FFT) algorithm: O(NlogN). This efficiency makes real-time speech processing possible.

The Big Problem: Speech Changes Over Time

The standard Fourier Transform analyzes the entire signal at once. That assumes the signal is: Stationary (its frequency content does not change). But speech is non-stationary. When you say: “hello”. The frequency content of “he” is different from “llo”.

If we analyze the whole word at once we know what frequencies exist overall but we lose information about when they occur.

To fix this, we assume speech is approximately stationary over short windows (20–25 ms): This is Short Fourier Transform.

So we:

Slice the waveform into short overlapping frames
Apply DFT to each frame
Slide forward
Repeat

This produces: X(t, f), Now we have frequency information at each moment in time.

Important parameters:

Window size (e.g., 25 ms): How long each slice of the signal is. Example: 25 ms at 16 kHz sampling rate: 0.025 x 16000 = 400 samples
Hop size (e.g., 10 ms)
Window function (e.g., Hamming window)

The window function reduces spectral leakage by smoothing edges.

Spectrogram: Making It Visual

The magnitude of the STFT can be displayed as an image. This is called spectogram.

(image of spectrogram)

Axes:

X-axis → Time
Y-axis → Frequency
Color → Energy

Quick Links

Understanding Fourier Transform: https://www.youtube.com/watch?v=iOsGkk63NfE