In the previous articles, we explored how speech signals are transformed into meaningful representations.
We began with the physics of sound, examining how speech exists as pressure waves in air. We then moved through frequency analysis, learning how the Fourier Transform reveals the frequency components of speech. After that, we explored mel-scale representations, which approximate how humans perceive pitch, and MFCCs, which have been a foundational feature representation in speech recognition for decades.
Most recently, we examined pitch, formants, and prosody, which capture important characteristics of how speech evolves over time.
All of these techniques share a common idea: we design features manually based on our understanding of speech production and perception.
For many years, speech recognition systems relied heavily on these carefully engineered features.
However, modern speech AI systems increasingly rely on a different paradigm: Self-Supervised Learning (SSL).
Instead of relying on handcrafted features, these models learn representations directly from raw audio by training on massive amounts of unlabeled speech.
This shift has fundamentally transformed speech technology.
Training traditional speech recognition systems requires labeled datasets.
A typical training example looks like this:
Audio recording → “The quick brown fox jumps over the lazy dog.”
Each audio clip must be paired with its correct text transcription.
Creating such datasets is extremely expensive and time-consuming because the transcription must be performed manually by humans.
For many languages around the world, especially in Africa and other underrepresented regions, this presents a serious challenge.
Common issues include:
At the same time, there is an enormous amount of unlabeled speech data available:
Self-supervised learning allows models to learn from this raw audio without requiring human annotations.
Self-supervised learning is a training paradigm in which a model creates its own supervision signal from the input data.
Instead of learning from labels provided by humans, the model learns by solving pretext tasks derived from the structure of the data itself.
In speech processing, these tasks typically involve forcing the model to predict or reconstruct parts of the signal.
Examples include:
By solving these tasks, the model learns general-purpose speech representations.
These learned representations can later be adapted for downstream tasks such as:
Once the model has learned these representations, only a small amount of labeled data is required to fine-tune it for a specific application.
One of the most influential self-supervised speech models is wav2vec 2.0, introduced by researchers at Meta AI.
Unlike earlier speech systems that relied on features like MFCCs or mel spectrograms, wav2vec 2.0 operates directly on raw waveforms.
The architecture consists of three major components.
The first stage is a convolutional neural network that processes the raw waveform.
Speech signals are continuous time-series data, and convolutional layers are well-suited for capturing local temporal patterns.
This network transforms the waveform into a sequence of latent acoustic representations.
These latent features capture patterns such as:
In effect, the model learns its own internal representation of low-level acoustic features.
The latent acoustic features are then passed into a Transformer encoder.
Transformers are sequence models that rely on self-attention mechanisms.
Self-attention allows each time step in the speech signal to incorporate information from other time steps.
In speech processing, this enables the model to capture relationships such as:
Through multiple layers of self-attention and feed-forward transformations, the network builds increasingly context-aware speech representations.
During training, parts of the latent representation sequence are masked.
The model must identify the correct representation among several possible candidates.
This task is known as contrastive learning.
By forcing the model to distinguish correct acoustic representations from incorrect ones, the training process encourages the network to learn meaningful speech structure.
Importantly, this learning occurs without any text transcripts.
Several other architectures have extended the ideas introduced by wav2vec 2.0.
HuBERT (Hidden-Unit BERT) uses a different training strategy.
Instead of directly predicting masked acoustic representations, it predicts cluster assignments derived from acoustic features.
The process works in stages:
Even though the cluster labels are imperfect, they provide enough structure for the model to learn meaningful speech representations.
WavLM further extends the SSL framework by learning representations that capture both:
This makes WavLM particularly effective across a wide range of speech tasks, including:
An interesting research question is:
What types of information are represented in different layers of self-supervised speech models?
Researchers investigate this using probing experiments.
A typical probing experiment involves:
These studies reveal a consistent pattern.
Early layers capture low-level acoustic features, such as:
These representations resemble the information captured by traditional signal processing techniques.
Middle layers tend to represent phonetic information, including:
This is often where the model's representations are most useful for speech recognition tasks.
Higher layers capture more abstract linguistic structure, including:
In effect, the model builds increasingly abstract representations as information flows through the network.

To better understand how self-supervised models organize speech internally, I ran a small experiment using the pretrained wav2vec 2.0 base model.
For a small dataset of Igbo speech recordings, I performed the following steps:
This allows us to visualize how the model groups speech signals in its learned representation space.
https://colab.research.google.com/drive/1q9q3Vi5H3z_mhrHqH2KerhINEJtB1CT_?usp=sharing
Initially, I attempted to visualize embeddings using only multiple variations of the same word.
However, this produced weak separation in the PCA visualization. This outcome is not surprising: recordings of the same word often share very similar acoustic characteristics, especially when spoken by the same speaker.
To make the experiment more informative, I expanded the dataset to include several different words, while still including multiple recordings of the same word.
This allowed clearer patterns to emerge in the embedding space.
The PCA projection revealed a meaningful structure in the learned embeddings.
Several recordings of the word “akwa” appeared clustered together, indicating that the model captured their shared phonetic structure.
Similarly, the two recordings of “gini” appeared close together in the embedding space.
The word “kedu” appeared farther from these clusters, while “mba” appeared isolated in another region of the visualization.
These patterns suggest that the model organizes speech embeddings based on acoustic and phonetic similarity.
Words that share similar sound structures tend to appear closer together, while acoustically distinct words are placed farther apart.
Even with a small dataset, this experiment provides an intuitive glimpse into how self-supervised speech models internally represent spoken language.
Self-supervised learning is particularly important for languages with limited annotated datasets.
Because these models can be trained on unlabeled audio, it becomes possible to build speech systems using large collections of recordings such as:
Only a relatively small labeled dataset is then required to fine-tune the system for tasks like speech recognition.
For many underrepresented languages, this approach may be the most practical path toward building robust speech technology.
Across this series, we have gradually moved from signal-level representations toward neural representation learning.
Earlier techniques relied on carefully designed features such as:
Self-supervised learning shifts this responsibility to the model itself.
Instead of manually designing features, we allow the model to discover them automatically from large collections of speech data.