●31 reads●MIT License

Environmental Sound Classification Based on Vision Transformers

Table of contents

Environmental Sound Classification: Vision Transformers and CNNs in Action

🌿🎡 Welcome to Environmental Sound Classification (ESC)β€”a cutting-edge project leveraging transformer-based architectures and Convolutional Neural Networks (CNNs) to tackle the unique challenges of environmental sound classification. By utilizing advanced models such as Vision Transformers (ViT) and Audio Spectrogram Transformers (AST), this project introduces innovative methodologies to revolutionize how we process environmental audio data.


🎯 Project Highlights

  • State-of-the-Art Models: Demonstrates the potential of ViT and AST for environmental sound classification.
  • Superior Accuracy: AST achieved a remarkable 88.35% validation accuracy, outperforming CNNs and ViTs.
  • Sustainability Applications: Enhances ecological monitoring, biodiversity research, and urban soundscape analysis.

πŸ§ͺ Problem Statement

Environmental sound classification presents unique challenges:

  • Environmental sounds are often polyphonic and lack stable temporal structures.
  • Traditional approaches using CNNs are limited in capturing long-range dependencies.

This project hypothesizes that transformer-based models can outperform or match CNNs in accuracy and efficiency for ESC tasks.


πŸ“Š Key Results

Model Performance

ModelValidation Accuracy
CNN (ResNet-50)60%
Vision Transformer (ViT)40%
Audio Spectrogram Transformer (AST)88.35%

Visualized Results

CNN Training Loss

See GitHub

ViT Validation Accuracy

See GitHub

AST Training and Validation

See GitHub


πŸš€ Methodology

πŸ›  Data Preparation

  • Dataset: Bird sound recordings from 20 species.
  • Preprocessing: Audio converted into mel spectrograms using the Librosa library.
  • Augmentation: Normalization and scaling to enhance features.

🧠 Model Architectures

  1. CNN (ResNet-50):

    • Tuned for ESC with pretrained weights.
    • Observed overfitting after 60 epochs.
  2. Vision Transformer (ViT):

    • Adapted for spectrogram inputs.
    • Struggled with small datasets due to lack of inductive biases.
  3. Audio Spectrogram Transformer (AST):

    • Fine-tuned on AudioSet for spectrogram-specific tasks.
    • Leveraged overlapping patches for nuanced feature extraction.

πŸ” Metrics and Validation

  • Validation Accuracy: Assessed model performance.
  • Training/Validation Loss Trends: Evaluated convergence and overfitting.
  • Statistical Testing: Welch's t-tests validated AST's superior performance.

🌟 Innovations

1️⃣ Transformer Models for Sound

  • AST demonstrated superior capability in audio classification with spectrogram-specific designs.

2️⃣ Sustainable Solutions

  • Enables non-invasive environmental monitoring and public health applications.

3️⃣ Multimodal Applications

  • Potential for combining audio and visual data for enhanced analytics in ESC.

πŸ“¬ Engage with Us!

Interested in our work? Feel free to:


🌟 Thank you for exploring Environmental Sound Classification! Together, let's redefine audio analytics for a better future. 🐦🎧✨