Transformers and SAM: A Breakthrough Approach to Diabetic Retinopathy Detection and Segmentation

Abstract

We present a project combining AI tools, particularly computer vision, to analyze and better understand diabetic retinopathy. This disease is particularly concerning as it affects many diabetic patients and is the leading cause of blindness before age 65 in France.

The project's initiative is both scientific and personal: having relatives with diabetes motivates us to propose a concrete solution to help detect and monitor retinopathy through medical imaging.
We detail how we use a transformer, specifically the Swin Transformer, to identify different forms of retinopathy from fundus images, and how attention map extraction and segmentation using Segment Anything Model (SAM) help better localize and understand lesions.

Introduction

Diabetic retinopathy is a major complication of diabetes, affecting nearly 50% of type 2 diabetic patients. It primarily results from progressive damage to small blood vessels irrigating the retina. In France, it's the leading cause of blindness in people under 65.

Detection and classification of retinopathy, particularly through fundus image analysis, is both time-consuming and technically challenging. To address this challenge, we propose a pipeline based on computer vision models to assist healthcare professionals in diagnosing and monitoring this disease.

Beyond technological research, this project is particularly meaningful to us, as family members have diabetes and risk developing ocular complications. This personal dimension motivates us to achieve high performance and make our solution accessible and practical in clinical settings.

The link of the repository project is here

Methodology

Our proposed methodology addresses diabetic retinopathy detection and analysis through a three-step pipeline:
(1) Classification,

(2) Attention Visualization,

and (3) Lesion Segmentation.

Below is a concise overview.

1. Classification via Swin Transformer

Model Selection

Why Swin Transformer?
- Employs a hierarchical shifting window mechanism, capturing both local and global retinal features.
- Well-suited for medical images where fine-grained lesion details are critical.

Classification Output

The model outputs probabilities for five classes:

0 - No DR
1 - Mild
2 - Moderate
3 - Severe
4 - Proliferative DR

By analyzing these probabilities, we gauge the image’s likelihood of belonging to each severity level.

2. Attention Visualization

Rationale

Interpretability: Attention maps show key retinal zones influencing the model’s decision.
Diagnostic Confidence: When the model focuses on clinically relevant areas, healthcare professionals gain trust in its predictions.

Implementation

Extracting Attention Weights: We collect multi-head self-attention matrices at different layers to track how the model’s focus evolves.
Heatmaps & Aggregates: Visualize local vs. global attention, helping identify possible lesions or abnormalities.

Clinical Implications

Explainability: Clinicians can see if the model aligns with standard diagnostic markers.
Quality Assurance: Ensures attention isn’t misplaced (e.g., image borders).

3. Lesion Segmentation via SAM

Motivation

Granular Analysis: Precise boundary detection is crucial for assessing lesion size and treatment planning.
Quantification: Monitoring lesion shape and spread helps evaluate disease progression over time.

Why SAM?

General-Purpose Segmentation: Minimal prompts (points, boxes) can isolate pathological regions.
Flexibility: SAM adapts to diverse visual patterns without extensive fine-tuning.

Integration with Attention Maps

Identify ROIs: Use attention hotspots to locate potential lesions.
Prompt SAM: Provide “positive” or “negative” points for refined segmentation.
Output Mask: Obtain precise lesion boundaries.
Post-Processing: Calculate lesion metrics (area, perimeter). Formally: where represents all pixels in the image.

Pipeline Summary

This three-step approach—Swin-based classification, attention-based interpretability, and SAM-driven segmentation—offers an end-to-end solution for early, transparent, and accurate diabetic retinopathy diagnosis. It not only enhances clinical decision-making but also builds trust by showing exactly where and how the model detects disease-related changes.

Experiments

Database and Preprocessing

The training is on the dataset of the APTOS 2019 Blindness Detection challenge. It contains train and test images, we only use 90% of the train images (3300)

For the data augmentation:

train_transforms = transforms.Compose([
    transforms.Resize((300, 300)),
    transforms.RandomCrop((224, 224)),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.GaussianBlur(kernel_size=(3, 3)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

Note: We avoid flips or rotations because the orientation and typical features of retinal images might be lost or confused if flipped.

Training Parameters

Learning rate: Adaptive but we start with 2e-5
Optimizer: Adam
Scheduler: Learning rate reduction each epoch
Number of epochs: 15

Results

Here are the different results obtained by our solution:

This is the overall attention as explained above. But sometimes it may not be enough to detect hot areas so we visualize the different attention maps.

Layers attention

And SAM can be parameters to detect only small lesions and we can calculate the size of the lesions automatically

Conclusion

Diabetic retinopathy is a pathology that can lead to blindness when not detected and managed in time. Through this project, we've developed a complete solution, from classification to segmentation, supported by interpretation capabilities offered by attention maps.

Future Work

Interface improvement: Richer annotation tools
Active learning: Continuous feedback mechanism
Dataset extension: Potential hospital collaborations
Multimodal analysis: Integration of clinical data
Medical certification: Clinical validation and CE/FDA marking