๐งพ End-to-End Vietnamese Text Recognition (OCR)
๐ง Overview
This project presents an end-to-end OCR system specifically designed for Vietnamese scene text recognition. It processes the MC_OCR dataset and implements a two-stage architecture:
- Text Detection using PaddleOCR
- Text Recognition using a custom CNN + Transformer model built in PyTorch
โจ Key Features
- โ
Two-Stage OCR Pipeline: separates detection from recognition for modularity and accuracy
- ๐ง Custom Recognition Model: CNN Encoder + Transformer Decoder
- ๐ฆ Vietnamese Dataset Support: parses and trains on MC_OCR (polygon-based labels)
- ๐ Evaluation: uses Character Error Rate (CER) & Sequence Accuracy
- ๐ฏ Visual Output: draws bounding boxes and predicted text on original image
๐งฑ Model Architecture
๐น Encoder โ CNN
- Inspired by CRNN
- Converts text-line images into sequential visual features
- Uses self-attention and cross-attention
- Autoregressively generates characters one by one
๐ Steps
-
Open the notebook in Google Colab
-
Connect to GPU runtime
-
Mount Drive and install required libraries
-
Run the notebook sequentially:
- ๐ Preprocessing: parses MC_OCR CSV with polygon annotations
- ๐ง Model: defines CNN + Transformer OCRModel
- ๐ Training: saves best model based on validation CER
- ๐งช Evaluation: computes final CER & Sequence Accuracy
- ๐ผ๏ธ Inference: visualizes prediction on sample images
๐ Evaluation Results
| Metric | Score |
|---|
| Test Loss | 0.4042 |
| Character Error Rate (CER) | 0.1502 |
| Sequence Accuracy | 59.03% |