Jul 15, 2025●31 reads●No License

OCR system for Vietnamese text recognition

Deep Learning
NLP
OCR
Transformer

Pham Huynh Tin

🧾 End-to-End Vietnamese Text Recognition (OCR)

🧠 Overview

This project presents an end-to-end OCR system specifically designed for Vietnamese scene text recognition. It processes the MC_OCR dataset and implements a two-stage architecture:

Text Detection using PaddleOCR
Text Recognition using a custom CNN + Transformer model built in PyTorch

✨ Key Features

✅ Two-Stage OCR Pipeline: separates detection from recognition for modularity and accuracy
🧠 Custom Recognition Model: CNN Encoder + Transformer Decoder
📦 Vietnamese Dataset Support: parses and trains on MC_OCR (polygon-based labels)
📊 Evaluation: uses Character Error Rate (CER) & Sequence Accuracy
🎯 Visual Output: draws bounding boxes and predicted text on original image

🧱 Model Architecture

🔹 Encoder – CNN

Inspired by CRNN
Converts text-line images into sequential visual features

🔹 Decoder – Transformer

Uses self-attention and cross-attention
Autoregressively generates characters one by one

📋 Steps

Open the notebook in Google Colab
Connect to GPU runtime
Mount Drive and install required libraries
Run the notebook sequentially:
- 📊 Preprocessing: parses MC_OCR CSV with polygon annotations
- 🧠 Model: defines CNN + Transformer OCRModel
- 🔁 Training: saves best model based on validation CER
- 🧪 Evaluation: computes final CER & Sequence Accuracy
- 🖼️ Inference: visualizes prediction on sample images

📊 Evaluation Results

Metric	Score
Test Loss	0.4042
Character Error Rate (CER)	0.1502
Sequence Accuracy	59.03%

OCR system for Vietnamese text recognition