Real-Time Automatic License Plate Recognition Using RT-DETR v2

Author: Chidambara Raju G
Repository: https://github.com/ChidambaraRaju/real-time-license-plate-detection-ocr
Dataset: https://huggingface.co/datasets/justjuu/traffic-accident-cctv-object-detection
Demo: https://huggingface.co/spaces/justjuu/license-plate-recognition-rtdetr

Abstract

Automatic License Plate Recognition (ALPR) is a cornerstone of modern intelligent transportation systems, yet balancing real-time performance with high accuracy remains a challenge. This project presents a robust, end-to-end ALPR pipeline that combines the state-of-the-art RT-DETR v2 (Real-Time DEtection TRansformer) for object detection with an optimized EasyOCR implementation for text extraction.

The final system was fine-tuned on a custom dataset, deployed as an interactive web application on Hugging Face Spaces.

1. Introduction

License plate recognition systems traditionally rely on a two-stage pipeline:

License plate localization
Text recognition from the localized plate

While many approaches focus heavily on OCR, real-world performance is often bottlenecked by inaccurate or missed detections. This project emphasizes the opposite philosophy: optimize detection quality and recall first, then apply OCR only where it matters.

Recent advances in Transformer-based object detectors have made it possible to achieve both high accuracy and real-time inference. RT-DETR v2 is one such model, offering end-to-end detection without traditional post-processing such as non-maximum suppression (NMS). This makes it particularly suitable for low-latency applications when paired with appropriate filtering strategies.

2. System Overview

flowchart TD
    A([Input Image]) --> B[Pre-processing]
    B --> C[RT-DETR v2 Detection]
    C --> D{Found Plate?}
    
    D -- No --> E([No Output])
    D -- Yes --> F[Crop & Pad]
    
    F --> G[EasyOCR]
    G --> H([Final Output])

    %% Styling
    style C fill:#0f766e,color:#fff,stroke-width:0px
    style G fill:#1e40af,color:#fff,stroke-width:0px
    style H fill:#065f46,color:#fff,stroke-width:0px
    style A fill:#333,color:#fff,stroke-width:0px
    style E fill:#333,color:#fff,stroke-width:0px

3. Dataset Preparation

3.1 Source Dataset

The dataset was sourced from Roboflow and provided in COCO format. It contains vehicle images annotated with license plate bounding boxes.

3.2 Dataset Cleaning

The original dataset contained an unused category and non-contiguous class indices. To ensure compatibility with modern detection frameworks, the dataset was cleaned using the following steps:

Removed unused category entries
Remapped license_plate class ID to 0
Converted annotations into Hugging Face Dataset format
Excluded images without license plates

3.3 Final Dataset

Task: Single-class object detection
Class: license_plate
Annotation format: COCO [x, y, width, height]
Splits: Train / Validation / Test

4. Solution Architecture

The pipeline consists of three distinct stages:

4.1. The Detector: RT-DETR v2

I selected RT-DETR v2 (ResNet-50 backbone) over traditional YOLO models because of its transformer-based architecture. Unlike CNN-based detectors that rely heavily on NMS (Non-Maximum Suppression) post-processing, RT-DETR uses an efficient hybrid encoder-decoder to predict objects directly.

Backbone: ResNet-50 (Pre-trained on COCO)
Fine-tuning: Adapted for the single class license_plate.

4.2. The Recognizer: Optimized EasyOCR

For text extraction, I integrated EasyOCR. While powerful, EasyOCR can be slow on CPU. I implemented critical optimizations such as canvas_size=512 to reduce inference time to <5s on standard hardware.

4.3. The Interface: Gradio

The model is wrapped in a Python Gradio interface, allowing users to upload images and visualize bounding boxes and extracted text in real-time.

5. Training Setup

Framework: Hugging Face Transformers
Task: Object Detection
Classes: 1 (license_plate)
Training strategy: Fine-tuning from pretrained RT-DETR v2 weights
Objective: High recall and accurate localization

6. Evaluation Methodology

6.1 Metrics

The detection model was evaluated using standard object detection metrics:

Mean Average Precision (mAP)
Mean Average Recall (MAR)

Given the ALPR use case, mAP@0.5 was chosen as the primary metric, as OCR only requires reasonably tight localization rather than pixel-perfect bounding boxes.

6.2 Results

Evaluation on the held-out test split yielded the following results:

Metric	Value
mAP (0.5 .95)	0.97
Recall (MAR@100)	0.98
mAP@0.5	0.97
mAP (Small Objects)	0.88
mAP (Medium Objects)	0.99
mAP (Large Objects)	1.00

These results indicate that the model reliably detects license plates across different object sizes, with particularly strong recall — a crucial property for downstream OCR.

7. Demo & Results

The final application is hosted on Hugging Face Spaces. It features a custom-styled visualization engine that renders detections with a "glass-morphism" effect for better readability.

hf demo.png

Try the Demo: https://huggingface.co/spaces/justjuu/license-plate-recognition-rtdetr
View the Code: https://github.com/ChidambaraRaju/real-time-license-plate-detection-ocr

8. Limitations

OCR accuracy depends on image quality and plate visibility
Night-time, motion-blurred, or low-resolution plates may require OCR fine-tuning
The dataset is single-class and biased toward license plate–containing images
The dataset lacks environmental diversity, as most images were captured in controlled settings such as parking areas.
Due to limited scene variation, the detection model may not generalize optimally to unconstrained environments such as highways, crowded urban streets, or surveillance footage captured under diverse weather and lighting conditions.

9. Conclusion

This project demonstrates the power of combining transformer-based detection with lightweight, optimized OCR. By fine-tuning RT-DETR v2 and solving practical engineering hurdles related to image processing and inference speed, I built a reliable system capable of real-world license plate recognition.