DocShield – Automated Redaction for Secure Document Handling is a multi-modal system designed to automatically identify and redact sensitive information from scanned documents. By leveraging both Optical Character Recognition (OCR) and Named Entity Recognition (NER) techniques, DocShield addresses the increasing need for privacy preservation and compliance with data protection regulations (e.g., GDPR, HIPAA).
The DocShield pipeline begins with a computer vision module that performs OCR on scanned images (PDF, JPG, or PNG), extracting textual content alongside bounding box coordinates. Preprocessing steps—such as image deskewing, noise reduction, and binarization—are applied to enhance recognition accuracy. Subsequently, a robust NER model (e.g., spaCy or a fine-tuned transformer from Hugging Face) analyzes the extracted text to identify key entity categories such as personal names, addresses, and financial identifiers (credit card or social security numbers).
To integrate results from these submodules, DocShield maps each detected entity back to its corresponding location in the source document image. This mapping enables highly precise redaction at the pixel level by either overlaying black bars, blurring, or pixelating the text. The system further includes optional user-defined rules or pattern matching (e.g., regex for emails or phone numbers) to capture domain-specific data not covered by the NER model.
Evaluation of DocShield involves measuring OCR accuracy (via character or word error rates) and NER performance (using precision, recall, and F1 scores). Sample real-world datasets, such as SROIE for receipt OCR and CoNLL-2003 for general-purpose NER, demonstrate DocShield’s reliability. In addition to its standalone command-line or web-based prototype, DocShield can be seamlessly integrated into enterprise workflows, enabling large-scale document processing. By reducing the manual effort required to redact sensitive information, DocShield helps organizations mitigate privacy risks, maintain regulatory compliance, and streamline document handling processes.
There are no models linked
There are no models linked