LLM Privacy Shield: Privacy-Preserving NLP Pipeline for Secure LLM Interactions

The Problem
Users unknowingly share sensitive information with AI chatbots every day:
"Hi, I'm John Smith. Email me at john.smith@company.com about the contract."
This data gets sent directly to LLM servers, where it may be logged, used for training, or exposed in security breaches. For healthcare, finance, and enterprise applications, this creates serious compliance risks under HIPAA, PCI-DSS, and GDPR.
The Solution
Privacy Shield acts as an intelligent middleware layer between applications and LLM APIs. It automatically detects personally identifiable information (PII), replaces it with safe tokens, and restores the original values after the LLM responds.
Here's how it works:
User Input: "Hi, I'm John Smith at john.smith@company.com"
Masked: "Hi, I'm {{PERSON_1}} at {{EMAIL_1}}"
↓
LLM API (never sees real PII)
↓
LLM Response: "Nice to meet you {{PERSON_1}}!"
Final Output: "Nice to meet you John Smith!"
The LLM never sees actual sensitive data, yet the conversation flows naturally.

Key Features

Multi-layer Detection: Combines regex patterns, spaCy NER, and optional transformer models for 95% accuracy
Context-Aware: Maintains natural conversation flow while protecting privacy
Reversible Anonymization: Seamlessly restores original values in responses
Selective Privacy Controls: Choose which data types remain masked
Production-Ready: Includes error handling, logging, and edge-case management
Zero Configuration: Works out of the box with sensible defaults

What Gets Protected
Supported Sensitive Data Types

Emails · Phone Numbers · Names · Organizations · Locations
Credit Cards · SSNs · IP Addresses · Dates of Birth

Configurable rules for HIPAA, GDPR, and PCI-DSS compliance

Quick Start

Install dependencies
bash pip install -r requirements.txt


**2. Set your API key**

Create a `.env` file:

OPENAI_API_KEY=sk-your-actual-key-here
3. Run the Streamlit app
bashstreamlit run app.py

4. Try it out:

Enter text containing PII, optionally select data types to keep masked, and watch Privacy Shield protect data in real-time.



Architecture

The system uses a three-stage pipeline:
User Input
     ↓
PII Detection (Regex + spaCy + Transformers)
     ↓
Masked Text Sent to LLM
     ↓
Masked Response
     ↓
PII Restored for User

Detection combines three methods:

Regex patterns: Fast matching for structured data (emails, phones, SSNs)
spaCy NER: Context-aware detection for names, organizations, locations
Transformer models (optional): BERT-based models for maximum accuracy

Overlapping detections are intelligently deduplicated, keeping the highest-confidence match.

Performance Benchmarks
Tested on typical PII-heavy text containing names, emails, and phone numbers:
MethodAccuracySpeedRegex only: ~85%~0.5ms+ spaCy: ~90%~15ms+ Hugging Face: ~95%~150ms
Configuration can be adjusted based on accuracy vs. speed requirements.

Demo:
![4.png](4.png)![3.png](3.png)![2.png](2.png)![1.png](1.png)

Use Cases
Healthcare: Protect patient data in medical chatbots and telehealth platforms
Customer Support: Screen customer information in support tools and ticket systems
HR Systems: Anonymize candidate details during automated screening
Financial Services: Safeguard account numbers and personal data in banking applications
Enterprise: Enable GDPR-compliant AI assistants for internal tools

Advanced Features

Multi-turn conversations: Maintains PII mappings across entire conversation history
Selective masking: Configure which data types remain masked even in final output
Custom patterns: Easily extend with domain-specific PII patterns
Cost monitoring: Track token usage and API costs
Multiple LLM support: Works with OpenAI, Anthropic, local models, and more


Security Considerations
1] API keys loaded from .env (gitignored by default)
2] Token mappings cleared after each session
3] No PII logged to disk
4] Masked data sent to LLM only
Note: While Privacy Shield significantly reduces risk, some context leakage may occur through conversational patterns. Always review specific compliance requirements.

Limitations

Context clues may indirectly reveal information
New PII types require custom pattern definitions
Currently optimized for English language only
Accuracy depends on input text quality and model selection


Built With
spaCy • Hugging Face Transformers • OpenAI Python SDK • Streamlit