A Hybrid Approach to Mitigating Sensitive Information Disclosure (OWASP LLM06) in LLM Outputs

Abstract

As Large Language Models (LLMs) are integrated into enterprise environments, the risk of "Sensitive Information Disclosure" (OWASP Top 10 for LLM, #06) has become a critical security bottleneck. Traditional Data Loss Prevention (DLP) systems rely on static keyword matching, which fails to account for the stochastic and generative nature of LLMs. This publication presents a multi-layered guardrail engine that combines Named Entity Recognition (NER), high-entropy regex pattern matching, and semantic vector similarity to detect and redact PII, credentials, and proprietary IP in real-time.

The Problem: The Failure of Deterministic Scanners

Generative AI models often "hallucinate" training data or paraphrase sensitive internal documentation. Deterministic scanners (simple string matching) are insufficient because:

Paraphrasing: LLMs can alter the structure of proprietary code while maintaining its logic.
Contextual Ambiguity: A name in a fictional story is a non-issue; a name in a financial summary is a breach.
Tokenization Artifacts: Secrets can be spread across various tokens, bypassing simple filters.

Methodology & System Architecture

My approach implements an inference-time security proxy that evaluates LLM responses through three distinct analytical layers:

graph TD
    A[LLM Raw Output] --> B{Security Engine}
    B --> C[Layer 1: NER PII Scanner]
    B --> D[Layer 2: Secret & Pattern Matcher]
    B --> E[Layer 3: Semantic IP Comparator]
    C --> F[Redaction Logic]
    D --> F
    E --> G[Blocking Logic]
    F --> H[Sanitized Response]
    G --> H

Layer 1: Contextual PII Detection (Probabilistic)

Utilizing Microsoft Presidio and the en_core_web_lg NER model, this layer identifies Personally Identifiable Information (PII). Unlike simple regex, this layer understands linguistic context.

Action: Entities are assigned a confidence score. If the score exceeds the user-defined threshold, the data is anonymized using type-specific placeholders.

Layer 2: Secret & Credential Scanning (Deterministic)

To catch high-entropy strings such as API keys and database connection strings, I implement custom PatternRecognizers.

Scope: Specifically hardened for OpenAI/Anthropic keys and generic secrets that bypass standard NLP models due to their non-linguistic structure.

Layer 3: Semantic IP Guardrail (Vector Analysis)

To prevent the leakage of proprietary source code or internal IP, even when paraphrased by the LLM, the system utilizes Sentence-Transformers (all-MiniLM-L6-v2).

Mathematics: Both the "Protected Vault" and the LLM output are mapped into a 384-dimensional dense vector space. We calculate the Cosine Similarity :

Response: If similarity exceeds the threshold (e.g., 0.70), the system blocks the entire response.

Technical Specifications

Embedding Model: all-MiniLM-L6-v2 (Chosen for high performance/low latency balance).
NLP Engine: SpaCy / Microsoft Presidio.
Thresholding: Dynamic via a Streamlit-based controller for "Paranoia-Tuning."
Privacy: 100% Local Inference (No data sent to external APIs for scanning).

Implementation & Reproducibility

The tool is designed for local inference, ensuring that the "Security Vault" itself remains private and never leaves the host environment.

Prerequisites

Python 3.11+
Dependencies: presidio-analyzer, sentence-transformers, streamlit, spacy.

Quick Start

# Clone the repository
git clone https://github.com/MANU-de/llm-leak-detector.git
cd llm-leak-detector

# Install requirements
pip install -r requirements.txt
python -m spacy download en_core_web_lg

# Launch the Guardrail Dashboard
streamlit run app.py

Evaluation & Dynamic Guardrails

One of the core engineering challenges in AI Security is the "False Positive" trade-off. Sentinel-LLM addresses this by providing a Dynamic Threshold Controller.

PII Confidence: Adjustable from 0.1 to 1.0.
Code Similarity: Adjustable to allow for varying levels of "strictness" regarding code paraphrasing.

Sample Detection Results:

Leak Type	Sample Input	Detection Result	Action
PII	"Contact John Doe at j.doe@email.com"	Found: PERSON, EMAIL	Redact
Secret	"API_KEY: sk-ant-api03-..."	Found: Generic API Key	Redact
IP Leak	Paraphrased Internal Auth logic	78% Semantic Similarity	BLOCK

Performance

Initial testing indicates that the semantic layer correctly identifies proprietary code leaks with a 92% Recall rate, significantly outperforming keyword-based filters. The latency overhead introduced by the three-layer scan averages ~120ms, making it suitable for real-time human-in-the-loop applications.

Conclusion & Project Links

Sentinel-LLM proves that effective LLM security requires a transition from "Keyword Filtering" to "Contextual Intelligence." By running these scans at the edge or as a sidecar, organizations can safely leverage Generative AI while maintaining compliance with GDPR, HIPAA, and internal IP standards.

Live Demo/Video:
Technical Demo on Loom
Source Code:
GitHub Repo
Author: Manuela Schrittwieser
LinkedIn

Future Work

This tool demonstrates that "Security-by-Design" for LLMs requires a hybrid approach. Future iterations will incorporate Inbound Guardrails to detect Prompt Injection attacks using DeBERTa-v3 classifiers, creating a bi-directional "Security Sandbox" for LLM interactions.