
This publication presents an enhanced Retrieval-Augmented Generation (RAG) system specifically designed for intelligent insurance document processing. The system now features advanced document type classification, real-time retrieval evaluation metrics, and sophisticated filtering capabilities. The system combines Django REST API backend with Streamlit frontend, featuring automated table extraction, semantic text chunking, human-in-the-loop validation, and Azure OpenAI integration.
Key innovations include:
The enhanced system demonstrates significant improvements in user experience, search relevance, and system transparency through real-time evaluation feedback. Performance analysis shows improved precision in document retrieval with document type filtering and comprehensive quality metrics for system monitoring.
Insurance documents are notoriously complex, containing dense text, structured tables, and cross-references that traditional document processing systems struggle to handle effectively. Manual processing is time-intensive and error-prone, while simple OCR-based approaches fail to capture the semantic relationships between different sections of the document.
Processing insurance documents presents unique challenges:
We developed a RAG system that addresses these challenges through:
The system introduces several components:
┌─────────────────────┐ HTTP API ┌─────────────────────┐
│ Streamlit UI │◄──────────────►│ Django Backend │
│ (Frontend) │ │ (REST API) │
│ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ Ingestion │ │ │ │ Ingestion │ │
│ │ Interface │ │ │ │ Service │ │
│ └─────────────┘ │ │ └─────────────┘ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ Query │ │ │ │ Retrieval │ │
│ │ Interface │ │ │ │ Service │ │
│ └─────────────┘ │ │ └─────────────┘ │
└─────────────────────┘ └─────────────────────┘
│ │
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ File Storage │ │ ChromaDB │
│ (PDF Input) │ │ (Vector Store) │
│ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ Raw PDFs │ │ │ │ Embeddings │ │
│ └─────────────┘ │ │ └─────────────┘ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │ Extracted │ │ │ │ Metadata │ │
│ │ Content │ │ │ └─────────────┘ │
│ └─────────────┘ │ └─────────────────────┘
└─────────────────────┘ ▲
│
┌─────────────────────┐
│ Azure OpenAI │
│ ┌─────────────┐ │
│ │ Embeddings │ │
│ │ Model │ │
│ └─────────────┘ │
│ ┌─────────────┐ │
│ │ Chat Model │ │
│ └─────────────┘ │
└─────────────────────┘
Our PDF processing engine combines multiple techniques for comprehensive content extraction:
def extract_and_save_tables(pdf_path, output_dir): """ Intelligent table extraction with automatic merging logic """ with pdfplumber.open(pdf_path) as pdf: for page_num, page in enumerate(pdf.pages, start=1): page_tables = page.find_tables( table_settings={ "vertical_strategy": "lines", "horizontal_strategy": "lines", "snap_tolerance": 3 } )
Key Features:
def extract_text(pdf_path, output_dir): """ Extract text while preserving table references and context """ # Extract words and filter out table content non_table_words = [] for word in words: if not intersects_with_table(word, table_bboxes): non_table_words.append(word)
Innovation: Our text extraction intelligently excludes table content while preserving references, preventing duplication in the final corpus.
The semantic chunking algorithm is a key innovation that improves retrieval accuracy by creating contextually coherent chunks:
def semantic_chunk_text(self, text: str, max_chunk_size: int = 1000) -> List[str]: """ Apply semantic chunking using cosine similarity between sentence embeddings """ sentences = self.split_into_sentences(text) # Get embeddings for all sentences embeddings = [self.get_embedding(sentence) for sentence in sentences] # Calculate semantic similarities between consecutive sentences similarities = [] for i in range(len(embeddings) - 1): sim = cosine_similarity([embeddings[i]], [embeddings[i + 1]])[0][0] similarities.append(sim) # Create chunks based on semantic boundaries chunks = [] current_chunk = [sentences[0]] for i, sim in enumerate(similarities): if sim < self.semantic_threshold or len(current_chunk_text) > max_chunk_size: chunks.append(' '.join(current_chunk)) current_chunk = [sentences[i + 1]] else: current_chunk.append(sentences[i + 1])
Algorithm Benefits:
The system incorporates human validation at critical stages:
# Editable table mapping in Streamlit edited_mapping = st.data_editor( table_mapping, num_rows="fixed", disabled=["page_num", "table_idx"], key="table_mapping_editor" )
Validation Features:
class ChunkerEmbedder: def __init__(self, azure_endpoint, azure_api_key, embedding_model, chroma_persist_dir): self.chroma_client = chromadb.PersistentClient(path=chroma_persist_dir) self.collection = self.chroma_client.create_collection( name="insurance_chunks", metadata={"description": "Insurance document chunks with embeddings"} )
Storage Strategy:
def query_document_internal(collection, embedding_model, query, k=5): """ Process queries with context assembly and LLM integration """ # Get query embedding query_embedding = embedding_model.embed_query(query) # Search ChromaDB results = collection.query( query_embeddings=[query_embedding], n_results=k ) # Build context and generate response context = build_context_from_results(results) answer = llm.invoke(format_prompt(context, query))
# Upload PDF POST /api/upload_pdf/ Content-Type: multipart/form-data # Extract tables POST /api/extract_tables/ { "pdf_path": "/path/to/document.pdf", "output_dir": "/path/to/output" } # Process and embed POST /api/chunk_and_embed/ { "output_dir": "/path/to/extracted/content", "chroma_db_dir": "/path/to/vector/store" }
# Query documents POST /retriever/query/ { "query": "What vaccinations are covered for children?", "chroma_db_dir": "/path/to/vector/store", "k": 5 }
# Azure OpenAI Configuration AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/ AZURE_OPENAI_KEY=your-api-key AZURE_OPENAI_TEXT_DEPLOYMENT_EMBEDDINGS=text-embedding-ada-002 AZURE_OPENAI_CHAT_DEPLOYMENT=gpt-35-turbo # Application Settings API_BASE=http://localhost:8000 LOG_LEVEL=INFO DJANGO_SECRET_KEY=your-secret-key
from langchain.prompts import PromptTemplate prompt_config = PromptTemplate( input_variables=["context", "question"], template="""You are an expert insurance assistant AI... **Context from Insurance Documents:** {context} **Customer Question:** {question} **Answer:**""" )
import logging from logs.utils import setup_logging setup_logging() logger = logging.getLogger(__name__) def process_document(pdf_path): try: logger.info(f"Starting document processing for {pdf_path}") # Processing logic logger.info("Document processing completed successfully") except Exception as e: logger.error(f"Error processing document: {e}") raise
Logging Features:
| Metric | Value | Description |
|---|---|---|
| Table Detection Accuracy | ~85-90% | Generally good table identification, some edge cases |
| Table Extraction Accuracy | ~80-85% | Mostly accurate, requires manual validation |
| Text Extraction Completeness | ~95% | Good text extraction with occasional formatting issues |
| Processing Speed | 8+ minutes | Full document processing including chunking and embedding |
| Semantic Chunking Effectiveness | Variable | Depends on document structure and content complexity |
| Component | Metric | Value |
|---|---|---|
| API Response Time | Average | 2-5 seconds |
| Query Processing | Average | 3-8 seconds |
| Embedding Generation | Per chunk | 0.2-0.5 seconds |
| ChromaDB Query | Average | 0.5-1 seconds |
| Full Document Processing | Average | 8+ minutes |
| Memory Usage | Peak | Variable (2-4 GB) |
Query: "What vaccinations are covered for children?"
System Response:
Based on the insurance policy documents, vaccination coverage for children includes:
1. **Routine Immunizations**: All vaccines listed in the Indian Academy of
Pediatrics (IAP) recommended immunization schedule are covered up to age 18.
2. **Coverage Limits**: Up to ₹5,000 per policy year for preventive vaccinations.
3. **Eligible Vaccines**: Including but not limited to DPT, MMR, Hepatitis B,
Polio, and seasonal flu vaccines.
**Sources**:
- Page 15, Table: Vaccination_Cover.csv
- Page 12, Section: Preventive Care Benefits
Analysis:
| Metric | Standard Chunking | Semantic Chunking | Observation |
|---|---|---|---|
| Context Relevance | Baseline | Improved | Better context preservation observed |
| Answer Accuracy | Baseline | Improved | More coherent responses for complex questions |
| Source Precision | Baseline | Improved | Better chunk-to-query matching |
| Processing Time | Faster | 8+ minutes | Significant time overhead for semantic processing |
| Memory Usage | Lower | Higher | Increased resource requirements |
Key Insights:
┌─────────────────────┐ HTTP API ┌─────────────────────┐
│ Streamlit UI │◄──────────────►│ Django Backend │
│ (Development) │ │ (Single Instance) │
│ Port 8501/8502 │ │ Port 8000 │
└─────────────────────┘ └─────────────────────┘
│ │
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Local Storage │ │ Local ChromaDB │
│ - PDFs │ │ - Single Node │
│ - Extracted Data │ │ - File-based │
│ - Logs │ │ - No Clustering │
└─────────────────────┘ └─────────────────────┘
# Caching strategy for embeddings @st.cache_resource def get_cached_chunker_embedder(chroma_db_dir: str, output_dir: str): """Cache expensive operations for better performance""" return ChunkerEmbedder(...) # Batch processing for large documents def process_documents_batch(documents, batch_size=10): """Process multiple documents efficiently""" for batch in chunked(documents, batch_size): process_batch(batch)
✅ Microservices Architecture: Clean separation between ingestion and retrieval
✅ Human-in-the-Loop Design: Critical for handling edge cases and building trust
✅ Semantic Chunking: Shows promise for improving retrieval quality
✅ Comprehensive Logging: Essential for debugging and monitoring
✅ Azure OpenAI Integration: Reliable and high-quality embeddings and responses
❌ No Session Memory: Each query is independent, no conversation context maintained
❌ Performance Bottlenecks: 8+ minute processing time for full document pipeline
❌ Limited Error Recovery: Basic error handling, needs more robust recovery mechanisms
❌ No Batch Processing: Sequential processing leads to long wait times
❌ Memory Management: Inefficient memory usage during large document processing
❌ Single-User Design: Not optimized for concurrent multi-user access
❌ Limited Scalability: Current architecture not production-ready for high loads
🔧 Table Merging Logic: Complex algorithm needed for multi-page tables
🔧 Memory Management: Careful optimization required for large documents
🔧 Error Handling: Robust error handling across distributed components
🔧 User Experience: Balance between automation and manual control
# Key metrics to track metrics = { "processing_time_per_page": timer.elapsed(), "chunks_generated": len(chunks), "embedding_success_rate": success_count / total_count, "query_response_time": response_timer.elapsed(), "accuracy_score": calculate_accuracy(results) }
The enhanced system introduces intelligent document classification during the ingestion process. Users can now categorize documents into four main types:
Frontend Integration:
Backend Processing:
Metadata Structure:
{ "type": "text", "doc_type": "policy", "page_num": 15, "chunk_idx": "42_1", "chunking_method": "semantic" }
The system now provides real-time evaluation of retrieval quality with multiple metrics:
covered_terms / total_query_termsEvaluation Pipeline:
Performance Optimization:
Real-Time Display:
Metric Explanations:
Implementation:
doc_type fieldUser Interface:
Performance Impact:
Query Processing Improvements:
Result Quality Enhancements:
Single Unified Interface:
Enhanced Usability:
Graceful Failure Management:
System Transparency:
Response Time Improvements:
Resource Management:
This Insurance RAG system represents a significant solution for document processing technology. The integration of document classification, real-time evaluation metrics, and advanced filtering capabilities demonstrates substantial improvements in both technical capabilities and user experience. The system now provides transparent, measurable quality feedback and more precise search results, addressing key limitations of traditional RAG implementations.
The system demonstrates several aspects of technical excellence:
With further development, this system could provide:
This system establishes a foundation for advanced insurance technology applications:
The successful implementation of this insurance RAG system demonstrates that with careful architecture, attention to domain requirements, and thoughtful human-AI collaboration, it's possible to create AI systems that deliver real business value while maintaining high standards of accuracy and reliability.
The project showcases the power of combining multiple AI technologies - semantic understanding, vector databases, large language models, and intelligent document processing - into a cohesive solution. As AI continues to transform various industries, this work provides a blueprint for building robust, scalable, and user-friendly AI applications that solve real-world problems.
Tags: #AI #RAG #DocumentProcessing #Insurance
License: MIT License
Author: Yuvaranjani
Version: 1.0