Retrieval-Augmented Generation (RAG) systems address a fundamental limitation of Large Language Models (LLMs):
Their inability to reliably ground responses in user-provided data.
This publication presents RAG Engine, a near production-ready RAG system designed to transform static documents into an interactive, grounded knowledge base. The system emphasizes accuracy, efficiency, and reproducibility through deterministic document processing, semantic retrieval, and strict prompt grounding. A key contribution of this work is a smart caching mechanism based on content hashing and deterministic identifiers, enabling zero-duplication vector storage and significant performance gains for repeated document uploads.
Large Language Models are powerful but inherently limited by static training data and a tendency to hallucinate when answering document-specific questions. Retrieval-Augmented Generation (RAG) mitigates these issues by retrieving relevant document context at query time and constraining generation to that context.
RAG Engine was developed to explore how RAG systems can be engineered beyond proof-of-concept demos into systems that resemble real-world, production-oriented applications. The project focuses on practical challenges such as document reuse, efficient vector storage, session management, and controllable retrieval behavior. Tackling the problem of hallucination by providing real time facts, straight from the document you upload.
The system supports querying custom PDF and TXT documents without requiring model fine-tuning, allowing knowledge updates through simple document uploads. It is designed to be usable as both, a persistent local application and as a cloud-deployable demo, highlighting tradeoffs between functionality, security, and persistence.
The RAG Engine follows a modular pipeline consisting of document ingestion, preprocessing, vector storage, retrieval, and generation. Uploaded documents are validated, parsed, chunked, embedded, and stored in a persistent ChromaDB vector database. Document metadata and session mappings are maintained using a relational SQLite database.
A single document is processed only once. Multiple user sessions can reference the same document without duplicating embeddings, enabling efficient reuse and consistent retrieval behavior.
Text is segmented using a Recursive Chunking Strategy with a default chunk size of 1500 characters and a 150-character overlap. This approach preserves semantic coherence while preventing context loss at chunk boundaries. The chunking process prioritizes natural language structure by splitting first on paragraphs and sentences before falling back to smaller units.
To eliminate redundant processing, the system computes a SHA-256 hash from raw document bytes and generates a deterministic UUID5 identifier. This identifier uniquely represents the document content regardless of filename changes. Before processing a new upload, the system checks whether the document identifier already exists in the database. If found, existing embeddings and chunks are reused instantly without reprocessing or additional embedding API calls, Which proved to save a lot of time.
At query time, the system embeds the user query and performs cosine similarity search against stored document vectors. A configurable number of top-ranked chunks are retrieved and passed into a strictly grounded prompt template. The LLM is explicitly instructed to answer only within the provided context, reducing hallucinations and improving response reliability. It also provides retrieved chunks from the document, Which was sent to the LLM, for generation of a factual answer based on the document you uploaded, significantly reducing the risk of hallucination.
The system supports multiple LLM providers, enabling flexibility in performance, cost, and inference speed.
We have created two different UI versions:


The systems were tested using a variety of document types, including technical documentations, policy files, and long-form text documents. Experiments focused on measuring document upload latency, retrieval accuracy, and the performance impact of the caching mechanism.
Repeated uploads of identical documents were used to evaluate cache effectiveness. Retrieval parameters such as chunk size, overlap, and number of retrieved chunks were adjusted to observe their impact on response quality and latency.
Initial uploads of medium-sized documents (40–60 pages) required approximately 7–10 seconds for full processing, including chunking and embedding. Subsequent uploads of identical documents completed in under one second due to our Smart Caching Mechanism, representing up to a 10× speed improvement.
The caching mechanism eliminated redundant embedding calls entirely for previously processed documents, resulting in significant cost savings and reduced network usage. Retrieval quality remained consistent across sessions, confirming that deterministic identifiers and shared vector storage do not degrade system behavior.
Query response times averaged 3–4 seconds, with most latency attributed to LLM inference rather than retrieval or vector search.
RAG Engine demonstrates how Retrieval-Augmented Generation systems can be engineered with production-oriented concerns such as efficiency, determinism, and scalability in mind. By combining semantic retrieval, strict prompt grounding, and a content-based caching strategy, the system delivers accurate, reproducible, and cost-efficient document-based question answering.
This project highlights the importance of architectural decisions in RAG systems and provides a practical reference implementation for teams seeking to move beyond experimental prototypes toward real-world deployments.