
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, despite their fluency, they suffer from a fundamental limitation: their knowledge is static, bounded by a training cutoff, and prone to hallucination when asked about information outside their learned parameters. This project addresses that limitation by implementing a Retrieval-Augmented Generation (RAG) system that grounds model responses in an external, curated knowledge base.
The primary objective of this project is to transform a generic conversational AI into a domain-aware, evidence-backed research assistant capable of answering questions strictly based on provided documents. Rather than relying on parametric memory alone, the system retrieves relevant information on demand and uses it to generate accurate, transparent, and verifiable responses. This work demonstrates a complete, production-oriented RAG pipeline.
Even with perfect conversational memory, an LLM remains “frozen in time” at its training cutoff. As a result, it may confidently provide outdated, incomplete, or entirely fabricated information. This behavior is particularly problematic in technical, scientific, and research-oriented settings where accuracy, traceability, and source credibility are essential.
The motivation behind this project is to mitigate these risks by:
• Enabling real-time access to external knowledge
• Reducing hallucinations through relevance-based filtering
• Enforcing grounded answers and explicit refusals when information is missing
By doing so, the system demonstrates how RAG acts as the foundational memory layer for agentic AI systems, enabling safer and more reliable reasoning.
The knowledge base for this assistant consists of multiple structured .txt documents covering a range of technical domains, including:
• Artificial Intelligence
• Biotechnology
• Quantum Computing
• Sustainable Energy
• Space Exploration and related scientific topics
The system currently supports plain .txt documents only. The default dataset consists of the sample files provided in the official ReadyTensor Project template repository, and users can modify the assistant’s knowledge base by adding or removing .txt files from the /data/ directory followed by re-ingestion.
These documents were intentionally selected/chosen to simulate a realistic, multi-domain research corpus. The diversity of topics allows for evaluation of retrieval precision under mixed-domain conditions and highlights the importance of relevance filtering. Each document is treated as a first-class data source and preserved with metadata (such as filename, chunk identifiers, and source references) to support traceability, citation, and transparency.
To prepare documents for semantic retrieval, the system applies a Recursive Character Text Splitter, which breaks long documents into smaller, overlapping chunks. Overlapping is critical to ensure that important contextual information is not lost at chunk boundaries, particularly in technical explanations that span multiple paragraphs.
Each chunk is embedded using a Sentence Transformer model (all-MiniLM-L6-v2), producing dense vector representations that capture semantic meaning rather than surface-level keywords. These embeddings are then stored in a persistent ChromaDB vector database, allowing the knowledge base to survive application restarts and enabling efficient similarity search.
This methodology directly reflects best practices emphasized in the course material, including:
• Appropriate chunk sizing and overlap
• Semantic embedding over keyword-based search
• Persistent vector storage for production-readiness
The assistant follows a classic two-phase RAG architecture:
5.1 Knowledge Ingestion (Insertion Phase)
During ingestion, documents are:
Loaded from a local data directory
Split into overlapping semantic chunks
Embedded into vector representations
Stored in ChromaDB alongside metadata


This phase builds the searchable knowledge base and is designed to be repeatable as new documents are added.

5.2 Retrieval and Generation (Inference Phase)
When a user submits a query:
The query is embedded into the same vector space
A similarity search retrieves the most relevant chunks using cosine distance
Results are filtered using a relevance threshold
Filtered context is passed to a large language model for answer generation


A key design decision is the use of relevance thresholding. Queries that do not meet the threshold are rejected, preventing off-topic questions from triggering hallucinated responses. Additionally, prompt hardening enforces strict grounding rules, instructing the model to refuse speculation and respond with “I don’t know based on the provided documents” when necessary.
5.3 System Architecture Diagram

The diagram above illustrates the modular separation between retrieval and generation within the system. User queries are first embedded and matched against stored document vectors in ChromaDB using cosine similarity. Retrieved chunks are then filtered through a relevance threshold before being assembled into grounded context for the LLM. Importantly, the language model never interacts directly with embeddings or the vector database; it operates strictly on retrieved text. This architectural separation ensures that factual accuracy is driven by retrieval quality, while generation remains controlled, transparent, and resistant to hallucination.
The system is designed to support multiple large language model providers: OpenAI (ChatOpenAI), Groq (ChatGroq), and Google Gemini (ChatGoogleGenerativeAI). The decision to support multiple providers was intentional and aligned with architectural flexibility, benchmarking capability, and real-world deployment considerations.
6.1 Why Multi-Provider Support?
In production-grade AI systems, model flexibility is critical. Different providers offer varying trade-offs in latency, cost, model size, reasoning ability, reliability, and availability. By abstracting the LLM layer through LangChain, this system decouples retrieval logic from the underlying model provider.
This design enables:
• Provider interchangeability without architectural modification
• Cost optimization depending on deployment constraints
• Performance benchmarking across models under identical retrieval conditions
• Resilience against provider outages or API limitations
• Reduced vendor lock-in
Because the retrieval pipeline remains constant, generation quality can be compared objectively across providers while holding retrieval fixed. This allows controlled evaluation of model reasoning quality within the same RAG framework.
Such abstraction aligns with best practices in agentic system design, where modularity, fault tolerance, and adaptability are essential characteristics of production-ready AI systems.
6.2 OpenAI (ChatOpenAI)
OpenAI models are widely regarded as strong general-purpose reasoning models with robust instruction-following behavior. In a RAG setting, instruction adherence is particularly critical when enforcing:
• Strict grounding in retrieved context
• Explicit refusal behavior
• Resistance to prompt injection
OpenAI models were included to provide high-quality, reliable generation performance and strong compliance with system-level constraints, making them well-suited for controlled RAG environments where hallucination prevention is prioritized.
6.3 Groq (Llama 3 via ChatGroq)
Groq provides optimized inference for open-weight models such as Llama 3, offering:
• Lower latency responses
• Cost efficiency
• High-throughput inference
Including Groq enables experimentation with open-weight models within the same RAG pipeline. This supports comparative benchmarking and evaluation of cost-performance trade-offs while maintaining identical retrieval logic.
Groq integration also demonstrates architectural flexibility, ensuring the system is not dependent on a single proprietary model ecosystem.
6.4 Google Gemini (ChatGoogleGenerativeAI)
Google Gemini models provide strong reasoning performance and enterprise-grade API infrastructure. Including Gemini ensures cross-provider compatibility and further demonstrates that the RAG architecture is provider-agnostic.
Supporting Gemini reinforces the system’s modularity and prepares the architecture for potential future multimodal extensions, even though the current implementation focuses on text-only retrieval.
6.5 Alignment with RAG Architecture
Importantly, the retrieval pipeline remains constant regardless of the selected LLM provider. The LLM is responsible only for synthesizing responses from retrieved context.
This separation ensures that:
• Retrieval quality drives factual accuracy
• The model does not rely on parametric memory for domain knowledge
• Grounded generation is enforced through prompt hardening and system-level constraints
• By intentionally decoupling retrieval and generation, the architecture reflects production-ready AI system design principles, ensuring scalability, adaptability, and controlled model behavior.
| Feature | Function | Use |
|---|---|---|
| Semantic Retrieval | Information is retrieved based on meaning rather than keyword matching. | Enables accurate context selection by matching user queries to semantically similar document chunks in vector space. |
| Relevance Thresholding | Low-confidence or off-topic queries are automatically rejected. | Prevents hallucinations by filtering out results whose cosine distance exceeds the defined similarity threshold. |
| Prompt Hardening | System prompts override user attempts to elicit unsupported or speculative answers. | Enforces grounded generation by ensuring the LLM only uses retrieved context and refuses unsupported queries. |
| Transparency | Retrieved sources and vector distance scores are exposed to users. | Provides traceability and confidence indicators, allowing users to assess answer reliability. |
| Persistent Storage | ChromaDB ensures durability across application restarts. | Maintains embedded document vectors between sessions, enabling consistent retrieval without re-ingestion. |
| Multi-Model Compatibility | The system supports OpenAI, Groq (Llama 3), and Google Gemini backends. | Allows flexible deployment, benchmarking, and reduced vendor lock-in through modular LLM abstraction. |


8.1 Retrieval Quality
Retrieval quality is evaluated using vector distance scores returned by ChromaDB. Lower distances indicate higher semantic similarity and are surfaced to the user to enable confidence assessment. The system’s behavior is evaluated across three primary scenarios:
• High-relevance queries within the dataset
• Ambiguous or cross-domain queries
• Completely out-of-scope queries
This evaluation framework prioritizes controlled refusal over incorrect answers, reflecting real-world safety and reliability requirements.



8.2 Automated Retrieval Evaluation (Top-K Accuracy Analysis)
To quantitatively evaluate retrieval performance, an automated evaluation script (evaluation.py) was implemented. This script measures Top-K retrieval accuracy before and after relevance threshold filtering, allowing objective validation of retrieval effectiveness.
Evaluation Configuration
• Top-K = 5
• Relevance Threshold = 1.3
• Test Cases = 3 domain-aligned queries
• Metric = Top-K Hit Rate
A hit is defined as the expected source document appearing within the Top-K retrieved results.
Example Evaluation Output

Interpretation of Results
Raw Retrieval Top-K Hit Rate: 100%
This indicates that for all evaluation queries, the correct source document was retrieved within the Top-5 results prior to relevance filtering.
Post-Threshold Top-K Hit Rate: 100%
After applying the cosine distance threshold (1.3), no valid results were incorrectly rejected. This confirms that:
• The relevance threshold does not degrade retrieval recall
• Threshold gating preserves in-domain answers
• Off-topic queries can be safely rejected without harming valid retrieval
Distance Analysis
Observed best cosine distances ranged between:
• 0.31 (high semantic similarity)
• 0.50 (moderate similarity but still well below threshold 1.3)
Since lower cosine distance = higher relevance in ChromaDB’s HNSW implementation, these values indicate strong semantic alignment between queries and indexed document chunks.
| Metric | Value |
|---|---|
| Top-K | 5 |
| Raw Top-K Hit Rate | 100% |
| Post-Threshold Top-K Hit Rate | 100% |
| Relevance Threshold | 1.3 |
| Evaluation Queries | 3 |
| Best Distance Range | 0.31 – 0.50 |
Why This Matters
This evaluation demonstrates:
• High retrieval precision
• Effective threshold tuning
• Stability of semantic embeddings
• Controlled gating without recall degradation
By implementing measurable retrieval metrics, the system moves beyond qualitative demonstration and provides quantitative validation of retrieval accuracy — aligning with production-grade RAG evaluation practices.
When valid, domain-aligned queries are issued, the assistant produces concise, source-cited responses derived exclusively from retrieved document chunks. When queries fall outside the dataset scope (e.g., current events or unrelated topics), the system consistently responds with a transparent refusal rather than hallucinating.
These results validate the effectiveness of semantic retrieval, relevance gating, and prompt hardening as complementary mechanisms for building trustworthy AI systems.
While the system is demonstrated using a multi-domain technical dataset, the architectural design supports a wide range of real-world applications.
10.1 Enterprise Knowledge Management
Organizations often maintain internal documentation such as technical specifications, policy manuals, compliance documents, and operational procedures. Traditional search systems rely on keyword matching, which can fail when terminology differs from the query phrasing.
By leveraging dense semantic embeddings and vector similarity search, this RAG assistant can retrieve contextually relevant internal documentation even when lexical overlap is minimal. Relevance thresholding ensures that if no sufficiently similar content exists, the system refuses to fabricate answers — a critical requirement in compliance-driven industries such as finance, healthcare, and legal services.

10.2 Research and Academic Assistance
In research environments, assistants must provide responses grounded in verified source material. The modular retrieval-generation separation ensures that the model synthesizes responses strictly from indexed documents rather than relying on potentially outdated parametric knowledge.
This makes the system suitable for:
• Literature review assistance
• Technical concept clarification
• Domain-specific Q&A over curated research corpora
• Cross-domain exploratory querying
Because all outputs are source-cited and traceable to specific document chunks, the system promotes academic integrity and transparency.

10.3 Internal Technical Support Systems
Technical teams frequently rely on internal documentation for troubleshooting and onboarding. A RAG-based assistant can act as a contextual help system that retrieves relevant procedure steps, configuration guidelines, or historical issue resolutions.
The relevance gating mechanism prevents the assistant from responding to unrelated queries, reducing the risk of misleading technical guidance. This is particularly important in high-stakes environments such as infrastructure management or DevOps operations.

10.4 Regulated and High-Trust Environments
In regulated industries, generative AI systems must demonstrate:
• Traceable reasoning
• Controlled output behavior
• Hallucination resistance
• Deterministic refusal policies
This architecture explicitly enforces these properties through:
• Prompt hardening
• Relevance threshold filtering
• Citation requirements
• Retrieval-generation decoupling
As a result, the system design aligns with emerging best practices for trustworthy and governable AI systems.

While effective, system performance depends on several factors:
• Quality and coverage of source documents
• Chunking strategy and overlap configuration
• Embedding model selection
For production deployment, additional considerations would include monitoring, logging, access control, automated re-ingestion pipelines, and periodic re-embedding to maintain data freshness.
This project demonstrates how Retrieval-Augmented Generation enables grounded reasoning, improved safety, and domain specificity in agentic AI systems. It highlights RAG’s role as the architectural backbone for intelligent assistants that must reason over external knowledge.
Future enhancements include:
• Support for additional document formats (PDFs, HTML)
• Adaptive relevance thresholds
• Comparative embedding model evaluation
• Real-time ingestion and update pipelines
• Enhanced evaluation metrics and benchmarking
This project is grounded in established research and modern production tooling within the fields of retrieval-augmented generation, semantic search, and vector database systems.
Foundational Research
• Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems (NeurIPS).
https://arxiv.org/abs/2005.11401
This seminal paper introduced the RAG framework, combining parametric language models with non-parametric external memory via dense retrieval. The architecture implemented in this project reflects these principles: retrieval of semantically relevant context followed by grounded generation.
• Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
https://arxiv.org/abs/1908.10084
Sentence-BERT introduced efficient semantic similarity search using dense vector embeddings. The embedding model used in this project (all-MiniLM-L6-v2) follows the Sentence Transformer paradigm for high-performance semantic retrieval.
• Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense Passage Retrieval for Open-Domain Question Answering.
https://arxiv.org/abs/2004.04906
Dense Passage Retrieval (DPR) introduced a dual-encoder architecture for learning dense vector representations of questions and passages, enabling highly efficient semantic retrieval using inner-product similarity. Unlike traditional sparse retrieval methods such as BM25, DPR leverages transformer-based encoders to map queries and documents into the same embedding space, allowing retrieval based on semantic meaning rather than lexical overlap.
The retrieval component implemented in this project aligns with the principles introduced in DPR. By embedding both document chunks and user queries into a shared vector space and performing cosine similarity search within ChromaDB, the system applies dense retrieval techniques foundational to modern Retrieval-Augmented Generation architectures. This approach enables robust cross-domain retrieval and improved recall for semantically related but lexically diverse queries.
Technical Documentation and Frameworks
• LangChain Documentation
https://python.langchain.com/
LangChain provides orchestration tools for building LLM-powered pipelines, including prompt management, retrieval chains, and model abstraction layers used in this system.
• ChromaDB Documentation
https://docs.trychroma.com/
• ChromaDB serves as the persistent vector database layer, enabling efficient cosine similarity search and durable embedding storage.
• Sentence-Transformers Documentation
https://www.sbert.net/
Provides implementation details and best practices for dense embedding models used in semantic search systems.
Full setup and usage instructions can also be found in the attached repository.
This project requires Python 3.11 or newer.
cd path/to/your/project-root
python -m venv .venv
source .venv/bin/activate # Linux/macOS
.venv\Scripts\activate # Windows
pip install -r requirements.txt
The assistant currently supports plain .txt files only.
Place your .txt files inside:
data/
Important: After adding or removing any files, re-run ingestion from the project root:
python src/app.py --mode ingest
This project supports these providers:
OpenAI (ChatOpenAI)
Groq (ChatGroq)
Google Gemini (ChatGoogleGenerativeAI)
Create a .env file in the project root and set only ONE of the following keys:
OPENAI_API_KEY=your_key_here
or
GROQ_API_KEY=your_key_here
or
GOOGLE_API_KEY=your_key_here
Important: If multiple provider keys are set, the app may choose a provider you did not intend. Keep only the key for the provider you want to use.
All commands below should be executed from the project root directory:
CLI(Chat):
python src/app.py
Ingest only:
python src/app.py --mode ingest
Chat mode:
python src/app.py --mode chat
Streamlit UI:
python -m streamlit run src/streamlit_app.py
Retrieval Evaluation
python src/evaluation.py
Conclusion
By integrating semantic retrieval, relevance filtering, and grounded generation, this project successfully transforms a general-purpose LLM into a reliable, domain-aware research assistant. The system exemplifies how RAG can be applied in practice to build secure, scalable, and production-ready agentic AI systems—meeting both academic rigor and applied industry standards.