
This project implements a Retrieval-Augmented Generation (RAG) based AI Assistant that answers questions using only the content inside a set of domain documents. The system loads text files, splits them into smaller chunks, generates embeddings using Google Gemini Embeddings, stores them in a ChromaDB vector database, and retrieves relevant context when a user asks a question. An LLM (Gemini 2.5 Flash) then produces a grounded answer based strictly on the retrieved text. When the documents do not contain the answer, the model is forced to respond: “I don’t have enough information from the documents.” This project was completed as part of the Ready Tensor Agentic AI Developer Certification – Module 1.
Modern AI systems increasingly rely on Retrieval-Augmented Generation (RAG) to provide accurate, grounded, and hallucination-free responses. Traditional language models often generate fluent answers without clear ties to verified sources, which limits transparency and reliability. RAG addresses this by combining embedding-based retrieval with controlled generation, ensuring that every answer is supported by document context.
This project implements a lightweight but robust RAG pipeline using Google Generative AI embeddings, Gemini LLM, and ChromaDB for vector storage. The system ingests domain-specific text files, chunks them with overlap to preserve context, embeds those chunks, and retrieves top-K matches during question-answering. The model strictly answers using only the retrieved context, ensuring factual grounding and preventing unsupported claims.
This publication explains the architecture, design decisions, safety measures, evaluation approach, and real execution results. It demonstrates how a simple yet well-engineered RAG system can deliver reliable, context-aware answers suitable for real-world applications such as documentation assistants, knowledge explorers, and project-based AI tools.
The system follows a standard RAG workflow:
If the answer is not found in retrieval context, the model returns:
“I don't have enough information from the documents.”
For this project, I prepared three small domain documents stored in the data/ directory.
These act as the custom knowledge base for retrieval:
All documents were manually drafted and curated for clarity, converted to UTF-8 text format, and placed in the data folder for ingestion.
No external proprietary datasets were used, ensuring full reproducibility.
The knowledge base contains 3 custom text documents, placed in the /data directory:
| File Name | Description | Approx. Length |
|---|---|---|
| vaes_intro.txt | What VAEs are and common applications | ~250 words |
| vaes_vs_autoencoders.txt | Difference between VAEs and traditional autoencoders | ~220 words |
| transformers_basics.txt | Explanation of transformers and self-attention | ~260 words |
All files are plain text and fully included in this repository.
| Component | Purpose |
|---|---|
| Google Gemini (LLM + Embeddings) | Answering and embedding text |
| ChromaDB | Vector store for chunk embeddings |
| LangChain | Prompt templating + pipeline |
| Python 3.9+ | Runtime |
| Google Embedding Model | text-embedding-004 |
🔹 vectordb.py
Acts as a lightweight wrapper around ChromaDB for vector storage and similarity search.
Splits documents into smaller chunks to improve search accuracy.
Generates embeddings using Google Generative AI (text-embedding-004).
Stores each chunk with metadata (source file, chunk index, length).
Performs fast similarity search and returns the most relevant chunks.
Includes retry logic to handle temporary API errors (e.g., 504 timeouts).
🔹 app.py
Loads .txt documents and sends them for ingestion.
Builds a RAG prompt that forces answers to rely only on retrieved context.
Runs the full RAG pipeline: retrieve → format context → call LLM → output answer + sources.
Gracefully handles missing info by returning:
“I don't have enough information from the documents.”
Uses .env for secure API key management and .env.example for reproducibility.
Leverages ChromaDB persistence to avoid re-embedding on every run.
This project is released under the MIT License, enabling anyone to use, modify, and distribute the code with proper attribution. The complete license text is included in the GitHub Repository.
ChromaDB was selected because:
It is lightweight, fast, and easy to integrate
It works locally without requiring external hosting
It supports metadata, filtering, and similarity search
It is ideal for small and medium-scale RAG systems
Simple API makes it perfect for certification learning projects
For this project, ChromaDB offered a balance of speed, simplicity, and flexibility, making it the optimal choice.
This project uses a baseline chunk strategy:
Chunk size: ~500 characters
Chunk overlap: 0 (default)
This decision was intentional to keep the system simple and deterministic.
However, chunk overlap—typically 20–50 characters—is often used in production to keep semantic connections across sentences. The implementation has been updated to allow easy overlap configuration for improved retrieval accuracy.
The RAG assistant was designed with strong safety and hallucination-prevention principles.
Instead of allowing the model to guess or fabricate information, the system forces the LLM to rely exclusively on the retrieved document context. Whenever a query cannot be answered based on the available text, the assistant clearly states:
“I don't have enough information from the documents.”
This explicit fallback ensures full transparency.
A carefully constructed prompt template restricts the model from using external knowledge or creative reasoning beyond the provided text. Additionally, the system sanitizes user inputs, checks for empty or invalid queries, and gracefully handles missing or incomplete content. These measures collectively create a controlled, predictable environment where the model remains grounded and the risk of hallucinations is minimized.
To ensure high-quality, consistent retrieval, the system incorporates lightweight but meaningful evaluation practices.
Each incoming query is preprocessed by converting it to lowercase, trimming unnecessary whitespace, and removing non-text noise so the embeddings represent the clean semantic meaning of the question. This helps stabilize retrieval scores and reduce mismatches caused by formatting or punctuation.
Once results are retrieved, they are evaluated based on contextual relevance and their alignment with the expected semantic meaning. A short manual relevance check—comparing retrieved content against the intended query—helps confirm the RAG pipeline is functioning correctly. Distances from the vector store are also observed to understand how confidently a chunk matches the user query.
These simple yet effective evaluation steps allow the system to assess retrieval quality without requiring heavy external frameworks.
Following reviewer recommendations, the system was strengthened to behave more consistently even in challenging conditions.
Embedding API calls are wrapped with a retry mechanism so that temporary network delays or timeouts do not break the workflow. Chunking logic was refined to maintain sentence boundaries and preserve context across segments, reducing retrieval fragmentation. Error handling now ensures that unexpected situations—such as empty documents, malformed queries, or failed embeddings—do not cause the system to crash.
These improvements make the end-to-end RAG pipeline significantly more reliable and resilient during repeated use.
The system was evaluated using real queries executed against the stored documents.
An output was considered correct if:
✅ Answer used only retrieved context
✅ The model cited the correct document
✅ If no answer existed, it returned the fallback sentence
To evaluate the Retrieval-Augmented AI Assistant, performed controlled tests focusing on retrieval quality, grounding accuracy, and overall system reliability.
All experiments were run locally in machine with the following configuration:
| Component | Specification |
|---|---|
| CPU | Intel Core i3 (local laptop) |
| RAM | 8 GB (sufficient for ChromaDB local operations) |
| GPU | Not required (embeddings + LLM served by Google) |
| OS | Windows 10 / 11 |
| Environment | Python virtual environment (.venv) |
| Embedding Model | Google Generative AI – text-embedding-004 |
| Vector Store | ChromaDB (persistent mode enabled) |
| LLM | Gemini 2.5 Flash |
| Chunk Size / Overlap | 500 characters, 40-character overlap |
| Top-k Retrieval | k = 3 |
The evaluation focused on understanding how well the system retrieves relevant text and how accurately the LLM grounds its answers in that context.
Queries were tested across multiple topics and documents. Retrieved chunks were manually inspected for relevance, coherence, and their ability to answer the question. The model’s responses were then verified against these chunks to ensure no external information was introduced.
The impact of chunk overlap was examined by comparing overlapping versus non-overlapping retrieval. Overlap provided smoother continuity and significantly reduced meaning loss across boundaries. The system was also tested with malformed queries, unrelated questions, empty inputs, and simulated network failures to evaluate its robustness.
Each of these tests helped confirm that the system behaves predictably under normal and adverse conditions.
Although the project did not require formal benchmark metrics, practical and informative indicators were used to assess the pipeline’s performance.
Manual relevance scoring was applied to retrieved chunks, verifying whether they aligned with the user’s question. Grounding accuracy was measured by checking whether the model’s answer was entirely supported by the retrieved text. Overlap continuity was assessed to determine whether combined chunks preserved the natural flow of information.
Robustness was validated through retry success rates and a “zero hallucination” expectation. These metrics provided a clear understanding of system performance without overcomplicating the evaluation.
| Query | Expected Result | System Output | Result |
|---|---|---|---|
| “What are VAEs used for?” | Generation, anomaly detection, imputation | ✅ Correct & cited | ✅ Pass |
| “Difference between VAEs and autoencoders?” | Comparison exists in documents | ✅ Correct & cited | ✅ Pass |
| “How do transformers model long-range dependencies?” | Self-attention mechanism | ✅ Correct & cited | ✅ Pass |
✅ Accuracy: 100% (3/3 queries grounded in retrieved text)
| Query | Screenshot |
|---|---|
| What are VAEs used for? | ![]() |
| What is the difference between VAEs and autoencoders? | ![]() |
| How do transformers model long-range dependencies? | ![]() |
| What are VAEs used for?" --dump-context | ![]() |
(Images stored in /assets/ folder inside the repo)
The system currently works only with plain .txt files, and does not yet support PDFs or HTML content. Retrieval is based purely on semantic vector similarity without additional ranking mechanisms such as keyword weighting or RRF fusion.
The assistant processes one question at a time and does not maintain dialogue memory across turns. Additionally, evaluations were performed on a relatively small corpus. The system is functional but not yet optimized for enterprise-scale datasets or fully deployed environments.
Several enhancements are planned to extend the system’s capabilities.
PDF ingestion along with OCR processing will allow more diverse document formats. A Streamlit or FastAPI interface can turn the solution into an interactive chat assistant.
Introducing conversational memory will support multi-turn reasoning, while integrating cloud-based vector stores like Pinecone or Weaviate can improve scalability. Automated evaluation with more queries will help build stronger confidence in performance.
The system is simple to deploy and can run entirely on a local machine using a Python virtual environment.
It can also be containerized through Docker for reproducibility or extended into a web interface via Streamlit. Proper management of API keys through .env files ensures secure use during deployment.
These options make the assistant suitable for both experimentation and lightweight production environments.
General-purpose LLMs often fail to answer domain-specific questions reliably because they lack the specific context required. This system fills that gap by grounding responses directly in user-provided documents, ensuring that answers are both accurate and contextually aligned.
By constraining the LLM and tightly integrating retrieval, the approach dramatically reduces hallucinations and improves trustworthiness.
Maintaining the system is simple and requires minimal overhead.
Users can add or update documents by placing new text files into the data/ folder and re-running ingestion. Switching to a remote vector database enables persistent storage and multi-user scaling. Periodically refreshing embeddings ensures improved retrieval as documents evolve.
Overall, maintenance remains lightweight while supporting long-term usability.
To show the benefit of using a RAG approach, I compared responses from:
✅ RAG pipeline (Gemini + ChromaDB)
❌ LLM-only prompt (no document retrieval)
| User Question | LLM-Only (No RAG) | RAG System Output |
|---|---|---|
| "What is the difference between VAEs and autoencoders?" | Gives a generic answer based on prior training, even if incorrect | Correctly responds: "I don't have enough information from the documents." |
| "How do transformers model long-range dependencies?" | Generic theoretical response, no citation | Uses retrieved chunks and cites (source: transformers_basics.txt) |
Key Result:
This proves that RAG improves reliability compared to a normal LLM prompt that can guess or hallucinate.
This RAG-based assistant demonstrates how organizations can build reliable, private knowledge tools grounded in their internal content.
Its architecture provides a strong foundation for document search systems, internal QA assistants, onboarding tools, and customer support knowledgebases.
With further enhancements such as PDF ingestion and semantic evaluation, the system can be adapted into a full-scale enterprise retrieval assistant.
Many enterprises already use RAG systems to ground large language models in proprietary knowledge.
Systems like customer support, medical record search, and legal document lookup rely on RAG for factual accuracy and auditability.
This project demonstrates a complete working RAG assistant using Gemini and ChromaDB. It retrieves context, grounds responses, cites sources, handles missing information gracefully, and follows production-aligned design principles.
👤 Author
Suraj Mahale
AI & Salesforce Developer
GitHub: https://github.com/sbm-11-SFDC