
Every organization faces the same challenge: mountains of valuable information trapped in documents that are difficult to search and impossible to query conversationally. You know the answer exists somewhere in your 500-page technical manual, but finding it requires reading through irrelevant sections. Meanwhile, AI chatbots can answer almost anything—except questions about your specific data.
This project bridges that gap by building a Retrieval-Augmented Generation (RAG) system that combines vector search with multiple Large Language Model (LLM) providers, creating an intelligent assistant that answers questions based solely on your documents while preventing the notorious problem of AI hallucinations.
Retrieval-Augmented Generation represents a fundamental shift in how we deploy AI systems for practical applications. Rather than relying solely on the knowledge baked into an LLM during training, RAG systems dynamically retrieve relevant information from external sources and provide this context to the model before generating responses. This approach transforms generic chatbots into specialized assistants that can answer questions about your specific domain, documents, or proprietary information.
When you ask a regular LLM about your company's return policy, it faces an impossible task. The model has never seen your policy document, yet it will often confidently generate a plausible-sounding answer based on patterns it learned from training data. This response might sound professional and well-structured, but it's completely fabricated. This phenomenon, known as hallucination, makes standalone LLMs unsuitable for scenarios where accuracy is critical.
A RAG system approaches the same question differently. First, it searches through your actual documents to find relevant information about your return policy. Once it locates the correct passages, it provides these as context to the LLM along with the user's question. Now the model isn't inventing an answer from scratch but rather synthesizing and presenting information from your verified sources. The result is accurate, grounded, and trustworthy.
Understanding the RAG pipeline requires following the journey of a user's question through multiple processing stages. Each stage plays a crucial role in transforming a natural language query into an accurate, contextually grounded answer.
┌─────────────────┐
│ User Question │
└────────┬────────┘
│
▼
┌─────────────────────────┐
│ Convert to Vector │
│ Embedding │
└────────┬────────────────┘
│
▼
┌─────────────────────────┐
│ Search Vector DB │
│ (ChromaDB) │
│ Find Similar Chunks │
└────────┬────────────────┘
│
▼
┌─────────────────────────┐
│ Retrieve Top N │
│ Relevant Documents │
└────────┬────────────────┘
│
▼
┌─────────────────────────┐
│ Combine Context + │
│ Question → LLM │
│ (GPT/Llama/Gemini) │
└────────┬────────────────┘
│
▼
┌─────────────────────────┐
│ Grounded, Accurate │
│ Answer │
└─────────────────────────┘
The document ingestion layer serves as the entry point for your knowledge base. This component scans designated directories for text and markdown files, reads their contents while handling various encoding formats, and validates that the content is suitable for processing. When errors occur during file reading, the system logs them gracefully and continues processing other documents rather than failing entirely.
At the heart of the retrieval system lies ChromaDB, a vector database optimized for similarity search operations. When documents enter the system, ChromaDB converts the text into high-dimensional numerical vectors called embeddings. These embeddings capture semantic meaning rather than just keywords, enabling the system to understand that "automobile" and "car" are related concepts even though they share no common letters. The database stores these vectors in a structure that allows for lightning-fast similarity searches across millions of documents.
The retrieval engine processes user queries through a similar transformation, converting natural language questions into vector representations that exist in the same semantic space as the document embeddings. By calculating the distance between the query vector and document vectors, the system identifies the most semantically similar chunks. The default configuration returns the top three results, though this parameter can be adjusted based on the complexity of queries your system handles.
One of the most distinctive aspects of this implementation is its multi-LLM integration layer. Rather than locking users into a single AI provider, the system automatically detects which API keys are available in the environment and initializes the appropriate model. This flexibility provides several strategic advantages including cost optimization through model selection, protection against vendor lock-in, and the ability to experiment with different providers to find the optimal balance of speed, cost, and quality for your specific use case.
The generation layer combines retrieved context with the user's original question, feeding both to the selected LLM. This is where the magic happens as the model synthesizes information from your documents into coherent, natural language responses. However, raw LLM output can be unpredictable, which is why the system implements comprehensive guardrails that constrain the model's behavior and ensure responses remain accurate, appropriate, and grounded in the provided context.
Building a production-ready RAG system requires more than just connecting a vector database to an LLM. Without proper constraints, even well-designed systems can produce unreliable outputs. Guardrails serve as the safety mechanisms that keep AI responses accurate, appropriate, and trustworthy.
The first and most fundamental guardrail enforces context constraint. The system explicitly instructs the LLM to answer only based on the documents provided in the context. This simple rule has profound implications because it transforms the LLM from a creative generator into a faithful synthesizer. Rather than drawing on its vast training data to craft plausible responses, the model must work exclusively with the information retrieved from your knowledge base.
Clarity serves as the second pillar of our guardrail strategy. The system requires that all responses be clear, polite, and free from ambiguous language. This requirement addresses a common issue in AI systems where models hedge their statements with phrases like "it might be" or "possibly" even when the source material is definitive. By demanding clarity, we ensure users receive confident, actionable answers when the information is available in the documents.
Perhaps the most user-friendly guardrail is the explicit unavailability response. Rather than allowing the system to guess or deflect when information isn't available, we require it to state clearly that it doesn't have the requested data. This honest admission builds trust with users who learn they can rely on the system to acknowledge its limitations rather than inventing answers to seem helpful.
Response length control represents a subtle but important guardrail. By limiting responses to under 80 words, the system accomplishes multiple objectives. First, it prevents the LLM from rambling or adding unnecessary elaboration that might introduce errors. Second, it forces the model to focus on the most relevant information from the retrieved context. Third, it ensures users receive concise answers that respect their time.
The anti-hallucination directive serves as an explicit final instruction that reinforces all other guardrails. While the constraint to use only provided context should theoretically prevent hallucinations, explicitly stating this requirement creates an additional layer of protection. Combined with the other guardrails, this directive significantly reduces the risk of the model generating false information.
The implementation of these guardrails occurs through careful prompt engineering. The system uses a template that begins by establishing the AI's role as a helpful assistant, then immediately lays out the five core rules that govern its behavior. This template structure ensures that every interaction starts with clear constraints:
template = """You are a helpful AI assistant. Use the following context to answer the question. you must obey the following rules: 1. DO not answer any question outside the documents below. 2. Answer questions in a clear and polite manner 3. If you are asked any question outside this document kindly say "I do not have such data with me" 4. Make your response less than 80 words 5. Never Hallucinate Context: {context} Question: {question} Answer:"""
This approach to guardrails creates a system where users can trust the responses they receive. When the assistant provides an answer, users know it's grounded in actual documents. When it admits ignorance, users know to seek information elsewhere rather than acting on fabricated data.
While the current implementation loads complete documents, understanding chunking strategies becomes essential when scaling to larger document collections. The challenge of chunking lies in balancing competing concerns: chunks must be small enough to fit within the LLM's context window and precise enough for accurate retrieval, yet large enough to maintain coherent meaning.
Large Language Models operate under strict context limitations. Even the most advanced models can only process a finite number of tokens in a single request, typically ranging from 4,000 to 128,000 tokens depending on the model. When your documents exceed these limits or when you have extensive document collections, you must split them into smaller segments. However, naive splitting at arbitrary boundaries risks severing important connections and losing contextual meaning.
Consider a technical document discussing a multi-step process. If one chunk ends with "First, initialize the system" and the next chunk begins with "Then configure the parameters," the connection between these steps becomes unclear when they're retrieved separately. The second chunk lacks context about what "the system" refers to or what initialization means in this context.
The solution to maintaining context across chunk boundaries involves overlap. By allowing consecutive chunks to share some content at their boundaries, we create continuity that preserves meaning even when chunks are retrieved independently. Let me illustrate this with a concrete example:
Document: "...AI is transforming healthcare. Machine learning algorithms
can detect diseases early. Early detection saves lives..."
Without Overlap:
Chunk 1: "AI is transforming healthcare."
Chunk 2: "Machine learning algorithms can detect diseases early."
Chunk 3: "Early detection saves lives."
❌ Context lost between chunks
With Overlap (50 characters):
Chunk 1: "AI is transforming healthcare. Machine learning"
Chunk 2: "healthcare. Machine learning algorithms can detect diseases early. Early"
Chunk 3: "diseases early. Early detection saves lives."
✅ Context preserved through overlapping boundaries
In the overlapped version, each chunk contains references to concepts from adjacent chunks. When the retrieval system finds Chunk 2, it includes context about healthcare and establishes the connection to disease detection. Similarly, Chunk 3 maintains the connection between early detection and its life-saving implications.
The optimal chunking strategy depends heavily on your document characteristics. For small documents under 1,000 words, the best approach is often to avoid chunking entirely. Loading the complete document as a single unit preserves all context and relationships, eliminating any risk of breaking important connections. This approach works well for FAQs, short articles, or single-page documents.
Medium-sized documents between 1,000 and 5,000 words benefit from chunks of 500 to 1,000 tokens with an overlap of 100 to 200 tokens, representing approximately 20% overlap. This balance provides enough context in each chunk to maintain meaning while keeping chunks small enough for efficient retrieval. Technical documentation, blog posts, and research papers typically fall into this category.
Large documents exceeding 5,000 words require more aggressive chunking with sizes of 1,000 to 1,500 tokens and overlaps of 200 to 300 tokens. The larger overlap percentage compensates for the increased risk of context loss in longer documents. Books, comprehensive guides, and detailed reports benefit from this approach.
Implementing chunking with overlap requires careful configuration of text splitting parameters. Using LangChain's RecursiveCharacterTextSplitter, you can specify exactly how documents should be divided:
from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # Maximum chunk size chunk_overlap=200, # Overlap between chunks length_function=len, # How to measure length separators=["\n\n", "\n", ".", " ", ""] # Split preferences ) chunks = splitter.split_text(document_text)
The separators parameter deserves special attention as it defines the splitter's preference for where to break text. The system first attempts to split on double newlines (paragraph boundaries), then single newlines, then periods (sentence boundaries), then spaces (word boundaries), and finally individual characters as a last resort. This hierarchy ensures that splits occur at natural boundaries that preserve meaning.
Getting started with the RAG assistant requires setting up your development environment, configuring API access, and preparing your document collection. The entire process typically takes less than 10 minutes and requires no specialized hardware.
Before beginning installation, verify that your system has Python 3.8 or higher installed. You can check your Python version by opening a terminal and running python --version. If you need to install Python, download it from python.org and ensure you check the option to add Python to your system PATH during installation. You'll also need pip, Python's package manager, which comes bundled with modern Python installations. Finally, having Git installed will make it easier to clone the repository, though you can also download it as a ZIP file if you prefer.
Begin by cloning the repository to your local machine. Open a terminal, navigate to the directory where you want to store the project, and run the git clone command with your repository URL. Once cloned, change into the project directory. Creating a virtual environment is crucial for keeping this project's dependencies isolated from other Python projects on your system. Use Python's built-in venv module to create a virtual environment in a folder named venv. After creation, activate the virtual environment using the appropriate command for your operating system. On Windows, you'll run venv\Scripts\activate, while macOS and Linux users should use source venv/bin/activate. You'll know the environment is active when you see (venv) prepended to your terminal prompt. Finally, install all required dependencies by running pip install -r requirements.txt, which reads the requirements file and installs each package with its specified version.
The system's flexibility stems from its support for multiple LLM providers, but this requires providing at least one API key. Create a new file named .env in your project root directory. This file will store your API credentials securely without committing them to version control. You need to configure at least one provider from the available options.
For OpenAI integration, you'll need an API key from OpenAI's platform. Visit https://platform.openai.com/api-keys, sign in or create an account, and generate a new API key. Add this to your .env file with the format OPENAI_API_KEY=sk-...your-key-here.... OpenAI offers the most reliable performance and the widest range of models, making it an excellent choice for production systems, though it's also the most expensive option.
Groq provides an alternative with their Llama models, offering exceptional speed and a generous free tier that's perfect for development and testing. Obtain your API key from https://console.groq.com/keys and add it to your .env file as GROQ_API_KEY=gsk_...your-key-here.... Groq's infrastructure optimizes inference speed, making it ideal for applications where response time is critical.
Google's Gemini models offer another option with competitive pricing and good performance. Generate your API key at https://aistudio.google.com/app/apikey and add it as GOOGLE_API_KEY=AI...your-key-here... in your .env file. Gemini's free tier is particularly generous, making it attractive for experimentation and low-volume deployments.
The system will automatically detect which keys are available and initialize the corresponding model, with OpenAI taking precedence if multiple keys are present. You can also specify which model version to use by setting the corresponding MODEL environment variables, though the defaults work well for most applications.
The RAG system needs documents to create its knowledge base. If the data directory doesn't exist in your project, create it using mkdir -p data on Unix-like systems or mkdir data on Windows. Copy your text and markdown files into this directory. The system recursively scans subdirectories, so you can organize documents in any folder structure that makes sense for your use case. For example, you might create separate folders for different topics, departments, or document types.
With everything configured, you're ready to start the assistant. From your terminal with the virtual environment still activated, run python src/app.py. The system will initialize, reporting which LLM provider it's using, loading all documents from the data directory, processing them into vector embeddings, and finally presenting an interactive prompt where you can start asking questions. The entire initialization typically completes in just a few seconds, after which you'll see a welcome message indicating the system is ready.
Understanding how to interact effectively with the RAG assistant maximizes its value. Let me walk through several real-world scenarios that demonstrate both the system's capabilities and its intentional limitations.
When you ask "What is quantum computing?" the system springs into action. It converts your question into a vector representation, searches the document database for semantically similar content, retrieves the most relevant passages, and provides them to the LLM as context. The response demonstrates the system's ability to synthesize information from your documents into a coherent explanation. The answer stays focused on the core concepts while remaining within the 80-word limit, providing a definition that explains quantum computing leverages quantum mechanics principles like superposition and entanglement, distinguishing it from classical computing through its use of qubits that can exist in multiple states simultaneously.
The guardrails show their value when users venture outside the knowledge base. Asking "What's the weather like today?" triggers the system's out-of-scope detection. The vector search finds no relevant documents since weather information isn't in your knowledge base. Rather than inventing an answer or trying to be helpful by guessing, the system responds with the configured unavailability message stating it doesn't have such data. This honest admission prevents the frustration users experience when AI systems confidently provide incorrect information.
The system truly shines when synthesizing information across multiple documents. A question like "How does AI impact both healthcare and climate science?" requires the retrieval system to find relevant passages from documents on both topics. The LLM then integrates these separate pieces of information into a cohesive response that addresses both domains. It might explain how machine learning algorithms detect diseases early and personalize treatments in healthcare, while simultaneously describing how AI models predict weather patterns and optimize renewable energy systems in climate science, concluding with an observation about AI's broader capability for analyzing complex data patterns.
The practical applications of RAG systems extend across virtually every domain that relies on documented knowledge. Let me explore how different sectors can leverage this technology to transform information access.
Customer support operations represent one of the most immediate applications for RAG technology. By deploying a RAG assistant on your company's knowledge base, you can automate responses to common inquiries while maintaining accuracy. Companies implementing such systems typically see support ticket volume decrease by 40 to 60 percent as customers find instant answers to routine questions. The system operates 24 hours a day across all time zones, ensuring customers never wait for business hours to get help with straightforward issues.
Internal knowledge management transforms when employees can query company information conversationally. Consider the typical scenario where an employee needs to understand a policy or find a specific procedure. Rather than searching through SharePoint folders or scrolling through policy manuals, they simply ask the RAG assistant. This capability dramatically reduces time spent searching for information, improves onboarding efficiency for new employees who can quickly access institutional knowledge, and ensures everyone receives consistent, accurate information regardless of when or how they ask.
Compliance and legal departments benefit from RAG systems that can quickly reference regulations, policies, and procedures. When questions arise about regulatory requirements or company policies, the system retrieves exact language from official documents, ensuring accurate communication while maintaining complete audit trails of what information was provided and when.
Literature review processes traditionally consume enormous amounts of researcher time. Reading through hundreds or thousands of papers to find relevant information is tedious and error-prone. A RAG system deployed over a corpus of research papers enables semantic querying that finds relevant research even when it uses different terminology than your query. Researchers can ask about concepts rather than keywords, synthesize findings across multiple studies, and rapidly identify gaps in existing literature.
Laboratory work benefits from RAG assistants that have access to experimental protocols, methodology documentation, and historical research notes. When questions arise during experiments about proper procedures or previous results, researchers can query the system rather than interrupting colleagues or hunting through lab notebooks.
Educational institutions face constant demand for information about course materials, policies, and procedures. Students repeatedly ask the same questions about assignment requirements, exam dates, and course policies. A RAG system deployed over course materials, syllabi, and institutional policies provides students with instant, accurate answers while freeing instructors to focus on higher-value interactions. The system supports personalized learning by allowing students to explore course materials at their own pace, ask clarifying questions without embarrassment, and receive consistent information regardless of when they study.
Optimizing a RAG system's performance requires balancing multiple competing factors including retrieval quality, response speed, accuracy, and cost. Understanding these trade-offs enables you to tune the system for your specific requirements.
The number of results parameter fundamentally impacts system behavior. Setting it too low, such as retrieving only one document chunk, risks missing important context that might be distributed across multiple passages. Your system might find a relevant chunk but miss crucial details located elsewhere in your documents. The default setting of three results provides a good balance for most applications, offering enough context to answer complex questions without overwhelming the LLM with irrelevant information. Setting the parameter too high introduces its own problems as retrieving ten or more chunks may include less relevant information that dilutes the quality of the context provided to the LLM.
Vector search quality depends heavily on the embedding model used to convert text into vectors. Better embedding models capture semantic meaning more accurately, leading to more relevant retrievals. The system uses semantic similarity rather than keyword matching, meaning it understands that "automobile accident" and "car crash" refer to similar concepts even though they share no words. ChromaDB implements efficient approximate nearest neighbor search algorithms that enable fast similarity searches even across large document collections.
Different LLM providers offer dramatically different cost structures, and selecting the appropriate model for each use case can significantly impact operational expenses. For simple FAQ-style queries with straightforward answers, Groq's Llama models provide fast, inexpensive responses that typically match the quality of more expensive alternatives. Complex analytical queries that require nuanced understanding or sophisticated reasoning benefit from OpenAI's GPT-4, despite its higher cost per token. Google's Gemini models and GPT-4o-mini offer middle-ground options that balance capability with cost-effectiveness, making them suitable for general-purpose deployments where some queries are simple while others require deeper analysis.
The current implementation provides a solid foundation, but several enhancements would expand its capabilities and improve user experience. Implementing advanced chunking with configurable overlap strategies would enable the system to handle much larger document collections while maintaining context quality. Adding a web interface using Streamlit or Gradio would make the system accessible to non-technical users who prefer graphical interfaces over command-line interactions.
Expanding format support to include PDFs, DOCX files, and PowerPoint presentations would dramatically increase the types of documents the system can process. Many organizations store critical information in these formats, and supporting them eliminates the need for manual conversion. Implementing conversation memory would allow the system to maintain context across multiple questions in a single session, enabling follow-up questions and clarifications that feel more natural.
Showing answer citations that identify which document chunks were used to generate each response would increase user trust and enable verification of information. Adding evaluation metrics to track answer quality, relevance, and user satisfaction would support continuous improvement of the system. Creating a user feedback loop where users can rate responses would enable the system to learn which retrieval strategies and answer styles work best for your specific use case.
This RAG assistant demonstrates that building powerful, production-ready AI applications doesn't require massive infrastructure or PhD-level expertise. By combining modern tools like LangChain and ChromaDB with thoughtful guardrails and clear constraints, we create systems that are both powerful and safe. The implementation proves that effective AI applications often emerge from intelligently combining existing tools rather than building everything from scratch.
The key lessons center on responsibility and practicality. Grounding AI responses in factual sources rather than allowing free generation prevents the hallucinations that plague many AI deployments. Implementing strict guardrails through explicit rules and constraints ensures the system behaves predictably and safely. Designing for flexibility by supporting multiple LLM providers avoids vendor lock-in while enabling cost optimization. Prioritizing user experience through clear, concise answers builds trust and encourages adoption.
As AI continues to evolve, RAG systems represent a practical path forward for organizations seeking to leverage large language models without sacrificing accuracy or control. They make private data accessible through conversational interfaces while maintaining the accuracy, control, and user safety that production systems require. The complete code for this project is available on GitHub, and with the instructions provided here, you can have your own RAG assistant running in under 10 minutes, ready to transform how your organization accesses and utilizes its documented knowledge.
Ready to deploy your own RAG assistant? The complete code is available on GitHub, and you can have it running in under 10 minutes.
Here is a demonstration video of the app in action: