Large Language Models (LLMs) offer powerful generative capabilities but are inherently limited by their static training data, leading to "hallucinations" and an inability to access private or real-time information. This project implements a Retrieval-Augmented Generation (RAG) system to address this gap. The system enhances a generative LLM by connecting it to a custom, external knowledge base. This is achieved by ingesting a corpus of user-provided documents (e.g., PDFs and text files), which are processed, chunked, and converted into vector embeddings. These embeddings are stored in a ChromaDB vector database, creating a searchable index. When a user submits a query, the system first retrieves the most relevant document chunks (the "context") via similarity search. This context is then prepended to the user's query in a specialized prompt, instructing the LLM to generate an answer based solely on the provided information. The result is an AI assistant that provides accurate, fact-based, and context-aware responses grounded in the specified documents, significantly mitigating fabrication and expanding the model's utility to domain-specific knowledge.
Methodology
The methodology for this RAG system is divided into two primary phases: 1. Data Ingestion & Indexing and 2. The RAG Query Pipeline. The entire application is built using Python, leveraging the LangChain framework for orchestration, a Hugging Face sentence-transformer for embeddings, and ChromaDB as the vector store.
Document Loading: The system first scans a designated data/ directory. It uses LangChain's document loaders (specifically PyPDFLoader for PDFs and TextLoader for .txt files) to read the raw content and metadata from each file.
Text Chunking: Since LLMs have limited context windows and retrieval is more effective on smaller pieces of text, the loaded documents are split into small, overlapping chunks. This project uses LangChain's RecursiveCharacterTextSplitter, which intelligently tries to split text along semantic boundaries (like paragraphs, sentences, and words).
Embedding Generation: Each text chunk is converted into a high-dimensional vector (an "embedding") that represents its semantic meaning. This is accomplished using the all-MiniLM-L6-v2 sentence-transformer model from Hugging Face, which is efficient and provides high-quality embeddings.
Vector Storage: The generated embeddings, along with their corresponding text chunks and metadata (like the source filename), are loaded and stored in a persistent ChromaDB vector database collection. This creates the final, searchable index of the knowledge base.
Query Embedding: The user's natural language question is embedded using the same all-MiniLM-L6-v2 model to ensure it exists in the same vector space as the documents.
Similarity Search: This query vector is used to perform a similarity search (specifically, a cosine similarity search) against the ChromaDB collection. The database returns the top-k (e.g., n_results=3) most relevant document chunks that are semantically similar to the user's question. This is the "Retrieval" step.
Prompt Augmentation: A specialized prompt template is used to combine the retrieved information. The template is structured to include:
Explicit instructions for the AI (e.g., "Answer based only on the following context...").
The retrieved document chunks (the "Context").
The original user "Question."
Answer Generation: This complete, augmented prompt is sent to a generative LLM (supporting providers like OpenAI, Groq, or Google AI) via a LangChain Expression Language (LCEL) chain. The LLM then generates a final answer that is "grounded" in the provided context. This is the "Generation" step.
The project successfully resulted in a functional, end-to-end RAG-based AI assistant. The system demonstrates the following key outcomes:
Successful Document Ingestion: The application correctly loads, chunks, embeds, and indexes custom documents from the data/ directory into a persistent vector store. This was verified by inspecting the ChromaDB collection count.
Effective Grounded Answering: When tested with questions whose answers were explicitly contained in the provided documents, the assistant provided accurate, relevant, and factual responses. The system was able to synthesize information, often drawing from multiple retrieved chunks to form a comprehensive answer.
Hallucination Mitigation: A critical success criterion was the system's ability to "refuse" to answer questions outside its knowledge base. When asked questions about topics not present in the documents, the assistant correctly followed its prompt instructions and responded that it did not have sufficient information, thereby avoiding the fabrication of answers.
Modular and Extensible Framework: The final system is highly modular. Due to the use of LangChain, the core logic can easily be adapted to use different LLM providers, embedding models, or vector databases with minimal code changes.
In conclusion, the project serves as a successful proof-of-concept, effectively bridging the gap between the generative capabilities of LLMs and the need for verifiable, domain-specific information. The RAG architecture proved to be a highly effective method for creating a trustworthy and knowledgeable AI assistant.