
The RAG Assistant: A Modular Framework for Building and Evaluating Retrieval-Augmented Generation Systems is a modular and scalable framework designed for Retrieval-Augmented Generation (RAG) systems. RAG systems enhance large language models (LLMs) by allowing them to retrieve relevant context from external knowledge bases before generating responses. This paper outlines the core architecture of the RAG Assistant, detailing its ingestion pipeline, query processing, retrieval mechanism, response generation, and evaluation metrics. We explore the flexibility and modularity of the framework, which supports multiple LLMs and vector stores (Chroma, FAISS), and demonstrate its application to real-world AI challenges. Furthermore, the frameworkβs evaluation module provides quantitative measures of retrieval effectiveness using Precision@K, Recall@K, and Mean Reciprocal Rank (MRR). This publication provides insights into both the theoretical and practical aspects of building retrieval-augmented systems, with an emphasis on scalability, accuracy, and evaluation.
The RAG Assistant framework enables the construction of Retrieval-Augmented Generation (RAG) systems that combine external knowledge with the generative capabilities of large language models. By integrating Chroma and FAISS as vector stores, this project allows real-time document retrieval to supplement LLMs, improving accuracy and relevance.
Document Ingestion: Loads documents from URLs, Wikipedia, and PDFs.
Query Processing: Normalizes and expands queries to improve retrieval accuracy.
Retrieval: Utilizes vector stores (Chroma or FAISS) to fetch relevant documents.
Response Generation: Generates answers from LLMs using the retrieved context.
Evaluation: Assesses retrieval accuracy using Precision@K, Recall@K, and MRR.
Loaders: Fetches content from various sources (URLs, Wikipedia, PDFs).
Text Chunking: Splits documents into 1000-character blocks with 150-character overlap.
Embeddings: Uses sentence-transformers/all-MiniLM-L6-v2 for document embedding.
Vector Store: Persistently stores embeddings in Chroma (default) or FAISS.
Normalization: Cleans up user queries (removes unnecessary characters, trims spaces).
Classification: Categorizes queries to determine how to process them (e.g., factual vs conceptual).
Keyword Extraction: Extracts keywords to enhance retrieval.
Query Expansion: Optionally expands queries using LLMs to make them more precise.
Retriever: Fetches the top-k relevant documents from the vector store based on the query.
Vector Store: Uses Chroma or FAISS to store and retrieve document embeddings.
Prompt Construction: Combines the user query and retrieved context into a single prompt.
LLM: Processes the prompt and generates an answer.
Answer and Sources: Outputs an answer, with metadata citing the source of information.
Precision@K: Measures the proportion of relevant documents in the top K retrieved documents.
Recall@K: Evaluates whether all relevant documents appear in the top K.
Mean Reciprocal Rank (MRR): Measures how far down the list of retrieved documents the first relevant document appears.
The RAG Assistant is designed to work with multiple language models and vector stores for flexibility and scalability.
OpenAI: For high-quality, reliable completions (GPT-4).
Ollama: For local inference, providing privacy and better control over execution (LLaMA-3).
HuggingFace: To support open-source models through the HuggingFace Inference API.
Chroma: The default, easy-to-use vector store that is well-suited for medium-scale deployments.
FAISS: A more performance-oriented vector store that is useful for large-scale systems requiring high-speed search capabilities.
To get started, follow these steps:
git clone https://github.com/mohanelango/rag-assistant.git cd rag-assistant python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt cp .env.example .env
Edit .env to select your preferred LLM provider (OpenAI / Ollama / HuggingFace).
Optionally, adjust the chunk size, retrieval k, and model parameters in configs/settings.yaml.
make ingest
Or without make:
python -m src.rag.ingest
make ask Q="What is Retrieval-Augmented Generation and why is it useful?"
Or without make:
python -m src.rag.cli --question "What is Retrieval-Augmented Generation and why is it useful?"
make serve
Or without make:
uvicorn src.rag.api:app --reload --port 8000
The Retrieval Evaluation module computes Precision@K, Recall@K, and Mean Reciprocal Rank (MRR) to evaluate how well the retriever fetches relevant context.
[ { "question": "What is Retrieval-Augmented Generation?", "relevant_docs": ["Wikipedia:Retrieval-Augmented Generation"] }, { "question": "Who wrote Federalist Paper No. 10?", "relevant_docs": ["./docs/federalist_papers.pdf"] } ]
make evaluate
Precision@5: 0.80 Recall@5: 0.75 MRR: 0.67
You can switch to alternate configurations without changing the codebase:
make ingest SETTINGS=configs/settings.prod.yaml SOURCES=configs/sources.alt.yaml make ask Q="What is RAG?" SETTINGS=configs/settings.prod.yaml
What is Retrieval-Augmented Generation and why is it useful?
The system processes the query by first normalizing it, then expanding it (if configured) to ensure the query is as detailed and accurate as possible for the retriever.
Explain Retrieval-Augmented Generation and its usefulness in improving large language model accuracy and response relevance.
Retrieval-augmented generation (RAG) is a method where a large language model retrieves relevant information from a specified set of external documents or data sources before generating its answer, supplementing whatβs in its static training data. This allows the model to incorporate domain-specific or newly updated information it didnβt originally train on, such as internal company documents or other authoritative sources like Wikipedia.
Why itβs useful:
Keeps answers current without retraining: You update the external knowledge base rather than retraining the model (Ars Technica via Wikipedia
Generation).Improves accuracy and relevance: Retrieving context first helps the model βstick to the factsβ and reduces hallucinations (Wikipedia
Generation).Cuts cost and complexity: Reduces the need for frequent retraining, saving compute and money (Wikipedia
Generation).Increases transparency: Systems can include citations to retrieved sources so users can verify claims (Wikipedia-Retrieval-Augmented Generation).
[ { "source": "Wikipedia:Retrieval-Augmented Generation", "type": "wikipedia", "title": "Retrieval-augmented generation", "score": 0.7042 }, { "source": "Wikipedia:LangChain", "type": "wikipedia", "title": "LangChain", "score": 1.272 } ]
Retrieved Documents: The system retrieves relevant documents from the vector store (Chroma or FAISS). These documents provide the background context needed to answer the user's query.
Answer Generation: After retrieving the context, the answer is generated using the LLM (Language Model), which combines the query and the retrieved context.
Source Attribution: The retrieved sources are listed with metadata, including the document's type, title, and similarity score to the query.
This ensures transparency and credibility, allowing users to verify the response by referencing the sources from which the answer was derived.
This modular, scalable RAG Assistant enables efficient and contextually accurate query generation through retrieval-augmented systems. With clear installation instructions, robust evaluation methods, and flexibility in LLM and vector store choices, this project is designed for both research and production environments.
The evaluation module helps assess retrieval performance, making it easy to fine-tune the system based on quantitative results.