Abstract

The RAG Assistant: A Modular Framework for Building and Evaluating Retrieval-Augmented Generation Systems is a modular and scalable framework designed for Retrieval-Augmented Generation (RAG) systems. RAG systems enhance large language models (LLMs) by allowing them to retrieve relevant context from external knowledge bases before generating responses. This paper outlines the core architecture of the RAG Assistant, detailing its ingestion pipeline, query processing, retrieval mechanism, response generation, and evaluation metrics. We explore the flexibility and modularity of the framework, which supports multiple LLMs and vector stores (Chroma, FAISS), and demonstrate its application to real-world AI challenges. Furthermore, the framework’s evaluation module provides quantitative measures of retrieval effectiveness using Precision@K, Recall@K, and Mean Reciprocal Rank (MRR). This publication provides insights into both the theoretical and practical aspects of building retrieval-augmented systems, with an emphasis on scalability, accuracy, and evaluation.

1. Project Overview

The RAG Assistant framework enables the construction of Retrieval-Augmented Generation (RAG) systems that combine external knowledge with the generative capabilities of large language models. By integrating Chroma and FAISS as vector stores, this project allows real-time document retrieval to supplement LLMs, improving accuracy and relevance.

Key Components:

Document Ingestion: Loads documents from URLs, Wikipedia, and PDFs.

Query Processing: Normalizes and expands queries to improve retrieval accuracy.

Retrieval: Utilizes vector stores (Chroma or FAISS) to fetch relevant documents.

Response Generation: Generates answers from LLMs using the retrieved context.

Evaluation: Assesses retrieval accuracy using Precision@K, Recall@K, and MRR.

2. Architecture & Components

2.1 Ingestion Pipeline

Loaders: Fetches content from various sources (URLs, Wikipedia, PDFs).

Text Chunking: Splits documents into 1000-character blocks with 150-character overlap.

Embeddings: Uses sentence-transformers/all-MiniLM-L6-v2 for document embedding.

Vector Store: Persistently stores embeddings in Chroma (default) or FAISS.

2.2 Query Processing

Normalization: Cleans up user queries (removes unnecessary characters, trims spaces).

Classification: Categorizes queries to determine how to process them (e.g., factual vs conceptual).

Keyword Extraction: Extracts keywords to enhance retrieval.

Query Expansion: Optionally expands queries using LLMs to make them more precise.

2.3 Retrieval Layer

Retriever: Fetches the top-k relevant documents from the vector store based on the query.

Vector Store: Uses Chroma or FAISS to store and retrieve document embeddings.

2.4 Generation Layer

Prompt Construction: Combines the user query and retrieved context into a single prompt.

LLM: Processes the prompt and generates an answer.

Answer and Sources: Outputs an answer, with metadata citing the source of information.

2.5 Evaluation

Precision@K: Measures the proportion of relevant documents in the top K retrieved documents.

Recall@K: Evaluates whether all relevant documents appear in the top K.

Mean Reciprocal Rank (MRR): Measures how far down the list of retrieved documents the first relevant document appears.

3. Visual Workflow

RAG Assistant Pipeline:

4. Model and Library Choices

The RAG Assistant is designed to work with multiple language models and vector stores for flexibility and scalability.

4.1 Language Models:

OpenAI: For high-quality, reliable completions (GPT-4).

Ollama: For local inference, providing privacy and better control over execution (LLaMA-3).

HuggingFace: To support open-source models through the HuggingFace Inference API.

4.2 Vector Stores:

Chroma: The default, easy-to-use vector store that is well-suited for medium-scale deployments.

FAISS: A more performance-oriented vector store that is useful for large-scale systems requiring high-speed search capabilities.

5. Installation Instructions

To get started, follow these steps:

git clone https://github.com/mohanelango/rag-assistant.git
cd rag-assistant
python -m venv .venv && source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env

Configure the environment:

Edit .env to select your preferred LLM provider (OpenAI / Ollama / HuggingFace).

Optionally, adjust the chunk size, retrieval k, and model parameters in configs/settings.yaml.

6. Usage Instructions

Ingest Data:

make ingest

Or without make:

python -m src.rag.ingest

Query the System (CLI):

make ask Q="What is Retrieval-Augmented Generation and why is it useful?"

Or without make:

python -m src.rag.cli --question "What is Retrieval-Augmented Generation and why is it useful?"

Serve via API:

make serve

Or without make:

uvicorn src.rag.api:app --reload --port 8000

7. Evaluation

The Retrieval Evaluation module computes Precision@K, Recall@K, and Mean Reciprocal Rank (MRR) to evaluate how well the retriever fetches relevant context.

Create an evaluation dataset in data/eval_questions.json:

[
  {
    "question": "What is Retrieval-Augmented Generation?",
    "relevant_docs": ["Wikipedia:Retrieval-Augmented Generation"]
  },
  {
    "question": "Who wrote Federalist Paper No. 10?",
    "relevant_docs": ["./docs/federalist_papers.pdf"]
  }
]

Run evaluation :

make evaluate

Example Output:

Precision@5: 0.80
Recall@5: 0.75
MRR: 0.67

8. Overriding Config Files

You can switch to alternate configurations without changing the codebase:

make ingest SETTINGS=configs/settings.prod.yaml SOURCES=configs/sources.alt.yaml
make ask Q="What is RAG?" SETTINGS=configs/settings.prod.yaml

9. Example Query and Answer

User Query:

What is Retrieval-Augmented Generation and why is it useful?

Processed Query:

The system processes the query by first normalizing it, then expanding it (if configured) to ensure the query is as detailed and accurate as possible for the retriever.

Expanded Query (after query expansion):

Explain Retrieval-Augmented Generation and its usefulness in improving large language model accuracy and response relevance.

Answer:

Retrieval-augmented generation (RAG) is a method where a large language model retrieves relevant information from a specified set of external documents or data sources before generating its answer, supplementing what’s in its static training data. This allows the model to incorporate domain-specific or newly updated information it didn’t originally train on, such as internal company documents or other authoritative sources like Wikipedia.

Why it’s useful:

Keeps answers current without retraining: You update the external knowledge base rather than retraining the model (Ars Technica via Wikipedia

Generation).

Improves accuracy and relevance: Retrieving context first helps the model “stick to the facts” and reduces hallucinations (Wikipedia

Generation).

Cuts cost and complexity: Reduces the need for frequent retraining, saving compute and money (Wikipedia

Generation).

Increases transparency: Systems can include citations to retrieved sources so users can verify claims (Wikipedia-Retrieval-Augmented Generation).

Sources (deduplicated with metadata):

[
  {
    "source": "Wikipedia:Retrieval-Augmented Generation",
    "type": "wikipedia",
    "title": "Retrieval-augmented generation",
    "score": 0.7042
  },
  {
    "source": "Wikipedia:LangChain",
    "type": "wikipedia",
    "title": "LangChain",
    "score": 1.272
  }
]

Explanation of Output:

Retrieved Documents: The system retrieves relevant documents from the vector store (Chroma or FAISS). These documents provide the background context needed to answer the user's query.

Answer Generation: After retrieving the context, the answer is generated using the LLM (Language Model), which combines the query and the retrieved context.

Source Attribution: The retrieved sources are listed with metadata, including the document's type, title, and similarity score to the query.

This ensures transparency and credibility, allowing users to verify the response by referencing the sources from which the answer was derived.

10.Conclusion

This modular, scalable RAG Assistant enables efficient and contextually accurate query generation through retrieval-augmented systems. With clear installation instructions, robust evaluation methods, and flexibility in LLM and vector store choices, this project is designed for both research and production environments.

The evaluation module helps assess retrieval performance, making it easy to fine-tune the system based on quantitative results.

Abstract

1. Project Overview

Key Components:

Document Ingestion: Loads documents from URLs, Wikipedia, and PDFs.

Query Processing: Normalizes and expands queries to improve retrieval accuracy.

Retrieval: Utilizes vector stores (Chroma or FAISS) to fetch relevant documents.

Response Generation: Generates answers from LLMs using the retrieved context.

Evaluation: Assesses retrieval accuracy using Precision@K, Recall@K, and MRR.

2. Architecture & Components

2.1 Ingestion Pipeline

Loaders: Fetches content from various sources (URLs, Wikipedia, PDFs).

Text Chunking: Splits documents into 1000-character blocks with 150-character overlap.

Embeddings: Uses sentence-transformers/all-MiniLM-L6-v2 for document embedding.

Vector Store: Persistently stores embeddings in Chroma (default) or FAISS.

2.2 Query Processing

Normalization: Cleans up user queries (removes unnecessary characters, trims spaces).

Classification: Categorizes queries to determine how to process them (e.g., factual vs conceptual).

Keyword Extraction: Extracts keywords to enhance retrieval.

Query Expansion: Optionally expands queries using LLMs to make them more precise.

2.3 Retrieval Layer

Retriever: Fetches the top-k relevant documents from the vector store based on the query.

Vector Store: Uses Chroma or FAISS to store and retrieve document embeddings.

2.4 Generation Layer

Prompt Construction: Combines the user query and retrieved context into a single prompt.

LLM: Processes the prompt and generates an answer.

Answer and Sources: Outputs an answer, with metadata citing the source of information.

2.5 Evaluation

Precision@K: Measures the proportion of relevant documents in the top K retrieved documents.

Recall@K: Evaluates whether all relevant documents appear in the top K.

Mean Reciprocal Rank (MRR): Measures how far down the list of retrieved documents the first relevant document appears.

3. Visual Workflow

RAG Assistant Pipeline:

4. Model and Library Choices

The RAG Assistant is designed to work with multiple language models and vector stores for flexibility and scalability.

4.1 Language Models:

OpenAI: For high-quality, reliable completions (GPT-4).

Ollama: For local inference, providing privacy and better control over execution (LLaMA-3).

HuggingFace: To support open-source models through the HuggingFace Inference API.

4.2 Vector Stores:

Chroma: The default, easy-to-use vector store that is well-suited for medium-scale deployments.

FAISS: A more performance-oriented vector store that is useful for large-scale systems requiring high-speed search capabilities.

5. Installation Instructions

To get started, follow these steps:

git clone https://github.com/mohanelango/rag-assistant.git
cd rag-assistant
python -m venv .venv && source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env

Configure the environment:

Edit .env to select your preferred LLM provider (OpenAI / Ollama / HuggingFace).

Optionally, adjust the chunk size, retrieval k, and model parameters in configs/settings.yaml.

6. Usage Instructions

Ingest Data:

make ingest

Or without make:

python -m src.rag.ingest

Query the System (CLI):

make ask Q="What is Retrieval-Augmented Generation and why is it useful?"

Or without make:

python -m src.rag.cli --question "What is Retrieval-Augmented Generation and why is it useful?"

Serve via API:

make serve

Or without make:

uvicorn src.rag.api:app --reload --port 8000

7. Evaluation

The Retrieval Evaluation module computes Precision@K, Recall@K, and Mean Reciprocal Rank (MRR) to evaluate how well the retriever fetches relevant context.

Create an evaluation dataset in data/eval_questions.json:

[
  {
    "question": "What is Retrieval-Augmented Generation?",
    "relevant_docs": ["Wikipedia:Retrieval-Augmented Generation"]
  },
  {
    "question": "Who wrote Federalist Paper No. 10?",
    "relevant_docs": ["./docs/federalist_papers.pdf"]
  }
]

Run evaluation :

make evaluate

Example Output:

Precision@5: 0.80
Recall@5: 0.75
MRR: 0.67

8. Overriding Config Files

You can switch to alternate configurations without changing the codebase:

make ingest SETTINGS=configs/settings.prod.yaml SOURCES=configs/sources.alt.yaml
make ask Q="What is RAG?" SETTINGS=configs/settings.prod.yaml

9. Example Query and Answer

User Query:

What is Retrieval-Augmented Generation and why is it useful?

Processed Query:

The system processes the query by first normalizing it, then expanding it (if configured) to ensure the query is as detailed and accurate as possible for the retriever.

Expanded Query (after query expansion):

Explain Retrieval-Augmented Generation and its usefulness in improving large language model accuracy and response relevance.

Answer:

Why it’s useful:

Keeps answers current without retraining: You update the external knowledge base rather than retraining the model (Ars Technica via Wikipedia

Generation).

Improves accuracy and relevance: Retrieving context first helps the model “stick to the facts” and reduces hallucinations (Wikipedia

Generation).

Cuts cost and complexity: Reduces the need for frequent retraining, saving compute and money (Wikipedia

Generation).

Increases transparency: Systems can include citations to retrieved sources so users can verify claims (Wikipedia-Retrieval-Augmented Generation).

Sources (deduplicated with metadata):

[
  {
    "source": "Wikipedia:Retrieval-Augmented Generation",
    "type": "wikipedia",
    "title": "Retrieval-augmented generation",
    "score": 0.7042
  },
  {
    "source": "Wikipedia:LangChain",
    "type": "wikipedia",
    "title": "LangChain",
    "score": 1.272
  }
]

Explanation of Output:

Retrieved Documents: The system retrieves relevant documents from the vector store (Chroma or FAISS). These documents provide the background context needed to answer the user's query.

Answer Generation: After retrieving the context, the answer is generated using the LLM (Language Model), which combines the query and the retrieved context.

Source Attribution: The retrieved sources are listed with metadata, including the document's type, title, and similarity score to the query.

This ensures transparency and credibility, allowing users to verify the response by referencing the sources from which the answer was derived.

10.Conclusion

The evaluation module helps assess retrieval performance, making it easy to fine-tune the system based on quantitative results.

RAG Assistant: A Modular Framework for Building and Evaluating Retrieval-Augmented Generation System