RAG-Based Q&A Assistant Using ChromaDB and Multi-LLM Integration

Domain-Specific QA using RAG-Based AI Assistant

Abstract

This project implements a Retrieval-Augmented Generation (RAG) Assistant that combines a vector database (ChromaDB) with multiple Large Language Models (LLMs), supports OpenAI GPT, Groq LLaMA, and Google Gemini. The system enables users to ask questions over their own local text documents and receive accurate, context-aware responses.

User documents are converted into embeddings using HuggingFace models and stored in a vector database for semantic search. Relevant content is retrieved at query time and passed to the selected LLM to generate grounded and meaningful answers.

The project demonstrates an end-to-end, lightweight AI-powered knowledge retrieval system designed for research, educational, and practical use cases, with a focus on simplicity, flexibility, and extensibility.

Introduction

Retrieval-Augmented Generation (RAG) is a practical approach used in modern AI systems to produce accurate and reliable answers by grounding responses in real documents. Unlike traditional language models that may generate unsupported or incorrect information, RAG retrieves relevant content first and then generates answers strictly based on that context.

This project implements a lightweight RAG pipeline using embeddings, LLM, and ChromaDB for vector storage. Text documents are processed, embedded, and stored for semantic search, allowing the system to retrieve relevant information and answer user queries with factual support.

The project demonstrates how a simple and well-structured RAG system can provide context-aware answers for use cases such as documentation assistants, knowledge search, and AI-powered project tools.

Methodology

Prepare Your Documents - the documents are added to the data directory. You can add your own.
Document Loading - loads .txt files from data directory
Text Chunking - Split documents into smaller, searchable chunks using RecursiveCharacterTextSplitter (500 characters, 10% overlap)
Embedding – Converts chunks to vectors using Sentence Transformers (all-MiniLM-L6-v2)
Document Ingestion - store the embeddings, metadata and other information in the vector database
Similarity Search - On user query, the question is embedded and finds top-k relevant chunks
RAG Prompt Template - Design effective prompts for the LLM
RAG Query Pipeline - Complete query-response pipeline using retrieved context

Technical Stack (Tools & Frameworks)

Component	Purpose
Python 3.9+	Core language
LangChain	RAG orchestration
ChromaDB	Vector store
SentenceTransformers (all-MiniLM-L6-v2)	Embedding & Generation
LangChain RecursiveCharacterTextSplitter	Chunking documents
OpenAI GPT, Groq Llama, Google Gemini	Multi-model support
Text files	Document source

Key Features

Multi-model Support: Compatible with OpenAI GPT, Groq Llama, and Google Gemini models
Semantic Search: ChromaDB-based similarity vector search with HuggingFace embeddings
Smart Text Processing: Recursive character-based text chunking with overlap
Flexible Document Loading: Support for multiple text-based document formats
Interactive CLI: Real-time question-answering interface

Dataset Details

Dataset Sources & Collection

This project uses the content from .txt file in the data\ folder. All files are committed into the repository.

Dataset Description

For this project, I am using the same txt files provided in the project template by ReadyTensor.

artificial_intelligence.txt - related to AI, ML, deep learning.
biotechnology.txt - related to CRISPR Gene Editing, Synthetic Biology
climate_science.txt - related to Climate Change and Global Warming
quantum_computing.txt - related to Quantum Computing Principles and algorithms
sample_documents.txt - related to Data Science Methodology
space_exploration.txt - related to Solar System, Mars Exploration
sustainable_energy.txt - related to Solar Energy Technology, Wind Power Generation

Each file size is around 2-3K.

Dataset Processing Methodology

These documents are splitted into chunks which is converted into embeddings using HuggingFace models and stored in a vector database for semantic search.

Setup Instructions

Prerequisites and Requirements

Before starting, make sure you have:

Python 3.9 or higher installed
An API key from one of these providers:
- OpenAI (most popular)
- Groq (free tier available)
- Google AI (competitive pricing)

Important: This project uses specific packages:

LangChain
Vector DB: ChromaDB
Embedding Model: sentence-transformers/all-MiniLM-L6-v2
LLM: gemini-2.5-flash (I used but can be used anyone as mentioned earlier)

Installation and Usage Instructions

Clone and install dependencies:

Clone the repository

git clone https://github.com/techbrij/rag-aaidc-project1
cd rag-aaidc-project1

Create Virtual Environment

python3 -m venv venv

Activate the virtual environment:

On Windows:
```
venv\Scripts\activate
```
On macOS/Linux:
```
source venv/bin/activate
```
Install dependencies

pip install -r requirements.txt

Configure your API key:

# Create environment file (choose the method that works on your system)
cp .env.example .env    # Linux/Mac
copy .env.example .env  # Windows

Edit .env and add your API key:

OPENAI_API_KEY=your_key_here
# OR
GROQ_API_KEY=your_key_here
# OR
GOOGLE_API_KEY=your_key_here

Add any API key and comment the other API key parameters to avoid the conflicts. At a time, only one key should be active.

Run the application:
```
python src/app.py
```

The system automatically processes documents in the data/ directory and provides an interactive interface for asking questions. You can enter your question and it will answer from the documents.

Output

Screenshot:

AI RAG Assistant

Implementation Details

Step 1: Documents Preparation

First step is to prepare all the documents and put them in data/ folder. The directory contains sample files on various topics. Each file contains text content you want your RAG system to search through.
For simplicity, I am using the same files provided in the template.

Step 2: Implement Document Loading

Location: src/app.py
Function load_documents

Read files from the data/ directory
Load the content of each file into memory
Return a list of document dictionaries with content and metadata
Currently the implementation handles the text type of files

Step 3: Implement Text Chunking

Location: src/vectordb.py
Function chunk_text

Choose a chunking strategy (LangChain's RecursiveCharacterTextSplitter)
Split the input text into manageable chunks
Return a list of text strings

Step 4: Implement Document Ingestion

Location: src/vectordb.py
Function add_documents

Loop through the documents list
Extract content and metadata from each document
Use your chunk_text() method to split documents
Create embeddings using self.embedding_model.encode()
Store everything in ChromaDB using self.collection.add()

Step 5: Implement Similarity Search

Location: src/vectordb.py
Function search

Create an embedding for the query using self.embedding_model.encode()
Search the ChromaDB collection using self.collection.query()
Return results in the expected format with keys: documents, metadatas, distances, ids

Step 6: Implement RAG Prompt Template

Location: src/app.py

Design a prompt template that effectively combines retrieved context with user questions
Use ChatPromptTemplate.from_template() to create the template
Include placeholders for {context} (retrieved documents) and {question} (user query)
Consider how to instruct the LLM to use the context appropriately
Handle cases where the context might not contain relevant information

Step 7: Implement RAG Query Pipeline

Location: src/app.py
Function query

Use self.vector_db.search() to find relevant context
Combine retrieved chunks into a context string
Use self.chain.invoke() to generate a response
Return structured results

Project Structure

rag-aaidc-project1/
├── src/
│   ├── app.py           # Main RAG application (implement Steps 2, 6-7)
│   └── vectordb.py      # Vector database wrapper (implement Steps 3-5)
├── data/               # Replace with your documents (Step 1)
│   ├── *.txt          # Your text files here
├── tests/
│    ├── test_performance.py       # Performance test  
├── requirements.txt    # All dependencies included
├── .env                         # Environment template
└── README.md          # The project guide

Security & Best Practices

API Keys: Store in .env, never commit to repository
Environment Variables: Use python-dotenv for configuration
Error Handling: Graceful handle for service failures

Evaluation and Analysis

Evaluation Methodology

To assess the performance of our Retrieval-Augmented Generation (RAG) assistant, we developed a dedicated test class using the pytest framework. The evaluation focused on three core components:

Document Loading: Measuring the time required to load all source documents from disk.
Document Ingestion: Measuring the time to add documents to the vector database.
Assistant Response: Measuring the response time for the assistant to answer a set of representative user queries.

For each component, we executed the relevant function multiple times and recorded the elapsed time using Python’s time module. For the assistant’s response, we used a set of diverse questions and averaged the response times over several runs to ensure reliability and account for variability due to external API calls.

The test module is in tests folder. First you need to install pytest:

pip install pytest

run following command to run the test

pytest tests/test_performance.py -s

Note: tune the configuration based on your requirement. I used 5 iterations with gemini-2.5-flash LLM.

Performance Metrics

Load Time: Time (in seconds) to read and parse all documents from the data directory.
Ingestion Time: Time (in seconds) to add all documents to the vector database.
Response Time: Average time (in seconds) for the assistant to generate an answer to a user query, measured over multiple runs and multiple questions.
Correctness: Each response was checked to ensure it was a non-empty string, indicating successful processing.

Results

Infrastructure: Windows 11, 32 GB RAM, 2TB SSD, i7-13650HX
Environment: Python 3.13.7, gemini-2.5-flash, LangChain 0.3.27
Total Documents: 7
Total Chunks: 35
Used different types of questions for different test cases:

Q1: What is artificial intelligence? (Simple)
Q2: What are Machine Learning and MLOps? (Multi-documents question)
Q3: Who is home minister of India? (Out of context)

The following metrics were obtained from our evaluation (example values, replace with your actual results):

Metric	Value (seconds)
Document Load Time	<0.001
Document Ingestion	0.41
Avg. Response Time Q1	2.94
Avg. Response Time Q2	4.33
Avg. Response Time Q3	2.07

All responses were successfully generated and validated for non-emptiness. The assistant demonstrated consistent performance across different queries, with response times suitable for interactive use.

Avg time for assistant.invoke('What is artificial intelligenc...') over 5 runs: 2.9413 seconds
Avg time for assistant.invoke('What are Machine Learning and ...') over 5 runs: 4.3323 seconds
Avg time for assistant.invoke('Who is home minister of India?...') over 5 runs: 2.0753 seconds

Last question was out of context and it generated the following expected response in each iteration:

The question is not answerable from the provided documents.

Limitations:

File Type: Currently it supports only .txt files. Other formats (pdf, html...etc.) are not supported.
User Interface: It's CLI only, no GUI and web interface.
LLM Support: It needs API key from OpenAI, Groq, or Gemini. Ollama and others are not supported.
Local Storage: ChromaDB stores data locally; scaling will requires external vector DB integration

Future Improvements:

Support pdf and other file formats
Setup UI for conversation
Multi turn memory implementation
Include Metadata in the search logic

Applications and Use Case

RAG enhances large language models by providing them with relevant context from external documents. Instead of relying solely on pre-trained knowledge.

Current State Gap

Traditional LLMs have limitations:

No access to private/proprietary data
Knowledge cutoff dates
Tendency to hallucinate facts
Cannot cite sources

RAG solves these problems by grounding responses in actual documents

Significance and Implications of Work

This work demonstrates the practical significance of Retrieval-Augmented Generation (RAG) as an effective approach to overcome key limitations of large language models, particularly hallucination, lack of transparency, and dependency on static parametric knowledge. By combining semantic retrieval over a curated knowledge base with controlled text generation, the proposed RAG-based assistant delivers context-grounded, verifiable, and domain-specific responses.

Industry Insights

Retrieval-Augmented Generation (RAG) is increasingly adopted to ground outputs in proprietary and domain-specific knowledge without retraining models.
Modular, model-agnostic RAG architectures align with industry needs for scalability, flexibility, and cost efficiency.
Secure integration with private data sources supports compliance and enterprise-grade AI deployment.
RAG-based assistants are widely used in industry for knowledge management, technical support, and decision-support applications.

Comparative Analysis

To evaluate the effectiveness of the proposed RAG-based Assistant, we conducted a comparative examination against traditional large language models (LLMs) and with RAG configurations. We tested with around 10 questions and checked answer relevancy and context accuracy manually. For example, I asked the same question (as shown in output screenshot) to Gemini directly, I got the general and long response while our system generates accurate result based on the provided data.

Results:

LLMs without retrieval exhibit fluent generation but suffer from a high rate of unsupported or hallucinated responses.
RAG systems demonstrate measurable gains in factual correctness by leveraging external document retrieval. RAG accuracy totally depends on the accuracy of the data we provide.
It makes the responses traceable and trustworthy

Deployment Considerations

Can be deployed as a CLI tool or self hosted solution
Requires secure handling of API keys and sensitive configuration
Dockerization and CI/CD recommended for production

Monitoring and Maintenance

This project is maintained as part of the ReadyTensor Agentic AI Essentials Certification Program and is intended as an educational and reference implementation of a multi-agent system.

Maintenance Status: Actively maintained for learning, experimentation, and certification purposes.
Support: Community-driven. Issues and pull requests are welcome through the GitHub repository.

While using this project, it is recommended to monitor API usage and error rates. For any dependencies, LLM or API changes, it should be updated accordingly.

Licensing & Usage Rights

This project is licensed under the MIT License, allowing free use, modification, and distribution with proper attribution.

References & Acknowledgments

LangChain Documentation:
https://python.langchain.com
ChromaDB:
https://docs.trychroma.com
HuggingFace SentenceTransformers:
https://www.sbert.net
Google Gemini API:
https://ai.google.dev

Conclusion

This project implements a complete RAG pipeline using embeddings, LLM, and ChromaDB for vector storage. Text documents are processed, embedded, and stored for semantic search, allowing the system to retrieve relevant information and answer user queries.

GitHub:

https://github.com/techbrij/rag-aaidc-project1

Contact:

Brij Mohan