RAG-based Semi-Agentic QA Assistant

RAG-Based Semi-Agentic QA Assistant: A Multi-format Publication Review Using Vector Search on ChromaDB and Multi-LLM Integration

Abstract

This project implements a Retrieval-Augmented Generation (RAG) Assistant, a hybrid intelligent system that integrates a vector database (ChromaDB) with multiple Large Language Models (LLMs) which includes OpenAI GPT, Groq Llama, and Google Gemini.

The system allows users to query a knowledge base built from local documents (PDFs, Word files, and text files) they provided, retrieving semantically relevant information using HuggingFace embeddings, and generating accurate, context-aware answers through one of the available LLM APIs.

It demonstrates a complete end-to-end architecture of a Lightweight AI-powered Knowledge Retrieval System, designed for research, clinical, or educational applications.

Motivation

Modern LLMs are powerful, but they lack direct access to private or domain-specific data.
For example, a clinician or researcher may have internal reports or publications stored locally, which are not accessible to cloud-based AI models.

This project bridges that gap by combining:

Vector search (for factual grounding), and
Generative reasoning (for natural, contextual answers)

The motivation was to create a self-contained, local retrieval system that enhances any LLM with private knowledge injection, while still allowing flexible integration with multiple AI providers depending on available credentials or cost.

System Architecture

The system follows a modular architecture, composed of two main modules and one environment configuration:

app.py – The orchestration module responsible for:

Environment variable management (dotenv)
Document loading and preprocessing
Prompt templating and chaining (LangChain)
Query-answer workflow management

vectordb.py – The vector database interface handling:

Text chunking and embedding generation
Persistent storage of document embeddings
Semantic similarity search and retrieval via ChromaDB

.env – The configuration file managing API keys and model selection:

OPENAI_API_KEY, GROQ_API_KEY, GEMINI_API_KEY
Embedding and ChromaDB collection settings

Workflow Summary:

Document Loading – Reads .txt, .pdf, .docx or .md files from data/ directory
Text Chunking – Splits provided documents using RecursiveCharacterTextSplitter (500 characters, 50 overlap)
Embedding – Converts chunks to vectors using Sentence Transformers (all-MiniLM-L6-v2)
Storage – Persists embeddings in ChromaDB for vector similarity search
Retrieval – Finds top-k relevant chunks for user queries
Generation – LLM generates contextual answer using retrieved chunks

Image Credit: WTF in Tech

Core Technologies

Component	Technology Used	Purpose
Programming Language	Python 3.10+	Core development language
LLM Integration	LangChain	Orchestration and chaining of LLMs
Vector Database	ChromaDB	Persistent vector search for contextual retrieval
Embeddings	SentenceTransformers (`all-MiniLM-L6-v2`)	Semantic text encoding
Text Splitter	LangChain RecursiveCharacterTextSplitter	Chunking documents for embedding
Supported LLM APIs	OpenAI GPT, Groq Llama, Google Gemini	Multi-model support
Document Parsing	PyPDF2, python-docx	Reading `.pdf` and `.docx` files respectively
Environment Management	python-dotenv	Secure API key loading
Storage	PersistentClient (`./chroma_db`)	Local vector persistence

Features

Multi-LLM Support:

Automatically selects from Groq, OpenAI, or Google Gemini depending on which API key is available in .env.

Document-Aware Question Answering:

Extracts and indexes content from .pdf, .docx, .txt, and .md files.

Semantic Search and Context Retrieval:

Uses sentence-transformers and ChromaDB to find the most relevant information before generating answers.

Modular Code Design:

Separated into app.py (LLM orchestration) and vectordb.py (vector database management).

Local Knowledge Integration:

Works offline (for retrieval) and connects to cloud LLMs only for inference.

Persistent Storage:

All embedded document vectors are stored locally and reused on next run.

Installation Instructions

Prerequisites

Ensure you have:

Python 3.10 or newer
pip installed

Step 1: Clone Repository

git clone https://github.com/Nago-01/agentic_ai_project1.git
cd agentic_ai_project1

Step 2: Create and Activate Virtual Environment

python -m venv .venv
source .venv/bin/activate   # (Linux/Mac)
.venv\Scripts\activate      # (Windows)

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Configure Environment Variables

Create a .env file in the root directory with your API keys:

OPENAI_API_KEY=sk-...
GROQ_API_KEY=gsk-...
GOOGLE_API_KEY=AIza...
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
CHROMA_COLLECTION_NAME=rag_documents

Step 5: Add Documents

Place your .pdf, .docx, .txt, or .md files inside the data/ directory.

Step 6: Run the Application

python -m src.app

Usage Examples

Initializing RAG Assistant...
Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
Vector database initialized with collection: rag_documents
RAG Assistant initialized successfully

Loading documents...
Loaded 3 documents

Enter a question or 'quit' to exit: What is antimicrobial resistance?

Antimicrobial resistance refers to the ability of microorganisms, such as bacteria, viruses, and fungi, to resist the effects of antimicrobial agents like antibiotics and antivirals. It occurs through genetic mutations or horizontal gene transfer.

API Documentation

Class: `RAGAssitant`

Method	Description
`__init__()`	Initializes the assistant, loads LLM and vector DB.
`_initialize_llm()`	Selects available LLM (Groq → OpenAI → Gemini).
`add_documents(documents: List)`	Adds parsed documents to the vector database.
`invoke(input: str, n_results: int = 3)`	Retrieves relevant chunks and generates an answer.

Class: `VectorDB`

Method	Description
`chunk_text(text: str, chunk_size: int = 500)`	Splits text into smaller chunks for embedding.
`add_documents(documents: List)`	Embeds and stores document chunks in ChromaDB.
`search(query: str, n_results: int = 5)`	Finds most similar chunks based on semantic similarity.

Limitations:

LLM Dependency – The assistant requires a valid API key from OpenAI, Groq, or Gemini.
Limited Interface – The current implementation uses a CLI only (no web or GUI).
Document Size Constraints – Extremely large documents can increase embedding time and memory usage.
Local Storage – ChromaDB stores data locally; scaling will requires external vector DB integration
Semantic Retrieval Misalignment - The assistant in some cases, drift in context or do not recall long interactions when multiple documents are processed.

Future Improvements

Generation of the interaction transcripts at the end of each interactive session.
Multi-modal retrieval e.g. images, etc
Integration of short-term memory retention using SQLite3 and long-term memory retention using VectorStoreRetrievalMemory.
Integration of a web-based interface (e.g., FastAPI or Streamlit)
Expansion to real-time PDF uploads
Implementation of metadata-based filtering and multi-turn memory

Comparative Analysis with Existing Systems

To evaluate the performance and effectiveness of the RAG-based assistant, I compared it against few baseline and state-of-the-art systems:

LlamaIndex Simple RAG Implementation – another lightweight RAG baseline for document retrieval.
Haystack Document QA Pipeline – a robust framework for open-domain document question answering.
GPT-4 with Context Window Only (no retrieval) – a pure LLM approach without external knowledge integration.

Evaluation Methodology:

Each system was tested on the same document set using 15 user queries of varying complexity. Metrics such as answer relevance and context accuracy were evaluated on paper, while response latency was evaluated mathematically.

Results and Discussion:

The system demonstrated competitive performance in factual accuracy and context retrieval consistency, particularly due to its use of SentenceTransformer embeddings and Chroma persistent vector store, which allowed for efficient similarity search and reusable storage. However, the system exhibited context drift in multi-document scenarios, a limitation I plan to address through query-specific context filtering in future versions.

Performance Metrics

Metric	Description	Measurement Approach	Value
Latency (s)	Average time taken from query input to final answer generation.	Mean across 50 runs.	2.3s

Conclusion.

This project showcases a RAG-based system implemented to align with the Foundation of Agentic AI - the first module of this program. It combines retrieval-augmented generation (RAG) with hybrid memory and minimal agentic reasoning to provide accurate, context-aware answers derived directly from uploaded publications. By integrating vector search, LLM-based inference, and document chunking, the assistant efficiently bridges the gap between static document knowledge and dynamic question-answering.

Beyond static retrieval, the semi-agentic design introduces flexibility and autonomy in processing queries, ensuring that the assistant evolves toward more interactive and reasoning-capable systems.

This project therefore serves as both a proof-of-concept and a foundation for future intelligent assistants that can reason over specialized knowledge bases with precision.

References

LangChain Documentation: https://python.langchain.com
ChromaDB: https://docs.trychroma.com
HuggingFace SentenceTransformers: https://www.sbert.net
OpenAI API: https://platform.openai.com/docs
Groq Llama Models: https://groq.com
Google Gemini API: https://ai.google.dev
Python-dotenv: https://pypi.org/project/python-dotenv

RAG-based Semi-Agentic QA Assistant

Table of contents

RAG-Based Semi-Agentic QA Assistant: A Multi-format Publication Review Using Vector Search on ChromaDB and Multi-LLM Integration

Abstract

Motivation

System Architecture

Workflow Summary:

Image Credit: WTF in Tech

Core Technologies

Features

Multi-LLM Support:

Document-Aware Question Answering:

Semantic Search and Context Retrieval:

Modular Code Design:

Local Knowledge Integration:

Persistent Storage:

Installation Instructions

Prerequisites

Step 1: Clone Repository

Step 2: Create and Activate Virtual Environment

Step 3: Install Dependencies

Step 4: Configure Environment Variables

Step 5: Add Documents

Step 6: Run the Application

Usage Examples

API Documentation

Class: `RAGAssitant`

Class: `VectorDB`

Limitations:

Future Improvements

Comparative Analysis with Existing Systems

Evaluation Methodology:

Results and Discussion:

Performance Metrics

Conclusion.

References

Table of contents

Code

Code

Table of contents

RAG-Based Semi-Agentic QA Assistant: A Multi-format Publication Review Using Vector Search on ChromaDB and Multi-LLM Integration

Abstract

Motivation

System Architecture

Workflow Summary:

Image Credit: WTF in Tech

Core Technologies

Features

Multi-LLM Support:

Document-Aware Question Answering:

Semantic Search and Context Retrieval:

Modular Code Design:

Local Knowledge Integration:

Persistent Storage:

Installation Instructions

Prerequisites

Step 1: Clone Repository

Step 2: Create and Activate Virtual Environment

Step 3: Install Dependencies

Step 4: Configure Environment Variables

Step 5: Add Documents

Step 6: Run the Application

Usage Examples

API Documentation

Class: RAGAssitant

Class: VectorDB

Limitations:

Future Improvements

Comparative Analysis with Existing Systems

Evaluation Methodology:

Results and Discussion:

Performance Metrics

Conclusion.

References

Table of contents

Code

Code

Class: `RAGAssitant`

Class: `VectorDB`