RT-RAG Assistant: A Modular Retrieval-Augmented Generation Framework for Document Querying

Abstract

RT-RAG Assistant is a modular Retrieval-Augmented Generation (RAG) system designed to enable natural language querying over user-provided documents. It combines modern large language models (LLMs) with efficient document retrieval techniques using vector similarity search. Built with production-readiness in mind, the framework offers seamless ingestion of various document formats, transformation into vector stores using FAISS, and response generation through OpenAI APIs, all exposed via a FastAPI-based interface. With features like Docker integration, comprehensive testing, and structured documentation, RT-RAG provides a scalable, customizable assistant for document-centric workflows.

Introduction

The recent surge in the use of LLMs has led to significant interest in Retrieval-Augmented Generation as a method to overcome the limitations of static model knowledge. However, many implementations of RAG systems are either tightly coupled to specific architectures or lack sufficient flexibility for production deployment.

RT-RAG addresses these limitations through:

Modular Architecture: Separates ingestion, embedding, retrieval, and generation into discrete, extensible components.

Flexibility: Supports multiple document types, easily configurable pipelines, and optional frontend integration.

Production-Readiness: Provides Docker support, test coverage, and CI configurations out-of-the-box.

This system is especially suited for researchers, engineers, and businesses looking to quickly stand up and iterate on document-querying applications.

Methodology

RT-RAG employs a six-stage RAG pipeline with detailed, extensible handling at each step:

Document Loading
Supports ingestion of multiple formats (PDF, TXT, JSON) using modular loaders (e.g., PyMuPDF for PDFs, custom readers for structured data). Metadata extraction and file normalization are handled automatically.
Text Splitting (Chunking)
RT-RAG uses recursive text splitters (e.g., from LangChain) to divide documents into semantically meaningful chunks. Chunk size and overlap are tunable—defaults are 512 tokens with 20-token overlaps to ensure context continuity. Chunks preserve document origin and positional metadata for traceability.
Embedding Generation
Each chunk is vectorized using models like text-embedding-ada-002 from OpenAI. The system can be configured to use HuggingFace embeddings (e.g., BGE, MiniLM) for local inference. Embeddings are normalized and cached for faster indexing.
Vector Store Management
FAISS serves as the default vector store, enabling efficient similarity search using inner product or cosine metrics. The embedding store is persistent between runs and supports indexing strategies (e.g., IVF, Flat).
Query Handling and Chunk Retrieval
Incoming queries are embedded in real time and matched against stored vectors. Top-N (default: 5) relevant chunks are retrieved, with scores and source metadata passed to the prompt construction phase.
Prompt Construction and Answer Generation
Retrieved chunks are dynamically injected into a customizable prompt template. Example:

“
SYSTEM: You are an expert assistant.
CONTEXT: {{retrieved_context}}
QUESTION: {{user_query}}
ANSWER:
“

The system supports few-shot and zero-shot prompting, with placeholder tokens replaced at runtime. Answers are generated using OpenAI’s GPT-4 or custom LLM endpoints.

Implementation

Key Components
Backend: Python-based FastAPI server (src/rt_rag/api_main.py)

RAG Logic: Modular pipeline logic encapsulated in rag_assistant.py

Vector Storage: FAISS for fast retrieval with potential for other backends

Frontend: Optional React interface for interacting with the assistant

Containerization: Docker and Docker Compose support for deployment

Documentation: Sphinx-based, with ReadTheDocs compatibility

Project Structure

The project is organized into clear folders for data, source code, documentation, and testing. It follows best practices including pre-commit hooks, environment isolation (.env), and dependency management via requirements/.

RT-RAG/
├── src/
│ ├── rt_rag/
│ │ ├── api_main.py
│ │ ├── rag_assistant.py
│ │ ├── chunking/
│ │ ├── embeddings/
│ │ ├── retrieval/
│ │ ├── prompting/
│ │ └── utils/
├── tests/
├── data/
├── docker/
├── requirements/
├── docs/
└── README.md

Experiments

We evaluated RT-RAG across different deployment scenarios, testing:

Scalability: Verified ingestion and retrieval across large corpora (e.g., 1K+ PDF documents).

Latency: Measured response times of ~300ms for document queries under local hosting.

Extensibility: Successfully plugged in alternative embedding models (e.g., HuggingFace-based).

Deployment Flexibility: Docker-compose successfully deployed in cloud environments with minimal configuration.

The system passed all unit and integration tests, with over 85% code coverage.

Results

RT-RAG delivers:

Fast Document Querying: High recall using FAISS + OpenAI embeddings

Smooth Developer Experience: With detailed logging, environment templates, and Dockerized deployment

Strong Performance: Efficient for both local and cloud-hosted workflows

Customizability: Easy to swap in alternative LLMs, vector stores, and file loaders

Compared to ad-hoc RAG scripts or monolithic implementations, RT-RAG cuts development time by offering a production-grade boilerplate with extensibility.

Conclusion

RT-RAG provides a robust, end-to-end solution for building document-aware AI assistants using Retrieval-Augmented Generation. Its modular design, production-friendly tooling, and API-first interface make it suitable for teams looking to incorporate LLM-powered document interaction with minimal friction.

Key Contributions:

End-to-end RAG assistant with modular architecture

Native FAISS and OpenAI support

Production-ready deployment via FastAPI and Docker

Extensible to multiple document types and LLMs

Future Work

Upcoming improvements include:

Support for hybrid search (BM25 + vector similarity)

Integration with open-source LLMs for local inference

Document annotation and user feedback loop for answer quality

Real-time ingestion and incremental vector updates