RT-RAG Assistant is a modular Retrieval-Augmented Generation (RAG) system designed to enable natural language querying over user-provided documents. It combines modern large language models (LLMs) with efficient document retrieval techniques using vector similarity search. Built with production-readiness in mind, the framework offers seamless ingestion of various document formats, transformation into vector stores using FAISS, and response generation through OpenAI APIs, all exposed via a FastAPI-based interface. With features like Docker integration, comprehensive testing, and structured documentation, RT-RAG provides a scalable, customizable assistant for document-centric workflows.
The recent surge in the use of LLMs has led to significant interest in Retrieval-Augmented Generation as a method to overcome the limitations of static model knowledge. However, many implementations of RAG systems are either tightly coupled to specific architectures or lack sufficient flexibility for production deployment.
Modular Architecture: Separates ingestion, embedding, retrieval, and generation into discrete, extensible components.
Flexibility: Supports multiple document types, easily configurable pipelines, and optional frontend integration.
Production-Readiness: Provides Docker support, test coverage, and CI configurations out-of-the-box.
This system is especially suited for researchers, engineers, and businesses looking to quickly stand up and iterate on document-querying applications.
RT-RAG employs a six-stage RAG pipeline with detailed, extensible handling at each step:
Document Loading
Supports ingestion of multiple formats (PDF, TXT, JSON) using modular loaders (e.g., PyMuPDF for PDFs, custom readers for structured data). Metadata extraction and file normalization are handled automatically.
Text Splitting (Chunking)
RT-RAG uses recursive text splitters (e.g., from LangChain) to divide documents into semantically meaningful chunks. Chunk size and overlap are tunableβdefaults are 512 tokens with 20-token overlaps to ensure context continuity. Chunks preserve document origin and positional metadata for traceability.
Embedding Generation
Each chunk is vectorized using models like text-embedding-ada-002 from OpenAI. The system can be configured to use HuggingFace embeddings (e.g., BGE, MiniLM) for local inference. Embeddings are normalized and cached for faster indexing.
Vector Store Management
FAISS serves as the default vector store, enabling efficient similarity search using inner product or cosine metrics. The embedding store is persistent between runs and supports indexing strategies (e.g., IVF, Flat).
Query Handling and Chunk Retrieval
Incoming queries are embedded in real time and matched against stored vectors. Top-N (default: 5) relevant chunks are retrieved, with scores and source metadata passed to the prompt construction phase.
Prompt Construction and Answer Generation
Retrieved chunks are dynamically injected into a customizable prompt template. Example:
β
SYSTEM: You are an expert assistant.
CONTEXT: {{retrieved_context}}
QUESTION: {{user_query}}
ANSWER:
β
The system supports few-shot and zero-shot prompting, with placeholder tokens replaced at runtime. Answers are generated using OpenAIβs GPT-4 or custom LLM endpoints.
Key Components
Backend: Python-based FastAPI server (src/rt_rag/api_main.py)
RAG Logic: Modular pipeline logic encapsulated in rag_assistant.py
Vector Storage: FAISS for fast retrieval with potential for other backends
Frontend: Optional React interface for interacting with the assistant
Containerization: Docker and Docker Compose support for deployment
Documentation: Sphinx-based, with ReadTheDocs compatibility
The project is organized into clear folders for data, source code, documentation, and testing. It follows best practices including pre-commit hooks, environment isolation (.env), and dependency management via requirements/.
RT-RAG/
βββ src/
β βββ rt_rag/
β β βββ api_main.py
β β βββ rag_assistant.py
β β βββ chunking/
β β βββ embeddings/
β β βββ retrieval/
β β βββ prompting/
β β βββ utils/
βββ tests/
βββ data/
βββ docker/
βββ requirements/
βββ docs/
βββ README.md
We evaluated RT-RAG across different deployment scenarios, testing:
Scalability: Verified ingestion and retrieval across large corpora (e.g., 1K+ PDF documents).
Latency: Measured response times of ~300ms for document queries under local hosting.
Extensibility: Successfully plugged in alternative embedding models (e.g., HuggingFace-based).
Deployment Flexibility: Docker-compose successfully deployed in cloud environments with minimal configuration.
The system passed all unit and integration tests, with over 85% code coverage.
RT-RAG delivers:
Fast Document Querying: High recall using FAISS + OpenAI embeddings
Smooth Developer Experience: With detailed logging, environment templates, and Dockerized deployment
Strong Performance: Efficient for both local and cloud-hosted workflows
Customizability: Easy to swap in alternative LLMs, vector stores, and file loaders
Compared to ad-hoc RAG scripts or monolithic implementations, RT-RAG cuts development time by offering a production-grade boilerplate with extensibility.
RT-RAG provides a robust, end-to-end solution for building document-aware AI assistants using Retrieval-Augmented Generation. Its modular design, production-friendly tooling, and API-first interface make it suitable for teams looking to incorporate LLM-powered document interaction with minimal friction.
End-to-end RAG assistant with modular architecture
Native FAISS and OpenAI support
Production-ready deployment via FastAPI and Docker
Extensible to multiple document types and LLMs
Upcoming improvements include:
Support for hybrid search (BM25 + vector similarity)
Integration with open-source LLMs for local inference
Document annotation and user feedback loop for answer quality
Real-time ingestion and incremental vector updates