RAG-Based AI Assistant

Abstract

This work presents ready-tensor-agentic-program, a minimal yet complete Retrieval-Augmented Generation (RAG) system designed for instructional, research, and rapid-prototyping purposes. The project demonstrates the essential components of an end-to-end RAG pipeline, including document loading, vector indexing, semantic retrieval, prompt construction, and multi-provider LLM integration with fallback. By preserving simplicity and modularity, the system provides a reproducible scaffold that developers can extend toward production-level applications. This paper details the architecture, methodology, experiments, and early results from running the system in a controlled environment.

Introduction

Retrieval-Augmented Generation (RAG) is a widely adopted technique for enabling Large Language Models (LLMs) to answer queries using external knowledge sources. However, many RAG frameworks are too complex for beginners or too specialized for demonstration environments.

ready-tensor-agentic-program aims to solve this gap by providing a minimal RAG skeleton that remains:

simple enough for newcomers,

modular enough for engineers,

extensible enough for production experimentation.

The system includes:

local document ingestion,

vector database indexing via a lightweight abstraction,

retrieval of relevant text chunks,

deterministic LLM query execution through a fallback chain spanning OpenAI, Groq, and Google Gemini,

a compact CLI for interactive querying.

This project serves as an educational tool for understanding RAG mechanics while enabling experimentation with multiple LLM providers using a uniform interface.

Methodology

System Architecture Overview

The framework follows a classical RAG pipeline consisting of:

Document Loading
Plain text files in data/*.txt are parsed and fed into the system using the function load_documents().

Vector Database Indexing
A custom abstraction vectordb.VectorDB provides two essential methods:

add_documents(docs)

search(query, k)
This allows compatibility with ChromaDB or any custom backend.

Retrieval
At inference time, the assistant queries the vector DB to retrieve the most relevant documents. These serve as the context block supplied to the LLM.

Prompt Construction
A LangChain ChatPromptTemplate incorporates:

{context} (retrieved chunks),

{question} (user query).
The template encourages grounded, context-aware answers.

LLM Fallback Layer
The system initializes LLMs in strict priority order:

OpenAI

Groq

Google Gemini

Each operates at temperature=0.0 to ensure deterministic behavior during testing and instruction.

Generation Pipeline
The prompt flows through:

prompt → selected LLM → StrOutputParser()

This modular chain produces a clean text response.

CLI Interaction
A minimal REPL allows the user to enter natural-language questions and receive grounded responses.

Experiments

To validate the pipeline, we designed a simple experimental setup:

Experiment Setup

Platform: Windows PowerShell

Dependencies: installed via requirements.txt

Sample input documents: placed in data/ (e.g., intro.txt)

Environment keys: at least one of OpenAI, Groq, or Google Gemini provided via .env

Procedure

System initialization loads sample documents.

VectorDB indexes all available content.

The REPL interface is launched.

Users pose questions referencing document content.

The assistant:

retrieves the top-k relevant text chunks,

assembles a prompt,

queries the active LLM,

returns a grounded answer.

3.3 Test Questions

Example queries included:

“What facts are contained in intro.txt?”

“Explain the key ideas in the loaded documents.”

“Summarize the information relevant to X topic found in data.”

These tests verify retrieval, context injection, and LLM grounding.

Results

Initial experimentation demonstrates the following outcomes:

Correct LLM Fallback Behavior

When OpenAI keys are available, it becomes the active provider.

Removing the key triggers a seamless fallback to Groq, then Gemini.

Successful Document Ingestion
Text files are consistently detected, loaded, and indexed.
The system correctly ignores non-.txt files.

Accurate Semantic Retrieval
Provided VectorDB returns reasonable embeddings, retrieval yields coherent context chunks aligned with user queries.

Deterministic Generation
Setting temperature=0.0 ensures that repeated queries produce identical outputs—ideal for teaching and debugging.

Correct Prompt Assembly
The assistant composes the context and question into a well-structured prompt consumed by the LLM chain.

User Interaction via CLI
The interactive REPL successfully handles user queries end-to-end once the invoke() method is implemented.

Pending Observations

The CLI call mismatch (query() vs. invoke()) needs correction for full usability.

Retrieval quality depends on the shape of VectorDB search results (string vs dict).

Conclusion

The ready-tensor-agentic-program provides a clear, minimalistic, and modular blueprint for implementing RAG systems. Its simplicity makes it ideal for:

instructional sessions,

workshops,

debugging demonstrations,

comparative LLM research,

rapid prototyping of retrieval-based assistants.

By combining deterministic LLM behavior, flexible vector indexing, multi-provider fallback, and a clean pipeline structure, this project lays a strong foundation for scalable RAG development. Future extensions may include more sophisticated text splitting, persistent indexing, benchmarking modules, or a full web interface—without compromising the core minimalist philosophy.