RAG Based Chatbot--Ready Tensor Publication Explorer

Abstract

This project presents a Retrieval-Augmented Generation (RAG)–based assistant designed to explore and answer questions from Ready Tensor publications with accuracy and reliability. The system integrates document loaders, text splitting, embeddings, vector stores, and large language models (LLMs) into a modular pipeline that allows seamless ingestion and retrieval of knowledge from research publications. By combining semantic search with generative language models, the assistant retrieves the most relevant passages from ingested publications and formulates precise, context-aware answers while avoiding hallucination. The architecture supports multiple providers (Groq, Google Gemini) and embedding backends (HuggingFace, Google Generative AI), offering flexibility and scalability. A Gradio-based web interface and command-line interface (CLI) are provided for intuitive interaction. This tool enables researchers, practitioners, and students to efficiently query complex technical documents, ensuring that information from Ready Tensor publications becomes more accessible, interpretable, and actionable.

Methodology

The methodology for developing the Ready Tensor Publication Explorer – RAG Chatbot follows a modular Retrieval-Augmented Generation (RAG) architecture. The workflow consists of six main phases: data ingestion, text preprocessing, embedding generation, vector storage and retrieval, answer generation, and user interaction. Each phase is implemented as a separate module in the system, ensuring scalability, maintainability, and extensibility.

1. Data Ingestion

Publications are collected in multiple formats, including PDF, TXT, Markdown, and JSON. A flexible loader system was implemented using LangChain’s document loaders. Each loader handles its respective format, with a fallback mechanism for plain text.

Code snippet:

from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, TextLoader, UnstructuredMarkdownLoader
from langchain.schema import Document
import json

def load_documents_from_folder(folder: str):
    folder = Path(folder)
    docs = []
    for p in folder.rglob("*"):
        if p.is_dir():
            continue
        suffix = p.suffix.lower()
        if suffix == ".pdf":
            docs += PyPDFLoader(str(p)).load()
        elif suffix in {".md", ".markdown"}:
            docs += UnstructuredMarkdownLoader(str(p)).load()
        elif suffix == ".txt":
            docs += TextLoader(str(p), encoding="utf-8").load()
        elif suffix == ".json":
            with open(p, "r", encoding="utf-8") as f:
                data = json.load(f)
            if isinstance(data, list):
                for item in data:
                    content = item.get("publication_description", "")
                    if content:
                        docs.append(Document(page_content=content, metadata=item))
        else:
            text = p.read_text(encoding="utf-8", errors="ignore")
            docs.append(Document(page_content=text, metadata={"source": str(p)}))
    return docs

This modular approach allows the system to easily integrate new document formats in the future.

2. Text Preprocessing and Splitting

Documents are often long and exceed LLM token limits. To address this, documents are segmented into overlapping chunks while preserving semantic continuity and metadata.

Code snippet:

from langchain_text_splitters import RecursiveCharacterTextSplitter
from .config import settings

def create_splitter():
    return RecursiveCharacterTextSplitter(
        chunk_size=settings.chunk_size,       # e.g., 1000
        chunk_overlap=settings.chunk_overlap, # e.g., 200
        separators=["\n\n", "\n", ". ", "? ", "! ", " "],
    )

This ensures context is preserved across chunk boundaries, enabling more accurate retrieval.

3. Embedding Generation

Each chunk is converted into a dense vector representation using sentence embedding models, enabling semantic search.

Code snippet:

from langchain_huggingface import HuggingFaceEmbeddings

def get_embedder():
    return HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

The system supports multiple embedding backends, including HuggingFace and Google Generative AI, providing flexibility for performance and cost optimization.

4. Vector Storage and Retrieval

The embeddings are stored in a vector database (Chroma or FAISS). The retriever fetches top-k relevant chunks at query time.

Code snippet:

from langchain_chroma import Chroma

class PersistedVectorStore:
    def create_or_update(self, documents):
        self.store = Chroma.from_documents(
            documents, embedding=self.emb, persist_directory=self.persist_directory
        )
        self.store.persist()

    def as_retriever(self, **kwargs):
        return self.store.as_retriever(search_kwargs=kwargs)

This enables efficient similarity search, ensuring retrieved chunks are contextually relevant for answering queries.

5. Answer Generation

The retrieved chunks are combined with the user’s query and passed to a large language model (LLM), such as Groq LLaMA or Google Gemini, via a prompt template that enforces fidelity to the source.

Code snippet:

from langchain.chains import RetrievalQA
from .llm import get_chat_llm
from .retriever import get_retriever
from .prompts import QA_SYSTEM

def build_rag_chain(persist_directory="VectorStore"):
    llm = get_chat_llm()
    retriever = get_retriever(persist_directory=persist_directory)
    qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True, # Return both answer + source documents
        # return_source_documents=False, # Return only answer 
        chain_type_kwargs={"prompt": QA_SYSTEM}
    )
    return qa

The system prompt ensures the LLM answers only based on retrieved content and refuses to hallucinate.

6. User Interaction

Two interfaces are provided for user interaction:

Web-based Chatbot (Gradio UI): Interactive conversational interface for querying publications.
Command-Line Interface (CLI): Lightweight alternative for developers and researchers.

Code snippet:

from src.ui.gradio_app import build_and_launch

if __name__ == "__main__":
    build_and_launch()

This dual-interface design enhances usability across different audiences.

Summary Workflow

Load publications from storage →
Split documents into semantic chunks →
Generate embeddings for each chunk →
Store embeddings in a vector database →
Retrieve relevant chunks for user queries →
Generate answers using an LLM constrained by retrieved context →
Deliver responses via Gradio web or CLI interface.

Results and Discussion

The Ready Tensor Publication Explorer – RAG Chatbot was tested on a diverse set of publications in PDF, Markdown, TXT, and JSON formats. The evaluation focused on the system’s accuracy, responsiveness, and usability.

1. Accuracy of Retrieval and Answers

The system successfully ingested and chunked over 30 publication documents, generating over 2,000 semantically coherent chunks.
The vector store retrieval consistently returned the top-k relevant chunks for user queries.
The LLM, guided by the carefully designed prompt, produced answers that were faithful to the source documents with minimal hallucination.
In edge cases where queries were out-of-scope, the assistant correctly responded with polite refusals, adhering to the system guidelines.

2. Performance and Efficiency

Embedding generation using HuggingFace models processed large documents efficiently, allowing vector storage of thousands of chunks in a matter of minutes.
Retrieval time for queries was consistently under 2 seconds for a moderate-sized vector store (~2,000 chunks).
Chunk size and overlap settings (chunk_size=1000, chunk_overlap=200) provided a balance between context retention and performance, ensuring accurate responses without excessive computation.

3. User Interaction and Interface

The system provides two interfaces for users:

The Gradio web interface allowed users to explore publications interactively, view document previews, and see sources alongside answers.
The CLI interface provided a lightweight alternative, useful for quick queries or integration into other scripts.
The dual-interface approach increased accessibility for both researchers and developers.

Screenshot of the RAG Chatbot Interface:

Figure 1: Screenshot of the Ready Tensor Publication Explorer web interface showing a sample query and response. The interface displays document preview, chatbot conversation, and sources of retrieved information.

4. Challenges and Solutions

Challenge	Solution
Large documents exceeding token limits	Implemented hierarchical chunking with overlapping windows to preserve context
Hallucination by LLM	Enforced strict system prompts limiting responses to retrieved content
Diverse file formats	Modular loaders and JSON parsing handled different formats robustly
Vector store persistence issues	Enabled Chroma persistence and ensured re-ingestion when needed

5. Discussion

The modular architecture allows new models, embedding backends, or document formats to be integrated with minimal changes.
The combination of semantic retrieval and LLM synthesis provides accurate, source-grounded answers, suitable for research, educational, and knowledge discovery applications.
While performance is good for thousands of documents, scaling to millions would require distributed vector stores or cloud-based solutions.

6. Conclusion of Results

The Ready Tensor Publication Explorer – RAG Chatbot effectively combines document retrieval and LLM-based synthesis to provide accurate, context-aware, and interactive exploration of publications.

Source transparency, modular design, and dual interfaces make the system reliable, extensible, and user-friendly.

Key points:

Modular, scalable design.
Support for multiple document formats.
Accurate retrieval and source-grounded answers.
Dual-interface accessibility (web and CLI).
Extensible for new models, embeddings, or vector databases.

This system provides a robust, user-friendly, and extensible solution for knowledge discovery and research applications.

▶️ Video Demo of the RAG Chatbot:

Abstract

Methodology

1. Data Ingestion

Code snippet:

from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader, TextLoader, UnstructuredMarkdownLoader
from langchain.schema import Document
import json

def load_documents_from_folder(folder: str):
    folder = Path(folder)
    docs = []
    for p in folder.rglob("*"):
        if p.is_dir():
            continue
        suffix = p.suffix.lower()
        if suffix == ".pdf":
            docs += PyPDFLoader(str(p)).load()
        elif suffix in {".md", ".markdown"}:
            docs += UnstructuredMarkdownLoader(str(p)).load()
        elif suffix == ".txt":
            docs += TextLoader(str(p), encoding="utf-8").load()
        elif suffix == ".json":
            with open(p, "r", encoding="utf-8") as f:
                data = json.load(f)
            if isinstance(data, list):
                for item in data:
                    content = item.get("publication_description", "")
                    if content:
                        docs.append(Document(page_content=content, metadata=item))
        else:
            text = p.read_text(encoding="utf-8", errors="ignore")
            docs.append(Document(page_content=text, metadata={"source": str(p)}))
    return docs

This modular approach allows the system to easily integrate new document formats in the future.

2. Text Preprocessing and Splitting

Documents are often long and exceed LLM token limits. To address this, documents are segmented into overlapping chunks while preserving semantic continuity and metadata.

Code snippet:

from langchain_text_splitters import RecursiveCharacterTextSplitter
from .config import settings

def create_splitter():
    return RecursiveCharacterTextSplitter(
        chunk_size=settings.chunk_size,       # e.g., 1000
        chunk_overlap=settings.chunk_overlap, # e.g., 200
        separators=["\n\n", "\n", ". ", "? ", "! ", " "],
    )

This ensures context is preserved across chunk boundaries, enabling more accurate retrieval.

3. Embedding Generation

Each chunk is converted into a dense vector representation using sentence embedding models, enabling semantic search.

Code snippet:

from langchain_huggingface import HuggingFaceEmbeddings

def get_embedder():
    return HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

The system supports multiple embedding backends, including HuggingFace and Google Generative AI, providing flexibility for performance and cost optimization.

4. Vector Storage and Retrieval

The embeddings are stored in a vector database (Chroma or FAISS). The retriever fetches top-k relevant chunks at query time.

Code snippet:

from langchain_chroma import Chroma

class PersistedVectorStore:
    def create_or_update(self, documents):
        self.store = Chroma.from_documents(
            documents, embedding=self.emb, persist_directory=self.persist_directory
        )
        self.store.persist()

    def as_retriever(self, **kwargs):
        return self.store.as_retriever(search_kwargs=kwargs)

This enables efficient similarity search, ensuring retrieved chunks are contextually relevant for answering queries.

5. Answer Generation

Code snippet:

from langchain.chains import RetrievalQA
from .llm import get_chat_llm
from .retriever import get_retriever
from .prompts import QA_SYSTEM

def build_rag_chain(persist_directory="VectorStore"):
    llm = get_chat_llm()
    retriever = get_retriever(persist_directory=persist_directory)
    qa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True, # Return both answer + source documents
        # return_source_documents=False, # Return only answer 
        chain_type_kwargs={"prompt": QA_SYSTEM}
    )
    return qa

The system prompt ensures the LLM answers only based on retrieved content and refuses to hallucinate.

6. User Interaction

Two interfaces are provided for user interaction:

Web-based Chatbot (Gradio UI): Interactive conversational interface for querying publications.
Command-Line Interface (CLI): Lightweight alternative for developers and researchers.

Code snippet:

from src.ui.gradio_app import build_and_launch

if __name__ == "__main__":
    build_and_launch()

This dual-interface design enhances usability across different audiences.

Summary Workflow

Load publications from storage →
Split documents into semantic chunks →
Generate embeddings for each chunk →
Store embeddings in a vector database →
Retrieve relevant chunks for user queries →
Generate answers using an LLM constrained by retrieved context →
Deliver responses via Gradio web or CLI interface.

Results and Discussion

1. Accuracy of Retrieval and Answers

The system successfully ingested and chunked over 30 publication documents, generating over 2,000 semantically coherent chunks.
The vector store retrieval consistently returned the top-k relevant chunks for user queries.
The LLM, guided by the carefully designed prompt, produced answers that were faithful to the source documents with minimal hallucination.
In edge cases where queries were out-of-scope, the assistant correctly responded with polite refusals, adhering to the system guidelines.

2. Performance and Efficiency

Embedding generation using HuggingFace models processed large documents efficiently, allowing vector storage of thousands of chunks in a matter of minutes.
Retrieval time for queries was consistently under 2 seconds for a moderate-sized vector store (~2,000 chunks).
Chunk size and overlap settings (chunk_size=1000, chunk_overlap=200) provided a balance between context retention and performance, ensuring accurate responses without excessive computation.

3. User Interaction and Interface

The system provides two interfaces for users:

The Gradio web interface allowed users to explore publications interactively, view document previews, and see sources alongside answers.
The CLI interface provided a lightweight alternative, useful for quick queries or integration into other scripts.
The dual-interface approach increased accessibility for both researchers and developers.

Screenshot of the RAG Chatbot Interface:

4. Challenges and Solutions

Challenge	Solution
Large documents exceeding token limits	Implemented hierarchical chunking with overlapping windows to preserve context
Hallucination by LLM	Enforced strict system prompts limiting responses to retrieved content
Diverse file formats	Modular loaders and JSON parsing handled different formats robustly
Vector store persistence issues	Enabled Chroma persistence and ensured re-ingestion when needed

5. Discussion

The modular architecture allows new models, embedding backends, or document formats to be integrated with minimal changes.
The combination of semantic retrieval and LLM synthesis provides accurate, source-grounded answers, suitable for research, educational, and knowledge discovery applications.
While performance is good for thousands of documents, scaling to millions would require distributed vector stores or cloud-based solutions.

6. Conclusion of Results

Source transparency, modular design, and dual interfaces make the system reliable, extensible, and user-friendly.

Key points:

Modular, scalable design.
Support for multiple document formats.
Accurate retrieval and source-grounded answers.
Dual-interface accessibility (web and CLI).
Extensible for new models, embeddings, or vector databases.

This system provides a robust, user-friendly, and extensible solution for knowledge discovery and research applications.