RAG Assistant: Local Document Retrieval with LLMs

A Retrieval-Augmented Generation (RAG) assistant that answers user queries using a collection of local documents. It retrieves relevant text chunks from your dataset and generates answers using LLMs.

Project Scope

This RAG Assistant is designed for:

querying local .txt documents,
small to medium document collections,
general Q&A scenarios where grounding is important.

It is not yet optimized for:

very large corpora (100+ documents),
complex structured files like PDFs or tables without preprocessing,
or multi-turn follow-up context tracking beyond basic query context injection.

Features

Retrieves relevant chunks from .txt documents using embeddings.
Supports multiple LLM backends:
- OpenAI (gpt-3.5-turbo, gpt-4o-mini)
- Groq LLaMA 3.1 8B Instant
- Google Gemini API
Handles document metadata and context tracking.
Works offline with local embeddings.

Method / Architecture

The workflow:

User Query
    │
    ▼
Retrieve relevant document chunks
    │
    ▼
Combine context
    │
    ▼
Send to LLM (OpenAI / Groq / Google Gemini)
    │
    ▼
Return generated answer

Document ingestion: Reads .txt files from a content directory.
Chunking & embeddings: Splits documents into manageable chunks and generates embeddings.
Vector search: Finds the top n_results relevant chunks per query.
LLM generation: Sends the combined context and query to the selected LLM for response.

Chunking Strategy Explained

Chunking breaks large text files into smaller pieces that can be embedded and retrieved efficiently — too large and context becomes diluted; too small and meaning may be lost.

In this project, we use fixed-size chunking with overlap, where each document is split into chunks of ~300–400 tokens with a 20–30% overlap between chunks. This preserves context across boundaries and ensures that important sentences aren’t split mid-thought.

The sliding overlap helps:

keep related sentences together,
improve recall if relevant context falls at a boundary,
and provide cleaner semantic retrieval.

Supported LLM Backends

Backend	Free Tier Options	Notes
OpenAI GPT	Free trial credits	429 errors if quota exceeded
Groq LLaMA 3.1	Free trial / local testing	Supports instant and batch inference
Google Gemini	Free-tier API access	May require account setup and quota limits

Recommended: manage quotas carefully or use local Groq models for higher throughput.

Usage

from rag_assistant import RAGAssistant

rag = RAGAssistant()

while True:
    query = input("Enter a question or 'quit' to exit: ")
    if query.lower() == "quit":
        break
    result = rag.query(query)
    print("Answer:", result["answer"])

Example Query

Enter a question or 'quit' to exit: What is an asteroid?
Answer: An asteroid is a small rocky body orbiting the Sun, primarily found in the asteroid belt between Mars and Jupiter.

Notes / Tips

Document loading now joins the content directory with filenames to avoid FileNotFound errors.
For vague queries (like "test"), consider skipping retrieval or returning a generic response.
Adjust n_results to limit irrelevant chunks.
Embedding quality directly affects answer relevance.

Dependencies

Python 3.10+
openai (for OpenAI API)
chromadb or other vector database backend (for Groq / embeddings)
sentence-transformers (for embeddings)
python-dotenv (for .env loading)
langchain (optional, for prompt chaining)

Install dependencies:

pip install openai chromadb sentence-transformers python-dotenv langchain

Future Work

Add more local datasets for testing.
Support additional LLMs or embeddings backends.
Improve query relevance with better chunking and similarity thresholds.

License

MIT License

RAG Assistant: Local Document Retrieval with LLMs

Table of contents

RAG Assistant: Local Document Retrieval with LLMs

Project Scope

Features

Method / Architecture

Chunking Strategy Explained

Supported LLM Backends

Usage

Example Query

Notes / Tips

Dependencies

Future Work

License

Table of contents

Code

Code