A local RAG AI assistant using Python, LangChain, and a local LLM to chat with your own documents.

Local RAG AI Assistant

1. Introduction

1.1 Project Overview

This project implements a 100% local Retrieval-Augmented Generation (RAG) assistant. It is a Python application designed to answer user questions based on a private collection of text documents.

The core principle is privacy and offline capability. Unlike services that send data to external APIs, this solution runs all components—including the Large Language Model (LLM) and embedding models—entirely on the user's local machine. This ensures that sensitive documents can be processed and queried without ever leaving the user's control.

1.2 Purpose and Scope

The system is designed to:

Ingest a directory of user-provided .txt files.
Process and "learn" the information by converting text into vector embeddings.
Store these embeddings in a persistent local vector database.
Query the database to find relevant context for a user's question.
Generate a natural language answer based only on that retrieved context.

The scope is limited to text-based documents (.txt) and answering questions based on the ingested knowledge. It does not access the internet or use any pre-existing knowledge from the LLM's original training.

2. System Architecture & Design

The application follows a classic RAG pipeline, which can be broken into two main phases: Indexing and Retrieval & Generation.

2.1 Core Technologies

This project is built on a stack of modern, open-source AI and Python libraries:

LLM (Generation): Qwen/Qwen2.5-3B-Instruct
- A powerful, instruction-tuned 3-billion-parameter model from Alibaba.
- It is loaded in 4-bit precision using bitsandbytes to run efficiently on consumer GPUs.
Embedding Model: sentence-transformers/all-MiniLM-L6-v2
- A lightweight and high-performance model that excels at converting sentences and paragraphs into meaningful vector representations for semantic search.
Vector Database: ChromaDB
- A modern, open-source vector database. It stores the embeddings and allows for efficient similarity search. It runs persistently in a local chroma_db/ folder.
Orchestration: LangChain & langchain-huggingface
- The "glue" that connects all components. LangChain is used to build the prompt template and chain the retrieval step to the generation step.
Local Model Loading: Hugging Face transformers
- The core library used to download, cache, and run both the LLM and the embedding model locally.

2.2 Data Flow (Indexing)

Before the user can ask a question, the documents must be indexed. This process runs once when the application starts and new documents are found.

Load Documents: The load_documents() function in app.py scans the data/ directory for all .txt files.
Chunk Text: Each document is passed to the chunk_text() function in vectordb.py. This uses RecursiveCharacterTextSplitter to break the text into smaller, overlapping chunks (approx. 1000 characters). This is critical for RAG, as it ensures the model receives small, relevant pieces of context.
Create Embeddings: The list of text chunks is processed by the SentenceTransformer (all-MiniLM-L6-v2) model, which outputs a 384-dimensional vector for each chunk.
Store in Vector DB: The chunks (as text), their embeddings (vectors), and metadata (like the source filename) are stored in the ChromaDB collection.

2.3 Data Flow (Retrieval & Generation)

This is the main application loop when a user asks a question.

User Query: The user inputs a question (e.g., "What is AI?").
Create Query Embedding: The user's question is passed through the same SentenceTransformer model to create a vector.
Similarity Search: ChromaDB is queried using this vector. It performs a semantic search and returns the top k text chunks (default k=3) from the database that are most semantically similar to the question.
Context Injection: These retrieved chunks are formatted and injected into a prompt template.
Generate Answer: The complete prompt (containing the context and the user's question) is sent to the local Qwen LLM.
Return Response: The LLM generates an answer, which is then post-processed to remove any artifacts and presented to the user.

3. Implementation Details

The project is structured into two main files for modularity: vectordb.py (handles all database logic) and app.py (handles the application logic and LLM).

3.1 src/vectordb.py

This file acts as a wrapper for all interactions with ChromaDB and the embedding model.

Text Chunking

The chunk_text method uses LangChain's RecursiveCharacterTextSplitter to intelligently divide documents.

# From src/vectordb.py
def chunk_text(self, text: str) -> List[str]:
    """
    Split text into smaller chunks for better retrieval using LangChain.
    """
    chunks = self.text_splitter.split_text(text)
    return chunks

Document Ingestion

The add_documents method orchestrates the entire indexing pipeline: chunking, embedding, and adding to the database.

# From src/vectordb.py
def add_documents(self, documents: List[Dict[str, Any]]) -> None:
    # ... (loop through documents) ...
    for doc_idx, doc in enumerate(documents):
        # ...
        # 1. Chunk the document
        chunks = self.chunk_text(content)
        
        for chunk_idx, chunk in enumerate(chunks):
            all_chunks.append(chunk)
            # ... (create metadata and unique IDs) ...

    # 4. Create embeddings for all chunks in a single batch
    print(f"Creating embeddings for {len(all_chunks)} chunks...")
    embeddings = self.embedding_model.encode(all_chunks, show_progress_bar=True)

    # 5. Store in ChromaDB
    self.collection.add(
        embeddings=embeddings,
        documents=all_chunks,
        metadatas=all_metadatas,
        ids=all_ids
    )

Similarity Search

The search method takes a text query, embeds it, and queries the ChromaDB collection.

# From src/vectordb.py
def search(self, query: str, n_results: int = 5) -> Dict[str, Any]:
    # 1. Create embedding for the query
    query_embedding = self.embedding_model.encode([query]).tolist()

    # 2. Search the collection
    results = self.collection.query(
        query_embeddings=query_embedding,
        n_results=n_results
    )
    # ... (format and return results) ...

3.2 src/app.py

This file initializes the LLM and orchestrates the user-facing query pipeline.

Local LLM Initialization

The _initialize_llm method is crucial. It loads the Qwen model from Hugging Face, applying 4-bit quantization for GPU efficiency, and wraps it in a LangChain-compatible HuggingFacePipeline object.

# From src/app.py
def _initialize_llm(self):
    model_id = os.getenv("LLM_MODEL_ID", "Qwen/Qwen2.5-3B-Instruct")
    
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type="nf4"
    )
    
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map="auto" # Automatically uses GPU if available
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    text_gen_pipeline = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=500,
        return_full_text=False # Prevents re-printing the prompt
        # ... (other parameters) ...
    )

    return HuggingFacePipeline(pipeline=text_gen_pipeline)

RAG Prompt Template

A specific prompt template instructs the LLM to only use the provided context, which is the key to preventing "hallucinations" or answers from its external knowledge.

# From src/app.py
template = """
You are a helpful AI assistant. You must answer the user's question based *only* on the provided context.
If the context does not contain the answer, you must state that you cannot find the information in the provided documents.
Do not use any external knowledge.

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:
"""
self.prompt_template = ChatPromptTemplate.from_template(template)

The Query Pipeline

The query method ties everything together: search, context injection, generation, and response cleaning.

# From src/app.py
def query(self, question: str, n_results: int = 3) -> Dict[str, Any]:
    # 1. Search for relevant chunks
    search_results = self.vector_db.search(question, n_results=n_results)
    retrieved_docs = search_results.get("documents", [])
    
    # ... (handle "no documents found" case) ...

    # 2. Combine chunks into a single context string
    context = "\n\n---\n\n".join(retrieved_docs)

    # 3. Generate response using LLM + context
    answer_raw = self.chain.invoke({
        "context": context,
        "question": question
    })

    # FIX: Split the response and take the first part
    answer = answer_raw.split("\n\nAssistant:")[0].strip()
    
    # 4. Return structured results
    return {
        "answer": answer,
        "context": retrieved_docs
    }

4. Setup and Usage Guide

4.1 System Requirements

OS: Ubuntu (tested) or other Linux/WSL2/macOS.
Python: 3.11 (required for package compatibility).
Hardware: An NVIDIA GPU with at least 4GB of VRAM is strongly recommended for 4-bit quantization.

4.2 Installation

Clone the Repository:

git clone https://github.com/wakodepranav2005-git/local_rag_project.git
cd local_rag_project

Create and Activate Python Environment:

# Ensure you have Python 3.11 and its 'venv' module
sudo apt update
sudo apt install python3.11 python3.11-venv

# Create the virtual environment
python3.11 -m venv venv

# Activate the environment
source venv/bin/activate

(Your terminal prompt should now start with (venv))

Install Dependencies:

# Install all required packages
pip install -r requirements.txt

4.3 Running the Application

Add Your Documents: Place all your knowledge files (as .txt files) into the data/ folder.
Run the Application: With your virtual environment active, start the assistant:

python3 src/app.py

First-Time Run: The first time you run the app, it will download the LLM and embedding models (several GB) and then index your documents. This may take a few minutes.
Ask Questions! Once initialized, you can begin asking questions.

...
Successfully added 21 chunks to the vector database.

Enter a question or 'quit' to exit: What is AI?

Processing query: 'What is AI?'
Generating answer with 3 context chunk(s)...

==================================================
🤖 AI ANSWER:
Artificial Intelligence (AI) is a branch of computer science that aims to create intelligent machines that can perform tasks that typically require human intelligence. These tasks include learning, reasoning, problem-solving, perception, and language understanding.

--- RETRIEVED CONTEXT ---
[1] Artificial Intelligence Overview...
[2] AI Ethics and Responsible Development...
[3] Deep Learning and Neural Networks...
==================================================

Enter a question or 'quit' to exit: quit

5. Conclusion

This project successfully demonstrates a complete, private RAG pipeline running on consumer hardware. It effectively answers questions from a private knowledge base, proving the viability of local-first AI applications.