This project implements a 100% local Retrieval-Augmented Generation (RAG) assistant. It is a Python application designed to answer user questions based on a private collection of text documents.
The core principle is privacy and offline capability. Unlike services that send data to external APIs, this solution runs all componentsβincluding the Large Language Model (LLM) and embedding modelsβentirely on the user's local machine. This ensures that sensitive documents can be processed and queried without ever leaving the user's control.
The system is designed to:
Ingest a directory of user-provided .txt files.
Process and "learn" the information by converting text into vector embeddings.
Store these embeddings in a persistent local vector database.
Query the database to find relevant context for a user's question.
Generate a natural language answer based only on that retrieved context.
The scope is limited to text-based documents (.txt) and answering questions based on the ingested knowledge. It does not access the internet or use any pre-existing knowledge from the LLM's original training.
The application follows a classic RAG pipeline, which can be broken into two main phases: Indexing and Retrieval & Generation.
This project is built on a stack of modern, open-source AI and Python libraries:
LLM (Generation): Qwen/Qwen2.5-3B-Instruct
A powerful, instruction-tuned 3-billion-parameter model from Alibaba.
It is loaded in 4-bit precision using bitsandbytes to run efficiently on consumer GPUs.
Embedding Model: sentence-transformers/all-MiniLM-L6-v2
Vector Database: ChromaDB
Orchestration: LangChain & langchain-huggingface
Local Model Loading: Hugging Face transformers
Before the user can ask a question, the documents must be indexed. This process runs once when the application starts and new documents are found.
Load Documents: The load_documents() function in app.py scans the data/ directory for all .txt files.
Chunk Text: Each document is passed to the chunk_text() function in vectordb.py. This uses RecursiveCharacterTextSplitter to break the text into smaller, overlapping chunks (approx. 1000 characters). This is critical for RAG, as it ensures the model receives small, relevant pieces of context.
Create Embeddings: The list of text chunks is processed by the SentenceTransformer (all-MiniLM-L6-v2) model, which outputs a 384-dimensional vector for each chunk.
Store in Vector DB: The chunks (as text), their embeddings (vectors), and metadata (like the source filename) are stored in the ChromaDB collection.
This is the main application loop when a user asks a question.
User Query: The user inputs a question (e.g., "What is AI?").
Create Query Embedding: The user's question is passed through the same SentenceTransformer model to create a vector.
Similarity Search: ChromaDB is queried using this vector. It performs a semantic search and returns the top k text chunks (default k=3) from the database that are most semantically similar to the question.
Context Injection: These retrieved chunks are formatted and injected into a prompt template.
Generate Answer: The complete prompt (containing the context and the user's question) is sent to the local Qwen LLM.
Return Response: The LLM generates an answer, which is then post-processed to remove any artifacts and presented to the user.
The project is structured into two main files for modularity: vectordb.py (handles all database logic) and app.py (handles the application logic and LLM).
This file acts as a wrapper for all interactions with ChromaDB and the embedding model.
The chunk_text method uses LangChain's RecursiveCharacterTextSplitter to intelligently divide documents.
# From src/vectordb.py def chunk_text(self, text: str) -> List[str]: """ Split text into smaller chunks for better retrieval using LangChain. """ chunks = self.text_splitter.split_text(text) return chunks
The add_documents method orchestrates the entire indexing pipeline: chunking, embedding, and adding to the database.
# From src/vectordb.py def add_documents(self, documents: List[Dict[str, Any]]) -> None: # ... (loop through documents) ... for doc_idx, doc in enumerate(documents): # ... # 1. Chunk the document chunks = self.chunk_text(content) for chunk_idx, chunk in enumerate(chunks): all_chunks.append(chunk) # ... (create metadata and unique IDs) ... # 4. Create embeddings for all chunks in a single batch print(f"Creating embeddings for {len(all_chunks)} chunks...") embeddings = self.embedding_model.encode(all_chunks, show_progress_bar=True) # 5. Store in ChromaDB self.collection.add( embeddings=embeddings, documents=all_chunks, metadatas=all_metadatas, ids=all_ids )
The search method takes a text query, embeds it, and queries the ChromaDB collection.
# From src/vectordb.py def search(self, query: str, n_results: int = 5) -> Dict[str, Any]: # 1. Create embedding for the query query_embedding = self.embedding_model.encode([query]).tolist() # 2. Search the collection results = self.collection.query( query_embeddings=query_embedding, n_results=n_results ) # ... (format and return results) ...
This file initializes the LLM and orchestrates the user-facing query pipeline.
The _initialize_llm method is crucial. It loads the Qwen model from Hugging Face, applying 4-bit quantization for GPU efficiency, and wraps it in a LangChain-compatible HuggingFacePipeline object.
# From src/app.py def _initialize_llm(self): model_id = os.getenv("LLM_MODEL_ID", "Qwen/Qwen2.5-3B-Instruct") bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4" ) model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto" # Automatically uses GPU if available ) tokenizer = AutoTokenizer.from_pretrained(model_id) text_gen_pipeline = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=500, return_full_text=False # Prevents re-printing the prompt # ... (other parameters) ... ) return HuggingFacePipeline(pipeline=text_gen_pipeline)
A specific prompt template instructs the LLM to only use the provided context, which is the key to preventing "hallucinations" or answers from its external knowledge.
# From src/app.py template = """ You are a helpful AI assistant. You must answer the user's question based *only* on the provided context. If the context does not contain the answer, you must state that you cannot find the information in the provided documents. Do not use any external knowledge. CONTEXT: {context} QUESTION: {question} ANSWER: """ self.prompt_template = ChatPromptTemplate.from_template(template)
The query method ties everything together: search, context injection, generation, and response cleaning.
# From src/app.py def query(self, question: str, n_results: int = 3) -> Dict[str, Any]: # 1. Search for relevant chunks search_results = self.vector_db.search(question, n_results=n_results) retrieved_docs = search_results.get("documents", []) # ... (handle "no documents found" case) ... # 2. Combine chunks into a single context string context = "\n\n---\n\n".join(retrieved_docs) # 3. Generate response using LLM + context answer_raw = self.chain.invoke({ "context": context, "question": question }) # FIX: Split the response and take the first part answer = answer_raw.split("\n\nAssistant:")[0].strip() # 4. Return structured results return { "answer": answer, "context": retrieved_docs }
OS: Ubuntu (tested) or other Linux/WSL2/macOS.
Python: 3.11 (required for package compatibility).
Hardware: An NVIDIA GPU with at least 4GB of VRAM is strongly recommended for 4-bit quantization.
git clone https://github.com/wakodepranav2005-git/local_rag_project.git cd local_rag_project
# Ensure you have Python 3.11 and its 'venv' module sudo apt update sudo apt install python3.11 python3.11-venv # Create the virtual environment python3.11 -m venv venv # Activate the environment source venv/bin/activate
(Your terminal prompt should now start with (venv))
# Install all required packages pip install -r requirements.txt
Add Your Documents: Place all your knowledge files (as .txt files) into the data/ folder.
Run the Application: With your virtual environment active, start the assistant:
python3 src/app.py
First-Time Run: The first time you run the app, it will download the LLM and embedding models (several GB) and then index your documents. This may take a few minutes.
Ask Questions! Once initialized, you can begin asking questions.
...
Successfully added 21 chunks to the vector database.
Enter a question or 'quit' to exit: What is AI?
Processing query: 'What is AI?'
Generating answer with 3 context chunk(s)...
==================================================
π€ AI ANSWER:
Artificial Intelligence (AI) is a branch of computer science that aims to create intelligent machines that can perform tasks that typically require human intelligence. These tasks include learning, reasoning, problem-solving, perception, and language understanding.
--- RETRIEVED CONTEXT ---
[1] Artificial Intelligence Overview...
[2] AI Ethics and Responsible Development...
[3] Deep Learning and Neural Networks...
==================================================
Enter a question or 'quit' to exit: quit
This project successfully demonstrates a complete, private RAG pipeline running on consumer hardware. It effectively answers questions from a private knowledge base, proving the viability of local-first AI applications.