
Overview
This project demonstrates how to build a production-ready Retrieval-Augmented Generation (RAG) assistant—an AI system that answers questions about your documents using state-of-the-art language models and vector search technology.
Unlike minimal tutorials, this implementation focuses on the complete lifecycle of a RAG system:
Document Ingestion → Chunking → Embedding → Indexing → Retrieval → Generation
Why RAG Matters
As language models grow more powerful, accurate reasoning increasingly depends on externalized knowledge—sources that are updateable, auditable, and domain-specific.
Key Benefits:
Reduces Hallucinations
Grounds responses in verified source documents
Provides traceable citations for every answer
Data Privacy
Keeps sensitive data out of proprietary model training pipelines
Enables on-premises deployment for confidential documents
Domain Expertise
Instantly incorporates specialized knowledge bases
Updates answers without retraining models
Transparency
Shows which documents informed each response
Enables audit trails for compliance
Cost Efficiency
Reduces token usage by focusing on relevant context
Avoids fine-tuning costs for domain adaptation
graph TD A[📄 Document Files] -->|Load| B[Document Loader] B -->|Text| C[Text Chunker] C -->|Chunks| D[Embedding Model] D -->|Vectors| E[(ChromaDB)] F[❓ User Query] -->|Embed| D D -->|Query Vector| E E -->|Similar Chunks| G[Context Retrieval] G -->|Relevant Context| H[LLM Generator] H -->|Grounded Answer| I[👤 User]
Component Flow:
Ingestion Layer: Loads documents from local storage
Processing Layer: Chunks text into semantic units
Embedding Layer: Converts text to vector representations
Storage Layer: Indexes vectors in ChromaDB
Retrieval Layer: Finds relevant context via similarity search
Generation Layer: Synthesizes answers using retrieved context
Key Features
Core Functionality
✅ Multi-Format Document Ingestion: Load .txt documents with extensible loader design
✅ Intelligent Chunking: Semantic text splitting with configurable overlap
✅ State-of-the-Art Embeddings: Sentence Transformers with multiple model options
✅ Vector Database: Persistent ChromaDB storage with metadata filtering
✅ Multi-LLM Support: Switch between OpenAI, Groq, and Google Gemini
✅ RAG Pipeline: Complete retrieval and generation workflow
Installation & Setup
Prerequisites
Python 3.8 or higher
pip package manager
API key from at least one LLM provider (OpenAI, Groq, or Google)
Step 1: Clone Repository
git clone https://github.com/yourusername/rag-assistant.git cd rag-assistant
Step 2: Create Virtual Environment
Windows:
python -m venv .venv .venv\Scripts\activate
Step 3: Install Dependencies
pip install -r requirements.txt
Required Packages:
Step 3: Install Dependencies
chromadb>=0.4.0 sentence-transformers>=2.2.0 openai>=1.0.0 groq>=0.4.0 google-generativeai>=0.3.0 python-dotenv>=1.0.0
Step 4: Configure Environment
Create a .env file in the project root:
# LLM Provider (choose one or multiple) GROQ_API_KEY=your_groq_api_key_here GROQ_MODEL=llama-3.3-70b-versatile # Alternative Providers # OPENAI_API_KEY=your_openai_key_here # OPENAI_MODEL=gpt-4-turbo-preview # GOOGLE_API_KEY=your_google_key_here # GOOGLE_MODEL=gemini-pro # Embedding Configuration EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 # Vector Database CHROMA_COLLECTION_NAME=rag_documents CHROMA_PERSIST_DIRECTORY=./chroma_db # Retrieval Settings TOP_K_RESULTS=3 CHUNK_SIZE=500 CHUNK_OVERLAP=50
Step 5: Add Documents
Place your .txt files in the data/ directory:
data/ ├── quantum_computing.txt ├── machine_learning_basics.txt └── company_handbook.txt
Step 6: Run the Assistant
cd src python app.py
Project Structure
rag-assistant/ │ ├── data/ # Document storage │ ├── quantum_computing.txt │ └── example_docs.txt │ ├── src/ # Source code │ ├── app.py # Main application entry point │ ├── vectordb.py # ChromaDB vector store interface │ ├── embeddings.py # Embedding generation logic │ ├── document_loader.py # Document ingestion utilities │ └── llm_client.py # LLM provider abstraction │ ├── chroma_db/ # Persistent vector database │ ├── .env # Environment configuration ├── .env.example # Example environment file ├── requirements.txt # Python dependencies ├── README.md # This file └── LICENSE # Project license
How It Works
data/ directory:def load_documents(data_dir: str) -> List[str]: """ Loads all .txt files from the specified directory. Args: data_dir: Path to directory containing documents Returns: List of document texts """ documents = [] for filename in os.listdir(data_dir): if filename.endswith('.txt'): with open(os.path.join(data_dir, filename), 'r', encoding='utf-8') as f: documents.append(f.read()) return documents
Supported Features:
Recursive directory scanning
UTF-8 encoding support
Metadata extraction (filename, creation date)
Error handling for corrupted files
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]: """ Splits text into overlapping chunks for better context preservation. Args: text: Input document text chunk_size: Maximum characters per chunk overlap: Number of overlapping characters between chunks Returns: List of text chunks """ chunks = [] start = 0 text_length = len(text) while start < text_length: end = start + chunk_size chunks.append(text[start:end]) start += chunk_size - overlap return chunks
Chunking Strategies:
Fixed-size chunks: Consistent length for uniform processing
Sentence-aware splitting: Preserves semantic boundaries
Overlap mechanism: Maintains context across chunk boundaries
Configurable parameters: Adapt to document types
from sentence_transformers import SentenceTransformer class EmbeddingModel: def __init__(self, model_name: str): """Initialize embedding model from HuggingFace.""" self.model = SentenceTransformer(model_name) def embed_documents(self, texts: List[str]) -> List[List[float]]: """Generate embeddings for multiple documents.""" return self.model.encode(texts, show_progress_bar=True).tolist() def embed_query(self, query: str) -> List[float]: """Generate embedding for a single query.""" return self.model.encode([query])[0].tolist()
Embedding Models:
all-MiniLM-L6-v2: Fast, 384 dimensions (default)
all-mpnet-base-v2: High quality, 768 dimensions
multi-qa-mpnet-base-dot-v1: Optimized for Q&A
import chromadb class VectorStore: def __init__(self, collection_name: str, persist_directory: str): """Initialize ChromaDB client with persistence.""" self.client = chromadb.PersistentClient(path=persist_directory) self.collection = self.client.get_or_create_collection( name=collection_name, metadata={"hnsw:space": "cosine"} ) def add_documents(self, documents: List[str], embeddings: List[List[float]]): """Add documents with embeddings to the collection.""" ids = [f"doc_{i}" for i in range(len(documents))] self.collection.add( documents=documents, embeddings=embeddings, ids=ids ) def query(self, query_embedding: List[float], top_k: int = 3): """Retrieve most similar documents.""" results = self.collection.query( query_embeddings=[query_embedding], n_results=top_k ) return results['documents'][0], results['distances'][0]
def generate_answer(query: str, context: List[str], llm_client) -> str: """ Generate answer using retrieved context. Args: query: User's question context: Retrieved document chunks llm_client: Configured LLM client Returns: Generated answer with citations """ prompt = f"""Answer the question based on the following context. Context: {chr(10).join(context)} Question: {query} Provide a detailed answer based only on the context above. If the context doesn't contain enough information, say so clearly.""" return llm_client.complete(prompt)
Usage Examples
Basic Query
$ python app.py RAG Assistant Ready! Loading documents... ✓ Loaded 5 documents ✓ Generated embeddings ✓ Indexed in ChromaDB Enter a question or 'quit' to exit: What is quantum superposition? AI: Quantum superposition is a fundamental principle of quantum mechanics where a quantum system can exist in multiple states simultaneously until measured. According to the documents, this property enables quantum computers to process multiple possibilities at once, providing exponential speedup for certain computational problems. The classic example is Schrödinger's cat, which exists in a superposition of alive and dead states until observed. Sources: quantum_computing.txt (chunks 3, 7)
Complex Multi-Document Query
Enter a question or 'quit' to exit: How does machine learning relate to quantum computing? AI: Machine learning and quantum computing intersect in several ways. Quantum machine learning algorithms can potentially process high-dimensional data more efficiently using quantum superposition and entanglement. The documents mention that quantum algorithms like QNN (Quantum Neural Networks) may accelerate training of large models by exploring parameter spaces more efficiently than classical gradient descent. However, practical quantum advantage for ML tasks is still an active research area. Sources: quantum_computing.txt (chunk 12), machine_learning_basics.txt (chunk 8)
Configuration & Customization
Changing Embedding Models
Edit .env to use different Sentence Transformers:
# Faster, smaller model (384 dim) EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 # Higher quality (768 dim) EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2 # Optimized for questions (768 dim) EMBEDDING_MODEL=sentence-transformers/multi-qa-mpnet-base-dot-v1
Switching LLM Providers
Use OpenAI:
OPENAI_API_KEY=sk-... OPENAI_MODEL=gpt-4-turbo-preview
Use Groq(fast inference):
GROQ_API_KEY=gsk_... GROQ_MODEL=llama-3.3-70b-versatile
Use Google Gemini
GOOGLE_API_KEY=AI... GOOGLE_MODEL=gemini-pro
Tuning Retrieval Parameters
Adjust in .env based on your use case:
# Number of context chunks to retrieve TOP_K_RESULTS=3 # Increase for more context (may add noise) # Chunk size and overlap CHUNK_SIZE=500 # Larger = more context per chunk CHUNK_OVERLAP=50 # Higher = better context preservation
Advanced: Custom Document Loaders
Extend document_loader.py to support PDF, DOCX, etc:
from PyPDF2 import PdfReader def load_pdf(filepath: str) -> str: """Extract text from PDF file.""" reader = PdfReader(filepath) text = "" for page in reader.pages: text += page.extract_text() return text
Roadmap & Future Enhancements
Planned Features
Multi-Format Support: PDF, DOCX, HTML document ingestion
Hybrid Search: Combine vector similarity with keyword search (BM25)
Re-ranking: Use cross-encoder models to re-score retrieved chunks
Streaming Responses: Real-time answer generation with citations
Web UI: Gradio/Streamlit interface for non-technical users
Evaluation Metrics: Automated testing with RAGAS or LangChain evaluators
Metadata Filtering: Filter by date, author, document type during retrieval
Multilingual Support: Cross-lingual embeddings and multilingual LLMs
Advanced Capabilities
Agentic RAG: Multi-step reasoning with tool use
Semantic Caching: Deduplicate similar queries
Active Learning: User feedback to improve retrieval
Explainability: Highlight exact text spans used in answers
Contributing
Contributions are welcome! Please follow these guidelines:
git checkout -b feature/your-featuregit commit -m "Add your feature"git push origin feature/your-featureDevelopment Setup
# Install development dependencies pip install -r requirements-dev.txt # Run tests pytest tests/ # Format code black src/ flake8 src/
License & Attribution
This project is licensed under the MIT License - see the LICENSE file for details.
Third-Party Licenses
Acknowledgments