Building a Retrieval-Augmented Generation (RAG) AI Assistant with Python, ChromaDB, and LLMs

Overview
This project demonstrates how to build a production-ready Retrieval-Augmented Generation (RAG) assistant—an AI system that answers questions about your documents using state-of-the-art language models and vector search technology.
Unlike minimal tutorials, this implementation focuses on the complete lifecycle of a RAG system:
Document Ingestion → Chunking → Embedding → Indexing → Retrieval → Generation

Why RAG Matters
As language models grow more powerful, accurate reasoning increasingly depends on externalized knowledge—sources that are updateable, auditable, and domain-specific.
Key Benefits:

Reduces Hallucinations
Grounds responses in verified source documents
Provides traceable citations for every answer

Data Privacy
Keeps sensitive data out of proprietary model training pipelines
Enables on-premises deployment for confidential documents

Domain Expertise
Instantly incorporates specialized knowledge bases
Updates answers without retraining models

Transparency
Shows which documents informed each response
Enables audit trails for compliance

Cost Efficiency
Reduces token usage by focusing on relevant context
Avoids fine-tuning costs for domain adaptation

graph TD
    A[📄 Document Files] -->|Load| B[Document Loader]
    B -->|Text| C[Text Chunker]
    C -->|Chunks| D[Embedding Model]
    D -->|Vectors| E[(ChromaDB)]
    F[❓ User Query] -->|Embed| D
    D -->|Query Vector| E
    E -->|Similar Chunks| G[Context Retrieval]
    G -->|Relevant Context| H[LLM Generator]
    H -->|Grounded Answer| I[👤 User]

Component Flow:
Ingestion Layer: Loads documents from local storage
Processing Layer: Chunks text into semantic units
Embedding Layer: Converts text to vector representations
Storage Layer: Indexes vectors in ChromaDB
Retrieval Layer: Finds relevant context via similarity search
Generation Layer: Synthesizes answers using retrieved context

Key Features

Core Functionality

✅ Multi-Format Document Ingestion: Load .txt documents with extensible loader design
✅ Intelligent Chunking: Semantic text splitting with configurable overlap
✅ State-of-the-Art Embeddings: Sentence Transformers with multiple model options
✅ Vector Database: Persistent ChromaDB storage with metadata filtering
✅ Multi-LLM Support: Switch between OpenAI, Groq, and Google Gemini
✅ RAG Pipeline: Complete retrieval and generation workflow

Installation & Setup
Prerequisites

Python 3.8 or higher
pip package manager
API key from at least one LLM provider (OpenAI, Groq, or Google)

Step 1: Clone Repository

git clone https://github.com/yourusername/rag-assistant.git
cd rag-assistant

Step 2: Create Virtual Environment
Windows:

python -m venv .venv
.venv\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

Required Packages:
Step 3: Install Dependencies

chromadb>=0.4.0
sentence-transformers>=2.2.0
openai>=1.0.0
groq>=0.4.0
google-generativeai>=0.3.0
python-dotenv>=1.0.0

Step 4: Configure Environment
Create a .env file in the project root:

# LLM Provider (choose one or multiple)
GROQ_API_KEY=your_groq_api_key_here
GROQ_MODEL=llama-3.3-70b-versatile

# Alternative Providers
# OPENAI_API_KEY=your_openai_key_here
# OPENAI_MODEL=gpt-4-turbo-preview
# GOOGLE_API_KEY=your_google_key_here
# GOOGLE_MODEL=gemini-pro

# Embedding Configuration
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

# Vector Database
CHROMA_COLLECTION_NAME=rag_documents
CHROMA_PERSIST_DIRECTORY=./chroma_db

# Retrieval Settings
TOP_K_RESULTS=3
CHUNK_SIZE=500
CHUNK_OVERLAP=50

Step 5: Add Documents
Place your .txt files in the data/ directory:

data/
├── quantum_computing.txt
├── machine_learning_basics.txt
└── company_handbook.txt

Step 6: Run the Assistant

cd src
python app.py

Project Structure

rag-assistant/
│
├── data/                          # Document storage
│   ├── quantum_computing.txt
│   └── example_docs.txt
│
├── src/                           # Source code
│   ├── app.py                     # Main application entry point
│   ├── vectordb.py                # ChromaDB vector store interface
│   ├── embeddings.py              # Embedding generation logic
│   ├── document_loader.py         # Document ingestion utilities
│   └── llm_client.py              # LLM provider abstraction
│
├── chroma_db/                     # Persistent vector database
│
├── .env                           # Environment configuration
├── .env.example                   # Example environment file
├── requirements.txt               # Python dependencies
├── README.md                      # This file
└── LICENSE                        # Project license

How It Works

Document Loading
The system loads documents from the data/ directory:

def load_documents(data_dir: str) -> List[str]:
    """
    Loads all .txt files from the specified directory.
    
    Args:
        data_dir: Path to directory containing documents
        
    Returns:
        List of document texts
    """
    documents = []
    for filename in os.listdir(data_dir):
        if filename.endswith('.txt'):
            with open(os.path.join(data_dir, filename), 'r', encoding='utf-8') as f:
                documents.append(f.read())
    return documents

Supported Features:

Recursive directory scanning
UTF-8 encoding support
Metadata extraction (filename, creation date)
Error handling for corrupted files

Text Chunking
Documents are split into semantic chunks with configurable overlap:

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """
    Splits text into overlapping chunks for better context preservation.
    
    Args:
        text: Input document text
        chunk_size: Maximum characters per chunk
        overlap: Number of overlapping characters between chunks
        
    Returns:
        List of text chunks
    """
    chunks = []
    start = 0
    text_length = len(text)
    
    while start < text_length:
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
        
    return chunks

Chunking Strategies:

Fixed-size chunks: Consistent length for uniform processing
Sentence-aware splitting: Preserves semantic boundaries
Overlap mechanism: Maintains context across chunk boundaries
Configurable parameters: Adapt to document types

Embedding Generation
Text chunks are converted to vector embeddings:

from sentence_transformers import SentenceTransformer

class EmbeddingModel:
    def __init__(self, model_name: str):
        """Initialize embedding model from HuggingFace."""
        self.model = SentenceTransformer(model_name)
    
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Generate embeddings for multiple documents."""
        return self.model.encode(texts, show_progress_bar=True).tolist()
    
    def embed_query(self, query: str) -> List[float]:
        """Generate embedding for a single query."""
        return self.model.encode([query])[0].tolist()

Embedding Models:

all-MiniLM-L6-v2: Fast, 384 dimensions (default)
all-mpnet-base-v2: High quality, 768 dimensions
multi-qa-mpnet-base-dot-v1: Optimized for Q&A

Vector Storage & Retrieval
ChromaDB stores embeddings and performs similarity search:

import chromadb

class VectorStore:
    def __init__(self, collection_name: str, persist_directory: str):
        """Initialize ChromaDB client with persistence."""
        self.client = chromadb.PersistentClient(path=persist_directory)
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )
    
    def add_documents(self, documents: List[str], embeddings: List[List[float]]):
        """Add documents with embeddings to the collection."""
        ids = [f"doc_{i}" for i in range(len(documents))]
        self.collection.add(
            documents=documents,
            embeddings=embeddings,
            ids=ids
        )
    
    def query(self, query_embedding: List[float], top_k: int = 3):
        """Retrieve most similar documents."""
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k
        )
        return results['documents'][0], results['distances'][0]

RAG Answer Generation
Retrieved context is used to generate grounded answers:

def generate_answer(query: str, context: List[str], llm_client) -> str:
    """
    Generate answer using retrieved context.
    
    Args:
        query: User's question
        context: Retrieved document chunks
        llm_client: Configured LLM client
        
    Returns:
        Generated answer with citations
    """
    prompt = f"""Answer the question based on the following context.
    
Context:
{chr(10).join(context)}

Question: {query}

Provide a detailed answer based only on the context above. If the context doesn't contain 
enough information, say so clearly."""
    
    return llm_client.complete(prompt)

Usage Examples

Basic Query

$ python app.py

RAG Assistant Ready!
Loading documents...
✓ Loaded 5 documents
✓ Generated embeddings
✓ Indexed in ChromaDB

Enter a question or 'quit' to exit: What is quantum superposition?

AI: Quantum superposition is a fundamental principle of quantum mechanics where a quantum 
system can exist in multiple states simultaneously until measured. According to the 
documents, this property enables quantum computers to process multiple possibilities at 
once, providing exponential speedup for certain computational problems. The classic 
example is Schrödinger's cat, which exists in a superposition of alive and dead states 
until observed.

Sources: quantum_computing.txt (chunks 3, 7)

Complex Multi-Document Query

Enter a question or 'quit' to exit: How does machine learning relate to quantum computing?

AI: Machine learning and quantum computing intersect in several ways. Quantum machine 
learning algorithms can potentially process high-dimensional data more efficiently using 
quantum superposition and entanglement. The documents mention that quantum algorithms 
like QNN (Quantum Neural Networks) may accelerate training of large models by exploring 
parameter spaces more efficiently than classical gradient descent. However, practical 
quantum advantage for ML tasks is still an active research area.

Sources: quantum_computing.txt (chunk 12), machine_learning_basics.txt (chunk 8)

Configuration & Customization

Changing Embedding Models
Edit .env to use different Sentence Transformers:

# Faster, smaller model (384 dim)
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

# Higher quality (768 dim)
EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2

# Optimized for questions (768 dim)
EMBEDDING_MODEL=sentence-transformers/multi-qa-mpnet-base-dot-v1

Switching LLM Providers
Use OpenAI:

OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4-turbo-preview

Use Groq(fast inference):

GROQ_API_KEY=gsk_...
GROQ_MODEL=llama-3.3-70b-versatile

Use Google Gemini

GOOGLE_API_KEY=AI...
GOOGLE_MODEL=gemini-pro

Tuning Retrieval Parameters
Adjust in .env based on your use case:

# Number of context chunks to retrieve
TOP_K_RESULTS=3              # Increase for more context (may add noise)

# Chunk size and overlap
CHUNK_SIZE=500               # Larger = more context per chunk
CHUNK_OVERLAP=50             # Higher = better context preservation

Advanced: Custom Document Loaders
Extend document_loader.py to support PDF, DOCX, etc:

from PyPDF2 import PdfReader

def load_pdf(filepath: str) -> str:
    """Extract text from PDF file."""
    reader = PdfReader(filepath)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

Roadmap & Future Enhancements

Planned Features

Multi-Format Support: PDF, DOCX, HTML document ingestion
Hybrid Search: Combine vector similarity with keyword search (BM25)
Re-ranking: Use cross-encoder models to re-score retrieved chunks
Streaming Responses: Real-time answer generation with citations
Web UI: Gradio/Streamlit interface for non-technical users
Evaluation Metrics: Automated testing with RAGAS or LangChain evaluators
Metadata Filtering: Filter by date, author, document type during retrieval
Multilingual Support: Cross-lingual embeddings and multilingual LLMs

Advanced Capabilities

Agentic RAG: Multi-step reasoning with tool use
Semantic Caching: Deduplicate similar queries
Active Learning: User feedback to improve retrieval
Explainability: Highlight exact text spans used in answers

Contributing
Contributions are welcome! Please follow these guidelines:

Fork the repository
Create a feature branch: git checkout -b feature/your-feature
Commit changes: git commit -m "Add your feature"
Push to branch: git push origin feature/your-feature
Open a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests
pytest tests/

# Format code
black src/
flake8 src/

License & Attribution

This project is licensed under the MIT License - see the LICENSE file for details.
Third-Party Licenses

ChromaDB: Apache License 2.0
Sentence Transformers: Apache License 2.0
OpenAI SDK: MIT License
LangChain (if used): MIT License

Acknowledgments

HuggingFace for Sentence Transformers
ChromaDB team for the excellent vector database
OpenAI, Groq, and Google for LLM API access
The open-source AI community

Building a Retrieval-Augmented Generation (RAG) AI Assistant with Python, ChromaDB, and LLMs

Building a Retrieval-Augmented Generation (RAG) AI Assistant with Python, ChromaDB, and LLMs

Table of contents

Table of contents

Code

Code