A Retrieval-Augmented Generation system using ChromaDB, Sentence Transformers, and multi-provider LLM support
Large Language Models (LLMs) are powerful but limited by their training data cutoff. When developers need answers about specific frameworks like LangChain, general-purpose LLMs may produce outdated or hallucinated responses. Retrieval-Augmented Generation (RAG) addresses this by grounding LLM responses in actual source documentation, ensuring answers are accurate and verifiable.
This project builds a RAG-based AI assistant that answers developer questions about LangChain by searching through its official documentation and generating contextual responses. The system demonstrates core RAG concepts: document ingestion, text chunking, vector embedding, similarity search, and context-augmented generation.
Developers working with LangChain face a common challenge: the framework evolves rapidly, and documentation is spread across many pages. Traditional keyword search often fails to surface the most relevant information for nuanced technical questions. An LLM alone may hallucinate API details or reference deprecated patterns.
The gap this project addresses is providing a system that:
The system follows a standard RAG architecture:
User Question --> Vector DB Search --> Context Formatting --> LLM Response
Step 1 - Document Ingestion: Plain text files from the data/ directory are loaded using LangChain's TextLoader. Each file represents a page of LangChain documentation.
Step 2 - Text Chunking: Documents are split into smaller chunks using RecursiveCharacterTextSplitter with a chunk size of 500 characters and 200-character overlap. The overlap preserves context across chunk boundaries. The splitter uses a hierarchy of separators (\n\n, \n, . , , "") to break text at natural boundaries.
Step 3 - Embedding: Each chunk is encoded into a 384-dimensional vector using the all-MiniLM-L6-v2 Sentence Transformer model. This lightweight model runs locally without GPU requirements.
Step 4 - Vector Storage: Embeddings are stored in ChromaDB with persistent local storage. Each chunk retains its source metadata (filename, chunk index) for attribution.
Step 5 - Retrieval: When a user asks a question, it is embedded using the same model and the top 5 most similar chunks are retrieved using cosine distance.
Step 6 - Generation: Retrieved chunks are formatted into a structured prompt along with conversation history, then sent to the LLM. The prompt template enforces chain-of-thought reasoning with separate "Reasoning" and "Answer" sections.
The project consists of two core modules:
The VectorDB class wraps ChromaDB with embedding generation:
class VectorDB: def __init__(self, collection_name, embedding_model): self.client = chromadb.PersistentClient(path="./chroma_db") self.embedding_model = SentenceTransformer(embedding_model) self.collection = self.client.get_or_create_collection(name=collection_name)
Key methods:
chunk_text(text, chunk_size) — Splits text using LangChain's RecursiveCharacterTextSplitteradd_documents(documents) — Chunks, embeds, and stores documents with metadatasearch(query, n_results) — Encodes query and returns top-N similar chunks with distancesThe RAGAssistant class orchestrates the full pipeline:
class RAGAssistant: def __init__(self): self.llm = self._initialize_llm() # Auto-detects provider self.vector_db = VectorDB() self.prompt_template = ChatPromptTemplate.from_template(...) self.chain = self.prompt_template | self.llm | StrOutputParser()
The prompt template includes structured fields for role, context, instructions, reasoning process, output constraints, style, and goal. This structured approach ensures consistent, well-formatted responses.
LLM provider detection follows a priority chain: OpenAI > Groq > Google Gemini, based on which API key is present in the environment.
The knowledge base consists of 8 text files sourced from the official LangChain documentation:
| File | Topic |
|---|---|
overview.txt | Core benefits, getting started |
models.txt | Chat models, providers, streaming, structured output |
agents.txt | Agent architecture, tools, prompts, memory |
tools.txt | Creating tools, decorators, ToolNode |
messages.txt | Message types, roles, conversation history |
short-term-memory.txt | Short-term memory and state management |
streaming-overview.txt | Streaming patterns |
streaming-frontend.txt | Frontend streaming integration |
These documents cover the core concepts a developer encounters when building with LangChain: from basic model invocation through agents, tools, memory, and streaming.
Processing: Each document is chunked into ~500-character segments with 200-character overlap, producing approximately 80-100 total chunks across all 8 files. Chunks are embedded into 384-dimensional vectors and stored in ChromaDB.
| Component | Technology | Purpose |
|---|---|---|
| Framework | LangChain | Prompt templates, output parsing, text splitting |
| Vector Database | ChromaDB | Persistent local vector storage and similarity search |
| Embeddings | Sentence Transformers (all-MiniLM-L6-v2) | Local embedding generation (384 dimensions) |
| LLM Providers | OpenAI, Groq, Google Gemini | Response generation |
| Package Manager | uv | Dependency management and virtual environments |
| Testing | pytest | Unit testing for vector database operations |
The assistant successfully answers developer questions about LangChain by retrieving relevant documentation chunks and generating grounded responses.
Example interaction:
Q: How do I create a tool in LangChain?
Reasoning:
Based on the retrieved context from tools.txt, LangChain provides
a @tool decorator for creating tools that agents can use.
Answer:
The simplest way to create a tool in LangChain is with the @tool
decorator. The function's docstring becomes the tool description
that helps the model understand when to use it. Type hints are
required as they define the tool's input schema.
The system correctly:
tools.txt for tool-related questions)all-MiniLM-L6-v2 model provides adequate semantic similarity for documentation search without requiring API calls or GPUThis project demonstrates a practical RAG system that grounds LLM responses in actual documentation. The architecture is intentionally modular: the vector database, embedding model, and LLM provider can each be swapped independently. While the current implementation is scoped as a CLI tool for LangChain documentation, the same pipeline can be adapted to any text corpus by replacing the files in the data/ directory.
The source code is available on GitHub under the CC BY-NC-SA 4.0 license.