Retrieval-Augmented Generation (RAG) is a key paradigm for building production-ready AI assistants that are grounded in reliable external knowledge. This project demonstrates a minimal yet robust RAG pipeline using LangChain with Chroma as the default vector store and FAISS as an optional alternative. The pipeline ingests multiple data sources—including web articles, Wikipedia topics, and PDFs—chunks and embeds them using sentence-transformers/all-MiniLM-L6-v2
, and serves both a CLI and REST API interface for question answering. It features staleness detection to ensure embeddings are refreshed when source data changes, configurable chunk sizes and retrieval parameters, and a fully reproducible workflow that can be run locally or deployed in production. This work highlights best practices for building explainable, testable, and maintainable RAG systems.
The ingestion pipeline begins by loading sources defined in configs/sources.yaml
, including URLs, Wikipedia queries, and PDFs. Documents are normalized, chunked into overlapping segments (default 1000 characters with 150 overlap), and embedded using sentence-transformers/all-MiniLM-L6-v2
. Embeddings are stored in a persistent Chroma vector store located at ./vectorstore/chroma_index/
. FAISS support is also available as a drop-in alternative.
For retrieval, the system uses a similarity search retriever with configurable k
(default 4). At query time, retrieved context is merged with the user’s question using a LangChain prompt template and passed to the selected LLM provider (OpenAI GPT-4/5, Ollama, or HuggingFace Inference API). Both CLI and FastAPI endpoints are provided, enabling local use and service deployment.
To maintain reproducibility, the system includes staleness detection to warn users if sources.yaml
is newer than the stored index. The Makefile automates ingestion (make ingest
), query execution (make ask Q="..."
), and cleaning (make clean
). Logging is centralized via logging_config.py
with configurable verbosity, making the pipeline easy to debug and monitor in production.
The final implementation successfully indexed 19 documents across 3 web sources, 5 Wikipedia topics, and 2 PDF files, resulting in 292 context chunks. The Chroma index persisted reliably and passed automated ingestion idempotency tests. Sample queries returned grounded, context-aware answers with deduplicated source metadata, demonstrating traceability and transparency of responses.
Logging at INFO level provided clear visibility into the ingestion process, including per-source document counts, total chunk counts, and embedding progress. DEBUG-level logs exposed more granular details for troubleshooting. The API endpoint returned JSON responses with both answer text and structured source information, making it easy to consume results programmatically.
Overall, this RAG assistant provides a reproducible, extensible, and production-ready blueprint for teams looking to deploy retrieval-augmented LLM systems with a focus on observability, maintainability, and data freshness guarantees.