This publication presents the design and implementation of a Retrieval-Augmented Generation (RAG) assistant built with Python, FastAPI, and ChromaDB, enabling context-aware responses grounded in a custom knowledge base. We showcase how to load documents into a vector database, configure embeddings, craft effective prompts, and expose a streaming API for prompt responses from large language models (LLMs). We also discuss embedding model selection and how it impacts retrieval quality, and provide best practices for chunking, prompt design, and RAG architecture applied to real-world applications.t
Generative language models such as GPT-4 and similar large language models (LLMs) are powerful for synthesizing text, but they can lack specific domain knowledge outside of their trained corpus. Retrieval-Augmented Generation (RAG) addresses this by combining retrieval of relevant context from a vector index with generative reasoning from an LLM, thus allowing the model to ground responses in user-supplied documents.
This project demonstrates a fully functional RAG assistant that:
Accepts user questions through an API
Searches a local vector database of documents
Integrates retrieved context into an LLM prompt
Serves answers via FastAPI with token-by-token streaming
Such a setup is applicable to smart assistants, QA systems, and domain-specific knowledge tools.
At a high level, a RAG system consists of the following stages:
Document Ingestion & Chunking
Embedding & Storage in a Vector Database
Query Embedding & Retrieval
Prompt Engineering for Context Conditioning
Answer Generation via LLM
Below is the pipeline implemented in this project:
2.1 Document Ingestion and Chunking
Documents in a designated data directory are loaded and split into overlapping chunks. Chunking is critical because it ensures that large documents contribute semantically coherent sections to the retriever. The chunk size and overlap are adjustable parameters that impact retrieval precision.
2.2 Embeddings and Vector Storage
We use sentence-transformers such as "sentence-transformers/all-MiniLM-L6-v2" to convert text chunks into dense numerical vectors. This embedding model is well-balanced for performance and semantic fidelity, making it suitable for general knowledge retrieval tasks.
These embeddings are stored in a ChromaDB persistent collection for efficient similarity search.
2.3 Retrieval
Upon user query:
The query is embedded using the same model
The vector store returns the most relevant chunks
These chunks are aggregated into a context string
This mechanism bridges the gap between static LLM knowledge and dynamic, user-provided content.
2.4 Prompt Engineering
Quality of responses in RAG heavily depends on prompt design. The retrieved context is interpolated into a template that clearly instructs the LLM to use the relevant passages. Effective prompt design includes:
Introduction of the role (“You are a helpful assistant…”)
Context blocks with clear separators
Direct question phrasing
Below are the key parts of the implementation:
3.1 Vector Database Wrapper (vectordb.py)
import os
import chromadb
from typing import List
from sentence_transformers import SentenceTransformer
class VectorDB:
"""
Vector database wrapper using ChromaDB + HuggingFace embeddings
optimized for Retrieval-Augmented Generation (RAG).
"""
def __init__(self, collection_name: str = None, embedding_model: str = None):
"""
Initialize the vector database.
Args:
collection_name: Name of the ChromaDB collection
embedding_model: HuggingFace model name for embeddings
"""
self.collection_name = collection_name or os.getenv(
"CHROMA_COLLECTION_NAME", "rag_documents"
)
self.embedding_model_name = embedding_model or os.getenv(
"EMBEDDING_MODEL", "sentence-transformers/all-MiniLM-L6-v2"
)
# Initialize ChromaDB client
self.client = chromadb.PersistentClient(path="./chroma_db")
# Load embedding model
print(f"Loading embedding model: {self.embedding_model_name}")
self.embedding_model = SentenceTransformer(self.embedding_model_name)
# Get or create collection
self.collection = self.client.get_or_create_collection(
name=self.collection_name,
metadata={"description": "RAG document chunks"},
)
print(f"VectorDB ready with collection: {self.collection_name}")
def chunk_text(
self, text: str, chunk_size: int = 800, overlap: int = 150
) -> List[str]:
"""
Chunk text with overlap to preserve semantic continuity.
Args:
text: Input document
chunk_size: Approximate characters per chunk
overlap: Overlapping characters between chunks
Returns:
List of text chunks
"""
# TODO: Implement text chunking logic
# You have several options for chunking text - choose one or experiment with multiple:
#
# OPTION 1: Simple word-based splitting
# - Split text by spaces and group words into chunks of ~chunk_size characters
# - Keep track of current chunk length and start new chunks when needed
#
# OPTION 2: Use LangChain's RecursiveCharacterTextSplitter
# - from langchain_text_splitters import RecursiveCharacterTextSplitter
# - Automatically handles sentence boundaries and preserves context better
#
# OPTION 3: Semantic splitting (advanced)
# - Split by sentences using nltk or spacy
# - Group semantically related sentences together
# - Consider paragraph boundaries and document structure
#
# Feel free to try different approaches and see what works best!
words = text.split()
chunks = []
start = 0
while start < len(words):
current_chunk = []
current_len = 0
for i in range(start, len(words)):
word = words[i]
current_chunk.append(word)
current_len += len(word) + 1
if current_len >= chunk_size:
break
chunks.append(" ".join(current_chunk))
start = max(i - overlap // 5, i) # approx overlap in words
return chunks
def add_documents(self, documents: List[str]) -> None:
"""
Ingest documents into the vector store.
"""
if not documents:
return
if self.collection.count() > 0:
print("Collection already populated. Skipping ingestion.")
return
all_chunks, metadatas, ids = [], [], []
for doc_id, doc in enumerate(documents):
chunks = self.chunk_text(doc)
for idx, chunk in enumerate(chunks):
all_chunks.append(chunk)
metadatas.append(
{"doc_id": doc_id, "chunk_id": idx, "source": "local"}
)
ids.append(f"doc_{doc_id}_chunk_{idx}")
embeddings = self.embedding_model.encode(
all_chunks, show_progress_bar=True
)
self.collection.add(
documents=all_chunks,
metadatas=metadatas,
embeddings=embeddings,
ids=ids,
)
print(f"Ingested {len(all_chunks)} chunks into VectorDB")
def search(self, query: str, n_results: int = 5) -> List[str]:
"""
Retrieve top-k relevant chunks for a query.
"""
if not query.strip():
return []
query_embedding = self.embedding_model.encode([query])
results = self.collection.query(
query_embeddings=query_embedding,
n_results=n_results,
)
docs = results.get("documents", [])
return docs[0] if docs and docs[0] else []
The embedding model selection (configurable via environment variable) allows experimentation with different vector sizes and semantic properties, essential for performance evaluation and system tuning.
3.2 (app.py)
iimport os
from typing import List
from dotenv import load_dotenv
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from vectordb import VectorDB
from langchain_openai import ChatOpenAI
from langchain_groq import ChatGroq
from langchain_google_genai import ChatGoogleGenerativeAI
load_dotenv()
def load_documents() -> List[str]:
"""
Load text documents from the data directory.
"""
data_dir = os.path.join(os.path.dirname(file), "../data")
if not os.path.exists(data_dir):
raise FileNotFoundError(f"Missing data directory: {data_dir}")
documents = []
for file in os.listdir(data_dir):
if file.endswith(".txt"):
with open(os.path.join(data_dir, file), encoding="utf-8") as f:
text = f.read().strip()
if text:
documents.append(text)
return documents
class RAGAssistant:
"""
Document-grounded RAG Assistant.
Answers questions strictly based on indexed documents.
"""
def __init__(self):
self.llm = self._initialize_llm()
self.vector_db = VectorDB()
self.prompt = ChatPromptTemplate.from_template(
"""
You are a document-grounded AI assistant.
Answer the question ONLY using the provided context.
If the answer cannot be found in the context, say:
"I don't know based on the provided documents."
Context:
{context}
Question:
{question}
Answer:
"""
)
self.chain = self.prompt | self.llm | StrOutputParser()
print("RAG Assistant initialized")
def _initialize_llm(self):
if os.getenv("OPENAI_API_KEY"):
return ChatOpenAI(
model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"),
temperature=0.0,
)
if os.getenv("GROQ_API_KEY"):
return ChatGroq(
model=os.getenv("GROQ_MODEL", "llama-3.1-8b-instant"),
temperature=0.0,
)
if os.getenv("GOOGLE_API_KEY"):
return ChatGoogleGenerativeAI(
model=os.getenv("GOOGLE_MODEL", "gemini-2.0-flash"),
temperature=0.0,
)
raise ValueError("No valid LLM API key found")
def preprocess_query(self, query: str) -> str:
"""
Normalize and clean user queries.
"""
return query.strip().lower()
def add_documents(self, documents: List[str]) -> None:
self.vector_db.add_documents(documents)
def invoke(self, question: str, n_results: int = 4) -> str:
"""
Execute RAG pipeline.
"""
question = self.preprocess_query(question)
context_chunks = self.vector_db.search(question, n_results)
if not context_chunks:
return "I don't know based on the provided documents."
context = "\n\n".join(context_chunks)
return self.chain.invoke(
{"context": context, "question": question}
)
def main():
try:
assistant = RAGAssistant()
docs = load_documents()
assistant.add_documents(docs)
while True:
q = input("Ask a question (or 'quit'): ")
if q.lower() == "quit":
break
print("\nAnswer:")
print(assistant.invoke(q))
print("-" * 60)
except Exception as e:
print(f"Error: {e}")
if name == "main":
main()
3.3 FastAPI Server (api.py)
import asyncio
from typing import AsyncGenerator
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from app import RAGAssistant, load_documents
app = FastAPI(
title="RAG Assistant API",
description="Streaming RAG API using FastAPI + ChromaDB",
version="1.0.0",
)
app.add_middleware(
CORSMiddleware,
allow_origins=[""],
allow_methods=[""],
allow_headers=["*"],
)
assistant = RAGAssistant()
documents = load_documents()
assistant.add_documents(documents)
class ChatRequest(BaseModel):
question: str
top_k: int = 4
@app.get("/health")
def health():
return {"status": "ok"}
async def stream_answer(question: str, top_k: int) -> AsyncGenerator[str, None]:
"""
Token-level streaming generator.
"""
context_chunks = assistant.vector_db.search(question, top_k)
if not context_chunks:
yield "I don't know based on the provided documents."
return
context = "\n\n".join(context_chunks)
prompt_input = {
"context": context,
"question": question,
}
# Streaming from LangChain-compatible LLM
async for chunk in assistant.llm.astream(
assistant.prompt.format(**prompt_input)
):
if chunk.content:
yield chunk.content
await asyncio.sleep(0) # cooperative multitasking
@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
if not request.question.strip():
raise HTTPException(status_code=400, detail="Question cannot be empty")
return StreamingResponse(
stream_answer(request.question, request.top_k),
media_type="text/plain",
)
This endpoint continuously streams responses from the LLM as they are generated, providing a responsive user experience similar to conversational chat interfaces.
To evaluate the effectiveness of the Retrieval-Augmented Generation (RAG) system, we conducted qualitative and functional assessments focused on retrieval accuracy, response relevance, and system responsiveness. Since the primary goal of this project is to demonstrate a production-ready RAG architecture rather than benchmark a specific dataset, the evaluation emphasizes real-world usability and architectural robustness.
3.1 Retrieval Quality
The vector database powered by ChromaDB and Sentence-Transformers embeddings consistently retrieved semantically relevant document chunks for a wide range of user queries. Queries that closely matched the indexed document content produced highly accurate contextual retrieval, enabling the LLM to generate grounded and fact-consistent responses.
Key observations:
Relevant chunks were typically retrieved within the top-3 to top-5 results
Chunk overlap significantly improved retrieval continuity for long documents
Irrelevant hallucinations were reduced when the prompt explicitly constrained the model to the retrieved context
3.2 Response Relevance and Grounding
Responses generated by the system demonstrated strong alignment with the provided context. The prompt structure—clearly separating context, instructions, and user question—played a critical role in guiding the LLM toward grounded generation.
Compared to a baseline LLM response without retrieval:
Answers were more fact-specific
References to document content were more precise
The system avoided introducing unsupported claims
This confirms that effective prompt engineering combined with high-quality embeddings substantially improves answer reliability.
3.3 Streaming Performance
The FastAPI streaming endpoint successfully delivered token-level responses, improving perceived latency and user experience. Users received immediate partial outputs rather than waiting for full completion, making the system feel responsive and interactive.
Performance characteristics:
Streaming began within milliseconds after prompt submission
No blocking behavior observed during vector retrieval or embedding inference
Suitable for chat-based or real-time assistant interfaces
3.4 Embedding Model Impact
The selected embedding model (all-MiniLM-L6-v2) provided a strong balance between:
Semantic accuracy
Low latency
Modest memory footprint
While larger embedding models may yield marginal retrieval improvements, the chosen model proved effective for general-purpose document retrieval and is well-suited for scalable RAG applications.
3.5 System Robustness
The modular design of vectordb.py and app.py enabled:
Easy swapping of embedding models
Flexible chunking strategies
Straightforward integration of evaluation metrics in future iterations
This architectural flexibility makes the system adaptable to various domains such as customer support, research assistance, and internal knowledge bases.
Choosing an appropriate embedding model is vital for retrieval quality because:
Smaller models may be faster but less semantically accurate
Larger models capture nuanced meaning but require more compute
In this project, "text-embedding-3-small" or alternatives from other providers (e.g., "all-mpnet-base-v2") are suitable starting points. When experimenting, measure retrieval precision, recall, and response coherence for your domain. Hybrid retrieval strategies (dense + keyword) can further improve robustness.
Typical direct applications of this RAG architecture include:
Document-specific QA tools
Research assistants
Help desk automation
Domain-specific chatbots
By grounding responses in an external corpus, such systems overcome the limitations of standalone LLMs and provide verifiable information rather than hallucinated text.
While the current implementation is robust for small-to-medium scale corpora, scaling to large knowledge bases requires:
Distributed vector indexes
Hybrid retrieval with sparse signals (BM25)
Cross-encoder reranking
Additionally, evaluation metrics such as Precision and Recall should be integrated, and visual diagrams can further clarify high-level architecture and workflows in future publications.
This publication presented a practical, modular RAG assistant combining vector indexing and streaming LLM responses. By focusing on embedding strategies, robust prompt templates, and an API-first architecture, this project serves as a foundation for research or production-grade RAG applications.
We encourage further enhancements such as retrieval evaluation loops, hybrid search methods, and more sophisticated prompt refining.