Tags: rag langchain chromadb cultural-heritage retrieval

This guide demonstrates how to build, tune, and validate HeritageQuery, a lightweight retrieval-augmented generation (RAG) assistant that preserves Ethiopian history and culture. It walks through data ingestion, recursive chunking experiments, ChromaDB persistence, SentenceTransformer embeddings, and multi-provider LLM routing so you can reproduce the workflow and extend the assistant for your own heritage datasets.
In this tutorial, you will learn how to stand up a question-answering pipeline that keeps Ethiopian cultural narratives accurate by combining a curated text corpus, ChromaDB vector search, SentenceTransformer embeddings, and LangChain-powered dialog models (OpenAI, Groq, or Google Gemini). We cover repository structure, key design choices, and hands-on tuning so you can adapt HeritageQuery to new regional archives without starting from scratch.
pip install -r requirements.txtOPENAI_API_KEY, GROQ_API_KEY, or GOOGLE_API_KEY in .envsentence-transformers/all-MiniLM-L6-v2 locallyHeritageQuery persists embeddings to ./chroma_db, so you can stop/start the CLI without re-indexing as long as CHROMA_COLLECTION_NAME stays constant.
Source material lives in data/*.txt, each with YAML-like headers providing title, canonical source, and topical tags. The loader (src/app.py::load_documents) enforces UTF-8 reads, strips the header, and attaches metadata extracted via extract_metadata:
def extract_metadata(text: str) -> dict: lines = text.split("\n") return { "title": lines[1].strip(), "source": lines[2].strip(), "topics": ", ".join(t.strip() for t in lines[3].split(",")) }
By keeping topics as a comma-separated string, you can facet downstream analytics while still treating the field as free text in Chroma metadata filters.
The cultural texts mix long-form history (Lalibela churches) and concise rituals (coffee ceremony), so a single chunk recipe rarely works. I went back and forth between chunk_size=500, chunk_overlap=50 and the tighter chunk_size=300, chunk_overlap=150 that ships in VectorDB.chunk_text. Keywords such as “Biete Medhane Alem,” “Treaty of Wuchale,” and “abol/tona/baraka” acted as probes: if they straddled chunk boundaries, I increased overlap; if embeddings became redundant, I shortened chunks. Settling on 300/150 kept architectural descriptions intact without overloading the encoder.
splitter = RecursiveCharacterTextSplitter( chunk_size=300, chunk_overlap=150 ) chunks = splitter.split_text(text)
Run an automated script changing the pairs, and inspect which chunks contain your probe terms before committing to a full re-embed.
VectorDB.add_documents batches every chunk, encodes them with SentenceTransformer.encode, and writes to a persistent Chroma collection:
self.collection.add( ids=ids, documents=documents_to_add, metadatas=metadatas, embeddings=self.embedding_model.encode(documents_to_add) )
Store path defaults to ./chroma_db; change it via the PersistentClient constructor if you need a network volume. Keeping deterministic embeddings makes regression testing straightforward.
RAGAssistant._initialize_llm lazily selects OpenAI, Groq, or Gemini models based on available keys and enforces temperature=0 for reproducible answers. The prompt template is concise and constrains responses to retrieved evidence:
self.prompt_template = ChatPromptTemplate.from_template(""" You are an assistant that answers questions about Ethiopian historical and cultural features. Use ONLY the provided context...If the answer is not explicitly stated...say: "I do not have enough information in the provided documents." """)
When invoking, the assistant concatenates retrieved chunks as From <title>: blocks, reinforcing provenance in every answer.
Launch python src/app.py, load documents (assistant.add_documents(sample_docs)), and query inside the REPL. Answers print alongside the supporting context for manual validation. This simple loop doubles as a smoke test for both embedding quality and LLM routing.
all-MiniLM-L6-v2 balances speed and semantic recall on CPUs while keeping the embedding dim low enough for cultural corpora under ~1 MB.OPENAI ➜ GROQ ➜ GOOGLE) lets the same CLI run wherever a key is available.Zero-temperature decoding, printed context blocks, and transparent metadata (title/source/topics) make it clear what evidence backs each answer.
sentence-transformers; offline deployments need additional work.HeritageQuery shows how a compact Python stack can safeguard cultural context by pairing curated documents with disciplined RAG engineering. By iterating on chunk sizes, enforcing deterministic prompts, and exposing provenance, the assistant becomes a trustworthy baseline for museums, tour operators, and educators focused on Ethiopian heritage.
For support, open an issue in this repository or email the maintainer. Contributions and dataset suggestions are welcome via pull requests.