HeritageQuery: RAG Assistant

HeritageQuery: RAG assistant for Ethiopian Heritage

Tags: rag langchain chromadb cultural-heritage retrieval

TL;DR

This guide demonstrates how to build, tune, and validate HeritageQuery, a lightweight retrieval-augmented generation (RAG) assistant that preserves Ethiopian history and culture. It walks through data ingestion, recursive chunking experiments, ChromaDB persistence, SentenceTransformer embeddings, and multi-provider LLM routing so you can reproduce the workflow and extend the assistant for your own heritage datasets.

Introduction

In this tutorial, you will learn how to stand up a question-answering pipeline that keeps Ethiopian cultural narratives accurate by combining a curated text corpus, ChromaDB vector search, SentenceTransformer embeddings, and LangChain-powered dialog models (OpenAI, Groq, or Google Gemini). We cover repository structure, key design choices, and hands-on tuning so you can adapt HeritageQuery to new regional archives without starting from scratch.

Prerequisites

Python 3.10+
pip install -r requirements.txt
One of OPENAI_API_KEY, GROQ_API_KEY, or GOOGLE_API_KEY in .env
~4 GB RAM to host sentence-transformers/all-MiniLM-L6-v2 locally
Basic familiarity with LangChain prompt chaining and ChromaDB collections

HeritageQuery persists embeddings to ./chroma_db, so you can stop/start the CLI without re-indexing as long as CHROMA_COLLECTION_NAME stays constant.

Step-by-Step Instructions

1. Curate and Load Domain Documents

Source material lives in data/*.txt, each with YAML-like headers providing title, canonical source, and topical tags. The loader (src/app.py::load_documents) enforces UTF-8 reads, strips the header, and attaches metadata extracted via extract_metadata:

def extract_metadata(text: str) -> dict:
    lines = text.split("\n")
    return {
        "title": lines[1].strip(),
        "source": lines[2].strip(),
        "topics": ", ".join(t.strip() for t in lines[3].split(","))
    }

By keeping topics as a comma-separated string, you can facet downstream analytics while still treating the field as free text in Chroma metadata filters.

2. Experiment with Chunk Sizes and Overlap

The cultural texts mix long-form history (Lalibela churches) and concise rituals (coffee ceremony), so a single chunk recipe rarely works. I went back and forth between chunk_size=500, chunk_overlap=50 and the tighter chunk_size=300, chunk_overlap=150 that ships in VectorDB.chunk_text. Keywords such as “Biete Medhane Alem,” “Treaty of Wuchale,” and “abol/tona/baraka” acted as probes: if they straddled chunk boundaries, I increased overlap; if embeddings became redundant, I shortened chunks. Settling on 300/150 kept architectural descriptions intact without overloading the encoder.

splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=150
)
chunks = splitter.split_text(text)

Run an automated script changing the pairs, and inspect which chunks contain your probe terms before committing to a full re-embed.

3. Embed and Persist with ChromaDB

VectorDB.add_documents batches every chunk, encodes them with SentenceTransformer.encode, and writes to a persistent Chroma collection:

self.collection.add(
    ids=ids,
    documents=documents_to_add,
    metadatas=metadatas,
    embeddings=self.embedding_model.encode(documents_to_add)
)

Store path defaults to ./chroma_db; change it via the PersistentClient constructor if you need a network volume. Keeping deterministic embeddings makes regression testing straightforward.

4. Route Across Multiple LLM Providers

RAGAssistant._initialize_llm lazily selects OpenAI, Groq, or Gemini models based on available keys and enforces temperature=0 for reproducible answers. The prompt template is concise and constrains responses to retrieved evidence:

self.prompt_template = ChatPromptTemplate.from_template("""
You are an assistant that answers questions about Ethiopian historical and cultural features.
Use ONLY the provided context...If the answer is not explicitly stated...say:
"I do not have enough information in the provided documents."
""")

When invoking, the assistant concatenates retrieved chunks as From <title>: blocks, reinforcing provenance in every answer.

5. Run the CLI Loop

Launch python src/app.py, load documents (assistant.add_documents(sample_docs)), and query inside the REPL. Answers print alongside the supporting context for manual validation. This simple loop doubles as a smoke test for both embedding quality and LLM routing.

Explanations & Architecture Notes

Why MiniLM? all-MiniLM-L6-v2 balances speed and semantic recall on CPUs while keeping the embedding dim low enough for cultural corpora under ~1 MB.
Why recursive splitting? RecursiveCharacterTextSplitter respects semantic boundaries (headings, paragraphs), which is essential for multi-topic documents such as Lalibela’s architectural vs. religious sections.
Why multi-provider fallback? Field projects often juggle API quotas; the cascading check (OPENAI ➜ GROQ ➜ GOOGLE) lets the same CLI run wherever a key is available.
Context assembly keeps chunk provenance visible, aiding trust and enabling manual audits when curators review sensitive topics.

Results & Impact

Retrieval fidelity: After settling on 300/150 chunks, probe questions like “What is Abol?” consistently surfaced the correct excerpt within the top 3 hits.
Cultural stewardship: The assistant keeps primary Ethiopian narratives accessible, reducing hallucinations when curators answer visitor questions or prep educational content.

Zero-temperature decoding, printed context blocks, and transparent metadata (title/source/topics) make it clear what evidence backs each answer.

Limitations

Narrow corpus: Only four seed documents are indexed; coverage gaps remain for other major regions.
Manual chunk tuning: Achieving optimal chunking/overlap still requires iterative probing; no automated evaluator is bundled.
No quantitative evals: There is no built-in retrieval or answer quality benchmark; validation is manual.
External dependencies: Functionality depends on third-party LLM APIs and sentence-transformers; offline deployments need additional work.

Conclusion

HeritageQuery shows how a compact Python stack can safeguard cultural context by pairing curated documents with disciplined RAG engineering. By iterating on chunk sizes, enforcing deterministic prompts, and exposing provenance, the assistant becomes a trustworthy baseline for museums, tour operators, and educators focused on Ethiopian heritage.

Future Work

Add multilingual corpora (Amharic, Oromo) and swap in multilingual embeddings.
Build an evaluation harness covering retrieval accuracy and answer faithfulness.
Layer a lightweight FastAPI or Streamlit UI for community docents.
Integrate monitoring to log query frequency and detect drift as new documents are ingested.

References & Contact

Rock-Hewn Churches, Lalibela — https://en.wikipedia.org/wiki/Rock-Hewn_Churches,_Lalibela
Battle of Adwa — https://en.wikipedia.org/wiki/Battle_of_Adwa
Coffee Ceremony of Ethiopia and Eritrea — https://en.wikipedia.org/wiki/Coffee_ceremony_of_Ethiopia_and_Eritrea
Simien Mountains — https://en.wikipedia.org/wiki/Simien_Mountains

For support, open an issue in this repository or email the maintainer. Contributions and dataset suggestions are welcome via pull requests.

HeritageQuery: RAG Assistant

Table of contents

HeritageQuery: RAG assistant for Ethiopian Heritage

TL;DR

Introduction

Prerequisites

Step-by-Step Instructions

1. Curate and Load Domain Documents

2. Experiment with Chunk Sizes and Overlap

3. Embed and Persist with ChromaDB

4. Route Across Multiple LLM Providers

5. Run the CLI Loop

Explanations & Architecture Notes

Results & Impact

Limitations

Conclusion

Future Work

References & Contact

Table of contents

Code

Code