This document describes the design of the Study RAG Assistant for
evaluators reviewing the implementation. It explains what the system
does, how it is structured, the specific design decisions that were
made, and the trade-offs behind them. Claims are anchored to source
files so the reviewer can verify each one directly.
The Study RAG Assistant is a conversational document-grounded
question-answering system. A user types a question in natural language;
the system retrieves the most semantically similar chunks from an
indexed corpus, feeds those chunks plus conversation history to a Large
Language Model under strict prompt-level constraints, and streams a
grounded answer back to the user.
Three properties separate this implementation from a minimal RAG
demonstration:
The implementation comprises roughly 400 lines of Python across five
modules and 100 lines of YAML across two config files.
.txt only;query_stream)/history slash command. +--------------------------+
| app.py (CLI) |
| REPL, streaming, /cmds |
+-----------+--------------+
|
v
+-------------------------------+-----------------------------+
| RAGAssistant |
| - rewriter chain (system + history -> standalone query)|
| - answerer chain (system + history + context + q) |
| - summarizer chain (system + prev summary + exchanges) |
| - memory state (summary string + recent message buf) |
+------+--------------+-----------+--------------+-----------+
| | | |
v v v v
+-------------+ +---------+ +---------+ +-------------+
| VectorDB | | YAML | | Prompt | | LLM client |
| (ChromaDB) | | configs | | builder | | (OpenAI / |
| | | | | | | Groq / |
| embedding | | | | | | Gemini) |
| + retrieve | | | | | | |
+-------------+ +---------+ +---------+ +-------------+
| Module | Lines | Responsibility |
|---|---|---|
src/app.py | ~220 | CLI shell. Banner, REPL, slash commands, streaming output, spinner, signal handling. No domain logic. |
src/ragassistant.py | ~230 | Orchestrator. Owns the three LCEL chains, memory state, and the query/query_stream API. |
src/vectordb.py | ~115 | ChromaDB wrapper. Document ingestion (chunking + embedding) and similarity search. |
src/prompt_builder.py | ~130 | Pure-functional prompt construction from config dicts. No LLM coupling. |
src/utils.py | ~35 | Two helpers: load_documents (filesystem -> LangChain Documents) and _load_yaml. |
config/prompt_config.yml | ~85 | Three prompt blocks: rag_prompt_cfg, summarizer_prompt_cfg, rewriter_prompt_cfg. |
config/config.yml | ~50 | App-level config: reasoning strategies, memory policy. |
This section traces a single user query end-to-end. Subsequent sections
go deeper on each component.
1. User input arrives at the CLI REPL.
File: src/app.py :: repl()
2. Slash command? -> Handled locally, no LLM call.
Otherwise -> Call assistant.query_stream(question).
3. Build history list (returns []).
File: src/ragassistant.py :: _history_messages()
4. SKIP rewriter (no history; raw question is already standalone).
5. Retrieve top-3 chunks via cosine search.
File: src/vectordb.py :: search()
6. Join chunks with "\n\n---\n\n" separator -> context string.
7. Invoke answering chain. Stream tokens.
Chain: SystemMessage(from YAML) -> MessagesPlaceholder("history")
-> HumanMessage("Research Context: ... Question: ...")
-> LLM -> StrOutputParser
8. Print tokens as they arrive (CLI).
9. After stream completes, append (HumanMessage, AIMessage) pair
to self.recent. Compact if buffer overflowed.
Total LLM calls: 1 (answerer).
1-2. Same as above.
3. Build history list:
[SystemMessage("Summary of earlier conversation: ...")] +
[HumanMessage, AIMessage, HumanMessage, AIMessage, ...]
4. INVOKE rewriter:
Input: history + raw question ("tell me more about that")
Output: standalone search query
("tell me more about quantum entanglement applications")
5. Retrieve top-3 chunks using the REWRITTEN query.
6. Same as above.
7. Invoke answering chain. Chain sees the FULL history + retrieved
context + ORIGINAL (not rewritten) question.
Rationale: retrieval needs the explicit query; the answerer
benefits from the user's actual phrasing for tone matching.
8-9. Same as above. Possibly triggers summarization (one extra LLM
call) if recent buffer overflows.
Total LLM calls: 2-3 (rewriter + answerer + possibly summarizer).
Implementation: src/vectordb.py
Embedding model: sentence-transformers/all-MiniLM-L6-v2
device= kwarg.Store: ChromaDB (PersistentClient at ./chroma_db).
Chunking: RecursiveCharacterTextSplitter
chunk_size=500, chunk_overlap=200.Ingestion (add_documents, lines 60-89 of vectordb.py):
encode(chunks) call (one GPU/CPU forward pass per{source}_{i} where source is thei is the chunk index. Falls back to acollection.count() if source is missing,collection.add(...) call per document, reducingSearch (search, lines 91-115 of vectordb.py):
encode([query])).include=["documents", "metadatas", "distances"]ids is always returned and cannot be in include).[[...]] nesting Chroma returns (it supportsDesign decisions:
Implementation: src/utils.py :: load_documents
Walks ./data/, loads every .txt file via LangChain's TextLoader,
and returns a list of Document objects. Document objects retain a
metadata["source"] field pointing back to the file, which is used
later for chunk-ID generation and for the /topics CLI command.
The ingestion path is idempotent against re-runs in app.py:
existing = assistant.vector_db.collection.count() if existing == 0: assistant.add_documents(docs)
If the collection is non-empty, ingestion is skipped. This prevents
duplicate-ID errors on repeated runs without forcing a manual
"delete chroma_db" step.
Limitation: Only .txt is supported. Adding PDF support is a
one-function change: branch on file extension in load_documents and
use PyPDFLoader for .pdf.
Implementation: src/prompt_builder.py, config/prompt_config.yml
This is the core extensibility surface. The system never hard-codes a
prompt in Python. Instead, three YAML blocks describe the three chains'
system prompts, and a single function (build_prompt_from_config)
turns each block into a system-message string.
The builder (build_prompt_from_config, ~90 lines):
Pure function. Takes a config dict and an optional app-level config
(for reasoning strategies). Assembles a prompt string by concatenating
labeled sections in a fixed order:
1. role -> "You are <role>."
2. instruction -> "Your task is as follows: <instruction>"
3. context -> "Here's some background that may help you: ..."
4. output_constraints -> "Ensure your response follows these rules: ..."
5. style_or_tone -> "Follow these style and tone guidelines: ..."
6. output_format -> "Structure your response as follows: ..."
7. examples -> few-shot block
8. goal -> "Your goal is to achieve the following outcome: ..."
9. input_data -> the user's content (optional; for runtime use)
10. reasoning_strategy -> splice in a block from app_config
11. final stinger -> "Now perform the task as instructed above."
Only instruction is required; every other field is optional. The
fixed ordering is important because LLMs weight earlier sections more
heavily.
Why this design instead of just writing prompts as Python strings:
config.yml and areThe three chains in RAGAssistant.__init__:
| Chain | Purpose | YAML block | Input variables |
|---|---|---|---|
| Answerer | Generate the user-facing response | rag_prompt_cfg | history, context, question |
| Summarizer | Compress old turns into a running summary | summarizer_prompt_cfg | previous_summary, exchanges |
| Rewriter | Turn follow-ups into standalone queries | rewriter_prompt_cfg | history, question |
Each is constructed identically:
SYSTEM_TEXT = build_prompt_from_config(yaml_block, app_config=app_cfg) TEMPLATE = ChatPromptTemplate.from_messages([ SystemMessage(content=SYSTEM_TEXT), MessagesPlaceholder("history"), # for answerer + rewriter ("human", "...{variables}..."), ]) CHAIN = TEMPLATE | self.llm | StrOutputParser()
SystemMessage(content=...) is used (not a templated tuple) so the
builder's output is treated as literal text. This prevents stray {
or } characters in the system prompt from being interpreted as
placeholder syntax.
Implementation: src/ragassistant.py :: _history_messages, _record_turn, _compact_history
Strategy: Summary buffer (the default, configurable in
config/config.yml).
State:
self.summary: str # running natural-language summary self.recent: list[BaseMessage] # last N turns, verbatim
Per-turn flow:
Assembly (_history_messages): If summary is non-empty,
prepend a SystemMessage("Summary of earlier conversation: ...")
before the verbatim recent messages. Return the list.
Recording (_record_turn): Append HumanMessage(question) and
AIMessage(answer) to self.recent. If
len(self.recent) > buffer_size * 2, trigger compaction.
Compaction (_compact_history): Take the oldest turns (all
except the most recent buffer_size * 2 messages), render them as
User: ...\nAssistant: ... plain text, and invoke the summarizer
chain with the current self.summary plus those exchanges. The
summarizer returns an updated summary that subsumes both.
Why summary buffer over alternatives:
| Strategy | Pro | Con | Why not chosen |
|---|---|---|---|
| Pure buffer (last N verbatim) | Trivial, no extra LLM calls | Cliff-drops information older than N | Loses too much for multi-topic study sessions |
| Pure refine (one evolving summary) | Constant-size memory | Loses recent verbatim phrasing; coreferences break | Pronoun resolution fails immediately |
| Summary buffer (chosen) | Verbatim recent + summary of older | One LLM call per compaction | Best balance for the use case |
Configurability (config/config.yml):
memory: strategy: summary_buffer # or "buffer" or "none" buffer_size: 4
buffer_size: 0 with summary_buffer collapses to refine mode (every
turn folded immediately). strategy: none disables history entirely
(stateless mode for benchmarking).
Compaction is event-driven, not turn-counted. Summarization fires
only when the buffer overflows, amortizing the extra LLM call across
roughly buffer_size turns. With the default buffer_size: 4, a
20-turn conversation triggers compaction ~4 times, not 20.
Implementation: src/ragassistant.py :: query_stream (lines around the rewriter call)
Problem it solves:
Vector retrieval works on semantic similarity to the query string.
Conversational follow-ups like "tell me more about that" contain no
semantic signal โ the embedding has no idea what "that" refers to.
Without intervention, retrieval would return arbitrary chunks.
Mechanism:
A separate LCEL chain (constructed from rewriter_prompt_cfg) sees the
conversation history plus the new question and returns a standalone
search query. The retrieval call uses the rewritten query; the
answering call uses the original question. This separation matters:
Optimization: The rewriter is skipped on the first turn (when
_history_messages returns empty). Cost saved: one LLM call per fresh
session.
Failure modes mitigated:
if not search_query: clauseImplementation: src/ragassistant.py :: _initialize_llm
Three providers supported: OpenAI, Groq, Google Gemini. The init
function checks environment variables in priority order
(OPENAI_API_KEY, GROQ_API_KEY, GOOGLE_API_KEY) and returns the
first matching LangChain ChatModel instance.
Each provider is wrapped by an officially maintained LangChain
integration package (langchain-openai, langchain-groq,
langchain-google-genai). The wrappers expose the same
Runnable interface, so the LCEL chains (prompt | llm | parser) are
provider-agnostic.
Model selection per provider is overridable via env var
(OPENAI_MODEL, GROQ_MODEL, GOOGLE_MODEL) with sensible defaults
chosen for the cost/quality balance appropriate to a learning project.
The system has two configuration layers, separated by lifetime:
| Layer | File | Lifetime | Contents |
|---|---|---|---|
| Secrets | .env | Per-deployment | API keys, model overrides |
| Behavior | config/*.yml | Per-version-controlled | Prompts, reasoning strategies, memory policy |
config/prompt_config.yml contains three top-level blocks:
rag_prompt_cfg # answerer system prompt
summarizer_prompt_cfg # summarizer system prompt
rewriter_prompt_cfg # rewriter system prompt
Each block is a config dict consumable by build_prompt_from_config.
config/config.yml contains:
reasoning_strategies # named reasoning scaffolds (CoT, ReAct, Self-Ask, Grounded)
memory # memory policy: strategy + buffer_size
Hot-reload: Configs are read once at RAGAssistant.__init__.
Restart required for config changes. This is intentional for the
demo context โ guaranteeing config invariance during a session
simplifies reasoning about behavior.
Hallucination is the primary failure mode of any RAG system. This
implementation defends against it at multiple layers:
rag_prompt_cfg.output_constraints in prompt_config.yml includes
explicit rules:
The default reasoning strategy is Grounded (defined in
config/config.yml). It instructs the model to:
Step 4 is the critical self-check: it catches model drift where the
first half of an answer is grounded but the second half extrapolates
beyond the context.
The answerer prompt explicitly classifies the user input into three
categories before responding:
This prevents the model from synthesizing irrelevant retrieved chunks
into an "answer" for non-research inputs like "thanks".
| Failure mode | Mitigation |
|---|---|
| Hallucinated citations | Prompt-level rule + Grounded reasoning strategy |
| Pronoun in follow-up retrieves wrong chunks | Query rewriter |
| Memory grows without bound | Summary buffer with event-driven compaction |
| User pastes a pleasantry / off-topic question | Three-way input classification in answerer prompt |
| Re-running app duplicates chunks | collection.count() == 0 guard in app.py |
metadata["source"] missing on a Document | Synthetic ID fallback in vectordb.add_documents |
Empty documents/ folder | Placeholder context string fed to LLM; prompt rules handle gracefully |
| Provider rate limit / API error during query | Caught at REPL boundary; session continues |
| User Ctrl+C mid-stream | Caught; turn is NOT recorded to memory; prompt returns |
| Rewriter returns an empty string | Defensive fallback to literal question |
The system makes 2-3 LLM calls per turn (rewriter, answerer,
occasionally summarizer). A naive single-call RAG would be cheaper but
would fail on follow-up questions and produce worse-grounded answers.
The cost increase (roughly 2-3x per turn) is judged worthwhile for the
demo use case. For production, the rewriter and summarizer could use a
smaller, cheaper model (e.g., gpt-3.5-turbo); this is a
single-line change.
Summary buffer was chosen over pure buffer or pure refine for the
reasons in ยง5.4. Tokens spent on the summary are tokens not spent on
retrieved context, so very long sessions could eventually starve
retrieval. A future enhancement would be token-budgeted compaction
(trigger on token count, not message count).
Provider selection is environment-driven, not config-driven. This
avoids ambiguity (one source of truth: which key is set) at the cost
of being less explicit. The trade-off favors operational simplicity.
A CLI was chosen for demo simplicity. The underlying query_stream
generator is web-frontend-ready (it streams tokens), so this decision
is reversible without architectural change.
Memory compaction triggers on message count, not token count. For the
default gpt-4o-mini (128k context), message count is sufficient. For
small-context models this could starve retrieval. The fix is to use
the already-installed tiktoken package to measure tokens; out of
scope for this revision.
This section is for reviewers grading the implementation.
| Claim | How to verify |
|---|---|
| "Grounding is enforced at the prompt layer" | Read config/prompt_config.yml :: rag_prompt_cfg.output_constraints |
| "Three chains share one builder" | src/ragassistant.py __init__ constructs three chains via build_prompt_from_config |
| "Memory compaction is event-driven" | src/ragassistant.py :: _record_turn triggers _compact_history only when len(self.recent) > self.buffer_size * 2 |
| "Rewriter skipped on first turn" | src/ragassistant.py :: query_stream โ if history: guard before rewriter invocation |
| "Provider-agnostic LCEL chains" | All chains use ... | self.llm | StrOutputParser(). self.llm is any LangChain ChatModel |
| "YAML-driven, no hard-coded prompts" | grep for triple-quoted strings in src/ โ no instructional prose exists outside of YAML |
# Grounded answer with citation
"What is quantum entanglement?"
# Pronoun resolution (rewriter exercise)
"Tell me more about that."
# Topic switch (rewriter must NOT contaminate)
"How does CRISPR work?"
# Off-topic refusal (anti-hallucination)
"How do I bake sourdough?"
# Pleasantry handling
"Thanks!"
# Meta-question (history exercise)
"What have we discussed?"
# Memory introspection (slash command)
/history
search_query after the rewriter call)./history shows verbatim turns for the last 4 exchanges and a| File | Public surface |
|---|---|
src/app.py | main() |
src/ragassistant.py | RAGAssistant, RAGAssistant.query, RAGAssistant.query_stream, RAGAssistant.add_documents, RAGAssistant.reset_memory |
src/vectordb.py | VectorDB, VectorDB.add_documents, VectorDB.search, VectorDB.chunk_text |
src/prompt_builder.py | build_prompt_from_config, format_prompt_section, print_prompt_preview |
src/utils.py | load_documents, _load_yaml |
| File | Top-level keys |
|---|---|
config/prompt_config.yml | rag_prompt_cfg, summarizer_prompt_cfg, rewriter_prompt_cfg |
config/config.yml | reasoning_strategies, memory |
.env (not in repo) | OPENAI_API_KEY | GROQ_API_KEY | GOOGLE_API_KEY; optionally *_MODEL overrides |
| Path | Created by | Contents |
|---|---|---|
chroma_db/ | First run of app.py | ChromaDB persistent index (SQLite + binary blobs) |
The Study RAG Assistant is a deliberately small, deliberately
declarative implementation of a conversational document-Q&A system.
Its distinguishing features are prompt-level grounding enforcement, a
three-chain orchestration that includes a query rewriter and a
summary-buffer memory subsystem, and a YAML-driven configuration model
that makes behavior tunable without code changes.
The architecture prioritizes auditability (every prompt is in YAML,
every chain is a three-line LCEL pipeline, every memory operation is
two helper methods), provider portability (env-var-driven, three
backends supported), and conversational robustness (rewriter for
follow-ups, summary buffer for long sessions, input classification for
non-research inputs).
The trade-offs taken โ 2-3 LLM calls per turn, in-memory-only state,
text-only ingestion โ are aligned with the project's scope as a
learning and demonstration system. Each trade-off is documented in
ยง2 (non-goals) and ยง9 (scope limits) so reviewers can distinguish
deliberate choices from oversights.