Multi-Agent RAG Research Assistant: Reducing LLM Hallucinations with a Three-Stage CrewAI Pipeline

Multi-Agent System Overview

GitHub Repository: samrat-kar/rag-research-assistant

Introduction

Chatbots built on large language models have a well-known failure mode: they confidently generate plausible-sounding answers that are factually wrong — a problem known as hallucination. This is especially damaging when users need accurate, verifiable information on a specific topic, and it gets worse when the knowledge required is either recent (not in the model's training data) or proprietary (stored in internal documents the model has never seen).

This project addresses both failure modes simultaneously. Rather than relying on a single LLM to both search and synthesise, it decomposes the research task across three specialised AI agents, each with a clearly bounded responsibility:

A Research Agent that gathers evidence from the live web and a local document corpus.
An Analyst Agent that validates those findings, resolves conflicts, and verifies numbers.
A Writer Agent that synthesises the validated evidence into a clean, source-cited report.

The three agents are orchestrated in a deterministic sequential pipeline using CrewAI, with explicit context handoffs between stages — the output of each agent becomes the grounded input for the next. This separation of concerns makes the system more reliable than a single-agent approach and the output more auditable.

The project also ships a simpler single-agent interactive demo (demo.py) backed by the same local knowledge base, useful for quick Q&A without running the full pipeline.

Key Concepts

Understanding the system requires familiarity with a handful of foundational ideas. This section explains each briefly.

What is RAG (Retrieval-Augmented Generation)?

A standard LLM is a closed system: it can only answer questions from patterns baked into its training weights. If you need answers grounded in your own documents — or in information published after the training cutoff — the model has no way to access that knowledge.

RAG solves this by injecting retrieved context into the prompt at query time. The workflow has two phases:

Indexing (offline): Documents are split into chunks, each chunk is converted into a dense vector (embedding) that captures its semantic meaning, and those vectors are stored in a searchable index.
Retrieval + Generation (online): When a question arrives, the same embedding model converts it into a vector, the index is searched for the nearest matching chunks, and those chunks are appended to the LLM's prompt. The model generates an answer conditioned on that retrieved evidence, not on memory alone.

The result is an LLM that can cite sources, stay current, and reason over private data — without any fine-tuning.

What are Vector Embeddings?

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Texts with similar meanings have vectors that are close together in high-dimensional space, measured by cosine similarity. This is what allows semantic search: instead of matching keywords, the system finds chunks that mean roughly the same thing as the query, even if they use different words.

This project uses OpenAI's text-embedding-3-small model, which produces 1,536-dimensional vectors.

What is a Multi-Agent System?

A single LLM call mixes retrieval, reasoning, and writing into one opaque step. That makes errors hard to trace and easy to compound: a hallucinated "fact" retrieved in the first sentence contaminates everything that follows.

A multi-agent system breaks the task into stages, assigns each stage to a specialised agent, and passes structured outputs between them. The agents in this project are purpose-limited by design: the Research Agent cannot draw conclusions, the Analyst Agent cannot do free-form writing, and the Writer Agent cannot retrieve new information. Each constraint prevents a whole class of errors.

What is CrewAI?

CrewAI is an open-source Python framework for orchestrating multiple LLM agents as a coordinated "crew". It handles the plumbing: defining agent roles and goals, managing tool calls, passing context between tasks, and running tasks in sequence or in parallel. In this project, CrewAI's Process.sequential mode enforces a strict Research → Analyse → Write order, with each task's output explicitly wired as context to the next.

What is Tavily?

Tavily is a search API built specifically for LLM workflows. Unlike scraping raw Google results, it returns clean, LLM-friendly summaries of web pages, including source URLs. The Research Agent uses it to pull live, up-to-date information that the local knowledge base may not contain.

System Architecture

User Question
      │
      ▼
┌──────────────────────────────┐   TavilySearchTool (live web)
│        Research Agent        │◄─ LocalRAGSearchTool (local docs)
│                              │
│  Goal: Gather evidence from  │
│  web + local knowledge base  │
│  Output: bullet notes +      │
│          sources list        │
└──────────────┬───────────────┘
               │  research notes + sources (via context=)
               ▼
┌──────────────────────────────┐   CalculatorTool (verified arithmetic)
│        Analyst Agent         │◄─ LocalRAGSearchTool (cross-check)
│                              │
│  Goal: Validate claims,      │
│  resolve source conflicts,   │
│  verify numbers              │
│  Output: analysis summary    │
│          + answer outline    │
└──────────────┬───────────────┘
               │  analysis summary + outline (via context=)
               ▼
┌──────────────────────────────┐   SaveReportTool
│        Writer Agent          │──► outputs/report.md
│                              │
│  Goal: Write structured,     │
│  cited Markdown report       │
│  No retrieval tools —        │
│  writes from evidence only   │
└──────────────────────────────┘

Each stage passes its output as explicit context to the next task via CrewAI's context= parameter, ensuring a fully deterministic Research → Analyse → Write flow with no information leakage from outside the pipeline.

Why Three Separate Stages?

The three-stage design is not arbitrary — each stage removes a distinct failure mode:

Stage	Failure mode it prevents
Research Agent (retrieval only)	LLM making up facts when no search is performed
Analyst Agent (validation only)	Conflicting sources silently resolved in favour of the wrong one
Writer Agent (write only, no retrieval)	New hallucinated content injected at the final stage

A single-agent approach merges all three into one opaque LLM call. If the final answer is wrong, there is no clean way to tell which stage introduced the error. With three separated agents, each stage's output is inspectable independently.

Agent Roles and Responsibilities

Research Agent

Attribute	Value
Role	Research Agent
Goal	Collect reliable information from the web and local knowledge base to answer the user's query
Tools	`TavilySearchTool`, `LocalRAGSearchTool`
Output	Bullet-pointed research notes with a Sources section listing URLs and/or local file names
Delegation	Disabled — operates independently, does not hand off mid-task

The Research Agent is the first to act. It queries both live web search (Tavily) and the local document corpus (via semantic embedding retrieval) to collect raw evidence. It is deliberately constrained: it does not draw conclusions, make recommendations, or synthesise findings. Its sole job is evidence gathering with source attribution. This means downstream agents always know the provenance of every claim.

Analyst Agent

Attribute	Value
Role	Analyst Agent
Goal	Analyse research notes, reconcile conflicts, compute any needed numbers, and produce bulletproof conclusions
Tools	`CalculatorTool`, `LocalRAGSearchTool`
Input	Research Agent's notes (via `context=` parameter)
Output	Analysis summary and a clear answer outline (sections + key bullets)
Delegation	Disabled — validates, does not regenerate research

The Analyst Agent receives the Research Agent's notes and critically evaluates them. If sources disagree, it surfaces the conflict rather than silently picking one. If the question involves numbers, it verifies them using the sandboxed CalculatorTool. It may also re-query the local knowledge base to cross-check claims. Its output is a structured answer outline, not a final report — the structure is intentional, as it forces the Analyst to commit to conclusions before the Writer embellishes them.

Writer Agent

Attribute	Value
Role	Writer Agent
Goal	Write the final grounded answer in a clean report format with clear sections and sources
Tools	`SaveReportTool`
Input	Analyst Agent's summary (via `context=` parameter)
Output	A saved `outputs/report.md` in structured Markdown format
Delegation	Disabled — writes only what can be supported by prior stages

The Writer Agent has no retrieval tools — this is a key architectural constraint. It is intentionally limited to writing from what the Analyst provided, preventing the Writer from re-introducing hallucinated content at the final stage. In many single-agent systems, the last "write this up nicely" step is where the model drifts back to pattern-matching from training data. Removing retrieval tools from the Writer eliminates that failure path entirely.

Tools

Tool	Class	Used By	Purpose
`TavilySearchTool`	`crewai_tools.TavilySearchTool`	Research Agent	Live web search for up-to-date information
`LocalRAGSearchTool`	`src/tools.py`	Research Agent, Analyst Agent	Semantic search over `./data` files using OpenAI embeddings and cosine similarity
`CalculatorTool`	`src/tools.py`	Analyst Agent	Safe sandboxed arithmetic — only numeric operators allowed, no imports
`SaveReportTool`	`src/tools.py`	Writer Agent	Writes the final Markdown report to `./outputs/`, with path sanitisation

LocalRAGSearchTool — How It Works

The LocalRAGSearchTool is backed by an in-memory VectorDB (src/vectordb.py) built at startup:

Ingestion — Documents in ./data (.txt, .md, .csv, .json) are read as UTF-8 text.
Chunking — Each document is split into 500-character chunks with 50-character overlap using RecursiveCharacterTextSplitter. Overlap ensures that sentences at chunk boundaries are not lost.
Embedding — Chunks are embedded with text-embedding-3-small (1,536-dim) via OpenAI.
Retrieval — At query time, the query is embedded and compared against all chunk vectors using cosine similarity. Top-k chunks are returned with source filenames.

The same vector index is shared between the Research Agent and Analyst Agent, ensuring both operate from the same local evidence base.

CalculatorTool — Why a Safe Evaluator?

LLMs are notoriously unreliable at arithmetic, especially for multi-step calculations. The CalculatorTool solves this by delegating all numeric computation to Python's eval() — but with a strict allowlist of characters (0-9 + - * / ( ) . %). Any expression containing letters, imports, or function calls is rejected before execution. This sandboxing prevents prompt injection via crafted expressions while still giving the Analyst a reliable arithmetic oracle.

CrewAI Orchestration

The crew uses Process.sequential — each task runs to completion before the next begins. Context is passed explicitly:

analysis_task = Task(..., context=[research_task])   # receives research output
writing_task  = Task(..., context=[analysis_task])   # receives analysis output

This deterministic handoff eliminates non-determinism from parallel or hierarchical coordination and makes the pipeline easy to debug: if the final report is wrong, you can inspect each stage's output independently in the console log.

System in Action

Running the Multi-Agent Pipeline

python -m src.main "Compare solar and wind energy efficiency"

Below is a representative console session showing what each agent does in real time:

[2026-03-28 14:22:01] INFO  Building vector index from ./data (7 documents)...
[2026-03-28 14:22:04] INFO  Vector index ready. Starting crew...

> Entering new CrewAgentExecutor chain...

╔══════════════════════════════════════════════════════╗
║  Research Agent  — Evidence Gathering                ║
╚══════════════════════════════════════════════════════╝

Thought: I need to find information about solar and wind energy efficiency.
I'll search the web for current statistics, then check the local knowledge base.

Action: tavily_search_results_json
Action Input: {"query": "solar panel efficiency vs wind turbine capacity factor 2024"}

Observation: [Tavily] Solar photovoltaic panels achieve 15–22% efficiency commercially
(up to 29% in lab settings). Offshore wind turbines reach 40–50% capacity factor...
Sources: [https://iea.org/...], [https://energy.gov/...]

Action: local_rag_search
Action Input: {"query": "solar wind energy efficiency comparison", "top_k": 4}

Observation:
[1] Source: sustainable_energy.txt
Solar power conversion efficiency has improved dramatically from ~6% in 1954 to
over 22% today for commercial silicon panels...

[2] Source: sustainable_energy.txt
Wind turbines convert kinetic energy at 35–45% capacity factor onshore, rising
to 45–55% offshore where wind speeds are higher and more consistent...

Final Answer (Research Agent):
## Research Notes: Solar vs Wind Energy Efficiency

**Solar Energy:**
- Commercial panels: 15–22% efficiency (ratio of solar energy in → electricity out)
- Best commercial: ~22–23% (monocrystalline silicon)
- Lab record: ~29% (multi-junction cells)
- Trend: improving ~0.5% per year

**Wind Energy:**
- Onshore capacity factor: 35–45%
- Offshore capacity factor: 45–55%
- Theoretical maximum (Betz limit): 59.3%
- Trend: growing turbine size increasing yield

**Note:** Efficiency metrics differ — solar uses conversion efficiency, wind uses
capacity factor. Direct comparison requires context.

**Sources:**
- https://iea.org/reports/solar-pv (web)
- https://energy.gov/wind-efficiency (web)
- sustainable_energy.txt (local)

╔══════════════════════════════════════════════════════╗
║  Analyst Agent  — Validation & Analysis              ║
╚══════════════════════════════════════════════════════╝

Thought: I have research notes. Key issue: the metrics are not directly comparable.
Solar uses conversion efficiency; wind uses capacity factor. I need to flag this.
Let me also verify the percentage figures using the calculator.

Action: calculator
Action Input: {"expression": "59.3 * 0.75"}

Observation: 44.475

[Analyst notes: real-world wind turbines achieve ~75% of the Betz limit, confirming
the 40–45% capacity factor range is physically consistent]

Final Answer (Analyst Agent):
## Analysis Summary

**Core finding:** Solar and wind use *different efficiency metrics* — a direct
number comparison is misleading without context.

**Solar (conversion efficiency):** measures how much incident sunlight becomes
electricity. Commercial: 15–22%. Lab: up to 29%.

**Wind (capacity factor):** measures actual output vs theoretical maximum output
over time. Onshore: 35–45%. Offshore: 45–55%. Betz-limit ceiling: 59.3%.

**Conflict resolution:** No source conflicts detected. Web and local sources
are consistent within measurement error.

**Answer outline:**
1. Short answer: both are competitive; metrics differ
2. Solar efficiency detail + trend
3. Wind capacity factor detail + trend
4. Comparison caveat: apples vs oranges
5. Cost context: both are grid-competitive as of 2024
6. Sources

╔══════════════════════════════════════════════════════╗
║  Writer Agent  — Report Composition                  ║
╚══════════════════════════════════════════════════════╝

Action: save_report
Action Input: {"filename": "report.md", "content": "# Solar vs Wind..."}

Observation: Saved report to: outputs/report.md

> Finished chain.

===== FINAL RESULT =====
Report saved to: outputs/report.md

Generated Report (`outputs/report.md`)

# Solar vs Wind Energy Efficiency

## Short Answer
Solar panels convert 15–22% of incident sunlight into electricity. Wind turbines
achieve a 35–45% capacity factor onshore and 45–55% offshore — but these are
different metrics and should not be compared directly as raw numbers.

## Explanation

**Solar Photovoltaic Efficiency**
- Measures the fraction of sunlight converted to electricity
- Commercial silicon panels: 15–22% (monocrystalline cells reaching ~22–23%)
- Laboratory record: ~29% using multi-junction cell technology
- Efficiency has improved by roughly 0.5 percentage points per year since 2010
- Cost: ~$0.04–0.06/kWh as of 2024 (utility scale)

**Wind Turbine Capacity Factor**
- Measures actual annual output vs maximum possible output if running 100% of the time
- Onshore: 35–45% capacity factor under good wind conditions
- Offshore: 45–55% due to stronger, more consistent winds
- Theoretical ceiling (Betz limit): 59.3% — no turbine can exceed this
- Cost: ~$0.03–0.05/kWh onshore; ~$0.08–0.10/kWh offshore as of 2024

**Comparing the Two**
- The metrics measure different things: solar efficiency is an instantaneous
  conversion ratio; wind capacity factor is a time-averaged utilisation measure
- Both technologies are cost-competitive with natural gas as of 2024
- Solar is better suited to sun-rich regions; wind to coastal and high-altitude areas
- Many utility-scale projects now combine both for complementary generation profiles

## Sources
- sustainable_energy.txt (local knowledge base)
- https://iea.org/reports/solar-pv
- https://energy.gov/wind-efficiency

Interactive Demo (`demo.py`)

For quick Q&A without the full pipeline, demo.py runs a single-agent assistant backed by the same local knowledge base:

$ python demo.py

Loading documents from ./data... 7 documents loaded.
Building vector index... done.

Example queries (tool-calling agent):

Q: What is machine learning?
A: Machine learning is a subset of artificial intelligence that enables systems
   to learn from data and improve their performance without being explicitly
   programmed. It encompasses supervised learning (labelled training data),
   unsupervised learning (pattern discovery), and reinforcement learning
   (reward-based optimisation).
Sources: artificial_intelligence.txt

Q: How does deep learning work?
A: Deep learning uses artificial neural networks with multiple layers (hence
   "deep") to learn hierarchical representations of data. Lower layers detect
   simple features (edges, tones); higher layers combine them into abstract
   concepts (faces, sentiment). Training adjusts the weights via
   backpropagation to minimise a loss function.
Sources: artificial_intelligence.txt

Q: What are key AI ethics concerns?
A: Key concerns include: bias and fairness (models reflecting historical
   prejudices), transparency (black-box decisions in high-stakes settings),
   privacy (training on sensitive data), accountability (who is responsible
   when AI causes harm), and misuse (deepfakes, autonomous weapons).
Sources: artificial_intelligence.txt

Interactive mode (type 'quit' to exit)

You: What are the latest advances in quantum computing?
Assistant: Based on the local knowledge base, quantum computing advances include
   error correction milestones, increased qubit counts (IBM reached 1,000+ qubits
   with their Condor processor), and progress on fault-tolerant logical qubits...
Sources: quantum_computing.txt

You: quit

Error Handling and System Resilience

API Key Validation

Both required API keys (OPENAI_API_KEY, TAVILY_API_KEY) are validated at startup before any agent runs. Missing keys raise a ValueError with a clear message pointing to the .env file — the crew never starts in a partially-configured state.

Tool-Level Resilience

Failure Mode	Handling
Tavily API error / timeout	CrewAI catches the tool error and the Research Agent falls back to local RAG context only
OpenAI API error (embedding)	Exception is raised and logged; the vector index is not partially populated
OpenAI API error (generation)	CrewAI surfaces the error; the task fails cleanly rather than silently producing an empty result
Empty local knowledge base	`build_vectordb()` logs a warning and returns an empty index; `LocalRAGSearchTool` returns `"No relevant local context found."`
Calculator: unsafe expression	Input is validated against an allowlist (`0-9 + - * / ( ) . %`) before evaluation. Anything outside is rejected, not executed
SaveReportTool: path traversal	`..`, `/`, and `\` are stripped from the filename before writing, preventing directory traversal

Graceful Degradation

If Tavily web search fails entirely, the Research Agent produces output from local RAG retrieval alone. Downstream Analyst and Writer agents continue normally, and the final report cites only local sources. The system degrades to a narrower-scope answer rather than failing completely.

Logging

All modules use Python's logging module (not print). The CLI entry point (src/main.py) configures basicConfig at startup so every agent's INFO-level activity is timestamped in the console. Tool errors are logged at WARNING level with context, making post-run diagnosis straightforward.

Results and Observed Behaviour

Output Format

Every successful pipeline run produces a structured Markdown report at ./outputs/report.md:

# [Topic] Report

## Short Answer
[A concise 1–2 sentence answer]

## Explanation
- Key finding 1 (with source)
- Key finding 2 (with source)
- Key finding 3 (with source)

## Sources
- https://... (web)
- artificial_intelligence.txt (local)

Observed Behaviour Across Queries

Research Agent consistently retrieves domain-relevant local chunks alongside web results. When a query maps well to a local file (e.g., "quantum entanglement" → quantum_computing.txt), the local context often provides more detailed technical content than a brief web snippet.
Analyst Agent correctly surfaces conflicts when web and local sources differ — for example, when a local document contains older statistics and a Tavily result has updated figures. Rather than silently picking one, it includes both with a date qualifier.
Writer Agent produces structured, citation-complete Markdown without introducing facts absent from the analysis stage. In all test runs, no new claims appeared in the final report that were not in the Analyst's outline.
The pipeline completes end-to-end in approximately 30–60 seconds depending on Tavily and OpenAI response latency. The dominant cost is three separate OpenAI chat completions (one per agent), each involving 1–3 tool calls.

Comparison: Single-Agent vs Multi-Agent

Dimension	Single Agent (demo.py)	Multi-Agent Pipeline
Setup	No Tavily key needed	Requires `TAVILY_API_KEY`
Speed	~5–10 seconds	~30–60 seconds
Web coverage	Local docs only	Web + local docs
Fact verification	None	Analyst checks numbers
Source conflict handling	Not supported	Analyst surfaces conflicts
Output format	Ad-hoc text	Structured Markdown report
Auditability	Single output	3 inspectable stage outputs

The single-agent demo is better for quick lookups over the local corpus. The multi-agent pipeline is better when accuracy, source coverage, and verifiability matter.

Limitations

Sequential latency — Three agent turns plus multiple tool calls means total wall-clock time is longer than a single LLM call. For time-sensitive use cases, parallel retrieval (not sequential) would be faster.
Tavily API dependency — Live web search requires a Tavily API key and internet access. In offline environments, the system falls back to local RAG only, limiting the breadth of evidence.
No conversational memory — Each crew.kickoff() call is stateless. Follow-up questions that reference a prior answer are not supported without explicit conversation history management.
In-memory knowledge base — The local vector index is rebuilt from scratch on every run. For larger corpora or production use, a persistent vector database (ChromaDB, FAISS, Pinecone) would eliminate the re-embedding overhead and allow incremental updates.
No reranking — Retrieval uses first-stage cosine similarity only. A cross-encoder reranker could improve precision for queries where the top-k chunks are not all equally relevant — the cosine search is fast but imprecise near the relevance boundary.
Single output format — The Writer Agent always produces Markdown. Different output formats (JSON, structured tables, HTML) would require task prompt changes or additional output parsers.

Reproducibility and Setup

Prerequisites

Python 3.12+
An OpenAI API key (for chat completions and embeddings)
A Tavily API key (for live web search; free tier available at tavily.com)

Environment Variables

Create .env in the project root:

OPENAI_API_KEY=your_openai_key
TAVILY_API_KEY=your_tavily_key
OPENAI_MODEL=gpt-4o-mini
OPENAI_EMBEDDING_MODEL=text-embedding-3-small

Install

python3.12 -m venv .venv312
source .venv312/bin/activate        # Windows: .\.venv312\Scripts\Activate.ps1
pip install -r requirements.txt

Run

# Full multi-agent pipeline
python -m src.main "Your research question here"

# Single-agent interactive demo
python demo.py

Docker

docker build -t rag-research-assistant .
docker run --env-file .env rag-research-assistant "What is quantum entanglement?"

Conclusion

What This Project Demonstrates

This project demonstrates that dividing responsibilities across specialised agents is a practical and effective strategy for reducing LLM hallucinations in research tasks. The key insight is architectural: rather than asking one LLM to retrieve, reason, and write simultaneously, each concern is isolated and handled by an agent constrained to that concern alone.

Three design decisions are worth highlighting:

1. The Writer Agent has no retrieval tools.
This is the most important constraint in the system. In many agentic pipelines, the final synthesis step is where the model drifts back to pattern-matching from training data, because nothing prevents it from doing so. By removing all retrieval tools from the Writer, the system enforces a hard boundary: every claim in the final report must trace back to something the Analyst explicitly included in the outline. The pipeline cannot hallucinate at the last step.

2. Context is passed explicitly, not implicitly.
CrewAI's context= parameter means each agent receives a clean, explicit version of the prior stage's output — not a growing conversation history that degrades over multiple turns. This keeps each agent focused and makes the data flow auditable.

3. Source conflicts are surfaced, not silenced.
The Analyst Agent is prompted to flag disagreements between sources rather than resolve them quietly. This is a deliberate choice for a research tool: a user who sees "web source says X, local document says Y (2022 data)" can make an informed judgement. A system that silently picks one gives false confidence.

Key Takeaways

RAG + multi-agent architecture is a practical combination for any task requiring accuracy over a private document corpus combined with live web access. The components (embeddings, vector search, tool-calling agents, sequential orchestration) are all production-ready with current open-source libraries.
CrewAI's sequential process with explicit context handoffs provides enough structure to build reliable pipelines without the complexity of a hierarchical or parallel agent topology.
Role constraints are more valuable than capabilities: the Writer Agent's usefulness comes not from what it can do but from what it is prevented from doing.
LLM arithmetic is unreliable; tool-based arithmetic is not. The CalculatorTool pattern (delegate numbers to a sandboxed evaluator, not to the LLM's internal reasoning) generalises to any domain where numeric accuracy matters.

Extensions Worth Exploring

For anyone building on top of this design, the highest-value extensions would be:

Persistent vector store (ChromaDB or FAISS) to avoid re-embedding on every run.
Cross-encoder reranking after initial cosine retrieval to improve chunk precision.
Conversational memory (LangChain ConversationBufferMemory or similar) to support follow-up questions.
Parallel retrieval (run Tavily and local RAG simultaneously) to cut Research Agent latency roughly in half.
Streaming output so users see partial results as each agent completes rather than waiting for the full pipeline.

Licensing

This project is released under CC BY-NC-SA 4.0 — free to use, share, and adapt for non-commercial purposes with attribution. See LICENSE.

Third-party dependencies: CrewAI (MIT), LangChain (MIT), OpenAI SDK (MIT), NumPy (BSD). Use of the OpenAI and Tavily APIs is subject to their respective Terms of Service.