Building Multi-Agent Research Systems: A Practical Architecture for Automated Content Generation

Abstract

I present a production-ready multi-agent system that autonomously conducts web research, analyzes findings, and generates comprehensive articles. Built on LangGraph and powered by Google's Gemini 2.5 Flash, the system demonstrates how intelligent agent orchestration, persistent memory, and adaptive routing can create a scalable solution for automated research workflows. Our implementation achieves sub-30-second end-to-end latency while maintaining high-quality outputs through learned optimization patterns. Key innovations include dynamic agent selection based on query classification, SQLite-backed memory for progressive learning, and comprehensive observability through LangSmith tracing. The system processes over 11,000 tokens per request while maintaining cost efficiency through strategic model selection and query caching.

Keywords: Multi-agent systems, LangGraph, Research automation, LLM orchestration, Agent memory

1. Introduction

1.1 The Challenge

Modern content creation workflows require three distinct capabilities: comprehensive information gathering, analytical synthesis, and coherent writing. Traditional single-agent approaches either oversimplify by handling all tasks with one model, or overcomplicate with rigid pipelines that run unnecessary operations. The challenge is building a system that intelligently adapts its workflow while maintaining production-grade performance and learning from past executions.

1.2 Our Solution

We developed a three-agent architecture with an intelligent orchestrator that:

Adapts workflow based on query type (full research, quick lookup, or analysis of provided data)
Learns continuously through SQLite-backed memory storing research patterns, quality metrics, and optimization insights
Maintains speed through query caching, parallel operations, and strategic model selection
Provides observability via comprehensive LangSmith tracing for debugging and optimization

2. Methodology

2.1 System Architecture

2.1.1 Core Agent Design

The system comprises three specialized agents, each with a focused responsibility:

Research Agent

@tool
def research_tool(query: str) -> str:
    """Search the web using Tavily and return aggregated text."""
    tavily_client = TavilyClient(api_key=tavily_key)
    response = tavily_client.search(query, max_results=5)
    results = []
    for r in response.get("results", []):
        results.append(
            "Title: {}\nContent: {}\nURL: {}".format(
                r.get("title", "N/A"),
                r.get("content", "N/A"),
                r.get("url", "N/A"),
            )
        )
    return "\n---\n".join(results) if results else "No results found."

The research agent uses tool-calling to invoke web searches dynamically, retrieving up to 5 sources per query. This design allows the LLM to determine when more information is needed and refine search queries based on initial results.

Analyzer Agent

async def Analyzer_Agent(state: analyzer_agent_state) -> analyzer_agent_state:
    '''Analyze the result of the searches for key details'''
    messages: List[BaseMessage] = list(state.get("message", []))
    system_msg = SystemMessage(content=analyzer_prompt)
    all_messages = [system_msg] + messages

    llm_response = await llm.ainvoke(all_messages)
    messages.append(AIMessage(content=llm_response.content))
    
    return {
        'message': messages, 
        'analysis': llm_response.content
    }

The analyzer structures raw research into coherent findings, extracting key insights and organizing information for the writer. Critical insight: Early implementations over-summarized data, causing the writer to lack specific details. The solution involved preserving concrete data points (statistics, quotes, examples) in the analysis output.

Writer Agent

async def writing_agent(state: writer_agent_state) -> writer_agent_state:
    '''Creates a comprehensive article from the analysis'''
    messages: List[BaseMessage] = list(state.get("message", []))
    system_msg = SystemMessage(content=writer_prompt)
    all_messages = [system_msg] + messages
    
    llm_response = await llm.ainvoke(all_messages)
    messages.append(AIMessage(content=llm_response.content))
    
    return {
        'message': messages, 
        'article': llm_response.content
    }

2.1.2 Intelligent Orchestration

The orchestrator determines the optimal agent workflow based on query classification:

@traceable(name="task_classifier")
async def task_classifier(state: OrchestratorState) -> dict:
    """Decides which agents to run based on user query"""
    
    classifier_prompt = f"""
    Analyze this user query and determine the task type:
    
    Query: {state['user_query']}
    User provided data: {state.get('user_provided_data', 'None')}
    
    Task types:
    - full_research: Need to research, analyze, and write
    - quick_research: Need to research and write (skip analysis)
    - research_only: Only need research
    - analyze_provided: User provided data, analyze and write
    - write_only: User provided analysis, just write
    
    Respond with ONLY the task type.
    """
    
    response = await llm.ainvoke([HumanMessage(content=classifier_prompt)])
    task_type = response.content.strip().lower()
    
    task_mapping = {
        'full_research': ['research', 'analyzer', 'writer'],
        'quick_research': ['research', 'writer'],
        'research_only': ['research'],
        'analyze_provided': ['analyzer', 'writer'],
        'write_only': ['writer']
    }
    
    agents = task_mapping.get(task_type, ['research', 'analyzer', 'writer'])
    return {'task_type': task_type, 'agents_to_run': agents}

This classification-based routing reduces unnecessary computation. Simple queries skip analysis, and provided data bypasses research, cutting latency by 30-50% for applicable queries.

2.1.3 Conditional Routing

def route_next_agent(state: OrchestratorState) -> str:
    """Routes to the next agent or END"""
    agents_to_run = state.get('agents_to_run', [])
    completed = state.get('completed_agents', [])
    
    for agent in agents_to_run:
        if agent not in completed:
            return agent
    
    return 'end'

# Graph construction
graph = StateGraph(OrchestratorState)
graph.add_node('task_classifier', task_classifier)
graph.add_node('search_node', search_node)
graph.add_node('analyse_node', analyse_node)
graph.add_node('writer_node', writer_node)

routing_map = {
    'research': 'search_node',
    'analyzer': 'analyse_node',
    'writer': 'writer_node',
    'end': END
}

graph.add_conditional_edges('task_classifier', route_next_agent, routing_map)
graph.add_conditional_edges('search_node', route_next_agent, routing_map)
graph.add_conditional_edges('analyse_node', route_next_agent, routing_map)
graph.add_conditional_edges('writer_node', route_next_agent, routing_map)

2.2 Memory and Learning System

2.2.1 Persistent Memory Architecture

A critical innovation is the SQLite-backed memory system that enables progressive learning:

class MemoryManager:
    """Manages persistent memory for the research system"""
    
    def save_research(self, conversation_id: int, query: str, 
                     results: str, sources: List[str] = None):
        """Save research results for future reference"""
        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.cursor()
            cursor.execute("""
                INSERT INTO research_results 
                (conversation_id, query, results, sources)
                VALUES (?, ?, ?, ?)
            """, (conversation_id, query, results, 
                  json.dumps(sources) if sources else None))
            conn.commit()
    
    def get_similar_research(self, query: str, limit: int = 5) -> List[Dict]:
        """Find similar past research using keyword matching"""
        tokens = re.findall(r'\w+', query.lower())
        keywords = [t for t in tokens if len(t) > 2][:5]
        
        results = []
        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.cursor()
            for kw in keywords:
                cursor.execute("""
                    SELECT query, results, timestamp
                    FROM research_results
                    WHERE LOWER(query) LIKE ?
                    ORDER BY timestamp DESC LIMIT ?
                """, (f'%{kw}%', limit))
                results.extend(cursor.fetchall())
        
        return results

2.2.2 Query Caching for Performance

def get_cached_result(self, query: str) -> Optional[str]:
    """Check if we have a cached result for this query"""
    query_hash = hashlib.sha256(query.lower().strip().encode()).hexdigest()
    
    with sqlite3.connect(self.db_path) as conn:
        cursor = conn.cursor()
        cursor.execute("""
            SELECT result FROM query_cache WHERE query_hash = ?
        """, (query_hash,))
        
        result = cursor.fetchone()
        if result:
            # Update hit count and access time
            cursor.execute("""
                UPDATE query_cache 
                SET hit_count = hit_count + 1, 
                    last_accessed = CURRENT_TIMESTAMP
                WHERE query_hash = ?
            """, (query_hash,))
            conn.commit()
            return result[0]
        
        return None

Query caching provides 10x speedup for repeated queries, reducing 29-second research workflows to sub-1-second cached responses.

2.2.3 Learning from Success and Failure

def save_learning(self, agent_name: str, lesson: str, 
                 context: str = None, success_pattern: bool = True):
    """Save learnings for agent improvement"""
    with sqlite3.connect(self.db_path) as conn:
        cursor = conn.cursor()
        cursor.execute("""
            INSERT INTO learnings 
            (agent_name, lesson, context, success_pattern)
            VALUES (?, ?, ?, ?)
        """, (agent_name, lesson, context, 1 if success_pattern else 0))
        conn.commit()

Each agent execution logs success patterns and failures, enabling future optimizations through retrieved context:

# In orchestrator - provide context from past learnings
past_analyses = memory.get_past_analyses(state['user_query'], limit=2)

context_hint = ""
if past_analyses:
    context_hint = "\n\nPrevious analyses on similar topics:\n"
    for pa in past_analyses:
        context_hint += f"- {pa['original_query']}\n"
        context_hint += f"  Key insights: {pa['key_insights']}\n"

2.3 Model Selection and Optimization

2.3.1 Strategic Model Usage

llm = ChatGoogleGenerativeAI(
    api_key=google_key,
    model="gemini-2.5-flash",   # Fast, cost-effective
    temperature=0.3,             # Focused outputs
    max_output_tokens=1024,      # Sufficient for most tasks
)

We selected Gemini 2.5 Flash for its optimal balance of speed, quality, and cost. Compared to GPT-4, it delivers 2-3x faster responses at 10x lower cost while maintaining comparable output quality for research tasks.

Key consideration: For production systems, different agents could use different models. The analyzer could use an even faster model since it primarily structures data, while the writer might benefit from a higher-quality model for coherent prose.

2.3.2 Token Management

With an average of 11,069 tokens per full research request, token efficiency is critical:

# Efficient: Only pass necessary context
writer_input = {
    "analysis": analysis_output,      # ~1000 tokens
    "key_findings": extract_key_data(research_result),  # ~2000 tokens
    "instructions": writer_prompt     # ~200 tokens
}

# Inefficient: Passing everything (original implementation)
writer_input = {
    "full_research": research_result,  # ~8000 tokens
    "full_analysis": analysis_output,  # ~2000 tokens
    "all_messages": message_history    # ~1000 tokens
}

Selective context passing reduced token usage by 40% without quality degradation.

2.4 Observability and Debugging

2.4.1 LangSmith Integration

Every function is decorated with @traceable for comprehensive observability:

@traceable(name="search_node")
async def search_node(state: OrchestratorState) -> dict:
    '''Perform research based on user query'''
    logger.info('Starting research agent...')
    
    search_result = await research_app.ainvoke({
        'messages': [HumanMessage(content=state.get('user_query'))],
        'research_result': None
    })
    
    return {'research_result': search_result.get('research_result')}

LangSmith tracing revealed that 40% of total latency came from web searches, leading to the parallel search optimization discussed in Results.

3. Results

3.1 Performance Metrics

Baseline Performance (Full Research Workflow):

End-to-end latency: 29.4 seconds
Token usage: 11,069 tokens per request
Cost per request: $0.0047901
Success rate: 98.2% (over 500 test queries)

Latency Breakdown:

Web research: 12 seconds (41%)
Analysis: 6 seconds (20%)
Article generation: 9 seconds (31%)
Orchestration overhead: 2.4 seconds (8%)

3.2 Optimization Impact

3.2.1 Query Caching

# Performance comparison
Query Type              | First Run | Cached Run | Speedup
------------------------|-----------|------------|--------
Simple factual          | 18s       | 0.4s       | 45x
Complex research        | 29s       | 0.8s       | 36x
Analysis of data        | 15s       | 0.3s       | 50x

Cache hit rate after 1 week of usage: 34% (meaning 1 in 3 queries returned instantly)

3.2.2 Adaptive Routing

# Task-specific latency improvements
Task Type          | Agents Run        | Latency | Token Usage
-------------------|-------------------|---------|-------------
full_research      | R → A → W         | 29s     | 11,000
quick_research     | R → W             | 21s     | 8,500
research_only      | R                 | 12s     | 4,000
analyze_provided   | A → W             | 15s     | 6,000
write_only         | W                 | 9s      | 3,500

Intelligent routing reduced unnecessary computation by 30-52% depending on query type.

3.2.3 Parallel Search Implementation (Proposed)

While not in the current implementation, profiling suggested parallel searches could reduce research latency:

# Sequential (current): 12 seconds
for query in search_queries:
    result = tavily_client.search(query, max_results=5)  # 3-4s each

# Parallel (proposed): ~4 seconds
async def parallel_research(queries):
    tasks = [
        asyncio.create_task(tavily_client.search(q, max_results=5)) 
        for q in queries
    ]
    return await asyncio.gather(*tasks)

Expected improvement: 29s → 21s (28% faster)

3.3 Quality Metrics

We evaluated output quality on 100 diverse queries across technical, current events, and analytical topics:

Evaluation Criteria:

Factual accuracy (verified against sources)
Completeness (addresses all aspects of query)
Coherence (logical flow and readability)
Source citation (proper attribution)

Results:

Metric                    | Score (0-1)
--------------------------|------------
Factual Accuracy          | 0.89
Completeness              | 0.85
Coherence                 | 0.91
Source Citation           | 0.94
Overall Quality           | 0.90

3.4 Memory System Impact

After 2 weeks of operation with 1,247 total queries:

statistics = memory.get_statistics()

{
    'total_conversations': 1247,
    'successful_conversations': 1224,
    'total_research_queries': 1189,
    'cached_queries': 412,
    'total_cache_hits': 847,
    'average_article_quality': 0.87
}

Key Insight: Cache hits (847) exceeded cached unique queries (412) by 2.06x, indicating users frequently ask variations of similar questions. This validates the keyword-based similarity matching in get_similar_research().

3.5 Critical Discovery: The Detail Bottleneck

Initial implementations suffered from generic outputs. LangSmith traces revealed the issue:

Research Agent Output (3,847 tokens):
"According to NOAA, sea levels rose 3.4mm/year between 2006-2015. 
Miami faces 2-foot increases by 2050 affecting 300M residents..."

Analyzer Agent Output (891 tokens):
"Sea levels are rising due to climate change. 
Coastal cities face significant threats..."

Writer Agent Output:
"Coastal cities are increasingly vulnerable to rising sea levels..."

The problem: Analyzer over-summarized, losing specific data. Writer lacked details to produce substantive content.

Solution: Modified analyzer prompt to preserve specifics:

analyzer_prompt = """
Analyze the research and create a structured outline.

CRITICAL: Preserve specific data points:
- Exact statistics and numbers
- Direct quotes from sources  
- Concrete examples and case studies
- Timeline information

Create a detailed blueprint with references to specific facts.
DO NOT over-summarize or generalize.
"""

Post-fix analyzer output (1,342 tokens):

"## Sea Level Rise: Key Findings

- Current rate: 3.4mm/year (NOAA, 2006-2015)
- Projection: 2-foot increase by 2050 in Miami
- Population affected: 300M in low-lying coastal areas
- Economic impact: $2.3T by 2050 (World Bank estimate)
- Case study: Jakarta sinking 10cm/year..."

Result: Writer output improved from generic to specific, with quality scores increasing from 0.76 to 0.90.

3.6 Production Deployment Insights

Scalability Testing:

Concurrent requests: Handled 50 simultaneous queries without degradation
Database performance: SQLite handled 1M+ records with <100ms query times
Error rate: 1.8% (primarily from Tavily API timeouts, resolved with retries)

Cost Analysis (1,000 requests):

Component               | Cost
------------------------|--------
Gemini API calls        | $4.79
Tavily searches         | $12.00
Infrastructure (AWS)    | $3.50
------------------------|--------
Total                   | $20.29
Cost per request        | $0.020

4. Discussion

4.1 Key Architectural Decisions

Why Three Agents Instead of One?

Single-agent systems handling research, analysis, and writing in one pass tend to either:

Rush through research to start writing (incomplete information)
Over-research without structure (information overload)
Fail to synthesize findings coherently (disconnected output)

Separating concerns allows each agent to excel at its specific task while maintaining clean interfaces between stages.

Why SQLite Instead of Vector Database?

For our use case, SQLite provides:

Zero infrastructure complexity (single file database)
Sufficient performance (<100ms queries even with 1M records)
ACID compliance for reliable memory storage
Simple keyword-based similarity that works well for research queries

Vector databases would add deployment complexity without clear benefits for our matching patterns. However, for semantic similarity at larger scale, vector search could be considered.

Why Gemini 2.5 Flash?

Testing across GPT-4, Claude 3.5, and Gemini 2.5 Flash showed:

Quality was comparable for research synthesis tasks (±5% score difference)
Gemini was 2-3x faster with 10x lower costs
Tool calling reliability was equivalent

For production systems optimizing for speed and cost, Gemini 2.5 Flash proved optimal.

4.2 Limitations and Future Work

Current Limitations:

Sequential search operations - Currently searches run one after another. Implementing async parallel searches could reduce latency by 28%.
No citation tracking - While sources are retrieved, specific claims aren't linked to sources in the final article. Adding citation markers would improve verifiability.
Fixed search depth - Always retrieves 5 results per query. Adaptive depth based on query complexity could improve quality for complex topics while reducing latency for simple queries.
Keyword-based memory - Similarity matching uses simple keywords. Semantic embeddings would better identify related past research.

Future Enhancements:

# 1. Parallel search implementation
async def enhanced_research(query):
    sub_queries = generate_sub_queries(query)  # LLM generates 2-4 queries
    results = await asyncio.gather(*[
        search_web(q) for q in sub_queries
    ])
    return synthesize_results(results)

# 2. Citation tracking
class CitationTracker:
    def __init__(self):
        self.claims_to_sources = {}
    
    def link_claim(self, claim: str, source_id: str):
        self.claims_to_sources[claim] = source_id
    
    def format_with_citations(self, article: str) -> str:
        # Add [1], [2] markers to claims
        pass

# 3. Adaptive search depth
def determine_search_depth(query: str) -> int:
    complexity_score = assess_complexity(query)
    if complexity_score > 0.8:
        return 10  # Complex topic needs more sources
    elif complexity_score > 0.5:
        return 5   # Moderate
    else:
        return 2   # Simple lookup

4.3 Lessons Learned

1. The Detail Preservation Problem

The most impactful optimization wasn't technical—it was prompt engineering. Ensuring the analyzer preserved specifics rather than summarizing improved output quality by 18%.

Takeaway: In multi-agent systems, information loss between agents is often the biggest quality bottleneck.

2. Observability is Non-Negotiable

LangSmith tracing revealed non-obvious performance bottlenecks:

40% of time in web searches (led to parallel search consideration)
Analyzer token usage 3x higher than necessary (led to prompt optimization)
Writer receiving insufficient context (led to detail preservation fix)

Takeaway: You can't optimize what you can't measure. Comprehensive tracing should be built in from day one.

3. Memory Compounds Value Over Time

The memory system's impact grew non-linearly:

Week 1: 12% cache hit rate, minimal quality improvement
Week 2: 28% cache hit rate, agents using past context effectively
Week 4: 34% cache hit rate, quality scores 9% higher than Week 1

Takeaway: Systems that learn from experience become more valuable over time, but require patience to see benefits.

5. Conclusion

i presented a production-ready multi-agent research system that achieves high-quality automated content generation through intelligent orchestration, persistent memory, and comprehensive observability. The architecture demonstrates that practical agent systems can be built with:

Sub-30-second latency for end-to-end research and writing
90% output quality across diverse query types
34% cache hit rate providing 36x speedup for repeated queries
$0.02 per request cost efficiency through strategic model selection
98% success rate with proper error handling and retries

Key innovations include adaptive routing that eliminates unnecessary computation, SQLite-backed memory that enables progressive learning, and detail-preserving prompts that maintain information fidelity across agent transitions.

The system is production-deployed and processing over 500 queries daily with consistent performance. Source code and detailed implementation guide are available at [repository link].

Future Directions

Immediate next steps include implementing parallel search operations for 28% latency reduction, adding citation tracking for verifiable outputs, and experimenting with semantic memory matching for better similarity detection. Longer-term, we're exploring multi-modal research (processing images, videos, PDFs) and collaborative agent writing where multiple writer agents specialize in different content types.

References

LangGraph Documentation. "Multi-Agent Orchestration Patterns." https://langchain-ai.github.io/langgraph/
Google AI. "Gemini 2.5 Flash Technical Report." 2024.
Tavily AI. "Web Search API for LLM Applications." https://tavily.com
LangSmith Documentation. "Observability for LLM Applications." https://docs.smith.langchain.com

Appendix A: Complete System Configuration

# config.py - Production configuration
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
TAVILY_API_KEY = os.getenv("TAVILY_API_KEY")
LANGSMITH_API_KEY = os.getenv("LANGSMITH_API_KEY")

# LLM Configuration
LLM_CONFIG = {
    "model": "gemini-2.5-flash",
    "temperature": 0.3,
    "max_output_tokens": 1024,
}

# Research Configuration
RESEARCH_CONFIG = {
    "max_results": 5,
    "search_timeout": 10,
    "retry_attempts": 3,
}

# Memory Configuration
MEMORY_CONFIG = {
    "db_path": "memory/agent_memory.db",
    "cache_ttl_days": 30,
    "max_similar_results": 5,
}

Appendix B: Evaluation Methodology

Outputs were evaluated by three independent reviewers using this rubric:

evaluation_criteria = {
    "factual_accuracy": {
        "1.0": "All claims verifiable and accurate",
        "0.7": "Minor inaccuracies in non-critical details",
        "0.5": "Some significant factual errors",
        "0.3": "Multiple major inaccuracies",
        "0.0": "Predominantly inaccurate"
    },
    "completeness": {
        "1.0": "Addresses all query aspects comprehensively",
        "0.7": "Covers main points, misses some details",
        "0.5": "Partial coverage of query",
        "0.3": "Significant gaps in coverage",
        "0.0": "Fails to address query"
    },
    # ... similar rubrics for coherence and citation
}

Inter-rater reliability: Cohen's kappa = 0.82 (substantial agreement)