AI Nexus Herald - A Production-Ready Multi-Agent Newsletter Generation System

Abstract

AI Nexus Herald is a production-grade, AI-powered newsletter generation platform built using a multi-agent architecture. It automatically discovers trending AI topics from curated RSS feeds, performs deep research, and synthesizes content into a polished newsletter ready for delivery.

Built with LangGraph, FastAPI, and a modular backend, the system combines topic discovery, information extraction, and natural language generation, ensuring both accuracy and readability.

This publication documents the production system architecture, testing & security measures, and user-facing operational features.

Introduction

The AI Nexus Herald is a fully autonomous, multi-agent AI system designed to discover, research, and publish AI-focused newsletters with minimal human intervention. Built using LangGraph for agent orchestration, FastAPI for the backend, and Streamlit for the frontend, the system operates in a modular workflow across three specialized agents:

Topic Finder – Uses curated RSS feeds from top AI sources to detect emerging trends and key developments.
Deep Researcher – Gathers relevant, high-quality context around selected topics to ensure accurate and comprehensive coverage.
Newsletter Writer – Crafts well-structured, engaging, and reader-friendly newsletter content tailored for an AI-focused audience.

Link to the original publication has been provided in the References section.

The system is enhanced with a real-time RAG (Retrieval-Augmented Generation) pipeline using semantic similarity to ensure factual accuracy and relevance, and it incorporates evaluation metrics via DeepEval to assess clarity, structure, faithfulness, and topic relevance.

This project demonstrates a production-ready, scalable AI content generation pipeline—combining automation, fact-checking, and high editorial quality—to deliver insightful newsletters on the rapidly evolving AI landscape.

Prerequisites

To fully understand and replicate the implementation of AI Nexus Herald, the following tools, libraries, and knowledge are required. These are divided into must-have and optional prerequisites:

Must-Have Prerequisites

Category	Requirement
Programming	Intermediate proficiency in Python
Frameworks	Familiarity with LangChain, LangGraph, FastAPI, and DeepEval
LLMs	Working knowledge of Large Language Models (LLMs) and API-based usage (e.g., OpenAI, Groq)
APIs	Understanding of RESTful API integration and authentication
Tools	Experience with virtual environments, package managers (pip, anaconda)
Frontend	Basic familiarity with Streamlit for building minimal, lightweight UI
YAML/JSON	Ability to read and write YAML and JSON for config and prompt files

Optional Prerequisites

Category	Benefit
Docker	Enables containerization and easier deployment across environments
LLM Prompt Engineering	Helps in refining agent prompts for deterministic outputs
LangChain Tools Development	Useful for custom tool integration
Deployment Platforms	Knowledge of Render, Vercel, or AWS for production deployment

System Architecture

1. High-Level Flow

1. Topic Discovery Agent

Extracts AI-related topics from curated RSS feeds using semantic analysis.

2. Deep Research Agent

Retrieves relevant news articles, summarizes context, and validates sources.

3. Newsletter Writer Agent

Crafts a coherent, structured newsletter in a professional tone.

4. Delivery Pipeline

Prepares the final newsletter for publication or email distribution.

2. Core Components

a. Data Ingestion Layer

The data ingestion layer extracts information from RSS feed URLs present in the rss_config.yaml file. The URLs have been carefully selected to represent current AI related news. The configuration file may periodically be updated based on audience news preferences.

RSS Feed Parser – Periodically fetches fresh content from 4 curated AI-focused feeds.
Metadata Simplification – Extracts only essential fields for downstream agents.

b. Multi-Agent Orchestration (LangGraph)

LangGraph orchestration makes the whole process intuitive handling the complete workflow with state management.

Topic Finder → Deep Researcher → Newsletter Writer are chained one after the other.
Agents use structured prompts managed in prompt_builder.py for consistent responses.

c. Retrieval-Augmented Generation (RAG)

RAG proves to be the best option in this scenario where we need to extract news from existing RSS feed content. I have selected the text embeddings model by nomic-ai due to its large context window potentially useful for multiple news articles.

Real-time embeddings generated with nomic-ai/nomic-embed-text-v1 for text.

d. API Backend (FastAPI)

I have used FastAPI for information retrieval, Groq API for LLMs and OpenAI API for evaluating the system.

REST endpoints for topic retrieval and news generation.
Integrated Groq API for ultra-fast LLM inference.
OpenAI API for testing with DeepEval.

e. Frontend (Streamlit)

The frontend of AI Nexus Herald is built using Streamlit. It displays a button to generate the newsletter and the final newsletter preview after it has been generated. This allows for manual review before publishing the newsletter incorporating human in the loop.

Installation

Install dependencies through the requirements.txt file.

pip install -r requirements.txt

Add your API keys - Create a .env file with:

GROQ_API_KEY=...
OPENAI_API_KEY=...
LANGCHAIN_API_KEY=...
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT="AI Nexus Herald Production"

Initialize the backend
Run the following command:

uvicorn src.backend.main:app --reload

Run the app
Navigate to the frontend folder and run the following command:

streamlit run Home.py

Key Production Improvements

I have done a lot of essential improvements in the existing system to make it production ready.

DeepEval Testing Suite – Automated validation of agent outputs before deployment.
Guardrails for Safety & Security – Output guardrails to validate the generate newsletter for toxic language and banned words.
User Interface – User-friendly interface to interact with the web application.
Monitoring & Resilience – Langsmith integration to monitor traces of application usage.
Logging – Logging complete workflow, system prompts, and generated newsletter.
Error Handling – Exception handling to add robustness and avoid failure.

Evaluation & Testing Strategy

The evaluation and testing strategy for AI Nexus Herald includes both automated testing as well as LLM output evaluation.

1. Automated Testing

Unit Tests – Validate each agent independently with fixed datasets.
Integration Tests – End-to-end pipeline validation simulating real feed ingestion.
Workflow Automation – Evaluate the complete workflow from topic finding to news extraction, and newsletter generation.

2. LLM Output Evaluation

DeepEval – Runs metrics like GEval on topics, news summaries, and newsletters.

Most importantly, I have used pytest-based testing through DeepEval. The testing strategy includes:

Rule-based tests - To test content presence and structure.
LLM-as-a-Judge - To test DeepEval metrics such as correctness, relevancy, and faithfulness.

Dataset Sources and Collection

RSS Context Dataset

I have generated an RSS context dataset using the RSS feed URLs from the configuration file. This dataset is used during evaluation to compare the generated newsletter content with context using LLM-as-a-Judge.

The contextual dataset is present in outputs/dataset folder.

User Generated Dataset

I have created multiple datasets by running the application and saved them. They are used during evaluation to be compared with the contextual dataset for calculating metrics such as relevancy, correctness, and faithfulness.

The generated datasets are present in the outputs/dataset folder.

Security Implementations

Following security measures have been adopted to make the system safe and secure in terms of accessibility and output generation.

API Key Management – Environment variables loaded via .env, never hardcoded.
Rate Limiting – Prevents abuse of public endpoints.
CORS Policies – Limits API access to authorized domains.
Data Sanitization – Validates all outgoing API payloads using output guardrails.
Error Logging – Centralized logging for anomaly detection.

User Interface & Operational Features

1. Streamlit Frontend

Generate Newsletter Button – Leverages the user to generate the newsletter as per requirement.
Newsletter Preview – Full rendering of the final newsletter before export.

2. FastAPI Endpoints

extract_titles – Extract all titles from RSS feeds.
extract_news – Extract news related to each topic from RSS feeds using semantic similarity.

Testing Suite Implementation

AI Nexus Herald testing suite includes separate test scripts for testing the following:

Topics – Unit Test
News – Integration Test
Newsletter – Integration Test
Topic Finder Agent – Individual Agent Test
Deep Research Agent – Individual Agent Test
Newsletter Writer Agent – Individual Agent Test
Orchestrator – Multiple Agents Workflow Test

Tool calling has been tested in individual agent test scripts.

Here are the test case snippets of various test scripts:

1. Topics Evaluation

# Rule-based test: Ensure datasets are not empty
@pytest.mark.parametrize("dataset_index", ["0", "1", "2"])
def test_basic_data_loading(dataset_index):
    rss_data, generated = load_datasets(dataset_index)
    if not rss_data or not generated:
        pytest.skip(f"Skipping dataset {dataset_index} due to empty data.")

    gold_titles = [item["title"] for item in rss_data]
    generated_topics = [news["topic"] for news in generated]

    result = {
        "dataset_index": dataset_index,
        "gold_titles_count": len(gold_titles),
        "generated_topics_count": len(generated_topics),
        "timestamp": datetime.utcnow().isoformat()
    }

    save_result(dataset_index, "basic_data_loading", result)

    assert len(gold_titles) > 0, "Gold titles dataset is empty."
    assert len(generated_topics) > 0, "Generated topics dataset is empty."


# Rule-based test: Ensure data structure is correct
@pytest.mark.parametrize("dataset_index", ["0", "1", "2"])
def test_data_structure(dataset_index):
    rss_data, generated = load_datasets(dataset_index)
    if not rss_data or not generated:
        pytest.skip(f"Skipping dataset {dataset_index} due to empty data.")

    save_result(dataset_index, "data_structure", {
        "dataset_index": dataset_index,
        "rss_count": len(rss_data),
        "rss_type": type(rss_data).__name__,
        "generated_count": len(generated),
        "generated_type": type(generated).__name__,
        "timestamp": datetime.utcnow().isoformat()
    })

    # Top-level checks
    assert isinstance(rss_data, list), "rss_data must be a list"
    assert isinstance(generated, list), "generated must be a list"

    # Element structure
    for idx, item in enumerate(rss_data):
        assert isinstance(item, dict), f"rss_data[{idx}] must be a dict"
        assert "title" in item, f"rss_data[{idx}] missing 'title' key"

    for idx, news in enumerate(generated):
        assert isinstance(news, dict), f"generated[{idx}] must be a dict"
        assert "topic" in news, f"generated[{idx}] missing 'topic' key"


# LLM-as-a-Judge test: Evaluate news topic relevance to RSS titles
@pytest.mark.parametrize("dataset_index", ["0", "1", "2"])
def test_news_topic_relevance(dataset_index):
    rss_data, generated = load_datasets(dataset_index)
    if not rss_data or not generated:
        pytest.skip(f"Skipping dataset {dataset_index} due to empty data.")

    gold_titles = [item["title"] for item in rss_data]
    topics = [news["topic"] for news in generated]

    relevance_metric = GEval(
        name="Topic Relevance",
        criteria="Determine if the generated news topics are relevant to the titles of the RSS feeds.",
        evaluation_params=[
            LLMTestCaseParams.ACTUAL_OUTPUT,
            LLMTestCaseParams.RETRIEVAL_CONTEXT
        ],
        threshold=0.7
    )

    test_case = LLMTestCase(
        name=f"topic_relevance_test_{dataset_index}",
        input="Select the top 5 trending news topics based on the RSS feed titles.",
        actual_output="\n".join(topics),
        retrieval_context=gold_titles
    )

    # Capture DeepEval table output
    buffer = io.StringIO()
    sys_stdout = sys.stdout
    sys.stdout = buffer
    try:
        results = evaluate([test_case], [relevance_metric])
    finally:
        sys.stdout = sys_stdout

    table_output = buffer.getvalue()
    
    test_results = results.test_results
    passed = all(r.success for r in test_results)

    # Enhanced save with structured data
    result_data = {
        "dataset_index": dataset_index,
        "passed": passed,
        "timestamp": datetime.utcnow().isoformat(),
        "deepeval_table_output": table_output.strip(),
        "structured_results": [
            {
                "test_name": tr.name,
                "success": tr.success,
                "metrics": [
                    {
                        "name": md.name,
                        "score": md.score,
                        "threshold": md.threshold,
                        "success": md.success,
                        "reason": md.reason,
                        "cost": md.evaluation_cost
                    }
                    for md in tr.metrics_data
                ]
            }
            for tr in test_results
        ]
    }

    save_result(dataset_index, "news_topic_relevance", result_data)

    # Print summary
    for test_result in results.test_results:
        for metric_data in test_result.metrics_data:
            status = "PASSED" if metric_data.success else "FAILED"
            print(f"\nSUMMARY: {metric_data.name} - {status}")
            print(f"Score: {metric_data.score:.4f} (Threshold: {metric_data.threshold})")
            print(f"Reason: {metric_data.reason[:100]}...")

    # Fail pytest if needed
    assert passed, f"Topic relevance test failed for dataset {dataset_index}"

2. News Evaluation

# Rule-based test: Check if the data structure of generated news matches the expected format
@pytest.mark.parametrize("dataset_index", ["0", "1", "2"])
def test_data_structure_news(dataset_index):
    _, generated = load_datasets(dataset_index)
    if not generated:
        pytest.skip(f"Skipping dataset {dataset_index} due to empty data.")

    passed = True
    for idx, item in enumerate(generated):
        if not all(key in item for key in ["topic", "news_articles"]):
            passed = False
            break
        if not isinstance(item["news_articles"], list):
            passed = False
            break
        for news in item["news_articles"]:
            if not all(k in news for k in ["title", "link", "summary", "content"]):
                passed = False
                break

    save_result(dataset_index, "data_structure_news", {
        "dataset_index": dataset_index,
        "generated_count": len(generated),
        "passed": passed,
        "timestamp": datetime.utcnow().isoformat()
    })

    assert passed, f"Data structure test failed for dataset {dataset_index}"


# LLM-as-a-Judge: Evaluate if the generated news articles are relevant to their assigned topic
@pytest.mark.parametrize("dataset_index", ["0", "1", "2"])
def test_news_relevance_to_topic(dataset_index):
    _, generated = load_datasets(dataset_index)
    if not generated:
        pytest.skip(f"Skipping dataset {dataset_index} due to empty data.")

    relevance_metric = GEval(
        name="News Relevance to Topic",
        criteria="Determine if the news articles are semantically relevant to their assigned topic.",
        evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
        threshold=0.7
    )

    structured_results = []
    all_passed = True

    for i, item in enumerate(generated):
        topic = item["topic"]
        news_articles = item.get("news_articles", [])
        for j, news in enumerate(news_articles):
            test_case = LLMTestCase(
                name=f"news_relevance_topic_{i}_{j}",
                input=topic,
                actual_output=f"{news['title']}\n{news.get('summary', '')}\n{news.get('content', '')}"
            )

            buffer = io.StringIO()
            sys_stdout = sys.stdout
            sys.stdout = buffer
            try:
                results = evaluate([test_case], [relevance_metric])
            finally:
                sys.stdout = sys_stdout

            table_output = buffer.getvalue()
            test_results = results.test_results
            for tr in test_results:
                structured_results.append({
                    "test_name": tr.name,
                    "success": tr.success,
                    "metrics": [
                        {
                            "name": md.name,
                            "score": md.score,
                            "threshold": md.threshold,
                            "success": md.success,
                            "reason": md.reason,
                            "cost": md.evaluation_cost
                        }
                        for md in tr.metrics_data
                    ]
                })
                if not tr.success:
                    all_passed = False

    save_result(dataset_index, "news_relevance_to_topic", {
        "dataset_index": dataset_index,
        "passed": all_passed,
        "timestamp": datetime.utcnow().isoformat(),
        "structured_results": structured_results
    })

    assert all_passed, f"News relevance to topic failed for dataset {dataset_index}"

3. Newsletter Evaluation

# Rule-based test: Check presence of newsletter and generated news
@pytest.mark.parametrize("dataset_index", ["0", "1", "2"])
def test_newsletter_and_generated_news_presence(dataset_index):
    generated = load_generated_news(dataset_index)
    newsletter = load_newsletter(dataset_index)

    result = {
        "dataset_index": dataset_index,
        "passed": bool(generated and isinstance(generated, list) and newsletter.strip()),
        "generated_news_count": len(generated) if generated else 0,
        "newsletter_length": len(newsletter.strip()) if newsletter else 0,
        "timestamp": datetime.utcnow().isoformat()
    }

    save_result(dataset_index, "newsletter_and_generated_news_presence", result)

    assert generated and isinstance(generated, list), f"Generated news dataset {dataset_index} is missing or invalid."
    assert newsletter.strip(), f"Newsletter content for dataset {dataset_index} is empty."


# LLM-as-a-Judge: Evaluate newsletter relevance to news articles
@pytest.mark.parametrize("dataset_index", ["0", "1", "2"])
def test_newsletter_relevance_to_news(dataset_index):
    generated = load_generated_news(dataset_index)
    newsletter = load_newsletter(dataset_index)

    all_news_texts = []
    for item in generated:
        for news in item.get("news_articles", []):
            text = f"{news['title']}\n{news.get('link', '')}\n{news.get('summary', '')}\n{news.get('content', '')}"
            all_news_texts.append(text)

    relevance_metric = GEval(
        name="Newsletter Relevance",
        criteria="Determine whether the newsletter reflects the main ideas and topics of the extracted news articles.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.RETRIEVAL_CONTEXT],
        threshold=0.5
    )

    test_case = LLMTestCase(
        name=f"newsletter_relevance_test_{dataset_index}",
        input="Generated newsletter based on curated AI news.",
        actual_output=newsletter,
        retrieval_context=all_news_texts
    )

    buffer = io.StringIO()
    sys_stdout = sys.stdout
    sys.stdout = buffer
    try:
        results = evaluate([test_case], [relevance_metric])
    finally:
        sys.stdout = sys_stdout

    table_output = buffer.getvalue()
    test_results = results.test_results
    passed = all(r.success for r in test_results)

    result_data = {
        "dataset_index": dataset_index,
        "passed": passed,
        "timestamp": datetime.utcnow().isoformat(),
        "deepeval_table_output": table_output.strip(),
        "structured_results": [
            {
                "test_name": tr.name,
                "success": tr.success,
                "metrics": [
                    {
                        "name": md.name,
                        "score": md.score,
                        "threshold": md.threshold,
                        "success": md.success,
                        "reason": md.reason,
                        "cost": md.evaluation_cost
                    }
                    for md in tr.metrics_data
                ]
            }
            for tr in test_results
        ]
    }

    save_result(dataset_index, "newsletter_relevance_to_news", result_data)

    assert passed, f"Newsletter relevance test failed for dataset {dataset_index}"

4. Topic Finder Evaluation

# LLM-as-a-judge: Check if the tool is called correctly
def test_tool_call_correctness():
    """Check if extract_titles_from_rss tool is invoked."""
    groq_api_key = os.getenv("GROQ_API_KEY")
    assert groq_api_key, "GROQ_API_KEY must be set."

    agent = TopicFinder(groq_api_key)
    graph = agent.build_topic_finder_graph()
    initial = get_initial_state()
    final = graph.invoke(initial, config={"recursion_limit": 100})

    test_case = LLMTestCase(
        name="tool_invocation",
        input=initial.messages[0].content,
        actual_output="",
        tools_called=[ToolCall(name="extract_titles_from_rss")],
        expected_tools=[ToolCall(name="extract_titles_from_rss")]
    )

    metric = ToolCorrectnessMetric()

    buffer = io.StringIO()
    sys_stdout = sys.stdout
    sys.stdout = buffer
    try:
        results = evaluate([test_case], [metric])
    finally:
        sys.stdout = sys_stdout

    passed = all(r.success for r in results.test_results)
    save_result("tool_call_correctness", {
        "timestamp": datetime.utcnow().isoformat(),
        "passed": passed,
        "deepeval_table_output": buffer.getvalue().strip()
    })

    assert passed, "Tool call correctness test failed."


# Rule based test: Ensure 5 non-empty topics are returned as a list of strings
def test_topics_structure_and_count():
    """Ensure exactly 5 non-empty topic strings are returned."""
    groq_api_key = os.getenv("GROQ_API_KEY")
    assert groq_api_key

    agent = TopicFinder(groq_api_key)
    graph = agent.build_topic_finder_graph()
    final = graph.invoke(get_initial_state(), config={"recursion_limit": 100})
    topics = final.get("topics", [])

    passed = (
        isinstance(topics, list) and
        len(topics) == 5 and
        all(isinstance(t, str) and t.strip() for t in topics)
    )

    save_result("topics_structure_and_count", {
        "timestamp": datetime.utcnow().isoformat(),
        "topic_count": len(topics),
        "passed": passed
    })

    assert passed, f"Expected 5 valid topics, got {len(topics)}"


# LLM-as-a-judge: Check if each topic relates to RSS titles
def test_topic_relevancy():
    """Use LLM to check if each topic relates to RSS titles."""
    titles = load_rss_titles()
    groq_api_key = os.getenv("GROQ_API_KEY")
    agent = TopicFinder(groq_api_key)
    graph = agent.build_topic_finder_graph()
    final = graph.invoke(get_initial_state(), config={"recursion_limit": 100})
    topics = final.get("topics", [])

    metric = AnswerRelevancyMetric(threshold=0.5)

    test_cases = [
        LLMTestCase(
            input="Evaluate the relevance of this topic to the given RSS titles.",
            actual_output=topic,
            retrieval_context=titles
        ) for topic in topics
    ]

    buffer = io.StringIO()
    sys_stdout = sys.stdout
    sys.stdout = buffer
    try:
        results = evaluate(test_cases, [metric])
    finally:
        sys.stdout = sys_stdout

    passed = all(r.success for r in results.test_results)

    structured_metrics = []
    for tr in results.test_results:
        for md in tr.metrics_data:
            structured_metrics.append({
                "name": md.name,
                "score": md.score,
                "threshold": md.threshold,
                "success": md.success,
                "reason": md.reason,
                "cost": md.evaluation_cost
            })

    save_result("topic_relevancy", {
        "timestamp": datetime.utcnow().isoformat(),
        "passed": passed,
        "deepeval_table_output": buffer.getvalue().strip(),
        "metrics": structured_metrics
    })

    assert passed, "One or more topics were not relevant to RSS titles."

5. Deep Researcher Evaluation

# LLM-as-a-judge: Check if the agent calls the tool correctly
@pytest.mark.parametrize("dataset_index", ["0", "1", "2"])
def test_tool_call_correctness(dataset_index):
    groq_api_key = os.getenv("GROQ_API_KEY")
    assert groq_api_key, "GROQ_API_KEY must be set."
    topics = load_datasets(dataset_index)
    passed_all = True

    for topic in topics:
        agent = DeepResearcher(groq_api_key)
        graph = agent.build_deep_researcher_graph()
        initial = get_initial_state(topic)
        _ = graph.invoke(initial, config={"recursion_limit": 100})

        case = LLMTestCase(
            name=f"tool_usage_for_{topic}",
            input=initial.messages[0].content,
            actual_output="",
            tools_called=[ToolCall(name="extract_news_from_rss")],
            expected_tools=[ToolCall(name="extract_news_from_rss")]
        )
        metric = ToolCorrectnessMetric()
        result = evaluate([case], [metric])
        if not all(r.success for r in result.test_results):
            passed_all = False

    save_result(dataset_index, "tool_call_correctness", {
        "dataset_index": dataset_index,
        "passed": passed_all,
        "timestamp": datetime.utcnow().isoformat()
    })

    assert passed_all, f"Tool call correctness failed for dataset {dataset_index}"

# Rule-based test: Check if news articles are present and have a specific structure (Title, Link, Summary, Content)
@pytest.mark.parametrize("dataset_index", ["0", "1", "2"])
def test_news_output_format_and_presence(dataset_index):
    groq_api_key = os.getenv("GROQ_API_KEY")
    assert groq_api_key
    topics = load_datasets(dataset_index)
    passed_all = True

    for topic in topics:
        agent = DeepResearcher(groq_api_key)
        graph = agent.build_deep_researcher_graph()
        final = graph.invoke(get_initial_state(topic), config={"recursion_limit": 100})
        arts = final.get("news_articles", [])
        if not (isinstance(arts, list) and len(arts) >= 1):
            passed_all = False
            continue
        for art in arts:
            if not (isinstance(art, dict) and all(k in art and isinstance(art[k], str) for k in ("title", "link", "summary", "content"))):
                passed_all = False

    save_result(dataset_index, "news_output_format", {
        "dataset_index": dataset_index,
        "passed": passed_all,
        "timestamp": datetime.utcnow().isoformat()
    })

    assert passed_all, f"News output format or presence failed for dataset {dataset_index}"

# LLM-as-a-judge: Check if each news article is relevant to the topic
@pytest.mark.parametrize("dataset_index", ["0", "1", "2"])
def test_news_relevancy_to_topic(dataset_index):
    groq_api_key = os.getenv("GROQ_API_KEY")
    assert groq_api_key
    topics = load_datasets(dataset_index)
    passed_all = True
    relevancy_metric = AnswerRelevancyMetric(threshold=0.5)

    for topic in topics:
        agent = DeepResearcher(groq_api_key)
        graph = agent.build_deep_researcher_graph()
        final = graph.invoke(get_initial_state(topic), config={"recursion_limit": 100})
        for art in final.get("news_articles", []):
            content = f"{art['title']}\n{art['link']}\n{art['summary']}\n{art.get('content','')}"
            case = LLMTestCase(
                name=f"rel_news_{topic}",
                input=f"Assess if this news is relevant to topic: {topic}",
                actual_output=content,
                retrieval_context=[topic]
            )
            result = evaluate([case], [relevancy_metric])
            if not all(r.success for r in result.test_results):
                passed_all = False

    save_result(dataset_index, "news_relevancy", {
        "dataset_index": dataset_index,
        "passed": passed_all,
        "timestamp": datetime.utcnow().isoformat()
    })

    assert passed_all, f"News relevancy failed for dataset {dataset_index}"

6. Newsletter Writer Evaluation

# Rule-based test: Check if the newsletter writer agent generates a non-empty markdown
@pytest.mark.parametrize("dataset_index", ["0", "1", "2"])
def test_generation_and_structure(dataset_index):
    """Test newsletter generation produces non-empty markdown."""
    groq_api = os.getenv("GROQ_API_KEY")
    assert groq_api, "GROQ_API_KEY not set."

    news_list = load_generated_news(dataset_index)
    writer = NewsletterWriter(groq_api)
    graph = writer.build_newsletter_writer_graph()
    state = compose_initial_state(news_list)
    result = graph.invoke(state, config={"recursion_limit": 100})
    newsletter = result.get("newsletter", "")

    passed = isinstance(newsletter, str) and bool(newsletter.strip())

    save_result(dataset_index, "generation_and_structure", {
        "dataset_index": dataset_index,
        "newsletter_length": len(newsletter),
        "passed": passed,
        "timestamp": datetime.utcnow().isoformat()
    })

    assert passed, "Newsletter markdown is empty or not a string."


# LLM-based test: Evaluate newsletter relevancy and faithfulness to articles
@pytest.mark.parametrize("dataset_index", ["0", "1", "2"])
def test_newsletter_relevancy_and_faithfulness(dataset_index):
    """Evaluate that newsletter is relevant and faithful to articles."""
    groq_api = os.getenv("GROQ_API_KEY")
    assert groq_api

    news_list = load_generated_news(dataset_index)
    compiled_context = []
    for entry in news_list:
        for art in entry.get("news_articles", []):
            if isinstance(art, str):
                compiled_context.append(art)
            elif isinstance(art, dict):
                compiled_context.append(
                    f"{art.get('title','')}\n{art.get('summary','')}\n{art.get('content','')}"
                )

    state = compose_initial_state(news_list)
    writer = NewsletterWriter(groq_api)
    graph = writer.build_newsletter_writer_graph()
    result = graph.invoke(state, config={"recursion_limit": 100})
    newsletter_text = result.get("newsletter", "")

    rel_metric = AnswerRelevancyMetric(threshold=0.5)
    faith_metric = FaithfulnessMetric(threshold=0.5)

    test_case = LLMTestCase(
        name=f"newsletter_eval_{dataset_index}",
        input="Evaluate if the newsletter reflects the extracted news correctly.",
        actual_output=newsletter_text,
        retrieval_context=compiled_context
    )

    buffer = io.StringIO()
    sys_stdout = sys.stdout
    sys.stdout = buffer
    try:
        results = evaluate([test_case], [rel_metric, faith_metric])
    finally:
        sys.stdout = sys_stdout

    table_output = buffer.getvalue()

    passed = all(r.success for r in results.test_results)
    result_data = {
        "dataset_index": dataset_index,
        "passed": passed,
        "timestamp": datetime.utcnow().isoformat(),
        "deepeval_table_output": table_output.strip(),
        "structured_results": [
            {
                "test_name": tr.name,
                "success": tr.success,
                "metrics": [
                    {
                        "name": md.name,
                        "score": md.score,
                        "threshold": md.threshold,
                        "success": md.success,
                        "reason": md.reason,
                        "cost": md.evaluation_cost
                    }
                    for md in tr.metrics_data
                ]
            }
            for tr in results.test_results
        ]
    }

    save_result(dataset_index, "newsletter_relevancy_and_faithfulness", result_data)

    assert passed, f"Newsletter relevancy/faithfulness failed for dataset {dataset_index}"

7. Orchestrator Evaluation

# Rule-based test: Check for task completion
def test_full_workflow_outputs():
    """Run full orchestrator and validate structure."""
    groq_api = os.getenv("GROQ_API_KEY")
    assert groq_api, "GROQ_API_KEY must be set."

    orchestrator = Orchestrator(groq_api)
    graph = orchestrator.build_orchestrator_graph()

    final_state = graph.invoke(OrchestratorState(), config={"recursion_limit": 200})
    news = final_state["news"]
    newsletter = final_state["newsletter"]

    passed = (
        isinstance(news, list) and news and
        isinstance(newsletter, str) and newsletter.strip()
    )

    save_result("full_workflow_outputs", {
        "timestamp": datetime.utcnow().isoformat(),
        "passed": passed,
        "news_count": len(news) if isinstance(news, list) else 0,
        "newsletter_length": len(newsletter) if isinstance(newsletter, str) else 0
    })

    assert passed, "Orchestrator failed to produce valid outputs."


# LLM-as-a-Judge: Test semantic consistency between topics, articles, and newsletter
def test_orchestrator_semantic_consistency():
    """Check semantic consistency between topics, articles, and newsletter."""
    groq_api = os.getenv("GROQ_API_KEY")
    assert groq_api

    orchestrator = Orchestrator(groq_api)
    graph = orchestrator.build_orchestrator_graph()
    final = graph.invoke(OrchestratorState(), config={"recursion_limit": 200})

    news = final["news"]
    newsletter = final["newsletter"]

    topic_metric = AnswerRelevancyMetric(threshold=0.5)
    rel_metric = AnswerRelevancyMetric(threshold=0.5)
    faith_metric = FaithfulnessMetric(threshold=0.5)

    all_metrics_results = []

    # Articles vs Topic
    for entry in news:
        topic = entry.topic
        for art in entry.news_articles:
            content = f"{art.title}\n{art.summary}\n{art.content}"
            case = LLMTestCase(
                name=f"article_relevancy_for_{topic}",
                input=f"Check relevance of news to topic: {topic}",
                actual_output=content,
                retrieval_context=[topic]
            )
            buffer = io.StringIO()
            sys_stdout = sys.stdout
            sys.stdout = buffer
            try:
                results = evaluate([case], [topic_metric])
            finally:
                sys.stdout = sys_stdout
            
            for tr in results.test_results:
                for md in tr.metrics_data:
                    all_metrics_results.append({
                        "name": md.name,
                        "score": md.score,
                        "threshold": md.threshold,
                        "success": md.success,
                        "reason": md.reason,
                        "cost": md.evaluation_cost
                    })

    # Newsletter vs Aggregated News
    aggregated_text = []
    for entry in news:
        for art in entry.news_articles:
            aggregated_text.append(f"{art.title}\n{art.summary}\n{art.content}")

    case = LLMTestCase(
        name="newsletter_end_to_end_relevancy",
        input="Check newsletter alignment with all generated news",
        actual_output=newsletter,
        retrieval_context=aggregated_text
    )
    buffer = io.StringIO()
    sys_stdout = sys.stdout
    sys.stdout = buffer
    try:
        results = evaluate([case], [rel_metric, faith_metric])
    finally:
        sys.stdout = sys_stdout

    for tr in results.test_results:
        for md in tr.metrics_data:
            all_metrics_results.append({
                "name": md.name,
                "score": md.score,
                "threshold": md.threshold,
                "success": md.success,
                "reason": md.reason,
                "cost": md.evaluation_cost
            })

    passed = all(m["success"] for m in all_metrics_results)

    save_result("semantic_consistency", {
        "timestamp": datetime.utcnow().isoformat(),
        "passed": passed,
        "metrics": all_metrics_results
    })

    assert passed, "Semantic consistency test failed."

Evaluation Results

Following are the results of various tests conducted through DeepEval. I have summarized the results of each testing script in markdown files for easy understanding and accessibility.

Test Topics Summary Report

Generated on: 2025-08-10T11:42

.948324

Overview

Total Test Functions: 3
Datasets Tested: ['0', '1', '2']
Total Test Cases: 9

Basic Data Loading

Data Loading Results

Dataset	Gold Titles	Generated Topics	Status
0	115	5	✅
1	115	5	✅
2	115	5	✅

Data Structure

Data Structure Validation

Dataset	RSS Type	RSS Count	Generated Type	Generated Count
0	list	115	list	5
1	list	115	list	5
2	list	115	list	5

News Topic Relevance

Topic Relevance Evaluation

Dataset 0 ✅

Test: topic_relevance_test_0

Topic Relevance [GEval] ✅

Score: 0.9021
Threshold: 0.7000
Cost: $0.0064
Reason: The generated news topics are highly relevant to the main subjects in the RSS feed titles, which focus on AI advancements, OpenAI developments, quantum computing breakthroughs, and open-source AI. Eac...

Dataset 1 ❌

Test: topic_relevance_test_1

Topic Relevance [GEval] ❌

Score: 0.6823
Threshold: 0.7000
Cost: $0.0066
Reason: Most generated news topics are relevant to the main themes in the Retrieval Context, such as GPT-5, OpenAI's open-source models, AI agents, quantum computing, and AI safety. Specifically, 'GPT-5 launc...

Dataset 2 ✅

Test: topic_relevance_test_2

Topic Relevance [GEval] ✅

Score: 0.7016
Threshold: 0.7000
Cost: $0.0061
Reason: Most generated news topics are relevant to the RSS feed titles, particularly those about GPT-5, OpenAI open-source models, AI agent infrastructure, and Anthropic Claude 4.1, which have direct matches ...

Final Summary

Total Tests: 9
Passed: 8
Failed: 1
Success Rate: 88.9%

🎉 Overall Status: EXCELLENT

Test News Summary Report

Generated on: 2025-08-10T12:10

.057013

Overview

Total Test Functions: 2
Datasets Tested: ['0', '1', '2']
Total Test Cases: 6

Data Structure News

Dataset	Generated Count	Passed
0	5	✅
1	5	✅
2	5	✅

News Relevance To Topic

Dataset 0 ✅

Test: news_relevance_topic_0_0

News Relevance to Topic [GEval] (✅)
- Score: 0.9881 / Threshold: 0.5000
- Reason: The article directly discusses the launch of GPT-5, specifically focusing on issues encountered during its rollout. The main subject and key points of both the topic (GPT-5 launch) and the article are...
  Test: news_relevance_topic_1_0
News Relevance to Topic [GEval] (✅)
- Score: 1.0000 / Threshold: 0.5000
- Reason: The article directly addresses the assigned topic of 'OpenAI model removals' by detailing the removal of OpenAI's AI text classifier. It provides specific information about the reasons for the removal...
  Test: news_relevance_topic_1_1
News Relevance to Topic [GEval] (✅)
- Score: 1.0000 / Threshold: 0.5000
- Reason: The response directly addresses the assigned topic of 'OpenAI model removals' by detailing the removal of OpenAI's AI text classifier tool. It identifies the main subject (the removal of the classifie...
  Test: news_relevance_topic_2_0
News Relevance to Topic [GEval] (✅)
- Score: 1.0000 / Threshold: 0.5000
- Reason: The article directly addresses the assigned topic of a quantum computing breakthrough by detailing a new method for creating 'magic states,' which are essential for fault-tolerant quantum computers. I...
  Test: news_relevance_topic_3_0
News Relevance to Topic [GEval] (✅)
- Score: 0.8212 / Threshold: 0.5000
- Reason: The article discusses AWS's neurosymbolic AI and automated reasoning checks, which are directly related to AI agents and automation. The main subject of both the topic and the article is the use of AI...
  Test: news_relevance_topic_4_0
News Relevance to Topic [GEval] (✅)
- Score: 0.5180 / Threshold: 0.5000
- Reason: The article discusses the launch of new AI models by OpenAI, specifically GPT-5 and its variants, which is relevant to the general topic of AI models. However, the assigned topic is 'Open-source AI mo...

Dataset 1 ✅

Test: news_relevance_topic_0_0

News Relevance to Topic [GEval] (✅)
- Score: 0.5382 / Threshold: 0.5000
- Reason: The response mentions the GPT-5 rollout and highlights a specific issue (failure on a simple algebra problem), which aligns with the assigned topic of 'GPT-5 launch and rollout issues.' However, the o...
  Test: news_relevance_topic_1_0
News Relevance to Topic [GEval] (✅)
- Score: 1.0000 / Threshold: 0.5000
- Reason: The article directly addresses the assigned topic of OpenAI releasing open-source models called gpt-oss. It details the release, model sizes, benchmark performance, licensing, and the strategic intent...
  Test: news_relevance_topic_2_0
News Relevance to Topic [GEval] (✅)
- Score: 0.6971 / Threshold: 0.5000
- Reason: The article is highly relevant to the 'AI agents' part of the topic, discussing recent advances in protocols and infrastructure for AI agents, including work by Anthropic and Google. It details how th...
  Test: news_relevance_topic_3_0
News Relevance to Topic [GEval] (✅)
- Score: 1.0000 / Threshold: 0.5000
- Reason: The article directly addresses the assigned topic of 'Quantum computing breakthroughs and applications' by detailing a significant breakthrough in quantum error correction, a key area in quantum compu...
  Test: news_relevance_topic_3_1
News Relevance to Topic [GEval] (✅)
- Score: 1.0000 / Threshold: 0.5000
- Reason: The article directly addresses the assigned topic of quantum computing breakthroughs and applications by detailing IBM's roadmap for a 1000-qubit quantum processor, which is a significant breakthrough...
  Test: news_relevance_topic_3_2
News Relevance to Topic [GEval] (✅)
- Score: 0.9986 / Threshold: 0.5000
- Reason: The article directly discusses a breakthrough in quantum computing—specifically, quantum algorithms enabling exponential speedup in AI training tasks. It details both the breakthrough (quantum machine...
  Test: news_relevance_topic_3_3
News Relevance to Topic [GEval] (✅)
- Score: 1.0000 / Threshold: 0.5000
- Reason: The article directly addresses the assigned topic of quantum computing breakthroughs and applications by detailing Google's milestone in quantum error correction, a key breakthrough in the field. It d...
  Test: news_relevance_topic_4_0
News Relevance to Topic [GEval] (✅)
- Score: 1.0000 / Threshold: 0.5000
- Reason: The article directly addresses the topic of AI safety and governance concerns by discussing the lack of regulatory frameworks, the rapid deployment of powerful AI systems, and the resulting governance...
  Test: news_relevance_topic_4_1
News Relevance to Topic [GEval] (✅)
- Score: 0.9982 / Threshold: 0.5000
- Reason: The article directly addresses AI safety and governance concerns by reporting on a call from AI researchers to pause the development of advanced AI systems until their risks are manageable. The main s...
  Test: news_relevance_topic_4_2
News Relevance to Topic [GEval] (✅)
- Score: 1.0000 / Threshold: 0.5000
- Reason: The response directly addresses AI safety and governance concerns by describing a new framework focused on constitutional AI, alignment with human values, and governance structures. The main subject a...

Dataset 2 ✅

Test: news_relevance_topic_0_0

News Relevance to Topic [GEval] (✅)
- Score: 0.7307 / Threshold: 0.5000
- Reason: The response is mostly aligned with the evaluation steps: it directly addresses the GPT-5 launch and rollout issues by mentioning that the rollout is not going smoothly and provides a specific example...
  Test: news_relevance_topic_1_0
News Relevance to Topic [GEval] (✅)
- Score: 0.8578 / Threshold: 0.5000
- Reason: The article is highly relevant to the topic 'OpenAI open-source model releases' as it details the release of GPT-4o mini by OpenAI, including its features, pricing, and performance. However, the artic...
  Test: news_relevance_topic_1_1
News Relevance to Topic [GEval] (✅)
- Score: 1.0000 / Threshold: 0.5000
- Reason: The article directly addresses the assigned topic of OpenAI open-source model releases by discussing OpenAI's plans to release an open-source language model, the motivations behind this decision, and ...
  Test: news_relevance_topic_1_2
News Relevance to Topic [GEval] (✅)
- Score: 1.0000 / Threshold: 0.5000
- Reason: The response directly addresses the assigned topic of OpenAI open-source model releases, thoroughly analyzing the implications, motivations, and potential impact of such a release. It identifies the m...
  Test: news_relevance_topic_2_0
News Relevance to Topic [GEval] (✅)
- Score: 1.0000 / Threshold: 0.5000
- Reason: The article directly addresses a breakthrough in quantum computing, describing an exponential and unconditional speedup achieved by a research team. It references quantum computers outperforming class...
  Test: news_relevance_topic_3_0
News Relevance to Topic [GEval] (✅)
- Score: 1.0000 / Threshold: 0.5000
- Reason: The response directly addresses the topic of AI agent infrastructure development by describing the launch of Claude 3.5 Sonnet and its new capabilities that enhance AI agent infrastructure. It highlig...
  Test: news_relevance_topic_4_0
News Relevance to Topic [GEval] (✅)
- Score: 0.9047 / Threshold: 0.5000
- Reason: The response closely aligns with the evaluation steps: it directly addresses the assigned topic of 'Anthropic Claude 4.1 coding dominance' by highlighting Claude 4.1's leading performance on coding be...

Final Summary

Total Tests: 6
Passed: 6
Failed: 0
Success Rate: 100.0%

🎉 Overall Status: EXCELLENT

Test Newsletter Summary Report

Generated on: 2025-08-10T13:49

.974387

Overview

Total Test Functions: 2
Datasets Tested: ['0', '1', '2']
Total Test Cases: 6

Newsletter And Generated News Presence

Dataset 0 ✅

Dataset 1 ✅

Dataset 2 ✅

Newsletter Relevance To News

Dataset 0 ✅

Test: newsletter_relevance_test_0

Newsletter Relevance [GEval] ✅

Score: 1.0000
Threshold: 0.5000
Cost: $0.0058
Reason: The Actual Output covers all main ideas and topics from the Retrieval Context: the problematic GPT-5 rollout and its algebra error, the launch of multiple GPT-5 variants with a focus on safety and pra...

Dataset 1 ✅

Test: newsletter_relevance_test_1

Newsletter Relevance [GEval] ✅

Score: 1.0000
Threshold: 0.5000
Cost: $0.0086
Reason: The Actual Output covers all major topics and main ideas from the Retrieval Context, including GPT-5's rollout issues, OpenAI's open-weight models, new protocols for AI agents (MCP and A2A), multiple ...

Dataset 2 ✅

Test: newsletter_relevance_test_2

Newsletter Relevance [GEval] ✅

Score: 1.0000
Threshold: 0.5000
Cost: $0.0063
Reason: The newsletter accurately covers all main ideas and topics from the Retrieval Context, including GPT-5 rollout issues, the release and pricing of GPT-4o mini, OpenAI's open-source model plans and impl...

Final Summary

Total Tests: 6
Passed: 6
Failed: 0
Success Rate: 100.0%

🎉 Overall Status: EXCELLENT

Test Topic Finder Summary Report

Generated on: 2025-08-10T13:02

.765064

Tool Call Correctness

Timestamp	Passed
2025-08-10T12:51 .263972	✅

Topics Structure And Count

Timestamp	Topic Count	Passed
2025-08-10T12:51 .750282	5	✅

Topic Relevancy

Run at 2025-08-10T13:02
.752578 ✅

Answer Relevancy: 1.0000 (Threshold: 0.5) ✅
Reason: The score is 1.00 because the response was fully relevant and addressed the input perfectly. Great job staying on topic!...
Answer Relevancy: 1.0000 (Threshold: 0.5) ✅
Reason: The score is 1.00 because the response was fully relevant to the input, with no irrelevant statements present. Great job staying focused and on-topic!...
Answer Relevancy: 1.0000 (Threshold: 0.5) ✅
Reason: The score is 1.00 because all statements in the output are directly relevant to evaluating the topic's relevance to the RSS titles. Great job staying ...
Answer Relevancy: 1.0000 (Threshold: 0.5) ✅
Reason: The score is 1.00 because the response was fully relevant and addressed the input perfectly. Great job staying on topic!...
Answer Relevancy: 1.0000 (Threshold: 0.5) ✅
Reason: The score is 1.00 because all statements in the output are directly relevant to evaluating the topic's relevance to the RSS titles. Great job staying ...

Deep Researcher Summary Report

Generated on: 2025-08-10T15:53

.079991

Overview

Total Test Functions: 3
Datasets Tested: ['0', '1', '2']
Total Test Cases: 8

Tool Call Correctness

Dataset 0: ✅
Dataset 1: ✅

News Output Format

Dataset 0: ✅
Dataset 1: ✅
Dataset 2: ✅

News Relevancy

Dataset 0: ✅
Dataset 1: ✅
Dataset 2: ✅

Final Summary

Total Tests: 8
Passed: 8
Failed: 0
Success Rate: 100.0%

🎉 Overall Status: EXCELLENT

Newsletter Writer Summary Report

Generated on: 2025-08-10T16:21

.425819

Overview

Total Test Functions: 2
Datasets Tested: ['0', '1', '2']
Total Test Cases: 6

Generation And Structure

Dataset 0 ✅

Dataset 1 ✅

Dataset 2 ✅

Newsletter Relevancy And Faithfulness

Dataset 0 ✅

Test: newsletter_eval_0

Answer Relevancy ✅
- Score: 1.0000 (Threshold: 0.5000)
- Reason: The score is 1.00 because the output was fully relevant and addressed the input without any irrelevant statements. Great job staying focused and concise!...
Faithfulness ✅
- Score: 1.0000 (Threshold: 0.5000)
- Reason: Great job! There are no contradictions, so the actual output is fully faithful to the retrieval context....

Dataset 1 ✅

Test: newsletter_eval_1

Answer Relevancy ✅
- Score: 1.0000 (Threshold: 0.5000)
- Reason: The score is 1.00 because the output was fully relevant and addressed the input without any irrelevant statements. Great job staying focused and concise!...
Faithfulness ✅
- Score: 1.0000 (Threshold: 0.5000)
- Reason: Great job! There are no contradictions, so the actual output is fully faithful to the retrieval context....

Dataset 2 ✅

Test: newsletter_eval_2

Answer Relevancy ✅
- Score: 1.0000 (Threshold: 0.5000)
- Reason: The score is 1.00 because the output was fully relevant and addressed the input without any irrelevant statements. Great job staying focused and concise!...
Faithfulness ✅
- Score: 0.9474 (Threshold: 0.5000)
- Reason: Great job! The score is 0.95 because the only minor contradiction is that the actual output overstates the significance by calling the result the 'holy grail' of quantum computing, which is not mentio...

Final Summary

Total Tests: 6
Passed: 6
Failed: 0
Success Rate: 100.0%

🎉 Overall Status: EXCELLENT

Orchestrator Test Summary Report

Generated on: 2025-08-10T16:37

.299375

Total Tests: 2
Passed: 2
Failed: 0
Success Rate: 100.0%

Full Workflow Outputs

Run at 2025-08-10T16:32
.439124 ✅

Semantic Consistency

Run at 2025-08-10T16:37
.288434 ✅

Answer Relevancy ✅ — Score: 1.0000 / Threshold: 0.5000
- Reason: The score is 1.00 because the output was fully relevant to the topic of GPT-5 launch and rollout issues, with no irrelevant statements present. Great job staying on topic!...
Answer Relevancy ✅ — Score: 0.8333 / Threshold: 0.5000
- Reason: The score is 0.83 because while the answer mostly addresses OpenAI's open-weight and open-source models, it includes some irrelevant statements about Meta and Chinese open models, which are not direct...
Answer Relevancy ✅ — Score: 1.0000 / Threshold: 0.5000
- Reason: The score is 1.00 because all statements in the output are directly relevant to quantum computing breakthroughs and efficiency. Great job staying focused on the topic!...
Answer Relevancy ✅ — Score: 1.0000 / Threshold: 0.5000
- Reason: The score is 1.00 because all statements in the output are directly relevant to the topic of AI agent infrastructure and automation. Great job staying focused and on-topic!...
Answer Relevancy ✅ — Score: 1.0000 / Threshold: 0.5000
- Reason: The score is 1.00 because the answer was fully relevant to the topic, with no irrelevant statements. Great job staying focused and on-topic!...
Answer Relevancy ✅ — Score: 1.0000 / Threshold: 0.5000
- Reason: The score is 1.00 because the response was fully relevant and addressed the input directly without any irrelevant statements. Great job staying focused!...
Faithfulness ✅ — Score: 1.0000 / Threshold: 0.5000
- Reason: Great job! There are no contradictions, so the actual output is fully faithful to the retrieval context....

Demo

Here is a demonstration of using AI Nexus Herald in local as well as production environments. Due to limited memory in Render free tier, it doesn't work through the production backend but the frontend works fine in Streamlit cloud.

Localhost Demo

Localhost demo works perfectly fine. It shows the complete newsletter generation process alongwith console logs and Langsmith tracing. It also verifies a link from the generated newsletter by clicking and validating it through human-in-the-loop.

Production Demo

Limitations and Gaps

The AI Nexus Herald implements a multi-agent orchestration framework that integrates real-time news ingestion, ensures cross-agent context consistency, and applies domain-focused evaluation using DeepEval metrics tailored for AI-related content. Despite AI-driven content generation, there are several limitations of the system.

Language – The current newsletter generation system is limited to English language. It can’t yet handle multilingual content or generate newsletters in other languages.
Predetermined Sources – The system pulls content from a fixed list of RSS feeds set in the configuration file, which limits flexibility in sourcing.
Narrow Approach – The focus is solely on AI-related topics, so it doesn’t yet adapt to other domains or areas of interest.
Deployment – The free tier of Render only offers a 512MB of RAM which is a hindrance in running the application in production.

Addressing these limitations would help improve the system and make it a general purpose newsletter generation system.

Industry Insights

The AI industry is experiencing rapid growth, with advancements in generative AI, automation, and multi-agent systems reshaping how information is created and consumed. Recent market reports highlight a surge in AI adoption across sectors, and AI-related news volume continues to grow at an unprecedented pace. This expanding landscape presents both opportunities and challenges for the newsletter generation system, enabling timely, high-value content delivery but also demanding scalability, adaptability, and coverage of an ever-broader range of topics to stay relevant and competitive.

Licensing

AI Nexus Herald is released under the MIT License, allowing free use, modification, and distribution of the software with proper attribution. Users are encouraged to build upon and adapt the system for their own applications.

Maintenance & Support

Basic Information

Current Status: Stable Production-Grade Application
Release Date: 11 August 2025
Supported Features: Multi-agent newsletter generation, RSS-based topic discovery, summarization, LangGraph-based orchestration, DeepEval-based evaluation, Langsmith tracing.

Issue Reporting Process

If you encounter bugs, unexpected behavior, or have suggestions:

Open an issue on the project’s GitHub repository under Issues.
Use the bug, enhancement, or question labels to categorize your report.
Include relevant logs, screenshots, and steps to reproduce.

Future Directions and Potential improvements

While the current system achieves a high degree of automation, reliability, and contextual accuracy, several avenues exist for further system enhancement:

Cross-Language and Cross-Cultural Expansion – Extend the AI Nexus Herald to multilingual pipelines to monitor global AI trends and broaden contextual diversity in topic discovery.
Enhanced Real-Time Knowledge Integration – Incorporate additional high-frequency data sources such as social media trend streams, real-time web search, and research preprint servers to improve coverage of emerging topics.
User-Driven Personalization – Implement fine-grained user preference learning for topic selection, enabling tailored newsletter editions for different AI subdomains or interest profiles.
Advanced Fact-Verification Agents – Introduce a specialized verification agent that cross-references generated content with trusted sources to improve factual reliability.
Deployment on Cloud – Due to the memory limitation in free tiers of various deployment platforms like Render and Vercel, deploy the application on a Cloud server (AWS, GCP, Azure, etc) or an upgraded plan of these services.

These future directions not only aim to refine the AI Nexus Herald pipeline but also open up new questions regarding multi-agent orchestration efficiency, contextual reasoning at scale, and domain-aware evaluation methodologies.

Documentation

Technical documentation of the project is present in the GitHub repository in docs folder.

There are three main documents generated for AI Nexus Herald:

Key Takeaways

During the development and deployment of AI Nexus Herald in production, I have learnt many things.

Large Language Model (LLM) is important for content generation because it lies at the core of each AI agent, but the whole ecosystem built around the model is even more important because that makes a system resilient, safe, and production ready. Designing the whole system must be taken into consideration very carefully.
Try different LLMs if you don't get expected results with one LLM. In this case, the latest moonshotai/Kimi-K2-Instruct from Groq API worked best by reducing hallucinations and giving structured output.
Productionizing an Agentic AI application is not as easy as it seems. There are a lot of aspects which need to be considered such as guardrails, security, structured output, hallucinations. Finding the right fit of everything for a particular use case takes time and requires a high level of commitment and consistency.
There must be unit tests, integration tests, and end-to-end workflow tests in the testing suite which are scheduled to be run before every release.
Technical documentation plays a key role in explaining your agentic AI application to diverse users based on their level of understanding. There must be a document for every user (developers, end-users, MLOps team, legal team, etc) which helps them understand, reproduce, or use the system. Documenting each step along the way while building makes it very easy to produced good quality documentation to be shipped with the application. Without proper documentation, the application might resemble a locked door without a key giving no clue about how to open it.
User interface must be simple, intuitive and easy to navigate. The simpler, the better.

Conclusion

The AI Nexus Herald demonstrates how a modular multi-agent AI system can be taken from prototype to a fully operational production service. By combining RSS-based discovery, real-time retrieval augmented generation using semantic similarity, and agent-level testing, it achieves both automation and content reliability.

The architecture supports scalability, security, and continuous improvement, making it a reference point for deploying AI-driven editorial systems at scale.

Table of contents

Abstract

Introduction

Prerequisites

Must-Have Prerequisites

Optional Prerequisites

System Architecture

1. High-Level Flow

1. Topic Discovery Agent

2. Deep Research Agent

3. Newsletter Writer Agent

4. Delivery Pipeline

2. Core Components

a. Data Ingestion Layer

b. Multi-Agent Orchestration (LangGraph)

c. Retrieval-Augmented Generation (RAG)

d. API Backend (FastAPI)

e. Frontend (Streamlit)

Installation

Key Production Improvements

Evaluation & Testing Strategy

1. Automated Testing

2. LLM Output Evaluation

Dataset Sources and Collection

RSS Context Dataset

User Generated Dataset

Security Implementations

User Interface & Operational Features

1. Streamlit Frontend

2. FastAPI Endpoints

Testing Suite Implementation

1. Topics Evaluation

2. News Evaluation

3. Newsletter Evaluation

4. Topic Finder Evaluation

5. Deep Researcher Evaluation

6. Newsletter Writer Evaluation

7. Orchestrator Evaluation

Evaluation Results

Test Topics Summary Report

Overview

Basic Data Loading

Data Loading Results

Data Structure

Data Structure Validation

News Topic Relevance

Topic Relevance Evaluation

Dataset 0 ✅

Dataset 1 ❌

Dataset 2 ✅

Final Summary

Test News Summary Report

Overview

Data Structure News

News Relevance To Topic

Dataset 0 ✅

Dataset 1 ✅

Dataset 2 ✅

Final Summary

Test Newsletter Summary Report

Overview

Newsletter And Generated News Presence

Dataset 0 ✅

Dataset 1 ✅

Dataset 2 ✅

Newsletter Relevance To News

Dataset 0 ✅

Dataset 1 ✅

Dataset 2 ✅

Final Summary

Test Topic Finder Summary Report

Tool Call Correctness

Topics Structure And Count

Topic Relevancy

Run at 2025-08-10T13:02.752578 ✅

Deep Researcher Summary Report

Overview

Tool Call Correctness

News Output Format

News Relevancy

Run at 2025-08-10T13:02
.752578 ✅

Run at 2025-08-10T16:32
.439124 ✅

Run at 2025-08-10T16:37
.288434 ✅