Cross-Check: A Multi-Agent System for Cross-Checking Phishing URLs

Introduction

Cross-Check is an advanced phishing detection framework powered by Large Language Models (LLMs). Built using Google's Agent Development Kit (ADK) and Mesop, it implements a "debate" mechanism where multiple specialized AI agents analyze a website from different perspectives before reaching a consensus on its legitimacy.

This project serves as the capstone submission for the Agentic AI Developer Certification (Module 3), delivering a production-grade system designed to mitigate AI hallucinations through rigorous cross-examination and engineering.

Architecting Multi-Agent Systems

The Challenge: Single-Point Failure

Traditional phishing detection often relies on single-point analysis—asking one model, "Is this phishing?" This approach is prone to hallucinations; a sophisticated phishing site might look visually perfect to a standard LLM, or a legitimate site might be flagged due to benign anomalies. To build a system that is truly reliable, we need to move beyond simple inference and towards a panel of experts that can debate the evidence.

The Agentic Pipeline

Cross-Check operates on a sophisticated SequentialAgent architecture governed by a debate loop. The pipeline processes every request through three distinct stages:

1. Ingestion & Preprocessing

Before any AI analysis occurs, the UrlPreProcessor agent executes deterministic validation. It validates the URL format, verifies reachability, and scrapes the target website to extract clean HTML and visible text. This ensures that all subsequent agents analyze the exact same snapshot of the site and prevents wasted tokens on invalid inputs.

2. The Debate Loop (The Reasoning Engine)

The core of the system is the LoopAgent, which convenes a panel of four specialized experts to debate the findings:


URL Analyst	Examines domain patterns, typosquatting, subdomain usage, and TLD characteristics.
HTML Structure Analyst	Inspects the code for hidden elements, obfuscated scripts, suspicious input fields, and deceptive redirection patterns.
Content Semantic Analyst	Analyzes visible text for manipulative language, requests for sensitive information, and social engineering tactics.
Brand Impersonation Analyst	Detects mismatches between the brand identity (e.g., Apple, PayPal) and the actual URL/content.

These agents submit their findings to a Moderator, who evaluates if a consensus exists. If the team disagrees, the Moderator triggers another round, forcing the agents to refine their arguments based on peer feedback.

3. Final Judgment

Once the debate concludes, a distinct JudgementAgent reviews the entire conversation history. It weighs the final arguments from all specialists and delivers the authoritative PHISHING or LEGITIMATE verdict.

Engineering

Cross-Check is engineered to meet professional software standards, ensuring it is testable, portable, and resilient.

Containerization & Deployment

The application is fully containerized using Docker. The Dockerfile implements best practices by using uv for fast, frozen dependency management and creating a non-root mesop user for security compliance. This makes the system immediately deployable to environments like Hugging Face Spaces or Kubernetes.

Comprehensive Testing & Evaluation

Reliability is proven through a multi-layered testing strategy that goes beyond simple unit tests:


Integration & Evaluation	The system includes a dedicated `eval` suite that utilizes the `AgentEvaluator` to run full end-to-end integration tests. By testing against structured datasets (`legitimate.evalset.json` and `phishing.evalset.json`), we can benchmark the system's actual detection performance and ensure the "debate" mechanism is functioning correctly across real-world examples.
Unit Tests	Individual components, such as the `UrlPreProcessor` and utility functions, are verified using `pytest` to ensure robust error handling and correct data parsing.
CI/CD Pipeline	A GitHub Actions workflow (`tests.yml`) automatically executes this entire unit test suite on every push, ensuring no regressions are introduced.

Demo

See Cross-Check in action:

Legitimate URL – Analysis of a safe website
Phishing URL – Detection of a phishing attempt
Invalid URL – Handling of invalid URLs
Rate Limit – Graceful handling of API limits

Conclusion

Cross-Check demonstrates the power of Agentic AI when applied with engineering rigor. By simulating a human expert panel—analysts, moderators, and judges—it provides a transparent and robust defense against sophisticated phishing attacks, wrapped in a production-ready architecture.

You can explore the project here: Cross-Check on GitHub

📚 Reference

PhishDebate: An LLM-Based Multi-Agent Framework for Phishing Website Detection
Wenhao Li, Selvakumar Manickam, Yung-Wey Chong, Shankar Karuppayah
https://arxiv.org/abs/2506.15656