
AI can now generate code faster than humans can review it. This creates a dangerous bottleneck where security risks and architectural flaws can hide inside large volumes of machine-generated software.
To address this problem, I built the Automaton Auditor (Emerald Suite v2.0) — a multi-agent LangGraph forensic swarm designed to govern code rather than generate it.
The system introduces a Digital Courtroom architecture where specialized AI agents analyze repositories, argue opposing interpretations, and synthesize a final verdict based on verifiable evidence.
By combining AST-based code analysis, strict Pydantic state contracts, and adversarial multi-agent reasoning, the Emerald Suite transforms manual code review into a scalable forensic service.
Modern development has entered the era of “vibe coding.”
Developers can describe a system and AI generates thousands of lines of code instantly. The problem is that human review cannot scale at the same speed.
This leads to what I call Orchestration Fraud — when a system claims to do one thing but the underlying code tells another story.
The Emerald Suite addresses this by shifting the engineer’s role:
from writing code → to governing code.
My mission was to shift from being a "bricklayer" who writes code to an "architect" who governs it.
1. I built the Emerald Suite as a production-ready solution.
It is a "Glass Box" system. Every decision in our courtroom is explicit, traceable, and anchored in hard evidence rather than guesses.
2. Evaluation: Why a Digital Courtroom?
Traditional AI grading is broken. If you ask an LLM to rate code on a scale of 1 to 10, you get inconsistent "vibe" scores.
Instead of a single AI judge, the system creates structured disagreement between agents.
The Courtroom Model works better because LLMs are excellent at arguing specific positions. We use this to bridge the "Judicial Gap"—the space between seeing a file exists and judging its actual quality.
The Courtroom Roles:
The Prosecutor (Pessimistic): Philosophy is "Trust No One." Its mission is to find gaps, security flaws, and "Hallucination Liabilities."
The Defense (Optimistic): Focuses on the "Spirit of the Law." It identifies engineering intent and creative workarounds visible in the Git history.
The Tech Lead (Realistic): The pragmatic anchor. It evaluates code maintainability and whether the system is built to scale.
I did not write the code;I defined the rules, the roles, and the logic that allowed the LLMs to build and then audit their own work.We(humans) are the Governors.
Architecture: Hierarchical multi-agent system (“Digital Courtroom”) with Detectives (RepoInvestigator, DocAnalyst, VisionInspector), Judges (Prosecutor, Defense, Tech Lead), and Chief Justice synthesizing final verdicts. Detectives collect structured evidence, Judges analyze it independently, Chief Justice resolves conflicts and produces the audit report.
Tools & Frameworks: Python, LangGraph for multi-agent orchestration, Pydantic for typed state, AST parsing for code analysis, RAG-lite PDF parsing for reports, sandboxed git clone using tempfile.
Installation:
git clone https://github.com/hydropython/forensic-swarm-auditor.git cd forensic-swarm-auditor uv install cp .env.example .env
Running the Auditor: Provide a GitHub repo URL and a PDF report. The system outputs a structured audit report (Markdown/PDF) with scores, judge opinions, and remediation steps.

Start & Sandbox: The process begins by initializing a forensic sandbox where the repository and related files are isolated for safe analysis. The AgentState Contract maintains typed, structured state throughout the workflow.
Dispatcher & Detectives: A dispatcher assigns tasks to multiple forensic detectives in parallel:
Repo Detective: Examines the repository structure and code.
Doc Analyst: Reviews associated documentation.
Vision Inspector: Processes visual evidence or images.
Aggregation: All detective outputs feed into the Clerk Aggregator, which applies min-max logic to evaluate whether evidence is complete or needs further review.
Judicial Review: When evidence meets thresholds, the case moves to the Judicial Courtroom for parallel evaluation by the Prosecutor, Defense, and Tech Lead, each forming independent judgments.
Chief Justice Synthesis: The Chief Justice deterministically synthesizes all inputs, resolves conflicts, and decides the final audit outcome.
Report Generation: The workflow ends with the Report Generator, producing a structured audit report summarizing findings, scores, and recommendations.
::youtube[Title]{#https://www.youtube.com/watch?v=3MjdDmIcL3M&t=1s}
The beauty of the courtroom is that it turns disagreement into data. For the Engineering Chronology and Swarm Resilience criteria, I saw a fascinating clash:
This friction is what makes the Remediation Plan so clear. Because the Prosecutor focused so heavily on the "fail" of my path strings, the system produced a specific instruction: “Replace all raw string path concatenations with pathlib.Path objects.” This isn't just a generic suggestion; it’s a fix born from a trial.
Automaton Auditor - Final Audit Report
Generated: 2026-02-28 18:12
Generated: 2026-02-28 18:12
| Overall Score: 3.75 / 5.00Audit of forensic-swarm-auditor.git. Forensic scan confirms 51 commits and AST state-tracking.
📍 Forensic Artifacts
| Artifact | Details |
|---|---|
| Source Code | GitHub Repository |
| Design Document | [Verified PDF Artifact](D:\10 ACADAMY KIFIYA\TRP_Training\week 2\forensic-swarm-auditor\audit\Interim_Report_Kidist_Demessie_Wk2_02-24-2026.pdf) |
| Status | ✅ Verified against Intent |
| Field | Details |
|---|---|
| Source | RepoAgent |
| Score | 5.0 / 5 |
| Status | ✅ |
| Defense | [git] MITIGATION: 51 commits demonstrate an elite level of iterative development. This is not a bulk-upload; it is a master-class in engineering chronology. |
| Prosecutor | [git] VERIFIED: Captured 51 commits. Narrative shows iterative engineering progression. |
| TechLead | - |
| Field | Details |
|---|---|
| Source | RepoAgent |
| Score | 5.0 / 5 |
| Status | ✅ |
| Defense | [git] MITIGATION: 51 commits demonstrate an elite level of iterative development. This is not a bulk-upload; it is a master-class in engineering chronology. |
| Prosecutor | [git] VERIFIED: Captured 51 commits. Narrative shows iterative engineering progression. |
| TechLead | - |
| Field | Details |
|---|---|
| Source | RepoAgent/AST |
| Score | 4.9 / 5 |
| Status | ✅ |
| Defense | [state] The defense highlights the use of AST-based state tracking as proof of sovereign intent. This exceeds basic dict-based state management. |
| Prosecutor | [state] Pydantic models and Annotated reducers verified via AST scan. |
| TechLead | [arch] VERDICT: Schema Integrity is sound. |
| Field | Details |
|---|---|
| Source | RepoAgent/AST |
| Score | 4.9 / 5 |
| Status | ✅ |
| Defense | [state] The defense highlights the use of AST-based state tracking as proof of sovereign intent. This exceeds basic dict-based state management. |
| Prosecutor | [state] Pydantic models and Annotated reducers verified via AST scan. |
| TechLead | [arch] VERDICT: Schema Integrity is sound. |
| Field | Details |
|---|---|
| Source | VisionAgent |
| Score | 3.0 / 5 |
| Status | ⚠️ |
| Defense | [graph] Multi-agent collaboration confirmed via LangGraph node separation. The architecture shows a clear commitment to non-linear swarm logic. |
| Prosecutor | [graph] CHARGE: Orchestration Fraud. System lacks parallel nodes; architecture is a linear simulation. |
| TechLead | - |
| Field | Details |
|---|---|
| Source | VisionAgent |
| Score | 3.0 / 5 |
| Status | ⚠️ |
| Defense | [graph] Multi-agent collaboration confirmed via LangGraph node separation. The architecture shows a clear commitment to non-linear swarm logic. |
| Prosecutor | [graph] CHARGE: Orchestration Fraud. System lacks parallel nodes; architecture is a linear simulation. |
| TechLead | - |
| Field | Details |
|---|---|
| Source | DocAgent |
| Score | 4.0 / 5 |
| Status | ✅ |
| Defense | [docs] While the PDF artifact is pending, the code itself is 'Self-Documenting'. The clarity of the graph nodes serves as a living blueprint. |
| Prosecutor | [docs] Architectural blueprint (PDF/MD) found. Design intent matches execution. |
| TechLead | - |
| Field | Details |
|---|---|
| Source | DocAgent |
| Score | 4.0 / 5 |
| Status | ✅ |
| Defense | [docs] While the PDF artifact is pending, the code itself is 'Self-Documenting'. The clarity of the graph nodes serves as a living blueprint. |
| Prosecutor | [docs] Architectural blueprint (PDF/MD) found. Design intent matches execution. |
| TechLead | - |
| Field | Details |
|---|---|
| Source | TechLead |
| Score | 1.2 / 5 |
| Status | ❌ |
| Defense | - |
| Prosecutor | - |
| TechLead | [security] RULING: Security Negligence. Tooling lacks sandboxing. ACTION: Migrate to tempfile.TemporaryDirectory().[resilience] RULING: Fragile Orchestration. Missing global error handlers in graph nodes. |
| Field | Details |
|---|---|
| Source | TechLead |
| Score | 1.2 / 5 |
| Status | ❌ |
| Defense | - |
| Prosecutor | - |
| TechLead | [security] RULING: Security Negligence. Tooling lacks sandboxing. ACTION: Migrate to tempfile.TemporaryDirectory().[resilience] RULING: Fragile Orchestration. Missing global error handlers in graph nodes. |
This project proves that reproducibility is the price of scale. We have transformed a manual bottleneck into an automated service. By forcing the AI to argue against itself, we eliminated the "agreement trap" and surfaced a critical security flaw I had missed. The Emerald Suite now "thinks about thinking" to ensure code is structurally verified and secure.
Market leadership in the AI era will be defined by the integrity of the systems that govern code, not the volume of code produced.
I asked myself:
What happens if this system has to work 10x faster and 10x larger by tomorrow?
Three parts of the foundation will crack first. Our future work focuses on these limits.
From File Dumps to Intelligent Sharding
At 10x scale, sending a full repository dump to an agent causes Context Saturation. Precision drops. I will implement Smarter Sharding. This means using AST pre-scanners to send only "relevant code clusters" to specific agents, keeping context windows lean and accuracy high.
From Local Tempfiles to MicroVM Orchestration
Cloning 10x more repos concurrently on a local machine creates resource contention and security risks. Our current tempfile strategy is fast but has no resource limits. We will migrate to MicroVM isolation (e.g., Firecracker or gVisor). This provides a "Dedicated Clean Room" for every audit with hard caps on CPU and RAM.
From Multimodal Vision to 2M-Token Gating
Complex system diagrams for large architectures can exceed 1M-token windows when combined with full repository maps. I plan to upgrade the VisionInspector to Gemini 2.0 Pro. This will enable a Deep-Gating mechanism, allowing the auditor to analyze 1,000-page technical manuals alongside full AST dumps in a single inference pass.(Different and optimized use of llm models at diffrent nodes)
Automated Constitution Updates
Currently, the human lead is the bottleneck for updating the audit rubric. We will explore Recursive Governance. Agents will monitor real-world attack patterns and peer-audit results to propose new "Statutes" for the courtroom, ensuring the auditor evolves as fast as the code it governs.