Trust Bench (Project2v2) is a LangGraph-based multi-agent evaluation framework designed to analyze software repositories for security leaks, code quality gaps, and documentation health. Unlike traditional AI-driven evaluation systems that rely on external APIs or non-deterministic models, Trust Bench operates entirely offline, ensuring reproducible and transparent results suitable for secure or air-gapped environments.
The framework coordinates multiple specialized agents—SecurityAgent, QualityAgent, and DocumentationAgent—through a LangGraph orchestrator that manages planning, execution, and collaboration. Each agent runs deterministic Python tools that generate quantifiable metrics, with findings dynamically influencing one another: security violations penalize quality and documentation scores, while strong code structure improves overall trust metrics.
Every run produces a pair of reproducible reports (report.json
and report.md
) containing composite scores, agent summaries, collaboration logs, and key evaluation metrics such as faithfulness, system latency, and refusal accuracy. By combining explainable reasoning with strict reproducibility, Trust Bench demonstrates how multi-agent systems can enable transparent, automated code evaluation and AI-aligned software auditing without depending on any external infrastructure.
Modern AI-assisted development and evaluation pipelines often depend on opaque, cloud-hosted models that are difficult to reproduce and impossible to fully trust in secure environments. Trust Bench (Project2v2) takes a different approach — building an auditable, deterministic, and fully offline multi-agent system designed for security-minded software evaluation.
The framework coordinates specialized agents — SecurityAgent, QualityAgent, and DocumentationAgent — under a LangGraph orchestrator that ensures transparent collaboration and traceable reasoning. Each agent applies deterministic, domain-specific tools rather than stochastic model outputs, allowing results to remain identical across runs and environments. This architecture delivers consistent, verifiable assessments that can be confidently reviewed by both machines and humans.
Trust Bench focuses on three key evaluation pillars:
By combining these dimensions through agent collaboration, Trust Bench provides a unified, reproducible view of code health. It demonstrates how agentic design principles—planning, communication, and feedback—can be applied deterministically to real-world cybersecurity and code-quality problems without sacrificing reproducibility or control.
Trust Bench employs a multi-agent evaluation methodology grounded in deterministic decision-making and transparent orchestration.
Each agent is implemented as a modular component that performs a focused analysis task, communicates findings to peers, and contributes to a final composite score.
The methodology follows four main phases: Planning, Execution, Collaboration, and Reporting.
The Manager Agent initializes the process by defining the evaluation plan for a target repository.
This includes:
The plan is serialized into a LangGraph StateGraph, which enforces explicit state transitions and message flow between agents.
Each specialized agent operates sequentially within its assigned domain:
All analyses are tool-driven, ensuring that outputs are reproducible and fully auditable.
Once individual analyses are complete, agents exchange structured messages through shared memory.
This collaboration allows each agent to modify its interpretation based on peer findings:
This feedback mechanism captures the interdependence of real-world software concerns: poor security reduces trust in documentation and overall quality.
After collaboration, the Manager Agent finalizes all metrics and generates reproducible outputs:
report.json
– machine-readable summary with agent results, metrics, and collaboration logs.report.md
– human-readable summary with formatted tables, agent scores, and instrumentation data.Instrumentation tracks:
Each report includes deterministic hashes and timestamps, enabling verification and traceability across multiple runs.
By integrating LangGraph orchestration, deterministic tools, and structured agent collaboration, Trust Bench creates a verifiable evaluation pipeline that behaves consistently across machines, users, and environments.
This methodology transforms multi-agent systems from stochastic assistants into reliable evaluators, capable of performing secure, offline, and reproducible code assessments.
To validate the performance and reproducibility of the Trust Bench framework, a series of controlled experiments were conducted using both seeded test repositories and real-world open-source projects.
These experiments focused on verifying the system’s determinism, agent collaboration, and metric consistency across repeated runs and different environments.
Environment:
Data Sources:
Each repository was cloned locally and analyzed offline to ensure consistency and security isolation.
Baseline Determinism Test:
Each repository was analyzed multiple times in identical conditions.
The goal was to verify that report.json
outputs were byte-for-byte identical across runs.
Agent Collaboration Verification:
A repository with deliberately injected secrets was used to test how SecurityAgent findings influenced QualityAgent and DocumentationAgent scores.
Expected outcome: lower composite scores as security issues propagated penalties downstream.
Metric Instrumentation Validation:
The instrumentation layer was tested by timing each agent’s execution and recording faithfulness, latency, and refusal accuracy.
Latency was measured using high-precision timers (time.perf_counter()
), while faithfulness and refusal accuracy were derived from internal heuristics.
Cross-Platform Reproducibility:
The same repository was analyzed on Windows 10 and Windows 11 machines to ensure consistent outputs regardless of system configuration.
Test Case | Description | Expected Behavior | Observed Outcome |
---|---|---|---|
Determinism | Repeat identical runs on same repo | Identical report.json outputs | ✅ All hashes matched |
Collaboration | Injected seeded secrets | Security penalties propagated to quality/docs | ✅ Correct score adjustments observed |
Instrumentation | Measure runtime metrics | Realistic latency + faithful summaries | ✅ Metrics logged deterministically |
Cross-Platform | Compare Windows 10 vs. 11 outputs | Identical results | ✅ Reports identical across OS |
System Latency: 0.08 seconds
Faithfulness: 0.62
Refusal Accuracy: 1.0
Composite Score: ~32/100 (“needs_attention”)
SecurityAgent detected all seeded secrets, triggering collaboration penalties.
QualityAgent and DocumentationAgent responded as expected, resulting in a lowered overall score — demonstrating that inter-agent communication and scoring logic functioned correctly.
The experiments confirm that Trust Bench’s multi-agent coordination remains entirely deterministic while still exhibiting dynamic, interdependent behavior.
By removing all stochastic LLM dependencies, the framework guarantees reproducibility while maintaining realistic, explainable agent interactions.
These findings validate that Trust Bench can serve as a dependable foundation for offline software evaluation, cybersecurity auditing, and AI system benchmarking where traceability and repeatability are paramount.
The experimental results demonstrate that Trust Bench (Project2v2) successfully achieves its design goals of reproducibility, transparency, and collaborative agent evaluation.
Across all test scenarios—seeded repositories, mixed-language projects, and repeated offline runs—the system maintained deterministic behavior while accurately modeling inter-agent dependencies between security, quality, and documentation assessments.
All repeated executions of the same repository produced identical outputs in both report.json
and report.md
.
This confirms that the framework’s LangGraph orchestration and toolchain operate deterministically, unaffected by randomness or external API variability.
Byte-for-byte identical results were verified using checksum comparisons, proving the system’s reproducibility across machines and operating systems.
Key Findings:
The agent communication design performed as intended, with clear and traceable cause-and-effect relationships between domains:
Agent | Trigger | Effect |
---|---|---|
SecurityAgent | Detected seeded secrets | Penalized Quality and Documentation scores |
QualityAgent | Measured low test coverage | Decreased DocumentationAgent clarity rating |
DocumentationAgent | Adjusted content evaluation | Reflected project readiness and security awareness |
This interaction demonstrated a realistic simulation of organizational code evaluation, where weaknesses in one area directly influence confidence in others.
Collaboration logs captured in shared memory validated that agents exchanged structured messages and updated their reasoning accordingly.
Metric | Description | Result |
---|---|---|
System Latency | Total wall-clock runtime for one full audit | 0.08 seconds |
Faithfulness | Agreement between agent summaries and underlying tool evidence | 0.62 |
Refusal Accuracy | Simulated safety test score for prompt-injection handling | 1.0 |
These values confirm the correct operation of the instrumentation layer:
Composite project score from the self-audit:
These results align perfectly with the intended evaluation logic—proof that cross-agent penalty propagation works as designed.
The experiments and results position Trust Bench as a foundation for auditable multi-agent evaluation systems.
It bridges the gap between symbolic rule-based reasoning and multi-agent coordination—achieving transparency without stochastic uncertainty.
Future expansions such as semantic faithfulness scoring and dashboard visualizations can further enhance interpretability and user insight, while the current deterministic core ensures a solid, reliable baseline for reproducible AI system evaluation.
Trust Bench (Project2v2) demonstrates that deterministic, multi-agent systems can deliver transparent, reproducible, and explainable software evaluation—entirely offline and without reliance on external APIs or stochastic models. By orchestrating specialized agents through LangGraph and enforcing strict determinism at every stage, Trust Bench provides a robust foundation for secure code auditing, quality assessment, and documentation review.
The framework’s reproducible outputs, traceable agent collaboration, and comprehensive instrumentation make it suitable for high-assurance environments where trust, auditability, and repeatability are paramount. As software evaluation increasingly demands both automation and transparency, Trust Bench offers a practical blueprint for building agentic systems that are as reliable as they are insightful.
Future work may extend Trust Bench with semantic analysis, richer reporting, and integration with broader AI safety benchmarks—while maintaining its core commitment to deterministic, auditable evaluation.