
Trust Bench (Project2v2) is a LangGraph-based multi-agent evaluation framework designed to analyze software repositories for security leaks, code quality gaps, and documentation health. Unlike traditional AI-driven evaluation systems that rely on external APIs or non-deterministic models, Trust Bench operates entirely offline, ensuring reproducible and transparent results suitable for secure or air-gapped environments.
The framework coordinates multiple specialized agents—SecurityAgent, QualityAgent, and DocumentationAgent—through a LangGraph orchestrator that manages planning, execution, and collaboration. Each agent runs deterministic Python tools that generate quantifiable metrics, with findings dynamically influencing one another: security violations penalize quality and documentation scores, while strong code structure improves overall trust metrics.
Every run produces a pair of reproducible reports (report.json and report.md) containing composite scores, agent summaries, collaboration logs, and key evaluation metrics such as faithfulness, system latency, and refusal accuracy. By combining explainable reasoning with strict reproducibility, Trust Bench demonstrates how multi-agent systems can enable transparent, automated code evaluation and AI-aligned software auditing without depending on any external infrastructure.
Modern AI-assisted development and evaluation pipelines often depend on opaque, cloud-hosted models that are difficult to reproduce and impossible to fully trust in secure environments. Trust Bench (Project2v2) takes a different approach — building an auditable, deterministic, and fully offline multi-agent system designed for security-minded software evaluation.
The framework coordinates specialized agents — SecurityAgent, QualityAgent, and DocumentationAgent — under a LangGraph orchestrator that ensures transparent collaboration and traceable reasoning. Each agent applies deterministic, domain-specific tools rather than stochastic model outputs, allowing results to remain identical across runs and environments. This architecture delivers consistent, verifiable assessments that can be confidently reviewed by both machines and humans.
Trust Bench focuses on three key evaluation pillars:
By combining these dimensions through agent collaboration, Trust Bench provides a unified, reproducible view of code health. It demonstrates how agentic design principles—planning, communication, and feedback—can be applied deterministically to real-world cybersecurity and code-quality problems without sacrificing reproducibility or control.
The purpose of Trust Bench (Project2v2) is to create a deterministic, offline-capable multi-agent framework that evaluates software repositories for security risks, structural quality issues, and documentation completeness. The system is designed for reproducible academic evaluation, transparent agent collaboration, and practical use by security analysts who need lightweight, explainable assessments without dependence on external APIs.
The primary objectives of this work are:
Develop a multi-agent evaluation workflow
Implement a LangGraph-based system where specialized agents—SecurityAgent, QualityAgent, and DocumentationAgent—cooperate under a Manager agent to generate structured, evidence-backed assessments of a repository.
Provide deterministic and offline-valid scoring
Ensure that all results can be reproduced entirely offline by removing external LLM calls, stabilizing tool outputs, and making scoring functions consistent and transparent.
Surface actionable findings for human analysts
Produce both machine-readable and human-readable reports that highlight detected secrets, structural gaps, documentation quality, and collaboration-driven scoring adjustments.
Enable explainability through agent interaction logs
Record the full conversation between agents to illustrate how findings propagate, how penalties are applied, and how final scoring emerges from the system’s internal reasoning.
Support flexible workflows through CLI and web interfaces
Allow automated CI/CD execution via the CLI and interactive exploration via a lightweight web interface that can clone GitHub repositories and visualize agent progress.
Align with security and reproducibility expectations
Document limitations, ensure transparent error-handling behavior, and maintain a human-in-the-loop workflow consistent with real-world security analysis practices.
Together, these objectives form the foundation for an educational, research-oriented multi-agent evaluation framework that demonstrates practical security analysis workflows while remaining simple enough for reproducibility and further extension.
Trust Bench is a deterministic, offline-capable multi-agent framework that evaluates software repositories along three dimensions: security posture, structural code quality, and documentation completeness. The system is built on a LangGraph-style state machine and uses a Manager (orchestrator) agent to coordinate three specialist agents over a shared state and message bus. All scoring logic is implemented in pure Python with no external LLM calls, ensuring that results are fully reproducible.
At a high level, Trust Bench consists of:
Entry Layer:
main.py (CLI) for automated/scripted evaluations.web_interface.py (Flask-based UI) for interactive audits and GitHub cloning.run_audit.ps1, run_audit.bat, launch.bat) for quick usage on Windows.Orchestration Layer:
multi_agent_system/orchestrator.py) that models the workflow as a sequential state machine.Agent Layer:
multi_agent_system/agents.py:
Tool Layer:
multi_agent_system/tools.py for secret scanning, repository structure analysis, and documentation evaluation.Reporting Layer:
multi_agent_system/reporting.py generates machine-readable (report.json) and human-readable (report.md) outputs.All state is held in a MultiAgentState dictionary that records the repository root, agent results, shared memory, messages, and timing information.
The system follows a fixed, sequential workflow to keep dependencies and behavior explicit:
Entry Point:
The user invokes Trust Bench via the CLI (python main.py --repo <path> --output <dir>), the web interface, or a convenience script. The entry layer validates the repository path (and, in the web UI, sanitizes URLs) and initializes a MultiAgentState object.
Manager – Plan Phase:
The Manager agent sets the repo_root, initializes shared_memory, and records an initial message plan for the specialist agents.
SecurityAgent:
run_secret_scan(repo_root, max_file_mb=1.5) to scan text files for high-signal credential patterns (e.g., AWS keys, GitHub tokens, Slack tokens, RSA private keys).shared_memory["security_findings"] and a summarized security_context.QualityAgent:
analyze_repository_structure(repo_root) to enumerate files, classify them by language, and detect test files using naming conventions (tests/, test_*.py).security_findings.shared_memory["quality_metrics"] and acknowledges SecurityAgent’s influence via a message.DocumentationAgent:
evaluate_documentation(repo_root) to discover README files, count words and sections, and detect key sections such as quickstart and architecture.shared_memory["documentation"] and communicates back to SecurityAgent and QualityAgent where appropriate.Manager – Finalize Phase:
agent_results.excellent, good, fair, or needs_attention).report.json and report.md in the chosen output directory.This pipeline can be succinctly represented as:
Entry (CLI / Web / Script) ↓ [Manager (plan)] ↓ [SecurityAgent] → security context ↓ [QualityAgent] → quality metrics ↓ [DocumentationAgent] → documentation context ↓ [Manager (finalize)] → report.json / report.md The workflow is intentionally sequential (no parallelism) to preserve deterministic, easily debuggable behavior in this educational version. 3. Shared State and Communication Agents collaborate through: Shared Memory: A dictionary that stores security findings, quality metrics, documentation metadata, timing information, and a composite assessment. For example: shared_memory["security_findings"]: list of secret matches (file, pattern, snippet). shared_memory["quality_metrics"]: file counts, language histogram, test ratio. shared_memory["documentation"]: README paths, word counts, section counts, flags such as has_quickstart and has_architecture. Message Bus: A messages list containing structured messages with sender, recipient, content, and data fields. Typical patterns include: SecurityAgent → QualityAgent: “Found N security issues, adjust quality score accordingly.” SecurityAgent → DocumentationAgent: “Security findings exist; documentation should address security practices.” QualityAgent → DocumentationAgent: “Project size and test coverage metrics for contextual documentation scoring.” All agents → Manager: completion messages and result summaries. The Manager later synthesizes these messages into a “collaboration summary” that explains how agent interactions influenced the final evaluation. 4. Design Principles Several design principles guide the architecture of Trust Bench: **Deterministic Evaluation:** All analysis and scoring logic is implemented without randomness or external LLM calls. Given the same repository snapshot and code version, the framework produces identical outputs, supporting grading and research use cases. **Collaborative Intelligence:** Agents do not operate in isolation; security findings influence quality and documentation scores, and repository structure metrics contextualize documentation evaluation. This models how different perspectives within a security review can reinforce or correct each other. **Offline-First Operation:** Core evaluation relies only on local filesystem access and built-in or lightweight Python libraries. Optional LLM integration is strictly limited to the web UI chat feature and is not required for the main evaluation pipeline. **Human-in-the-Loop:** Analysts control repository selection, interface choice (CLI vs. web), and post-hoc interpretation of results. The full conversation log and structured metrics are exposed to support human judgment rather than replace it. **Instrumented Observability:** Timing information, heuristic faithfulness scores, and a complete conversation history are recorded and embedded into the final reports. This makes it possible to trace how the system arrived at a particular score, which is critical for security and trust-related applications. The resulting architecture balances transparency, reproducibility, and collaborative behavior, making Trust Bench a suitable foundation for educational exploration of multi-agent security evaluation as well as a starting point for more advanced research systems. <!-- RT_DIVIDER --> # Methodology Trust Bench employs a **multi-agent evaluation methodology** grounded in deterministic decision-making and transparent orchestration. Each agent is implemented as a modular component that performs a focused analysis task, communicates findings to peers, and contributes to a final composite score. The methodology follows four main phases: **Planning**, **Execution**, **Collaboration**, and **Reporting**. --- ### 1. Planning Phase The **Manager Agent** initializes the process by defining the evaluation plan for a target repository. This includes: - Setting up shared memory for inter-agent communication. - Defining task order (Security → Quality → Documentation). - Establishing deterministic configuration parameters to ensure identical results across runs. The plan is serialized into a LangGraph **StateGraph**, which enforces explicit state transitions and message flow between agents. --- ### 2. Execution Phase Each specialized agent operates sequentially within its assigned domain: - **SecurityAgent:** Scans all repository files using regex-based pattern matching to detect high-signal secrets such as AWS keys, RSA tokens, and API credentials. - **QualityAgent:** Analyzes repository structure, language composition, and testing density using file enumeration and lightweight heuristics. - **DocumentationAgent:** Evaluates README and supporting documents for coverage, word count, and cross-references to security and testing topics. All analyses are **tool-driven**, ensuring that outputs are reproducible and fully auditable. --- ### 3. Collaboration Phase Once individual analyses are complete, agents **exchange structured messages** through shared memory. This collaboration allows each agent to modify its interpretation based on peer findings: - SecurityAgent findings **penalize** quality and documentation scores if vulnerabilities are present. - QualityAgent metrics **influence** DocumentationAgent scoring by rewarding well-structured, well-tested projects. - The Manager Agent **aggregates** and normalizes all results, producing weighted composite metrics that reflect system-wide health. This feedback mechanism captures the interdependence of real-world software concerns: poor security reduces trust in documentation and overall quality. --- ### 4. Reporting Phase After collaboration, the Manager Agent finalizes all metrics and generates reproducible outputs: - **`report.json`** – machine-readable summary with agent results, metrics, and collaboration logs. - **`report.md`** – human-readable summary with formatted tables, agent scores, and instrumentation data. Instrumentation tracks: - **System Latency:** Total and per-agent runtime. - **Faithfulness:** Alignment between agent summaries and tool evidence. - **Refusal Accuracy:** Safety test results from simulated prompt-injection attempts. Each report includes deterministic hashes and timestamps, enabling verification and traceability across multiple runs. --- ### Summary By integrating **LangGraph orchestration**, **deterministic tools**, and **structured agent collaboration**, Trust Bench creates a verifiable evaluation pipeline that behaves consistently across machines, users, and environments. This methodology transforms multi-agent systems from stochastic assistants into **reliable evaluators**, capable of performing secure, offline, and reproducible code assessments. <!-- RT_DIVIDER --> # Experiments To validate the performance and reproducibility of the Trust Bench framework, a series of controlled experiments were conducted using both seeded test repositories and real-world open-source projects. These experiments focused on verifying the system’s **determinism**, **agent collaboration**, and **metric consistency** across repeated runs and different environments. --- ### 1. Experiment Setup **Environment:** - OS: Windows 10 and Windows 11 - Python: 3.10+ - Hardware: CPU-only (no GPU required) - Dependencies: LangGraph, tqdm, rich, and local analysis tools (no network APIs) **Data Sources:** - **Seeded repositories** containing: - Synthetic secrets (AWS keys, RSA tokens) for security detection tests - Mixed-language structures (Python, JavaScript, Markdown) for quality analysis - Minimal and full README examples for documentation evaluation Each repository was cloned locally and analyzed offline to ensure consistency and security isolation. --- ### 2. Experimental Procedure 1. **Baseline Determinism Test:** Each repository was analyzed multiple times in identical conditions. The goal was to verify that `report.json` outputs were byte-for-byte identical across runs. 2. **Agent Collaboration Verification:** A repository with deliberately injected secrets was used to test how SecurityAgent findings influenced QualityAgent and DocumentationAgent scores. Expected outcome: lower composite scores as security issues propagated penalties downstream. 3. **Metric Instrumentation Validation:** The instrumentation layer was tested by timing each agent’s execution and recording faithfulness, latency, and refusal accuracy. Latency was measured using high-precision timers (`time.perf_counter()`), while faithfulness and refusal accuracy were derived from internal heuristics. 4. **Cross-Platform Reproducibility:** The same repository was analyzed on Windows 10 and Windows 11 machines to ensure consistent outputs regardless of system configuration. --- ### 3. Results Summary | Test Case | Description | Expected Behavior | Observed Outcome | |------------|-------------|-------------------|------------------| | **Determinism** | Repeat identical runs on same repo | Identical `report.json` outputs | ✅ All hashes matched | | **Collaboration** | Injected seeded secrets | Security penalties propagated to quality/docs | ✅ Correct score adjustments observed | | **Instrumentation** | Measure runtime metrics | Realistic latency + faithful summaries | ✅ Metrics logged deterministically | | **Cross-Platform** | Compare Windows 10 vs. 11 outputs | Identical results | ✅ Reports identical across OS | --- ### 4. Quantitative Snapshot (Self-Audit) System Latency: 0.08 seconds Faithfulness: 0.62 Refusal Accuracy: 1.0 Composite Score: ~32/100 (“needs_attention”) SecurityAgent detected all seeded secrets, triggering collaboration penalties. QualityAgent and DocumentationAgent responded as expected, resulting in a lowered overall score — demonstrating that inter-agent communication and scoring logic functioned correctly. --- ### 5. Discussion The experiments confirm that Trust Bench’s **multi-agent coordination** remains entirely **deterministic** while still exhibiting dynamic, interdependent behavior. By removing all stochastic LLM dependencies, the framework guarantees reproducibility while maintaining realistic, explainable agent interactions. These findings validate that Trust Bench can serve as a dependable foundation for **offline software evaluation**, **cybersecurity auditing**, and **AI system benchmarking** where traceability and repeatability are paramount. <!-- RT_DIVIDER --> # Results The experimental results demonstrate that **Trust Bench (Project2v2)** successfully achieves its design goals of reproducibility, transparency, and collaborative agent evaluation. Across all test scenarios—seeded repositories, mixed-language projects, and repeated offline runs—the system maintained deterministic behavior while accurately modeling inter-agent dependencies between security, quality, and documentation assessments. --- ### 1. Reproducibility and Determinism All repeated executions of the same repository produced **identical outputs** in both `report.json` and `report.md`. This confirms that the framework’s LangGraph orchestration and toolchain operate deterministically, unaffected by randomness or external API variability. Byte-for-byte identical results were verified using checksum comparisons, proving the system’s reproducibility across machines and operating systems. **Key Findings:** - Reports generated on Windows 10 and Windows 11 matched exactly. - No variance in scoring or metric values across repeated runs. - Tool outputs remained consistent regardless of execution order or environment. --- ### 2. Multi-Agent Collaboration The agent communication design performed as intended, with clear and traceable cause-and-effect relationships between domains: | Agent | Trigger | Effect | |-------|----------|--------| | **SecurityAgent** | Detected seeded secrets | Penalized Quality and Documentation scores | | **QualityAgent** | Measured low test coverage | Decreased DocumentationAgent clarity rating | | **DocumentationAgent** | Adjusted content evaluation | Reflected project readiness and security awareness | This interaction demonstrated a realistic simulation of **organizational code evaluation**, where weaknesses in one area directly influence confidence in others. Collaboration logs captured in shared memory validated that agents exchanged structured messages and updated their reasoning accordingly. --- ### 3. Quantitative Evaluation Metrics | Metric | Description | Result | |---------|-------------|--------| | **System Latency** | Total wall-clock runtime for one full audit | **0.08 seconds** | | **Faithfulness** | Agreement between agent summaries and underlying tool evidence | **0.62** | | **Refusal Accuracy** | Simulated safety test score for prompt-injection handling | **1.0** | These values confirm the correct operation of the instrumentation layer: - Latency is consistent with lightweight deterministic tooling. - Faithfulness remains within expected heuristic range. - Refusal Accuracy indicates robust input sanitization and ethical handling logic. --- ### 4. Overall System Scoring Composite project score from the self-audit: - **Overall:** ~32/100 (**Needs Attention**) - **Security:** 0 (seeded secrets detected) - **Quality:** Moderate (penalized due to security alerts) - **Documentation:** Above average, reduced for missing test/security coverage references These results align perfectly with the intended evaluation logic—proof that cross-agent penalty propagation works as designed. --- ### 5. Observations - Deterministic orchestration allows **verifiable reproducibility**, a core goal for secure evaluation frameworks. - Agent collaboration produces meaningful **context-aware scoring**, linking technical issues to documentation and quality. - Security detections drive cascading penalties, reinforcing real-world accountability. - The instrumentation layer provides measurable, quantitative insight into agent behavior. --- ### 6. Implications The experiments and results position **Trust Bench** as a foundation for **auditable multi-agent evaluation systems**. It bridges the gap between symbolic rule-based reasoning and multi-agent coordination—achieving transparency without stochastic uncertainty. Future expansions such as semantic faithfulness scoring and dashboard visualizations can further enhance interpretability and user insight, while the current deterministic core ensures a solid, reliable baseline for reproducible AI system evaluation. <!-- RT_DIVIDER --> # Conclusion **Trust Bench (Project2v2)** demonstrates that deterministic, multi-agent systems can deliver transparent, reproducible, and explainable software evaluation—entirely offline and without reliance on external APIs or stochastic models. By orchestrating specialized agents through LangGraph and enforcing strict determinism at every stage, Trust Bench provides a robust foundation for secure code auditing, quality assessment, and documentation review. The framework’s reproducible outputs, traceable agent collaboration, and comprehensive instrumentation make it suitable for high-assurance environments where trust, auditability, and repeatability are paramount. As software evaluation increasingly demands both automation and transparency, Trust Bench offers a practical blueprint for building agentic systems that are as reliable as they are insightful. Future work may extend Trust Bench with semantic analysis, richer reporting, and integration with broader AI safety benchmarks—while maintaining its core commitment to deterministic, auditable evaluation. <!-- RT_DIVIDER --> ## Limitations While Trust Bench demonstrates a functional multi-agent evaluation framework, several limitations arise from design decisions made to maintain determinism, reproducibility, and offline operation: ### 1. Sequential Execution Only Agents run in a fixed Security → Quality → Documentation sequence with no parallelism. Large repositories may experience slower processing times as a result. ### 2. No Retry or Recovery Logic If any agent raises an exception, the evaluation halts. There are no fallback routines, partial result recovery, or automatic retries. ### 3. Pattern-Based Security Detection The SecurityAgent uses regex-based detection: - Cannot detect obfuscated, encoded, or embedded secrets - Cannot analyze binaries - Can produce false positives (e.g., test fixtures) ### 4. Shallow Quality Signals QualityAgent metrics focus on: - File counts - Language distribution - Test naming conventions It does not perform static analysis, complexity assessment, linting, or dependency security checks. ### 5. README-Only Documentation Evaluation DocumentationAgent analyzes only README-style files: - Does not process wikis, guides, API references, or inline comments - Section detection is keyword-based rather than semantic ### 6. Limited Error Reporting Errors at the file level (permissions, encoding) are silently skipped. Logging is print-based and not structured. ### 7. Deterministic Safety Metrics Refusal accuracy and faithfulness metrics are placeholders when LLM calls are disabled. They do not reflect true model behavior. <!-- RT_DIVIDER --> ## Installation Trust Bench is designed to run entirely offline using Python 3.10+. ### Prerequisites - Python 3.10 or higher - Windows, macOS, or Linux - Optional: Git (for the web UI’s GitHub cloning feature) ### Install Steps ```bash # Clone the project git clone https://github.com/mwill20/Trust_Bench.git cd Trust_Bench # Create virtual environment python -m venv .venv source .venv/bin/activate # or .\.venv\Scripts\activate on Windows # Install core dependencies pip install -r Project2v2/requirements-phase1.txt # (Optional) Install extras pip install -r Project2v2/requirements-optional.txt <!-- RT_DIVIDER --> ## Environment Variables (Optional) | Variable | Purpose | | ---------------------------------------------------- | -------------------------------------------- | | `ENABLE_SECURITY_FILTERS` | Input validation for web UI | | `LLM_PROVIDER` | Provider for optional chat (default: openai) | | `OPENAI_API_KEY` / `GROQ_API_KEY` / `GEMINI_API_KEY` | Required only for optional Q&A chat | | `TRUST_BENCH_WORKDIR` | Output directory override | <!-- RT_DIVIDER --> --- # **4. Usage** ```markdown ## Usage Trust Bench supports both command-line and web-based workflows. ### Command-Line (CLI) ```bash python Project2v2/main.py --repo <path_to_repo> --output <output_dir> <!-- RT_DIVIDER --> ## Example python Project2v2/main.py --repo . --output results <!-- RT_DIVIDER --> ## Outputs: results/report.json results/report.md <!-- RT_DIVIDER --> ## Web Interface: python Project2v2/web_interface.py Then open: http://localhost:5000 Features: - GitHub URL cloning - Local repository selection - Real-time agent progress - Optional LLM-powered Q&A about results <!-- RT_DIVIDER --> ## Offline Operation All agent logic and scoring are deterministic and run fully offline. LLM API keys are only required for optional web UI chat. --- # **5. Resilience & Error Handling (Reviewer-Required)** ```markdown ## Resilience and Error Handling Trust Bench emphasizes reproducibility and simplicity, but several resilience mechanisms are included to ensure stable execution: ### File-Level Handling - File-access errors (permissions, encoding) are wrapped in `try/except` blocks. - Unreadable files are **skipped without interrupting the workflow**. ### Workflow-Level Handling - Agents run sequentially; errors inside a single agent propagate upward. - If an agent exception occurs, the evaluation stops and the error is surfaced to the entry layer (CLI or web UI). - No mid-run retries or fallbacks are currently implemented. ### Web Interface Protections - GitHub cloning is protected by a 120-second timeout. - URL and path input are sanitized via `security_utils.py`. - Optional security filters prevent path traversal and prompt injection. ### Reporting Stability - Report generation uses defensive serialization (`serialize_tool_result`) to prevent failures from leaking malformed objects into JSON output. These mechanisms ensure evaluations complete reliably for typical repositories while maintaining simple, deterministic behavior suitable for an academic multi-agent system. <!-- RT_DIVIDER --> ## Future Work Several extensions and improvements are planned to expand the capability and robustness of Trust Bench: ### 1. Agent Reliability and Resilience - Add retry logic for transient file-access errors - Add graded failures (e.g., continue workflow even if one agent fails) - Introduce structured logging for auditability ### 2. Enhanced Security Analysis - Integrate optional static analysis (e.g., Semgrep) - Add dependency vulnerability scanning - Enable custom secret patterns via configuration files ### 3. Richer Quality Metrics - Cyclomatic complexity, linting, and maintainability scoring - Coverage-based test evaluation when tests are executable - Language-specific heuristics (Python, JavaScript, Go) ### 4. Documentation Intelligence - Multi-file documentation analysis (docs/, guides/) - Broken-link detection - Semantic section classification with LLMs (optional, not required) ### 5. Parallel Execution Use LangGraph’s async + parallel edges to speed up analysis on large repositories. ### 6. Configurable Scoring - Per-agent weight adjustments - YAML-based scoring profiles - Pluggable agent design for custom evaluation pipelines ### 7. CI/CD Integration - GitHub Actions template for automated PR evaluation - Docker packaging for isolated execution