🤖 Multi-Agent Publication Reviewer & Recommendation System (AAIDC – Project 2)

260a61dd9ad5baf13b6632e412c0feba07d7e9c7a31923a1d0f71757_tmpr8csa_zx.jpg

🎯 Abstract

This project presents a multi-agent system designed to analyze software repositories, evaluate documentation quality, and generate actionable improvement recommendations in a controlled, human-guided workflow. The system leverages structured agents that perform repository analysis, keyword extraction, content enhancement, and final report synthesis. A human-in-the-loop interaction layer ensures that recommendations remain accurate, responsible, and aligned with the original intent of the repository owner. The project demonstrates how multi-agent collaboration, combined with interactive decision points, leads to more reliable technical content evaluation.

📝 Introduction

Modern technical repositories increasingly rely on automation to maintain consistent documentation quality, yet fully automated evaluation systems often lack the contextual awareness and judgment necessary for content correctness. To address this gap, this project introduces a multi-agent framework that works cooperatively to inspect repository documentation and produce structured content improvements. Unlike traditional automated systems, this implementation emphasizes human supervision at key stages, ensuring that AI-generated insights remain trustworthy and appropriate. The system brings together agent specialization, error-resilience mechanisms, and clear state-sharing across the pipeline to create a dependable review assistant.

🧪 Methodology

The system is designed around a clear agent workflow. After extracting the README from a GitHub repository, the Repo Analyzer agent evaluates structure, completeness, and documentation quality. The Tag Recommender agent then identifies potential metadata through keyword extraction. Next, the Content Improver agent suggests refinements to titles and introductory sections of the README. Finally, the Reviewer agent compiles insights from all stages into a structured final report. To ensure correctness, a human-in-the-loop checkpoint appears after every major agent step, allowing the user to proceed, stop, or manually edit intermediate content before it is passed to the next agent.

📂 Dataset Sources & Collection

The system does not use a traditional dataset. Instead, it dynamically retrieves the README content directly from any public GitHub repository using the GitHub REST API and raw file fallbacks. This ensures that the evaluation is always performed on the most recent version of the documentation.

📄 Dataset Description

Each execution produces its own dataset consisting of:

The extracted README text.
Intermediate analysis metadata.
Suggested tags and content improvements.
Final synthesized reports.

These artifacts are stored locally within the outputs directory, enabling full reproducibility.

📝 Installation & Usage Instructions

The system is designed for easy setup so users and evaluators can reproduce outputs.
The recommended environment setup begins by creating a Python virtual environment and installing the required dependencies included in the repository.
A .env file must be created to supply the Google API key used for all LLM and embedding operations.
After installation, the system can be executed using a single command that accepts a GitHub repository URL.
Users may also enable human-in-the-loop mode to intervene during key decision-making stages.
All generated artifacts, including recommendations and structured evaluation reports, are automatically saved in the outputs directory.

🛠 Tools, Frameworks & Services

The solution is implemented in Python using a lightweight agent-based architecture. GitHub API endpoints are used to retrieve repository content. The system incorporates robust network call mechanisms with retries and exponential backoff to improve resilience. The multi-agent coordination is implemented using custom orchestrator logic that manages shared state and structured responses between agents. No external LLMs are used, ensuring deterministic behavior and minimal resource dependency.

🧩 Why a Multi-Agent Approach?

A multi-agent architecture allows specialized components to independently focus on different aspects of repository evaluation, resulting in clearer modularity and maintainability. Each agent performs a distinct role, and the orchestrator synchronizes their outputs, ensuring the system remains extensible. This modular separation also makes it easier to debug, replace, or enhance individual agents without affecting the larger workflow.

🧍‍♂️🤖 Human-in-the-Loop Integration

To ensure high-quality and human-aligned recommendations, the system integrates a simple human-in-the-loop workflow. After the initial repository analysis, the system pauses and displays the analyzer output, asking the user to either proceed, stop, or edit the extracted text. If the user edits, the revised content is used in subsequent agent steps. This checkpoint behavior repeats before the tag recommender and content improver stages. This lightweight interaction model allows the user to preserve control over recommendations, increases trust, and prevents cascading errors from automatic suggestions.

🛡️⚙️ Safety Measures, Error Handling & System Reliability

The system incorporates multiple safeguards to ensure stable and dependable operation, even under unpredictable conditions. Robust fallback mechanisms handle GitHub API failures, and input sanitization ensures that inconsistent or poorly structured repository documentation does not break the pipeline. A well-defined shared state model prevents data loss across agent steps and minimizes invalid state transitions.

Reliability was a core design objective. Transient network errors are mitigated using Tenacity-based exponential backoff, ensuring that temporary embedding or API issues do not interrupt the workflow. Each agent operates within a controlled state boundary, making error propagation more predictable and significantly easier to diagnose. Graceful fallback behaviors are triggered when key repository elements—such as README files or metadata—are missing, or when keyword extraction produces insufficient signals.

The system also emphasizes defensive engineering. Assumptions are validated at every step to avoid crashes during edge cases. Meaningful error messages guide the user when encountering malformed URLs, unexpected exceptions, or incomplete agent outputs. Additionally, the human-in-the-loop mechanism ensures that no automated recommendation is finalized without explicit user approval, further reinforcing system reliability and preventing incorrect or low-quality outputs from reaching the final stage.

📌 Implementation Details

The implementation begins by validating the provided repository URL and attempting to fetch its README using the GitHub API or raw content fallback. If fetching fails, the system retries the request automatically before halting with a descriptive failure message. Once retrieved, the README is passed to the Repo Analyzer, which inspects structure and presence of key documentation sections.

The next stage evaluates potential tags by analyzing keyword frequencies while excluding common stopwords. Afterwards, suggested improvements for titles and introductory text are generated. Each stage shares its output through a centralized state object, enabling clean, predictable transitions.

Human-in-the-loop checkpoints serve as validation stages where the user can approve or modify intermediate results. These checkpoints not only prevent error propagation but also encourage transparency and human oversight — addressing important safety and accuracy concerns in AI-assisted tools.

🧪 Evaluation Framework

Evaluation is conducted by running the tool on a variety of GitHub repositories differing in size, structure, and documentation completeness. Each run measures the quality of tag recommendations, the relevance of content improvements, and the clarity of the final report. Reliability tests simulate API failures to verify retry behavior, while usability checks ensure the HITL checkpoints respond correctly to user input.

🔬 12.1 Experimental Setup

The multi-agent system was tested locally on a Windows machine using a Python virtual environment. The evaluation environment included an Intel Core i3 processor and 8 GB RAM, demonstrating that the system runs effectively on standard hardware without requiring a GPU. Hosted models from Google GenAI handled language-generation tasks, while all agent logic, orchestration, and tool execution occurred locally. This setup ensured reliable, reproducible conditions for assessing agent coordination, error handling, and workflow stability.

🔎 12.2 Evaluation Methodology

The system was evaluated by running complete end-to-end workflows across multiple public GitHub repositories. Each agent’s contribution—analysis, keyword extraction, content improvement, and final review—was examined for relevance, clarity, and consistency. Edge-case scenarios such as missing README files, incomplete metadata, and inconsistent formatting were introduced to observe how the system recovered from failures. Human-in-the-loop confirmation was included to verify whether user feedback influenced and improved the final output, ensuring the system supports collaborative refinement.

📊 12.3 Metrics Used

Evaluation focused on practical system-level indicators rather than formal ML metrics. Key measures included relevance of agent outputs to repository content, coherence of the final consolidated summary, workflow completion rate, and stability under error conditions. Latency was observed to assess responsiveness, and HITL interaction quality was measured by how effectively the system integrated user confirmations or corrections. Together, these metrics provided a concise view of accuracy, robustness, and reliability.

🛡️ Licensing & Usage Rights

This project is fully open-source under the MIT License. Users may modify, distribute, and integrate this system into their own tools with proper attribution. The open license ensures that the multi-agent system can be freely extended for commercial, academic, or personal use.

🚀 Results & Performance

Testing showed consistent and accurate extraction of README files across multiple repositories. Recommended tags typically aligned with repository topics, and content improvements were concise and contextually appropriate. The reliability tests confirmed that retry logic successfully recovered from intermittent failures without user intervention. Human-in-the-loop checkpoints significantly improved overall output quality by allowing user corrections where necessary.

📸 Screenshots (Execution Proof)

Command Executed	Output Screenshot
`python -m src.app --repo "https://github.com/sbm-11-SFDC/rt-aaidc-project1-template"`
`python -m src.app --repo "https://github.com/sbm-11-SFDC/rt-aaidc-project2-multiagent"`
`python -m src.app --repo "https://github.com/sbm-11-SFDC/rt-aaidc-project2-multiagent"` --no-interactive

⚠️ Limitations

The system currently relies on deterministic heuristics, which limits the sophistication of content improvement suggestions. Additionally, README analysis accuracy depends heavily on the structure and clarity of the original documentation. Agent behavior is intentionally simple and rule-based to avoid unnecessary API costs or unpredictable LLM outputs.

🚀 Future Work

The next phase involves expanding the system with optional LLM-powered agents for more advanced content rewriting. Template-based transformations and quality scoring modules may also be added. Further enhancements could include a simple Web UI and support for batch repository analysis.

☁ Deployment Considerations

Since the system relies only on Python and GitHub APIs, deployment is lightweight. It can be executed locally or packaged within a Docker container. Ensuring network availability is crucial for reliable README retrieval.

📉 Current State Gap

While the system meets project requirements, it does not yet support multi-repository batch processing, advanced semantic tagging, or continuous integration workflows. These features can be added in future iterations.

📡 Monitoring / Maintenance

The system requires periodic updates to maintain compatibility with GitHub API changes. Additionally, incorporating optional logging and telemetry (disabled by default) may help analyze tool usage and improve long-term maintainability.

📊 Comparative Analysis

Compared to traditional static linters or documentation checkers, this multi-agent solution provides dynamic, context-aware insights. It improves on simpler tools by integrating flexible human oversight and a structured improvement workflow. This hybrid approach balances automation with editorial control.

🌍 Significance & Implications of Work

The project demonstrates how multi-agent systems can be used to automate and enhance documentation workflows in software development. By integrating human feedback loops, the system avoids common pitfalls of fully autonomous tools and provides an example of responsible AI design.

🏭 Industry Insights

The approach aligns well with emerging trends in developer tooling, where automated code assistants are increasingly backed by multi-agent coordination. The HITL model is also gaining traction as organizations prioritize AI systems that retain human authority and avoid uncontrolled automation.

🏁 Conclusion

This project successfully implements a modular, reliable, and transparent multi-agent system capable of analyzing repositories and generating structured recommendations. Its human-in-the-loop checkpoints ensure high-quality results and responsible use of automation. Through specialization, resilience, and controlled collaboration between agents and human reviewers, the system presents a practical and effective solution for improving documentation quality at scale.

👤 Author
Suraj Mahale
AI & Salesforce Developer
GitHub: https://github.com/sbm-11-SFDC