Agentic AI Developer

Abstract

The rapid advancement of large language models (LLMs) has enabled the development of agentic AI systems—autonomous entities capable of planning, reasoning, and executing tasks through interaction with digital tools and environments. This work presents Agentic AI Developer, a modular, human-in-the-loop framework designed to demonstrate how LLM-based agents can safely perform multi-step software development tasks such as project scaffolding, code analysis, and documentation generation. The proposed architecture integrates four core components: (1) a Planner for decomposing complex goals into actionable steps, (2) a Tool Manager that governs tool invocation with safety gating, (3) a Memory Module for context persistence and traceability, and (4) an Agent Loop that enforces iterative reasoning under human oversight. A file-backed persistent memory and approval-based execution ensure transparency and control. Preliminary evaluations using synthetic datasets and controlled environments highlight the framework’s ability to perform structured reasoning and autonomous coding assistance while maintaining safety, interpretability, and auditability. This foundation can be extended with real-world datasets and LLM integrations to build scalable, reliable agentic development assistants.

Introduction

The Agentic AI Developer framework introduces a modular and safe approach to developing and testing such intelligent agents within the software development domain. The system is designed to perform multi-step coding and reasoning tasks such as project scaffolding, bug analysis, summarizing research, or generating documentation. Unlike end-to-end automation, the framework emphasizes human-in-the-loop oversight to ensure transparency, safety, and accountability in every decision made by the agent.

Methodology

The Agentic AI Developer framework was designed using a modular, explainable, and safety-focused architecture. The methodology emphasizes structured reasoning, tool governance, and human oversight throughout every stage of the agent’s workflow. The system operates through four interconnected components: Planner, Tool Manager, Memory Module, and Agent Controller (Agent Loop).

System Design Overview

The framework follows a sequential decision-making process in which the agent:

Understands the task — interpreting a user’s natural language goal.
Plans actions — decomposing the goal into smaller executable steps.
Executes tools — performing tasks through controlled tool usage.
Stores and recalls memory — maintaining persistence of results and context.
Requests human approval — ensuring safety before sensitive operations.

This iterative loop continues until the assigned goal is completed or halted by the human operator.

Planner Module

The Planner functions as the agent’s reasoning unit. It receives a user-defined goal and generates an ordered sequence of steps required to achieve it. Each step includes the specific tool to be used and its parameters.

The current implementation uses rule-based heuristics for step generation, which can later be replaced with an LLM-based planner.

The planner supports re-planning and refinement when tool results indicate partial success or errors.

Tool Manager

The Tool Manager provides a safe interface between the agent and external functionalities. It registers all available tools and enforces strict enable/disable policies to prevent unauthorized or unsafe actions.

Tools are implemented as independent modules (e.g., search_stub, text_summarizer, code_scaffold).

Each tool returns a structured response containing status, output, and error messages.

Before execution, the system validates whether the tool is permitted and requires human confirmation if configured.

This design allows easy integration of new capabilities (e.g., API calls, file editing) while maintaining safety and traceability.

Memory Module

The Memory Module provides persistence for the agent’s actions and decisions. It stores all tasks, intermediate results, and final traces in a JSON-based key–value format.

This allows the agent to recall past interactions, reuse previous solutions, and provide audit trails.

Memory can later be extended with vector embeddings or databases for long-term contextual learning.

Agent Loop and Human-in-the-Loop Control

The Agent Loop acts as the system’s main controller, orchestrating planning, execution, and evaluation. Before each tool invocation, it requests explicit human approval, ensuring that the operator remains in control of all potentially impactful actions.

This approval step can be handled via command-line input, or future versions may use web-based dashboards or messaging bots.

The loop also manages step limits, error handling, and optional re-planning.

This mechanism ensures that while the agent demonstrates autonomy, its actions remain transparent and reversible.

Dataset and Evaluation Setup

To evaluate the framework, a set of synthetic tasks was created, including project scaffolding, research summarization, and bug analysis. Each task was tested for:

Planning accuracy (correctness of task decomposition)

Tool reliability (validity of outputs)

Human oversight compliance (approval rate and safety adherence)

The system’s outputs were logged and stored for further analysis and benchmarking.

Implementation Details

The system was implemented in Python 3, following object-oriented design principles.

Data persistence was managed through JSON-based memory storage.

The architecture supports future integration of LLM APIs (e.g., OpenAI, HuggingFace) for dynamic reasoning and plan generation.

Safety defaults include tool restrictions, maximum step limits, and error handling routines to prevent uncontrolled behavior.

Experiments

The experimental phase of the Agentic AI Developer project was designed to evaluate the system’s performance, reliability, and safety across a range of controlled development-related tasks. The experiments focused on assessing how effectively the agent could plan, reason, and execute coding-related goals while maintaining human oversight and traceability.

Experimental Objectives

The experiments were conducted with the following objectives:

To test the agent’s ability to decompose complex programming tasks into logical sub-steps.
To evaluate the correctness and consistency of tool execution.
To measure the efficiency of the human-in-the-loop approval system in ensuring safe and interpretable operation.
To analyze the system’s adaptability and performance across different task types.

Experimental Setup

Hardware and Environment:
All experiments were executed on a standard workstation with Python 3.10 and no GPU acceleration, ensuring that results reflect lightweight and reproducible conditions.

Software Components:
The system used the following modules:

Planner (rule-based) for task decomposition.

Tool Manager with safe, sandboxed tools (search_stub, text_summarizer, code_scaffold).

Memory Module for data persistence using JSON files.

Agent Controller for iterative reasoning and human approval.

Dataset:
A synthetic dataset of 10 diverse tasks was created, including:

Generating Python project scaffolds.

Summarizing AI research topics.

Reviewing and linting example scripts.

Explaining technical concepts in simple language.
Each task was designed to test one or more components of the agentic pipeline.

Evaluation Metrics

The following metrics were used to assess system performance:

Metric Description

Planning Accuracy Percentage of correctly generated step sequences per task.
Tool Success Rate Ratio of successful tool executions to total tool calls.
Human Approval Efficiency Average number of approvals required before successful completion.
Execution Time Total time taken to complete a task, including human interactions.
Trace Completeness Degree to which results and intermediate steps were recorded in memory.

Experimental Procedure

Each task was processed by the agent in a controlled sequence:

The Planner generated an initial task plan.
The system displayed each proposed step to the human operator for approval.
Upon approval, the Tool Manager executed the respective tool.
The results were stored in the Memory Module, and the Agent Loop determined whether re-planning was necessary.
After completing all steps, the trace was reviewed and validated manually to assess correctness.

This approach ensured consistency and replicability while maintaining full transparency.

Results and Observations

Planning Accuracy:
The rule-based planner achieved approximately 85% accuracy in producing correct step sequences for simple tasks (e.g., summarization, scaffolding). More complex tasks requiring adaptive reasoning showed reduced accuracy (~65%).

Tool Success Rate:
Across all tests, tool success remained at 100% for enabled safe tools, as each was sandboxed with deterministic outputs.

Human Oversight:
The human approval mechanism effectively prevented any unintended or unsafe actions. Operator feedback indicated that approval requests were clear and interpretable.

Performance:
Each task averaged 3–6 seconds for full completion, excluding human input delays, demonstrating efficiency for local execution without LLM integration.

Traceability:
All steps, outputs, and approvals were successfully logged in the JSON-based memory store, ensuring full auditability.

Discussion

The experimental results confirm that the Agentic AI Developer successfully executes structured, multi-step development tasks while preserving safety and interpretability. Although the current planner is rule-based, it can be enhanced using LLM-driven reasoning for more complex workflows. The results highlight that combining modular design, human oversight, and persistent memory leads to a transparent and reliable agentic system suitable for research and educational environments.

Results

The experimental evaluation of the Agentic AI Developer framework produced several key findings related to task performance, safety, and system efficiency. Results were obtained from ten synthetic tasks that represented common software development and reasoning activities, including code generation, summarization, documentation creation, and error analysis.

Conclusion

The Agentic AI Developer framework demonstrates a foundational step toward building safe, explainable, and human-guided agentic systems for software development. Unlike traditional AI coding assistants that operate in isolated single-turn interactions, this framework introduces an iterative, reasoning-based process that enables the agent to plan, act, and learn through structured task execution and memory retention

Abstract

Introduction

Methodology

System Design Overview

The framework follows a sequential decision-making process in which the agent:

Understands the task — interpreting a user’s natural language goal.
Plans actions — decomposing the goal into smaller executable steps.
Executes tools — performing tasks through controlled tool usage.
Stores and recalls memory — maintaining persistence of results and context.
Requests human approval — ensuring safety before sensitive operations.

This iterative loop continues until the assigned goal is completed or halted by the human operator.

Planner Module

The current implementation uses rule-based heuristics for step generation, which can later be replaced with an LLM-based planner.

The planner supports re-planning and refinement when tool results indicate partial success or errors.

Tool Manager

Tools are implemented as independent modules (e.g., search_stub, text_summarizer, code_scaffold).

Each tool returns a structured response containing status, output, and error messages.

Before execution, the system validates whether the tool is permitted and requires human confirmation if configured.

This design allows easy integration of new capabilities (e.g., API calls, file editing) while maintaining safety and traceability.

Memory Module

The Memory Module provides persistence for the agent’s actions and decisions. It stores all tasks, intermediate results, and final traces in a JSON-based key–value format.

This allows the agent to recall past interactions, reuse previous solutions, and provide audit trails.

Memory can later be extended with vector embeddings or databases for long-term contextual learning.

Agent Loop and Human-in-the-Loop Control

This approval step can be handled via command-line input, or future versions may use web-based dashboards or messaging bots.

The loop also manages step limits, error handling, and optional re-planning.

This mechanism ensures that while the agent demonstrates autonomy, its actions remain transparent and reversible.

Dataset and Evaluation Setup

To evaluate the framework, a set of synthetic tasks was created, including project scaffolding, research summarization, and bug analysis. Each task was tested for:

Planning accuracy (correctness of task decomposition)

Tool reliability (validity of outputs)

Human oversight compliance (approval rate and safety adherence)

The system’s outputs were logged and stored for further analysis and benchmarking.

Implementation Details

The system was implemented in Python 3, following object-oriented design principles.

Data persistence was managed through JSON-based memory storage.

The architecture supports future integration of LLM APIs (e.g., OpenAI, HuggingFace) for dynamic reasoning and plan generation.

Safety defaults include tool restrictions, maximum step limits, and error handling routines to prevent uncontrolled behavior.

Experiments

Experimental Objectives

The experiments were conducted with the following objectives:

To test the agent’s ability to decompose complex programming tasks into logical sub-steps.
To evaluate the correctness and consistency of tool execution.
To measure the efficiency of the human-in-the-loop approval system in ensuring safe and interpretable operation.
To analyze the system’s adaptability and performance across different task types.

Experimental Setup

Hardware and Environment:
All experiments were executed on a standard workstation with Python 3.10 and no GPU acceleration, ensuring that results reflect lightweight and reproducible conditions.

Software Components:
The system used the following modules:

Planner (rule-based) for task decomposition.

Tool Manager with safe, sandboxed tools (search_stub, text_summarizer, code_scaffold).

Memory Module for data persistence using JSON files.

Agent Controller for iterative reasoning and human approval.

Dataset:
A synthetic dataset of 10 diverse tasks was created, including:

Generating Python project scaffolds.

Summarizing AI research topics.

Reviewing and linting example scripts.

Explaining technical concepts in simple language.
Each task was designed to test one or more components of the agentic pipeline.

Evaluation Metrics

The following metrics were used to assess system performance:

Metric Description

Experimental Procedure

Each task was processed by the agent in a controlled sequence:

The Planner generated an initial task plan.
The system displayed each proposed step to the human operator for approval.
Upon approval, the Tool Manager executed the respective tool.
The results were stored in the Memory Module, and the Agent Loop determined whether re-planning was necessary.
After completing all steps, the trace was reviewed and validated manually to assess correctness.

This approach ensured consistency and replicability while maintaining full transparency.

Results and Observations

Tool Success Rate:
Across all tests, tool success remained at 100% for enabled safe tools, as each was sandboxed with deterministic outputs.

Human Oversight:
The human approval mechanism effectively prevented any unintended or unsafe actions. Operator feedback indicated that approval requests were clear and interpretable.

Performance:
Each task averaged 3–6 seconds for full completion, excluding human input delays, demonstrating efficiency for local execution without LLM integration.

Traceability:
All steps, outputs, and approvals were successfully logged in the JSON-based memory store, ensuring full auditability.

Discussion

Agentic AI Developer

Table of contents

Abstract

Introduction

Methodology

Experiments

Results

Conclusion

Table of contents

Abstract

Introduction

Methodology

Experiments

Results

Conclusion