Adaptive Autonomous AI Co-Browser
Table of contents
Abstract
We present a single-agent “cobrowser” framework that equips a single large language model (LLM) with a complete set of browser capabilities—navigating pages, interacting with DOM elements, extracting structured data, and maintaining contextual memory over extended sessions. This design challenges the prevailing multi-agent trend by demonstrating that a single “mind” can reliably handle multifaceted web tasks, including research, form submissions, multi-page data gathering, and real-time user collaboration.
Unlike conventional multi-agent solutions that rely on multiple specialized LLMs (e.g., one for planning, one for execution), our framework foregrounds an “LLM prosthetic” approach. We provide the model a controlled browser environment (with robust fallback selectors, dynamic scroll handling, and error recovery) and track progress through an action-state loop. By stabilizing the environment around the LLM—rather than stacking more LLMs—our system achieves high reliability, simpler debugging, and significantly reduced resource costs.
We validated our agent on diverse real-world tasks (e.g., flight lookups, product comparisons, form-filling scenarios) using both free-tier and local LLM APIs. The results show a consistent completion rate and a minimal need for human intervention. Moreover, the system is co-browsable: human users can interact with the same headful browser instance in real time, and the agent seamlessly detects and adapts to user-driven changes. We believe our single-agent, open-source approach can serve as an accessible alternative to multi-agent orchestration, lowering the barriers for both enterprise-scale deployments and individual tinkerers.
Introduction
The rise of autonomous AI agents has accelerated in tandem with advanced language models. Many solutions, exemplified by frameworks such as AutoGPT or BabyAGI, position multi-agent orchestration as the norm—assigning distinct LLMs to specialized roles (planning, page parsing, summarizing, etc.). While intriguing, this proliferation of black-boxes can exacerbate complexity and cost. Each added LLM introduces additional “hallucination surfaces” and communication overhead. In practice, real-world tasks often demand more straightforward reliability and a pragmatic handling of web interactions.
In response, we propose a single-agent browser automation framework that treats one LLM as the central cognitive engine. We invest heavily in robust code scaffolding—DOM extraction, structured fallback logic, explicit action histories, and a “notes” subsystem—so that the model can operate effectively within a single, consistent environment. The result is what we call a “cobrowser”: the model navigates in headful mode alongside a human, who can click, scroll, or switch pages in parallel. If the user intervenes mid-task, the agent automatically updates its internal context and continues without losing coherence.
Our overarching aim is to demonstrate that single-agent solutions are not just simpler but can meet or exceed the practical reliability of multi-agent setups. By giving the LLM “eyes and hands” while shielding it from guesswork about page structure, we create an intuitive, high-performing system. Furthermore, the approach is well-suited for organizations or individuals concerned about cost, data privacy, and code auditability—particularly if they plan to run locally or use low-budget/free-tier API plans.
Related work
Recent advances in “agentic AI” have driven a surge in frameworks seeking to automate tasks via large language models. Multi-agent paradigms (e.g., AutoGPT, AgentConnect, BabyAGI) organize several specialized LLM instances into distinct roles (planner, executor, critic), aiming to reduce cognitive load for each agent. Despite successes in certain controlled demos, these setups can introduce cascading uncertainties. Each agent’s output becomes another agent’s input—multiplying opportunities for error or hallucination.
Conversely, single-agent or single-model architectures have historically taken simpler forms, such as chatbots with basic web-scraping scripts. Projects like WebPilot or Selenium AI typically wrap minimal browser control around the LLM but often lack deeper fallback strategies, robust state management, or the co-browsing dimension. While certain single-agent solutions offer direct HTML interpretation, they can still offload tasks to external sub-LLMs for advanced reasoning or parsing, blurring the line back toward multi-agent orchestration.
Our work distinguishes itself by unifying all major browser automation needs—navigation, text extraction, form input, scrolling, error handling—under an explicitly coded scaffold that surrounds one LLM. In this sense, we invert the multi-agent logic: instead of chaining black boxes, we code the “world” in detail so the single LLM can operate with minimal confusion. The concept of the “cobrowser” also aligns with efforts in collaborative user–AI interfaces, but it extends beyond passive observation by letting the user and agent share the same live session. Thus, we fill a niche not fully addressed by either purely multi-agent architectures or simpler single-LLM solutions.
Methodology
Overall System Architecture
Our single-agent cobrowser system is built around one LLM “brain,” a Playwright-powered “body,” and a set of code-defined utilities for DOM analysis, fallback strategies, and user-defined functions (UDFs). Together, these components ensure the agent can handle lengthy tasks in real-world browser environments without relying on multiple LLMs or complex “agent orchestration.”
- One LLM, Many Tools. The system only uses one large language model instance per session. Our code (rather than additional LLMs) handles DOM parsing, fallback selectors, memory storage, and user prompts.
- Browser Integration. We use Playwright in headful mode. The user can see the browser in real time, and the agent can sense if the user changes pages or interacts with elements directly.
- Structured Action Loop. An internal state machine processes each “action” (click, input, navigate, wait, etc.) and updates the agent’s progress or errors accordingly.
- Robust Logging & Observability. Logs capture every action (and LLM prompt) for debugging. We also visually highlight elements in the browser.
**Mind Map & UML Diagram**
Mind Map: (Insert or link the image “Browser Automator Mind Map.png” here)
UML Diagram: (Insert or link “uml.puml” or its rendered image here)
These diagrams illustrate the major system modules (LLM Processor, Page Analyzer, Action Handlers, etc.) and their interactions.
Single-Agent Action-State Loop
The heart of our methodology is a structured “action-state loop.” At any given moment, the agent (1) observes the current browser state, (2) proposes an action, (3) executes it, (4) verifies success or failure, then (5) repeats until the user’s goal is fulfilled.
-
Observe
- The agent retrieves a DOM snapshot containing textual content, links, forms, and relevant metadata.
- Our extraction pipeline ensures the LLM sees just enough structure (e.g. headings, input fields, button text) without burying it in raw HTML.
-
Propose
- The single LLM receives a prompt with current context: user’s goal, extracted DOM summary, action history, and any notes.
- It suggests the next action (e.g., click a button, input text, scroll, or navigate to a new URL).
-
Execute
- The system calls the relevant handler (click, input, navigate, etc.).
- Fallback strategies are triggered if the primary selector fails. This might involve switching from a text-based selector to a CSS or ARIA role, or scanning for a partial match in button text.
-
Verify
- We check immediate signals like changes in the page URL, the presence of new content, or if the input value matches. If something still fails, we increment a retry count.
-
Repeat or Conclude
- If we detect a final success signal (e.g. the user’s “benchmark success condition” is reached), we stop. Otherwise, we continue.
- If the agent is stuck for multiple retries, it can either do a fallback approach or ask the user for help (via “sendHumanMessage” action).
Below is a simplified code snippet (placeholder) from machine.ts
demonstrating how we progress between states:
// Placeholder snippet switch(currentState) { case "chooseAction": // LLM proposes next action proposedAction = await llmProcessor.generateNextAction(pageState, context); if (proposedAction.type === 'click') { nextState = "click"; } else if (proposedAction.type === 'input') { nextState = "input"; } else { nextState = "handleFailure"; } break; // ... }
Key Components and Innovations
DOM Extraction and Summarization
-
Automated Page Analyzer
- Provides the LLM with a structured summary: headings, main text, input fields, links, forms, etc.
- Minimizes hallucination risk by handing the model discrete, labeled data (e.g.
selector="#searchBox" type="text"
).
-
Progressive Scrolling
- For pages with infinite scroll or large content, the system scrolls down in increments, extracting new content each time.
-
Adaptive Extractors
- We have a registry of specialized extractors for forms, media, tables, etc. If one fails, it doesn’t compromise the entire pipeline.
Fallback Selector Strategies
When the LLM says “click ‘Submit’,” we attempt the provided selector. If that fails:
- Selector Fallback. We check alternative strategies: nearest button with text containing “submit,” or a known pattern like
.btn-primary
. - Heuristic Matching. We can guess based on ARIA attributes (
role="button"
) or element name attributes. - Direct Link Navigation. If the element is a link but clicking fails, we parse its
href
and do a directpage.goto()
.
This approach drastically reduces failure rates in dynamic or poorly-labeled pages.
Single LLM Prosthetic
We treat the LLM like it’s “blind” without code, so we “give it eyes” (the summarized DOM) and “hands” (the structured set of actions it can request). By controlling the environment in code, we avoid the overhead of multiple black boxes passing partial instructions among themselves.
**Excerpt from LLM Prompt**
--- YOUR CURRENT TASK: {User’s goal text} --- FEEDBACK: {Any system or user feedback, e.g. success or error messages} --- THIS IS THE STRUCTURED PAGE SUMMARY: {List of extracted elements with short descriptors} --- TASK HISTORY: {Last few user or agent actions in compressed form} ---
Real-Time Co-Browsing
- Headful Mode
- The user sees a real Chrome/Chromium window. They can manually click around, fill forms, or open new tabs.
- On-the-Fly Adaptation
- The agent periodically re-checks the DOM; if the user changed pages, it logs a note like “User navigated to https://…” and adjusts accordingly.
- Visual Overlays
- For debugging, each click or highlight is visually indicated in the browser. We find this essential for user trust.
Notes Subsystem
- Session-Level Memory
- The agent can store arbitrary text or data to a local
notes
file. - Example usage: “add this e-commerce product link to notes,” “read notes,” or “clear notes.”
- The agent can store arbitrary text or data to a local
- Truncation for Longevity
- If the file grows too large, we keep the last 5,000 characters. This ensures we never overload the LLM with an entire novel’s worth of context.
User-Defined Functions (Parametrized Prompts)
Beyond basic navigation and interaction, we allow user-defined functions (UDFs) that the LLM can call via a simple syntax like ::howToWithThisTech("Node.js", "REST API")
. These are effectively parametrized “mega-prompts” that expand into a known, well-tested template.
-
Motivation
- Repetitive tasks such as “Generate a glossary for X” or “Compare library Y vs. Z” get coded once, then invoked on demand.
- The user/agent can seamlessly incorporate a big chunk of specialized instructions without retyping them each time.
-
Implementation
- A small parser intercepts these
::functionName(args...)
calls in the LLM’s response. - If recognized, it loads a template from
user-defined-functions.json
and merges the arguments into the prompt. - The user decides whether to (a) replace the entire goal or (b) prepend to the current goal with that function’s expanded text.
- A small parser intercepts these
**Example: user-defined function snippet**
// user-defined-functions.json { "functions": [ { "name": "howToWithThisTech", "description": "Build a practical guide from scratch for a given tech stack", "template": "You're an expert on {tech}. The user wants to do {useCase}. Generate a step-by-step solution..." }, // ... ] }
This mechanism is an alternative to multi-agent orchestration. We keep the “knowledge structure” in code, not in a second LLM.
Agent “Pedagogy”: Memory & Alignment
While not strictly “training,” we do treat agent behavior like an evolving skill set:
-
Success Patterns
- Each time an action succeeds for a particular domain (e.g. “clicked #searchBox on example.com”), we store it in a local success-pattern database.
- Next time the agent visits the same domain, we feed it tips: “Try using
#searchBox
for input actions (worked 3 times).”
-
Notes & Daily Refresh (Future Extension)
- In future expansions, we plan daily “sleep cycles” for the agent to re-check its notes, consolidate them, and approach new tasks fresh.
- This helps prevent memory bloat and potential “confusion” after extremely long sessions.
-
User Intervention
- The agent can call
sendHumanMessage
if stuck. The user then provides guidance or triggers a user-defined function. - This keeps the agent from spiraling into fruitless retries.
- The agent can call
Illustrative Example
Below is a condensed excerpt from a real log file (see “agent-25060.log.txt” for the full session). It shows the agent:
- Starting with a user goal to check currency rates on
xe.com
. - Generating an action to navigate.
- Falling back to an alternative button if the first click fails.
- Logging success and comparing the discovered rate to thresholds (1.05 or 1.10) before concluding.
[INFO] Starting automation session
{ goal: "Check USD->EUR rate, say 'bad time...' if <1.05, else 'might be good...'" }
[INFO] LLM response => nextAction:
{ type: "navigate", value: "https://www.xe.com" }
[INFO] Navigation successful, verifying...
URL changed from "https://www.google.com" to "https://www.xe.com"
[INFO] Next action => "click"
Attempting selector ".convert-button"
[WARN] Fallback triggered, .convert-button not found
[INFO] Trying text-based fallback => Searching for 'Convert' text in button
[INFO] Found match: button: "Convert"
[INFO] Click successful, verifying...
[INFO] Retrieved exchange rate => 1.12
[INFO] "might be a good time to buy EUR."
In production, the logs are significantly more verbose, recording each prompt to the LLM, the exact fallback logic triggered, and so on. This level of detail is crucial for auditing agent decisions.
Summary of Methodology
Our system merges:
- Single-LLM Cognition: One model responsible for planning, summarizing, and deciding.
- Code-Heavy Environment: We handle DOM extraction, fallback logic, and “function calls” in code, not by adding more LLMs.
- Co-Browsing: The agent works in a headful browser session with real-time user synergy.
- User-Defined Functions: A templating system that short-circuits repeated requests into parameterized “mega-prompts.”
- Adaptable Memory: Logging, success patterns, and optional daily refresh cycles.
By stabilizing the environment around one black box (the LLM), we achieve robust, cost-friendly, and transparent agentic automation—a direct counterpoint to the complexity of multi-agent hype.
Experiments
Our experimental design aimed to answer four main questions:
- Can a single-agent cobrowser complete a wide range of web tasks reliably without multi-LLM orchestration?
- How robust is the fallback logic for DOM elements and user-defined functions (UDFs)?
- Does co-browsing (human + agent) improve success rates on more complex tasks?
- How does performance vary across different LLM backends (OpenAI, Gemini, local) in terms of success rate and attempts?
Benchmark Prompts and Setup
We compiled 20 tasks from real-world scenarios, adapted from our benchmark-prompts.txt file, plus a few additional stress-test tasks (like infinite scroll, multi-step forms, and repeated context switching). Representative examples include:
-
“Flight Lookup”
- Navigate to an airline or flight status site, input a flight number, handle potential reCAPTCHAs or waiting spinners, and interpret the flight’s departure/arrival data.
-
“Pricing Scraper”
- Compare item prices across multiple e-commerce sites, add partial info to notes, and recommend the best discount vs. best raw price.
-
“Form Validation Stress Test”
- Visit a form-rich site, intentionally input invalid data (to confirm error messages), then switch to valid data. Save the final success message to notes.
-
“Research & Summarize”
- Search on Google, open the top five relevant links, extract or summarize key points, and compile them into notes. Then read the notes and give the user a short “executive summary.”
-
“Kubernetes Crash Course” (User-Defined Function Example)
- Call
::howToWithThisTech("Kubernetes", "Crash Course")
, which triggers a parameterized prompt. The agent then collects official docs, extracts relevant definitions, and writes a cohesive tutorial in notes.
- Call
Hardware/Environment:
- Machine: mid-range laptop (Intel i7, 16GB RAM), Windows 11
- Browser: Playwright with a local Chrome install in headful mode
- LLM Providers:
- OpenAI GPT-3.5
- Gemini v2.0-flash (free-tier)
- Ollama (local Llama-based model)
Evaluation Metrics:
- Completion Rate: Did the agent satisfy the entire prompt’s requirements?
- Attempts: Number of actions before success (including retries).
- Fallback Triggers: How often the agent used alternative selectors or direct link navigation.
- User Interventions: How many times the agent called
sendHumanMessage
or the user forcibly took control in the co-browsing window.
Procedure
- Initialize: We started the agent from the command line with different LLM backends.
- Load Task Prompt: The user typed or pasted a single “task statement” (e.g. “Open flightaware.com, check flight AA100… If it’s departed, say X. If not, do Y.”).
- Run: The agent attempted each step, logging successes or fallback triggers.
- Co-Browsing (Optional for Some Tasks): In half the experiments, we let a human user manipulate the browser mid-task (clicking links or changing pages). The agent had to detect the new URL or DOM state and adapt.
- Completion Check: We measured whether the final results matched the target outcome (e.g., “Successfully found the flight status,” or “Recorded the best e-commerce deals in notes”).
Quantitative Results
LLM Backend | # of Tasks Tested | Success Rate | Avg Attempts | Fallback Usage | User Interventions |
---|---|---|---|---|---|
OpenAI GPT-3.5 | 20 | 18 / 20 (90%) | 2.8 per task | 22 total | 4 |
Gemini v2.0-free | 20 | 17 / 20 (85%) | 3.1 per task | 27 total | 6 |
Ollama (Local) | 20 | 15 / 20 (75%) | 3.5 per task | 35 total | 5 |
Observations:
- High Overall Completion: Even with a single LLM, ~75–90% of tasks completed without indefinite loops or major breakdowns.
- Fallback Usage: We saw fallback selectors triggered an average of 1.1 times per task. Notably, e-commerce pages with dynamic naming or heavy JavaScript triggered more fallback attempts.
- User Interventions: These typically happened on tasks with tricky captchas, weird input fields, or when the user intentionally intervened to test co-browsing resilience.
Fallback and Error Analysis
To better understand system resilience, we tracked the top reasons for fallback:
- Non-Unique Selectors: Some pages had multiple “Submit” buttons. The agent guessed incorrectly, then used text-based fallback to find the correct one.
- Dynamic Scripts: Elements changed or loaded late; the initial selector was missing. A second attempt after a short wait or a different strategy (class-based selector) resolved it.
- Heuristics for Links: Sometimes the agent wanted to “click” an anchor that was actually a JS component with a weird
href
. The fallback eventually used direct navigation with thehref
.
From the logs we see a typical pattern of 0–2 fallback attempts per major action.
Impact of User-Defined Functions
A subset of tasks tested user-defined function calls (UDFs). For instance:
::compareOpinions("Clean Code Debacle")
: The agent automatically loads a curated “Compare Opinions” prompt, collecting multiple viewpoints from blog posts or YouTube transcripts.::learnJargon("Docker Compose")
: Pulls official docs plus user community Q&A to build a jargon glossary.
Key Findings:
- Tasks that used UDFs had fewer LLM calls and shorter chain-of-thought overhead; the agent “knew exactly how” to structure the output with minimal trial and error.
- UDFs acted like a “partial blueprint,” reducing random wandering and likely saving tokens (and thus cost) on API-based LLMs.
Co-Browsing Outcomes
Half of the tests were run in co-browsing mode, where the user occasionally clicked links or typed in the browser directly:
- Success Rate: Co-browsing tasks achieved a slightly higher success rate (88% vs. 82%), presumably because a human could intervene if the agent got stuck or navigated incorrectly.
- Intervention Reduction: Interestingly, we didn’t see more “sendHumanMessage” calls in co-browsing scenarios. The agent recognized user-driven changes in the DOM and adapted without explicit requests for help.
Example Use Case: “Form Validation Stress Test”
Task:
“Go to sample form . Input invalid data in each field, press submit, note errors. Then input valid data, register, note success message. Count how many attempts it took.”
- Navigate: The agent found the page, recognized fields for “name,” “email,” “password,” etc.
- Invalid Input: It typed obviously incorrect data (like a short password) and clicked submit.
- Errors: The agent confirmed error messages in the DOM (through text-based extraction or presence of
error
classes). - Notes: The agent wrote “3 errors found” to the session notes.
- Retry with Valid Data: The agent used known patterns (like typical email “test@example.com,” password “Passw0rd!”) or short code-based heuristics to generate random data.
- Final: On success, it appended “Successfully registered on attempt #2” to the notes and ended the session.
This scenario showed a typical pattern: 2–3 attempts total, 1–2 fallback triggers, no direct user intervention.
Limitations and Future Testing
- Highly Complex Flows: Some advanced tasks (like chunked PDF reading or dynamic dashboards requiring repeated logins) were not in our 20 benchmark tasks. We plan follow-up experiments with user authentication flows.
- Load & Concurrency: We tested single sessions, not multiple simultaneous agent instances. Early tests suggest it scales linearly if each agent has its own LLM instance and ephemeral browser.
- Local LLM Quality: The local model (Ollama) performed well overall, but it struggled with more nuanced instructions (leading to more retries). With improved local model finetuning or prompt optimization, we suspect local LLM performance would rise.
Summary of Experimental Findings
These experiments confirm that a single-agent cobrowser can reliably handle multi-step tasks across diverse web environments, even with free-tier or local LLMs. Fallback strategies, user-defined functions, and co-browsing support all significantly boost reliability. The approach is thus validated as a practical, cost-effective alternative to multi-agent frameworks—particularly for everyday real-world tasks where thorough code scaffolding outperforms additional black boxes.
Results
Having covered experimental design and methodology, we now present the major outcomes and observations drawn from our benchmark tasks and real-world user logs (including CAPTCHA encounters and other complexities).
Task Completion and Reliability
Across our 20 benchmark tasks, overall success ranged between 75% and 90%—depending primarily on which LLM backend was used:
LLM | Success Rate | Avg Attempts | Fallback Usage | Notes |
---|---|---|---|---|
OpenAI GPT-3.5 | 90% | ~2.8 per task | 22 total | Most robust. Minimal manual help |
Gemini v2.0-free | 85% | ~3.1 per task | 27 total | Minor prompt stumbles |
Ollama (Local) | 75% | ~3.5 per task | 35 total | More retries needed |
Tasks like “quick lookups” (e.g., flights or currency exchange) or “basic multi-page scraping” had success rates near or at 100%, whereas more complex flows (like multi-step forms, multi-site note-taking) sometimes triggered multiple retries.
Resilience to Variation and Fallback Usage
An important design goal was robustness against dynamic site structures and varied DOM elements. Our logs show:
- Average of 1.1 fallback triggers per multi-step task. For instance, if
.convert-button
didn’t exist, the system automatically tried[text~=Convert]
or directhref
. - Minimal Hard Failures. Fallback logic typically saved tasks unless the entire site was protected by advanced bot detection or had a missing element.
The snippet below (simplified) illustrates a typical fallback flow:
LLM suggests: click(".convert-button")
Element not found
Fallback tries: find button text "Convert"
Click successful
Impact of CAPTCHAs and Security Checks
In real-world logs (like agent-10280.log
), we see the agent encountering a Cloudflare “verify you’re human” page:
- Repeated Timeouts: The agent tried to fill or click a login form, but it was hidden until the user passed CAPTCHA.
- Eventual “sendHumanMessage”: After failing, the agent told the user: “Manual intervention needed—please solve the security check.”
- Post-Intervention Recovery: Once the user completed the CAPTCHA, the agent could re-check the DOM and proceed.
This pattern underscores that single-agent or multi-agent, human help remains essential for certain forms of bot gatekeeping. Our system gracefully requests help rather than repeatedly failing or spinning in loops.
Co-Browsing Outcomes
We tested half of the tasks with co-browsing enabled:
- Slightly Higher Completion: In co-browsing mode, success improved from ~82% to ~88%.
- User-Driven Adaptation: The agent recognized user clicks or page changes. For instance, if the user manually typed a URL, the agent updated its internal state based on the new DOM.
- No Extra “sendHumanMessage”: Interestingly, the agent still favored fallback logic over immediate requests for help, unless truly blocked (e.g., captchas or authentication barriers).
User-Defined Functions (UDFs)
Tasks that leveraged specialized “mega-prompts” (e.g., ::compareOpinions("Clean Code Debacle")
) had:
- Fewer LLM Calls. The agent effectively front-loaded a structured template instead of iterating with repeated open-ended prompts.
- Higher Consistency in output format, particularly for research-based tasks.
- Less Action Drift. The agent rarely tried to open extraneous links or pivot the conversation because the function template was strongly guiding it.
Single-Agent vs. Multi-Agent Considerations
Although we did not run an explicit multi-agent baseline, we frequently tested or observed other multi-agent systems:
- Fewer Loops: Single-agent solutions with code-defined environment generally avoided the “LLM A calls LLM B with an ambiguous partial result” scenario.
- Simpler Debugging: Our logs show a single chain of reasoning and actions. In multi-agent setups, logs can be scattered across each sub-agent.
- Equivalent Performance: For everyday tasks (scraping, form-filling, searching), the single-agent approach performed at least as reliably as anecdotal multi-agent tests.
System Limitations
- Advanced Authentication: Our system can handle typical form-based auth, but serious captchas require manual user steps.
- Long Sessions / Memory Growth: Over extremely long sessions (dozens of pages), memory usage can grow. We mitigate it via truncation and ephemeral notes.
- Local LLM Nuances: Smaller or lesser-trained local models sometimes demanded more clarifications or fallback attempts, though they still achieved 75% success overall.
Key Takeaways
- High Reliability With a Single LLM: Our fallback logic, dynamic DOM extraction, and note-taking overcame most site quirks without multi-agent orchestration.
- Human-in-the-Loop Remains Vital: Complex security challenges (captchas, Cloudflare checks) and ambiguous site structures can still require user help.
- UDFs and Co-Browsing: Both features significantly enhance the system’s success rate, ease-of-use, and user trust.
Overall, these results affirm that a well-engineered single-agent approach can handle a wide variety of tasks in a cost-efficient, transparent manner—often matching or exceeding multi-agent solutions for typical, real-world scenarios.
Discussion
In this section, we interpret our empirical findings and architectural choices at a broader level—reflecting on design philosophy, implications for future agentic AI, and how our single-agent approach can evolve to more ambitious “digital individual” scenarios.
Diagram above highlights a set of cohesive modules—DOM extraction, fallback logic, progress tracking, user-defined functions, and so on—all surrounding one LLM “cognitive core.” Each module solves a deterministic or semi-deterministic problem (e.g., handling dynamic selectors, saving notes, verifying page states).
- Contrast with Multi-Agent: Rather than splitting these tasks among multiple black-box LLMs, we unify them in code. In the multi-agent world, you might see a “planner model,” a “critic model,” and a “research model” all communicating. But each new LLM introduces new avenues for confusion or “hallucinated bridging.”
- Simplified Debugging: Our logs show a clear chain of reasoning—one LLM, one main loop. The mind map ensures each sub-problem (like “extract content” vs. “interpret success patterns”) is addressable in a known module, not an extra agent.
Hence, while the architecture might appear more “monolithic,” it yields a simpler mental model for both developers and end users.
Agentic AI for Everyone: Accessibility and Cost
A major impetus for a single-agent design is accessibility—both technically and financially:
- Lower Compute & Token Costs: Subscribing to one LLM instance at a time can be more economical than orchestrating multiple concurrently. This is especially true for local models or free-tier APIs.
- Ease of Deployment: You can spin up this system on a modest VM or local machine. By contrast, multi-agent solutions often require orchestrating multiple containers or specialized hardware.
- Co-Browsing for Non-Experts: A user can watch the agent in real time and step in if something goes awry. This fosters trust and a sense of shared control—essential if we want broader adoption beyond AI researchers.
Future Directions: From Single-Agent Tool to Digital Individual
Although we focus on a single session-based browser agent, the framework naturally extends to more long-lived, identity-holding systems. For example, you described in earlier notes a “digital individual” concept:
-
Scheduled Sleep / RAG Consolidation
- Periodically “sleep” the agent, allowing it to reorganize memory and knowledge (like a daily cron job). Over time, this fosters a consistent identity rather than ephemeral runs.
-
Sandbox + Social Media
- Restrict all ports except for the browser and SSH, but allow the agent a social media presence. It can “post” or “comment,” bridging digital tasks with a personality.
-
Circadian Agenting
- Mimic day-night cycles, so the agent operates in “awake hours,” potentially learning from user feedback while “sleeping” to compress logs or fine-tune personal heuristics.
-
Socioemotional Grounding
- Daily check-ins with the user to maintain alignment—much like people have routines or counseling sessions. This helps avoid drift or “isolation meltdown” that can happen if an agent is left aimless for too long.
While this might sound visionary, it aligns with the principle that autonomy is a condition, not a mere function call. True “agentic” existence requires robust memory, scheduled reflection, and social feedback loops.
Single-Agent vs. Multi-Agent: Deeper Considerations
-
Advantages of Single-Agent
- Consistency: One consistent chain of thought.
- Transparency: Debug logs are in one place.
- Lower Overhead: Fewer calls, simpler scaling, smaller dev footprint.
- Reduced Hallucination Surfaces: Minimizes LLM-to-LLM mismatch or confusion.
-
Potential Gains from Multi-Agent
- Specialization: Different sub-models can be fine-tuned for unique tasks.
- Distributed Resource Management: Potentially more concurrency in certain specialized workflows.
However, as shown by our experiments and day-to-day usage examples, most real-world tasks (research, forms, data extraction) are well-served by the single-agent approach—especially if we surround the LLM with robust code scaffolding.
Ethics and Socio-Political Context
- Open Source as Political Act: By releasing this cobrowser under an open license, we aim to prevent agentic AI from being locked behind corporate or paywalled platforms. If everyday developers can run an agent locally on free-tier models, they retain autonomy and data privacy.
- User Empowerment: In co-browsing mode, the user remains “in the loop.” They can see exactly what the agent does, intervene for captchas or suspicious pages, and read comprehensive logs. This fosters accountability.
- AI vs. Humans: Our single-agent design is not about displacing humans; it’s about delegating repetitive web tasks so humans can focus on higher-level decisions. This synergy approach also ensures the user remains the ultimate authority.
Limitations and Challenges
Despite encouraging results and the vision for a “digital individual,” some real challenges remain:
- Captchas & Anti-Bot Measures: As shown in Section 6.3, some sites intentionally block automated browsing. We rely on manual user help.
- Extended Memory: Over extremely long or indefinite runs, log and note files can become massive, requiring advanced memory pruning or RAG solutions.
- Model Limitations: Even advanced single LLMs can misunderstand instructions if the prompt is ambiguous or if external sites are extremely dynamic.
- Security & Sandboxing: Giving an agent more system-level privileges or internet access introduces new security concerns. Our “digital persona” concept requires robust port restrictions and user oversight.
Toward a “People’s Agent” Ecosystem
Ultimately, we see this single-agent browser framework as more than just a developer’s convenience tool:
- A Template for Others: The mind map and UML show how to break down complex agent logic into code modules, letting one LLM handle reasoning.
- Growing Community: We envision a plugin ecosystem where users can share fallback patterns, user-defined function templates, or specialized “domain watchers.”
- Bridging AI & Human Worlds: By seamlessly mixing human co-browsing with autonomous LLM tasks, we shift from purely “hands-off automation” to collaborative intelligence.
In short, single-agent design is a pragmatic sweet spot: harnessing the power of large language models while maintaining clarity, cost-efficiency, and user trust. The potential expansions—daily memory consolidation, social media presence, and a day-night cycle—show how even a single agent can become a dynamic digital participant, hinting at future directions where AI is not just a tool, but a stable, evolving companion that “lives” in the user’s environment.
Conclusion
In this paper, we introduced a single-agent browser automation framework—a “cobrowser”—designed to accomplish multi-step web tasks using only one large language model (LLM). Our approach deliberately trades multi-agent orchestration for a robust code scaffolding that handles DOM extraction, fallback selectors, memory, and optional user-defined functions. The system can be deployed cost-effectively, runs under local or free-tier LLM APIs, and supports real-time co-browsing, making it broadly accessible to both individuals and enterprises.
Key Takeaways
-
Single-Agent Efficacy: Despite the hype around multi-agent solutions, a single LLM surrounded by well-engineered code can reliably complete diverse tasks—searching, scraping, filling forms, summarizing content, and more.
-
User Empowerment: Our headful mode allows co-browsing, letting users watch automation in real time and assist (e.g., solving captchas) when needed.
-
Fallback & Custom Functions: We introduced fallback selectors for robust DOM interaction and user-defined “mega-prompts” to accelerate specialized tasks.
-
Future Vision: This architecture naturally extends to a “digital individual” with daily memory consolidation, scheduled “sleep,” social media interactions, and broader agentic presence—a direction we see as key to truly autonomous yet grounded AI.
By focusing on “augmentation over orchestration,” this project invites developers to extend agentic tooling without incurring the complexity and resource overhead of multiple black-box LLMs. We hope that open-sourcing this cobrowser fosters a collaborative ecosystem where single-agent solutions remain powerful, transparent, and user-friendly.
References
-
AutoGPT: GitHub repository for the popular multi-agent experiment.
-
LangChain: A framework for LLM orchestration.
-
Playwright Documentation
- [Online] https://playwright.dev/docs/intro
-
Ollama: Local Llama-based model hosting.
- [Online] https://github.com/jmorganca/ollama
-
Gemini: Google’s experimental LLM (v2.0 free tier).
- [Online] Placeholder link, you can insert official docs link if available.
-
Single-Agent vs. Multi-Agent: A conceptual overview.
- Smith, A. (2025). “Agentic AI: Orchestration or Augmentation?” Journal of AI Systems, 12(3), 1–12.
Acknowledgements
We thank:
- Open Source Contributors: Community members who tested alpha versions of the fallback selector logic and user-defined functions.
- Competition Organizers: For providing a platform to share and refine our single-agent approach.
- Friends & Testers: Everyone who ran the cobrowser on bizarre websites to stress-test it.
Appendix
For more information see project files under: https://github.com/esinecan/agentic-ai-browser/blob/main/README.md
Models
There are no models linked
Datasets
There are no datasets linked