Building Oracle Forge: What It Actually Takes to Make an AI Agent Work on Real Enterprise Data Team

Building Oracle Forge: What It Actually Takes to Make an AI Agent Work on Real Enterprise Data

Team Gemini — Oracle Forge, Week 2 of 2

Two weeks ago I wrote a post about what we were planning to build and what we expected to get wrong. This is the follow-up — what we actually built, where the plan held, and where reality pushed back.

If you missed the first post: Oracle Forge is a multi-database AI data agent built to compete on the UC Berkeley DataAgentBench (DAB) benchmark. 54 queries. 12 datasets. 4 database types in the same session. The kind of workload that exposes every assumption a demo-friendly agent makes about data being clean.

The Problem We Were Actually Solving

Enterprise data is not a single database with a tidy schema. It is PostgreSQL for transactions, MongoDB for customer records, SQLite for reference data, DuckDB for analytics — each system with its own conventions, its own ID formats, and its own definition of what a "customer" is.

Most LLM-based agents fail on this not because the model can't reason, but because:

They don't know which table to look in across systems they've never seen before
They assume join keys are consistent when they almost never are
When a query fails, they retry generically instead of diagnosing what went wrong
They have no memory of what worked and what didn't across sessions

These are engineering problems. Oracle Forge was designed to solve them at the system level, not the prompt level.

The Architecture

The system has four layers that work in sequence on every query.

Layer 1 — Three-layer context injection

Before the agent touches a single database, it loads three context layers:

AGENT.md is the persistent global context — the tool catalogue, join key format rules across all 12 DAB datasets, SQL/MongoDB/DuckDB dialect reference, and execution constraints. It lives in the agent directory and is always present.

Domain KB (kb/domain/dab_*.md) is one document per dataset covering schema, field types, authoritative tables, domain term definitions, and known query patterns. Written by Intelligence Officers, updated as the team discovers new patterns. This is the layer that encodes what the database means, not just what it contains.

Corrections log (kb/corrections/corrections_log.md) is a running structured log of every observed failure:

[query that failed] → [what was wrong] → [correct approach]

Injected as context before every session. Updated automatically by utils/autodream.py from evaluation run traces. The agent gets smarter not because the model improves — but because the context it has access to improves.

Layer 2 — The Conductor

The Conductor (agent/conductor.py, built on LangGraph) is the orchestration brain. It takes a natural language question, decomposes it into per-database sub-tasks, dispatches them to specialist sub-agents in parallel, and merges results with cross-database entity resolution applied before returning to the user.

Four specialist sub-agents handle their respective systems:

PostgreSQL agent — 5 databases including yelp, googlelocal, pancancer, github_repos, agnews
MongoDB agent — 3 collections including yelp, googlelocal, crmarenapro
SQLite agent — 12 databases across bookreview, music_brainz, patents, deps_dev, and more
DuckDB agent — 9 databases for analytical workloads

Layer 3 — MCP Tool Server

A custom MCP server (mcp/mcp_server.py) exposes 29 database tools across all four DB types through a unified REST interface at port 5000. Every sub-agent calls the same POST /v1/tools/{tool_name} endpoint regardless of database type. This separates agent logic from database connection management entirely — adding a new dataset means updating tools.yaml, not touching agent code.

Layer 4 — Typed failure routing

This is the decision we debated longest in our Inception phase and the one we're most confident held up.

The naive self-correction loop is: query fails → modify prompt slightly → retry. This works for simple cases. It fails on exactly the queries DAB is designed to test, because the correct recovery action depends entirely on why the query failed.

We defined four named failure classes:

Class	What it means	Recovery
`JoinKeyMismatch`	Entity IDs don't resolve across databases	Apply normalisation function from Domain KB
`ContractViolation`	Output fails schema validation	Re-query with corrected constraints
`DialectError`	Wrong query syntax for target DB	Rewrite in correct dialect
`DomainKnowledgeGap`	Business term undefined in schema	Query Domain KB, then retry

The agent classifies before it retries. The recovery action is determined by the class, not by a prompt-level guess.

What the Evaluation Told Us

We ran the harness (eval/harness.py) as an external Sentinel — it imports DAB's authoritative validate.py per query, calls validate(ground_truth, llm_answer), and produces a score the agent cannot influence. Every tool call is traced. Every failure is diagnosable without re-running.

The score progression tells the real story of the project:

Run	Date	Pass Rate	What happened
True baseline	Apr 11	1.85% (1/54)	Structural failure — invalid model ID string caused every LLM call to return 400
Post-fix (yelp)	Apr 13	66.7% (2/3)	One model ID fix + one KB update to join key format
Multi-dataset	Apr 13	33.3% (1/3)	agnews/bookreview Domain KB still incomplete

The jump from 1.85% to 66.7% on yelp after a single KB update is the clearest evidence in this project that the architecture works as intended. The improvement came entirely from correcting the join key pattern in dab_yelp.md — not from any change to the model, the conductor, or the query logic.

The specific fix: MongoDB yelp business IDs use a businessid_ prefix. DuckDB yelp review data uses businessref_. The agent was querying for exact matches that didn't exist. One KB entry documenting the prefix transformation, and yelp queries went from returning zero rows to passing at 66.7%.

The Hardest Problems Were System Problems, Not Model Problems

Looking back at where we spent most of our debugging time:

Cross-database join key mismatches were the highest-impact failure mode by far. One prefix discrepancy propagates into zero rows returned, which propagates into N/A from the synthesiser, which scores as a failure. The fix in every case was a KB update, not a model change. We expect analogous mismatches in other datasets as we complete the full 54-query run.

MCP server process lifecycle was the operational blocker that caused the most lost time. The server needs to be running before any eval run, and it doesn't restart automatically after configuration changes. A small infra issue that costs disproportionate time under deadline pressure.

crmarenapro domain complexity is the benchmark's hardest dataset — 13 queries requiring stage classification extracted from unstructured activity content. The Domain KB document for this dataset requires the most detailed term definitions of any dataset we've worked with, and it's still not complete.

The silent wrong answer remains the failure mode we're least satisfied with our handling of. A query that fails loudly — with a 400 error or a database exception — is easy to catch and classify. A query that succeeds and returns a structurally valid result that happens to be semantically wrong is much harder. ContractViolation catches some of these, but not all. It's the open problem we'd spend more time on in a third week.

What This Project Is Actually About

The framing we started with — "context engineering, not model prompting" — held up. Every material improvement we made was a change to what the agent knew before it ran, not how it ran.

The 38% ceiling on DataAgentBench for frontier models is not a reasoning limitation. It is a context limitation. Models capable of writing sophisticated SQL still fail when they don't know the join key format changed between systems. That is a solvable problem. It just requires treating the knowledge layer as a first-class engineering artifact rather than an afterthought.

That's what Oracle Forge is: a demonstration that the gap between a demo that works and an agent that works in production is primarily an engineering discipline problem, and that discipline is now teachable.

Repo: github.com/Deregit2025/data-agent-forge
Benchmark: github.com/ucbepic/DataAgentBench

Built by: Dereje Derib · Eyoel Nebiyu · Chalie Lijalem · Liul Teshome · Nuhamin Alemayehu · Rafia Kedir

#DataEngineering #AIAgents #ContextEngineering #LLM #DataAgentBench #BuildingInPublic