Team Gemini โ Oracle Forge, Week 2 of 2
Two weeks ago I wrote a post about what we were planning to build and what we expected to get wrong. This is the follow-up โ what we actually built, where the plan held, and where reality pushed back.
If you missed the first post: Oracle Forge is a multi-database AI data agent built to compete on the UC Berkeley DataAgentBench (DAB) benchmark. 54 queries. 12 datasets. 4 database types in the same session. The kind of workload that exposes every assumption a demo-friendly agent makes about data being clean.
Enterprise data is not a single database with a tidy schema. It is PostgreSQL for transactions, MongoDB for customer records, SQLite for reference data, DuckDB for analytics โ each system with its own conventions, its own ID formats, and its own definition of what a "customer" is.
Most LLM-based agents fail on this not because the model can't reason, but because:
These are engineering problems. Oracle Forge was designed to solve them at the system level, not the prompt level.
The system has four layers that work in sequence on every query.
Before the agent touches a single database, it loads three context layers:
AGENT.md is the persistent global context โ the tool catalogue, join key format rules across all 12 DAB datasets, SQL/MongoDB/DuckDB dialect reference, and execution constraints. It lives in the agent directory and is always present.
Domain KB (kb/domain/dab_*.md) is one document per dataset covering schema, field types, authoritative tables, domain term definitions, and known query patterns. Written by Intelligence Officers, updated as the team discovers new patterns. This is the layer that encodes what the database means, not just what it contains.
Corrections log (kb/corrections/corrections_log.md) is a running structured log of every observed failure:
[query that failed] โ [what was wrong] โ [correct approach]
Injected as context before every session. Updated automatically by utils/autodream.py from evaluation run traces. The agent gets smarter not because the model improves โ but because the context it has access to improves.
The Conductor (agent/conductor.py, built on LangGraph) is the orchestration brain. It takes a natural language question, decomposes it into per-database sub-tasks, dispatches them to specialist sub-agents in parallel, and merges results with cross-database entity resolution applied before returning to the user.
Four specialist sub-agents handle their respective systems:
A custom MCP server (mcp/mcp_server.py) exposes 29 database tools across all four DB types through a unified REST interface at port 5000. Every sub-agent calls the same POST /v1/tools/{tool_name} endpoint regardless of database type. This separates agent logic from database connection management entirely โ adding a new dataset means updating tools.yaml, not touching agent code.
This is the decision we debated longest in our Inception phase and the one we're most confident held up.
The naive self-correction loop is: query fails โ modify prompt slightly โ retry. This works for simple cases. It fails on exactly the queries DAB is designed to test, because the correct recovery action depends entirely on why the query failed.
We defined four named failure classes:
| Class | What it means | Recovery |
|---|---|---|
JoinKeyMismatch | Entity IDs don't resolve across databases | Apply normalisation function from Domain KB |
ContractViolation | Output fails schema validation | Re-query with corrected constraints |
DialectError | Wrong query syntax for target DB | Rewrite in correct dialect |
DomainKnowledgeGap | Business term undefined in schema | Query Domain KB, then retry |
The agent classifies before it retries. The recovery action is determined by the class, not by a prompt-level guess.
We ran the harness (eval/harness.py) as an external Sentinel โ it imports DAB's authoritative validate.py per query, calls validate(ground_truth, llm_answer), and produces a score the agent cannot influence. Every tool call is traced. Every failure is diagnosable without re-running.
The score progression tells the real story of the project:
| Run | Date | Pass Rate | What happened |
|---|---|---|---|
| True baseline | Apr 11 | 1.85% (1/54) | Structural failure โ invalid model ID string caused every LLM call to return 400 |
| Post-fix (yelp) | Apr 13 | 66.7% (2/3) | One model ID fix + one KB update to join key format |
| Multi-dataset | Apr 13 | 33.3% (1/3) | agnews/bookreview Domain KB still incomplete |
The jump from 1.85% to 66.7% on yelp after a single KB update is the clearest evidence in this project that the architecture works as intended. The improvement came entirely from correcting the join key pattern in dab_yelp.md โ not from any change to the model, the conductor, or the query logic.
The specific fix: MongoDB yelp business IDs use a businessid_ prefix. DuckDB yelp review data uses businessref_. The agent was querying for exact matches that didn't exist. One KB entry documenting the prefix transformation, and yelp queries went from returning zero rows to passing at 66.7%.
Looking back at where we spent most of our debugging time:
Cross-database join key mismatches were the highest-impact failure mode by far. One prefix discrepancy propagates into zero rows returned, which propagates into N/A from the synthesiser, which scores as a failure. The fix in every case was a KB update, not a model change. We expect analogous mismatches in other datasets as we complete the full 54-query run.
MCP server process lifecycle was the operational blocker that caused the most lost time. The server needs to be running before any eval run, and it doesn't restart automatically after configuration changes. A small infra issue that costs disproportionate time under deadline pressure.
crmarenapro domain complexity is the benchmark's hardest dataset โ 13 queries requiring stage classification extracted from unstructured activity content. The Domain KB document for this dataset requires the most detailed term definitions of any dataset we've worked with, and it's still not complete.
The silent wrong answer remains the failure mode we're least satisfied with our handling of. A query that fails loudly โ with a 400 error or a database exception โ is easy to catch and classify. A query that succeeds and returns a structurally valid result that happens to be semantically wrong is much harder. ContractViolation catches some of these, but not all. It's the open problem we'd spend more time on in a third week.
The framing we started with โ "context engineering, not model prompting" โ held up. Every material improvement we made was a change to what the agent knew before it ran, not how it ran.
The 38% ceiling on DataAgentBench for frontier models is not a reasoning limitation. It is a context limitation. Models capable of writing sophisticated SQL still fail when they don't know the join key format changed between systems. That is a solvable problem. It just requires treating the knowledge layer as a first-class engineering artifact rather than an afterthought.
That's what Oracle Forge is: a demonstration that the gap between a demo that works and an agent that works in production is primarily an engineering discipline problem, and that discipline is now teachable.
Repo: github.com/Deregit2025/data-agent-forge
Benchmark: github.com/ucbepic/DataAgentBench
Built by: Dereje Derib ยท Eyoel Nebiyu ยท Chalie Lijalem ยท Liul Teshome ยท Nuhamin Alemayehu ยท Rafia Kedir
#DataEngineering #AIAgents #ContextEngineering #LLM #DataAgentBench #BuildingInPublic