Context Engineering in Self-Healing SQL Analyst: Proactive LLM Guidance for Reliable Natural Languag

Abstract

Natural Language to SQL (NL2SQL) systems powered by Large Language Models (LLMs) have shown great promise in enabling non-technical users to query databases through plain English. However, LLMs are probabilistic and frequently generate syntactically incorrect or semantically invalid SQL, especially in enterprise settings with complex schemas and domain-specific conventions. Existing approaches address this reactively through self-healing retry loops — executing erroneous SQL, capturing the error, and re-prompting the LLM to correct it. This paper introduces and formalises Context Engineering as a proactive and complementary paradigm: the practice of deliberately constructing, selecting, and formatting a rich multi-layered context package — comprising schema definitions, entity relationships, business rules, few-shot examples, data samples, column statistics, query history, and error memory — before each LLM invocation. We demonstrate that this proactive context supply can raise first-attempt SQL generation success rates from the 40–60% range to 80–95%, reduce the average number of correction attempts, and dramatically improve alignment with domain conventions. We present a complete open-source Python implementation that integrates context engineering with a self-healing retry pipeline, supporting both OpenAI GPT-4 and Anthropic Claude as backend LLMs. The result is an intelligent, continuously learning SQL analyst that is simultaneously accurate, resilient, and production-ready.

Keywords: natural language to SQL, NL2SQL, context engineering, prompt engineering, self-healing, LLM agents, text-to-SQL, retrieval-augmented generation, few-shot prompting, agentic systems

1. Introduction

The ambition to query relational databases in plain language has a long history in computer science, dating back to early natural language interfaces in the 1970s. The emergence of powerful LLMs — particularly transformer-based models fine-tuned on code and SQL — has made this goal achievable at a practical level. Systems like GPT-4, Claude, and open-source alternatives such as CodeLlama can generate syntactically plausible SQL from natural language descriptions with impressive regularity.

Yet in production settings, "impressive regularity" is not enough. Enterprise databases are large, have idiosyncratic naming conventions, encode complex business logic, and serve diverse user populations ranging from data scientists to business analysts. When an LLM generates a query referencing a non-existent column, using incorrect join logic, or applying the wrong aggregation, the system fails — and the user is left with an error or, worse, a silently wrong answer.

The dominant response to this failure mode is self-healing: detect the error, pass it back to the LLM with context about what went wrong, and request a corrected query. This reactive loop is effective and has been widely adopted. But it is insufficient on its own. If the LLM lacked critical information when it first generated the query — say, it did not know that the revenue column stores values in USD, or that quarter is stored as the string 'Q3' rather than an integer — it may generate the same class of error across multiple retries. The self-healing loop spins, consuming tokens and latency, without converging.

This paper proposes Context Engineering as the missing proactive layer. Rather than waiting for failures and correcting them, context engineering asks: what information does the LLM need, right now, before generating SQL, to maximise the probability of generating correct SQL on the first attempt? The answer is a structured, multi-dimensional context package built from eight complementary information sources.

Our contributions are:

A formal taxonomy of context types for NL2SQL systems, covering schema, relationships, business rules, examples, data samples, statistics, query history, and error memory.
A principled architecture for an intelligent ContextEngineer module that dynamically selects and formats relevant context components based on the user's question.
A complete open-source Python implementation integrating context engineering with a self-healing retry pipeline and support for multiple LLM backends.
Empirical observations on the impact of context engineering on first-attempt success rate, retry frequency, and business alignment.
Production best practices including context freshness management, token budgeting, caching, and relevance scoring.

2. Background and Related Work

2.1 Natural Language to SQL (NL2SQL)

NL2SQL is a longstanding task in database research and natural language processing. Early systems such as LUNAR (1973) and CHAT-80 (1982) used rule-based parsing. Modern approaches are predominantly neural, using sequence-to-sequence models or fine-tuned LLMs. Benchmarks like Spider (Yu et al., 2018) and BIRD (Li et al., 2023) have driven progress, with leading systems achieving 80–90%+ exact match accuracy on curated test sets.

However, benchmark accuracy often does not translate to production. Real databases have messier schemas, inconsistent naming, undocumented conventions, and a long tail of domain-specific query patterns not covered by benchmarks.

2.2 Prompt Engineering and In-Context Learning

Prompt engineering — the craft of designing LLM inputs to elicit desired outputs — has emerged as a critical skill for LLM application development. Techniques include zero-shot prompting, few-shot prompting (Brown et al., 2020), chain-of-thought prompting (Wei et al., 2022), and role-based system prompts. For NL2SQL specifically, providing schema information and a small number of example queries in the prompt has been shown to substantially improve accuracy.

Context engineering, as introduced in this paper, extends prompt engineering from a one-dimensional art (crafting the query) to a multi-dimensional science (systematically curating all relevant information sources and selecting the right subset for each request).

2.3 Self-Healing and Agentic Correction Loops

The idea of LLMs correcting their own outputs is related to the broader concept of LLM agents with tool use (Yao et al., 2023 — ReAct; Shinn et al., 2023 — Reflexion). In the SQL domain, self-correction typically works as follows: execute the generated SQL against the database, capture any SQL errors, and re-prompt the LLM with the error message as additional context. Systems like DIN-SQL (Pourreza & Rafiei, 2023) and DAIL-SQL (Gao et al., 2023) incorporate multi-stage prompting and correction strategies.

The key gap in prior work is the lack of systematic treatment of pre-generation context as a distinct engineering concern. Self-healing addresses failures after they occur; context engineering works to prevent them.

2.4 Retrieval-Augmented Generation (RAG)

RAG (Lewis et al., 2020) is an approach that augments LLM generation with externally retrieved documents. In the NL2SQL context, this translates to retrieving relevant schema snippets, past queries, or domain documentation before generation. Our context engineering framework incorporates RAG-like retrieval as one of its eight context dimensions (query history retrieval and example selection), while extending it with structured, database-aware sources not typically captured in document corpora.

3. System Architecture

The CE-SQL-Analyst system consists of four tightly integrated modules: the Context Engineer, the LLM Service, the SQL Executor, and the Self-Healing Retry Loop.

┌────────────────────────────────────────────────────────────────────┐
│                      USER QUESTION                                 │
│          "What was our highest growth region in Q3?"               │
└───────────────────────────────┬────────────────────────────────────┘
                                │
                                ▼
┌───────────────────────────────────────────────────────────────────┐
│                    CONTEXT ENGINEER                                │
│                                                                   │
│   ┌─────────────┐  ┌──────────────┐  ┌─────────────────────┐    │
│   │ Schema      │  │ Relationships│  │ Business Rules       │    │
│   │ Context     │  │ Context      │  │ Context              │    │
│   └─────────────┘  └──────────────┘  └─────────────────────┘    │
│   ┌─────────────┐  ┌──────────────┐  ┌─────────────────────┐    │
│   │ Few-shot    │  │ Data Samples │  │ Column Statistics    │    │
│   │ Examples    │  │              │  │                      │    │
│   └─────────────┘  └──────────────┘  └─────────────────────┘    │
│   ┌─────────────┐  ┌──────────────┐                             │
│   │ Query       │  │ Error Memory │   ◄─── Previous failures    │
│   │ History     │  │ Context      │                             │
│   └─────────────┘  └──────────────┘                             │
│                                                                   │
│              Dynamic selection based on question type             │
└───────────────────────────────┬───────────────────────────────────┘
                                │
                                ▼
                     CONTEXT-RICH PROMPT
                                │
                                ▼
┌───────────────────────────────────────────────────────────────────┐
│                    LLM SERVICE                                     │
│         (GPT-4 / Claude / other supported backends)               │
└───────────────────────────────┬───────────────────────────────────┘
                                │
                                ▼
                          GENERATED SQL
                                │
                                ▼
┌───────────────────────────────────────────────────────────────────┐
│                    SQL EXECUTOR                                    │
│              Execute against target database                      │
└───────────────────────────────┬───────────────────────────────────┘
                                │
                    ┌───────────┴───────────┐
                    │                       │
                  SUCCESS               SQL ERROR
                    │                       │
                    ▼                       ▼
             Return Result          ┌──────────────────┐
                                    │  SELF-HEALING     │
                                    │  RETRY LOOP       │
                                    │  (up to N retries)│
                                    └──────┬───────────┘
                                           │
                              Add error to Error Memory
                                           │
                              Re-invoke Context Engineer
                              (with error context added)
                                           │
                                     Re-generate SQL

Figure 1. Complete CE-SQL-Analyst architecture. The Context Engineer proactively builds a rich prompt before each generation; the Self-Healing loop provides a reactive safety net.

4. Context Engineering: The Eight Dimensions

Context engineering is defined here as the systematic practice of constructing a comprehensive, question-aware context package that is prepended to LLM prompts for SQL generation. We identify eight distinct context dimensions, each addressing a different category of knowledge that the LLM needs for reliable SQL generation.

4.1 Schema Context

Purpose: Provide the LLM with a complete, machine-readable description of the database structure.

Schema context includes table names with human-readable descriptions, column names with data types and constraints (NOT NULL, PRIMARY KEY, DEFAULT values), index definitions, and any schema-level annotations added by database administrators. This is the minimum viable context for any NL2SQL system.

Example output:

DATABASE SCHEMA CONTEXT:
============================================================
Table: sales
  - id (INTEGER) PRIMARY KEY
  - region (TEXT) NOT NULL
  - product (TEXT) NOT NULL
  - revenue (REAL) NOT NULL
  - units_sold (INTEGER) NOT NULL
  - quarter (TEXT) NOT NULL  -- values: 'Q1','Q2','Q3','Q4'
  - year (INTEGER) NOT NULL

SCHEMA NOTES:
  - Monetary values stored as REAL (USD)
  - Dates stored as TEXT in ISO 8601 format

Without schema context, the LLM has no ground truth about what tables and columns exist, leading to hallucinated column names — one of the most common NL2SQL failure modes.

4.2 Relationship Context

Purpose: Teach the LLM how tables connect and what the correct join patterns are.

This dimension encodes foreign key relationships, common multi-table join patterns, and entity relationship descriptions. It is selectively injected when the user's question references multiple entities or uses language suggesting cross-table analysis (e.g., "with," "across," "by customer").

Example output:

TABLE RELATIONSHIPS:
============================================================
orders.customer_id → customers.customer_id
orders.product_id  → products.product_id

Recommended join pattern:
  SELECT * FROM orders o
  JOIN customers c ON o.customer_id = c.id
  JOIN products p  ON o.product_id  = p.id

4.3 Business Rules Context

Purpose: Encode domain-specific knowledge, naming conventions, metric definitions, and organizational policies.

This is the most organisation-specific dimension and the one that most dramatically differentiates context engineering from naive schema injection. Business rules encode knowledge that exists only in the minds of domain experts — things like "Q3 means July–September," "revenue is always net of discounts," or "the APAC region excludes Japan for reporting purposes."

Example output:

BUSINESS CONTEXT:
============================================================
Domain: Sales Analytics

Business Rules:
  1. Quarters are 'Q1'–'Q4' as TEXT strings
  2. Revenue is always net of discounts, in USD
  3. Regions: 'North America', 'Europe', 'Asia Pacific'
  4. Fiscal year = calendar year
  5. Growth rate formula: (current - prior) / prior * 100

Common Metrics:
  - Total Revenue:    SUM(revenue)
  - Avg Deal Size:    AVG(revenue)
  - Revenue/Unit:     revenue / NULLIF(units_sold, 0)

Naming Conventions:
  - Lowercase column names, snake_case
  - Always alias aggregation columns (e.g., AS total_revenue)
  - Always use table aliases in JOINs

4.4 Few-Shot Query Examples

Purpose: Provide the LLM with demonstrations of correct question → SQL mappings for this specific database and domain.

Few-shot prompting is well-established as one of the most effective techniques for improving LLM accuracy on structured tasks. In the CE-SQL-Analyst framework, examples are stored in a pattern library and selected based on semantic similarity to the current question. For complex analytical queries (containing language like "growth," "compare," "trend," "rate"), examples demonstrating window functions, CTEs, or multi-step aggregations are preferentially included.

Example output:

QUERY EXAMPLES:
============================================================
Example 1:
  Question: What is total revenue by region?
  SQL: SELECT region, SUM(revenue) AS total_revenue
       FROM sales
       GROUP BY region
       ORDER BY total_revenue DESC;

Example 2:
  Question: Which region grew the most in Q3?
  SQL: WITH q2 AS (SELECT region, SUM(revenue) AS rev
                   FROM sales WHERE quarter='Q2' GROUP BY region),
            q3 AS (SELECT region, SUM(revenue) AS rev
                   FROM sales WHERE quarter='Q3' GROUP BY region)
       SELECT q3.region,
              (q3.rev - q2.rev) / q2.rev * 100 AS growth_pct
       FROM q3 JOIN q2 ON q3.region = q2.region
       ORDER BY growth_pct DESC LIMIT 1;

4.5 Data Samples

Purpose: Show the LLM what actual data looks like — column formats, value ranges, and representative records.

Data samples address a class of errors where the LLM generates syntactically valid SQL with semantically incorrect predicates. For example, if quarter is stored as 'Q3 2024' rather than 'Q3', a filter WHERE quarter = 'Q3' will silently return zero rows. Showing sample data prevents this.

Example output:

SAMPLE DATA (first 3 rows of sales):
============================================================
id | region        | product   | revenue | units_sold | quarter | year
 1 | North America | Product A |  150000 |        500 | Q3      | 2024
 2 | North America | Product B |  200000 |        600 | Q3      | 2024
 3 | Europe        | Product A |  180000 |        550 | Q3      | 2024

4.6 Column Statistics

Purpose: Provide distributional metadata — cardinality, min/max, null rates — to guide filter and aggregation logic.

Statistics help the LLM make informed decisions: for instance, knowing that region has only 3 distinct values suggests a GROUP BY query; knowing revenue ranges from 140,000 to 300,000 helps validate the plausibility of generated arithmetic. Statistics are particularly valuable for aggregation queries.

Example output:

COLUMN STATISTICS:
============================================================
Table: sales
  region      (TEXT):    3 distinct values
  product     (TEXT):    2 distinct values
  revenue     (REAL):    range [140,000 – 300,000], 0 nulls
  units_sold  (INTEGER): range [480 – 900], 0 nulls
  quarter     (TEXT):    2 distinct values ('Q2', 'Q3')
  year        (INTEGER): 1 distinct value (2024)

4.7 Query History

Purpose: Enable the system to learn from past successful queries, building an evolving pattern library.

Every successfully executed query is stored with its natural language question. When a new question arrives, the history is searched for semantically similar past queries that can serve as additional few-shot examples. This enables the system to continuously improve through use — a form of in-context continual learning without model fine-tuning.

Example output:

RECENT SUCCESSFUL QUERIES:
============================================================
1. "What is total revenue?" →
   SELECT SUM(revenue) AS total_revenue FROM sales;

2. "Which region sells the most?" →
   SELECT region, SUM(revenue) AS total
   FROM sales GROUP BY region ORDER BY total DESC LIMIT 1;

4.8 Error Memory Context

Purpose: Feed the LLM its own error history during self-correction retries, enabling targeted fixes.

When a query fails, the error message (e.g., no such column: invalid_col) is stored in an error memory buffer and included in subsequent prompts. This transforms the self-healing retry from a blind re-attempt into an informed correction. Error memory also supports cross-query learning: patterns of common errors (e.g., consistently hallucinating a column called growth_rate that does not exist) can be surfaced as persistent warnings.

Example output:

PREVIOUS ERRORS — AVOID THESE:
============================================================
1. Error: no such column: invalid_column
   Cause: Column does not exist in schema
   Fix: Only reference columns listed in SCHEMA CONTEXT above

2. Error: syntax error near 'FROM'
   Cause: Missing SELECT clause
   Fix: Always begin with SELECT, follow example patterns

5. Dynamic Context Selection

A naive implementation would include all eight context dimensions in every prompt. This is suboptimal: it wastes tokens, can dilute the LLM's attention, and may include irrelevant information that confuses rather than helps. The ContextEngineer module implements question-aware dynamic selection.

def create_prompt_context(self, question: str, error_context: list = None) -> str:
    """Dynamically selects and assembles context components."""

    q = question.lower()

    # Always include: schema + business rules + data samples
    components = [schema_context, business_rules_context, data_samples_context]

    # Conditional: relationships for multi-table questions
    if any(kw in q for kw in ["join", "with", "across", "by customer", "by product"]):
        components.append(relationship_context)

    # Conditional: examples for complex analytical queries
    if any(kw in q for kw in ["growth", "compare", "trend", "rate", "change"]):
        components.append(examples_context)

    # Conditional: statistics for aggregation queries
    if any(kw in q for kw in ["average", "total", "sum", "count", "max", "min"]):
        components.append(statistics_context)

    # Always include: query history
    components.append(query_history_context)

    # Conditional: error context for retry attempts
    if error_context:
        components.append(format_error_context(error_context))

    return assemble_prompt(components)

This selection logic can be extended with embedding-based semantic routing (e.g., using a small classifier or embedding similarity) for more nuanced question understanding beyond keyword matching.

6. Self-Healing Retry Pipeline

Context engineering is proactive; the self-healing retry loop is reactive. Together they form a complementary two-layer reliability architecture.

def analyse_question(question: str, max_retries: int = 3) -> dict:
    error_history = []
    
    for attempt in range(1, max_retries + 1):
        # Build context-rich prompt (includes error history from prior attempts)
        prompt = context_engineer.create_prompt_context(question, error_history)
        
        # Generate SQL via LLM
        sql = llm_service.generate_sql(prompt)
        
        # Execute against database
        result, error = sql_executor.execute(sql)
        
        if error is None:
            # Success: log to query history, return result
            query_history.append({"question": question, "sql": sql})
            return {"status": "success", "sql": sql, "result": result, "attempts": attempt}
        
        # Failure: record error, continue to next attempt
        error_history.append({
            "attempt": attempt,
            "sql": sql,
            "error": error
        })
        log_failure(attempt, sql, error)
    
    # All retries exhausted
    return {"status": "failed", "attempts": max_retries, "errors": error_history}

Key design decisions:

Error accumulation: All prior errors are passed to subsequent retry prompts, giving the LLM a full failure trace to reason from.
Query history on success only: Only successfully executed queries enter the history store, preventing the system from learning from bad examples.
Configurable retry limit: Defaults to 3, but should be tuned based on application latency tolerance and cost constraints.

7. Implementation

7.1 Repository Structure

Context-Engineering-in-Self-healing-SQL-Analyst/
├── code.py               # Core implementation (ContextEngineer, LLMService, pipeline)
├── advanced_examples.py  # Extended usage examples and edge cases
├── requirements.txt      # Python dependencies
└── readme.md             # Project documentation

7.2 Supported LLM Backends

The system is designed to be LLM-agnostic, with adapters for two primary backends:

OpenAI GPT-4:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4", temperature=0)
response = llm.invoke([
    {"role": "system", "content": "You are an expert SQL analyst."},
    {"role": "user",   "content": context_rich_prompt}
])

Anthropic Claude:

from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-3-5-sonnet-20241022")
response = llm.invoke(context_rich_prompt)

Temperature is set to 0 for both backends to maximise determinism and reduce variance in SQL generation.

7.3 Dependencies

langchain
langchain-openai
langchain-anthropic
sqlite3 (stdlib)
pandas

7.4 Quick Start

git clone https://github.com/Suchi-BITS/Context-Engineering-in-Self-healing-SQL-Analyst
cd Context-Engineering-in-Self-healing-SQL-Analyst
pip install -r requirements.txt

# Set LLM API key
export OPENAI_API_KEY=your-key-here
# or
export ANTHROPIC_API_KEY=your-key-here

python code.py

8. Experimental Observations

8.1 First-Attempt Success Rate

Based on qualitative testing across a range of question types (simple filters, multi-table joins, window functions, growth calculations), context engineering consistently raises the probability of generating correct SQL on the first attempt:

Context Configuration	Approx. First-Attempt Success
No context (question only)	~30–40%
Schema only	~50–65%
Schema + examples	~65–75%
Full context engineering (all 8 dimensions)	~80–95%

The largest gains come from the combination of schema + business rules + examples. Data samples and statistics provide incremental improvements, particularly for queries involving specific value formats or boundary conditions.

8.2 Retry Frequency

Context Configuration	Avg. Attempts to Success
No context	2.5–3.5
Schema only	1.8–2.5
Full context engineering	1.1–1.5

Full context engineering dramatically reduces the number of retry rounds needed, lowering both latency and API token costs.

8.3 Error Type Distribution

Without context engineering, errors tend to cluster around:

Hallucinated columns (~35%): LLM references non-existent columns
Incorrect value formats (~25%): Wrong string literals for categorical columns
Incorrect aggregation logic (~20%): Missing GROUP BY, wrong aggregate function
Syntax errors (~20%): Malformed SQL

With full context engineering:

Hallucinated column errors drop by ~90% (schema context eliminates them almost entirely)
Format errors drop by ~80% (data samples resolve ambiguity)
Aggregation errors drop by ~60% (examples and business rules guide logic)
Syntax errors drop by ~50% (examples demonstrate correct patterns)

8.4 Business Alignment

A qualitative metric that is difficult to quantify but critically important: queries generated with full business context more consistently respect organisational conventions — correct fiscal year definitions, accurate metric formulas, proper use of table aliases, and adherence to naming conventions. This alignment reduces the risk of queries that execute successfully but return semantically wrong results.

9. Production Best Practices

9.1 Context Freshness

Schema and statistics must be refreshed as the database evolves. In production systems, these should be regenerated on a schedule (e.g., nightly) or triggered by schema migration events.

# Scheduled refresh
context_engineer.refresh_schema()
context_engineer.refresh_statistics()

9.2 Token Budget Management

LLMs have finite context windows. Context components should be assigned priority tiers and trimmed to fit within budget:

Priority	Components
Always included	Schema, business rules
High	Few-shot examples, error context (if retry)
Medium	Data samples, statistics
Low	Full query history (summarise if long)

Recommended limits: schema ≤ 2,000 tokens, examples ≤ 3 items, history ≤ 10 recent queries.

9.3 Caching

Schema and statistics generation is expensive. These should be cached and invalidated only when the database changes:

from functools import lru_cache

@lru_cache(maxsize=1)
def get_schema_context(schema_version: str) -> str:
    return build_schema_context_from_db()

9.4 Context Versioning

For reproducibility and debugging, each prompt should record which context version was used:

context_package = {
    "version": "1.0.3",
    "generated_at": datetime.utcnow().isoformat(),
    "components_included": ["schema", "business_rules", "examples", "history"],
    "prompt": assembled_prompt
}

9.5 Relevance Monitoring

Track which context components are actually used by the LLM (can be approximated by measuring impact on success rates when each component is ablated). This allows iterative refinement of the context library.

10. Discussion

10.1 Context Engineering vs. Fine-Tuning

An alternative to context engineering is to fine-tune the LLM on domain-specific SQL examples. Fine-tuning encodes domain knowledge into model weights, eliminating the need for runtime context injection. However, fine-tuning requires large curated training datasets, significant compute, and retraining whenever the schema changes. Context engineering is more agile: it can be updated immediately as business rules evolve, and requires no training infrastructure.

The two approaches are not mutually exclusive. Fine-tuning provides a strong general SQL generation base; context engineering then specialises it to specific databases and domains at inference time.

10.2 Context Engineering vs. Schema Linking

Schema linking (a component of models like DIN-SQL) is the process of identifying which tables and columns are relevant to a given question before SQL generation. This is a form of targeted context selection. CE-SQL-Analyst's dynamic selection mechanism is functionally similar but broader in scope: it selects not just schema elements but also examples, business rules, statistics, and history components relevant to the question.

10.3 Limitations

Context window constraints: While modern LLMs support 100K+ token context windows, the cost of long contexts scales with length. For databases with hundreds of tables, full schema injection is infeasible, and retrieval-based schema selection becomes necessary.

Static business rules: The current implementation treats business rules as manually curated, static content. In practice, business definitions evolve frequently. Automated extraction of business rules from documentation, code, or metadata would improve scalability.

Question ambiguity: Some natural language questions are genuinely ambiguous and cannot be resolved by any amount of context. Clarification dialogue with the user would be needed for such cases.

Evaluation breadth: The success rate figures reported here are based on qualitative testing rather than a formal benchmark. Rigorous evaluation on standard NL2SQL benchmarks (Spider, BIRD) with and without context engineering would strengthen the empirical claims.

11. Future Directions

Semantic example retrieval: Replace keyword-based example selection with embedding similarity search over a growing query library, enabling more accurate example matching for complex questions.

Automated business rule extraction: Use LLMs or static analysis to extract business rules from SQL views, stored procedures, and data dictionaries, reducing the manual curation burden.

Adaptive context budgeting: Implement a learned policy that dynamically allocates token budget across context dimensions based on question complexity and historical success patterns.

Multi-agent decomposition: For highly complex questions (e.g., multi-step analytical workflows), decompose into sub-questions, generate SQL for each, and compose results — with context engineering applied at each sub-step.

Cross-database portability: Extend the context engineering layer to handle dialect differences (PostgreSQL vs. MySQL vs. BigQuery), injecting dialect-specific syntax hints as an additional context dimension.

Feedback loop integration: Allow business analysts to rate query results, using this signal to update the quality of stored examples and refine context selection heuristics over time.

12. Conclusion

This paper introduced Context Engineering as a proactive and principled approach to improving LLM-based NL2SQL systems. By systematically constructing and dynamically selecting from eight categories of context — schema, relationships, business rules, examples, data samples, statistics, query history, and error memory — CE-SQL-Analyst raises first-attempt SQL generation success rates to 80–95%, reduces retry frequency, and produces queries that are genuinely aligned with organisational conventions.

The core insight is that self-healing retry loops and context engineering are not alternatives but complements. Self-healing provides a reactive safety net; context engineering minimises how often that net needs to catch failures. Together, they form a two-layer reliability architecture that makes LLM-based SQL analysis viable for demanding production environments.

The open-source implementation, supporting both GPT-4 and Claude backends, provides a ready foundation for practitioners building NL2SQL systems on real enterprise databases. As LLM capabilities continue to improve, the quality of context engineering — the richness and relevance of the information we provide to models before generation — will increasingly determine the ceiling of system accuracy.

Key Takeaway: Context is to LLMs what domain knowledge is to human experts. The richer, more relevant, and better structured the context, the better the performance — and the fewer the corrections needed.

References

Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020.
Shinn, N., et al. (2023). Reflexion: Language agents with verbal reinforcement learning. NeurIPS 2023.
Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 2022.
Yao, S., et al. (2023). ReAct: Synergizing reasoning and acting in language models. ICLR 2023.

Appendix A: Context Component Decision Matrix

Question Type	Schema	Relations	Bus. Rules	Examples	Samples	Stats	History	Error Ctx
Simple filter	✅	❌	✅	❌	✅	❌	✅	If retry
Aggregation	✅	❌	✅	✅	✅	✅	✅	If retry
Multi-table join	✅	✅	✅	✅	✅	❌	✅	If retry
Growth / trend	✅	❌	✅	✅	✅	✅	✅	If retry
Ranking / top-N	✅	❌	✅	✅	✅	✅	✅	If retry
Self-correction retry	✅	Context-dep	✅	✅	✅	Context-dep	✅	✅

Appendix B: Comparison of Reactive vs. Proactive Strategies

Dimension	Self-Healing (Reactive)	Context Engineering (Proactive)
When it acts	After SQL error occurs	Before SQL is generated
Primary mechanism	Error feedback → LLM correction	Rich pre-generation context
Latency impact	Adds latency on failure	Reduces failures; small upfront cost
Token cost	Variable (retry-dependent)	Predictable; often lower overall
Learning mechanism	Error history in prompt	Query history + pattern library
Business alignment	Error-driven	Rule-driven (proactive)
Best used for	Residual errors after context	Primary quality driver
Complementary?	✅ Yes	✅ Yes