Large Language Models (LLMs) are increasingly used to generate SQL queries for analytical tasks, but their accuracy and reliability in practical applications remain a key concern. This paper evaluates the performance of agentic SQL AI systems, focusing on their effectiveness in generating correct, optimized, and executable SQL queries across various analytical scenarios. We benchmark different implementations using real-world., financial datasets and analyze an LLM's performance to answer simple and complex business questions. Our findings highlight the strengths and limitations of current LLM-driven SQL analytics, offering insights into improving their reliability for production environments and ability to accurately answer non-technical user's business questions.
The rise of AI-driven analytics has led to increased reliance on LLMs for generating SQL queries, automating data exploration, and enhancing decision-making processes. However, ensuring the correctness, efficiency, and adaptability of these models in real-world analytical environments remains a challenge. If LLMs cannot generate reliable, accurate SQL queries, they must not be used as agents to answer non-technical business users through natural language.
This study evaluates the practical performance of agentic SQL AI systems, analyzing their accuracy in generating SQL queries and their adaptability to complex data structures. By systematically testing different LLM-based approaches, we aim to provide a clear understanding of their effectiveness and limitations in modern data ecosystems and their ability to be integrated into critical business settings.
Our evaluation framework includes a structured assessment of LLM-generated SQL queries across multiple datasets and query types. While there are a number of key performance metrics we might evaluate (e.g., syntactic correctness, execution success rate, query optimization, etc.), this study focuses solely on the accuracy of retrieved results. If results are accurate, the time saved by circumventing a human programmer will (1) mitigate the effect of non-ideal syntax, as humans will not review all queries (2) overcome execution rates, as LLMs will retry the database query multiple times until it succeeds, and (3) offset most sub-optimal query optimizations.
The methodology follows these steps:
This data pipeline uses python and sqlite.
This project uses an established data pipeline, data model, and data governance to closely align with how SQL-agentic AIs may match real-world business applications.
This project ingests data from the FDIC (Federal Deposit Insurance Corporation) and the NCUA (National Credit Union Administration) into a SQL-lite database. The data pipelines uses the FDIC API and NCUA custom query data files.
Pipelines are created to facilitate further testing on LLM's accuracy with different data freshness and potentially evolving schemas. Additionally, the database uses a medallion data architecture where data is loaded into the database and then transformed into downstream normalized tables. The LLM is only given access to the finalized tables. This matches real-world business data warehousing and how businesses might implement LLMs; utilizing proper data governance practices to limit their access to relevant, finalized data tables.
The diagram provides a conceptual organization of the data flow and data lineage.
Below is the data model for the transformed data from the data pipeline. Tables in the database are prefixed with "dw" to represent data warehouse. In a traditional database system, these tables might be stored in a different schema or an entirely different database.
Note that all data presented here is publicly available and contains no PII. The LLM selected, chat-gpt-3.5
, is only granted access to the finalized tables in our database. Thus, the data pipeline can be adjusted to ingest other data without making it accessible and potentially exposed via the LLM. This is specified by the include_tables
keyword below.
db = SQLDatabase.from_uri( f"sqlite+pysqlite:///{c.DB_NAME}", include_tables = [ c.DB_DW_TBL_PROFILES, c.DB_DW_TBL_MONEY ] )
Note: real world implementation of LLM SQL Agents would limit each agent's individual access to ensure the LLM does not share confidential data with business users.
The dataset programmatically generates four types of queries across two different dimensions. A Python script iterates over a database and generates:
Natural Language vs. SQL Queries: Queries are generated in both (1) natural language to imitate a business user, and (2) a SQL queries from a templated "correct" solution. The SQL query serves as the answer key to evaluate the SQL Agent's correctness, thus serving as the "Golden SQL" standard.
Easy vs. Complex SQL: Generated queries are either (1) a simple query to retrieve the last update date or (2) a more complex query requiring a join and a max aggregation.
Reference the table below for samples of these two queries. Below, Fleet Bank of Maine was a unique FDIC institution in the database and used to general both natural queries and a Golden SQL answer.
Difficulty | Natural Language | SQL Query |
---|---|---|
Simple | What was the last update for Fleet Bank of Maine? | SELECT last_update FROM dw_financial_institution_profiles AS p WHERE p.institution_name = Fleet Bank of Maine' |
Complex | What is the maximum total assets Fleet Bank of Maine ever had? | SELECT max(total_assets) FROM dw_financial_institution_money AS m LEFT JOIN w_financial_institution_profiles AS p ON p.inst_id = m.inst_id WHERE p.institution_name = 'Fleet Bank of Maine' |
A total of 16,000 queries were programmatically generated.
4,000 queries of each query type was selected as a balance of:
Our results reveal key insights into LLM-driven SQL analytics accuracy. LLMs perform well on straightforward queries (98% accuracy) but struggle with complex joins and aggregations (24% accuracy).
Below is a contingency table showing the LLM's accuracy across difficulties:
Human review of the SQL agent output queries showed common errors include incorrect column references, inability to match natural queries to existing values in the database, and difficulty handling edge cases.
While LLMs offer promising capabilities for SQL query generation, their accuracy and reliability remain inconsistent for complex analytical tasks. This study highlights the need for further advancements in model training, prompt engineering, and integration with metadata-driven query optimization. Future research should explore hybrid approaches combining LLMs with rule-based systems to enhance robustness and ensure practical usability in enterprise data environments.
For business use cases, it is critical to implement proper human data governance to ensure that sensitive information is not inadvertently exposed to business users. Additionally, data governance likely plays a key role in improving accuracy, as LLMs often experience a decrease in precision when handling an increasing number of tables and schema complexities. Organizations must establish clear guidelines on query execution and limit LLM access to only necessary datasets to mitigate security risks.
LLMs exhibit strong performance in answering simple analytical questions with high accuracy. However, they should be explicitly prompted to either refuse to answer or provide a cautionary warning when generating more complex queries that require intricate joins, aggregations, or subqueries. This approach ensures that business users do not unknowingly rely on incorrect or inefficient query results and instead seek human oversight when needed. Human oversight can also be used as a feedback loop to improve the LLM's future accuracy.
Looking ahead, integrating LLMs with database metadata and adaptive query refinement techniques can further improve their reliability. By leveraging feedback loops, organizations can iteratively enhance SQL generation accuracy while maintaining strict data security and governance policies. Future work should also investigate combining LLMs with rule-based query optimization strategies to enhance their effectiveness in enterprise environments.
Future studies should expand on our findings by exploring additional dimensions of LLM performance in SQL generation. Some key areas of further research include:
Testing with Schema Drift: Investigating how LLMs handle changes in database schemas, such as the addition of new columns, even when these columns are irrelevant to the query. Understanding how schema drift impacts SQL generation accuracy can help refine model adaptability. The same pipeline can be adapted to incorporate additional columns/real-world data provided by the FDIC and NCUA.
Evaluation with Newer Models: This study used GPT-3.5 due to limitations at the time. Testing with newer models, such as GPT-4 and beyond, can help determine whether advancements in LLMs lead to significant improvements in SQL accuracy.
Assessing the Impact of Different Data Access Levels: Simulating scenarios with and without real data governance can reveal how unrestricted access affects LLM performance, security risks, and overall SQL accuracy.
Handling More Complex Queries: While this study evaluated medium-difficulty queries, future research should examine the performance of LLMs in generating and optimizing highly complex queries that involve four or more table joins, intricate filtering conditions, and multi-level aggregations. This is important to understand to what degree SQL agents can accurately answer business user's questions.
By addressing these areas, researchers and practitioners can gain deeper insights into the evolving capabilities of LLMs in SQL generation and identify best practices for deploying AI-driven analytics in enterprise environments.