We use cookies to improve your browsing experience and to analyze our website traffic. By clicking “Accept All” you agree to our use of cookies. Privacy policy.

Fine-Tuning Qwen 1.5B for Text-to-SQL Generation Using QLoRA on the Spider Dataset

Fine-Tuning Qwen 1.5B for Text-to-SQL Generation Using QLoRA on the Spider Benchmark

Author: Arun Teja Pamu
Repository: https://github.com/pamuarun/LLM_FINE_TUNING
Model: https://huggingface.co/faltooz123/qwen1.5-sql-qlora-spider
Experiment Tracking: https://wandb.ai/arunteja962-aispry/sql-finetuning

1. Abstract

This project demonstrates end-to-end parameter-efficient fine-tuning of an open-weights large language model for Text-to-SQL generation. We fine-tuned Qwen/Qwen1.5-1.8B-Chat on the Spider benchmark dataset using QLoRA a technique that combines 4-bit quantization with Low-Rank Adaptation. The base model achieved 0% exact match accuracy, outputting verbose natural language explanations instead of SQL queries. After fine-tuning on 3,500 training samples over 5 epochs, exact match accuracy improved to 6% and token-level match improved by +20.36 percentage points. The entire training pipeline ran on a single NVIDIA RTX 5060 Laptop GPU (8GB VRAM) in approximately 90 minutes — demonstrating that meaningful task specialization is achievable on consumer hardware.

2. Introduction

Large Language Models (LLMs) have demonstrated strong capabilities in instruction following, code generation, reasoning, and natural language understanding. However, adapting a general-purpose LLM to a specialized downstream task often requires fine-tuning on domain-specific datasets.

One practical application of LLMs is Text-to-SQL generation, where a natural language question is automatically translated into a structured SQL query. This capability enables non-technical users to interact with databases using plain language and has significant applications in business intelligence, analytics, and enterprise data systems.

Recent advances in parameter-efficient fine-tuning techniques such as LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) have made it possible to adapt large language models using consumer-grade hardware. Instead of updating billions of model parameters, these techniques train a small number of adapter parameters while keeping the base model frozen.

In this work, we fine-tune Qwen/Qwen1.5-1.8B-Chat on the Spider Text-to-SQL benchmark using QLoRA. The objective is to improve SQL generation quality while maintaining a lightweight and reproducible training pipeline that can run on an 8GB GPU.

The publication presents the complete workflow, including dataset preparation, baseline evaluation, parameter-efficient fine-tuning, quantitative evaluation, qualitative analysis, and lessons learned during development.

3. Objective

What Task Are We Fine-Tuning For?

The task is Text-to-SQL generation — converting natural language questions into syntactically correct SQL queries. Given a database name and a natural language question, the model must output only the corresponding SQL query with no additional explanation.

Example:

Database	Question	Expected Output
concert_singer	How many singers do we have?	`SELECT count(*) FROM singer`
farm	Show all countries and the number of singers in each country.	`SELECT country, count(*) FROM singer GROUP BY country`
concert_singer	What is the average age of singers from France?	`SELECT avg(age) FROM singer WHERE country = 'France'`

Why This Task?

Text-to-SQL was selected for three key reasons:

1. Crystal clear evaluation criteria
SQL generation has objective, measurable metrics — exact match accuracy and token-level match. Unlike open-ended generation tasks, there is no ambiguity: the query either matches or it does not. This makes it easy to quantify before vs after improvement.

2. High practical value
The ability to query databases using natural language is a high-demand capability in business intelligence, data analytics, and developer tooling. A working Text-to-SQL model can directly power real applications.

3. Consumer hardware feasibility
A 1.8B parameter model quantized to 4-bit fits comfortably within 8GB VRAM. The Spider dataset is well-structured and publicly available. This makes the project fully reproducible without access to expensive infrastructure.

Dataset

Dataset Details

Property	Value
Name	Spider
Source	xlangai/spider
Task	Natural language → SQL generation
Total training samples	7,000
Total validation samples	1,034
Samples used for training	3,500 (50%)
Samples used for evaluation	200
Number of databases	200+ across 138 domains
License	CC BY-SA 4.0

Data Preparation

The raw Spider dataset contains three key fields per sample:

question — the natural language question
query — the ground truth SQL query
db_id — the database identifier
Each sample was converted into Qwen chat instruction format with a system prompt, user prompt, and assistant response:

<|im_start|>system
You are an expert SQL assistant that converts natural language questions
into accurate SQL queries.<|im_end|>
<|im_start|>user
You are an expert SQL assistant. Given a natural language question and a
database name, write the correct SQL query.
 
Database: concert_singer
Question: How many singers do we have?
 
Write only the SQL query, nothing else.<|im_end|>
<|im_start|>assistant
SELECT count(*) FROM singer<|im_end|>

Why This Format?

This format teaches the model three things simultaneously:

Its role as a SQL expert (system prompt)
The context it needs — database name and question (user prompt)
To output only SQL — no explanations (assistant response)
The dataset was processed using HuggingFace datasets, saved in Arrow format for fast loading, and split into 3,500 training and 200 validation samples.

4. Related work

Text-to-SQL has been an active research area for many years. Early systems relied on rule-based approaches and semantic parsing techniques that mapped natural language questions to structured database queries. While effective in constrained domains, these methods struggled with scalability and complex database schemas.

The introduction of the Spider benchmark significantly advanced Text-to-SQL research by providing a large-scale, cross-domain dataset containing complex SQL queries across hundreds of databases. Spider remains one of the most widely used benchmarks for evaluating Text-to-SQL systems.

Large Language Models for SQL Generation

Recent transformer-based language models have shown strong performance in SQL generation tasks. Models such as GPT, LLaMA, Mistral, and Qwen can learn SQL syntax and database reasoning through instruction tuning and supervised fine-tuning. However, deploying and adapting these models often requires significant computational resources.

Parameter-Efficient Fine-Tuning

LoRA introduced a method for adapting large language models by injecting low-rank trainable matrices into transformer layers while keeping the original weights frozen. QLoRA extended this approach by combining LoRA with 4-bit quantization, dramatically reducing memory requirements while maintaining competitive performance.

This project builds upon these developments by applying QLoRA to Qwen 1.5B for Text-to-SQL generation and evaluating the resulting improvements on the Spider benchmark.

5.Methodology

5.1 Base Model Selection

Model: Qwen/Qwen1.5-1.8B-Chat

Property	Value
Parameters	1.8B
Type	Instruction-tuned chat model
Context length	8,192 tokens
Chat format	`<\|im_start\|>` / `<\|im_end\|>` tokens
License	MIT

Qwen1.5-1.8B-Chat was chosen because:

It is already instruction-tuned, so it understands the system/user/assistant role format
At 1.8B parameters, it fits in 8GB VRAM with 4-bit quantization
The <|im_start|>/<|im_end|> chat template is clean and well-documented
Apache 2.0 license allows unrestricted use

5.2 Fine-Tuning Approach — QLoRA

Instead of full fine-tuning (which would require updating all 1.8B parameters), we used QLoRA — a combination of two techniques:

4-bit Quantization (bitsandbytes)
The base model weights are quantized from 16-bit floating point to 4-bit NormalFloat (NF4) format, reducing VRAM from ~7GB to ~1.5GB.

LoRA (Low-Rank Adaptation)
Small trainable rank-decomposition matrices are injected into the attention layers. Only these adapter weights (~13M parameters, 0.7% of total) are updated during training. The base model stays frozen.

LoRA Configuration:

Parameter	Value	Reasoning
Rank (r)	16	Balances capacity vs efficiency
Alpha	32	Standard 2x rank scaling
Dropout	0.05	Regularization
Target modules	q_proj, v_proj, k_proj, o_proj	Attention projection layers
Trainable params	~13M	Only 0.7% of the full model

5.3 Training Setup

Hardware:

Property	Value
GPU	NVIDIA GeForce RTX 5060 Laptop GPU
VRAM	8GB
OS	Windows 11
CUDA	12.8 (nightly — required for Blackwell sm_120)

Key Libraries:

Library	Version
transformers	4.44.2
peft	0.12.0
trl	0.10.1
bitsandbytes	0.49.2
datasets	2.21.0
wandb	0.27.0

Hyperparameters:

Parameter	Value
Training samples	3,500
Epochs	5
Batch size (per device)	8
Gradient accumulation steps	2
Effective batch size	16
Learning rate	2e-4
LR scheduler	Cosine
Warmup ratio	0.05
Weight decay	0.01
Max sequence length	512
Quantization	4-bit NF4 (QLoRA)
Compute dtype	float16
Training time	~90 minutes

Training Procedure:

Load formatted Spider dataset from disk
Quantize base model to 4-bit NF4
Apply LoRA adapters to attention projection layers
Train only adapter weights (base model frozen)
Evaluate on validation set every 100 steps
Save best adapter checkpoint based on eval loss
Publish final adapter to Hugging Face Hub

6. Experimental Setup

The experiments were designed to evaluate the impact of QLoRA fine-tuning on SQL generation performance. A baseline evaluation was first conducted using the original Qwen1.5-1.8B-Chat model. The same evaluation procedure was then repeated after fine-tuning.

Evaluation Metrics

The following metrics were used:

Exact Match Accuracy: Percentage of generated SQL queries that exactly match the reference query.
Average Token Match: Token-level similarity between generated and reference SQL queries.
Qualitative Output Analysis: Manual inspection of generated SQL outputs.
MMLU Accuracy: Used as a lightweight check for catastrophic forgetting after task-specific fine-tuning.

Experimental Workflow

Prepare and format the Spider dataset into Qwen chat instruction format.
Evaluate the base model on validation samples.
Fine-tune the model using QLoRA adapters.
Evaluate the fine-tuned model on the same validation set.
Compare baseline and fine-tuned performance.
Analyze generated SQL outputs and training behavior.
Perform a lightweight MMLU evaluation to assess knowledge retention.

Reproducibility

All experiments were executed using:

Fixed random seed (42)
Hugging Face Transformers ecosystem
Weights & Biases experiment tracking
Published source code and notebooks
Published adapter weights on Hugging Face Hub

This ensures that the training and evaluation process can be reproduced by other researchers and practitioners.

7. Results

7.1 Baseline Evaluation

Before any fine-tuning, the base Qwen1.5-1.8B-Chat model was evaluated on 200 Spider validation samples. A critical failure mode was identified — the model generated verbose natural language explanations instead of SQL:

Baseline output example:

Question : How many singers do we have?
Expected : SELECT count(*) FROM singer
Got      : To find the number of singers in the "concert_singer"
           database, you can use the COUNT function with a WHERE
           clause to filter the rows where the singer column is
           not null. Here's the SQL query:

Baseline scores:

Metric	Score
Exact Match Accuracy	0.00%
Avg Token Match	34.39%

The 34.39% token match indicates the model understood SQL keywords but completely failed at output format.

7.2 Training Loss Curves

The model showed strong, consistent learning across all 5 epochs. Loss decreased smoothly with no instability or divergence.

Step	Epoch	Train Loss	Observation
~10	0.32	4.09	Model completely confused
~20	0.64	1.26	Starting to learn SQL patterns
~30	0.96	0.62	Getting the format right
~40	1.28	0.52	Consistent SQL structure
~50	1.60	0.48	Refining query patterns
~60	1.92	0.46	Solid performance
~80	2.56	0.42	Converging
~90	2.88	0.42	Well converged

Final training loss: 4.09 → 0.42 (90% reduction)

Training and validation loss curves are available in the W&B experiment dashboard linked above, and as screenshots in the output_images/ folder of the repository.

W&B Runs.png

7.3 Baseline vs Fine-Tuned Performance

Metric	Baseline	Fine-Tuned	Improvement
Exact Match Accuracy	0.00%	6.00%	+6.00%
Avg Token Match	34.39%	54.75%	+20.36%
Output behavior	Verbose explanations	Clean SQL queries	✅

7.4 Example Inputs/Outputs Demonstrating Improvement

Example 1 — Simple COUNT query

	Value
Question	How many singers do we have?
Database	concert_singer
Expected SQL	`SELECT count(*) FROM singer`
Baseline output	"To find the number of singers in the database, you can use the COUNT function..."
Fine-tuned output	`SELECT count(*) FROM singer;`
Result	✅ Correct

Example 2 — ORDER BY with multiple columns

	Value
Question	What are the names, countries, and ages for every singer in descending order of age?
Database	concert_singer
Expected SQL	`SELECT name, country, age FROM singer ORDER BY age DESC`
Baseline output	Incorrect table and join syntax
Fine-tuned output	`SELECT Name, Country, Age FROM singer ORDER BY Age DESC;`
Result	✅ Correct (case difference only)

Example 3 — Aggregate with WHERE filter

	Value
Question	What is the average, minimum, and maximum age of all singers from France?
Database	concert_singer
Expected SQL	`SELECT avg(age), min(age), max(age) FROM singer WHERE country = 'France'`
Baseline output	Natural language explanation
Fine-tuned output	`SELECT avg(age), min(age), max(age) FROM singer WHERE country = 'France';`
Result	✅ Correct

Example 4 — GROUP BY

	Value
Question	Show all countries and the number of singers in each country.
Database	concert_singer
Expected SQL	`SELECT country, count(*) FROM singer GROUP BY country`
Baseline output	Long explanation with incorrect syntax
Fine-tuned output	`SELECT Country, COUNT(*) FROM singer GROUP BY Country;`
Result	✅ Correct

7.5 General Benchmark — Catastrophic Forgetting Check (MMLU)

To verify that SQL fine-tuning did not cause the model to forget general knowledge, we evaluated on 50 samples from the MMLU benchmark:

Model	MMLU Accuracy
Base model (before fine-tuning)	25.0% (random baseline)
Fine-tuned model	16.0%

Note: With only 50 samples, high variance is expected — the difference of 9 percentage points corresponds to approximately 4-5 questions. A definitive catastrophic forgetting assessment would require 500+ samples. This limitation is acknowledged.

8. Discussion

What Worked Well

QLoRA efficiency on consumer hardware
The combination of 4-bit quantization and LoRA adapters reduced VRAM requirements from ~7GB to ~1.5GB, enabling training on a laptop GPU. The final adapter file is only 35MB compared to the 3.5GB full model — a 99% size reduction. This makes deployment extremely lightweight.

Dramatic qualitative transformation
The most important result is not captured by exact match numbers alone. The base model consistently produced paragraphs of natural language explanation. After fine-tuning, it produces clean, structured SQL queries on every inference. This is the core goal of instruction fine-tuning and was fully achieved.

Stable training dynamics
Loss dropped smoothly from 4.09 to 0.42 across 5 epochs without spikes or instability. The cosine learning rate schedule and warmup ratio contributed to stable convergence. Eval loss tracked train loss closely, indicating no overfitting.

Reproducible pipeline
The five-notebook structure (dataset preparation → baseline → training → evaluation → publishing) provides a clean, step-by-step workflow that anyone can follow to reproduce the results.

Challenges Faced

RTX 5060 Blackwell architecture compatibility
The RTX 5060 uses NVIDIA's new Blackwell architecture (sm_120) which is not supported by PyTorch stable releases. This required installing PyTorch nightly builds with CUDA 12.8 support and a custom-compiled bitsandbytes build. This added significant setup complexity at the start of the project.

Exact match metric is overly strict
The exact match metric treats these two queries as different:

Expected : SELECT country, count(*) FROM singer GROUP BY country
Got      : SELECT Country, COUNT(*) FROM singer GROUP BY Country

Both queries are semantically identical and produce the same results when executed. A case-insensitive normalization or execution-based metric would more fairly represent the model's actual SQL generation quality.

Chinese text generation in early runs
The first training run (500 samples, 2 epochs) resulted in the model occasionally appending Chinese text after generating the SQL query. This was a symptom of insufficient training — the model had not fully learned to stop at the end of the SQL statement. This was resolved by increasing the training dataset to 3,500 samples and extending to 5 epochs.

Training time constraints
Training on the full 7,000-sample Spider dataset for 5 epochs would require approximately 4 hours on the RTX 5060. We used 3,500 samples as a practical compromise between training time and model quality.

Key Lessons

QLoRA is highly effective for task-specific adaptation on limited hardware
Chat-formatted instruction templates are critical when fine-tuning a chat-tuned base model
Exact match alone is insufficient to capture SQL generation quality — token match and qualitative analysis are also necessary
Even a 90-minute training run on a laptop GPU can produce meaningful behavioral change in an LLM

9. Conclusion

This publication documents a practical, reproducible QLoRA fine-tuning pipeline for Text-to-SQL generation using the Spider benchmark. Starting from a base model that output verbose natural language instead of SQL, we achieved a model that consistently generates clean, structured SQL queries after approximately 90 minutes of training on consumer hardware.

Key achievements:

✅ Exact match accuracy improved from 0% → 6%
✅ Token-level match improved by +20.36 percentage points
✅ Training loss reduced 90% (4.09 → 0.42)
✅ Trained on RTX 5060 Laptop GPU (8GB VRAM) in ~90 minutes
✅ LoRA adapter published to Hugging Face Hub (35MB)
✅ Full experiment tracked on Weights & Biases
✅ No conclusive evidence of catastrophic forgetting

Recommended next steps:

Train on the full 7,000-sample Spider dataset for higher exact match performance
Implement execution-based SQL evaluation against actual database schemas
Include explicit schema information (table/column names) in the prompt
Evaluate additional general benchmarks (HellaSwag, ARC) for a thorough forgetting assessment

10. References

Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv
.09685
Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv
.14314
Yu, T. et al. (2018). Spider: A Large-Scale Human-Labeled Dataset for Text-to-SQL. arXiv
.08887
Qwen Team (2024). Qwen1.5: Expanding Language Model Series. Hugging Face.
Hendrycks, D. et al. (2021). Measuring Massive Multitask Language Understanding. ICLR 2021.
Wolf, T. et al. (2020). HuggingFace Transformers: State-of-the-art NLP. EMNLP 2020.

11. Acknowledgements

I would like to acknowledge the open-source AI community whose tools and datasets made this project possible.

Special thanks to:

The Qwen Team for releasing the Qwen 1.5 model series.
The Spider dataset authors for providing a high-quality Text-to-SQL benchmark.
The Hugging Face team for Transformers, Datasets, PEFT, and Hub infrastructure.
The TRL and BitsAndBytes contributors for enabling efficient LLM fine-tuning.
The Weights & Biases team for experiment tracking and visualization.

Their contributions significantly reduced the barrier to entry for conducting large language model research on consumer hardware.

Appendix

A. Project Structure

README.md                       # Project overview and documentation
requirements.txt                # Python dependency list
LICENSE                         # MIT license
fix_notebook.py                 # Cleanup helper for notebook metadata

01_dataset_preparation.ipynb     # Prepare Spider dataset, format Qwen chat prompts
02_baseline_evaluation.ipynb     # Evaluate base Qwen model on SQL generation
03_qlora_finetuning.ipynb       # Fine-tune model with QLoRA adapters
04_evaluation.ipynb             # Compare baseline vs fine-tuned model
05_publishing.ipynb             # Publish adapters and model card to HF hub

data/                           # Processed datasets and evaluation outputs
  train.json
  validation.json
  dataset_config.json
  baseline_results.json
  finetuned_results.json
  comparison_results.json
  spider_formatted/              # Saved HuggingFace DatasetDict

outputs/                        # Fine-tuning checkpoints and final adapter
  qwen-sql-qlora/
    checkpoint-*/
    final_adapter/

wandb/                          # Weights & Biases run logs and summaries

B. Training Hardware Details

Component	Specification
GPU	NVIDIA RTX 5060 Laptop GPU
VRAM	8 GB
CPU	Intel Core Ultra 7
RAM	24 GB
OS	Windows 11

C. Future Improvements

Potential future enhancements include:

Full Spider dataset training
Execution-based SQL evaluation
Schema-aware prompting
Multi-database reasoning benchmarks
Comparison against larger open-source LLMs
Deployment as a web-based Text-to-SQL assistant

D. Resources

Resource	Link
🤗 HuggingFace Model	faltooz123/qwen1.5-sql-qlora-spider
📊 W&B Experiment	sql-finetuning project
💻 GitHub Repository	pamuarun/LLM_FINE_TUNING
📂 Spider Dataset	xlangai/spider