Author: Arun Teja Pamu
Repository: https://github.com/pamuarun/LLM_FINE_TUNING
Model: https://huggingface.co/faltooz123/qwen1.5-sql-qlora-spider
Experiment Tracking: https://wandb.ai/arunteja962-aispry/sql-finetuning

This project demonstrates end-to-end parameter-efficient fine-tuning of an open-weights large language model for Text-to-SQL generation. We fine-tuned Qwen/Qwen1.5-1.8B-Chat on the Spider benchmark dataset using QLoRA a technique that combines 4-bit quantization with Low-Rank Adaptation. The base model achieved 0% exact match accuracy, outputting verbose natural language explanations instead of SQL queries. After fine-tuning on 3,500 training samples over 5 epochs, exact match accuracy improved to 6% and token-level match improved by +20.36 percentage points. The entire training pipeline ran on a single NVIDIA RTX 5060 Laptop GPU (8GB VRAM) in approximately 90 minutes — demonstrating that meaningful task specialization is achievable on consumer hardware.
Large Language Models (LLMs) have demonstrated strong capabilities in instruction following, code generation, reasoning, and natural language understanding. However, adapting a general-purpose LLM to a specialized downstream task often requires fine-tuning on domain-specific datasets.
One practical application of LLMs is Text-to-SQL generation, where a natural language question is automatically translated into a structured SQL query. This capability enables non-technical users to interact with databases using plain language and has significant applications in business intelligence, analytics, and enterprise data systems.
Recent advances in parameter-efficient fine-tuning techniques such as LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) have made it possible to adapt large language models using consumer-grade hardware. Instead of updating billions of model parameters, these techniques train a small number of adapter parameters while keeping the base model frozen.
In this work, we fine-tune Qwen/Qwen1.5-1.8B-Chat on the Spider Text-to-SQL benchmark using QLoRA. The objective is to improve SQL generation quality while maintaining a lightweight and reproducible training pipeline that can run on an 8GB GPU.
The publication presents the complete workflow, including dataset preparation, baseline evaluation, parameter-efficient fine-tuning, quantitative evaluation, qualitative analysis, and lessons learned during development.
The task is Text-to-SQL generation — converting natural language questions into syntactically correct SQL queries. Given a database name and a natural language question, the model must output only the corresponding SQL query with no additional explanation.
Example:
| Database | Question | Expected Output |
|---|---|---|
| concert_singer | How many singers do we have? | SELECT count(*) FROM singer |
| farm | Show all countries and the number of singers in each country. | SELECT country, count(*) FROM singer GROUP BY country |
| concert_singer | What is the average age of singers from France? | SELECT avg(age) FROM singer WHERE country = 'France' |
Text-to-SQL was selected for three key reasons:
1. Crystal clear evaluation criteria
SQL generation has objective, measurable metrics — exact match accuracy and token-level match. Unlike open-ended generation tasks, there is no ambiguity: the query either matches or it does not. This makes it easy to quantify before vs after improvement.
2. High practical value
The ability to query databases using natural language is a high-demand capability in business intelligence, data analytics, and developer tooling. A working Text-to-SQL model can directly power real applications.
3. Consumer hardware feasibility
A 1.8B parameter model quantized to 4-bit fits comfortably within 8GB VRAM. The Spider dataset is well-structured and publicly available. This makes the project fully reproducible without access to expensive infrastructure.
| Property | Value |
|---|---|
| Name | Spider |
| Source | xlangai/spider |
| Task | Natural language → SQL generation |
| Total training samples | 7,000 |
| Total validation samples | 1,034 |
| Samples used for training | 3,500 (50%) |
| Samples used for evaluation | 200 |
| Number of databases | 200+ across 138 domains |
| License | CC BY-SA 4.0 |
The raw Spider dataset contains three key fields per sample:
question — the natural language questionquery — the ground truth SQL querydb_id — the database identifier<|im_start|>system
You are an expert SQL assistant that converts natural language questions
into accurate SQL queries.<|im_end|>
<|im_start|>user
You are an expert SQL assistant. Given a natural language question and a
database name, write the correct SQL query.
Database: concert_singer
Question: How many singers do we have?
Write only the SQL query, nothing else.<|im_end|>
<|im_start|>assistant
SELECT count(*) FROM singer<|im_end|>
This format teaches the model three things simultaneously:
datasets, saved in Arrow format for fast loading, and split into 3,500 training and 200 validation samples.Text-to-SQL has been an active research area for many years. Early systems relied on rule-based approaches and semantic parsing techniques that mapped natural language questions to structured database queries. While effective in constrained domains, these methods struggled with scalability and complex database schemas.
The introduction of the Spider benchmark significantly advanced Text-to-SQL research by providing a large-scale, cross-domain dataset containing complex SQL queries across hundreds of databases. Spider remains one of the most widely used benchmarks for evaluating Text-to-SQL systems.
Large Language Models for SQL Generation
Recent transformer-based language models have shown strong performance in SQL generation tasks. Models such as GPT, LLaMA, Mistral, and Qwen can learn SQL syntax and database reasoning through instruction tuning and supervised fine-tuning. However, deploying and adapting these models often requires significant computational resources.
Parameter-Efficient Fine-Tuning
LoRA introduced a method for adapting large language models by injecting low-rank trainable matrices into transformer layers while keeping the original weights frozen. QLoRA extended this approach by combining LoRA with 4-bit quantization, dramatically reducing memory requirements while maintaining competitive performance.
This project builds upon these developments by applying QLoRA to Qwen 1.5B for Text-to-SQL generation and evaluating the resulting improvements on the Spider benchmark.
Model: Qwen/Qwen1.5-1.8B-Chat
| Property | Value |
|---|---|
| Parameters | 1.8B |
| Type | Instruction-tuned chat model |
| Context length | 8,192 tokens |
| Chat format | <|im_start|> / <|im_end|> tokens |
| License | MIT |
Qwen1.5-1.8B-Chat was chosen because:
<|im_start|>/<|im_end|> chat template is clean and well-documentedInstead of full fine-tuning (which would require updating all 1.8B parameters), we used QLoRA — a combination of two techniques:
4-bit Quantization (bitsandbytes)
The base model weights are quantized from 16-bit floating point to 4-bit NormalFloat (NF4) format, reducing VRAM from ~7GB to ~1.5GB.
LoRA (Low-Rank Adaptation)
Small trainable rank-decomposition matrices are injected into the attention layers. Only these adapter weights (~13M parameters, 0.7% of total) are updated during training. The base model stays frozen.
LoRA Configuration:
| Parameter | Value | Reasoning |
|---|---|---|
| Rank (r) | 16 | Balances capacity vs efficiency |
| Alpha | 32 | Standard 2x rank scaling |
| Dropout | 0.05 | Regularization |
| Target modules | q_proj, v_proj, k_proj, o_proj | Attention projection layers |
| Trainable params | ~13M | Only 0.7% of the full model |
Hardware:
| Property | Value |
|---|---|
| GPU | NVIDIA GeForce RTX 5060 Laptop GPU |
| VRAM | 8GB |
| OS | Windows 11 |
| CUDA | 12.8 (nightly — required for Blackwell sm_120) |
Key Libraries:
| Library | Version |
|---|---|
| transformers | 4.44.2 |
| peft | 0.12.0 |
| trl | 0.10.1 |
| bitsandbytes | 0.49.2 |
| datasets | 2.21.0 |
| wandb | 0.27.0 |
Hyperparameters:
| Parameter | Value |
|---|---|
| Training samples | 3,500 |
| Epochs | 5 |
| Batch size (per device) | 8 |
| Gradient accumulation steps | 2 |
| Effective batch size | 16 |
| Learning rate | 2e-4 |
| LR scheduler | Cosine |
| Warmup ratio | 0.05 |
| Weight decay | 0.01 |
| Max sequence length | 512 |
| Quantization | 4-bit NF4 (QLoRA) |
| Compute dtype | float16 |
| Training time | ~90 minutes |
Training Procedure:
The experiments were designed to evaluate the impact of QLoRA fine-tuning on SQL generation performance. A baseline evaluation was first conducted using the original Qwen1.5-1.8B-Chat model. The same evaluation procedure was then repeated after fine-tuning.
The following metrics were used:
All experiments were executed using:
This ensures that the training and evaluation process can be reproduced by other researchers and practitioners.
Before any fine-tuning, the base Qwen1.5-1.8B-Chat model was evaluated on 200 Spider validation samples. A critical failure mode was identified — the model generated verbose natural language explanations instead of SQL:
Baseline output example:
Question : How many singers do we have?
Expected : SELECT count(*) FROM singer
Got : To find the number of singers in the "concert_singer"
database, you can use the COUNT function with a WHERE
clause to filter the rows where the singer column is
not null. Here's the SQL query:
Baseline scores:
| Metric | Score |
|---|---|
| Exact Match Accuracy | 0.00% |
| Avg Token Match | 34.39% |
The 34.39% token match indicates the model understood SQL keywords but completely failed at output format.
The model showed strong, consistent learning across all 5 epochs. Loss decreased smoothly with no instability or divergence.
| Step | Epoch | Train Loss | Observation |
|---|---|---|---|
| ~10 | 0.32 | 4.09 | Model completely confused |
| ~20 | 0.64 | 1.26 | Starting to learn SQL patterns |
| ~30 | 0.96 | 0.62 | Getting the format right |
| ~40 | 1.28 | 0.52 | Consistent SQL structure |
| ~50 | 1.60 | 0.48 | Refining query patterns |
| ~60 | 1.92 | 0.46 | Solid performance |
| ~80 | 2.56 | 0.42 | Converging |
| ~90 | 2.88 | 0.42 | Well converged |
Final training loss: 4.09 → 0.42 (90% reduction)
Training and validation loss curves are available in the W&B experiment dashboard linked above, and as screenshots in the output_images/ folder of the repository.

| Metric | Baseline | Fine-Tuned | Improvement |
|---|---|---|---|
| Exact Match Accuracy | 0.00% | 6.00% | +6.00% |
| Avg Token Match | 34.39% | 54.75% | +20.36% |
| Output behavior | Verbose explanations | Clean SQL queries | ✅ |
Example 1 — Simple COUNT query
| Value | |
|---|---|
| Question | How many singers do we have? |
| Database | concert_singer |
| Expected SQL | SELECT count(*) FROM singer |
| Baseline output | "To find the number of singers in the database, you can use the COUNT function..." |
| Fine-tuned output | SELECT count(*) FROM singer; |
| Result | ✅ Correct |
Example 2 — ORDER BY with multiple columns
| Value | |
|---|---|
| Question | What are the names, countries, and ages for every singer in descending order of age? |
| Database | concert_singer |
| Expected SQL | SELECT name, country, age FROM singer ORDER BY age DESC |
| Baseline output | Incorrect table and join syntax |
| Fine-tuned output | SELECT Name, Country, Age FROM singer ORDER BY Age DESC; |
| Result | ✅ Correct (case difference only) |
Example 3 — Aggregate with WHERE filter
| Value | |
|---|---|
| Question | What is the average, minimum, and maximum age of all singers from France? |
| Database | concert_singer |
| Expected SQL | SELECT avg(age), min(age), max(age) FROM singer WHERE country = 'France' |
| Baseline output | Natural language explanation |
| Fine-tuned output | SELECT avg(age), min(age), max(age) FROM singer WHERE country = 'France'; |
| Result | ✅ Correct |
Example 4 — GROUP BY
| Value | |
|---|---|
| Question | Show all countries and the number of singers in each country. |
| Database | concert_singer |
| Expected SQL | SELECT country, count(*) FROM singer GROUP BY country |
| Baseline output | Long explanation with incorrect syntax |
| Fine-tuned output | SELECT Country, COUNT(*) FROM singer GROUP BY Country; |
| Result | ✅ Correct |
To verify that SQL fine-tuning did not cause the model to forget general knowledge, we evaluated on 50 samples from the MMLU benchmark:
| Model | MMLU Accuracy |
|---|---|
| Base model (before fine-tuning) | 25.0% (random baseline) |
| Fine-tuned model | 16.0% |
Note: With only 50 samples, high variance is expected — the difference of 9 percentage points corresponds to approximately 4-5 questions. A definitive catastrophic forgetting assessment would require 500+ samples. This limitation is acknowledged.
QLoRA efficiency on consumer hardware
The combination of 4-bit quantization and LoRA adapters reduced VRAM requirements from ~7GB to ~1.5GB, enabling training on a laptop GPU. The final adapter file is only 35MB compared to the 3.5GB full model — a 99% size reduction. This makes deployment extremely lightweight.
Dramatic qualitative transformation
The most important result is not captured by exact match numbers alone. The base model consistently produced paragraphs of natural language explanation. After fine-tuning, it produces clean, structured SQL queries on every inference. This is the core goal of instruction fine-tuning and was fully achieved.
Stable training dynamics
Loss dropped smoothly from 4.09 to 0.42 across 5 epochs without spikes or instability. The cosine learning rate schedule and warmup ratio contributed to stable convergence. Eval loss tracked train loss closely, indicating no overfitting.
Reproducible pipeline
The five-notebook structure (dataset preparation → baseline → training → evaluation → publishing) provides a clean, step-by-step workflow that anyone can follow to reproduce the results.
RTX 5060 Blackwell architecture compatibility
The RTX 5060 uses NVIDIA's new Blackwell architecture (sm_120) which is not supported by PyTorch stable releases. This required installing PyTorch nightly builds with CUDA 12.8 support and a custom-compiled bitsandbytes build. This added significant setup complexity at the start of the project.
Exact match metric is overly strict
The exact match metric treats these two queries as different:
Expected : SELECT country, count(*) FROM singer GROUP BY country
Got : SELECT Country, COUNT(*) FROM singer GROUP BY Country
Both queries are semantically identical and produce the same results when executed. A case-insensitive normalization or execution-based metric would more fairly represent the model's actual SQL generation quality.
Chinese text generation in early runs
The first training run (500 samples, 2 epochs) resulted in the model occasionally appending Chinese text after generating the SQL query. This was a symptom of insufficient training — the model had not fully learned to stop at the end of the SQL statement. This was resolved by increasing the training dataset to 3,500 samples and extending to 5 epochs.
Training time constraints
Training on the full 7,000-sample Spider dataset for 5 epochs would require approximately 4 hours on the RTX 5060. We used 3,500 samples as a practical compromise between training time and model quality.
This publication documents a practical, reproducible QLoRA fine-tuning pipeline for Text-to-SQL generation using the Spider benchmark. Starting from a base model that output verbose natural language instead of SQL, we achieved a model that consistently generates clean, structured SQL queries after approximately 90 minutes of training on consumer hardware.
Key achievements:
Recommended next steps:
I would like to acknowledge the open-source AI community whose tools and datasets made this project possible.
Special thanks to:
Their contributions significantly reduced the barrier to entry for conducting large language model research on consumer hardware.
README.md # Project overview and documentation requirements.txt # Python dependency list LICENSE # MIT license fix_notebook.py # Cleanup helper for notebook metadata 01_dataset_preparation.ipynb # Prepare Spider dataset, format Qwen chat prompts 02_baseline_evaluation.ipynb # Evaluate base Qwen model on SQL generation 03_qlora_finetuning.ipynb # Fine-tune model with QLoRA adapters 04_evaluation.ipynb # Compare baseline vs fine-tuned model 05_publishing.ipynb # Publish adapters and model card to HF hub data/ # Processed datasets and evaluation outputs train.json validation.json dataset_config.json baseline_results.json finetuned_results.json comparison_results.json spider_formatted/ # Saved HuggingFace DatasetDict outputs/ # Fine-tuning checkpoints and final adapter qwen-sql-qlora/ checkpoint-*/ final_adapter/ wandb/ # Weights & Biases run logs and summaries
| Component | Specification |
|---|---|
| GPU | NVIDIA RTX 5060 Laptop GPU |
| VRAM | 8 GB |
| CPU | Intel Core Ultra 7 |
| RAM | 24 GB |
| OS | Windows 11 |
| Resource | Link |
|---|---|
| 🤗 HuggingFace Model | faltooz123/qwen1.5-sql-qlora-spider |
| 📊 W&B Experiment | sql-finetuning project |
| 💻 GitHub Repository | pamuarun/LLM_FINE_TUNING |
| 📂 Spider Dataset | xlangai/spider |