May 30, 2026●16 reads●MIT License

Certifiedunder the LLM Deployment & Engineering module in the LLM Engineering and Deployment program.

Deploying a Fine-Tuned Text-to-SQL Model Using Modal Serverless Infrastructure

api-security
auto-scaling
cost-optimization
gpu-inference
huggingface
large-language-models
llm-deployment
llm-observability
lora
mlops
modal
monitoring
nlp
peft
production-ml
qlora
quantization
qwen
serverless
text-to-sql

Arun Teja

Deploying a Fine-Tuned Text-to-SQL Model Using Modal Serverless Infrastructure

Author: Arun Teja Pamu
Repository: https://github.com/pamuarun/LLM_DEPLOYMENT
Fine-Tuned Model: https://huggingface.co/faltooz123/qwen1.5-sql-qlora-spider
Module 1 Reference: https://github.com/pamuarun/LLM_FINE_TUNING

Abstract

This publication presents a comprehensive deployment plan for a fine-tuned Text-to-SQL language model in a production environment. The model Qwen1.5-1.8B-Chat fine-tuned on the Spider benchmark using QLoRA was developed in Module 1 of this program. This plan covers the full production lifecycle: use case definition, model configuration, deployment strategy, cost analysis, monitoring, and security. The chosen deployment platform is Modal, a serverless GPU infrastructure provider that enables pay-per-use scaling without the overhead of managing persistent cloud instances. The plan demonstrates how a lightweight 1.8B parameter model with a 35MB LoRA adapter can be deployed cost-effectively to serve real business users querying databases in plain English.

Note: This publication presents a comprehensive deployment plan
demonstrating understanding of production LLM systems. Per the project
guidelines, actual cloud infrastructure deployment is not required
the focus is on sound technical decision-making, cost analysis, and
monitoring strategy. All code in the accompanying GitHub repository is
written to be fully functional and ready to deploy by following the
provided instructions.

Introduction

Natural language interfaces to databases represent one of the most practical applications of large language models in enterprise settings. Business analysts, product managers, and non-technical stakeholders routinely need data insights but lack SQL expertise. A production Text-to-SQL API eliminates this barrier users ask questions in plain English and receive accurate SQL queries instantly.

This deployment plan addresses the challenge of taking a fine-tuned model from a research notebook to a production-grade API. The key requirements for this deployment are:

Low latency — business users expect near-instant responses
Cost efficiency — the system must be economically viable at realistic traffic volumes
Reliability — the API must be available during business hours with minimal downtime
Security — database query generation requires careful input validation and access control
The model being deployed faltooz123/qwen1.5-sql-qlora-spider was fine-tuned on the Spider Text-to-SQL benchmark using QLoRA, achieving a +20.36% improvement in token-level match accuracy compared to the base model. It is hosted on Hugging Face Hub as a 35MB LoRA adapter on top of Qwen/Qwen1.5-1.8B-Chat.

Related Work

Text-to-SQL in Production

Several commercial products have deployed Text-to-SQL models at scale. Tools like Retool AI, Thoughtspot, and Microsoft Copilot for Power BI demonstrate real demand for natural language database interfaces. These systems typically use large proprietary models (GPT-4, Claude) via API. Our approach differs by using a fine-tuned open-weights model, which reduces per-request cost and eliminates data privacy concerns associated with sending queries to third-party APIs.

Serverless LLM Deployment

Serverless GPU platforms have emerged as a practical middle ground between managed APIs (easy but expensive) and self-hosted infrastructure (cheap but complex). Modal, RunPod, and Replicate all offer GPU-backed serverless functions. Modal is particularly well-suited for LLM workloads due to its container caching, which keeps model weights warm between requests and dramatically reduces cold start times.

Parameter-Efficient Models for Production

QLoRA adapters offer a unique deployment advantage — the base model can be shared across multiple adapter fine-tunes, reducing storage costs. For our use case, the 35MB adapter is loaded on top of the 3.5GB base model, enabling fast updates (swapping adapters) without redeploying the full model.

1. Use Case Definition

Problem Statement

Business analysts at data-driven organizations need to query internal databases to generate reports, track KPIs, and answer ad-hoc business questions. Writing SQL requires technical expertise that most analysts do not have, creating a bottleneck — they must wait for data engineers to write queries on their behalf.

The Text-to-SQL API solves this by accepting a natural language question and a database name, returning the correct SQL query that the analyst can execute directly in their database client.

Core task: Convert natural language questions into executable SQL queries for structured databases.

Target Users

User Type	How They Use the System
Business Analysts	Query sales, marketing, and operations databases without SQL knowledge
Product Managers	Pull product metrics and user behavior data on demand
Customer Success Teams	Query customer databases to answer client-specific questions
Data Engineers	Use the API to prototype queries before optimizing manually

Input/Output Examples

Example 1 — Simple aggregation

Field	Value
Database	sales_db
Input	How many customers do we have in each country?
Output	`SELECT country, COUNT(*) FROM customers GROUP BY country;`

Example 2 — Filtered query with ordering

Field	Value
Database	ecommerce_db
Input	Show me the top 5 products by revenue last month
Output	`SELECT product_name, SUM(revenue) FROM orders WHERE order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH) GROUP BY product_name ORDER BY SUM(revenue) DESC LIMIT 5;`

Example 3 — Multi-condition filter

Field	Value
Database	hr_db
Input	List all employees hired after 2020 who are in the engineering department
Output	`SELECT name, hire_date, department FROM employees WHERE hire_date > '2020-01-01' AND department = 'Engineering';`

Success Criteria

Criterion	Target	Measurement
Response latency (p50)	< 2 seconds	API response time logs
Response latency (p99)	< 5 seconds	API response time logs
SQL correctness rate	> 60% exact match	Weekly evaluation on test set
API availability	> 99.5% uptime	Modal uptime dashboard
Error rate	< 1%	Error logs and alerts
Cost per 1,000 requests	< $0.50	Monthly billing analysis

Traffic Expectations

Scenario	Requests/Hour	Requests/Day	Notes
Normal business hours	50–100	600–1,200	9AM–6PM weekdays
Peak load	300–500	—	End of quarter reporting
Off-hours	0–5	50	Automated jobs only
Monthly total	—	~25,000	Estimated average

The traffic pattern is bursty and business-hours-heavy — making serverless deployment ideal over always-on GPU instances.

2. Model Selection & Configuration

Model Details

Aspect	Value
Model	Qwen/Qwen1.5-1.8B-Chat + LoRA adapter
Model Source	Hugging Face Hub (faltooz123/qwen1.5-sql-qlora-spider)
Parameter Count	1.8B (base) + 13M (adapter)
Quantization	4-bit NF4 (QLoRA)
Context Length	8,192 tokens
Max Output Tokens	128
Adapter Size	35MB
Full Model Size	~3.5GB (with base)

Why This Model?

This model was selected over alternatives for four reasons:

1. Task-specific fine-tuning
The model was fine-tuned specifically on the Spider Text-to-SQL benchmark it has learned to output clean SQL queries rather than natural language explanations. A general-purpose model would require more complex prompting and produce less reliable output.

2. Size and cost efficiency
At 1.8B parameters, this model runs on a single A10G GPU (24GB VRAM) with comfortable headroom. Larger models (7B, 13B) would require more expensive hardware, increasing per-request cost significantly.

3. Open weights and data privacy
Unlike GPT-4 or Claude API, this model runs entirely within our infrastructure. Database schemas and business queries never leave our servers critical for enterprise compliance.

4. Fast inference
The 4-bit quantized model generates SQL queries in under 2 seconds on an A10G GPU, meeting our latency targets.

Trade-offs Considered

Option	Pros	Cons	Decision
GPT-4o-mini API	High accuracy, no infra	$0.15/1M tokens, data privacy risk	Rejected
Qwen 7B fine-tuned	Higher accuracy	4x cost, slower inference	Rejected
Our Qwen 1.8B	Low cost, private, fast	Lower accuracy than 7B	✅ Selected
Base model (no fine-tune)	No training cost	0% exact match, wrong output format	Rejected

Quantization Impact

4-bit NF4 quantization reduces VRAM from ~7GB to ~1.5GB with minimal quality loss on structured generation tasks like SQL. For Text-to-SQL, the output space is constrained (SQL keywords are fixed) — quantization has less impact than on open-ended generation tasks. The trade-off is a slight increase in inference latency (~10-15%) compared to FP16, which remains within our 2-second p50 target.

3. Deployment Strategy

Platform Selection

Selected Platform: Modal (Serverless GPU)

Platform	Best For	Why Not Selected
Modal ✅	Serverless, bursty traffic, easy HF integration	— Selected
HuggingFace Inference API	Quick prototyping	Limited GPU options, cold starts
AWS SageMaker	Enterprise, full control	Complex setup, high fixed costs
AWS Bedrock	Managed models only	Doesn't support custom fine-tunes
vLLM on Cloud VM	High sustained throughput	Idle GPU cost during off-hours
Ollama on EC2	Simple self-hosted	No auto-scaling, manual management

Why Modal?

Modal is the best fit for this use case for three reasons:

1. Serverless scales to zero
Our traffic is bursty and business-hours-heavy. With Modal, we pay only for actual compute time. During off-hours (18 hours/day), we pay nothing. A persistent VM would charge 24/7 regardless of traffic.

2. Container image caching
Modal caches the model weights in the container image. After the first cold start (~30-60 seconds), subsequent requests are served from a warm container in under 2 seconds. This eliminates the biggest downside of serverless for LLM workloads.

3. Native Hugging Face integration
Modal can directly load our fine-tuned adapter from Hugging Face Hub using snapshot_download. No manual model management or custom container registries needed.

Infrastructure Details

Component	Configuration
GPU Type	NVIDIA A10G (24GB VRAM)
CPU	4 vCPUs
Memory	32GB RAM
Container timeout	300 seconds
Min replicas (warm)	0 (scales to zero)
Max replicas	5
Concurrency per container	1 request
Cold start time	~30-60 seconds (first request)
Warm request time	< 2 seconds
Region	us-east-1 (lowest latency for US users)

Scaling Strategy

Traffic Level	Replicas	Behavior
0 requests	0	Scales to zero, no cost
1-50 req/hour	1	Single warm container
50-200 req/hour	2-3	Auto-scales up
200-500 req/hour	4-5	Peak capacity
Post-peak	0-1	Scales back down after 5 min idle

Endpoint Architecture

User Request (HTTPS)
       │
       ▼
Modal API Gateway
       │
       ▼
Modal Container (A10G GPU)
  ├── Load base model (Qwen1.5-1.8B, 4-bit)
  ├── Load LoRA adapter (faltooz123/qwen1.5-sql-qlora-spider)
  ├── Run inference (< 2 seconds)
  └── Return SQL query
       │
       ▼
JSON Response → User

4. Cost Analysis

Monthly Cost Breakdown

Cost Component	Details	Monthly Estimate
Compute (A10G GPU)	~25,000 requests × 2 sec = 14 GPU hours @ $1.10/hr	$15.40
Container idle time	Warm container kept alive 2 hrs/day × 22 days	$4.84
Storage (model weights)	3.5GB base + 35MB adapter on Modal volume	$2.00
Network (data transfer)	~1KB per request × 25,000 requests	$0.50
Monitoring (LangFuse)	Free tier (up to 50K traces/month)	$0.00
Total Estimated		~$22.74/month

Note: Modal provides $30/month free credits for new accounts this deployment runs within the free tier for the first month.

Cost Per Request

Monthly compute cost    : $15.40
Monthly requests        : 25,000
Cost per request        : $0.00062
Cost per 1,000 requests : $0.62

Cost Optimization Strategies

Strategy 1 — Response Caching
Many business users ask the same questions repeatedly (e.g., "How many active users do we have today?"). Implementing a Redis cache keyed on (database_id, question_hash) would serve repeated queries instantly at zero GPU cost. Estimated cache hit rate: 20-30%, reducing monthly compute cost by ~$4.

Strategy 2 — Request Batching
Instead of processing each request individually, batch multiple requests together in a single GPU forward pass. Modal supports batching via container concurrency settings. For non-latency-sensitive requests (automated jobs), batching 4-8 requests together reduces GPU time by ~60%.

Strategy 3 — Aggressive Scale-to-Zero
Setting container idle timeout to 2 minutes (instead of default 10) ensures GPU is released quickly after traffic drops. During the 18 off-hours daily, this saves approximately 3-4 GPU hours per day.

Strategy 4 — Spot Instance Pricing
For batch processing jobs (e.g., nightly report generation), Modal supports spot-equivalent pricing. Running non-urgent requests on interruptible containers reduces compute cost by ~40%.

5. Monitoring & Observability Plan

Metrics to Track

Metric	Why It Matters	Alert Threshold	Tool
Latency p50	Typical user experience	> 3 seconds	LangFuse
Latency p99	Worst-case user experience	> 8 seconds	LangFuse
Error rate	API reliability	> 2%	Modal dashboard
Throughput (RPS)	Capacity planning	< 0.1 RPS sustained	CloudWatch
Token usage/request	Cost control	> 150 tokens/request	LiteLLM
GPU utilization	Resource efficiency	> 95% or < 10%	Modal dashboard
Cold start rate	User experience	> 20% of requests	Custom logging
SQL parse success rate	Output quality	< 80% valid SQL	Custom evaluation

Monitoring Stack

Tool	Purpose	Why Selected
LangFuse	LLM request tracing, prompt/response logging, latency tracking	Open source, self-hostable, purpose-built for LLMs
Modal Dashboard	Container health, GPU utilization, scaling events	Built-in, no setup required
LiteLLM	Token usage tracking, cost per request, budget alerts	Lightweight, integrates with any LLM endpoint
Sentry	Error tracking and alerting	Industry standard, free tier sufficient

Alerting Strategy

Alert 1 — High Latency

Condition: p99 latency > 8 seconds for 5 consecutive minutes
Severity: Warning
Notification: Slack #ml-alerts channel
Runbook: Check GPU utilization, check for container cold starts, scale up min replicas
Alert 2 — High Error Rate
Condition: Error rate > 2% over 10-minute window
Severity: Critical
Notification: PagerDuty on-call + Slack #ml-alerts
Runbook: Check Modal logs for OOM errors, verify HF Hub connectivity, rollback if needed
Alert 3 — Budget Threshold
Condition: Monthly spend > $40 (LiteLLM budget alert)
Severity: Warning
Notification: Email to project owner
Runbook: Review traffic spike, enable aggressive caching, check for abuse
Alert 4 — Low SQL Quality
Condition: SQL parse success rate < 80% over 1-hour window
Severity: Warning
Notification: Slack #ml-alerts
Runbook: Sample recent failures, check if input distribution has shifted, consider prompt tuning

Logging Strategy

Every request logs the following to LangFuse:

Input: database name + natural language question
Output: generated SQL query
Latency: total response time
Token count: input + output tokens
Container ID: for debugging cold starts
PII Note: Database names and questions may contain sensitive business data. Logs are retained for 30 days and access is restricted to the ML team.

6. Security Considerations

API Authentication

All API endpoints are protected using API key authentication. Each client application receives a unique API key issued through a key management service. Keys are passed in the Authorization: Bearer <key> header and validated on every request before reaching the model.

# Example authentication header
headers = {
    "Authorization": "Bearer sk-sql-xxxxxxxxxxxxxxxx",
    "Content-Type": "application/json"
}

Rate Limiting

User Type	Rate Limit	Burst Limit
Standard API key	60 requests/minute	100 requests/minute
Premium API key	300 requests/minute	500 requests/minute
Internal service	1,000 requests/minute	2,000 requests/minute

Rate limits are enforced at the API gateway level using a token bucket algorithm. Clients exceeding limits receive a 429 Too Many Requests response with a Retry-After header.

Input Validation

Prompt Injection Prevention:
User inputs are sanitized before being inserted into the prompt template. The following patterns are blocked:

Inputs containing <|im_start|>, <|im_end|>, or similar control tokens
Inputs exceeding 500 characters (prevents prompt stuffing)
Inputs containing SQL injection patterns (semicolons followed by DROP, DELETE, INSERT)

def validate_input(question: str, db_id: str) -> bool:
    # Block control tokens
    blocked_tokens = ["<|im_start|>", "<|im_end|>", "system\n", "assistant\n"]
    for token in blocked_tokens:
        if token in question:
            return False
    # Block excessive length
    if len(question) > 500:
        return False
    return True

PII Handling

Prompt logging: Enabled for debugging but access restricted to ML team
Response logging: SQL queries logged but sanitized of any literal values
Retention: Logs retained for 30 days then automatically deleted
Encryption: All logs encrypted at rest using AES-256
No training on user data: Request logs are never used to retrain the model without explicit consent

Access Control

Resource	Who Can Access
Modal deployment console	ML team only
LangFuse traces	ML team + Engineering leads
API keys management	Engineering leads only
Model weights (HF Hub)	Public (read-only)
Monitoring dashboards	ML team + Management

Discussion

What This Deployment Plan Achieves

This plan demonstrates a cost-effective, scalable path to production for a fine-tuned LLM. The key insight is matching the deployment platform to the traffic pattern — bursty, business-hours-heavy traffic is perfectly served by serverless infrastructure, while a persistent VM would waste money on idle GPU time.

The combination of a small fine-tuned model (1.8B parameters) with serverless deployment results in a system that costs approximately **2.50 per 1M tokens, roughly $5-10 per 1,000 SQL requests) while maintaining data privacy.

Trade-offs and Limitations

Accuracy vs cost: The 1.8B model achieves 6% exact match on the Spider benchmark. A 7B model would achieve significantly higher accuracy but at 4x the compute cost. For a production system, this trade-off should be revisited after measuring real user satisfaction.

Cold starts: The first request after a period of inactivity takes 30-60 seconds while the container initializes. This is acceptable for a business tool where users tolerate occasional delays, but would be unacceptable for a real-time consumer application.

Single model limitation: This deployment serves a single fine-tuned model. A production system would benefit from A/B testing infrastructure to compare model versions before full rollout.

Future Improvements

Execution validation: Run generated SQL against a read-only database replica and return an error if it fails
Schema-aware prompting: Pass actual table and column names in the prompt for higher accuracy
Model versioning: Implement blue-green deployment for zero-downtime model updates
Feedback loop: Collect user corrections to build a higher-quality training dataset for the next fine-tuning iteration

Conclusion

This publication presents a complete deployment plan for a fine-tuned Text-to-SQL model in a production environment. Using Modal's serverless GPU infrastructure, the deployment achieves:

✅ Sub-2-second response latency for warm requests
✅ Estimated cost of 22/month at projected traffic)
✅ Auto-scaling from 0 to 5 replicas based on demand
✅ Complete data privacy — no queries sent to third-party APIs
✅ Comprehensive monitoring with LangFuse, Modal dashboard, and LiteLLM
✅ Security through API key authentication, rate limiting, and input validation

The plan demonstrates that a fine-tuned open-weights model can be deployed cost-effectively for real business use cases without enterprise-scale infrastructure budgets. The serverless approach is particularly well-suited for internal business tools with predictable, business-hours-heavy traffic patterns.

References

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv
.14314
Yu, T., Zhang, R., Yang, K., et al. (2018). Spider: A Large-Scale Human-Labeled Dataset for Text-to-SQL. arXiv
.08887
Modal Labs. (2024). Modal Documentation — Serverless GPU Deployment. https://modal.com/docs
LangFuse. (2024). Open Source LLM Observability. https://langfuse.com
Hu, E., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv
.09685
Qwen Team. (2024). Qwen1.5: Expanding Language Model Series. Hugging Face.

Acknowledgements

The author thanks the Modal Labs team for their excellent documentation and free credits for academic projects, the HuggingFace team for the Hub infrastructure that makes model sharing seamless, and the LangFuse team for building an open-source LLM observability platform. This deployment plan builds directly on the fine-tuning work completed in Module 1 of the LLMED Certification Program.

Appendix

A. Full Repository Structure

LLM_DEPLOYMENT/
├── README.md                    # Deployment guide and setup instructions
├── requirements.txt             # Python dependencies
├── deploy/
│   └── modal_deploy.py          # Modal serverless deployment script
├── client/
│   ├── client.py                # HTTP client for the deployed API
│   └── test_requests.py         # Example SQL generation test cases
└── config/
    └── model_config.yaml        # Model and deployment configuration

B. Environment Variables Required

# Hugging Face access token
HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
 
# Modal API token (set via modal token set)
MODAL_TOKEN_ID=ak-xxxxxxxxxxxxxxxxxxxx
MODAL_TOKEN_SECRET=as-xxxxxxxxxxxxxxxxxxxx
 
# LangFuse observability
LANGFUSE_PUBLIC_KEY=pk-lf-xxxxxxxxxxxxxxxxxxxx
LANGFUSE_SECRET_KEY=sk-lf-xxxxxxxxxxxxxxxxxxxx
LANGFUSE_HOST=https://cloud.langfuse.com
 
# API authentication
SQL_API_KEY=sk-sql-xxxxxxxxxxxxxxxxxxxx

C. Deployment Checklist

Pre-deployment:
□ Modal account created and CLI authenticated
□ HuggingFace token set with read access
□ LangFuse project created and keys configured
□ model_config.yaml reviewed and updated
□ requirements.txt installed locally
 
Deployment:
□ modal deploy deploy/modal_deploy.py
□ Endpoint URL copied from Modal dashboard
□ client.py updated with endpoint URL
□ test_requests.py run and all tests pass
 
Post-deployment:
□ LangFuse traces appearing for test requests
□ Modal dashboard showing GPU utilization
□ Latency p50 < 2 seconds on warm requests
□ Error rate < 1% on test suite

D. Resources

Resource	Link
🤗 Fine-Tuned Model	faltooz123/qwen1.5-sql-qlora-spider
💻 Deployment Repo	pamuarun/LLM_DEPLOYMENT
📦 Module 1 Repo	pamuarun/LLM_FINE_TUNING
📊 W&B Experiment	sql-finetuning project
🚀 Modal Docs	modal.com/docs
📊 LangFuse	langfuse.com

May 30, 2026●16 reads●MIT License

Certifiedunder the LLM Deployment & Engineering module in the LLM Engineering and Deployment program.

Deploying a Fine-Tuned Text-to-SQL Model Using Modal Serverless Infrastructure

api-security
auto-scaling
cost-optimization
gpu-inference
huggingface
large-language-models
llm-deployment
llm-observability
lora
mlops
modal
monitoring
nlp
peft
production-ml
qlora
quantization
qwen
serverless
text-to-sql

Arun Teja

Deploying a Fine-Tuned Text-to-SQL Model Using Modal Serverless Infrastructure

Abstract

Note: This publication presents a comprehensive deployment plan
demonstrating understanding of production LLM systems. Per the project
guidelines, actual cloud infrastructure deployment is not required
the focus is on sound technical decision-making, cost analysis, and
monitoring strategy. All code in the accompanying GitHub repository is
written to be fully functional and ready to deploy by following the
provided instructions.

Introduction

This deployment plan addresses the challenge of taking a fine-tuned model from a research notebook to a production-grade API. The key requirements for this deployment are:

Low latency — business users expect near-instant responses
Cost efficiency — the system must be economically viable at realistic traffic volumes
Reliability — the API must be available during business hours with minimal downtime
Security — database query generation requires careful input validation and access control
The model being deployed faltooz123/qwen1.5-sql-qlora-spider was fine-tuned on the Spider Text-to-SQL benchmark using QLoRA, achieving a +20.36% improvement in token-level match accuracy compared to the base model. It is hosted on Hugging Face Hub as a 35MB LoRA adapter on top of Qwen/Qwen1.5-1.8B-Chat.

Related Work

Text-to-SQL in Production

Serverless LLM Deployment

Parameter-Efficient Models for Production

1. Use Case Definition

Problem Statement

The Text-to-SQL API solves this by accepting a natural language question and a database name, returning the correct SQL query that the analyst can execute directly in their database client.

Core task: Convert natural language questions into executable SQL queries for structured databases.

Target Users

User Type	How They Use the System
Business Analysts	Query sales, marketing, and operations databases without SQL knowledge
Product Managers	Pull product metrics and user behavior data on demand
Customer Success Teams	Query customer databases to answer client-specific questions
Data Engineers	Use the API to prototype queries before optimizing manually

Input/Output Examples

Example 1 — Simple aggregation

Field	Value
Database	sales_db
Input	How many customers do we have in each country?
Output	`SELECT country, COUNT(*) FROM customers GROUP BY country;`

Example 2 — Filtered query with ordering

Field	Value
Database	ecommerce_db
Input	Show me the top 5 products by revenue last month
Output	`SELECT product_name, SUM(revenue) FROM orders WHERE order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH) GROUP BY product_name ORDER BY SUM(revenue) DESC LIMIT 5;`

Example 3 — Multi-condition filter

Field	Value
Database	hr_db
Input	List all employees hired after 2020 who are in the engineering department
Output	`SELECT name, hire_date, department FROM employees WHERE hire_date > '2020-01-01' AND department = 'Engineering';`

Success Criteria

Criterion	Target	Measurement
Response latency (p50)	< 2 seconds	API response time logs
Response latency (p99)	< 5 seconds	API response time logs
SQL correctness rate	> 60% exact match	Weekly evaluation on test set
API availability	> 99.5% uptime	Modal uptime dashboard
Error rate	< 1%	Error logs and alerts
Cost per 1,000 requests	< $0.50	Monthly billing analysis

Traffic Expectations

Scenario	Requests/Hour	Requests/Day	Notes
Normal business hours	50–100	600–1,200	9AM–6PM weekdays
Peak load	300–500	—	End of quarter reporting
Off-hours	0–5	50	Automated jobs only
Monthly total	—	~25,000	Estimated average

The traffic pattern is bursty and business-hours-heavy — making serverless deployment ideal over always-on GPU instances.

2. Model Selection & Configuration

Model Details

Aspect	Value
Model	Qwen/Qwen1.5-1.8B-Chat + LoRA adapter
Model Source	Hugging Face Hub (faltooz123/qwen1.5-sql-qlora-spider)
Parameter Count	1.8B (base) + 13M (adapter)
Quantization	4-bit NF4 (QLoRA)
Context Length	8,192 tokens
Max Output Tokens	128
Adapter Size	35MB
Full Model Size	~3.5GB (with base)

Why This Model?

This model was selected over alternatives for four reasons:

4. Fast inference
The 4-bit quantized model generates SQL queries in under 2 seconds on an A10G GPU, meeting our latency targets.

Trade-offs Considered

Option	Pros	Cons	Decision
GPT-4o-mini API	High accuracy, no infra	$0.15/1M tokens, data privacy risk	Rejected
Qwen 7B fine-tuned	Higher accuracy	4x cost, slower inference	Rejected
Our Qwen 1.8B	Low cost, private, fast	Lower accuracy than 7B	✅ Selected
Base model (no fine-tune)	No training cost	0% exact match, wrong output format	Rejected

Quantization Impact

3. Deployment Strategy

Platform Selection

Selected Platform: Modal (Serverless GPU)

Platform	Best For	Why Not Selected
Modal ✅	Serverless, bursty traffic, easy HF integration	— Selected
HuggingFace Inference API	Quick prototyping	Limited GPU options, cold starts
AWS SageMaker	Enterprise, full control	Complex setup, high fixed costs
AWS Bedrock	Managed models only	Doesn't support custom fine-tunes
vLLM on Cloud VM	High sustained throughput	Idle GPU cost during off-hours
Ollama on EC2	Simple self-hosted	No auto-scaling, manual management

Why Modal?

Modal is the best fit for this use case for three reasons:

Infrastructure Details

Component	Configuration
GPU Type	NVIDIA A10G (24GB VRAM)
CPU	4 vCPUs
Memory	32GB RAM
Container timeout	300 seconds
Min replicas (warm)	0 (scales to zero)
Max replicas	5
Concurrency per container	1 request
Cold start time	~30-60 seconds (first request)
Warm request time	< 2 seconds
Region	us-east-1 (lowest latency for US users)

Scaling Strategy

Traffic Level	Replicas	Behavior
0 requests	0	Scales to zero, no cost
1-50 req/hour	1	Single warm container
50-200 req/hour	2-3	Auto-scales up
200-500 req/hour	4-5	Peak capacity
Post-peak	0-1	Scales back down after 5 min idle

Endpoint Architecture

User Request (HTTPS)
       │
       ▼
Modal API Gateway
       │
       ▼
Modal Container (A10G GPU)
  ├── Load base model (Qwen1.5-1.8B, 4-bit)
  ├── Load LoRA adapter (faltooz123/qwen1.5-sql-qlora-spider)
  ├── Run inference (< 2 seconds)
  └── Return SQL query
       │
       ▼
JSON Response → User

4. Cost Analysis

Monthly Cost Breakdown

Cost Component	Details	Monthly Estimate
Compute (A10G GPU)	~25,000 requests × 2 sec = 14 GPU hours @ $1.10/hr	$15.40
Container idle time	Warm container kept alive 2 hrs/day × 22 days	$4.84
Storage (model weights)	3.5GB base + 35MB adapter on Modal volume	$2.00
Network (data transfer)	~1KB per request × 25,000 requests	$0.50
Monitoring (LangFuse)	Free tier (up to 50K traces/month)	$0.00
Total Estimated		~$22.74/month

Note: Modal provides $30/month free credits for new accounts this deployment runs within the free tier for the first month.

Cost Per Request

Monthly compute cost    : $15.40
Monthly requests        : 25,000
Cost per request        : $0.00062
Cost per 1,000 requests : $0.62

Cost Optimization Strategies

5. Monitoring & Observability Plan

Metrics to Track

Metric	Why It Matters	Alert Threshold	Tool
Latency p50	Typical user experience	> 3 seconds	LangFuse
Latency p99	Worst-case user experience	> 8 seconds	LangFuse
Error rate	API reliability	> 2%	Modal dashboard
Throughput (RPS)	Capacity planning	< 0.1 RPS sustained	CloudWatch
Token usage/request	Cost control	> 150 tokens/request	LiteLLM
GPU utilization	Resource efficiency	> 95% or < 10%	Modal dashboard
Cold start rate	User experience	> 20% of requests	Custom logging
SQL parse success rate	Output quality	< 80% valid SQL	Custom evaluation

Monitoring Stack

Tool	Purpose	Why Selected
LangFuse	LLM request tracing, prompt/response logging, latency tracking	Open source, self-hostable, purpose-built for LLMs
Modal Dashboard	Container health, GPU utilization, scaling events	Built-in, no setup required
LiteLLM	Token usage tracking, cost per request, budget alerts	Lightweight, integrates with any LLM endpoint
Sentry	Error tracking and alerting	Industry standard, free tier sufficient

Alerting Strategy

Alert 1 — High Latency

Condition: p99 latency > 8 seconds for 5 consecutive minutes
Severity: Warning
Notification: Slack #ml-alerts channel
Runbook: Check GPU utilization, check for container cold starts, scale up min replicas
Alert 2 — High Error Rate
Condition: Error rate > 2% over 10-minute window
Severity: Critical
Notification: PagerDuty on-call + Slack #ml-alerts
Runbook: Check Modal logs for OOM errors, verify HF Hub connectivity, rollback if needed
Alert 3 — Budget Threshold
Condition: Monthly spend > $40 (LiteLLM budget alert)
Severity: Warning
Notification: Email to project owner
Runbook: Review traffic spike, enable aggressive caching, check for abuse
Alert 4 — Low SQL Quality
Condition: SQL parse success rate < 80% over 1-hour window
Severity: Warning
Notification: Slack #ml-alerts
Runbook: Sample recent failures, check if input distribution has shifted, consider prompt tuning

Logging Strategy

Every request logs the following to LangFuse:

Input: database name + natural language question
Output: generated SQL query
Latency: total response time
Token count: input + output tokens
Container ID: for debugging cold starts
PII Note: Database names and questions may contain sensitive business data. Logs are retained for 30 days and access is restricted to the ML team.

6. Security Considerations

API Authentication

# Example authentication header
headers = {
    "Authorization": "Bearer sk-sql-xxxxxxxxxxxxxxxx",
    "Content-Type": "application/json"
}

Rate Limiting

User Type	Rate Limit	Burst Limit
Standard API key	60 requests/minute	100 requests/minute
Premium API key	300 requests/minute	500 requests/minute
Internal service	1,000 requests/minute	2,000 requests/minute

Rate limits are enforced at the API gateway level using a token bucket algorithm. Clients exceeding limits receive a 429 Too Many Requests response with a Retry-After header.

Input Validation

Prompt Injection Prevention:
User inputs are sanitized before being inserted into the prompt template. The following patterns are blocked:

Inputs containing <|im_start|>, <|im_end|>, or similar control tokens
Inputs exceeding 500 characters (prevents prompt stuffing)
Inputs containing SQL injection patterns (semicolons followed by DROP, DELETE, INSERT)

def validate_input(question: str, db_id: str) -> bool:
    # Block control tokens
    blocked_tokens = ["<|im_start|>", "<|im_end|>", "system\n", "assistant\n"]
    for token in blocked_tokens:
        if token in question:
            return False
    # Block excessive length
    if len(question) > 500:
        return False
    return True

PII Handling

Prompt logging: Enabled for debugging but access restricted to ML team
Response logging: SQL queries logged but sanitized of any literal values
Retention: Logs retained for 30 days then automatically deleted
Encryption: All logs encrypted at rest using AES-256
No training on user data: Request logs are never used to retrain the model without explicit consent

Access Control

Resource	Who Can Access
Modal deployment console	ML team only
LangFuse traces	ML team + Engineering leads
API keys management	Engineering leads only
Model weights (HF Hub)	Public (read-only)
Monitoring dashboards	ML team + Management

Discussion

What This Deployment Plan Achieves

Trade-offs and Limitations

Single model limitation: This deployment serves a single fine-tuned model. A production system would benefit from A/B testing infrastructure to compare model versions before full rollout.

Future Improvements

Execution validation: Run generated SQL against a read-only database replica and return an error if it fails
Schema-aware prompting: Pass actual table and column names in the prompt for higher accuracy
Model versioning: Implement blue-green deployment for zero-downtime model updates
Feedback loop: Collect user corrections to build a higher-quality training dataset for the next fine-tuning iteration

Conclusion

This publication presents a complete deployment plan for a fine-tuned Text-to-SQL model in a production environment. Using Modal's serverless GPU infrastructure, the deployment achieves:

✅ Sub-2-second response latency for warm requests
✅ Estimated cost of 22/month at projected traffic)
✅ Auto-scaling from 0 to 5 replicas based on demand
✅ Complete data privacy — no queries sent to third-party APIs
✅ Comprehensive monitoring with LangFuse, Modal dashboard, and LiteLLM
✅ Security through API key authentication, rate limiting, and input validation

References

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv
.14314
Yu, T., Zhang, R., Yang, K., et al. (2018). Spider: A Large-Scale Human-Labeled Dataset for Text-to-SQL. arXiv
.08887
Modal Labs. (2024). Modal Documentation — Serverless GPU Deployment. https://modal.com/docs
LangFuse. (2024). Open Source LLM Observability. https://langfuse.com
Hu, E., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv
.09685
Qwen Team. (2024). Qwen1.5: Expanding Language Model Series. Hugging Face.

Acknowledgements

Appendix

A. Full Repository Structure

LLM_DEPLOYMENT/
├── README.md                    # Deployment guide and setup instructions
├── requirements.txt             # Python dependencies
├── deploy/
│   └── modal_deploy.py          # Modal serverless deployment script
├── client/
│   ├── client.py                # HTTP client for the deployed API
│   └── test_requests.py         # Example SQL generation test cases
└── config/
    └── model_config.yaml        # Model and deployment configuration

B. Environment Variables Required

# Hugging Face access token
HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
 
# Modal API token (set via modal token set)
MODAL_TOKEN_ID=ak-xxxxxxxxxxxxxxxxxxxx
MODAL_TOKEN_SECRET=as-xxxxxxxxxxxxxxxxxxxx
 
# LangFuse observability
LANGFUSE_PUBLIC_KEY=pk-lf-xxxxxxxxxxxxxxxxxxxx
LANGFUSE_SECRET_KEY=sk-lf-xxxxxxxxxxxxxxxxxxxx
LANGFUSE_HOST=https://cloud.langfuse.com
 
# API authentication
SQL_API_KEY=sk-sql-xxxxxxxxxxxxxxxxxxxx

C. Deployment Checklist

Pre-deployment:
□ Modal account created and CLI authenticated
□ HuggingFace token set with read access
□ LangFuse project created and keys configured
□ model_config.yaml reviewed and updated
□ requirements.txt installed locally
 
Deployment:
□ modal deploy deploy/modal_deploy.py
□ Endpoint URL copied from Modal dashboard
□ client.py updated with endpoint URL
□ test_requests.py run and all tests pass
 
Post-deployment:
□ LangFuse traces appearing for test requests
□ Modal dashboard showing GPU utilization
□ Latency p50 < 2 seconds on warm requests
□ Error rate < 1% on test suite

D. Resources

Resource	Link
🤗 Fine-Tuned Model	faltooz123/qwen1.5-sql-qlora-spider
💻 Deployment Repo	pamuarun/LLM_DEPLOYMENT
📦 Module 1 Repo	pamuarun/LLM_FINE_TUNING
📊 W&B Experiment	sql-finetuning project
🚀 Modal Docs	modal.com/docs
📊 LangFuse	langfuse.com