Author: Arun Teja Pamu
Repository: https://github.com/pamuarun/LLM_DEPLOYMENT
Fine-Tuned Model: https://huggingface.co/faltooz123/qwen1.5-sql-qlora-spider
Module 1 Reference: https://github.com/pamuarun/LLM_FINE_TUNING

This publication presents a comprehensive deployment plan for a fine-tuned Text-to-SQL language model in a production environment. The model Qwen1.5-1.8B-Chat fine-tuned on the Spider benchmark using QLoRA was developed in Module 1 of this program. This plan covers the full production lifecycle: use case definition, model configuration, deployment strategy, cost analysis, monitoring, and security. The chosen deployment platform is Modal, a serverless GPU infrastructure provider that enables pay-per-use scaling without the overhead of managing persistent cloud instances. The plan demonstrates how a lightweight 1.8B parameter model with a 35MB LoRA adapter can be deployed cost-effectively to serve real business users querying databases in plain English.

Natural language interfaces to databases represent one of the most practical applications of large language models in enterprise settings. Business analysts, product managers, and non-technical stakeholders routinely need data insights but lack SQL expertise. A production Text-to-SQL API eliminates this barrier users ask questions in plain English and receive accurate SQL queries instantly.
This deployment plan addresses the challenge of taking a fine-tuned model from a research notebook to a production-grade API. The key requirements for this deployment are:
faltooz123/qwen1.5-sql-qlora-spider was fine-tuned on the Spider Text-to-SQL benchmark using QLoRA, achieving a +20.36% improvement in token-level match accuracy compared to the base model. It is hosted on Hugging Face Hub as a 35MB LoRA adapter on top of Qwen/Qwen1.5-1.8B-Chat.Several commercial products have deployed Text-to-SQL models at scale. Tools like Retool AI, Thoughtspot, and Microsoft Copilot for Power BI demonstrate real demand for natural language database interfaces. These systems typically use large proprietary models (GPT-4, Claude) via API. Our approach differs by using a fine-tuned open-weights model, which reduces per-request cost and eliminates data privacy concerns associated with sending queries to third-party APIs.
Serverless GPU platforms have emerged as a practical middle ground between managed APIs (easy but expensive) and self-hosted infrastructure (cheap but complex). Modal, RunPod, and Replicate all offer GPU-backed serverless functions. Modal is particularly well-suited for LLM workloads due to its container caching, which keeps model weights warm between requests and dramatically reduces cold start times.
QLoRA adapters offer a unique deployment advantage — the base model can be shared across multiple adapter fine-tunes, reducing storage costs. For our use case, the 35MB adapter is loaded on top of the 3.5GB base model, enabling fast updates (swapping adapters) without redeploying the full model.
Business analysts at data-driven organizations need to query internal databases to generate reports, track KPIs, and answer ad-hoc business questions. Writing SQL requires technical expertise that most analysts do not have, creating a bottleneck — they must wait for data engineers to write queries on their behalf.
The Text-to-SQL API solves this by accepting a natural language question and a database name, returning the correct SQL query that the analyst can execute directly in their database client.
Core task: Convert natural language questions into executable SQL queries for structured databases.
| User Type | How They Use the System |
|---|---|
| Business Analysts | Query sales, marketing, and operations databases without SQL knowledge |
| Product Managers | Pull product metrics and user behavior data on demand |
| Customer Success Teams | Query customer databases to answer client-specific questions |
| Data Engineers | Use the API to prototype queries before optimizing manually |
Example 1 — Simple aggregation
| Field | Value |
|---|---|
| Database | sales_db |
| Input | How many customers do we have in each country? |
| Output | SELECT country, COUNT(*) FROM customers GROUP BY country; |
Example 2 — Filtered query with ordering
| Field | Value |
|---|---|
| Database | ecommerce_db |
| Input | Show me the top 5 products by revenue last month |
| Output | SELECT product_name, SUM(revenue) FROM orders WHERE order_date >= DATE_SUB(NOW(), INTERVAL 1 MONTH) GROUP BY product_name ORDER BY SUM(revenue) DESC LIMIT 5; |
Example 3 — Multi-condition filter
| Field | Value |
|---|---|
| Database | hr_db |
| Input | List all employees hired after 2020 who are in the engineering department |
| Output | SELECT name, hire_date, department FROM employees WHERE hire_date > '2020-01-01' AND department = 'Engineering'; |
| Criterion | Target | Measurement |
|---|---|---|
| Response latency (p50) | < 2 seconds | API response time logs |
| Response latency (p99) | < 5 seconds | API response time logs |
| SQL correctness rate | > 60% exact match | Weekly evaluation on test set |
| API availability | > 99.5% uptime | Modal uptime dashboard |
| Error rate | < 1% | Error logs and alerts |
| Cost per 1,000 requests | < $0.50 | Monthly billing analysis |
| Scenario | Requests/Hour | Requests/Day | Notes |
|---|---|---|---|
| Normal business hours | 50–100 | 600–1,200 | 9AM–6PM weekdays |
| Peak load | 300–500 | — | End of quarter reporting |
| Off-hours | 0–5 | 50 | Automated jobs only |
| Monthly total | — | ~25,000 | Estimated average |
The traffic pattern is bursty and business-hours-heavy — making serverless deployment ideal over always-on GPU instances.
| Aspect | Value |
|---|---|
| Model | Qwen/Qwen1.5-1.8B-Chat + LoRA adapter |
| Model Source | Hugging Face Hub (faltooz123/qwen1.5-sql-qlora-spider) |
| Parameter Count | 1.8B (base) + 13M (adapter) |
| Quantization | 4-bit NF4 (QLoRA) |
| Context Length | 8,192 tokens |
| Max Output Tokens | 128 |
| Adapter Size | 35MB |
| Full Model Size | ~3.5GB (with base) |
This model was selected over alternatives for four reasons:
1. Task-specific fine-tuning
The model was fine-tuned specifically on the Spider Text-to-SQL benchmark it has learned to output clean SQL queries rather than natural language explanations. A general-purpose model would require more complex prompting and produce less reliable output.
2. Size and cost efficiency
At 1.8B parameters, this model runs on a single A10G GPU (24GB VRAM) with comfortable headroom. Larger models (7B, 13B) would require more expensive hardware, increasing per-request cost significantly.
3. Open weights and data privacy
Unlike GPT-4 or Claude API, this model runs entirely within our infrastructure. Database schemas and business queries never leave our servers critical for enterprise compliance.
4. Fast inference
The 4-bit quantized model generates SQL queries in under 2 seconds on an A10G GPU, meeting our latency targets.
| Option | Pros | Cons | Decision |
|---|---|---|---|
| GPT-4o-mini API | High accuracy, no infra | $0.15/1M tokens, data privacy risk | Rejected |
| Qwen 7B fine-tuned | Higher accuracy | 4x cost, slower inference | Rejected |
| Our Qwen 1.8B | Low cost, private, fast | Lower accuracy than 7B | ✅ Selected |
| Base model (no fine-tune) | No training cost | 0% exact match, wrong output format | Rejected |
4-bit NF4 quantization reduces VRAM from ~7GB to ~1.5GB with minimal quality loss on structured generation tasks like SQL. For Text-to-SQL, the output space is constrained (SQL keywords are fixed) — quantization has less impact than on open-ended generation tasks. The trade-off is a slight increase in inference latency (~10-15%) compared to FP16, which remains within our 2-second p50 target.
Selected Platform: Modal (Serverless GPU)
| Platform | Best For | Why Not Selected |
|---|---|---|
| Modal ✅ | Serverless, bursty traffic, easy HF integration | — Selected |
| HuggingFace Inference API | Quick prototyping | Limited GPU options, cold starts |
| AWS SageMaker | Enterprise, full control | Complex setup, high fixed costs |
| AWS Bedrock | Managed models only | Doesn't support custom fine-tunes |
| vLLM on Cloud VM | High sustained throughput | Idle GPU cost during off-hours |
| Ollama on EC2 | Simple self-hosted | No auto-scaling, manual management |
Modal is the best fit for this use case for three reasons:
1. Serverless scales to zero
Our traffic is bursty and business-hours-heavy. With Modal, we pay only for actual compute time. During off-hours (18 hours/day), we pay nothing. A persistent VM would charge 24/7 regardless of traffic.
2. Container image caching
Modal caches the model weights in the container image. After the first cold start (~30-60 seconds), subsequent requests are served from a warm container in under 2 seconds. This eliminates the biggest downside of serverless for LLM workloads.
3. Native Hugging Face integration
Modal can directly load our fine-tuned adapter from Hugging Face Hub using snapshot_download. No manual model management or custom container registries needed.
| Component | Configuration |
|---|---|
| GPU Type | NVIDIA A10G (24GB VRAM) |
| CPU | 4 vCPUs |
| Memory | 32GB RAM |
| Container timeout | 300 seconds |
| Min replicas (warm) | 0 (scales to zero) |
| Max replicas | 5 |
| Concurrency per container | 1 request |
| Cold start time | ~30-60 seconds (first request) |
| Warm request time | < 2 seconds |
| Region | us-east-1 (lowest latency for US users) |
| Traffic Level | Replicas | Behavior |
|---|---|---|
| 0 requests | 0 | Scales to zero, no cost |
| 1-50 req/hour | 1 | Single warm container |
| 50-200 req/hour | 2-3 | Auto-scales up |
| 200-500 req/hour | 4-5 | Peak capacity |
| Post-peak | 0-1 | Scales back down after 5 min idle |
User Request (HTTPS)
│
▼
Modal API Gateway
│
▼
Modal Container (A10G GPU)
├── Load base model (Qwen1.5-1.8B, 4-bit)
├── Load LoRA adapter (faltooz123/qwen1.5-sql-qlora-spider)
├── Run inference (< 2 seconds)
└── Return SQL query
│
▼
JSON Response → User
| Cost Component | Details | Monthly Estimate |
|---|---|---|
| Compute (A10G GPU) | ~25,000 requests × 2 sec = 14 GPU hours @ $1.10/hr | $15.40 |
| Container idle time | Warm container kept alive 2 hrs/day × 22 days | $4.84 |
| Storage (model weights) | 3.5GB base + 35MB adapter on Modal volume | $2.00 |
| Network (data transfer) | ~1KB per request × 25,000 requests | $0.50 |
| Monitoring (LangFuse) | Free tier (up to 50K traces/month) | $0.00 |
| Total Estimated | ~$22.74/month |
Monthly compute cost : $15.40
Monthly requests : 25,000
Cost per request : $0.00062
Cost per 1,000 requests : $0.62
Strategy 1 — Response Caching
Many business users ask the same questions repeatedly (e.g., "How many active users do we have today?"). Implementing a Redis cache keyed on (database_id, question_hash) would serve repeated queries instantly at zero GPU cost. Estimated cache hit rate: 20-30%, reducing monthly compute cost by ~$4.
Strategy 2 — Request Batching
Instead of processing each request individually, batch multiple requests together in a single GPU forward pass. Modal supports batching via container concurrency settings. For non-latency-sensitive requests (automated jobs), batching 4-8 requests together reduces GPU time by ~60%.
Strategy 3 — Aggressive Scale-to-Zero
Setting container idle timeout to 2 minutes (instead of default 10) ensures GPU is released quickly after traffic drops. During the 18 off-hours daily, this saves approximately 3-4 GPU hours per day.
Strategy 4 — Spot Instance Pricing
For batch processing jobs (e.g., nightly report generation), Modal supports spot-equivalent pricing. Running non-urgent requests on interruptible containers reduces compute cost by ~40%.
| Metric | Why It Matters | Alert Threshold | Tool |
|---|---|---|---|
| Latency p50 | Typical user experience | > 3 seconds | LangFuse |
| Latency p99 | Worst-case user experience | > 8 seconds | LangFuse |
| Error rate | API reliability | > 2% | Modal dashboard |
| Throughput (RPS) | Capacity planning | < 0.1 RPS sustained | CloudWatch |
| Token usage/request | Cost control | > 150 tokens/request | LiteLLM |
| GPU utilization | Resource efficiency | > 95% or < 10% | Modal dashboard |
| Cold start rate | User experience | > 20% of requests | Custom logging |
| SQL parse success rate | Output quality | < 80% valid SQL | Custom evaluation |
| Tool | Purpose | Why Selected |
|---|---|---|
| LangFuse | LLM request tracing, prompt/response logging, latency tracking | Open source, self-hostable, purpose-built for LLMs |
| Modal Dashboard | Container health, GPU utilization, scaling events | Built-in, no setup required |
| LiteLLM | Token usage tracking, cost per request, budget alerts | Lightweight, integrates with any LLM endpoint |
| Sentry | Error tracking and alerting | Industry standard, free tier sufficient |
Alert 1 — High Latency
Every request logs the following to LangFuse:
All API endpoints are protected using API key authentication. Each client application receives a unique API key issued through a key management service. Keys are passed in the Authorization: Bearer <key> header and validated on every request before reaching the model.
# Example authentication header headers = { "Authorization": "Bearer sk-sql-xxxxxxxxxxxxxxxx", "Content-Type": "application/json" }
| User Type | Rate Limit | Burst Limit |
|---|---|---|
| Standard API key | 60 requests/minute | 100 requests/minute |
| Premium API key | 300 requests/minute | 500 requests/minute |
| Internal service | 1,000 requests/minute | 2,000 requests/minute |
Rate limits are enforced at the API gateway level using a token bucket algorithm. Clients exceeding limits receive a 429 Too Many Requests response with a Retry-After header.
Prompt Injection Prevention:
User inputs are sanitized before being inserted into the prompt template. The following patterns are blocked:
<|im_start|>, <|im_end|>, or similar control tokensdef validate_input(question: str, db_id: str) -> bool: # Block control tokens blocked_tokens = ["<|im_start|>", "<|im_end|>", "system\n", "assistant\n"] for token in blocked_tokens: if token in question: return False # Block excessive length if len(question) > 500: return False return True
| Resource | Who Can Access |
|---|---|
| Modal deployment console | ML team only |
| LangFuse traces | ML team + Engineering leads |
| API keys management | Engineering leads only |
| Model weights (HF Hub) | Public (read-only) |
| Monitoring dashboards | ML team + Management |
This plan demonstrates a cost-effective, scalable path to production for a fine-tuned LLM. The key insight is matching the deployment platform to the traffic pattern — bursty, business-hours-heavy traffic is perfectly served by serverless infrastructure, while a persistent VM would waste money on idle GPU time.
The combination of a small fine-tuned model (1.8B parameters) with serverless deployment results in a system that costs approximately **
Accuracy vs cost: The 1.8B model achieves 6% exact match on the Spider benchmark. A 7B model would achieve significantly higher accuracy but at 4x the compute cost. For a production system, this trade-off should be revisited after measuring real user satisfaction.
Cold starts: The first request after a period of inactivity takes 30-60 seconds while the container initializes. This is acceptable for a business tool where users tolerate occasional delays, but would be unacceptable for a real-time consumer application.
Single model limitation: This deployment serves a single fine-tuned model. A production system would benefit from A/B testing infrastructure to compare model versions before full rollout.
This publication presents a complete deployment plan for a fine-tuned Text-to-SQL model in a production environment. Using Modal's serverless GPU infrastructure, the deployment achieves:
The plan demonstrates that a fine-tuned open-weights model can be deployed cost-effectively for real business use cases without enterprise-scale infrastructure budgets. The serverless approach is particularly well-suited for internal business tools with predictable, business-hours-heavy traffic patterns.
The author thanks the Modal Labs team for their excellent documentation and free credits for academic projects, the HuggingFace team for the Hub infrastructure that makes model sharing seamless, and the LangFuse team for building an open-source LLM observability platform. This deployment plan builds directly on the fine-tuning work completed in Module 1 of the LLMED Certification Program.
LLM_DEPLOYMENT/
├── README.md # Deployment guide and setup instructions
├── requirements.txt # Python dependencies
├── deploy/
│ └── modal_deploy.py # Modal serverless deployment script
├── client/
│ ├── client.py # HTTP client for the deployed API
│ └── test_requests.py # Example SQL generation test cases
└── config/
└── model_config.yaml # Model and deployment configuration
# Hugging Face access token HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx # Modal API token (set via modal token set) MODAL_TOKEN_ID=ak-xxxxxxxxxxxxxxxxxxxx MODAL_TOKEN_SECRET=as-xxxxxxxxxxxxxxxxxxxx # LangFuse observability LANGFUSE_PUBLIC_KEY=pk-lf-xxxxxxxxxxxxxxxxxxxx LANGFUSE_SECRET_KEY=sk-lf-xxxxxxxxxxxxxxxxxxxx LANGFUSE_HOST=https://cloud.langfuse.com # API authentication SQL_API_KEY=sk-sql-xxxxxxxxxxxxxxxxxxxx
Pre-deployment:
□ Modal account created and CLI authenticated
□ HuggingFace token set with read access
□ LangFuse project created and keys configured
□ model_config.yaml reviewed and updated
□ requirements.txt installed locally
Deployment:
□ modal deploy deploy/modal_deploy.py
□ Endpoint URL copied from Modal dashboard
□ client.py updated with endpoint URL
□ test_requests.py run and all tests pass
Post-deployment:
□ LangFuse traces appearing for test requests
□ Modal dashboard showing GPU utilization
□ Latency p50 < 2 seconds on warm requests
□ Error rate < 1% on test suite
| Resource | Link |
|---|---|
| 🤗 Fine-Tuned Model | faltooz123/qwen1.5-sql-qlora-spider |
| 💻 Deployment Repo | pamuarun/LLM_DEPLOYMENT |
| 📦 Module 1 Repo | pamuarun/LLM_FINE_TUNING |
| 🚀 Modal Docs | modal.com/docs |
| 📊 LangFuse | langfuse.com |