Text-to-SQL in Production: Deploying Qwen2.5-1.5B on RunPod with FastAPI

We use cookies to improve your browsing experience and to analyze our website traffic. By clicking “Accept All” you agree to our use of cookies. Privacy policy.

Deployment Architecture

GitHub Repository: https://github.com/samrat-kar/llm-deploy-monitor
HuggingFace Model: https://huggingface.co/samrat-kar/qwen2.5-1.5b-sql-qlora
W&B Report: QLoRA Fine-Tuning of Qwen2.5-1.5B — Training Curves & Metrics
Module 1 Publication: From Prose to SQL: QLoRA Fine-Tuning of Qwen2.5-1.5B

Tags: LLM Deployment RunPod Text-to-SQL FastAPI QLoRA Qwen2.5 Monitoring LLMED Production Engineering

Introduction

In Module 1 of this program, I fine-tuned Qwen2.5-1.5B-Instruct on ~78,000 text-to-SQL examples using QLoRA, achieving a 75.4% exact match rate on a held-out validation set — up from 0% with the base model. The resulting model, samrat-kar/qwen2.5-1.5b-sql-qlora, is published on HuggingFace.

This publication covers the deployment plan for that model: taking it from a HuggingFace artifact to a live HTTP service, including the infrastructure choices, cost breakdown, monitoring strategy, and security considerations.

1. Use Case Definition

Problem Statement

Writing SQL queries is a recurring bottleneck for data analysts and backend developers — especially when working with unfamiliar schemas or performing exploratory analysis. The goal is a lightweight internal tool that accepts a natural-language question and a set of CREATE TABLE statements, and returns a single executable SQL query. No chat interface, no multi-turn conversation — just a clean, fast query generation endpoint.

Target Users

User Type	Interaction
Data analysts	Paste schema + question into a simple web form or call the API from a notebook
Backend developers	Call the `/generate` endpoint from their IDE or tooling
ML engineers	Use the API to benchmark the model against new SQL datasets

The service is internal — not consumer-facing. Usage is limited to a small team of 5–15 people.

Input / Output Examples

Example 1 — Aggregation

Question: How many singers do we have?

Schema:
CREATE TABLE singer (Singer_ID int, Name text, Country text, Song_Name text)
Output: SELECT count(*) FROM singer

Example 2 — Filtering

Question: Find all employees in the Engineering department with a salary above 80000.

Schema:
CREATE TABLE employee (Employee_ID int, Name text, Department text, Salary real)
Output: SELECT * FROM employee WHERE Department = 'Engineering' AND Salary > 80000

Example 3 — JOIN

Question: List each student's name and the dorm they live in.

Schema:
CREATE TABLE student (StuID int, Fname text, LName text)
CREATE TABLE lives_in (stuid int, dormid int)
CREATE TABLE dorm (dormid int, dorm_name text)
Output: SELECT T1.Fname, T1.LName, T3.dorm_name FROM student AS T1 JOIN lives_in AS T2 ON T1.StuID = T2.stuid JOIN dorm AS T3 ON T2.dormid = T3.dormid

Success Criteria

Criterion	Target
Exact-match accuracy on validation set	≥ 70%
p50 inference latency	< 500 ms
p99 inference latency	< 2,000 ms
API error rate	< 1%
Service availability during business hours	≥ 99%

Traffic Expectations

This is an internal developer tool, so traffic is modest and bursty — concentrated during working hours, quiet at nights and weekends.

Period	Estimated Requests
Typical business hour	10–30 requests
Peak hour (end of sprint, demo day)	50–80 requests
Daily total (weekday)	100–200 requests
Daily total (weekend)	0–10 requests

These figures inform the deployment strategy: a persistent pod that runs during business hours and is stopped overnight is more cost-effective than always-on infrastructure.

2. Model Selection & Configuration

Model Card

Aspect	Value
Model	samrat-kar/qwen2.5-1.5b-sql-qlora
Base model	Qwen/Qwen2.5-1.5B-Instruct
Model source	HuggingFace Hub (custom fine-tuned)
Parameter count	1.54 billion
Quantization	4-bit NF4 (double quantization, bfloat16 compute)
Context length	512 tokens (training max)
Max output tokens	256
Task	Text-to-SQL (natural language + schema → SQL query)

Why This Model

The model was fine-tuned specifically on the b-mc2/sql-create-context dataset (~78K examples) for this exact task. It outperforms the base model dramatically on SQL generation:

Metric	Base Model	Fine-Tuned	Improvement
Exact Match	0.00%	75.40%	+75.40 pp
ROUGE-L	0.8784	0.9856	+10.72%

The 1.5B parameter size is a deliberate trade-off: it fits comfortably in a single consumer GPU with 4-bit quantization (~3 GB VRAM), inference is fast (300–800 ms on an RTX 3090), and the focused fine-tuning compensates for the smaller capacity. A larger model like Llama 3.2 7B or Mistral 7B would improve accuracy on complex multi-table joins but would cost 4–5× more per hour for this use case.

Trade-offs Considered

Size vs. quality: A 7B model would score higher on complex joins and nested subqueries. For our use case — mostly single-table aggregations and two-table joins — the 1.5B fine-tuned model delivers comparable quality at a fraction of the cost.

Quantization impact: 4-bit NF4 quantization reduces the model from ~6 GB (bfloat16) to ~1.5 GB on disk and ~3 GB VRAM. Empirically, there is no measurable accuracy drop on text-to-SQL tasks at this compression level, since SQL generation does not require high numerical precision.

API vs. self-hosted: Using the OpenAI API or HuggingFace Inference API would eliminate infrastructure management, but would cost 10–30× more per request for this usage volume and would prevent fine-tuned model serving without additional setup.

3. Deployment Strategy

Platform: RunPod GPU Pod

I chose a RunPod GPU Pod over serverless or managed API options for the following reasons:

Model already on HuggingFace — the pod loads weights directly at startup with no custom container build required.
Familiar workflow — SSH access, persistent disk, and a standard PyTorch environment mean the deployment mirrors local development.
Cost-predictable — hourly billing while running; stopping the pod stops the cost. For a business-hours tool, this is more economical than always-on managed endpoints.
Low-latency warm starts — with the model cached on the volume disk after the first load, subsequent pod starts skip the download and reach ready state in ~30 seconds.

Alternatives considered:

Alternative	Why Rejected
HuggingFace Inference Endpoints	Higher cost (~$0.90/hr for GPU), less control over model loading
Modal (serverless)	Cold start latency (10–20 s model load) unacceptable for interactive use
AWS SageMaker	Significant setup overhead and higher cost; overkill for an internal tool
Local deployment	Not accessible to the team; not reproducible

Infrastructure Details

Component	Choice	Rationale
GPU	NVIDIA RTX 3090 (24 GB VRAM)	1.5B 4-bit model needs ~3 GB VRAM; 24 GB provides headroom for batching
Instance type	RunPod Community Cloud (spot)	~40% cheaper than Secure Cloud; acceptable for a dev tool
Container image	RunPod PyTorch 2.3 (CUDA 12.1)	Pre-built, no Docker build step needed
Container disk	30 GB	OS + Python packages (~5 GB) + buffer
Network volume	20 GB	HuggingFace model cache (~3 GB) + logs + room to grow
Region	US East	Lowest latency for the team; good GPU availability

Serving Architecture

User / Notebook
     │
     │  HTTPS (RunPod proxy)
     ▼
┌─────────────────────────────┐
│       RunPod Pod            │
│                             │
│  FastAPI server (port 8000) │
│  ├── GET  /health           │
│  └── POST /generate         │
│         │                   │
│         ▼                   │
│  Transformers pipeline      │
│  (model loaded in GPU VRAM) │
└─────────────────────────────┘

Endpoint type: Real-time (synchronous HTTP)
Scaling strategy: Single-pod, single-GPU. For the projected load (≤80 req/hr peak), a single pod with request queuing in FastAPI is sufficient. If load grows beyond ~200 concurrent requests, adding a second pod behind a simple load balancer (or migrating to RunPod Serverless) would be the next step.
Startup sequence: Pod boots → startup.sh installs dependencies → FastAPI starts → model loads from HF cache → /health returns ok.

Operating Schedule

To minimize cost, the pod runs only during business hours:

Start: Monday–Friday, 8
AM
Stop: Monday–Friday, 6
PM
Weekend: Pod stopped (minimal expected traffic)

This gives ~50 GPU-hours/week at 20/week, or ~$85/month.

4. Cost Analysis

Monthly Cost Breakdown

Cost Component	Calculation	Monthly Estimate
Compute — RTX 3090 @ $0.39/hr	10 hr/day × 22 weekdays	$85.80
Network volume (20 GB)	$0.10/GB/month	$2.00
Data transfer (egress)	~100 MB/day × 22 days	$0.00 (RunPod includes egress)
Monitoring (LangFuse)	Free tier (up to 50K traces/month)	$0.00
Total		~$87.80/month

At 100–200 requests/day × 22 weekdays = ~2,200–4,400 requests/month, the cost per 1,000 requests is approximately 40.

Cost Optimization Strategies

1. Response caching

Text-to-SQL queries from a dev team are highly repetitive — the same question against the same schema will appear many times. A simple in-process LRU cache (keyed on sha256(question + context)) can eliminate 20–40% of GPU compute:

from functools import lru_cache
import hashlib

@lru_cache(maxsize=512)
def cached_generate(question: str, context: str) -> str:
    ...

For a persistent cache that survives pod restarts, a Redis instance on the same pod (or RunPod volume) stores the (hash → sql) mapping.

2. Pod scheduling (start/stop automation)

RunPod's API supports programmatic pod start/stop. A small cron job or GitHub Action can:

Start the pod at 8
AM on weekdays
Stop it at 6
PM

This eliminates ~14 idle GPU-hours/day that would otherwise accrue on an always-on pod.

3. 4-bit quantization (already applied)

The model is already loaded in 4-bit NF4. This reduces VRAM usage from ~6 GB to ~3 GB, making it possible to use a cheaper GPU (e.g., RTX 3080 10 GB at $0.26/hr) and reducing the cost by ~33%.

4. Request batching (future)

The current server processes requests one at a time. If peak load grows, enabling dynamic batching (grouping concurrent requests into a single forward pass) can increase throughput 3–5× on the same hardware without additional cost. transformers supports this natively via batch_size in the pipeline API.

5. Spot instance pricing

RunPod Community Cloud (spot) is already 30–40% cheaper than Secure Cloud. Spot pods can be interrupted, but for a developer tool with re-tryable requests this is acceptable.

5. Monitoring & Observability Plan

Metrics to Track

Metric	Why It Matters	Alert Threshold
Latency p50	Typical user experience	> 800 ms
Latency p99	Worst-case user experience	> 3,000 ms
Error rate (5xx)	Service reliability	> 2% over 5 min
Throughput (req/min)	Capacity headroom	< 0 (unexpected drop)
Tokens generated (avg)	Runaway generation detection	> 200 tokens/request
GPU utilization	Resource efficiency	< 5% (wasted) or > 98% (saturated)
GPU VRAM used	OOM prevention	> 90%
Cache hit rate	Cost-saving effectiveness	< 10% (caching not helping)

Monitoring Stack

Tool	Purpose	Why
Application logs (FastAPI)	Per-request latency, token counts, errors	Already built into the server; zero added cost
LangFuse (free tier)	LLM tracing — logs prompt, output, latency per request; enables quality monitoring over time	Purpose-built for LLM observability; free up to 50K traces/month
RunPod dashboard	GPU utilization, VRAM, pod uptime	Built-in, no setup required

Application-level logging (already in server.py)

The server logs every request:

2026-04-10 09:12:43 INFO Generated 9 tokens in 312ms

These logs are written to stdout (visible in RunPod terminal) and can be tailed or piped to a file on the volume.

LangFuse integration plan

To add LangFuse tracing to the server, wrap the generation call:

from langfuse import Langfuse

langfuse = Langfuse()  # reads LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY from env

trace = langfuse.trace(name="sql-generate")
span = trace.span(name="model-inference", input={"question": req.question, "context": req.context})
# ... run inference ...
span.end(output={"sql": sql}, metadata={"latency_ms": latency_ms, "tokens": tokens_generated})

This gives a time-series view of:

Which questions are being asked most
Which inputs produce long or unexpected outputs
Latency trends over time

Alerting Strategy

Alert conditions:

Condition	Severity	Notification
p99 latency > 3s for 5 consecutive requests	Warning	Email
Error rate > 5% over 10 min	Critical	Email + Slack
Pod unreachable (`/health` fails)	Critical	Email + Slack
GPU VRAM > 90%	Warning	Email

Notification channels: Email (primary) and a Slack #alerts-sql-api channel via webhook.

Runbook for common issues:

Issue	Likely Cause	Resolution
High latency (p99 > 3s)	Concurrent requests queuing	Check GPU utilization; if saturated, reduce max concurrent or add batching
OOM crash	VRAM exhausted	Reduce `max_new_tokens`; check for unusually long `context` inputs
Model returns empty string	Generation hitting `eos_token` immediately	Check tokenizer `pad_token_id` config; inspect raw prompt
Pod unreachable	Pod stopped or preempted	Restart from RunPod dashboard

6. Security Considerations

API Authentication

RunPod's proxy provides a pod-specific URL (https://<pod-id>-8000.proxy.runpod.net) that is not guessable. As an additional layer, the server validates a static API key passed as a request header:

from fastapi.security.api_key import APIKeyHeader
from fastapi import Security, HTTPException

API_KEY_HEADER = APIKeyHeader(name="X-API-Key")

def verify_api_key(api_key: str = Security(API_KEY_HEADER)):
    if api_key != os.environ["SQL_API_KEY"]:
        raise HTTPException(status_code=403, detail="Invalid API key")

The key is set as an environment variable on the pod and never committed to the repository.

Rate Limiting

Since the service is internal and low-traffic, a simple per-IP rate limit is sufficient. Using slowapi:

from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)

@app.post("/generate")
@limiter.limit("30/minute")
def generate(request: Request, req: GenerateRequest):
    ...

Limit: 30 requests/minute per IP — generous enough for normal use, restrictive enough to prevent runaway loops.

Input Validation

Prompt injection is the primary attack surface: a malicious user could craft a question or context field to override the system prompt.

Mitigations:

Pydantic field length limits — question capped at 500 characters, context capped at 2,000 characters. Inputs exceeding these limits return a 422 error before the model is called.
Output validation — the server checks that the generated output starts with a SQL keyword (SELECT, INSERT, UPDATE, DELETE, WITH). Outputs that don't match are flagged and logged, and a 422 is returned.
System prompt is server-side — the system prompt is not user-supplied; it is set in model_config.yaml and cannot be overridden via the API.

PII Handling

SQL queries typically don't contain PII, but the question field might (e.g., "Find all records for customer John Smith"). Policy:

Logs retain inputs for 7 days (for debugging), then are deleted from the volume.
Logs are stored only on the RunPod volume — never forwarded to external services.
LangFuse traces, if enabled, are set to mask the raw prompt after 30 days.
Users are instructed not to paste real production data into the question field.

Access Control

Resource	Who Can Access
RunPod pod (SSH)	Deployment engineer only (RunPod account credentials)
`/generate` endpoint	Team members with the API key
LangFuse dashboard	ML team only (LangFuse account)
Model weights (HuggingFace)	Public (read-only)

Summary

This deployment plan covers the full lifecycle of taking samrat-kar/qwen2.5-1.5b-sql-qlora from a HuggingFace artifact to a production-ready internal service:

Use case: Text-to-SQL for data analysts and developers, ~100–200 requests/day
Model: 1.5B fine-tuned model with 75.4% exact match, loaded in 4-bit quantization
Platform: RunPod GPU Pod (RTX 3090), running during business hours
Cost: ~20–40 per 1,000 requests, reducible by 30–40% with caching and scheduling
Monitoring: FastAPI structured logs + LangFuse tracing + RunPod dashboard metrics
Security: API key auth, per-IP rate limiting, input length validation, output keyword validation

Text-to-SQL in Production: Deploying Qwen2.5-1.5B on RunPod with FastAPI

Table of contents

Introduction

1. Use Case Definition

Problem Statement

Target Users

Input / Output Examples

Success Criteria

Traffic Expectations

2. Model Selection & Configuration

Model Card

Why This Model

Trade-offs Considered

3. Deployment Strategy

Platform: RunPod GPU Pod

Infrastructure Details

Serving Architecture

Operating Schedule

4. Cost Analysis

Monthly Cost Breakdown

Cost Optimization Strategies

5. Monitoring & Observability Plan

Metrics to Track

Monitoring Stack

Application-level logging (already in server.py)

LangFuse integration plan

Alerting Strategy

6. Security Considerations

API Authentication

Rate Limiting

Input Validation

PII Handling

Access Control

Summary

Table of contents