
GitHub Repository: https://github.com/samrat-kar/llm-deploy-monitor
HuggingFace Model: https://huggingface.co/samrat-kar/qwen2.5-1.5b-sql-qlora
W&B Report: QLoRA Fine-Tuning of Qwen2.5-1.5B β Training Curves & Metrics
Module 1 Publication: From Prose to SQL: QLoRA Fine-Tuning of Qwen2.5-1.5B
Tags: LLM Deployment RunPod Text-to-SQL FastAPI QLoRA Qwen2.5 Monitoring LLMED Production Engineering
In Module 1 of this program, I fine-tuned Qwen2.5-1.5B-Instruct on ~78,000 text-to-SQL examples using QLoRA, achieving a 75.4% exact match rate on a held-out validation set β up from 0% with the base model. The resulting model, samrat-kar/qwen2.5-1.5b-sql-qlora, is published on HuggingFace.
This publication covers the deployment plan for that model: taking it from a HuggingFace artifact to a live HTTP service, including the infrastructure choices, cost breakdown, monitoring strategy, and security considerations.
Writing SQL queries is a recurring bottleneck for data analysts and backend developers β especially when working with unfamiliar schemas or performing exploratory analysis. The goal is a lightweight internal tool that accepts a natural-language question and a set of CREATE TABLE statements, and returns a single executable SQL query. No chat interface, no multi-turn conversation β just a clean, fast query generation endpoint.
| User Type | Interaction |
|---|---|
| Data analysts | Paste schema + question into a simple web form or call the API from a notebook |
| Backend developers | Call the /generate endpoint from their IDE or tooling |
| ML engineers | Use the API to benchmark the model against new SQL datasets |
The service is internal β not consumer-facing. Usage is limited to a small team of 5β15 people.
Example 1 β Aggregation
Question: How many singers do we have?
Schema:
CREATE TABLE singer (Singer_ID int, Name text, Country text, Song_Name text)Output:
SELECT count(*) FROM singer
Example 2 β Filtering
Question: Find all employees in the Engineering department with a salary above 80000.
Schema:
CREATE TABLE employee (Employee_ID int, Name text, Department text, Salary real)Output:
SELECT * FROM employee WHERE Department = 'Engineering' AND Salary > 80000
Example 3 β JOIN
Question: List each student's name and the dorm they live in.
Schema:
CREATE TABLE student (StuID int, Fname text, LName text) CREATE TABLE lives_in (stuid int, dormid int) CREATE TABLE dorm (dormid int, dorm_name text)Output:
SELECT T1.Fname, T1.LName, T3.dorm_name FROM student AS T1 JOIN lives_in AS T2 ON T1.StuID = T2.stuid JOIN dorm AS T3 ON T2.dormid = T3.dormid
| Criterion | Target |
|---|---|
| Exact-match accuracy on validation set | β₯ 70% |
| p50 inference latency | < 500 ms |
| p99 inference latency | < 2,000 ms |
| API error rate | < 1% |
| Service availability during business hours | β₯ 99% |
This is an internal developer tool, so traffic is modest and bursty β concentrated during working hours, quiet at nights and weekends.
| Period | Estimated Requests |
|---|---|
| Typical business hour | 10β30 requests |
| Peak hour (end of sprint, demo day) | 50β80 requests |
| Daily total (weekday) | 100β200 requests |
| Daily total (weekend) | 0β10 requests |
These figures inform the deployment strategy: a persistent pod that runs during business hours and is stopped overnight is more cost-effective than always-on infrastructure.
| Aspect | Value |
|---|---|
| Model | samrat-kar/qwen2.5-1.5b-sql-qlora |
| Base model | Qwen/Qwen2.5-1.5B-Instruct |
| Model source | HuggingFace Hub (custom fine-tuned) |
| Parameter count | 1.54 billion |
| Quantization | 4-bit NF4 (double quantization, bfloat16 compute) |
| Context length | 512 tokens (training max) |
| Max output tokens | 256 |
| Task | Text-to-SQL (natural language + schema β SQL query) |
The model was fine-tuned specifically on the b-mc2/sql-create-context dataset (~78K examples) for this exact task. It outperforms the base model dramatically on SQL generation:
| Metric | Base Model | Fine-Tuned | Improvement |
|---|---|---|---|
| Exact Match | 0.00% | 75.40% | +75.40 pp |
| ROUGE-L | 0.8784 | 0.9856 | +10.72% |
The 1.5B parameter size is a deliberate trade-off: it fits comfortably in a single consumer GPU with 4-bit quantization (~3 GB VRAM), inference is fast (300β800 ms on an RTX 3090), and the focused fine-tuning compensates for the smaller capacity. A larger model like Llama 3.2 7B or Mistral 7B would improve accuracy on complex multi-table joins but would cost 4β5Γ more per hour for this use case.
Size vs. quality: A 7B model would score higher on complex joins and nested subqueries. For our use case β mostly single-table aggregations and two-table joins β the 1.5B fine-tuned model delivers comparable quality at a fraction of the cost.
Quantization impact: 4-bit NF4 quantization reduces the model from ~6 GB (bfloat16) to ~1.5 GB on disk and ~3 GB VRAM. Empirically, there is no measurable accuracy drop on text-to-SQL tasks at this compression level, since SQL generation does not require high numerical precision.
API vs. self-hosted: Using the OpenAI API or HuggingFace Inference API would eliminate infrastructure management, but would cost 10β30Γ more per request for this usage volume and would prevent fine-tuned model serving without additional setup.
I chose a RunPod GPU Pod over serverless or managed API options for the following reasons:
Alternatives considered:
| Alternative | Why Rejected |
|---|---|
| HuggingFace Inference Endpoints | Higher cost (~$0.90/hr for GPU), less control over model loading |
| Modal (serverless) | Cold start latency (10β20 s model load) unacceptable for interactive use |
| AWS SageMaker | Significant setup overhead and higher cost; overkill for an internal tool |
| Local deployment | Not accessible to the team; not reproducible |
| Component | Choice | Rationale |
|---|---|---|
| GPU | NVIDIA RTX 3090 (24 GB VRAM) | 1.5B 4-bit model needs ~3 GB VRAM; 24 GB provides headroom for batching |
| Instance type | RunPod Community Cloud (spot) | ~40% cheaper than Secure Cloud; acceptable for a dev tool |
| Container image | RunPod PyTorch 2.3 (CUDA 12.1) | Pre-built, no Docker build step needed |
| Container disk | 30 GB | OS + Python packages (~5 GB) + buffer |
| Network volume | 20 GB | HuggingFace model cache (~3 GB) + logs + room to grow |
| Region | US East | Lowest latency for the team; good GPU availability |
User / Notebook
β
β HTTPS (RunPod proxy)
βΌ
βββββββββββββββββββββββββββββββ
β RunPod Pod β
β β
β FastAPI server (port 8000) β
β βββ GET /health β
β βββ POST /generate β
β β β
β βΌ β
β Transformers pipeline β
β (model loaded in GPU VRAM) β
βββββββββββββββββββββββββββββββ
startup.sh installs dependencies β FastAPI starts β model loads from HF cache β /health returns ok.To minimize cost, the pod runs only during business hours:
This gives ~50 GPU-hours/week at
| Cost Component | Calculation | Monthly Estimate |
|---|---|---|
| Compute β RTX 3090 @ $0.39/hr | 10 hr/day Γ 22 weekdays | $85.80 |
| Network volume (20 GB) | $0.10/GB/month | $2.00 |
| Data transfer (egress) | ~100 MB/day Γ 22 days | $0.00 (RunPod includes egress) |
| Monitoring (LangFuse) | Free tier (up to 50K traces/month) | $0.00 |
| Total | ~$87.80/month |
At 100β200 requests/day Γ 22 weekdays = ~2,200β4,400 requests/month, the cost per 1,000 requests is approximately
1. Response caching
Text-to-SQL queries from a dev team are highly repetitive β the same question against the same schema will appear many times. A simple in-process LRU cache (keyed on sha256(question + context)) can eliminate 20β40% of GPU compute:
from functools import lru_cache import hashlib @lru_cache(maxsize=512) def cached_generate(question: str, context: str) -> str: ...
For a persistent cache that survives pod restarts, a Redis instance on the same pod (or RunPod volume) stores the (hash β sql) mapping.
2. Pod scheduling (start/stop automation)
RunPod's API supports programmatic pod start/stop. A small cron job or GitHub Action can:
This eliminates ~14 idle GPU-hours/day that would otherwise accrue on an always-on pod.
3. 4-bit quantization (already applied)
The model is already loaded in 4-bit NF4. This reduces VRAM usage from ~6 GB to ~3 GB, making it possible to use a cheaper GPU (e.g., RTX 3080 10 GB at $0.26/hr) and reducing the cost by ~33%.
4. Request batching (future)
The current server processes requests one at a time. If peak load grows, enabling dynamic batching (grouping concurrent requests into a single forward pass) can increase throughput 3β5Γ on the same hardware without additional cost. transformers supports this natively via batch_size in the pipeline API.
5. Spot instance pricing
RunPod Community Cloud (spot) is already 30β40% cheaper than Secure Cloud. Spot pods can be interrupted, but for a developer tool with re-tryable requests this is acceptable.
| Metric | Why It Matters | Alert Threshold |
|---|---|---|
| Latency p50 | Typical user experience | > 800 ms |
| Latency p99 | Worst-case user experience | > 3,000 ms |
| Error rate (5xx) | Service reliability | > 2% over 5 min |
| Throughput (req/min) | Capacity headroom | < 0 (unexpected drop) |
| Tokens generated (avg) | Runaway generation detection | > 200 tokens/request |
| GPU utilization | Resource efficiency | < 5% (wasted) or > 98% (saturated) |
| GPU VRAM used | OOM prevention | > 90% |
| Cache hit rate | Cost-saving effectiveness | < 10% (caching not helping) |
| Tool | Purpose | Why |
|---|---|---|
| Application logs (FastAPI) | Per-request latency, token counts, errors | Already built into the server; zero added cost |
| LangFuse (free tier) | LLM tracing β logs prompt, output, latency per request; enables quality monitoring over time | Purpose-built for LLM observability; free up to 50K traces/month |
| RunPod dashboard | GPU utilization, VRAM, pod uptime | Built-in, no setup required |
The server logs every request:
2026-04-10 09:12:43 INFO Generated 9 tokens in 312ms
These logs are written to stdout (visible in RunPod terminal) and can be tailed or piped to a file on the volume.
To add LangFuse tracing to the server, wrap the generation call:
from langfuse import Langfuse langfuse = Langfuse() # reads LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY from env trace = langfuse.trace(name="sql-generate") span = trace.span(name="model-inference", input={"question": req.question, "context": req.context}) # ... run inference ... span.end(output={"sql": sql}, metadata={"latency_ms": latency_ms, "tokens": tokens_generated})
This gives a time-series view of:
Alert conditions:
| Condition | Severity | Notification |
|---|---|---|
| p99 latency > 3s for 5 consecutive requests | Warning | |
| Error rate > 5% over 10 min | Critical | Email + Slack |
Pod unreachable (/health fails) | Critical | Email + Slack |
| GPU VRAM > 90% | Warning |
Notification channels: Email (primary) and a Slack #alerts-sql-api channel via webhook.
Runbook for common issues:
| Issue | Likely Cause | Resolution |
|---|---|---|
| High latency (p99 > 3s) | Concurrent requests queuing | Check GPU utilization; if saturated, reduce max concurrent or add batching |
| OOM crash | VRAM exhausted | Reduce max_new_tokens; check for unusually long context inputs |
| Model returns empty string | Generation hitting eos_token immediately | Check tokenizer pad_token_id config; inspect raw prompt |
| Pod unreachable | Pod stopped or preempted | Restart from RunPod dashboard |
RunPod's proxy provides a pod-specific URL (https://<pod-id>-8000.proxy.runpod.net) that is not guessable. As an additional layer, the server validates a static API key passed as a request header:
from fastapi.security.api_key import APIKeyHeader from fastapi import Security, HTTPException API_KEY_HEADER = APIKeyHeader(name="X-API-Key") def verify_api_key(api_key: str = Security(API_KEY_HEADER)): if api_key != os.environ["SQL_API_KEY"]: raise HTTPException(status_code=403, detail="Invalid API key")
The key is set as an environment variable on the pod and never committed to the repository.
Since the service is internal and low-traffic, a simple per-IP rate limit is sufficient. Using slowapi:
from slowapi import Limiter from slowapi.util import get_remote_address limiter = Limiter(key_func=get_remote_address) @app.post("/generate") @limiter.limit("30/minute") def generate(request: Request, req: GenerateRequest): ...
Limit: 30 requests/minute per IP β generous enough for normal use, restrictive enough to prevent runaway loops.
Prompt injection is the primary attack surface: a malicious user could craft a question or context field to override the system prompt.
Mitigations:
question capped at 500 characters, context capped at 2,000 characters. Inputs exceeding these limits return a 422 error before the model is called.SELECT, INSERT, UPDATE, DELETE, WITH). Outputs that don't match are flagged and logged, and a 422 is returned.model_config.yaml and cannot be overridden via the API.SQL queries typically don't contain PII, but the question field might (e.g., "Find all records for customer John Smith"). Policy:
| Resource | Who Can Access |
|---|---|
| RunPod pod (SSH) | Deployment engineer only (RunPod account credentials) |
/generate endpoint | Team members with the API key |
| LangFuse dashboard | ML team only (LangFuse account) |
| Model weights (HuggingFace) | Public (read-only) |
This deployment plan covers the full lifecycle of taking samrat-kar/qwen2.5-1.5b-sql-qlora from a HuggingFace artifact to a production-ready internal service: