
This publication documents the deployment strategy and implementation for a Nigerian news headline generation system, building upon a fine-tuned Qwen 2.5 0.5B model from Module 1. I implemented a custom FastAPI inference server using HuggingFace Transformers with automatic CPU/GPU detection, and containerised it with Docker for production portability. The deployment maintains the quality improvements from Module 1 (ROUGE-1: 31.81%, +17% over baseline) while achieving production-grade latency (<200ms on GPU) through 4-bit quantization. I conducted comprehensive cost analysis comparing GPU vs CPU deployment scenarios, and designed a monitoring and observability strategy using industry-standard tools. Security measures implemented in the current prototype include Pydantic input validation and structured error handling; rate limiting, authentication, and container hardening are designed for production and documented as a roadmap. The complete implementation includes automated testing with 20+ real Nigerian news examples across multiple categories.
Repository: GitHub
Model: HuggingFace
Nigerian news organisations face a measurable productivity bottleneck: journalists and editors spend 15–20 minutes crafting headlines per article, directly constraining content velocity and publication timeliness. Generic, Western-trained headline models fail to capture Nigerian context — local institutions (INEC, EFCC, NDDC), party abbreviations (PDP, APC), geographic markers, and the editorial conventions of outlets like Vanguard, Punch, and ThisDay.
The system must generate concise, engaging, editorially appropriate headlines (5–15 words) from Nigerian news article excerpts in real time (<200ms on GPU).
News Editors at Nigerian media outlets (Arise TV, Vanguard, Punch, ThisDay): rapid headline generation for breaking news; 50–200 articles/day per outlet.
Content Management Systems: automated headline suggestions in CMS workflow; 500–1,000 requests/day; API integration required.
News Aggregators: standardising headlines across sources; 10,000+ requests/day; high throughput and cost efficiency critical.
Example 1 — Political News
Input: "President Bola Tinubu has approved the appointment of new heads for several federal agencies as part of his administration's restructuring efforts..."
Fine-tuned output: "Tinubu Approves New Appointments for Federal Agencies" (7 words, concise)
Baseline output: "President Tinubu's Government Restructures Federal Agencies with New Leadership" (10 words, verbose)
Example 2 — Economic News
Input: "Nigeria's inflation rate climbed to 33.40% in July 2024, according to the latest report from the National Bureau of Statistics..."
Fine-tuned output: "Nigeria Inflation Rate Climbs to 33.40% in July 2024" (prioritises the key figure)
Baseline output: "National Bureau of Statistics Reports Record-High Inflation Driven by Food and Energy"
Example 3 — Sports News
Input: "The Super Eagles of Nigeria secured their qualification for the 2025 Africa Cup of Nations after a commanding 3-1 victory over South Africa..."
Fine-tuned output: "Super Eagles Secure AFCON 2025 Qualification After South Africa Win" (uses correct terminology)
Baseline output: "Nigeria's National Team Advances to AFCON Following Three-Goal Performance"
| Dimension | Metric | Target | Status |
|---|---|---|---|
| Quality | ROUGE-1 | > 30% | ✅ 31.81% |
| Quality | ROUGE-2 | > 10% | ✅ 11.59% |
| Quality | ROUGE-L | > 25% | ✅ 28.46% |
| Quality | Headline length (5–15 words) | > 90% compliance | ✅ Achieved |
| Performance | TTFT on GPU | < 200ms | ✅ 150–200ms |
| Performance | Error rate | < 1% | ✅ 0.3% in testing |
| Business | Uptime | 99.9% | 🎯 Production target |
| Business | Cost per headline | < $0.013 | ✅ |
| Phase | Requests / Day | Peak RPS | Use Case |
|---|---|---|---|
| MVP (Month 1–3) | 100–500 | 1–2 | Single newsroom pilot |
| Growth (Month 4–6) | 1,000–5,000 | 5–10 | 3–5 newsrooms |
| Scale (Month 7+) | 10,000+ | 20–50 | National platform |
Peak hours follow Nigerian news cycles: 6–9 AM and 5–8 PM WAT. Weekend traffic runs at roughly 40% of weekday volume. Major events (elections, disasters, sports finals) generate 5–10× spikes. Approximately 90% of traffic originates from Lagos, Abuja, and Port Harcourt.
| Aspect | Configuration |
|---|---|
| Model | Qwen 2.5 0.5B Instruct (fine-tuned) |
| Model Source | HuggingFace — Blaqadonis/Qwen2.5-0.5B-Nigerian-News-Headlines |
| Parameter Count | 494M |
| Quantization | 4-bit NF4 (BitsAndBytes) on GPU; float32 on CPU |
| Context Length | 512 tokens |
| Max Output Tokens | 50 (default); configurable up to 200 |
Selecting the right model for production deployment requires balancing quality, latency, cost, and operational complexity. Three candidate models were evaluated:
| Model | Parameters | VRAM (4-bit) | Latency (GPU) | Notes |
|---|---|---|---|---|
| Qwen 2.5 0.5B Instruct ✅ | 494M | ~4 GB | 150–200ms | Selected — optimal cost/quality |
| Qwen 2.5 1.5B Instruct | 1.5B | ~8 GB | 300–500ms | Higher quality, 2× cost |
| Qwen 2.5 3B Instruct | 3B | ~12 GB | 600–900ms | Overkill for headline task |
Qwen 2.5 0.5B Instruct was selected for four reasons:
AutoModelForCausalLM.from_pretrained() with no proprietary inference runtime required.4-bit NF4 (NormalFloat) quantization is applied at inference time using the BitsAndBytes library. The exact configuration from deploy_nigerian_headlines.py:
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, # nested quantization saves ~0.4 GB bnb_4bit_compute_dtype=torch.bfloat16, # stable dequantization on CUDA )
Double quantization applies a second round of quantization to the quantization constants themselves, reducing memory by a further 0.1–0.4 GB. On CPU, the model loads at full float32 precision (~12 GB RAM, 800–1,200ms latency) — acceptable for low-traffic or batch deployments.
The base model was fine-tuned using QLoRA on 4,286 samples from the okite97/news-data dataset (AriseTv Nigerian news corpus). Only 1.08M of 494M parameters (0.22%) were trained, using rank-8 LoRA adapters targeting the query and value projection layers. Training completed in 18 minutes on a single T4 GPU.
| Metric | Baseline (Zero-shot) | Fine-tuned (QLoRA) | Improvement |
|---|---|---|---|
| ROUGE-1 | 27.16% | 31.81% | +17.13% |
| ROUGE-2 | 8.23% | 11.59% | +40.78% |
| ROUGE-L | 22.26% | 28.46% | +27.88% |
The merged model is publicly available at huggingface.co/Blaqadonis/Qwen2.5-0.5B-Nigerian-News-Headlines and is loaded directly by the deployment server.
Several deployment platforms were evaluated before choosing self-hosted FastAPI on EC2:
| Platform | Considered | Reason Not Selected |
|---|---|---|
| HuggingFace Inference Endpoints | Yes | Managed but $0.06–0.30/hour with no direct control over BitsAndBytes quantization config |
| Modal | Yes | Excellent serverless option but cold-start latency of 2–5 seconds is unacceptable for interactive headline generation |
| AWS SageMaker | Yes | Mature and production-grade but significant operational overhead and cost premium not justified for a 0.5B model |
| vLLM on EC2 | Yes | Optimal for high-throughput batching but adds infrastructure complexity not warranted at MVP scale |
| Ollama on EC2 | Yes | Simple CPU deployment but limited quantization control and no native streaming API matching our requirements |
| Self-hosted FastAPI on EC2 ✅ | Selected | Full control over quantization, prompt construction, and streaming; no per-token API costs; works offline; single command to run |
Self-hosted FastAPI was selected because the model is small enough that managed infrastructure overhead is not justified, the 4-bit NF4 quantization configuration requires direct BitsAndBytes access not available on managed endpoints, and the deployment must be reproducible across environments with no vendor lock-in.
Geographic Region: ap-south-1 (Mumbai) is the recommended primary region for Nigerian deployments, offering the lowest latency from Lagos and Abuja via West Africa's undersea cable connections. eu-west-1 (Ireland) is the recommended failover region. us-east-1 is avoided due to the additional round-trip latency from West Africa.
The deployment consists of a custom FastAPI application server built on HuggingFace Transformers, containerised with Docker, serving the quantized fine-tuned model over HTTP. This is a direct inference setup — not a managed inference service. The model is loaded via AutoModelForCausalLM, tokenization is handled by the application, and generation is called via model.generate(). This design gives full control over prompt construction, quantization, streaming, and benchmarking logic.
Request flow: client → FastAPI endpoint → build_prompt() → tokenizer → model.generate() → extract_headline() → JSON response
The server (deploy_nigerian_headlines.py) implements automatic device detection on startup:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu" DTYPE = torch.bfloat16 if DEVICE == "cuda" else torch.float32
On GPU, the model loads with 4-bit quantization and device_map="auto". On CPU, it loads at full float32 precision with low_cpu_mem_usage=True. A single codebase handles both environments with no code changes needed.
| Endpoint | Method | Description |
|---|---|---|
GET / | GET | Health check — returns status, model name, device, GPU name |
POST /generate | POST | Single headline generation (greedy or sampled decoding) |
POST /generate_stream | POST | Streaming generation — yields tokens as produced |
POST /ttft_itl_batched | POST | Latency benchmarking — returns TTFT, TPOT, ITL statistics |
Each request is formatted using Qwen's chat template via tokenizer.apply_chat_template, wrapping the task instruction and excerpt in the expected user turn structure:
You are an expert Nigerian news headline writer.
Generate a compelling, concise headline that captures the essence
of the following news excerpt. Make it engaging and suitable for Nigerian media.
## News Excerpt:
{excerpt}
## Headline:
Post-generation, extract_headline() strips the prompt prefix, isolates the first output line, and removes surrounding quotes or special characters.
The Docker image builds on nvidia/cuda:12.1.0-base-ubuntu22.04, installs Python 3.11, and exposes port 8000. The following features are implemented in the current repository:
model-cache:/app/model_cache), so the 4 GB model download only occurs on first start. Subsequent restarts use the cached weights.deploy.resources block in docker-compose.yml reserves one NVIDIA GPU. Commenting it out and setting DEVICE=cpu switches to CPU-only mode without rebuilding the image.GET / every 30 seconds with a 10-second timeout and 3 retries. The container is only marked healthy after the model has fully loaded and the server is responding.unless-stopped ensures automatic recovery from crashes, OOM events, or host reboots.HF_TOKEN environment variable — not baked into the image.healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/"] interval: 30s timeout: 10s retries: 3 start_period: 60s
MVP (100–500 req/day): single container on a t3.xlarge (CPU) or g4dn.xlarge (GPU). No load balancer needed. The Docker health check and restart policy provide sufficient reliability for a pilot deployment.
Growth (1,000–5,000 req/day): two GPU containers behind an AWS Application Load Balancer. The model cache volume is mounted read-only on both containers, avoiding duplicate downloads. The ALB distributes requests across both containers and handles health-based routing automatically.
Scale (10,000+ req/day): Kubernetes (EKS) or Docker Swarm with horizontal pod autoscaling based on CPU utilisation and request queue depth. Redis response caching is added for repeated or near-identical excerpts (common in breaking news cycles), reducing effective GPU load by an estimated 15–20%.
For predictable traffic spikes — elections, major sports finals, disaster coverage — pre-warmed spot instance pools can be provisioned 30–60 minutes in advance using AWS Auto Scaling scheduled actions, avoiding cold-start latency under sudden load.
The /generate_stream endpoint uses HuggingFace's TextIteratorStreamer running in a background thread. This allows clients to begin rendering tokens as they are produced rather than waiting for the full sequence, improving perceived responsiveness for interactive CMS integrations. The Python client library (client.py) exposes this as a generator:
for token in client.generate_stream("Nigeria's inflation rate..."): print(token, end='', flush=True)
The /ttft_itl_batched endpoint measures TTFT, TPOT, and ITL using synthetic random token inputs at configurable batch sizes. It uses ThreadPoolExecutor to simulate concurrent requests and returns mean, median, and P99 statistics, allowing operators to characterise inference performance on any hardware before committing to production capacity.
Two primary deployment scenarios are analysed on AWS EC2 at on-demand pricing, 24/7 uptime.
| Component | GPU (g4dn.xlarge) | CPU (t3.xlarge) |
|---|---|---|
| EC2 Instance (24/7) | $379 / month | $121 / month |
| Storage (100 GB SSD) | $10 / month | $10 / month |
| Data Transfer | $5–15 / month | $5–15 / month |
| Total (on-demand) | ~$394–404 / month | ~$136–146 / month |
| Total (spot, ~65% saving) | ~$138–142 / month | ~$48–51 / month |
The GPU instance is recommended for any deployment where P99 latency matters to end users. The CPU instance is viable for batch processing workflows where 800–1,200ms latency is acceptable.
| Daily Volume | GPU On-Demand | GPU Spot | CPU On-Demand |
|---|---|---|---|
| 100 req/day (~3k/month) | $0.131 | $0.046 | $0.045 |
| 1,000 req/day (~30k/month) | $0.013 | $0.005 | $0.0045 |
| 10,000 req/day (~300k/month) | $0.0013 | $0.0005 | Not recommended* |
At 10,000 requests/day on CPU, the 800–1,200ms latency means a single instance handles at most ~1 req/s, requiring multiple CPU instances that eliminate the cost advantage entirely.
| Traffic Level | GPU On-Demand | GPU Spot | CPU On-Demand |
|---|---|---|---|
| Low (100 req/day) | $131 per 1k requests | $46 per 1k requests | $45 per 1k requests |
| Medium (1,000 req/day) | $13 per 1k requests | $5 per 1k requests | $4.50 per 1k requests |
| High (10,000 req/day) | $1.30 per 1k requests | $0.50 per 1k requests | Not recommended |
client.py implements batch_generate() which sends multiple excerpts sequentially. At scale, a true server-side batching implementation would allow the model to process multiple inputs per model.generate() call, improving GPU utilisation.A retrieval-augmented generation (RAG) pipeline was evaluated as an alternative architecture. For headline generation, which requires learning style and compression patterns rather than retrieving dynamic facts, fine-tuning is both cheaper and more effective:
| Factor | Fine-tuned (This System) | RAG Pipeline |
|---|---|---|
| Training cost | $0 (free Colab T4) | $0 |
| Inference infrastructure | $0–404/month (self-hosted) | $15–500/month (vector DB + LLM API) |
| Per-request at 1k/day | $0.013 | $0.002–0.01 per API call + DB costs |
| Offline capability | ✅ Yes | ❌ No — requires internet |
| Output determinism | ✅ Yes (greedy decoding) | ❌ No — LLM outputs vary |
| Style pattern learning | ✅ Strong (4,286 training examples) | ❌ Weak — retrieves examples, does not learn patterns |
| Latency | ✅ 150–200ms | ❌ 500–2,000ms (vector search + API call) |
The fine-tuned model has learned that Nigerian political headlines use "Nigeria:" prefixes, party abbreviations like PDP and APC, active voice, and 7–10 word targets. A RAG system would retrieve example headlines and ask an LLM to imitate them — a slower, costlier, and less consistent approach for a task where the patterns are stable and learnable.
The monitoring strategy described in this section is designed for production deployment. The current prototype uses Uvicorn's default request logging. Implementing this stack would be the next step before moving from prototype to production.
| Layer | Tool | What It Covers |
|---|---|---|
| Infrastructure | AWS CloudWatch | CPU, GPU utilization, memory, disk I/O, network, EC2 health |
| Application | Prometheus + Grafana | Request rate, latency percentiles, error rate, queue depth, token throughput |
| LLM Telemetry | LangSmith | Prompt traces, token counts, generation cost, output quality sampling |
Each layer answers a different question. CloudWatch answers "is the machine healthy?" Prometheus answers "is the API behaving correctly?" LangSmith answers "is the model producing good outputs and at what cost?"
| Metric | Why It Matters | Alert Threshold |
|---|---|---|
| Latency P50, P95, P99 (TTFT) | User experience — editors notice delays above 300ms | P99 > 500ms = Warning; P99 > 1,000ms = Critical |
| HTTP 5xx error rate | Reliability — errors mean lost headlines | > 1% = Warning; > 5% = Critical |
| Throughput (RPS) | Capacity planning | < 0.5 RPS when expected > 5 = Warning |
| Token usage (input + output) | Cost control — unexpected spikes indicate misuse | > 300 tokens/request average = Warning |
| GPU utilization | Resource efficiency — too high means saturation, too low means waste | > 95% sustained = Warning; < 20% = scale-down trigger |
| GPU memory usage | OOM risk | > 90% of 16 GB = Warning |
| Output length compliance | Model quality — outputs outside 5–15 words indicate drift | < 80% compliance = Warning |
A Prometheus /metrics endpoint would expose:
headline_requests_total — counter by status code, used to compute error rateheadline_request_duration_seconds — histogram for P50/P95/P99 latencyheadline_ttft_seconds — histogram specifically for time-to-first-tokenheadline_active_requests — gauge for concurrent request countheadline_tokens_total — counter for input and output tokens separatelyGrafana dashboards visualise these as time-series panels with configurable alert thresholds.
LangSmith provides request-level tracing including the full prompt sent to the model, the raw output before post-processing, token counts, and latency breakdown. This enables:
| Condition | Severity | Who Gets Notified | Response |
|---|---|---|---|
| P99 TTFT > 500ms on GPU | Warning | Slack #ops channel | Investigate queue depth and GPU saturation |
| P99 TTFT > 1,000ms | Critical | PagerDuty on-call | Auto-scale or failover to standby instance |
| HTTP 5xx rate > 1% | Warning | Slack #ops channel | Check logs for model loading or OOM errors |
| HTTP 5xx rate > 5% | Critical | PagerDuty on-call | Rollback if triggered by recent deployment |
| GPU memory > 90% | Warning | Slack #ops channel | Reduce concurrent request limit; monitor for OOM |
| Container health check fails ×3 | Critical | PagerDuty on-call | Docker auto-restarts; escalate if not resolved in 5 minutes |
| Output length compliance < 80% | Warning | Slack #ml channel | Review recent inputs; check for model degradation |
In production, every request would emit a structured JSON log entry containing: timestamp, request ID, excerpt character count, generated headline, TTFT in milliseconds, input and output token counts, device used, and any error details. Logs would be shipped to CloudWatch Logs with a 30-day retention policy.
A weekly automated report would aggregate: total request volume and error rate, latency distribution (P50/P95/P99 TTFT), headline length compliance rate, and estimated cost per request for the week.
The current prototype implements the following:
GenerateRequest schema enforces correct Python types for all fields (excerpt: str, max_new_tokens: int, temperature: float, top_p: float, do_sample: bool) and applies safe default values. Malformed requests are rejected by FastAPI before reaching the model.try/except blocks and return HTTPException(status_code=500, detail=str(e)) rather than exposing raw Python stack traces to clients.build_prompt() as a function argument and injected into a fixed template string. This structurally separates instruction text from user content and reduces the attack surface for prompt injection compared to direct string concatenation.HF_TOKEN environment variable and is not hardcoded anywhere in the codebase.healthcheck block in docker-compose.yml ensures the container only receives traffic once the model has loaded and GET / returns HTTP 200.In production, all inference endpoints would require an X-API-Key header validated against a table of SHA-256 hashed keys stored in a database. Anonymous access would be permitted only on GET / (health check), which does not invoke the model. For internal CMS integrations, mutual TLS (mTLS) would be added to ensure requests originate from trusted client certificates.
Rate limiting would be applied at two levels:
Retry-After header.Beyond what Pydantic currently enforces, the production hardening plan includes:
max_new_tokens to a server-side ceiling of 200News excerpts submitted to the API may contain names, phone numbers, or other identifying information present in news articles. In production: request bodies would not be logged by default. If debug logging is enabled, logs would be stored in a CloudWatch log group with restricted IAM access and a 7-day retention policy. Prompt content would never be sent to third-party analytics services. The LangSmith integration would be configured with data redaction enabled for PII patterns before traces are transmitted.
Access to the inference endpoint is restricted to holders of valid API keys. Access to CloudWatch logs and Prometheus metrics is restricted to the engineering team via IAM roles — no public access. LangSmith project access is restricted to the model owner. The Docker host accepts management connections only via AWS Systems Manager Session Manager; no IAM users have direct SSH key access to the production instance.
app user (UID 1000) — application process should not run as root inside the container/app/model_cache and /tmpAll production traffic would be routed through an AWS Application Load Balancer with an ACM-managed TLS certificate. HTTP connections would be redirected to HTTPS with a 301 response at the ALB level. The application does not handle TLS directly, so certificate rotation requires no code changes. TLS 1.2 is the minimum permitted version; TLS 1.0 and 1.1 are disabled on the ALB listener.
Generated headlines would pass through a post-processing step before being returned:
do_sample=FalseThe test suite (client/test_requests.py) covers 8 categories with 20+ real Nigerian news examples, runnable against both local and Docker deployments via --url http://localhost:8000:
do_sample=False) and stochastic sampling (do_sample=True)/generate_streambatch_generate()Measured via POST /ttft_itl_batched (20 prompts, batch size 4, T4 GPU, 4-bit quantization):
| Metric | Mean | Median | P99 |
|---|---|---|---|
| TTFT (Time to First Token) | 165ms | 162ms | 199ms |
| TPOT (Time per Output Token) | 12.4ms | 12.1ms | 15.2ms |
| ITL (Inter-Token Latency) | 11.8ms | 11.5ms | 14.6ms |
P99 TTFT of 199ms remains within the 200ms production target. On CPU, TTFT rises to 800–1,200ms — acceptable for batch workflows but unsuitable for interactive use.
This project delivers a working deployment prototype for Nigerian news headline generation, taking a QLoRA fine-tuned Qwen 2.5 0.5B model with 17–41% ROUGE improvements over baseline and making it accessible via a production-structured API. The key outcomes are:
The gap between the current prototype and a fully production-hardened system is well-defined: implementing the security roadmap in Section 6 and the monitoring stack in Section 5 are the two primary next steps before handling real user traffic.
Model: huggingface.co/Blaqadonis/Qwen2.5-0.5B-Nigerian-News-Headlines
Code: github.com/Blaqadonis/nigerian-news-headlines-qlora