Deploying a Fine-Tuned Code-Generation LLM via Cloudflare Workers and HuggingFace Inference

Author: cmndcntrlcyber
Project: bubble-chats — Multi-Platform AI Assistant Suite
Model: cmndcntrlcyber/qwen14b-code-trainer-v6-gguf
Repository: cmndcntrlcyber/bubble_chats
Date: June 2026

Abstract

This publication documents the production deployment of a fine-tuned large language model for offensive security code generation. The model — a Q4_K_M GGUF quantization of a LoRA-adapted Qwen2.5-Coder-14B-Instruct — is served through the HuggingFace Inference Router behind a Cloudflare Worker proxy, and consumed by four client surfaces: a browser extension, a Linux GTK3 desktop app, a Windows Tauri desktop app, and an embeddable website widget. The architecture keeps API credentials server-side, normalizes streaming SSE formats across providers, and supports hot-swapping between the fine-tuned model, open-source alternatives, and Anthropic's Claude API without client changes.

1. Use Case Definition

Problem Statement

Security practitioners and developers frequently need to generate, review, and explain offensive security code — shellcode, exploit scaffolding, PowerShell payloads, privilege escalation scripts — in a context-aware, conversational interface. General-purpose chat interfaces lack the domain vocabulary and code-formatting precision required, while purpose-built tools typically lack the conversational layer. The goal is a floating chat assistant, accessible from any surface (browser, desktop, website), that defaults to a fine-tuned code-generation model with domain-specific calibration, while retaining the ability to escalate to larger general models for reasoning-heavy tasks.

Target Users

User Type	Interaction Pattern
Red team operators	Asking for payload generation, evasion ideas, enumeration scripts
Security engineers	Reviewing exploit code, explaining CVE proof-of-concepts
Students / CTF players	Learning offensive techniques in a guided, conversational way
Website visitors (widget)	General assistant, lead capture

Input / Output Examples

Example 1 — Payload generation

Input: "Write a Python reverse shell that connects to 10.10.10.5:4444 and avoids string-based AV detection"
Expected output: Working Python code using socket + subprocess, with obfuscation commentary

Example 2 — Code explanation

Input: "Explain what this shellcode does: \x31\xc0\x50\x68..."
Expected output: Step-by-step x86 assembly breakdown, syscall identification

Example 3 — Enumeration script

Input: "Give me a PowerShell one-liner to enumerate local admin group members without net.exe"
Expected output: Get-LocalGroupMember or WMI-based equivalent with explanation

Success Criteria

Metric	Target
Time to first token	< 2 seconds (p50)
Streaming latency	Continuous tokens, no stalls > 3 s
Code correctness (manual eval)	Runnable output on ≥ 80% of well-formed requests
Uptime	≥ 99.5% (Cloudflare SLA baseline)
Auth bypass rate	0% (timing-safe shared secret)

Traffic Expectations

Baseline: 50–200 requests/day across all client surfaces
Peak: Up to 500 requests/day during CTF events or training sessions
Request size: ~1–8 KB (messages array), ~512–2,048 output tokens
Concurrent users: 1–10 (personal/small-team tool, not consumer-scale)

The free tier of Cloudflare Workers handles 100,000 requests/day; HuggingFace Inference PRO handles concurrent requests with queue management. No auto-scaling infrastructure is required at this traffic level.

2. Model Selection & Configuration

Model Details

Aspect	Value
Model	`cmndcntrlcyber/qwen14b-code-trainer-v6-gguf`
Base Model	`Qwen/Qwen2.5-Coder-14B-Instruct`
Model Source	HuggingFace Hub (fine-tuned, custom)
Fine-tuning Method	LoRA (Phase 4A adapter), merged into base weights
Parameter Count	14.7B (Qwen2 architecture)
Quantization	Q4_K_M (GGUF via llama.cpp)
Quantized Size	~9 GB
Context Length	4,096 tokens
Max Output Tokens	1,024 (chat mode), 384 (context-agent mode)
Validation Loss	0.4724 (3,265-row validation split)
License	Apache 2.0

Secondary / Fallback Models

Role	Model	Rationale
Context annotation agent	`Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled`	High reasoning quality for second-opinion annotations
Fallback (website/browser)	`claude-haiku-4-5-20251001` (Anthropic API)	Reliable fallback when HF Worker unavailable
Local dev / offline	Ollama + `llama3.2`	Zero-cost local testing

Justification for Model Selection

Why fine-tune rather than use a base model? The Qwen2.5-Coder-14B-Instruct base model is strong at general code generation but lacks calibration for offensive security terminology, code patterns (e.g., shellcode, payload loaders), and the conversational style needed when explaining red team techniques. The LoRA fine-tune on a domain-specific dataset reduces the prompt engineering burden on users and produces outputs that require less post-editing.

Why Q4_K_M GGUF? The Q4_K_M quantization at 14B parameters reaches a practical balance: the 9 GB footprint fits on a single A100 GPU instance available through HuggingFace Inference Jobs, while the perplexity penalty vs. F16 is approximately 1–3% — negligible for code generation tasks where token-level accuracy matters less than structural correctness. Q8_0 would be more accurate but nearly doubles VRAM requirements.

Why 4,096 context tokens? The median offensive-security conversation fits comfortably in 4K tokens. Longer context (32K+) would increase inference cost and latency with minimal benefit for the targeted use case.

Trade-offs considered:

Option	Pros	Cons	Decision
GPT-4o-mini via OpenAI API	Easy setup, no hosting	No fine-tuning, per-token cost accumulates, data leaves user control	Rejected
Self-hosted vLLM on EC2	Full control, batch support	High fixed cost ($500+/month for 80 GB GPU), ops overhead	Deferred (Phase 6)
HF Inference API (PRO)	Managed GPU, pay-per-use, supports GGUF	Shared infrastructure, queue during peak	Selected
Ollama local only	Zero cost, offline	Requires local GPU; not viable for browser/website clients	Retained as dev/local mode

3. Deployment Strategy

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        CLIENT SURFACES                          │
│  browser-bubble  │  linux-desktop  │  windows-tauri  │  website │
└──────────────────┴─────────────────┴─────────────────┴──────────┘
          │                │                │               │
          └────────────────┴────────────────┴───────────────┘
                                   │
                      X-Bubble-Auth (shared secret)
                                   │
                    ┌──────────────▼──────────────┐
                    │   Cloudflare Worker          │
                    │   bubble-hf-worker           │
                    │                              │
                    │  • Timing-safe auth check    │
                    │  • CORS enforcement          │
                    │  • OpenAI→Anthropic SSE norm │
                    │  • Model routing (chat/ctx)  │
                    └──────────────┬──────────────┘
                                   │ Bearer HF_TOKEN
                    ┌──────────────▼──────────────┐
                    │  HuggingFace Inference       │
                    │  Router (v1/chat/completions)│
                    │                              │
                    │  Primary: qwen14b-code-v6    │
                    │  Context: Qwen3.5-27B-...    │
                    └─────────────────────────────┘

Platform Selection

Primary deployment: Cloudflare Workers + HuggingFace Inference Router

Cloudflare Workers were chosen as the proxy/gateway layer for three reasons:

Edge-native, zero cold starts — Workers run in Cloudflare's V8 isolates, eliminating cold-start latency. A Lambda function or Cloud Run container would add 200–2,000 ms on first request.
Secrets isolation — The HF_TOKEN and BUBBLE_SHARED_SECRET never reach the client. The Worker holds them as encrypted Wrangler secrets, reachable only in the Worker's runtime environment.
Free at this scale — The free Workers tier covers 100K requests/day. Monthly traffic at projected usage stays well within this.

HuggingFace Inference Router was chosen for model serving because:

The fine-tuned GGUF is already hosted on the Hub
The Inference Router provides an OpenAI-compatible /v1/chat/completions endpoint with streaming support
No GPU provisioning, no container management — the router handles scheduling across HF's a100 fleet
Cost aligns with usage (HF PRO subscription: $9/month, covers moderate inference volume)

Alternatives considered and rejected:

Platform	Why Rejected
Modal	Would require repackaging the GGUF; HF native hosting is simpler
AWS SageMaker	High setup overhead; monthly minimums exceed usage-based HF cost
Hugging Face Inference Endpoints (dedicated)	Fixed cost (~$80–300+/month for GPU endpoint); overkill for < 500 req/day
vLLM on Cloud VM	Excellent for high throughput; not justified until traffic > 1K req/day

Infrastructure Details

Component	Configuration
Worker runtime	Cloudflare V8 isolate, `compatibility_date = 2025-01-01`
Worker CPU limit	10ms CPU time per request (adequate for proxy; no heavy compute)
Worker memory	128 MB (Cloudflare default)
HF Inference tier	HF PRO (shared GPU fleet, a100 access)
GGUF quantization	Q4_K_M — ~9 GB VRAM, fits on one a100-40GB with overhead
Streaming	SSE (Server-Sent Events) end-to-end; Worker normalizes OpenAI→Anthropic SSE format
Geographic region	Cloudflare global edge; HF inference runs in US East data centers
Endpoint type	Real-time streaming (not batch)

Deployment Instructions (Cloudflare Worker)

cd cloudflare-hf-worker
npm install

# Deploy worker code to Cloudflare
npx wrangler deploy

# Set encrypted secrets
npx wrangler secret put HF_TOKEN           # your HuggingFace token
npx wrangler secret put BUBBLE_SHARED_SECRET  # openssl rand -hex 32

Edit wrangler.toml (copy from wrangler.toml.example) to set:

HF_DEFAULT_MODEL — the primary chat model ID
HF_CONTEXT_MODEL — the context-agent model ID
ALLOWED_ORIGINS — comma-separated list of permitted client origins

SSE Format Normalization

A critical implementation detail: the HuggingFace Inference Router returns OpenAI-compatible SSE:

data: {"choices":[{"delta":{"content":"..."}}]}

All four bubble clients consume Anthropic-compatible SSE:

data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"..."}}

The Worker's openAiToAnthropicSse() transform stream handles this conversion in-flight, so no client code changes were needed when switching from the Anthropic API to the HF backend.

Dual-Agent Design

Each chat request triggers the primary model. Optionally, a parallel mode=context request routes to the context-agent model (Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled), which streams a 2–4 sentence analytical annotation below the primary response. This gives users a second-opinion layer on technical claims without blocking the primary reply.

4. Cost Analysis

Monthly Cost Breakdown

Cost Component	Configuration	Monthly Estimate
Cloudflare Workers	Free tier (< 100K req/day)	$0
HuggingFace PRO	Subscription (shared inference fleet)	$9
Model storage (HF Hub)	Free for public models	$0
Network egress	Cloudflare → HF (intra-datacenter or peered)	~$0–2
Monitoring (CF Analytics)	Built-in, no extra cost	$0
Total Estimated		~$9–11/month

If traffic scales past 100K requests/day, upgrading to Cloudflare Workers Paid adds 14–16/month.

Cost Per Request

At 200 requests/day = 6,000 requests/month:

Monthly cost: ~$10
Cost per request: ~$0.0017 (~0.17 cents)

At 500 requests/day = 15,000 requests/month:

Monthly cost: ~$11 (still within free Worker tier)
Cost per request: ~$0.00073 (~0.07 cents)

For comparison: Claude Haiku 4.5 via the Anthropic API at 1,024 output tokens ≈ $0.0012/request. The HF-backed fine-tuned model is cheaper at scale and provides domain-specific quality.

Cost Optimization Strategies

1. Response caching for repeated queries
Common queries (boilerplate shellcode templates, standard enumeration scripts) can be cached at the Worker layer using the Cloudflare Cache API. A 1-hour TTL on deterministic prompts (temperature=0, identical messages array hash) would reduce HF inference calls by an estimated 15–25% for training/CTF use cases where the same questions recur.

2. Q4_K_M quantization reduces inference cost
The chosen Q4_K_M quantization uses ~55% of the VRAM that the F16 (full-precision) model would require. On HF's shared GPU fleet, reduced memory footprint means shorter queue times and lower effective per-request cost. The 1–3% perplexity penalty is acceptable for code generation tasks.

3. Token limits by mode
Chat mode caps at 1,024 output tokens. Context-agent mode caps at 384 tokens. These limits prevent runaway generation that would inflate cost with no UX benefit (context annotations are meant to be concise).

4. Lazy context-agent activation
The context agent is opt-in at the client level (HF_CONTEXT_ENABLED=true). When disabled (the default), no secondary inference call is made. This halves the effective HF inference volume for most deployments.

5. Cloudflare free tier headroom
At projected traffic (50–500 req/day), the free Workers tier (100K req/day) provides 200–2000× headroom. No paid compute scaling is triggered automatically; scale events are manual and deliberate.

5. Monitoring & Observability Plan

Metrics to Track

Metric	Why It Matters	Tool	Alert Threshold
Worker request count	Volume tracking, cost forecasting	Cloudflare Analytics	> 90K/day (90% of free tier)
Worker error rate (4xx/5xx)	Reliability indicator	Cloudflare Analytics	> 2% over 5-minute window
Worker CPU time (p99)	Detect processing spikes	Cloudflare Analytics	> 8ms (near 10ms limit)
HF Inference latency (TTFT)	User experience; time-to-first-token	Client-side logging	> 5s (p95)
Auth failure rate (401 count)	Security signal; potential brute-force	Cloudflare Analytics	> 10/hour
Shared secret header absence	Misconfigurations or probing	Worker logs	Any occurrence from known origins
HF upstream error rate	HF availability; model availability	Worker logs (upstream.status)	> 5% over 10-minute window

Tools Selection

Tool	Purpose	Cost
Cloudflare Analytics	Worker invocation count, error rates, CPU time, geographic distribution	Free (included)
Cloudflare Logpush (optional)	Export Worker request logs to R2/S3 for long-term retention and analysis	$0.05/GB
HuggingFace Hub → model page	Download/inference activity visible on model card	Free
LangFuse (planned Phase 6)	LLM-specific tracing: prompt/response pairs, latency per token, cost attribution	Free tier available
Self-hosted Uptime Check	Periodic `curl` smoke-test of the Worker endpoint from cron	Free (existing infra)

Current Implementation: Worker-Level Logging

The Worker logs upstream HTTP status on error:

if (!upstream.ok) {
  const err = await upstream.text().catch(() => upstream.statusText);
  return new Response(err, { status: upstream.status, headers: corsHeaders });
}

Cloudflare's wrangler tail surfaces these in real time during development:

npx wrangler tail --format pretty

Alerting Strategy

Alert 1: High error rate

Condition: 5xx rate > 2% over a 5-minute rolling window
Channel: Cloudflare Notification → email (ray.soreng@gmail.com)
Runbook: Check wrangler tail for upstream error body; verify HF token validity; check HF status page

Alert 2: Free tier approach (90K req/day)

Condition: Daily request count > 90,000
Channel: Cloudflare Notification → email
Runbook: Enable Workers Paid plan ($5/month) to avoid service interruption

Alert 3: Auth anomaly (> 10 401s/hour)

Condition: 401 response rate spikes
Channel: Cloudflare Notification → email
Runbook: Review originating IPs in Analytics; rotate BUBBLE_SHARED_SECRET if probing confirmed; verify ALLOWED_ORIGINS list is restrictive

Alert 4: HF upstream timeouts

Condition: HF response latency > 30s (upstream timeout) recurring > 3 times in 10 minutes
Channel: Email / manual
Runbook: Switch HF_DEFAULT_MODEL to a lighter model (e.g., meta-llama/Llama-3.1-8B-Instruct) in wrangler.toml and redeploy; file a HF status issue if persistent

Future Observability Improvements

For a Phase 6 production upgrade, the following would be added:

LangFuse integration: Wrap each HF call with a LangFuse trace, capturing prompt tokens, completion tokens, latency, and model. This enables per-user cost attribution and quality retrospectives.
Structured Worker logging: Replace plain-text error returns with JSON-structured log events shipped to Cloudflare R2 → queried with D1 or Grafana.
Model drift detection: Periodically run a fixed evaluation prompt set against the deployed model and compare output against a stored reference, alerting when cosine similarity drops below 0.85.

6. Security Considerations

API Authentication

All client-to-Worker communication is authenticated via a shared secret transmitted in the X-Bubble-Auth HTTP header. The Worker performs a timing-safe comparison using a bitwise XOR loop (not ===) to prevent timing oracle attacks:

function timingSafeEqual(a, b) {
  if (typeof a !== 'string' || typeof b !== 'string') return false;
  if (a.length !== b.length) return false;
  const ae = new TextEncoder().encode(a);
  const be = new TextEncoder().encode(b);
  let diff = 0;
  for (let i = 0; i < ae.length; i++) diff |= ae[i] ^ be[i];
  return diff === 0;
}

The BUBBLE_SHARED_SECRET is stored as an encrypted Wrangler secret (never in wrangler.toml or source control). It should be generated with openssl rand -hex 32 and rotated periodically.

Rate Limiting

Cloudflare Workers do not apply per-IP rate limiting by default. The current mitigations are:

Origin allowlist (ALLOWED_ORIGINS): Requests from unlisted origins are rejected at the CORS preflight level.
Auth gate: Every request without a valid X-Bubble-Auth header returns 401 immediately — no model inference is triggered, so unauthenticated probing costs nothing.
Planned: Cloudflare Rate Limiting rules (wrangler.toml → rules) can enforce per-IP limits (e.g., 60 requests/minute) when needed.

Input Validation

The Worker validates the messages array structure before forwarding to HF:

if (!Array.isArray(messages) || messages.length === 0)
  return new Response(JSON.stringify({ error: 'messages array required' }), { status: 400 });

Prompt injection mitigation: The context-agent system prompt is hardcoded in the Worker and prepended server-side — clients cannot override it. The primary chat system prompt is client-supplied (appropriate for a personal tool) but could be locked server-side for multi-tenant deployments.

No content filtering currently. For production use as a red team tool, this is intentional: the fine-tuned model is specifically designed to generate offensive code for authorized use. In a public-facing deployment, a content moderation layer (e.g., Llama Guard or a Cloudflare WAF rule) would be added.

PII Handling

Prompts and responses are not logged by the Worker (no persistent storage, no R2 write).
Cloudflare logs standard HTTP metadata (IP, timestamp, response code) per their privacy policy; these are not exported by default.
HuggingFace Inference Router logs inference requests per their data processing agreement — users should review HF's enterprise DPA for any PII concerns.
The contact form (website-bubble) collects name, email, and phone and delivers them to a Discord webhook; no persistent database stores this data.

Access Control

Resource	Who Has Access	Control Mechanism
HF_TOKEN	Worker runtime only	Wrangler encrypted secret
BUBBLE_SHARED_SECRET	Worker + authorized clients	Wrangler encrypted secret; shared out-of-band
Cloudflare dashboard	Account owner	Cloudflare SSO + 2FA
HuggingFace Hub (model)	Public (read); owner (write)	HF token with scoped permissions
`.env` files	Local dev only	`.gitignore` excludes all `.env` and `.env.`

Supply Chain

All Worker dependencies are pinned in package-lock.json (currently no runtime npm dependencies — the Worker is a single vanilla JS file).
The GGUF model file is hosted on HuggingFace Hub under the apache-2.0 license; the base model (Qwen/Qwen2.5-Coder-14B-Instruct) is also Apache 2.0.
The LoRA training pipeline source is at https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline.

7. Lessons Learned & What's Next

What Worked Well

Cloudflare Workers as a proxy layer proved extremely practical: zero infrastructure to manage, global edge delivery, and the timing-safe auth was straightforward to implement.
SSE normalization in-flight (OpenAI→Anthropic format) decoupled the backend model choice from the client streaming consumer entirely. Switching models required only a wrangler.toml config change.
Q4_K_M quantization hit the right balance. Inference on HF's shared a100 fleet is fast enough for interactive use (TTFT < 2s observed) without requiring a dedicated endpoint.

Limitations

No per-user cost attribution. All HF inference is billed to a single PRO subscription. Adding per-user quotas would require a usage-tracking layer (D1 database or KV store on the Worker).
Queue latency during peak HF load. The shared inference fleet can experience delays when popular models are heavily loaded. A dedicated HF Inference Endpoint would eliminate this at higher cost.
No automated evaluation. There is no CI pipeline that runs the evaluation prompt set against the deployed model to detect regressions after a model swap.

Phase 6 Roadmap

Item	Description
vLLM deployment	Self-hosted vLLM stack for higher throughput and OpenAI-compatible batch API
LangFuse tracing	Per-request LLM observability with cost tracking
Llama Guard integration	Content moderation layer for multi-tenant widget deployments
Per-user rate limiting	Cloudflare KV-backed request quota per API key
Automated eval CI	GitHub Actions job that runs fixed prompt set and compares against reference outputs

References

Model: cmndcntrlcyber/qwen14b-code-trainer-v6-gguf
Base model: Qwen/Qwen2.5-Coder-14B-Instruct
Repository: cmndcntrlcyber/bubble_chats
Training pipeline: cmndcntrlcyber/code-trainer-offsec-pipeline
Cloudflare Workers documentation: developers.cloudflare.com/workers
HuggingFace Inference Router: huggingface.co/docs/inference-providers
llama.cpp GGUF quantization: github.com/ggerganov/llama.cpp
ReadyTensor Module 2 requirements: app.readytensor.ai/lessons/llmed-program-module-2-project