Author: cmndcntrlcyber
Project: bubble-chats β Multi-Platform AI Assistant Suite
Model: cmndcntrlcyber/qwen14b-code-trainer-v6-gguf
Repository: cmndcntrlcyber/bubble_chats
Date: June 2026
This publication documents the production deployment of a fine-tuned large language model for offensive security code generation. The model β a Q4_K_M GGUF quantization of a LoRA-adapted Qwen2.5-Coder-14B-Instruct β is served through the HuggingFace Inference Router behind a Cloudflare Worker proxy, and consumed by four client surfaces: a browser extension, a Linux GTK3 desktop app, a Windows Tauri desktop app, and an embeddable website widget. The architecture keeps API credentials server-side, normalizes streaming SSE formats across providers, and supports hot-swapping between the fine-tuned model, open-source alternatives, and Anthropic's Claude API without client changes.
Security practitioners and developers frequently need to generate, review, and explain offensive security code β shellcode, exploit scaffolding, PowerShell payloads, privilege escalation scripts β in a context-aware, conversational interface. General-purpose chat interfaces lack the domain vocabulary and code-formatting precision required, while purpose-built tools typically lack the conversational layer. The goal is a floating chat assistant, accessible from any surface (browser, desktop, website), that defaults to a fine-tuned code-generation model with domain-specific calibration, while retaining the ability to escalate to larger general models for reasoning-heavy tasks.
| User Type | Interaction Pattern |
|---|---|
| Red team operators | Asking for payload generation, evasion ideas, enumeration scripts |
| Security engineers | Reviewing exploit code, explaining CVE proof-of-concepts |
| Students / CTF players | Learning offensive techniques in a guided, conversational way |
| Website visitors (widget) | General assistant, lead capture |
Example 1 β Payload generation
"Write a Python reverse shell that connects to 10.10.10.5:4444 and avoids string-based AV detection"Example 2 β Code explanation
"Explain what this shellcode does: \x31\xc0\x50\x68..."Example 3 β Enumeration script
"Give me a PowerShell one-liner to enumerate local admin group members without net.exe"Get-LocalGroupMember or WMI-based equivalent with explanation| Metric | Target |
|---|---|
| Time to first token | < 2 seconds (p50) |
| Streaming latency | Continuous tokens, no stalls > 3 s |
| Code correctness (manual eval) | Runnable output on β₯ 80% of well-formed requests |
| Uptime | β₯ 99.5% (Cloudflare SLA baseline) |
| Auth bypass rate | 0% (timing-safe shared secret) |
The free tier of Cloudflare Workers handles 100,000 requests/day; HuggingFace Inference PRO handles concurrent requests with queue management. No auto-scaling infrastructure is required at this traffic level.
| Aspect | Value |
|---|---|
| Model | cmndcntrlcyber/qwen14b-code-trainer-v6-gguf |
| Base Model | Qwen/Qwen2.5-Coder-14B-Instruct |
| Model Source | HuggingFace Hub (fine-tuned, custom) |
| Fine-tuning Method | LoRA (Phase 4A adapter), merged into base weights |
| Parameter Count | 14.7B (Qwen2 architecture) |
| Quantization | Q4_K_M (GGUF via llama.cpp) |
| Quantized Size | ~9 GB |
| Context Length | 4,096 tokens |
| Max Output Tokens | 1,024 (chat mode), 384 (context-agent mode) |
| Validation Loss | 0.4724 (3,265-row validation split) |
| License | Apache 2.0 |
| Role | Model | Rationale |
|---|---|---|
| Context annotation agent | Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled | High reasoning quality for second-opinion annotations |
| Fallback (website/browser) | claude-haiku-4-5-20251001 (Anthropic API) | Reliable fallback when HF Worker unavailable |
| Local dev / offline | Ollama + llama3.2 | Zero-cost local testing |
Why fine-tune rather than use a base model? The Qwen2.5-Coder-14B-Instruct base model is strong at general code generation but lacks calibration for offensive security terminology, code patterns (e.g., shellcode, payload loaders), and the conversational style needed when explaining red team techniques. The LoRA fine-tune on a domain-specific dataset reduces the prompt engineering burden on users and produces outputs that require less post-editing.
Why Q4_K_M GGUF? The Q4_K_M quantization at 14B parameters reaches a practical balance: the 9 GB footprint fits on a single A100 GPU instance available through HuggingFace Inference Jobs, while the perplexity penalty vs. F16 is approximately 1β3% β negligible for code generation tasks where token-level accuracy matters less than structural correctness. Q8_0 would be more accurate but nearly doubles VRAM requirements.
Why 4,096 context tokens? The median offensive-security conversation fits comfortably in 4K tokens. Longer context (32K+) would increase inference cost and latency with minimal benefit for the targeted use case.
Trade-offs considered:
| Option | Pros | Cons | Decision |
|---|---|---|---|
| GPT-4o-mini via OpenAI API | Easy setup, no hosting | No fine-tuning, per-token cost accumulates, data leaves user control | Rejected |
| Self-hosted vLLM on EC2 | Full control, batch support | High fixed cost ($500+/month for 80 GB GPU), ops overhead | Deferred (Phase 6) |
| HF Inference API (PRO) | Managed GPU, pay-per-use, supports GGUF | Shared infrastructure, queue during peak | Selected |
| Ollama local only | Zero cost, offline | Requires local GPU; not viable for browser/website clients | Retained as dev/local mode |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT SURFACES β
β browser-bubble β linux-desktop β windows-tauri β website β
ββββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββ΄βββββββββββ
β β β β
ββββββββββββββββββ΄βββββββββββββββββ΄ββββββββββββββββ
β
X-Bubble-Auth (shared secret)
β
ββββββββββββββββΌβββββββββββββββ
β Cloudflare Worker β
β bubble-hf-worker β
β β
β β’ Timing-safe auth check β
β β’ CORS enforcement β
β β’ OpenAIβAnthropic SSE norm β
β β’ Model routing (chat/ctx) β
ββββββββββββββββ¬βββββββββββββββ
β Bearer HF_TOKEN
ββββββββββββββββΌβββββββββββββββ
β HuggingFace Inference β
β Router (v1/chat/completions)β
β β
β Primary: qwen14b-code-v6 β
β Context: Qwen3.5-27B-... β
βββββββββββββββββββββββββββββββ
Primary deployment: Cloudflare Workers + HuggingFace Inference Router
Cloudflare Workers were chosen as the proxy/gateway layer for three reasons:
HF_TOKEN and BUBBLE_SHARED_SECRET never reach the client. The Worker holds them as encrypted Wrangler secrets, reachable only in the Worker's runtime environment.HuggingFace Inference Router was chosen for model serving because:
/v1/chat/completions endpoint with streaming supportAlternatives considered and rejected:
| Platform | Why Rejected |
|---|---|
| Modal | Would require repackaging the GGUF; HF native hosting is simpler |
| AWS SageMaker | High setup overhead; monthly minimums exceed usage-based HF cost |
| Hugging Face Inference Endpoints (dedicated) | Fixed cost (~$80β300+/month for GPU endpoint); overkill for < 500 req/day |
| vLLM on Cloud VM | Excellent for high throughput; not justified until traffic > 1K req/day |
| Component | Configuration |
|---|---|
| Worker runtime | Cloudflare V8 isolate, compatibility_date = 2025-01-01 |
| Worker CPU limit | 10ms CPU time per request (adequate for proxy; no heavy compute) |
| Worker memory | 128 MB (Cloudflare default) |
| HF Inference tier | HF PRO (shared GPU fleet, a100 access) |
| GGUF quantization | Q4_K_M β ~9 GB VRAM, fits on one a100-40GB with overhead |
| Streaming | SSE (Server-Sent Events) end-to-end; Worker normalizes OpenAIβAnthropic SSE format |
| Geographic region | Cloudflare global edge; HF inference runs in US East data centers |
| Endpoint type | Real-time streaming (not batch) |
cd cloudflare-hf-worker npm install # Deploy worker code to Cloudflare npx wrangler deploy # Set encrypted secrets npx wrangler secret put HF_TOKEN # your HuggingFace token npx wrangler secret put BUBBLE_SHARED_SECRET # openssl rand -hex 32
Edit wrangler.toml (copy from wrangler.toml.example) to set:
HF_DEFAULT_MODEL β the primary chat model IDHF_CONTEXT_MODEL β the context-agent model IDALLOWED_ORIGINS β comma-separated list of permitted client originsA critical implementation detail: the HuggingFace Inference Router returns OpenAI-compatible SSE:
data: {"choices":[{"delta":{"content":"..."}}]}
All four bubble clients consume Anthropic-compatible SSE:
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"..."}}
The Worker's openAiToAnthropicSse() transform stream handles this conversion in-flight, so no client code changes were needed when switching from the Anthropic API to the HF backend.
Each chat request triggers the primary model. Optionally, a parallel mode=context request routes to the context-agent model (Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled), which streams a 2β4 sentence analytical annotation below the primary response. This gives users a second-opinion layer on technical claims without blocking the primary reply.
| Cost Component | Configuration | Monthly Estimate |
|---|---|---|
| Cloudflare Workers | Free tier (< 100K req/day) | $0 |
| HuggingFace PRO | Subscription (shared inference fleet) | $9 |
| Model storage (HF Hub) | Free for public models | $0 |
| Network egress | Cloudflare β HF (intra-datacenter or peered) | ~$0β2 |
| Monitoring (CF Analytics) | Built-in, no extra cost | $0 |
| Total Estimated | ~$9β11/month |
If traffic scales past 100K requests/day, upgrading to Cloudflare Workers Paid adds
At 200 requests/day = 6,000 requests/month:
At 500 requests/day = 15,000 requests/month:
For comparison: Claude Haiku 4.5 via the Anthropic API at 1,024 output tokens β $0.0012/request. The HF-backed fine-tuned model is cheaper at scale and provides domain-specific quality.
1. Response caching for repeated queries
Common queries (boilerplate shellcode templates, standard enumeration scripts) can be cached at the Worker layer using the Cloudflare Cache API. A 1-hour TTL on deterministic prompts (temperature=0, identical messages array hash) would reduce HF inference calls by an estimated 15β25% for training/CTF use cases where the same questions recur.
2. Q4_K_M quantization reduces inference cost
The chosen Q4_K_M quantization uses ~55% of the VRAM that the F16 (full-precision) model would require. On HF's shared GPU fleet, reduced memory footprint means shorter queue times and lower effective per-request cost. The 1β3% perplexity penalty is acceptable for code generation tasks.
3. Token limits by mode
Chat mode caps at 1,024 output tokens. Context-agent mode caps at 384 tokens. These limits prevent runaway generation that would inflate cost with no UX benefit (context annotations are meant to be concise).
4. Lazy context-agent activation
The context agent is opt-in at the client level (HF_CONTEXT_ENABLED=true). When disabled (the default), no secondary inference call is made. This halves the effective HF inference volume for most deployments.
5. Cloudflare free tier headroom
At projected traffic (50β500 req/day), the free Workers tier (100K req/day) provides 200β2000Γ headroom. No paid compute scaling is triggered automatically; scale events are manual and deliberate.
| Metric | Why It Matters | Tool | Alert Threshold |
|---|---|---|---|
| Worker request count | Volume tracking, cost forecasting | Cloudflare Analytics | > 90K/day (90% of free tier) |
| Worker error rate (4xx/5xx) | Reliability indicator | Cloudflare Analytics | > 2% over 5-minute window |
| Worker CPU time (p99) | Detect processing spikes | Cloudflare Analytics | > 8ms (near 10ms limit) |
| HF Inference latency (TTFT) | User experience; time-to-first-token | Client-side logging | > 5s (p95) |
| Auth failure rate (401 count) | Security signal; potential brute-force | Cloudflare Analytics | > 10/hour |
| Shared secret header absence | Misconfigurations or probing | Worker logs | Any occurrence from known origins |
| HF upstream error rate | HF availability; model availability | Worker logs (upstream.status) | > 5% over 10-minute window |
| Tool | Purpose | Cost |
|---|---|---|
| Cloudflare Analytics | Worker invocation count, error rates, CPU time, geographic distribution | Free (included) |
| Cloudflare Logpush (optional) | Export Worker request logs to R2/S3 for long-term retention and analysis | $0.05/GB |
| HuggingFace Hub β model page | Download/inference activity visible on model card | Free |
| LangFuse (planned Phase 6) | LLM-specific tracing: prompt/response pairs, latency per token, cost attribution | Free tier available |
| Self-hosted Uptime Check | Periodic curl smoke-test of the Worker endpoint from cron | Free (existing infra) |
The Worker logs upstream HTTP status on error:
if (!upstream.ok) { const err = await upstream.text().catch(() => upstream.statusText); return new Response(err, { status: upstream.status, headers: corsHeaders }); }
Cloudflare's wrangler tail surfaces these in real time during development:
npx wrangler tail --format pretty
Alert 1: High error rate
ray.soreng@gmail.com)wrangler tail for upstream error body; verify HF token validity; check HF status pageAlert 2: Free tier approach (90K req/day)
Alert 3: Auth anomaly (> 10 401s/hour)
BUBBLE_SHARED_SECRET if probing confirmed; verify ALLOWED_ORIGINS list is restrictiveAlert 4: HF upstream timeouts
HF_DEFAULT_MODEL to a lighter model (e.g., meta-llama/Llama-3.1-8B-Instruct) in wrangler.toml and redeploy; file a HF status issue if persistentFor a Phase 6 production upgrade, the following would be added:
All client-to-Worker communication is authenticated via a shared secret transmitted in the X-Bubble-Auth HTTP header. The Worker performs a timing-safe comparison using a bitwise XOR loop (not ===) to prevent timing oracle attacks:
function timingSafeEqual(a, b) { if (typeof a !== 'string' || typeof b !== 'string') return false; if (a.length !== b.length) return false; const ae = new TextEncoder().encode(a); const be = new TextEncoder().encode(b); let diff = 0; for (let i = 0; i < ae.length; i++) diff |= ae[i] ^ be[i]; return diff === 0; }
The BUBBLE_SHARED_SECRET is stored as an encrypted Wrangler secret (never in wrangler.toml or source control). It should be generated with openssl rand -hex 32 and rotated periodically.
Cloudflare Workers do not apply per-IP rate limiting by default. The current mitigations are:
ALLOWED_ORIGINS): Requests from unlisted origins are rejected at the CORS preflight level.X-Bubble-Auth header returns 401 immediately β no model inference is triggered, so unauthenticated probing costs nothing.wrangler.toml β rules) can enforce per-IP limits (e.g., 60 requests/minute) when needed.The Worker validates the messages array structure before forwarding to HF:
if (!Array.isArray(messages) || messages.length === 0) return new Response(JSON.stringify({ error: 'messages array required' }), { status: 400 });
Prompt injection mitigation: The context-agent system prompt is hardcoded in the Worker and prepended server-side β clients cannot override it. The primary chat system prompt is client-supplied (appropriate for a personal tool) but could be locked server-side for multi-tenant deployments.
No content filtering currently. For production use as a red team tool, this is intentional: the fine-tuned model is specifically designed to generate offensive code for authorized use. In a public-facing deployment, a content moderation layer (e.g., Llama Guard or a Cloudflare WAF rule) would be added.
| Resource | Who Has Access | Control Mechanism |
|---|---|---|
| HF_TOKEN | Worker runtime only | Wrangler encrypted secret |
| BUBBLE_SHARED_SECRET | Worker + authorized clients | Wrangler encrypted secret; shared out-of-band |
| Cloudflare dashboard | Account owner | Cloudflare SSO + 2FA |
| HuggingFace Hub (model) | Public (read); owner (write) | HF token with scoped permissions |
.env files | Local dev only | .gitignore excludes all *.env and .env.* |
package-lock.json (currently no runtime npm dependencies β the Worker is a single vanilla JS file).apache-2.0 license; the base model (Qwen/Qwen2.5-Coder-14B-Instruct) is also Apache 2.0.https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline.wrangler.toml config change.| Item | Description |
|---|---|
| vLLM deployment | Self-hosted vLLM stack for higher throughput and OpenAI-compatible batch API |
| LangFuse tracing | Per-request LLM observability with cost tracking |
| Llama Guard integration | Content moderation layer for multi-tenant widget deployments |
| Per-user rate limiting | Cloudflare KV-backed request quota per API key |
| Automated eval CI | GitHub Actions job that runs fixed prompt set and compares against reference outputs |