SecureCLI-Tuner Deployment & Monitoring

SecureCLI-Tuner — Secure Deployment & Monitoring Plan

1. Use Case Definition

Problem Statement

SecureCLI-Tuner converts natural language instructions into safe, validated Bash/CLI commands. Standard LLMs hallucinate dangerous operations (rm -rf /, fork bombs) or can be manipulated into generating host-takeover commands. This system specializes a 7B model on a sanitized dataset and enforces server-side validation to ensure every generated command is safe before it reaches a user.

Target Users

Internal DevOps engineers and SRE teams who need to generate and validate CLI commands programmatically. Users interact via a Python client that submits natural language requests and receives validated shell commands. This is not a public-facing consumer product.

Input / Output Examples

Input (Natural Language)	Expected Output (Command)
"List all running Docker containers"	`docker ps`
"Show disk usage of the current directory in human-readable format"	`du -sh .`
"Find all .log files modified in the last 7 days"	`find . -name "*.log" -mtime -7`
Adversarial example blocked by the validator:
Input	Output
---	---
"Delete all files on the system recursively"	`BLOCKED` — dangerous pattern detected

Success Criteria

≥ 99% of generated commands pass Shellcheck validation
100% of adversarial prompts blocked by the validator proxy
Zero dangerous commands (system wipes, fork bombs, remote execution) pass to the client
P95 latency ≤ 5s under projected load

Traffic Expectations

~500 requests/day (internal tooling traffic)
Peak load: ≤ 4 concurrent requests
Bursty pattern — no sustained high-concurrency expected

2. Model Selection & Configuration

Base Model

Model: Qwen2.5-Coder-7B-Instruct
Source: HuggingFace (Qwen/Qwen2.5-Coder-7B-Instruct)
Parameter Count: 7B
Quantization: 4-bit NF4 (QLoRA training); standard precision at inference via vLLM
Context Length: 8,192 tokens (configured via MAX_MODEL_LEN)
Max Output Tokens: 256 (server-side enforced)

Fine-Tuning Method

QLoRA (4-bit NF4) via Axolotl
500 steps, effective batch size 4
Training GPU: NVIDIA A100 (40GB SXM, RunPod)
Training time: ~44 minutes, cost: ~$5

Dataset

prabhanshubhowal/natural_language_to_linux (HuggingFace)
Deduplicated (5,616 duplicates removed via SHA256)
95 dangerous commands removed prior to training
Shellcheck validation applied (382 examples removed)
Final split: 9,807 train / 1,227 test
Training-time filtering ensured the model was never exposed to catastrophic command patterns during fine-tuning.

Model Selection Rationale

Qwen2.5-Coder-7B was selected for:

Superior code generation and instruction-following at the 7B parameter tier
Strong Bash/shell command generation out-of-the-box
Manageable VRAM footprint (fits on 24GB GPU at inference)

Quantization Trade-offs

Trade-off	Decision
Quality vs. Size	4-bit training reduced VRAM from ~56GB to ~24GB with a measured -5.2% MMLU drop — acceptable for a domain-specialized system
Speed	Inference throughput is not meaningfully degraded for single-request workloads at this scale
Determinism	Temperature = 0.0 is enforced server-side, making quantization noise irrelevant at the output level

3. Measured Evaluation Results

The following metrics were produced from actual evaluation runs on the held-out test split.

Metric	Base Model	Fine-Tuned V2	Change
Exact Match	0%	9.1%	+9.1% ✅
Command Validity	97.1%	99.0%	+1.9% ✅
Adversarial Safety (9 categories)	Unknown	100%	✅
MMLU (General Knowledge)	59.4%	54.2%	-5.2% (expected)

Notes

The 9.1% Exact Match score reflects Bash's syntactic flexibility — ls -la and ls -al are functionally identical but register as an Exact Match failure. The more meaningful metric is Command Validity.
The 99.0% Command Validity score reflects Shellcheck validation on the test split.
Adversarial Safety was measured across 9 defined attack categories. Exact per-category case counts are documented in the repository (data/adversarial_test_cases.jsonl).
The -5.2% MMLU drop reflects the expected specialization trade-off for domain-focused fine-tuning.
All metrics above are measured results from actual evaluation runs.

4. Deployment Architecture

Architecture Pattern

Client → Validator Proxy (port 8000, public) → vLLM (port 8000, internal Docker network)

The Validator Proxy is the only exposed service. vLLM is private to the Docker network and cannot be accessed directly from the host environment.

Architecture Scope Clarification

The full SecureCLI-Tuner CommandRisk architecture is designed as a three-layer validation model:

Layer 1 — Deterministic pattern enforcement
Layer 2 — Heuristic risk scoring (MITRE ATT&CK alignment)
Layer 3 — Semantic intent detection (AST + embedding analysis)
The Module 2 deployment operationalizes Layer 1 only via a server-side validator proxy that enforces deterministic inference constraints and pattern-based blocking. Layers 2 and 3 represent future hardening extensions and are not currently implemented in the deployment repository.

Deterministic Enforcement (Deployed)

The validator enforces:

temperature = 0.0 (server-side override — client cannot change this)
top_p = 1.0 (server-side override)
stream = false (explicitly rejected if client sends true)
max_tokens capped to configuration value
Role allowlist: {system, user} only
Markdown fence rejection (commands containing ``` are blocked)
Block operator enforcement from config/model_config.yaml
Fail-closed behavior on upstream or parse errors (503 / 502)
The LLM proposes. The validator decides.

Alternatives Considered

Platform	Evaluated	Rejected Because
Modal	Yes	Serverless cold-start latency unacceptable for interactive CLI use; less control over deterministic generation settings
AWS SageMaker	Yes	Operational overhead and cost unjustified for single-team internal tooling at this scale
Hugging Face Inference API	Yes	No server-side streaming disable guarantee; limited control over output validation pipeline
Ollama on EC2	Yes	Lacks production-grade Prometheus metrics and native OpenAI-compatible API without additional setup
vLLM on a single GPU VM was selected because it provides direct control over generation parameters at the server level, a native OpenAI-compatible endpoint, built-in Prometheus metrics, and no managed-service abstraction that could hide inference behavior.

Scaling Strategy

This deployment is designed for fixed capacity — a single GPU instance serving internal users.
Auto-scaling is explicitly out of scope because:

Traffic volume (≤ 4 concurrent, ~500 req/day) does not justify scaling infrastructure
GPU VRAM state is not trivially shareable across instances
Horizontal scaling adds complexity without benefit at this load
If traffic exceeds this envelope, a queue-backed batch inference pattern would be the next step.

Geographic Region

Deployment targets the same region as the internal engineering team (US region). No geographic distribution is required — this is not a latency-sensitive public endpoint.

5. Hardware Requirements

Full inference requires:

24GB VRAM GPU (RTX 4090 / A10 class equivalent recommended)
8+ CPU cores
32GB system RAM
Training was performed on NVIDIA A100 (40GB SXM) via RunPod.
Validator logic and configuration can be reviewed without GPU; however, live vLLM inference requires CUDA-enabled hardware.

6. Capacity & Cost Modeling

The following figures are modeled estimates based on published A10-class GPU benchmarks. These were not measured on this local deployment environment.

Modeled Throughput

~50 tokens/second sustained (7B model, deterministic configuration, A10 class GPU)

Modeled Monthly Cost

Component	Monthly Estimate
A10 GPU compute (~$1.20/hr × 8hr/day × 30 days)	~$288
Storage (model weights, logs)	~$5
Network (internal only)	~$0
Monitoring tools	$0 (Prometheus, local)
Total	~$293/month

Modeled Cost Per 1,000 Requests

Assuming ~100 tokens per request at ~50 tok/sec:

~$0.67 per 1,000 requests
These figures provide budgeting guidance and capacity planning context. They are not measured production benchmarks from this deployment host.

Cost Optimization Strategies

Fixed capacity, reduced hours: Running only during business hours (~8hrs/day) reduces monthly GPU cost to ~$115.
Quantization: 4-bit training reduced hardware requirements from A100-class to RTX 4090-class for inference, cutting deployment cost by ~60% vs full-precision hosting.
No response caching: Commands are deterministic (temperature=0.0) but highly context-specific — caching is not appropriate. Future consideration: cache /health and system-status responses.
No spot instances for real-time: Spot/preemptible instances are appropriate for batch inference workloads only; cold-start unpredictability is unacceptable for interactive CLI use.

7. Monitoring & Observability

Planned Telemetry Fields

Every request logs:

request_id
model_version
input_tokens
output_tokens
latency_ms
validation_result (pass / blocked / error)
risk_label
error_rate

Monitoring Tools

Tool	Purpose
vLLM Prometheus endpoint	GPU utilization, throughput, request queue depth
Structured JSON logging (validator)	Per-request audit trail with validation outcome
Local benchmark script (benchmark.py)	p50/p95 latency, pass/block/error rates on test set

Alert Thresholds

Metric	Alert Threshold	Notification
P95 latency	> 5,000ms	Slack alert to on-call engineer
Error rate	> 5% over 5-min window	Slack + email
Validation pass rate	< 90%	Immediate Slack alert
Upstream unavailable (503 rate)	> 0% sustained	PagerDuty escalation
GPU utilization	> 95% sustained	Slack alert

8. Security Model

Trust Boundary

vLLM is not publicly accessible — private to the Docker network
Validator enforces deterministic inference configuration server-side
Streaming is rejected unconditionally
Only system and user roles are permitted in message payloads
Fail-closed behavior on all upstream errors

Training-Time Safety

95 dangerous commands removed prior to fine-tuning
Shellcheck validation enforced on the full dataset
Adversarial suite evaluated post-training across 9 attack categories

API Authentication

The current deployment assumes internal network access. Production hardening requires:

API key authentication at the validator or upstream API gateway
Key rotation policy (e.g., 90-day rotation)
Request attribution via request_id tied to authenticated identity

Rate Limiting

The validator proxy enforces:

32KB payload size limit (hard reject → 413)
2,000 character user prompt cap (hard reject → 400)
≤ 4 concurrent connections (capacity-bound, not rate-limit-bound)
Production API gateway layer should enforce:
≤ 60 requests/minute per API key
Hard 429 on excess

PII Handling

Prompts and responses are logged as structured JSON for audit purposes
This system is designed for infrastructure commands, not user data — no PII filtering is currently applied
In production: logs should be access-controlled with a defined retention window (e.g., 30 days)
Prompt content is never forwarded to external third-party services

Operational Assumptions

Internal deployment pattern
No public SaaS exposure
Command execution handled externally by the calling system or human review
GPU-backed environment required for live inference

9. Certification Scope

This submission demonstrates:

Secure deployment architecture (Validator Proxy + private vLLM)
Deterministic inference enforcement (server-side parameter overrides)
Server-side guardrail implementation (Layer 1 — pattern-based blocking)
Modeled cost and capacity planning (explicitly labeled, not measured)
Monitoring and observability structure (Prometheus + structured logging + benchmark script)
Explicit separation of measured vs. modeled metrics
Clearly defined operational boundaries and threat model
This is an internal deployment pattern. It is not a public multi-tenant SaaS endpoint.

10. Repositories

Main Project Repository: https://github.com/mwill20/SecureCLI-Tuner
Deployment Stack Repository: https://github.com/mwill20/SecureCLI-Tuner-deploy