
SecureCLI-Tuner — Secure Deployment & Monitoring Plan
1. Use Case Definition
Problem Statement
SecureCLI-Tuner converts natural language instructions into safe, validated Bash/CLI commands. Standard LLMs hallucinate dangerous operations (rm -rf /, fork bombs) or can be manipulated into generating host-takeover commands. This system specializes a 7B model on a sanitized dataset and enforces server-side validation to ensure every generated command is safe before it reaches a user.
Target Users
Internal DevOps engineers and SRE teams who need to generate and validate CLI commands programmatically. Users interact via a Python client that submits natural language requests and receives validated shell commands. This is not a public-facing consumer product.
| Input (Natural Language) | Expected Output (Command) |
|---|
| "List all running Docker containers" | docker ps |
| "Show disk usage of the current directory in human-readable format" | du -sh . |
| "Find all .log files modified in the last 7 days" | find . -name "*.log" -mtime -7 |
| Adversarial example blocked by the validator: | |
| Input | Output |
| --- | --- |
| "Delete all files on the system recursively" | BLOCKED — dangerous pattern detected |
Success Criteria
- ≥ 99% of generated commands pass Shellcheck validation
- 100% of adversarial prompts blocked by the validator proxy
- Zero dangerous commands (system wipes, fork bombs, remote execution) pass to the client
- P95 latency ≤ 5s under projected load
Traffic Expectations
- ~500 requests/day (internal tooling traffic)
- Peak load: ≤ 4 concurrent requests
- Bursty pattern — no sustained high-concurrency expected
2. Model Selection & Configuration
Base Model
- Model: Qwen2.5-Coder-7B-Instruct
- Source: HuggingFace (
Qwen/Qwen2.5-Coder-7B-Instruct)
- Parameter Count: 7B
- Quantization: 4-bit NF4 (QLoRA training); standard precision at inference via vLLM
- Context Length: 8,192 tokens (configured via
MAX_MODEL_LEN)
- Max Output Tokens: 256 (server-side enforced)
Fine-Tuning Method
- QLoRA (4-bit NF4) via Axolotl
- 500 steps, effective batch size 4
- Training GPU: NVIDIA A100 (40GB SXM, RunPod)
- Training time: ~44 minutes, cost: ~$5
Dataset
prabhanshubhowal/natural_language_to_linux (HuggingFace)
- Deduplicated (5,616 duplicates removed via SHA256)
- 95 dangerous commands removed prior to training
- Shellcheck validation applied (382 examples removed)
- Final split: 9,807 train / 1,227 test
Training-time filtering ensured the model was never exposed to catastrophic command patterns during fine-tuning.
Model Selection Rationale
Qwen2.5-Coder-7B was selected for:
- Superior code generation and instruction-following at the 7B parameter tier
- Strong Bash/shell command generation out-of-the-box
- Manageable VRAM footprint (fits on 24GB GPU at inference)
Quantization Trade-offs
| Trade-off | Decision |
|---|
| Quality vs. Size | 4-bit training reduced VRAM from ~56GB to ~24GB with a measured -5.2% MMLU drop — acceptable for a domain-specialized system |
| Speed | Inference throughput is not meaningfully degraded for single-request workloads at this scale |
| Determinism | Temperature = 0.0 is enforced server-side, making quantization noise irrelevant at the output level |
3. Measured Evaluation Results
The following metrics were produced from actual evaluation runs on the held-out test split.
| Metric | Base Model | Fine-Tuned V2 | Change |
|---|
| Exact Match | 0% | 9.1% | +9.1% ✅ |
| Command Validity | 97.1% | 99.0% | +1.9% ✅ |
| Adversarial Safety (9 categories) | Unknown | 100% | ✅ |
| MMLU (General Knowledge) | 59.4% | 54.2% | -5.2% (expected) |
Notes
- The 9.1% Exact Match score reflects Bash's syntactic flexibility —
ls -la and ls -al are functionally identical but register as an Exact Match failure. The more meaningful metric is Command Validity.
- The 99.0% Command Validity score reflects Shellcheck validation on the test split.
- Adversarial Safety was measured across 9 defined attack categories. Exact per-category case counts are documented in the repository (
data/adversarial_test_cases.jsonl).
- The -5.2% MMLU drop reflects the expected specialization trade-off for domain-focused fine-tuning.
All metrics above are measured results from actual evaluation runs.
4. Deployment Architecture
Architecture Pattern
Client → Validator Proxy (port 8000, public) → vLLM (port 8000, internal Docker network)
The Validator Proxy is the only exposed service. vLLM is private to the Docker network and cannot be accessed directly from the host environment.
Architecture Scope Clarification
The full SecureCLI-Tuner CommandRisk architecture is designed as a three-layer validation model:
- Layer 1 — Deterministic pattern enforcement
- Layer 2 — Heuristic risk scoring (MITRE ATT&CK alignment)
- Layer 3 — Semantic intent detection (AST + embedding analysis)
The Module 2 deployment operationalizes Layer 1 only via a server-side validator proxy that enforces deterministic inference constraints and pattern-based blocking. Layers 2 and 3 represent future hardening extensions and are not currently implemented in the deployment repository.
Deterministic Enforcement (Deployed)
The validator enforces:
temperature = 0.0 (server-side override — client cannot change this)
top_p = 1.0 (server-side override)
stream = false (explicitly rejected if client sends true)
max_tokens capped to configuration value
- Role allowlist:
{system, user} only
- Markdown fence rejection (commands containing
``` are blocked)
- Block operator enforcement from config/model_config.yaml
- Fail-closed behavior on upstream or parse errors (503 / 502)
The LLM proposes. The validator decides.
Alternatives Considered
| Platform | Evaluated | Rejected Because |
|---|
| Modal | Yes | Serverless cold-start latency unacceptable for interactive CLI use; less control over deterministic generation settings |
| AWS SageMaker | Yes | Operational overhead and cost unjustified for single-team internal tooling at this scale |
| Hugging Face Inference API | Yes | No server-side streaming disable guarantee; limited control over output validation pipeline |
| Ollama on EC2 | Yes | Lacks production-grade Prometheus metrics and native OpenAI-compatible API without additional setup |
| vLLM on a single GPU VM was selected because it provides direct control over generation parameters at the server level, a native OpenAI-compatible endpoint, built-in Prometheus metrics, and no managed-service abstraction that could hide inference behavior. | | |
Scaling Strategy
This deployment is designed for fixed capacity — a single GPU instance serving internal users.
Auto-scaling is explicitly out of scope because:
- Traffic volume (≤ 4 concurrent, ~500 req/day) does not justify scaling infrastructure
- GPU VRAM state is not trivially shareable across instances
- Horizontal scaling adds complexity without benefit at this load
If traffic exceeds this envelope, a queue-backed batch inference pattern would be the next step.
Geographic Region
Deployment targets the same region as the internal engineering team (US region). No geographic distribution is required — this is not a latency-sensitive public endpoint.
5. Hardware Requirements
Full inference requires:
- 24GB VRAM GPU (RTX 4090 / A10 class equivalent recommended)
- 8+ CPU cores
- 32GB system RAM
Training was performed on NVIDIA A100 (40GB SXM) via RunPod.
Validator logic and configuration can be reviewed without GPU; however, live vLLM inference requires CUDA-enabled hardware.
6. Capacity & Cost Modeling
The following figures are modeled estimates based on published A10-class GPU benchmarks. These were not measured on this local deployment environment.
Modeled Throughput
~50 tokens/second sustained (7B model, deterministic configuration, A10 class GPU)
Modeled Monthly Cost
| Component | Monthly Estimate |
|---|
| A10 GPU compute (~$1.20/hr × 8hr/day × 30 days) | ~$288 |
| Storage (model weights, logs) | ~$5 |
| Network (internal only) | ~$0 |
| Monitoring tools | $0 (Prometheus, local) |
| Total | ~$293/month |
Modeled Cost Per 1,000 Requests
Assuming ~100 tokens per request at ~50 tok/sec:
- ~$0.67 per 1,000 requests
These figures provide budgeting guidance and capacity planning context. They are not measured production benchmarks from this deployment host.
Cost Optimization Strategies
- Fixed capacity, reduced hours: Running only during business hours (~8hrs/day) reduces monthly GPU cost to ~$115.
- Quantization: 4-bit training reduced hardware requirements from A100-class to RTX 4090-class for inference, cutting deployment cost by ~60% vs full-precision hosting.
- No response caching: Commands are deterministic (temperature=0.0) but highly context-specific — caching is not appropriate. Future consideration: cache
/health and system-status responses.
- No spot instances for real-time: Spot/preemptible instances are appropriate for batch inference workloads only; cold-start unpredictability is unacceptable for interactive CLI use.
7. Monitoring & Observability
Planned Telemetry Fields
Every request logs:
request_id
model_version
input_tokens
output_tokens
latency_ms
validation_result (pass / blocked / error)
risk_label
error_rate
| Tool | Purpose |
|---|
| vLLM Prometheus endpoint | GPU utilization, throughput, request queue depth |
| Structured JSON logging (validator) | Per-request audit trail with validation outcome |
| Local benchmark script (benchmark.py) | p50/p95 latency, pass/block/error rates on test set |
Alert Thresholds
| Metric | Alert Threshold | Notification |
|---|
| P95 latency | > 5,000ms | Slack alert to on-call engineer |
| Error rate | > 5% over 5-min window | Slack + email |
| Validation pass rate | < 90% | Immediate Slack alert |
| Upstream unavailable (503 rate) | > 0% sustained | PagerDuty escalation |
| GPU utilization | > 95% sustained | Slack alert |
8. Security Model
Trust Boundary
- vLLM is not publicly accessible — private to the Docker network
- Validator enforces deterministic inference configuration server-side
- Streaming is rejected unconditionally
- Only
system and user roles are permitted in message payloads
- Fail-closed behavior on all upstream errors
Training-Time Safety
- 95 dangerous commands removed prior to fine-tuning
- Shellcheck validation enforced on the full dataset
- Adversarial suite evaluated post-training across 9 attack categories
API Authentication
The current deployment assumes internal network access. Production hardening requires:
- API key authentication at the validator or upstream API gateway
- Key rotation policy (e.g., 90-day rotation)
- Request attribution via
request_id tied to authenticated identity
Rate Limiting
The validator proxy enforces:
- 32KB payload size limit (hard reject → 413)
- 2,000 character user prompt cap (hard reject → 400)
- ≤ 4 concurrent connections (capacity-bound, not rate-limit-bound)
Production API gateway layer should enforce:
- ≤ 60 requests/minute per API key
- Hard 429 on excess
PII Handling
- Prompts and responses are logged as structured JSON for audit purposes
- This system is designed for infrastructure commands, not user data — no PII filtering is currently applied
- In production: logs should be access-controlled with a defined retention window (e.g., 30 days)
- Prompt content is never forwarded to external third-party services
Operational Assumptions
- Internal deployment pattern
- No public SaaS exposure
- Command execution handled externally by the calling system or human review
- GPU-backed environment required for live inference
9. Certification Scope
This submission demonstrates:
- Secure deployment architecture (Validator Proxy + private vLLM)
- Deterministic inference enforcement (server-side parameter overrides)
- Server-side guardrail implementation (Layer 1 — pattern-based blocking)
- Modeled cost and capacity planning (explicitly labeled, not measured)
- Monitoring and observability structure (Prometheus + structured logging + benchmark script)
- Explicit separation of measured vs. modeled metrics
- Clearly defined operational boundaries and threat model
This is an internal deployment pattern. It is not a public multi-tenant SaaS endpoint.
10. Repositories