AIT (AI Terminal): Deploying a Fine-Tuned LLM for Shell Command Generation
Module 2 Capstone Project β LLM Engineering & Deployment Certification
Executive Summary
AIT (AI Terminal) is a CLI tool that translates natural language descriptions into shell commands. This publication documents the complete deployment strategy for a fine-tuned Qwen3-0.6B model, including infrastructure choices, cost analysis, monitoring plan, and security considerations.
Key Highlights:
Fine-tuned Qwen3-0.6B model specialized for terminal command generation
Deployed on HuggingFace Dedicated Inference Endpoint
LiteLLM proxy on HuggingFace Spaces for API management
Go CLI client for cross-platform distribution
Total estimated cost: $0.50/hour (scale-to-zero enabled, Nvidia T4 GPU)
1. Use Case Definition
Problem Statement
Developers frequently need to recall complex shell commands for tasks like file manipulation, system administration, and data processing. Searching documentation or Stack Overflow interrupts workflow and reduces productivity.
AIT solves this by:
Converting natural language descriptions to executable shell commands
Supporting multiple platforms (Linux, macOS, Windows)
Providing instant, contextual command generation
Target Users
User Type Use Case Developers Quick command lookup during coding sessions DevOps Engineers System administration tasks across platforms Data Scientists File manipulation and data processing commands Students Learning shell commands through natural language
Input Description Target OS Generated Command "find all PDF files modified in the last 7 days" Linux find . -name "*.pdf" -mtime -7"show disk usage sorted by size" macOS du -sh * | sort -h"list all running processes" Windows Get-Process"compress all log files older than 30 days" Linux find . -name "*.log" -mtime +30 -exec gzip {} \;
Success Criteria
Metric Target Actual (Load Test) Measurement Method Response Latency (p50) < 2 seconds 1.29s βLiteLLM metrics Response Latency (p95) < 30 seconds 25.33s βLiteLLM metrics Success Rate > 99% 100% βLoad test Command Accuracy > 85% syntactically correct Manual evaluation Manual evaluation Availability > 99% uptime 100% βHealth check monitoring
Note: Higher p95 latency is expected due to HuggingFace Dedicated Endpoint cold starts and LiteLLM response caching. Once warmed up, median latency is 1.29s with many cached responses returning in ~0.2s.
Traffic Expectations
Scenario Requests/Hour Requests/Day Low (Initial) 10-50 100-500 Medium (Growth) 50-200 500-2,000 Peak (Burst) 500+ 5,000+
The deployment uses scale-to-zero to handle variable traffic efficiently.
2. Model Selection & Configuration
Model Details
Aspect Choice Model Qwen3-0.6B Terminal Instruct Model Source Fine-tuned from Qwen/Qwen3-0.6B (Module 1 project) Parameter Count 600 million Quantization None (FP16) Context Length 8,192 tokens Max Output Tokens 256 tokens Model Repository HuggingFace Hub
Why Qwen3-0.6B?
Size Efficiency : 600M parameters provides excellent quality-to-cost ratio for command generation
Fine-Tuned Specialization : Trained specifically on terminal command datasets from Module 1
Fast Inference : Small model size enables sub-second generation on GPU
Low Cost : Runs efficiently on minimal hardware
Quantization Decision
No quantization applied because:
Model is already small (600M params)
FP16 fits comfortably in 16GB VRAM (Nvidia T4)
Quantization would minimally reduce costs but could impact command accuracy
3. Deployment Strategy
Architecture Overview
βββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββ
β AIT CLI ββββββΆβ LiteLLM Proxy (HuggingFace Spaces) β
β (Go binary) β β βββββββββββββββ ββββββββββββββββββββββ β
βββββββββββββββββββ β β LiteLLM βββββΆβ HF Endpoint Proxy β β
β β (port 7860)β β (port 8000) β β
β βββββββββββββββ βββββββββββ¬βββββββββββ β
βββββββββββββββββββββββββββββββββΌββββββββββββββ
β
βββββββββββββββββββββββββββββββββΌββββββββββββββ
β HuggingFace Dedicated Inference Endpoint β
β (Qwen3-0.6B Terminal Instruct) β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββΌββββββββββββββ
β Supabase PostgreSQL β
β (Virtual keys, spend tracking) β
βββββββββββββββββββββββββββββββββββββββββββββββ
Component Platform Justification Model Hosting HuggingFace Dedicated Endpoint Scale-to-zero, managed TGI, simple deployment API Gateway LiteLLM on HF Spaces Free hosting, virtual keys, rate limiting Database Supabase (Free Tier) Managed PostgreSQL, connection pooling CLI Distribution Go binary Cross-platform, single executable, no runtime
Why HuggingFace Dedicated Endpoints?
HuggingFace Dedicated Endpoints was selected because:
Scale-to-zero : No charges when idle
Managed TGI : Optimized inference server included
Simple deployment : One-click from model repository
Pay-per-hour : Only charged for actual compute time
Infrastructure Details
Specification Value Instance Type GPU (Nvidia T4) GPU Memory 16 GB Scaling Scale-to-zero (min: 0, max: 1) Region us-east-1 (AWS) Endpoint Type Dedicated (not serverless) Cold Start Time ~30-60 seconds
Why GPU (Nvidia T4)?
For production use with low latency:
GPU inference latency : 0.2-0.5 seconds (excellent UX)
CPU inference latency : 1-3 seconds (slower)
GPU cost : $0.50/hour (with scale-to-zero)
16GB VRAM : Ample headroom for the 600M model
The faster response time justifies the GPU cost for a better user experience.
4. Cost Analysis
Estimated Monthly Costs
Component Scenario: Low Scenario: Medium Scenario: High HF Endpoint (GPU T4) $40 (80 hrs) $165 (330 hrs) $375 (750 hrs) HF Spaces (Docker) $0 (free) $0 (free) $0 (free) Supabase (Free Tier) $0 $0 $0 Network Transfer $0 $0 ~$1 Total $40/month $165/month $376/month
Cost Per Request
Metric Value Endpoint hourly cost $0.50 Requests per hour (avg) 60 Cost per 1,000 requests $0.01
Cost Optimization Strategies
1. Scale-to-Zero
Endpoint automatically scales down after 15 minutes of inactivity
No charges during idle periods
Savings : 50-80% compared to always-on deployment
2. Response Caching (LiteLLM)
litellm_settings :
cache : true
cache_params :
type : "local"
ttl : 600 # 10 minutes
Identical prompts return cached responses
Savings : 10-30% reduction in endpoint calls
5. Monitoring & Observability Plan
Metrics to Track
Metric Why It Matters Alert Threshold Observed Tool Latency (p50) User experience > 2s 1.29s β LiteLLM Latency (p95) Tail latency > 30s 25.33s β LiteLLM Error Rate Reliability > 5% 0% β LiteLLM Throughput (RPM) Capacity planning < 10 RPM sustained 19 RPM β LiteLLM Token Usage Cost control > 500 tokens/request β LiteLLM Endpoint Status Availability != "running" running β HF Dashboard Database Connections Infrastructure health > 80% pool β Supabase
Monitoring Stack
Tool Purpose Why Selected LiteLLM Dashboard Request tracking, spend logs, virtual key management Built-in, no additional cost HuggingFace Dashboard Endpoint health, scaling events, logs Native to platform Supabase Dashboard Database metrics, connection pool status Native to platform
LiteLLM Observability Features
general_settings :
enable_user_auth : true
max_budget : 10.0
budget_duration : "30d"
Tracked automatically:
Request count per virtual key
Token usage per request
Latency distribution
Error rates by error type
Spend per key/user
6. Security Considerations
API Authentication
Layer Mechanism Purpose LiteLLM Virtual Keys (sk-...) User authentication, budget control HF Endpoint HF Token Backend authentication (not exposed) Master Key Environment variable Admin access only
Virtual Key Security
# Users receive scoped virtual keys
sk-xxxxxxxxxxxxxxxxxxxx
# Keys are:
# - Revocable instantly
# - Budget-limited ($5/month default)
# - Rate-limited (60 RPM)
# - Model-scoped (only "default" model)
Rate Limiting
Limit Value Purpose Requests per minute 60 Prevent abuse Tokens per minute 10,000 Cost control Max budget per key $5/month Spending cap
Risk Mitigation Prompt Injection Model fine-tuned for command generation only; system prompt enforces output format Oversized Input Max input tokens limited in LiteLLM config Malicious Commands CLI displays command for user review before any execution
PII Handling
Data Type Logged? Retention Prompts No (by default) N/A Responses No (by default) N/A Request metadata Yes (LiteLLM) 30 days API tokens Masked in all output N/A
Access Control
Role Capabilities End User Use virtual key, generate commands Admin Create/revoke keys, view spend, access dashboard System HF token (backend only), database access
Secrets Management
Secret Storage Access LITELLM_MASTER_KEYHF Spaces secrets Admin only HF_TOKENHF Spaces secrets System only DATABASE_URLHF Spaces secrets System only User API tokens ~/.ait/config.jsonUser's machine only
7. Deployment Instructions
Prerequisites
HuggingFace account with write access token
Supabase account (free tier)
Go 1.21+ (for building CLI)
Step 1: Deploy Model Endpoint
Go to HuggingFace Inference Endpoints
Create new endpoint from Eng-Elias/Qwen3-0.6B-terminal-instruct
Select CPU instance, enable scale-to-zero
Wait for status: "Running"
Step 2: Deploy LiteLLM Proxy
Create HuggingFace Space (Docker SDK)
Set secrets: HF_TOKEN, LITELLM_MASTER_KEY, DATABASE_URL, HF_ENDPOINT_URL
Upload files from deploy/litellm/hf-spaces/
Wait for build to complete
Step 3: Create Virtual Key
curl -X POST https://your-space.hf.space/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY " \
-H "Content-Type: application/json" \
-d '{"models": ["default"], "max_budget": 5.0}'
# Install
go install github.com/Eng-Elias/ait@latest
# Configure
ait setup
# Enter: https://your-space.hf.space/v1/chat/completions
# Enter: sk-your-virtual-key
# Enter: default
Step 5: Test
ait "list all files larger than 100MB"
# Output: find . -size +100M
8. Conclusion
The architecture balances simplicity with production requirements, using managed services (HuggingFace, Supabase) to minimize operational overhead while maintaining full control over the deployment.
Links