We use cookies to improve your browsing experience and to analyze our website traffic. By clicking “Accept All” you agree to our use of cookies. Privacy policy.

AIT (AIT Terminal)

AIT (AI Terminal): Deploying a Fine-Tuned LLM for Shell Command Generation

Module 2 Capstone Project — LLM Engineering & Deployment Certification

Executive Summary

AIT (AI Terminal) is a CLI tool that translates natural language descriptions into shell commands. This publication documents the complete deployment strategy for a fine-tuned Qwen3-0.6B model, including infrastructure choices, cost analysis, monitoring plan, and security considerations.

Key Highlights:

Fine-tuned Qwen3-0.6B model specialized for terminal command generation
Deployed on HuggingFace Dedicated Inference Endpoint
LiteLLM proxy on HuggingFace Spaces for API management
Go CLI client for cross-platform distribution
Total estimated cost: $0.50/hour (scale-to-zero enabled, Nvidia T4 GPU)

1. Use Case Definition

Problem Statement

Developers frequently need to recall complex shell commands for tasks like file manipulation, system administration, and data processing. Searching documentation or Stack Overflow interrupts workflow and reduces productivity.

AIT solves this by:

Converting natural language descriptions to executable shell commands
Supporting multiple platforms (Linux, macOS, Windows)
Providing instant, contextual command generation

Target Users

User Type	Use Case
Developers	Quick command lookup during coding sessions
DevOps Engineers	System administration tasks across platforms
Data Scientists	File manipulation and data processing commands
Students	Learning shell commands through natural language

Input/Output Examples

Input Description	Target OS	Generated Command
"find all PDF files modified in the last 7 days"	Linux	`find . -name "*.pdf" -mtime -7`
"show disk usage sorted by size"	macOS	`du -sh * \| sort -h`
"list all running processes"	Windows	`Get-Process`
"compress all log files older than 30 days"	Linux	`find . -name "*.log" -mtime +30 -exec gzip {} \;`

Success Criteria

Metric	Target	Actual (Load Test)	Measurement Method
Response Latency (p50)	< 2 seconds	1.29s ✓	LiteLLM metrics
Response Latency (p95)	< 30 seconds	25.33s ✓	LiteLLM metrics
Success Rate	> 99%	100% ✓	Load test
Command Accuracy	> 85% syntactically correct	Manual evaluation	Manual evaluation
Availability	> 99% uptime	100% ✓	Health check monitoring

Note: Higher p95 latency is expected due to HuggingFace Dedicated Endpoint cold starts and LiteLLM response caching. Once warmed up, median latency is 1.29s with many cached responses returning in ~0.2s.

Traffic Expectations

Scenario	Requests/Hour	Requests/Day
Low (Initial)	10-50	100-500
Medium (Growth)	50-200	500-2,000
Peak (Burst)	500+	5,000+

The deployment uses scale-to-zero to handle variable traffic efficiently.

2. Model Selection & Configuration

Model Details

Aspect	Choice
Model	Qwen3-0.6B Terminal Instruct
Model Source	Fine-tuned from Qwen/Qwen3-0.6B (Module 1 project)
Parameter Count	600 million
Quantization	None (FP16)
Context Length	8,192 tokens
Max Output Tokens	256 tokens
Model Repository	HuggingFace Hub

Why Qwen3-0.6B?

Size Efficiency: 600M parameters provides excellent quality-to-cost ratio for command generation
Fine-Tuned Specialization: Trained specifically on terminal command datasets from Module 1
Fast Inference: Small model size enables sub-second generation on GPU
Low Cost: Runs efficiently on minimal hardware

Quantization Decision

No quantization applied because:

Model is already small (600M params)
FP16 fits comfortably in 16GB VRAM (Nvidia T4)
Quantization would minimally reduce costs but could impact command accuracy

3. Deployment Strategy

Architecture Overview

┌─────────────────┐     ┌─────────────────────────────────────────────┐
│   AIT CLI       │────▶│  LiteLLM Proxy (HuggingFace Spaces)         │
│   (Go binary)   │     │  ┌─────────────┐    ┌────────────────────┐  │
└─────────────────┘     │  │  LiteLLM    │───▶│  HF Endpoint Proxy │  │
                        │  │  (port 7860)│    │  (port 8000)       │  │
                        │  └─────────────┘    └─────────┬──────────┘  │
                        └───────────────────────────────┼─────────────┘
                                                        │
                        ┌───────────────────────────────▼─────────────┐
                        │  HuggingFace Dedicated Inference Endpoint   │
                        │  (Qwen3-0.6B Terminal Instruct)             │
                        └─────────────────────────────────────────────┘
                                                        │
                        ┌───────────────────────────────▼─────────────┐
                        │  Supabase PostgreSQL                        │
                        │  (Virtual keys, spend tracking)             │
                        └─────────────────────────────────────────────┘

Platform Selection

Component	Platform	Justification
Model Hosting	HuggingFace Dedicated Endpoint	Scale-to-zero, managed TGI, simple deployment
API Gateway	LiteLLM on HF Spaces	Free hosting, virtual keys, rate limiting
Database	Supabase (Free Tier)	Managed PostgreSQL, connection pooling
CLI Distribution	Go binary	Cross-platform, single executable, no runtime

Why HuggingFace Dedicated Endpoints?

HuggingFace Dedicated Endpoints was selected because:

Scale-to-zero: No charges when idle
Managed TGI: Optimized inference server included
Simple deployment: One-click from model repository
Pay-per-hour: Only charged for actual compute time

Infrastructure Details

Specification	Value
Instance Type	GPU (Nvidia T4)
GPU Memory	16 GB
Scaling	Scale-to-zero (min: 0, max: 1)
Region	us-east-1 (AWS)
Endpoint Type	Dedicated (not serverless)
Cold Start Time	~30-60 seconds

Why GPU (Nvidia T4)?

For production use with low latency:

GPU inference latency: 0.2-0.5 seconds (excellent UX)
CPU inference latency: 1-3 seconds (slower)
GPU cost: $0.50/hour (with scale-to-zero)
16GB VRAM: Ample headroom for the 600M model

The faster response time justifies the GPU cost for a better user experience.

4. Cost Analysis

Estimated Monthly Costs

Component	Scenario: Low	Scenario: Medium	Scenario: High
HF Endpoint (GPU T4)	$40 (80 hrs)	$165 (330 hrs)	$375 (750 hrs)
HF Spaces (Docker)	$0 (free)	$0 (free)	$0 (free)
Supabase (Free Tier)	$0	$0	$0
Network Transfer	$0	$0	~$1
Total	$40/month	$165/month	$376/month

Cost Per Request

Metric	Value
Endpoint hourly cost	$0.50
Requests per hour (avg)	60
Cost per 1,000 requests	$0.01

Cost Optimization Strategies

1. Scale-to-Zero

Endpoint automatically scales down after 15 minutes of inactivity
No charges during idle periods
Savings: 50-80% compared to always-on deployment

2. Response Caching (LiteLLM)

litellm_settings:
  cache: true
  cache_params:
    type: "local"
    ttl: 600  # 10 minutes

Identical prompts return cached responses
Savings: 10-30% reduction in endpoint calls

5. Monitoring & Observability Plan

Metrics to Track

Metric	Why It Matters	Alert Threshold	Observed	Tool
Latency (p50)	User experience	> 2s	1.29s ✓	LiteLLM
Latency (p95)	Tail latency	> 30s	25.33s ✓	LiteLLM
Error Rate	Reliability	> 5%	0% ✓	LiteLLM
Throughput (RPM)	Capacity planning	< 10 RPM sustained	19 RPM ✓	LiteLLM
Token Usage	Cost control	> 500 tokens/request	—	LiteLLM
Endpoint Status	Availability	!= "running"	running ✓	HF Dashboard
Database Connections	Infrastructure health	> 80% pool	—	Supabase

Monitoring Stack

Tool	Purpose	Why Selected
LiteLLM Dashboard	Request tracking, spend logs, virtual key management	Built-in, no additional cost
HuggingFace Dashboard	Endpoint health, scaling events, logs	Native to platform
Supabase Dashboard	Database metrics, connection pool status	Native to platform

LiteLLM Observability Features

general_settings:
  enable_user_auth: true
  max_budget: 10.0
  budget_duration: "30d"

Tracked automatically:

Request count per virtual key
Token usage per request
Latency distribution
Error rates by error type
Spend per key/user

6. Security Considerations

API Authentication

Layer	Mechanism	Purpose
LiteLLM	Virtual Keys (`sk-...`)	User authentication, budget control
HF Endpoint	HF Token	Backend authentication (not exposed)
Master Key	Environment variable	Admin access only

Virtual Key Security

# Users receive scoped virtual keys
sk-xxxxxxxxxxxxxxxxxxxx

# Keys are:
# - Revocable instantly
# - Budget-limited ($5/month default)
# - Rate-limited (60 RPM)
# - Model-scoped (only "default" model)

Rate Limiting

Limit	Value	Purpose
Requests per minute	60	Prevent abuse
Tokens per minute	10,000	Cost control
Max budget per key	$5/month	Spending cap

Input Validation

Risk	Mitigation
Prompt Injection	Model fine-tuned for command generation only; system prompt enforces output format
Oversized Input	Max input tokens limited in LiteLLM config
Malicious Commands	CLI displays command for user review before any execution

PII Handling

Data Type	Logged?	Retention
Prompts	No (by default)	N/A
Responses	No (by default)	N/A
Request metadata	Yes (LiteLLM)	30 days
API tokens	Masked in all output	N/A

Access Control

Role	Capabilities
End User	Use virtual key, generate commands
Admin	Create/revoke keys, view spend, access dashboard
System	HF token (backend only), database access

Secrets Management

Secret	Storage	Access
`LITELLM_MASTER_KEY`	HF Spaces secrets	Admin only
`HF_TOKEN`	HF Spaces secrets	System only
`DATABASE_URL`	HF Spaces secrets	System only
User API tokens	`~/.ait/config.json`	User's machine only

7. Deployment Instructions

Prerequisites

HuggingFace account with write access token
Supabase account (free tier)
Go 1.21+ (for building CLI)

Step 1: Deploy Model Endpoint

Go to HuggingFace Inference Endpoints
Create new endpoint from Eng-Elias/Qwen3-0.6B-terminal-instruct
Select GPU instance (Nvidia T4, 16GB), enable scale-to-zero
Wait for status: "Running"

Step 2: Deploy LiteLLM Proxy

Create HuggingFace Space (Docker SDK)
Set secrets: HF_TOKEN, LITELLM_MASTER_KEY, DATABASE_URL, HF_ENDPOINT_URL
Upload files from deploy/litellm/hf-spaces/
Wait for build to complete

Step 3: Create Virtual Key

curl -X POST https://your-space.hf.space/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"models": ["default"], "max_budget": 5.0}'

Step 4: Install and Configure CLI

# Install
go install github.com/Eng-Elias/ait@latest

# Configure
ait setup
# Enter: https://your-space.hf.space/v1/chat/completions
# Enter: sk-your-virtual-key
# Enter: default

Step 5: Test

ait "list all files larger than 100MB"
# Output: find . -size +100M

8. Conclusion

The architecture balances simplicity with production requirements, using managed services (HuggingFace, Supabase) to minimize operational overhead while maintaining full control over the deployment.