A Technical Guide to Building a Multi-OS Terminal Commands Assistant with QLoRA
Terminal commands are the backbone of software development, system administration, and DevOps workflows. However, remembering the exact syntax across different operating systems—Linux, Windows, and macOS—presents a significant challenge, especially for developers who frequently switch between environments.
This project fine-tunes Qwen3-0.6B, a compact yet capable language model, to translate natural language instructions into accurate terminal commands for all three major operating systems. The result is a lightweight, efficient model that can serve as an intelligent command-line assistant.
Fine-tune a language model to generate precise terminal commands from natural language descriptions, with support for:
The training dataset was synthetically generated using a combination of template-based generation and GPT-assisted augmentation, ensuring diverse coverage of common terminal operations.
| Split | Samples | Purpose |
|---|---|---|
| Train | >9000 | Model training |
| Validation | >1000 | Hyperparameter tuning |
| Test | >500 | Final evaluation |
| Total | >12,000 | - |
Each sample follows the Alpaca instruction format:
{ "instruction": "List all files including hidden ones", "input": "[LINUX]", "output": "ls -la" }
| Input Type | Description | Example Output |
|---|---|---|
[LINUX] | Linux/Unix command | ls -la |
[WINDOWS] | Windows CMD command | dir /a |
[MAC] | macOS command | ls -la |
"" (empty) | Cross-platform or context-dependent | Varies |
"Return...JSON" | JSON with all OS commands | {"linux": "...", "windows": "...", "mac": "..."} |
┌─────────────────────────────────────────────────────────────┐
│ Command Categories │
├─────────────────┬───────────────────────────────────────────┤
│ File Operations │ list, copy, move, delete, find, rename │
│ Directory Ops │ create, remove, navigate, list contents │
│ System Info │ disk usage, memory, CPU, processes │
│ Text Processing │ grep, sed, awk, sort, uniq, head, tail │
│ Network │ ping, curl, wget, netstat, ssh, scp │
│ Compression │ tar, zip, gzip, unzip, 7z │
│ Permissions │ chmod, chown, icacls, attrib │
│ Package Mgmt │ apt, yum, brew, choco, winget │
│ Git Operations │ clone, commit, push, pull, branch │
│ Docker │ run, build, compose, exec, logs │
└─────────────────┴───────────────────────────────────────────┘
max_length=256| Criterion | Choice | Rationale |
|---|---|---|
| Model | Qwen3-0.6B | Excellent performance-to-size ratio |
| Parameters | 600M | Fits in 6GB VRAM with 4-bit quantization |
| Architecture | Decoder-only Transformer | Standard for text generation |
| Context Length | 32K tokens | More than sufficient for commands |
QLoRA (Quantized Low-Rank Adaptation) was chosen to enable training on my laptop hardware while maintaining model quality.
BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4 quantization bnb_4bit_use_double_quant=True, # Nested quantization bnb_4bit_compute_dtype=float16 # Computation precision )
LoraConfig( r=16, # Rank of update matrices lora_alpha=32, # Scaling factor lora_dropout=0.1, # Dropout for regularization target_modules=[ # Attention layers to adapt "q_proj", "k_proj", "v_proj", "o_proj" ], bias="none", task_type="CAUSAL_LM" )
Why these settings?
| Parameter | Value | Rationale |
|---|---|---|
| Epochs | 3 | Sufficient convergence without overfitting |
| Batch Size | 2 | Memory constraint |
| Gradient Accumulation | 8 | Effective batch size of 16 |
| Learning Rate | 2e-4 | Standard for LoRA fine-tuning |
| LR Scheduler | Cosine | Smooth decay for better convergence |
| Warmup Ratio | 0.1 | 10% of steps for warmup |
| Optimizer | Paged AdamW 8-bit | Memory-efficient optimizer |
| Max Sequence Length | 256 | Sufficient for command generation |
| Gradient Checkpointing | Enabled | Reduces memory at cost of speed |
| Precision | FP16 | Mixed precision training |
| Component | Specification |
|---|---|
| GPU | NVIDIA RTX 2060 (6GB VRAM) |
| RAM | 32GB DDR4 |
| Storage | NVMe SSD |
| Framework | HuggingFace Transformers + PEFT |
| Training Time | ~3.8 hours (229 minutes) |
┌─────────────────────────────────────────────────────────────┐
│ Training Pipeline │
├─────────────────────────────────────────────────────────────┤
│ 1. Load Base Model (Qwen3-0.6B) with 4-bit Quantization │
│ ↓ │
│ 2. Apply LoRA Adapters to Attention Layers │
│ ↓ │
│ 3. Prepare Dataset (Tokenize + Format) │
│ ↓ │
│ 4. Train with HuggingFace Trainer │
│ ↓ │
│ 5. Evaluate on Validation Set │
│ ↓ │
│ 6. Save LoRA Adapters + Merge with Base Model │
│ ↓ │
│ 7. Publish to HuggingFace Hub │
└─────────────────────────────────────────────────────────────┘
The model showed rapid convergence, with loss decreasing significantly in the first epoch and continuing to improve throughout training.
Loss
│
0.35 ┤ ●
│ ╲
0.30 ┤ ╲
│ ╲
0.25 ┤ ╲
│ ╲
0.20 ┤ ●
│ ╲
0.15 ┤ ╲
│ ╲
0.10 ┤ ●───●───●
│ ╲
0.05 ┤ ●───●───●───●───●
│
0.00 ┼──────────────────────────────────────────→ Steps
0 200 400 600 800 1000 1200 1400 1600 1800
| Step | Training Loss | Validation Loss |
|---|---|---|
| 100 | 0.3108 | 0.2372 |
| 200 | 0.1446 | 0.1315 |
| 300 | 0.1131 | 0.1079 |
| 400 | 0.0912 | 0.0927 |
| 500 | 0.0884 | 0.0809 |
| 600 | 0.0759 | 0.0709 |
| 800 | 0.0583 | 0.0591 |
| 1000 | 0.0528 | 0.0509 |
| 1200 | 0.0459 | 0.0459 |
| 1400 | 0.0424 | 0.0435 |
| 1600 | 0.0404 | 0.0421 |
| 1800 | 0.0412 | 0.0417 |
Key Observations:
| Metric | Base Model (Qwen3-0.6B) | Fine-Tuned Model | Improvement |
|---|---|---|---|
| Exact Match | 2.0% | 93.0% | +91.0% |
| Fuzzy Match | 5.0% | 94.0% | +89.0% |
Accuracy Comparison
│
100% ┤ ████████
│ ████████
90% ┤ ████████
│ ████████
80% ┤ ████████
│ ████████
70% ┤ ████████
│ ████████
60% ┤ ████████
│ ████████
50% ┤ ████████
│ ████████
40% ┤ ████████
│ ████████
30% ┤ ████████
│ ████████
20% ┤ ████████
│ ████████
10% ┤ ██ ████████
│ ██ ████████
0% ┼──██──────────────────────────████████──
Base Model Fine-Tuned
(2%) (93%)
The fine-tuned model achieves a 46.5x improvement in exact match accuracy!
To ensure the model retained general language capabilities, we evaluated on HellaSwag, a commonsense reasoning benchmark.
| Model | HellaSwag Accuracy | Status |
|---|---|---|
| Base Qwen3-0.6B | 43.8% | Baseline |
| Fine-Tuned | 42.5% | ✅ Minimal degradation |
| Difference | -1.3% | Within acceptable range |
HellaSwag Performance (Catastrophic Forgetting Check)
│
50% ┤
│
45% ┤ ████████ ████████
│ ████████ ████████
40% ┤ ████████ ████████
│ ████████ ████████
35% ┤ ████████ ████████
│ ████████ ████████
30% ┤ ████████ ████████
│ ████████ ████████
25% ┤ ████████ ████████
│ ████████ ████████
20% ┤ ████████ ████████
│ ████████ ████████
┼──████████────────████████──
Base Model Fine-Tuned
(43.8%) (42.5%)
✅ Only 1.3% decrease - No significant catastrophic forgetting
The model was evaluated from multiple sources to ensure consistency:
| Source | Exact Match | Fuzzy Match |
|---|---|---|
| Local LoRA Adapters | 93.0% | 94.0% |
| Local Merged Model | 90.0% | 91.0% |
| HuggingFace LoRA | 93.0% | 94.0% |
| HuggingFace Merged | 93.0% | 94.0% |
| Instruction | OS | Expected | Generated | Match |
|---|---|---|---|---|
| List all files including hidden ones | Linux | ls -la | ls -la | ✅ |
| Create a new folder named projects | Windows | mkdir projects | mkdir projects | ✅ |
| Show disk usage | Mac | df -h | df -h | ✅ |
| Find files larger than 100MB | Linux | find . -size +100M | find . -size +100M | ✅ |
| Kill process by name | Windows | taskkill /IM process.exe /F | taskkill /IM process.exe /F | ✅ |
Input:
Instruction: Delete file named temp.txt
Input: Return the command for all operating systems as JSON
Output:
{ "description": "Delete file named temp.txt", "linux": "rm temp.txt", "windows": "del temp.txt", "mac": "rm temp.txt" }
| Instruction | OS | Generated Command |
|---|---|---|
| Find all Python files modified in last 7 days | Linux | find . -name "*.py" -mtime -7 |
| Compress folder with tar and gzip | Linux | tar -czvf archive.tar.gz folder/ |
| Show network connections on specific port | Windows | netstat -an | findstr :8080 |
| Create symbolic link | Mac | ln -s /path/to/target /path/to/link |
QLoRA Efficiency: Training a 600M parameter model on 6GB VRAM was successful
Rapid Convergence: The model learned the task quickly
Generalization: High accuracy across different command categories
Catastrophic Forgetting Mitigation: LoRA's additive nature preserved base capabilities
Memory Constraints
Sequence Length Decisions
Evaluation Metrics
ls -la vs ls -al)This project successfully demonstrates that small language models can be effectively fine-tuned for specialized technical tasks. The fine-tuned Qwen3-0.6B model achieves:
The combination of QLoRA fine-tuning, careful dataset preparation, and appropriate evaluation metrics enabled this compact model to serve as a practical cross-platform command-line assistant.
| Aspect | Result |
|---|---|
| Task Performance | >93% accuracy (46.5x improvement over base) |
| Efficiency | 17.5MB adapter, trains in 3.8 hours on RTX 2060 |
| Capability Retention | 98.7% of original HellaSwag performance |
| Practical Value | Ready-to-use multi-OS command assistant |
| Resource | Link |
|---|---|
| Merged Model | HuggingFace: Eng-Elias/qwen3-0.6b-terminal-instruct |
| LoRA Adapters | HuggingFace: Eng-Elias/qwen3-0.6b-terminal-instruct-lora |
| GitHub Repository | github.com/Eng-Elias/qwen3-600M-terminal-instruct |
| W&B Training Dashboard | wandb.ai/engelias-/qwen3-terminal-instruct |
from transformers import AutoModelForCausalLM, AutoTokenizer # Load the model model = AutoModelForCausalLM.from_pretrained("Eng-Elias/qwen3-0.6b-terminal-instruct") tokenizer = AutoTokenizer.from_pretrained("Eng-Elias/qwen3-0.6b-terminal-instruct") # Generate a command prompt = """### Instruction: List all running Docker containers ### Input: [LINUX] ### Response: """ inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # Output: docker ps
Eng. Elias Owis
This project is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
This project was completed as part of the Ready Tensor LLM Engineering and Deployment Program.