Fine-Tuning Qwen3-0.6B for Cross-Platform Terminal Command Generation

A Technical Guide to Building a Multi-OS Terminal Commands Assistant with QLoRA

Introduction
Project Objective
Dataset
Methodology
Results
Example Outputs
Discussion
Conclusion
Resources

Introduction

Terminal commands are the backbone of software development, system administration, and DevOps workflows. However, remembering the exact syntax across different operating systems—Linux, Windows, and macOS—presents a significant challenge, especially for developers who frequently switch between environments.

This project fine-tunes Qwen3-0.6B, a compact yet capable language model, to translate natural language instructions into accurate terminal commands for all three major operating systems. The result is a lightweight, efficient model that can serve as an intelligent command-line assistant.

Project Objective

The Task

Fine-tune a language model to generate precise terminal commands from natural language descriptions, with support for:

Linux (Bash/Shell commands)
Windows (CMD commands)
macOS (Terminal commands)
JSON output (all OS commands in structured format)

Why This Task?

Practical Utility: Developers constantly look up command syntax; an AI assistant can dramatically improve productivity
Clear Evaluation: Command accuracy is objectively measurable (exact match)
Multi-Output Challenge: Generating OS-specific commands from the same instruction tests the model's conditional generation capabilities
Compact Model Viability: Demonstrates that small models can excel at specialized tasks

Success Criteria

Achieve >90% exact match accuracy on terminal command generation
Maintain general language capabilities (avoid catastrophic forgetting)
Deploy efficiently on my laptop hardware (6GB VRAM)

Dataset

Source and Preparation

The training dataset was synthetically generated using a combination of template-based generation and GPT-assisted augmentation, ensuring diverse coverage of common terminal operations.

📦 Dataset: HuggingFace: Eng-Elias/multios-terminal-commands

Split	Samples	Purpose
Train	>9000	Model training
Validation	>1000	Hyperparameter tuning
Test	>500	Final evaluation
Total	>12,000	-

Data Format

Each sample follows the Alpaca instruction format:

{
  "instruction": "List all files including hidden ones",
  "input": "[LINUX]",
  "output": "ls -la"
}

Input Types

Input Type	Description	Example Output
`[LINUX]`	Linux/Unix command	`ls -la`
`[WINDOWS]`	Windows CMD command	`dir /a`
`[MAC]`	macOS command	`ls -la`
`""` (empty)	Cross-platform or context-dependent	Varies
`"Return...JSON"`	JSON with all OS commands	`{"linux": "...", "windows": "...", "mac": "..."}`

Command Categories Covered

┌─────────────────────────────────────────────────────────────┐
│                    Command Categories                       │
├─────────────────┬───────────────────────────────────────────┤
│ File Operations │ list, copy, move, delete, find, rename    │
│ Directory Ops   │ create, remove, navigate, list contents   │
│ System Info     │ disk usage, memory, CPU, processes        │
│ Text Processing │ grep, sed, awk, sort, uniq, head, tail    │
│ Network         │ ping, curl, wget, netstat, ssh, scp       │
│ Compression     │ tar, zip, gzip, unzip, 7z                 │
│ Permissions     │ chmod, chown, icacls, attrib              │
│ Package Mgmt    │ apt, yum, brew, choco, winget             │
│ Git Operations  │ clone, commit, push, pull, branch         │
│ Docker          │ run, build, compose, exec, logs           │
└─────────────────┴───────────────────────────────────────────┘

Preprocessing Steps

Tokenization: Used Qwen3 tokenizer with max_length=256
Prompt Template: Standardized Alpaca format for consistency
Label Masking: Standard causal language modeling on full sequence
Shuffling: Randomized training order each epoch

Methodology

Base Model Selection

Criterion	Choice	Rationale
Model	Qwen3-0.6B	Excellent performance-to-size ratio
Parameters	600M	Fits in 6GB VRAM with 4-bit quantization
Architecture	Decoder-only Transformer	Standard for text generation
Context Length	32K tokens	More than sufficient for commands

Fine-Tuning Approach: QLoRA

QLoRA (Quantized Low-Rank Adaptation) was chosen to enable training on my laptop hardware while maintaining model quality.

Quantization Configuration

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4 quantization
    bnb_4bit_use_double_quant=True,   # Nested quantization
    bnb_4bit_compute_dtype=float16    # Computation precision
)

LoRA Configuration

LoraConfig(
    r=16,                    # Rank of update matrices
    lora_alpha=32,           # Scaling factor
    lora_dropout=0.1,        # Dropout for regularization
    target_modules=[         # Attention layers to adapt
        "q_proj", "k_proj", 
        "v_proj", "o_proj"
    ],
    bias="none",
    task_type="CAUSAL_LM"
)

Why these settings?

r=16: Balance between expressiveness and efficiency
alpha=32: 2x rank provides good gradient scaling
Target modules: Attention projections capture task-specific patterns effectively

Training Configuration

Parameter	Value	Rationale
Epochs	3	Sufficient convergence without overfitting
Batch Size	2	Memory constraint
Gradient Accumulation	8	Effective batch size of 16
Learning Rate	2e-4	Standard for LoRA fine-tuning
LR Scheduler	Cosine	Smooth decay for better convergence
Warmup Ratio	0.1	10% of steps for warmup
Optimizer	Paged AdamW 8-bit	Memory-efficient optimizer
Max Sequence Length	256	Sufficient for command generation
Gradient Checkpointing	Enabled	Reduces memory at cost of speed
Precision	FP16	Mixed precision training

Hardware Setup

Component	Specification
GPU	NVIDIA RTX 2060 (6GB VRAM)
RAM	32GB DDR4
Storage	NVMe SSD
Framework	HuggingFace Transformers + PEFT
Training Time	~3.8 hours (229 minutes)

Training Pipeline

┌─────────────────────────────────────────────────────────────┐
│                    Training Pipeline                         │
├─────────────────────────────────────────────────────────────┤
│  1. Load Base Model (Qwen3-0.6B) with 4-bit Quantization    │
│                          ↓                                   │
│  2. Apply LoRA Adapters to Attention Layers                 │
│                          ↓                                   │
│  3. Prepare Dataset (Tokenize + Format)                     │
│                          ↓                                   │
│  4. Train with HuggingFace Trainer                          │
│                          ↓                                   │
│  5. Evaluate on Validation Set                              │
│                          ↓                                   │
│  6. Save LoRA Adapters + Merge with Base Model              │
│                          ↓                                   │
│  7. Publish to HuggingFace Hub                              │
└─────────────────────────────────────────────────────────────┘

Results

Training Progress

The model showed rapid convergence, with loss decreasing significantly in the first epoch and continuing to improve throughout training.

Training Loss Curve

Loss
│
0.35 ┤ ●
     │  ╲
0.30 ┤   ╲
     │    ╲
0.25 ┤     ╲
     │      ╲
0.20 ┤       ●
     │        ╲
0.15 ┤         ╲
     │          ╲
0.10 ┤           ●───●───●
     │                    ╲
0.05 ┤                     ●───●───●───●───●
     │
0.00 ┼──────────────────────────────────────────→ Steps
     0   200   400   600   800  1000  1200  1400  1600  1800

Detailed Training Metrics

Step	Training Loss	Validation Loss
100	0.3108	0.2372
200	0.1446	0.1315
300	0.1131	0.1079
400	0.0912	0.0927
500	0.0884	0.0809
600	0.0759	0.0709
800	0.0583	0.0591
1000	0.0528	0.0509
1200	0.0459	0.0459
1400	0.0424	0.0435
1600	0.0404	0.0421
1800	0.0412	0.0417

Key Observations:

Loss dropped 87% from initial to final (0.31 → 0.04)
Validation loss closely tracked training loss (no overfitting)
Convergence achieved by step ~1400

Task-Specific Evaluation

Baseline vs Fine-Tuned Performance

Metric	Base Model (Qwen3-0.6B)	Fine-Tuned Model	Improvement
Exact Match	2.0%	93.0%	+91.0%
Fuzzy Match	5.0%	94.0%	+89.0%

Accuracy Comparison
│
100% ┤                              ████████
     │                              ████████
 90% ┤                              ████████
     │                              ████████
 80% ┤                              ████████
     │                              ████████
 70% ┤                              ████████
     │                              ████████
 60% ┤                              ████████
     │                              ████████
 50% ┤                              ████████
     │                              ████████
 40% ┤                              ████████
     │                              ████████
 30% ┤                              ████████
     │                              ████████
 20% ┤                              ████████
     │                              ████████
 10% ┤  ██                          ████████
     │  ██                          ████████
  0% ┼──██──────────────────────────████████──
        Base Model               Fine-Tuned
        (2%)                       (93%)

The fine-tuned model achieves a 46.5x improvement in exact match accuracy!

Catastrophic Forgetting Check

To ensure the model retained general language capabilities, we evaluated on HellaSwag, a commonsense reasoning benchmark.

Model	HellaSwag Accuracy	Status
Base Qwen3-0.6B	43.8%	Baseline
Fine-Tuned	42.5%	✅ Minimal degradation
Difference	-1.3%	Within acceptable range

HellaSwag Performance (Catastrophic Forgetting Check)
    │
50% ┤
    │
45% ┤  ████████        ████████
    │  ████████        ████████
40% ┤  ████████        ████████
    │  ████████        ████████
35% ┤  ████████        ████████
    │  ████████        ████████
30% ┤  ████████        ████████
    │  ████████        ████████
25% ┤  ████████        ████████
    │  ████████        ████████
20% ┤  ████████        ████████
    │  ████████        ████████
    ┼──████████────────████████──
       Base Model    Fine-Tuned
       (43.8%)        (42.5%)

✅ Only 1.3% decrease - No significant catastrophic forgetting

Evaluation by Source

The model was evaluated from multiple sources to ensure consistency:

Source	Exact Match	Fuzzy Match
Local LoRA Adapters	93.0%	94.0%
Local Merged Model	90.0%	91.0%
HuggingFace LoRA	93.0%	94.0%
HuggingFace Merged	93.0%	94.0%

Example Outputs

Single OS Commands

Instruction	OS	Expected	Generated	Match
List all files including hidden ones	Linux	`ls -la`	`ls -la`	✅
Create a new folder named projects	Windows	`mkdir projects`	`mkdir projects`	✅
Show disk usage	Mac	`df -h`	`df -h`	✅
Find files larger than 100MB	Linux	`find . -size +100M`	`find . -size +100M`	✅
Kill process by name	Windows	`taskkill /IM process.exe /F`	`taskkill /IM process.exe /F`	✅

JSON Output (All Operating Systems)

Input:

Instruction: Delete file named temp.txt
Input: Return the command for all operating systems as JSON

Output:

{
  "description": "Delete file named temp.txt",
  "linux": "rm temp.txt",
  "windows": "del temp.txt",
  "mac": "rm temp.txt"
}

Complex Commands

Instruction	OS	Generated Command
Find all Python files modified in last 7 days	Linux	`find . -name "*.py" -mtime -7`
Compress folder with tar and gzip	Linux	`tar -czvf archive.tar.gz folder/`
Show network connections on specific port	Windows	`netstat -an \| findstr :8080`
Create symbolic link	Mac	`ln -s /path/to/target /path/to/link`

Discussion

What Worked Well

QLoRA Efficiency: Training a 600M parameter model on 6GB VRAM was successful
- 4-bit quantization reduced memory by ~75%
- Only 17.5MB of LoRA adapter weights added
Rapid Convergence: The model learned the task quickly
- Significant improvement visible after just 200 steps
- 3 epochs provided optimal results without overfitting
Generalization: High accuracy across different command categories
- File operations, networking, and system commands all performed well
- JSON structured output worked reliably
Catastrophic Forgetting Mitigation: LoRA's additive nature preserved base capabilities
- Only 1.3% drop on HellaSwag
- General instruction-following ability maintained

Challenges Faced

Memory Constraints
- Challenge: RTX 2060 has only 6GB VRAM
- Solution: QLoRA + gradient checkpointing + small batch size
- Trade-off: Longer training time (3.8 hours vs ~1 hour on larger GPU)
Sequence Length Decisions
- Challenge: Balancing context window vs memory
- Solution: 256 tokens sufficient for most commands
- Note: Some complex pipeline commands may be truncated
Evaluation Metrics
- Challenge: Commands can be equivalent but not identical (e.g., ls -la vs ls -al)
- Solution: Implemented fuzzy matching alongside exact match
- Future: Could add semantic equivalence checking

Limitations

Commands based on common usage patterns; obscure commands may be inaccurate
English instructions only
May generate slightly different but functionally equivalent commands
Complex multi-step commands may need refinement

Conclusion

This project successfully demonstrates that small language models can be effectively fine-tuned for specialized technical tasks. The fine-tuned Qwen3-0.6B model achieves:

>92% exact match accuracy on terminal command generation
Minimal catastrophic forgetting (1.3% HellaSwag drop)
Efficient training on my laptop hardware (6GB VRAM, RTX 2060)
Production-ready deployment via HuggingFace Hub

The combination of QLoRA fine-tuning, careful dataset preparation, and appropriate evaluation metrics enabled this compact model to serve as a practical cross-platform command-line assistant.

Key Takeaways

Aspect	Result
Task Performance	>93% accuracy (46.5x improvement over base)
Efficiency	17.5MB adapter, trains in 3.8 hours on RTX 2060
Capability Retention	98.7% of original HellaSwag performance
Practical Value	Ready-to-use multi-OS command assistant

Resources

Model & Code

Resource	Link
Merged Model	HuggingFace: Eng-Elias/qwen3-0.6b-terminal-instruct
LoRA Adapters	HuggingFace: Eng-Elias/qwen3-0.6b-terminal-instruct-lora
GitHub Repository	github.com/Eng-Elias/qwen3-600M-terminal-instruct
W&B Training Dashboard	wandb.ai/engelias-/qwen3-terminal-instruct
Dataset	HuggingFace: Eng-Elias/multios-terminal-commands

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model
model = AutoModelForCausalLM.from_pretrained("Eng-Elias/qwen3-0.6b-terminal-instruct")
tokenizer = AutoTokenizer.from_pretrained("Eng-Elias/qwen3-0.6b-terminal-instruct")

# Generate a command
prompt = """### Instruction:
List all running Docker containers

### Input:
[LINUX]

### Response:
"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: docker ps

Technologies Used

Base Model: Qwen3-0.6B
Fine-Tuning: PEFT (LoRA)
Quantization: bitsandbytes
Training: HuggingFace Transformers
Experiment Tracking: Weights & Biases
Model Hosting: HuggingFace Hub

Author

Eng. Elias Owis

GitHub: @Eng-Elias
HuggingFace: Eng-Elias

License

This project is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

This project was completed as part of the Ready Tensor LLM Engineering and Deployment Program.