Author: cmndcntrlcyber
Models: code-trainer-vision-adapter ยท qwen14b-code-trainer-v6-aggressive ยท qwen14b-code-trainer-v6-gguf
Dataset: code-trainer-offsec-dataset
Repository: github.com/cmndcntrlcyber/code-trainer-offsec-pipeline
W&B projects: rtpi-phase3-vision ยท rtpi-phase43-qwen14b
Code-Trainer V6 is a six-phase pipeline that takes a VS Code screenshot of
source code and emits the underlying source. It produces three publishable
artifacts: a Swin-B + Qwen2.5-Coder-1.5B multimodal LoRA adapter, a
Qwen2.5-Coder-14B-Instruct text-only LoRA adapter trained via a 3-config
sweep, and a Q4_K_M GGUF of the merged 14 B model for local inference.
Across two evaluation regimes (task-specific and a GSM8K catastrophic-
forgetting check) the fine-tuning measurably outperforms the base on the
intended task while preserving general reasoning. The full training stack
fits inside a single 16 GB consumer GPU plus Hugging Face Skills A100 jobs,
keeping the bill of materials under $100 end-to-end.
Task: given a screenshot of code rendered in a VS Code-style editor,
reconstruct the source code as text.
This is a constrained instance of the broader "image-to-code" problem (UI
mocks, design files, etc.) chosen for two reasons:
We therefore split the work across two distinct fine-tuning phases:
cmndcntrlcyber/code-trainer-offsec-datasetmain โ text-only messages rows. Used for Phase 4.v2-multimodal โ same rows with base64-encoded WebP screenshots embedded.src/phase2_preprocessing/ rewrites captures into chat-formatWe use a deterministic 80 / 10 / 10 split on row index. The dataset card
lists the upstream-repository licenses; the dataset itself inherits the
weakest license among contributors (i.e. compatible with permissive
fine-tuning use).
image (224 ร 224 ร 3)
โ
โผ
Swin-B encoder (frozen, 87.7 M)
โ visual feature sequence (49 ร 1024)
โผ
MLP projector (2-layer, GELU, trained, ~2.1 M)
โ decoder embedding sequence (49 ร 1536)
โผ
Qwen2.5-Coder-1.5B-Instruct + LoRA (r=16, ฮฑ=32, trained, ~24 M)
โ
โผ
source code tokens
The vision encoder stays frozen โ Swin-B is trained on natural images, but
its low-level features (edges, repeated motifs, gridded layouts) carry over
well enough to code-screenshot inputs that we did not see a return on
unfreezing during early ablations.
Three LoRA configurations were trained for a single epoch over the full
26,126-row training split, then ranked by eval_loss on the full
3,265-row validation split:
| Config | r / ฮฑ | LR | Effective batch | Notes |
|---|---|---|---|---|
conservative | 16 / 32 | 1.5 e-4 | 16 (1 ร 16) | small adapter, small LR |
standard | 32 / 64 | 2 e-4 | 16 (2 ร 8) | middle ground |
aggressive | 64 / 128 | 3 e-4 | 16 (4 ร 4) | largest adapter, largest LR |
A second-pass experiment (Phase 4B) trained the winning aggressive config
for 3 epochs over an 8 K slice to test whether more passes were worth
fewer unique examples. They were not โ Phase 4B's full-validation score was
0.5126, vs Phase 4A's 0.4724. The single-epoch full-data adapter is the
canonical Phase 5 conversion target.
| Layer | Choice | Why |
|---|---|---|
| Hardware | RTX 5060 Ti 16 GB (Blackwell) for Phase 3 dev; HF Skills a100-large for cloud runs | Single-GPU dev loop + cheap A100 bursts |
| Precision | BF16 + gradient checkpointing | Blackwell tensor cores are best at BF16; checkpointing keeps 14 B + LoRA in 80 GB |
| Trainer | transformers.TrainingArguments + trl.SFTTrainer (Phase 4); custom Trainer (Phase 3) | SFTTrainer's chat-template formatter is exactly what we want for Phase 4 |
| LoRA | peft.LoraConfig on q_proj, k_proj, v_proj, o_proj (Phase 4) or all linear (Phase 3) | Standard recipe; r โ {16, 32, 64} swept |
| Dependency mgmt | uv with pyproject.toml + uv.lock; torch pinned to cu128 | Reproducible local + cloud installs (uv sync --frozen) |
| Phase | Knob | Value |
|---|---|---|
| 3 | LR | 2 e-4 |
| 3 | Batch ร accum | 8 ร 4 (eff 32) |
| 3 | Epochs | 3 |
| 3 | LoRA r / ฮฑ / dropout | 16 / 32 / 0.05 |
| 3 | Max seq | 2048 |
| 4 (sweep winner) | LR | 3 e-4 |
| 4 (sweep winner) | Batch ร accum | 4 ร 4 (eff 16) |
| 4 (sweep winner) | Epochs | 1 |
| 4 (sweep winner) | LoRA r / ฮฑ / dropout | 64 / 128 / 0.05 |
| 4 (sweep winner) | Max seq | 2048 |
| Metric | Base | Fine-tuned | ฮ |
|---|---|---|---|
exact_match | 0.0000 | 0.0000 | 0 |
bleu_4 | 0.0000 | 0.0000 | 0 |
mean_edit_similarity | 0.0382 | 0.0446 | +16.8 % |
syntax_valid_rate โ | 0.1950 | 0.6100 | +213 % |
โ Python parser; the test split is multilingual so absolute numbers are not
directly comparable across languages. The delta is the meaningful signal.
syntax_valid_rate more than tripled โ the adapter learned to emit
code-shaped output rather than free-form text. mean_edit_similarity moved
modestly. exact_match and bleu_4 are both zero on both rows: the model is
paraphrasing the source rather than reconstructing it byte-for-byte. For
a 1.5 B base model with ~5.5 h of training on 26 K multilingual samples this
is the realistic ceiling without scaling.
| Rank | Config | r / ฮฑ | LR | eval_loss |
|---|---|---|---|---|
| 1 | aggressive | 64 / 128 | 3 e-4 | 0.4724 |
| 2 | standard | 32 / 64 | 2 e-4 | 0.4798 |
| 3 | conservative | 16 / 32 | 1.5 e-4 | 0.4819 |
Larger r + larger LR cleanly won, with monotonic improvement across the
three configurations. The Phase 4B 3-epoch / 8 K-slice rerun of aggressive
underperformed at 0.5126, so we kept Phase 4A's adapter.
GSM8K (grade-school math word problems) is orthogonal to the screenshot-to-code
training task. A small drop would be expected; a large drop would indicate
catastrophic forgetting of general reasoning.
| Run | exact_match (flexible-extract) | exact_match (strict-match) |
|---|---|---|
Base Qwen/Qwen2.5-Coder-14B-Instruct | 0.6050 ยฑ 0.013 | 0.0000 |
+ adapter qwen14b-code-trainer-v6-aggressive | 0.6778 ยฑ 0.013 | 0.0000 |
| ฮ | +0.0728 (+12.0 % relative) | โ |
Result: no forgetting โ and a small lift. The adapter scores 67.8 %
on GSM8K vs the base model's 60.5 %, a +12 % relative improvement on a
domain we never trained on. This is consistent with the chat-format SFT
having taught the model cleaner answer formatting on free-form prompts; the
LoRA changes did not erase math-reasoning capability.
strict-match (which expects GSM8K's raw #### 42 answer convention) is
zero on both rows because the chat-trained model emits prose like "The
answer is 42" โ a formatting artifact, not a reasoning failure. Both rows
use the same lm-evaluation-harness (lm-eval==0.4.4) pipeline, launched
via
launch_benchmark.py.
| Phase | Hours on a100-large | Cost |
|---|---|---|
| 3 (training) | ~5.5 | ~$17.60 |
| 3 (eval) | 0.34 | ~$1.10 |
| 4A (sweep, 3 configs) | ~21 | ~$66.10 |
| 4B (3-epoch slice) | 4.9 | ~$15.60 |
| 4 eval (full-val rerun) | 0.25 | ~$0.80 |
| 5 (GGUF convert) | ~1 | ~$2.00 |
| Forgetting bench (GSM8K ร 2) | ~0.5 | ~$1.50 |
| Total (estimated) | ~33 | ~$104 |
What worked.
aggressive LoRA at r = 64. Each step up the sweep ladder bought ahuggingface/transformers-pytorch-gpu image and pinned torch to cu128, aWhat didn't.
--train-limit 16000-20000 ร 2 epochs would probably have won).wait_for_job(timeout=...)wait_timeout_seconds (6 h) decoupled fromtimeout_seconds (2 h runtime cap).evaluator.py (output_ids[combined.shape[1]:]) discarded everything wheninputs_embeds. Caught by thegenerate() with inputs_embeds already returns only theLimitations.
Every artifact above can be reproduced by cloning the repo, setting two env
vars, and running one command per phase:
git clone https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline.git cd code-trainer-offsec-pipeline uv sync --frozen playwright install chromium # only needed for Phase 1 capture export GITHUB_TOKEN=... # only Phase 1 export HF_USERNAME=... # all phases that push to Hub set -a && source .env && set +a # if you keep an .env file # Phase 1 โ collect & screenshot (local) python -m src.phase1_data_collection.scripts.run_collection \ --config src/config/v6_config.yaml # Phase 2 โ build dataset + push to Hub python -m src.phase2_preprocessing.scripts.build_dataset \ --config src/config/v6_config.yaml python -m src.phase2_preprocessing.scripts.upload_to_hub \ --config src/config/v6_config.yaml # Phase 3 โ Swin-B + Qwen-1.5B LoRA on HF Skills A100 python -m src.phase3_vision_model.scripts.launch_vision_training \ --config src/config/v6_config.yaml --wait # Phase 4 โ Qwen-14B LoRA sweep + best-config retrain python -m src.phase4_qwen_finetuning.scripts.launch_validation_sweep \ --config src/config/v6_config.yaml --wait python -m src.phase4_qwen_finetuning.scripts.launch_full_training \ --config src/config/v6_config.yaml --best-config aggressive --wait # Phase 5 โ merge LoRA, convert to Q4_K_M GGUF, push to Hub python -m src.phase5_deployment.scripts.launch_convert \ --config src/config/v6_config.yaml --wait # Forgetting check python -m src.phase4_qwen_finetuning.scripts.launch_benchmark \ --adapter cmndcntrlcyber/qwen14b-code-trainer-v6-aggressive --wait python -m src.phase4_qwen_finetuning.scripts.launch_benchmark \ --adapter cmndcntrlcyber/qwen14b-code-trainer-v6-aggressive \ --baseline --wait
Costs are listed in ยง4.4. All training runs publish to W&B at the project URLs
above; per-run JSON metrics are also pushed back to each adapter's Hub repo
for traceability.
syntax_valid_rate with a language-specific syntax check; report ฮs persrc/phase6_inference/mean_edit_similarity.The pipeline is deliberately built on permissive open-source primitives:
Qwen2.5-Coder,
Swin Transformer,
peft,
trl,
llama.cpp, and
lm-evaluation-harness.
We Got Claude to Fine-Tune an Open Source LLM
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Hugging Face Skills handled the cloud GPU bursts at the cost of one queue
incident and one timeout-semantics bug โ both fixed and documented in the
linked codebase.