ShayGPT, a 1.3 B-parameter autoregressive language model whose entire training pipeline—data ingestion, optimisation, LoRA-based adapter fine-tuning, and model merging—was implemented from scratch.
Pre-training was carried out for ≈ approximately 5 × 10^10 tokens (approximately 450,000 optimization steps) on two NVIDIA H200 SXM GPUs (188 GB total HBM3) using bfloat16 precision and 8-bit AdamW.
A curriculum of six English corpora (OSCAR, OpenWebText, Wiki40B, GPT-2 dumps, MiniPile, and BookCorpus) delivered by a streaming DataLoader kept GPU utilisation > 98 %.
The model achieves a cross-entropy loss of 1.40 (perplexity ≈ 4.1) at the end of pre-training and supports parameter-efficient instruction tuning via LoRA (r = 16, α = 32). All weights (merged or adapter-only) load on CPUs/M-series Macs for inference, demonstrating an end-to-end lightweight workflow.
Large language models (LLMs) normally require vast computational resources. Recent work on parameter-efficient fine-tuning (e.g. LoRA) and 8-bit optimisation (bitsandbytes) narrows the gap between research prototypes and deployable systems.
ShayGPT explores how far one can push a small transformer (20 layers, 1280 dimensions, 20 heads, ≈ 1.3 B parameters) in a DIY training stack while retaining compatibility with the Hugging Face ecosystem.
A streaming loader feeds six English corpora sequentially.
Phase | Tokens | Corpus | Rationale |
---|---|---|---|
0-14 | 24 % | OSCAR-en | broad web |
15-24 | 18 % | OpenWebText | web |
25-29 | 18 % | Wiki40B | formal |
30-34 | 14 % | GPT-2 dump | continuity |
35-39 | 12 % | MiniPile | variety |
40-∞ | 14 % | BookCorpus | long-form |
Filtering removes HTML, URLs and short documents, enforcing ≥ 95 % ASCII.
Step | CE loss | PPL |
---|---|---|
0 | 10.2 | 27 k |
100 k | 4.9 | 134 |
250 k | 2.3 | 10.0 |
450 k | 1.4 | 4.1 |
Metric | Base | + LoRA |
---|---|---|
CE loss | 1.40 | 0.12 |
Win-rate | 21 % | 37 % |
ShayGPT confirms that a thoughtfully engineered medium-scale transformer can be: