Jun 20, 2025●11 reads●MIT License

ShayGPT: A 1.3B Parameter Pretrained Transformer LLM

Deep Neural Net
LLM
LoRA
ML
Neural Net
Pretrained
Text

s
Shay Manor

Abstract

ShayGPT, a 1.3 B-parameter autoregressive language model whose entire training pipeline—data ingestion, optimisation, LoRA-based adapter fine-tuning, and model merging—was implemented from scratch.
Pre-training was carried out for ≈ approximately 5 × 10^10 tokens (approximately 450,000 optimization steps) on two NVIDIA H200 SXM GPUs (188 GB total HBM3) using bfloat16 precision and 8-bit AdamW.
A curriculum of six English corpora (OSCAR, OpenWebText, Wiki40B, GPT-2 dumps, MiniPile, and BookCorpus) delivered by a streaming DataLoader kept GPU utilisation > 98 %.
The model achieves a cross-entropy loss of 1.40 (perplexity ≈ 4.1) at the end of pre-training and supports parameter-efficient instruction tuning via LoRA (r = 16, α = 32). All weights (merged or adapter-only) load on CPUs/M-series Macs for inference, demonstrating an end-to-end lightweight workflow.

Introduction

Large language models (LLMs) normally require vast computational resources. Recent work on parameter-efficient fine-tuning (e.g. LoRA) and 8-bit optimisation (bitsandbytes) narrows the gap between research prototypes and deployable systems.
ShayGPT explores how far one can push a small transformer (20 layers, 1280 dimensions, 20 heads, ≈ 1.3 B parameters) in a DIY training stack while retaining compatibility with the Hugging Face ecosystem.

Methodology

A streaming loader feeds six English corpora sequentially.

Phase	Tokens	Corpus	Rationale
0-14	24 %	OSCAR-en	broad web
15-24	18 %	OpenWebText	web
25-29	18 %	Wiki40B	formal
30-34	14 %	GPT-2 dump	continuity
35-39	12 %	MiniPile	variety
40-∞	14 %	BookCorpus	long-form

Filtering removes HTML, URLs and short documents, enforcing ≥ 95 % ASCII.

Experiments

Optimizer: 8-bit AdamW8bit (β = 0.9/0.995, ε = 1 e-7).
Precision: Weights + activations in bfloat16; optimiser states in 8-bit.
LR: Linear warm-up 15 % of total optimiser steps followed by cosine decay.
Batching: Physical batch = 16 seq · GPU; grad-accum = 8 → effective 131 072 tokens/step.

Results

Step	CE loss	PPL
0	10.2	27 k
100 k	4.9	134
250 k	2.3	10.0
450 k	1.4	4.1

Metric	Base	+ LoRA
CE loss	1.40	0.12
Win-rate	21 %	37 %

Conclusion

ShayGPT confirms that a thoughtfully engineered medium-scale transformer can be:

Pre-trained to perplexity ≈ 4 on two modern GPUs.
Fine-tuned with LoRA in hours, not days, updating < 1 % of parameters.
Exported as a single safe-tensor for CPU/MPS inference.