Training Language Models with Reinforcement Learning on Mathematical Reasoning
A modified version of nanochat trained with reinforcement learning on the DeepMind AQuA-RAT dataset for algebraic reasoning and multiple-choice problem solving.
Quick Start β’ Dataset β’ Modifications β’ Training β’ Results
This project adapts the nanochat training framework (originally designed for GSM8K numerical reasoning) to work with AQuA-RAT (Algebra Question Answering with Rationales), a dataset of ~97,000 algebraic word problems with multiple-choice answers (A-E) and natural language solution rationales.
| Model | Parameters | Training Time | AQuA-RAT Dev Accuracy |
|---|---|---|---|
| depth-8 | ~60M | 3-4 hours | 30-50% |
| depth-20 | ~561M | 6-8 hours | 40-60% |
nanochat is a minimalist yet complete pipeline for training transformer language models from scratch, created by Andrej Karpathy. It implements:
Original focus: Training on GSM8K (Grade School Math 8K) with free-form numeric answers.
The DeepMind AQuA-RAT dataset contains algebraic reasoning problems in JSON format:
{ "question": "A person is traveling at 20 km/hr and reached his destiny in 2.5 hr then find the distance?", "options": [ "A) 53 km", "B) 55 km", "C) 52 km", "D) 60 km", "E) 50 km" ], "rationale": "The distance that the person traveled = 20 * 2.5 = 50 km. Answer: E", "correct": "E" }
Dataset splits:
Key characteristics:
| Aspect | GSM8K (Original) | AQuA-RAT (This Project) |
|---|---|---|
| Format | Free-form numeric | Multiple choice (A-E) |
| Answer | Single number | Letter choice |
| Size | 8,500 problems | 97,700 problems |
| Difficulty | Elementary school | High school algebra |
| Rationale | Step-by-step | Natural language |
| Evaluation | Exact match on number | Categorical accuracy |
To adapt nanochat from GSM8K to AQuA-RAT, we modified the following components:
scripts/prepare_aqua.py)Created new file to download and format AQuA-RAT:
# New file: scripts/prepare_aqua.py ### 1. Dataset Preparation (`scripts/prepare_aqua.py`) - Uses `datasets.load_dataset("deepmind/aqua_rat")` and optionally caps split sizes. - Emits JSONL files (`train.jsonl`, `validation.jsonl`, `test.jsonl`) compatible with the conversation schema used throughout nanochat. - Defaults to `~/.cache/nanochat/aqua`, but accepts `--output_dir` overrides so launchers can bundle their own artifact. ```python def format_example(row): options = row["options"] assistant_content = [ {"type": "text", "text": row["rationale"].strip()}, {"type": "text", "text": f"Answer: {row['correct'].strip().upper()}"}, ] return { "messages": [ {"role": "user", "content": _render_user_prompt(row["question"], options)}, {"role": "assistant", "content": assistant_content}, ], "letters": letters, "answer_letter": correct, }
tasks/aqua.py)data_dir (or AQUA_DATA_DIR / NANOCHAT_AQUA_DIR) so the task_render_user_prompt to format the question/options using the common_extract_letter to score completions.Answer: <LETTER> line for SFT, while evaluate() only cares about the letter.def _extract_letter(text, default=None): answer_match = re.search(r"answer\s*[:\-]\s*([A-E])", text, flags=re.IGNORECASE) if answer_match: return answer_match.group(1).upper() match = LETTER_RE.search(text) return match.group(1).upper() if match else default
Key differences from GSM8K:
scripts/chat_rl.py)Modified to support both GSM8K and AQuA-RAT:
Key updates:
train_task / val_task now instantiate AQUA(...) instead of GSM8K(...).evaluate() helper so any completion containingrun_aqua_eval, still reporting pass@k accuracy--run, --temperature, --max_new_tokens, β¦).scripts/chat_eval.py)'AQUA' in the task registry so -a AQUA just works.run_categorical_eval, clamping logitsrun_aquarat_small.sh)What changed vs upstream nanochat:
# (Optional) Cache the dataset locally as JSONL python -m scripts.prepare_aqua --output_dir "$NANOCHAT_BASE_DIR/aqua" # Mid-training now samples from the AQuA mixture torchrun -m scripts.mid_train -- --run=demo --num_iterations=200 # SFT stage emphasises AQuA problems torchrun -m scripts.sft_train -- --run=demo --aqua_train_examples=20000 # RL fine-tuning rewards the correct letter on AQuA-RAT torchrun -m scripts.chat_rl -- --run=demo --temperature=0.7 --max_new_tokens=64
tasks/aqua.py loads AQuA-RAT either from Hugging Face or the cached JSONLscripts/mid_train.py extends the original Reasoning+Chat mixture with ascripts/chat_sft.py replaces the GSM8K component with AQuA, keeping ARC,scripts/chat_rl.py retools the GRPO loop to sample, reward, and evaluatescripts/chat_eval.py registers the new AQuA task so chat_eval can reportWhat happens: Model learns language from scratch on FineWeb corpus
torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=8
Duration: 1.5-2 hours on 8x H100
Output: Base checkpoint with general language understanding
Metrics: Validation loss, CORE benchmark scores
What happens: Teach conversation format and special tokens
torchrun --nproc_per_node=8 -m scripts.mid_train
Duration: 30 minutes
Output: Conversational checkpoint
Metrics: Format adherence, tool use capability
What happens: Fine-tune on AQuA-RAT with ground-truth solutions
torchrun --nproc_per_node=8 -m scripts.sft_train -- \ --aqua_train_examples=20000 \ --aqua_val_examples=254
Duration: 30 minutes
Output: AQuA-tuned checkpoint
Metrics: Dev set accuracy (categorical)
What happens: Policy gradient learning with GRPO algorithm
torchrun --nproc_per_node=1 -m scripts.chat_rl -- \ --temperature=0.7 \ --max_new_tokens=64
Duration: 30 minutes
Algorithm: Group Relative Policy Optimization (GRPO)
Reward: +1.0 for correct letter, +0.1 for valid letter format
Output: RL-optimized checkpoint
Logged metrics:
rl/acc - Accuracy on training samplesrl/mean_reward - Average reward per generationrl/kl_letter_mean - KL divergence at decision pointrl/kl_sequence_mean - Full sequence KLrl/letter_margin_mean - Confidence (logit gap)attn/entropy_mean - Attention mechanism patternsrustbpe tokenizer sources are present:
For existing clones rungit clone --recurse-submodules https://github.com/HarleyCoops/nanochatAquaRat.git
git submodule update --init --recursive before building.PATH:
$env:Path += ";$env:USERPROFILE\.cargo\bin" setx PATH "$env:Path" setx CARGO_HOME "$env:USERPROFILE\.cargo" setx RUSTUP_HOME "$env:USERPROFILE\.rustup" rustup set default-host x86_64-pc-windows-msvc rustup default stable-x86_64-pc-windows-msvc cargo --version rustup --version
uv run maturin develop
Use the automation helper for one-command deployment:
# Set credentials export LAMBDA_API_KEY='your-lambda-api-key' export WANDB_API_KEY='your-wandb-api-key' # Launch with auto-start python scripts/launch_lambda_training.py \ --ssh-key-name your_lambda_ssh_key \ --instance-type gpu_8x_h100_sxm5 \ --region us-west-1 \ --auto-start \ --inject-env WANDB_API_KEY
The script provisions the instance, clones this repository, sets up environment variables, and starts training in a tmux session.
Monitor training:
# SSH to instance ssh ubuntu@<INSTANCE_IP> # Attach to tmux session tmux attach -t nanochat-train # Or view logs tail -f ~/nanochatAquaRat/training.log
Spin up on-demand GPUs via Hyperbolic's marketplace API:
# Set credentials export HYPERBOLIC_API_KEY='your-hyperbolic-api-key' export WANDB_API_KEY='your-wandb-api-key' # Launch with auto-start python scripts/launch_hyperbolic_training.py \ --gpu-count 1 \ --region us-east \ --auto-start \ --inject-env WANDB_API_KEY
The launcher discovers an available node (respecting --region, --supplier, or --max-price filters), provisions it, copies your .env, and optionally starts training in tmux. Use --list to inspect available marketplace inventory without launching.
For step-by-step control, see LAMBDA_MANUAL_SETUP.md.
Quick summary:
ssh ubuntu@<IP>git clone <repo-url> && cd nanochatAquaRatecho "WANDB_API_KEY=..." > .envbash run_aquarat_small.shFor marketplace nodes without automation access, follow this lightweight bootstrap:
-p <port> and username).sudo apt-get update sudo apt-get install -y git curl unzip build-essential python3 python3-venv tmux git clone https://github.com/HarleyCoops/nanochatAquaRat.git cd nanochatAquaRat
.env with the required keys (WANDB, GCS bucket, AQUA path) and upload your GCP service-account JSON to the VM, e.g. scp -P <port> C:\path\to\credentials.json user@<ip>:/home/user/gcp-sa.json.curl -LsSf https://astral.sh/uv/install.sh | sh curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain stable source "$HOME/.cargo/env" export PATH="$HOME/.local/bin:$PATH" uv venv && uv sync --extra gpu source .venv/bin/activate uv run maturin develop uv run python -m scripts.tok_train
curl -sSL https://sdk.cloud.google.com | bash source "$HOME/.bashrc" gcloud auth login --no-launch-browser gcloud config set project <your-project-id> gcloud storage cp gs://nanochat-aquarat-datasets/datasets/aqua/aqua_cache.zip . unzip -o aqua_cache.zip -d ~/aqua_cache export AQUA_DATA_DIR=$HOME/aqua_cache
cd ~/.cache/nanochat curl -L -o identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl curl -L -o eval_bundle.zip https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip unzip -q eval_bundle.zip && rm eval_bundle.zip cd ~/nanochatAquaRat
CUDA_VISIBLE_DEVICES=0 bash run_aquarat_lite.sh or the full run_aquarat_small.sh.A simplified launcher is also available:
export LAMBDA_API_KEY='your-key' export WANDB_API_KEY='your-key' python launch_lambda.py \ --instance-type gpu_8x_h100_sxm5 \ --region us-west-1
See QUICKSTART.md for details.
# Setup environment cp .env.template .env # Edit .env with your WANDB_API_KEY # Run training bash run_aquarat_small.sh
Requirements:
Keep the GitHub docs mirrored with the Hugging Face model card:
README.md (and any linked docs) as usual.This copies every README dependency intouv run python -m scripts.sync_hf_repo --no-push
hf_release/. The script warns if a referenced file such as LICENSE is missing.The command requires prioruv run python -m scripts.sync_hf_repo --repo-id HarleyCooper/nanochatAquaRat
huggingface-cli login (or an HF_TOKEN env var). Use --dry-run to review operations without copying or uploading.nanochatAquaRat/
βββ nanochat/β¦ # Vendored upstream nanochat package
βββ scripts/
β βββ base_train.py # Base pretraining stage
β βββ mid_train.py # Mid-training (now includes AQuA)
β βββ chat_sft.py # Chat SFT pipeline
β βββ sft_train.py # Shim so `-m scripts.sft_train` still works
β βββ chat_rl.py # Reinforcement learning on AQuA-RAT
β βββ chat_eval.py # Evaluation harness (adds AQuA task)
β βββ prepare_aqua.py # AQuA-RAT JSONL exporter
β βββ launch_lambda_training.py # Lambda Labs automation
β βββ launch_hyperbolic_training.py # Hyperbolic Labs automation
β βββ upload_to_gcs.sh # Artifact helper
βββ tasks/
β βββ aqua.py # AQuA-RAT task implementation
β βββ arc.py / gsm8k.py / mmlu.py # Other reasoning tasks
β βββ β¦
βββ run_aquarat_small.sh # End-to-end orchestration
βββ pyproject.toml / uv.lock # Environment definitions
βββ README.md
| File | Type | Description |
|---|---|---|
tasks/aqua.py | NEW | Conversation + evaluation wrapper for AQuA-RAT |
scripts/prepare_aqua.py | NEW | Materializes train/validation/test JSONL splits for offline use |
scripts/mid_train.py | MODIFIED | Adds AQuA to the mid-training mixture |
scripts/chat_sft.py | MODIFIED | SFT mixture now includes AQuA controls |
scripts/sft_train.py | NEW | Thin compatibility shim around chat_sft |
scripts/chat_rl.py | MODIFIED | RL loop retargeted from GSM8K to AQuA-RAT |
scripts/chat_eval.py | MODIFIED | Registers AQuA for categorical evaluation |
run_aquarat_small.sh | MODIFIED | Pipeline glue aligned with AQuA staging |
scripts/launch_hyperbolic_training.py | NEW | Hyperbolic Labs automation helper |
launch_lambda.py / scripts/launch_lambda_training.py | EXISTING | Lambda Labs support retained |
All metrics stream to Weights & Biases in real-time:
Training Metrics:
RL Metrics:
Interpretability:
Example W&B dashboard:
rl/acc ββββββββββ 0.45
rl/kl_letter_mean ββββββββββ 0.12
rl/letter_margin_mean ββββββββββ 2.34
attn/entropy_mean ββββββββββ 3.21
| Depth | Parameters | Training Time | Best Instance Type | Estimated Cost |
|---|---|---|---|---|
| 8 | ~60M | 3-4 hours | 1-2x A100 | ~$18-35 |
| 12 | ~180M | 4-5 hours | 4x A100 | ~$35-45 |
| 20 | ~561M | 6-8 hours | 8x H100 | ~$144-192 |
| 26 | ~1.1B | 10-12 hours | 8x H100 | ~$240-288 |
To change model depth, edit the --depth parameter in run_aquarat_small.sh.
After SFT (before RL):
After RL:
Lambda Labs pricing (8x H100 SXM5 @ ~$24/hour):
| Model | Training Time | Total Cost |
|---|---|---|
| depth-8 (60M) | 3-4 hours | ~$96 |
| depth-20 (561M) | 6-8 hours | ~$192 |
Budget options:
This project is based on the nanochat framework. For issues specific to:
This project inherits the license from the base nanochat project.