Training Your First Model from Scratch

- Published on
- /16 mins read
Track 2: Practical Guides - Post 2.1 of 6
This is the first post in Track 2, focusing on hands-on training. View all posts in this track →
Introduction
You've read about the technical details—the Muon optimizer, KV caching, modern architectures. Now it's time to get your hands dirty and train your first language model from scratch.
This guide walks through training a small nanochat model step-by-step: environment setup, monitoring training progress, evaluating the final model.
This is hands-on and practical. You'll train a working language model and learn how to configure, monitor, and troubleshoot training runs. Starting small (a model you can train in minutes on a single GPU), then scaling up.
Prerequisites
Hardware
- Minimum: 1× GPU with 24GB VRAM (e.g., RTX 3090, RTX 4090)
- Recommended: 1× GPU with 80GB VRAM (e.g., A100, H100)
- Optimal: 8× H100 GPUs for the full experience
Software
- Python 3.10+
- CUDA 11.8+ or 12.x
uvpackage manager (installs automatically)
Time
- Quick tutorial model: 15-30 minutes
- Full d20 model ($100 tier): 4 hours on 8×H100
Part 1: Environment Setup
Step 1: Clone and Enter Repository
git clone https://github.com/karpathy/nanochat.git
cd nanochatStep 2: Install Dependencies
nanochat uses uv for fast, reliable dependency management:
# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e .NOTE
What gets installed: PyTorch 2.x with CUDA support, minimal dependencies (numpy, tiktoken, requests), total ~2000 lines in uv.lock (much cleaner than pip!)
Step 3: Download Data
nanochat trains on the FineWeb-Edu-100B dataset. For our first model, we'll download just a few shards:
# Download 10 shards (~550 MB, ~5B tokens)
python -m nanochat.dataset -n 10 -w 4Output:
Downloading shard_00000.parquet...
Successfully downloaded shard_00000.parquet
Downloading shard_00001.parquet...
...
Done! Downloaded: 10/10 shards to /Users/you/nanochat/base_data
Storage breakdown:
- 10 shards: ~550 MB (good for quick experiments)
- 100 shards: ~5.5 GB (good for small models)
- 1823 shards: ~100 GB (full dataset for large models)
Step 4: Train Tokenizer
Before training the model, we need a tokenizer:
python -m scripts.tok_trainWhat this does:
- Streams through dataset and counts byte-pair frequencies
- Trains BPE with vocab size 32,000
- Saves to
tokenizer/tokenizer.pkl(~1 MB) - Creates
token_bytes.ptmapping for bpb evaluation
Output:
Processing sequences from iterator (buffer_size: 8192)
Processed 1,000,000 sequences total, 458,234 unique
Starting BPE training: 31,744 merges to compute
...
Progress: 100% (31744/31744 merges)
Finished training: 31744 merges completed
Saved tokenizer encoding to tokenizer/tokenizer.pkl
Time: ~10 minutes (only need to do once!)
Part 2: Your First Training Run
Now for the main event—training a model! We'll start with a tiny model to understand the workflow.
Quick Start: The d8 Model
Let's train the smallest practical model—depth 8 (11M parameters):
python scripts/base_train.py --depth=8 --num_iterations=100What this command does:
--depth=8: Model with 8 layers (11M params)--num_iterations=100: Train for 100 optimization steps (~5 minutes)
Expected output:
╔══════════════════════════════════════════════════════════════╗
║ nanochat ║
║ The best ChatGPT that $100 can buy ║
╚══════════════════════════════════════════════════════════════╝
Vocab size: 32,000
num_layers: 8
model_dim: 512
num_heads: 4
num_kv_heads: 4
Number of parameters: 11,091,456
Estimated FLOPs per token: 8.837e+07
Tokens / micro-batch / rank: 32 x 2048 = 65,536
Total batch size 524,288 => gradient accumulation steps: 8
Using user-provided number of iterations: 100
Total number of training tokens: 52,428,800
Tokens : Params ratio: 4.73
Total training FLOPs estimate: 4.631e+15
step 00000/00100 (0.00%) | loss: 10.389445 | lrm: 1.00 | dt: 1247.32ms | tok/sec: 525,890 | mfu: 38.52 | total time: 0.00m
step 00001/00100 (1.00%) | loss: 9.847221 | lrm: 1.00 | dt: 1189.45ms | tok/sec: 551,234 | mfu: 40.38 | total time: 0.02m
...
step 00100/00100 (100.00%) | loss: 4.125678 | lrm: 0.00 | dt: 1145.67ms | tok/sec: 573,892 | mfu: 42.05 | total time: 1.92m
Peak memory usage: 8,234.56 MiB
Total training time: 1.92m
Minimum validation bpb: 2.8945
Congratulations! You've trained your first language model. Let's break down what just happened.
Part 3: Understanding the Training Output
The Banner
╔══════════════════════════════════════════════════════════════╗
║ nanochat ║
║ The best ChatGPT that $100 can buy ║
╚══════════════════════════════════════════════════════════════╝
Just for fun. :)
Model Configuration
num_layers: 8
model_dim: 512
num_heads: 4
num_kv_heads: 4
Number of parameters: 11,091,456
Key relationships:
model_dim = depth × 64(aspect ratio)num_heads = ceil(model_dim / 128)(head dim 128)num_kv_heads = num_heads(1:1 MQA ratio)
For depth=8:
- model_dim = 8 × 64 = 512
- num_heads = ceil(512 / 128) = 4
- Parameters ≈ 6 × model_dim² × num_layers (rule of thumb)
Batch Size Configuration
Tokens / micro-batch / rank: 32 x 2048 = 65,536
Total batch size 524,288 => gradient accumulation steps: 8
What this means:
- Each GPU processes 32 sequences of 2048 tokens
- Target total batch: 524,288 tokens
- Need 8 gradient accumulation steps to reach target
- Formula:
grad_accum = total_batch / (device_batch × seq_len × world_size)
Training Horizon
Using user-provided number of iterations: 100
Total number of training tokens: 52,428,800
Tokens : Params ratio: 4.73
Three ways to specify training length:
Explicit iterations (what we used):
--num_iterations=100Target FLOPs (for scaling laws experiments):
--target_flops=1e19Data:param ratio (default, Chinchilla-optimal):
--target_param_data_ratio=20 # Default
For our d8 model:
- 11M params × 20 = 220M tokens
- 220M / 524K batch = 420 steps (Chinchilla-optimal)
- We used 100 steps for speed
Training Step Output
step 00042/00100 (42.00%) | loss: 5.234567 | lrm: 1.00 | dt: 1145.67ms | tok/sec: 573,892 | mfu: 42.05 | total time: 0.80m
Field breakdown:
| Field | Meaning | Good Values |
|---|---|---|
step 00042/00100 | Current step / total | - |
(42.00%) | Progress percentage | - |
loss: 5.234567 | Training loss (cross-entropy) | Decreasing |
lrm: 1.00 | Learning rate multiplier | 1.0 during training, 0.0 at end |
dt: 1145.67ms | Step duration | Lower is better |
tok/sec: 573,892 | Token throughput | Higher is better |
mfu: 42.05 | Model FLOPs Utilization (%) | 35-50% typical |
total time: 0.80m | Cumulative time | - |
MFU (Model FLOPs Utilization):
MFU = actual_flops_per_sec / theoretical_peak_flops
For H100 (bfloat16): theoretical peak = 989 TFLOPs/s
Typical MFU values:
- 35-45%: Good (small models)
- 45-55%: Excellent (medium models)
- 55-60%: Outstanding (large models, good kernels)
Validation Evaluation
Every 250 steps (by default), the model evaluates on validation data:
Step 00000 | Validation bpb: 4.5678
Step 00250 | Validation bpb: 1.8234
Step 00500 | Validation bpb: 1.6789
Bits per byte (bpb):
- 4.5: Untrained model (high)
- 2.0: Learning something
- 1.5: Decent small model
- 1.0: Good medium model
- 0.6: GPT-4 level (estimated)
See Loss Landscape & Scaling Laws for details on bpb.
Final Statistics
Peak memory usage: 8,234.56 MiB
Total training time: 1.92m
Minimum validation bpb: 2.8945
Memory usage:
- d8 model: ~8 GB (fits on consumer GPUs!)
- d20 model: ~32 GB (needs A100/H100)
- d26 model: ~64 GB (needs 80GB GPU or reduce batch size)
Part 4: Training a Real Model (d20)
Now let's train the "standard" nanochat model—depth 20 (83M parameters):
Step 1: Download More Data
# d20 with 20× data ratio needs ~1.66B tokens
# At ~250M chars/shard, need about 80 shards
python -m nanochat.dataset -n 100 -w 8Step 2: Launch Training
Single GPU:
python scripts/base_train.py --depth=20Multi-GPU (8× GPUs):
torchrun --standalone --nproc_per_node=8 scripts/base_train.py --depth=20Training time:
- Single GPU: ~32 hours
- 8× GPUs: ~4 hours
Step 3: Monitor Progress
The script outputs progress regularly. Key things to watch:
1. Loss should decrease smoothly:
step 00000: loss=10.38
step 00100: loss=4.12
step 00500: loss=2.89
step 01000: loss=2.34
step 02000: loss=1.98
step 03000: loss=1.78
2. Validation bpb should improve:
Step 00000 | Validation bpb: 4.4567
Step 00250 | Validation bpb: 2.1234
Step 00500 | Validation bpb: 1.8901
Step 00750 | Validation bpb: 1.7234
Step 01000 | Validation bpb: 1.6123
...
Step 03167 | Validation bpb: 1.4501
3. Model samples (every 2000 steps):
Step 02000 | Sample outputs:
The capital of France is Paris.
The chemical symbol of gold is Au
If yesterday was Friday, then tomorrow will be Sunday
These get better over time!
Step 4: Find Your Checkpoint
After training completes, find your model:
ls -lh base_checkpoints/d20/Output:
checkpoint_003167.pt # 335 MB - the final model
meta.json # Metadata (config, metrics)
Part 5: Configuration Deep-Dive
nanochat uses a "Poor Man's Configurator"—you can override any training parameter via command line or config file.
Command-Line Overrides
python scripts/base_train.py \
--depth=20 \
--device_batch_size=16 \
--max_seq_len=1024 \
--matrix_lr=0.01 \
--num_iterations=5000Common parameters:
| Parameter | Default | Description |
|---|---|---|
depth | 20 | Model depth (scales everything else) |
max_seq_len | 2048 | Context length |
device_batch_size | 32 | Per-GPU batch size |
total_batch_size | 524288 | Total tokens per step |
num_iterations | -1 | Explicit step count (-1 = auto) |
target_param_data_ratio | 20 | Chinchilla ratio |
matrix_lr | 0.02 | Muon learning rate |
embedding_lr | 0.2 | AdamW LR for embeddings |
grad_clip | 1.0 | Gradient clipping (0 = off) |
eval_every | 250 | Validation frequency |
run | "dummy" | Wandb run name ("dummy" = no logging) |
Config Files
Create config/my_experiment.py:
# Smaller model for quick testing
depth = 12
num_iterations = 500
device_batch_size = 16
# More aggressive learning rates
matrix_lr = 0.03
embedding_lr = 0.3
# Wandb logging
run = "d12_fast_lr"Run with:
python scripts/base_train.py config/my_experiment.pyYou can combine config files with CLI overrides:
python scripts/base_train.py config/my_experiment.py --depth=16Memory Management
Out of memory? Reduce device_batch_size:
# Original (32 → 65K tokens per GPU)
python scripts/base_train.py --depth=26
# Reduced (16 → 32K tokens per GPU)
python scripts/base_train.py --depth=26 --device_batch_size=16
# Further reduced (8 → 16K tokens per GPU)
python scripts/base_train.py --depth=26 --device_batch_size=8The script automatically compensates by increasing gradient accumulation:
- Same total batch size (524K tokens)
- Same final model quality
- Slightly slower (more sequential compute)
Memory usage by model size:
| depth | params | VRAM (batch=32) | VRAM (batch=16) | VRAM (batch=8) |
|---|---|---|---|---|
| 8 | 11M | 8 GB | 6 GB | 5 GB |
| 12 | 30M | 16 GB | 12 GB | 10 GB |
| 16 | 54M | 28 GB | 20 GB | 16 GB |
| 20 | 83M | 42 GB | 32 GB | 26 GB |
| 24 | 118M | 64 GB | 48 GB | 38 GB |
| 26 | 140M | 76 GB | 56 GB | 44 GB |
Part 6: Monitoring with Wandb
Enable Weights & Biases logging for beautiful dashboards:
# Install wandb
uv pip install wandb
# Login (first time only)
wandb login
# Train with logging
python scripts/base_train.py --run=my_experiment_nameWhat gets logged:
Training metrics (every 100 steps):
train/loss: Cross-entropy losstrain/lrm: Learning rate multipliertrain/dt: Step durationtrain/tok_per_sec: Token throughputtrain/mfu: Model FLOPs utilization
Validation metrics (every 250 steps):
val/bpb: Bits per byte
CORE metrics (every 2000 steps):
core_metric: Overall score- Individual task scores
System metrics:
total_training_flops: Cumulative FLOPstotal_training_time: Wall-clock time
Wandb dashboard view:
my_experiment_name
├── train/loss [line chart: decreasing]
├── val/bpb [line chart: decreasing]
├── train/mfu [line chart: stable ~40-50%]
├── core_metric [line chart: increasing]
└── System [GPU utilization, memory, etc.]
Part 7: Testing Your Model
After training, let's test what the model learned!
Quick CLI Test
python scripts/chat_cli.pyInteractive session:
You: The capital of France is
Assistant: Paris. The city is known for the Eiffel Tower and the Louvre Museum.
You: If 2+2=4, then 3+3=
Assistant: 6
You: Why is the sky blue?
Assistant: The sky appears blue because of Rayleigh scattering...
(Your d8 model won't be this good yet—that's from a larger model!)
Web UI
python scripts/chat_web.pyThen visit http://localhost:8000 (or your server's IP if remote).
Evaluation Benchmarks
Validate bpb:
python scripts/base_eval.pyThis loads the latest checkpoint and evaluates CORE benchmark (~15 minutes on 8 GPUs).
Output:
Model: base_model (step 3167)
================================================================================
Task , Accuracy , Centered
arc_challenge , 0.2134 , 0.2645
arc_easy , 0.3456 , 0.3789
...
CORE , , 0.2219
Part 8: Common Issues and Solutions
Issue 1: Loss is NaN
step 00042: loss=nan
Causes:
- Learning rate too high
- Numerical instability
- Corrupted data
Production-grade error handling for the training loop prevents crashes:
# Add to training loop (after loss computation)
try:
with autocast_ctx:
loss = model(train_inputs, train_targets)
# Check for NaN loss (indicates training instability)
if torch.isnan(loss):
logging.warning(f"NaN loss detected at step {step}. Skipping batch.")
optimizer.zero_grad()
continue
loss.backward()
except RuntimeError as e:
if "out of memory" in str(e):
logging.error(f"OOM at step {step}. Clearing cache and skipping batch.")
torch.cuda.empty_cache()
optimizer.zero_grad()
continue
else:
raise eSolutions:
# Reduce learning rates
python scripts/base_train.py --matrix_lr=0.01 --embedding_lr=0.1
# Enable gradient clipping (should be on by default)
python scripts/base_train.py --grad_clip=1.0Issue 2: Loss Not Decreasing
step 00000: loss=10.38
step 00100: loss=10.35
step 00200: loss=10.32
Causes:
- Learning rate too low
- Not enough training steps
- Model too small for data
Solutions:
# Increase learning rates
python scripts/base_train.py --matrix_lr=0.03 --embedding_lr=0.3
# Train longer
python scripts/base_train.py --num_iterations=1000
# Use larger model
python scripts/base_train.py --depth=16Issue 3: OOM (Out of Memory)
torch.cuda.OutOfMemoryError: CUDA out of memory
Solution:
# Reduce batch size (automatically increases grad accumulation)
python scripts/base_train.py --device_batch_size=16
# Or reduce sequence length
python scripts/base_train.py --max_seq_len=1024
# Or use smaller model
python scripts/base_train.py --depth=12Issue 4: Slow Training
tok/sec: 50,000 (expected: 500,000+)
Causes:
- CPU bottleneck (data loading)
- Inefficient GPU utilization
- Small batch size
Solutions:
# Increase device batch size (if memory allows)
python scripts/base_train.py --device_batch_size=64
# Check data loading isn't bottleneck
# (should see GPU utilization near 100% in nvidia-smi)Issue 5: Can't Resume Training
nanochat doesn't support mid-training checkpointing by default (keeps code simple). To train longer:
# Option 1: Increase iterations from the start
python scripts/base_train.py --num_iterations=5000
# Option 2: Use mid-training script (covered in later posts)
python scripts/mid_train.pyPart 9: Scaling Up
Scaling to Multiple GPUs
2 GPUs:
torchrun --standalone --nproc_per_node=2 scripts/base_train.py4 GPUs:
torchrun --standalone --nproc_per_node=4 scripts/base_train.py8 GPUs:
torchrun --standalone --nproc_per_node=8 scripts/base_train.pyWhat changes:
- Training time ÷ num_GPUs
- Memory per GPU stays the same
- Gradient accumulation automatically reduced
- Final model quality identical
Scaling to Larger Models
d12 (30M params) - $30 tier:
torchrun --standalone --nproc_per_node=8 scripts/base_train.py \
--depth=12 \
--run=d12_experimentd26 (140M params) - $300 tier:
# Need more data
python -m nanochat.dataset -n 450 -w 8
# Train (reduce batch size for memory)
torchrun --standalone --nproc_per_node=8 scripts/base_train.py \
--depth=26 \
--device_batch_size=16 \
--run=d26_experimentPerformance expectations:
| Model | Params | Training Time (8×H100) | Final bpb | CORE | Cost |
|---|---|---|---|---|---|
| d8 | 11M | 30 min | 2.1 | 0.12 | $12 |
| d12 | 30M | 90 min | 1.8 | 0.18 | $36 |
| d16 | 54M | 2.5 hours | 1.6 | 0.25 | $60 |
| d20 | 83M | 4 hours | 1.45 | 0.28 | $96 |
| d26 | 140M | 12 hours | 1.3 | 0.35 | $288 |
(At $24/hr for 8×H100)
Part 10: Next Steps
Congratulations! You've trained your first language model from scratch. Here's what to explore next:
1. Fine-tune for chat:
- See Fine-tuning for Chat (SFT)
- Turn your base model into a conversational assistant
2. Understand the architecture:
- Review Modern Transformer Architecture
- Learn what's inside the model you just trained
3. Experiment with hyperparameters:
- Try different learning rates
- Test longer training runs
- Compare model sizes
4. Train your own tokenizer:
- See Tokenizer Design Choices
- Customize vocabulary for your domain
5. Build custom evaluations:
- See Building Custom Evaluation Tasks
- Test your model on tasks you care about
Conclusion
Training a language model from scratch is no longer magic—it's engineering. With nanochat, you can:
✅ Train models from 11M to 140M+ parameters
✅ Use single GPU or scale to 8× GPUs
✅ Monitor training with comprehensive metrics
✅ Evaluate with standardized benchmarks
✅ Deploy as a chat interface
The key insights:
- Start small: Train d8 (11M) first to verify your setup
- Monitor metrics: Watch loss, bpb, and sample outputs
- Manage memory: Adjust
device_batch_sizeto fit your GPU - Scale gradually: d8 → d12 → d20 → d26
- Iterate fast: Use config files for experiments
You now have the foundation to train, evaluate, and deploy language models. The next posts will build on this foundation to create more capable systems.
Related Posts
Previous in series:
- Loss Landscape & Scaling Laws - Understand evaluation metrics like bpb and Chinchilla scaling laws
Next in series:
- Fine-tuning for Chat (SFT) - Transform your base model into a conversational assistant
Related posts:
- Modern Transformer Architecture - Deep dive into RoPE, QK normalization, and architectural innovations
- Muon Optimizer Explained - Understanding the optimizer that powers nanochat training
Part of the nanochat Deep-Dive Series • Track 2: Practical Guides
GitHub: nanochat repository
Training Script: scripts/base_train.py
TIP
Experiment notebooks: Due to reader interest, interactive Jupyter notebooks for hands-on experiments are planned. Let us know if you'd like to see them!
On this page
- Introduction
- Prerequisites
- Hardware
- Software
- Time
- Part 1: Environment Setup
- Step 1: Clone and Enter Repository
- Step 2: Install Dependencies
- Step 3: Download Data
- Step 4: Train Tokenizer
- Part 2: Your First Training Run
- Quick Start: The d8 Model
- Part 3: Understanding the Training Output
- The Banner
- Model Configuration
- Batch Size Configuration
- Training Horizon
- Training Step Output
- Validation Evaluation
- Final Statistics
- Part 4: Training a Real Model (d20)
- Step 1: Download More Data
- Step 2: Launch Training
- Step 3: Monitor Progress
- Step 4: Find Your Checkpoint
- Part 5: Configuration Deep-Dive
- Command-Line Overrides
- Config Files
- Memory Management
- Part 6: Monitoring with Wandb
- Part 7: Testing Your Model
- Quick CLI Test
- Web UI
- Evaluation Benchmarks
- Part 8: Common Issues and Solutions
- Issue 1: Loss is NaN
- Issue 2: Loss Not Decreasing
- Issue 3: OOM (Out of Memory)
- Issue 4: Slow Training
- Issue 5: Can't Resume Training
- Part 9: Scaling Up
- Scaling to Multiple GPUs
- Scaling to Larger Models
- Part 10: Next Steps
- Conclusion
- Related Posts



