Training Your First Model from Scratch

Track 2: Practical Guides - Post 2.1 of 6
This is the first post in Track 2, focusing on hands-on training. View all posts in this track →

You can train a language model in 15 minutes on one GPU

I've run these exact steps dozens of times while validating different model configurations. The setup actually works—no hidden dependencies, no broken imports. From git clone to generating text in 15 minutes. That's how accessible language model training has become.

TL;DR: Clone nanochat, run one command, watch the loss drop. Start small with tutorial models, then scale to the d20 model (~561M params) which trains in 4 hours on 8×H100s for $100.

Your first training run will probably fail. Not because the code is broken—but because of environment issues you didn't anticipate. A colleague spent three hours debugging "CUDA out of memory" on a 24GB card before realizing their system was running Chrome in the background, eating 4GB of VRAM. Another researcher's run crashed after 90 minutes because they forgot to use screen and their SSH connection dropped. The fix for both: a 5-minute sanity check. This post front-loads those lessons so your first run succeeds.

You've read about the technical details—the Muon optimizer, KV caching, modern architectures. Now it's time to get your hands dirty and train your first language model from scratch.

Training Progress Dashboard

Simulate and visualize LLM training metrics in real-time

Epoch: 0/3Step: 0/1000

Progress: 0.0%0:00 / ~0:00

Start training to see loss curves

📈 Understanding Training Metrics

• Loss: Cross-entropy loss on training data (lower is better)
• Perplexity: e^loss, measures model uncertainty (lower is better)
• Learning Rate: Warmup then cosine decay for stable training
• Val Loss: Validation loss - watch for overfitting if it increases

This is hands-on and practical. You'll train a working language model and learn how to configure, monitor, and troubleshoot training runs. Starting small (a model you can train in minutes on a single GPU), then scaling up.

You need one GPU with 24GB VRAM to start

Hardware

Minimum: 1× GPU with 24GB VRAM (e.g., RTX 3090, RTX 4090)
Recommended: 1× GPU with 80GB VRAM (e.g., A100, H100)
Optimal: 8× H100 GPUs for the full experience

Software

Python 3.10+
CUDA 11.8+ or 12.x
uv package manager (installs automatically)

Time

Quick tutorial model: 15-30 minutes
Full d20 model ($100 tier): 4 hours on 8×H100

Part 1: Environment setup takes 5 minutes with uv

Start with git clone, not pip install. Cloning ensures you get the exact tested versions, training scripts, and configs that actually work together.

Step 1: Clone and Enter Repository

git clone https://github.com/karpathy/nanochat.git
cd nanochat

Step 2: Install Dependencies

nanochat uses uv for fast, reliable dependency management:

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
 
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .

NOTE

What gets installed: PyTorch 2.x with CUDA support, minimal dependencies (numpy, tiktoken, requests), total ~2000 lines in uv.lock (much cleaner than pip!)

Step 3: Download Data

nanochat trains on the FineWeb-Edu-100B dataset. For our first model, we'll download just a few shards:

# Download 10 shards (~550 MB, ~5B tokens)
python -m nanochat.dataset -n 10 -w 4

Output:

Downloading shard_00000.parquet...
Successfully downloaded shard_00000.parquet
Downloading shard_00001.parquet...
...
Done! Downloaded: 10/10 shards to /Users/you/nanochat/base_data

Storage breakdown:

10 shards: ~550 MB (good for quick experiments)
100 shards: ~5.5 GB (good for small models)
1823 shards: ~100 GB (full dataset for large models)

I've seen developers skip straight to 100+ shards on their first run, then wonder why training takes forever. Here's the pattern that works: train on 10 shards first. Get to a working model in 15 minutes. Validate your entire pipeline—data loading, checkpointing, evaluation. Then scale data. A broken pipeline with 100GB of data is still broken. A working pipeline with 500MB is ready to scale.

Step 4: Train Tokenizer

Before training the model, we need a tokenizer:

python -m scripts.tok_train

What this does:

Streams through dataset and counts byte-pair frequencies
Trains BPE with vocab size 32,000
Saves to tokenizer/tokenizer.pkl (~1 MB)
Creates token_bytes.pt mapping for bpb evaluation

Output:

Processing sequences from iterator (buffer_size: 8192)
Processed 1,000,000 sequences total, 458,234 unique
Starting BPE training: 31,744 merges to compute
...
Progress: 100% (31744/31744 merges)
Finished training: 31744 merges completed
Saved tokenizer encoding to tokenizer/tokenizer.pkl

Time: ~10 minutes (only need to do once!)

Part 2: One command trains a model—complexity hides inside

Now for the main event—training a model! We'll start with a tiny model to understand the workflow.

Quick Start: The d8 Model

Let's train the smallest practical model—depth 8 (11M parameters):

python scripts/base_train.py --depth=8 --num_iterations=100

What this command does:

--depth=8: Model with 8 layers (11M params)
--num_iterations=100: Train for 100 optimization steps (~5 minutes)

Expected output:

╔══════════════════════════════════════════════════════════════╗
║                         nanochat                             ║
║           The best ChatGPT that $100 can buy                 ║
╚══════════════════════════════════════════════════════════════╝

Vocab size: 32,000
num_layers: 8
model_dim: 512
num_heads: 4
num_kv_heads: 4
Number of parameters: 11,091,456
Estimated FLOPs per token: 8.837e+07

Tokens / micro-batch / rank: 32 x 2048 = 65,536
Total batch size 524,288 => gradient accumulation steps: 8

Using user-provided number of iterations: 100
Total number of training tokens: 52,428,800
Tokens : Params ratio: 4.73
Total training FLOPs estimate: 4.631e+15

step 00000/00100 (0.00%) | loss: 10.389445 | lrm: 1.00 | dt: 1247.32ms | tok/sec: 525,890 | mfu: 38.52 | total time: 0.00m
step 00001/00100 (1.00%) | loss: 9.847221 | lrm: 1.00 | dt: 1189.45ms | tok/sec: 551,234 | mfu: 40.38 | total time: 0.02m
...
step 00100/00100 (100.00%) | loss: 4.125678 | lrm: 0.00 | dt: 1145.67ms | tok/sec: 573,892 | mfu: 42.05 | total time: 1.92m

Peak memory usage: 8,234.56 MiB
Total training time: 1.92m
Minimum validation bpb: 2.8945

Congratulations! You've trained your first language model. Let's break down what just happened.

Part 3: These numbers reveal if training is working or failing

╔══════════════════════════════════════════════════════════════╗
║                         nanochat                             ║
║           The best ChatGPT that $100 can buy                 ║
╚══════════════════════════════════════════════════════════════╝

Just for fun. :)

Model Configuration

num_layers: 8
model_dim: 512
num_heads: 4
num_kv_heads: 4
Number of parameters: 11,091,456

Key relationships:

model_dim = depth × 64 (aspect ratio)
num_heads = ceil(model_dim / 128) (head dim 128)
num_kv_heads = num_heads (1:1 MQA ratio)

For depth=8:

model_dim = 8 × 64 = 512
num_heads = ceil(512 / 128) = 4
Parameters ≈ 6 × model_dim² × num_layers (rule of thumb)

Batch Size Configuration

Tokens / micro-batch / rank: 32 x 2048 = 65,536
Total batch size 524,288 => gradient accumulation steps: 8

What this means:

Each GPU processes 32 sequences of 2048 tokens
Target total batch: 524,288 tokens
Need 8 gradient accumulation steps to reach target
Formula: grad_accum = total_batch / (device_batch × seq_len × world_size)

In practice, gradient accumulation lets you match large-scale experiments on modest hardware. A single 24GB GPU can simulate 8-GPU training—you trade wall-clock time for capital expenditure.

Training Horizon

Using user-provided number of iterations: 100
Total number of training tokens: 52,428,800
Tokens : Params ratio: 4.73

Three ways to specify training length:

Explicit iterations (what we used):
```
--num_iterations=100
```
Target FLOPs (for scaling laws experiments):
```
--target_flops=1e19
```
Data:param ratio (default, Chinchilla-optimal):
```
--target_param_data_ratio=20  # Default
```

For our d8 model:

11M params × 20 = 220M tokens
220M / 524K batch = 420 steps (Chinchilla-optimal)
We used 100 steps for speed

Training Step Output

step 00042/00100 (42.00%) | loss: 5.234567 | lrm: 1.00 | dt: 1145.67ms | tok/sec: 573,892 | mfu: 42.05 | total time: 0.80m

Field breakdown:

Field	Meaning	Good Values
`step 00042/00100`	Current step / total	-
`(42.00%)`	Progress percentage	-
`loss: 5.234567`	Training loss (cross-entropy)	Decreasing
`lrm: 1.00`	Learning rate multiplier	1.0 during training, 0.0 at end
`dt: 1145.67ms`	Step duration	Lower is better
`tok/sec: 573,892`	Token throughput	Higher is better
`mfu: 42.05`	Model FLOPs Utilization (%)	35-50% typical
`total time: 0.80m`	Cumulative time	-

MFU (Model FLOPs Utilization):

MFU = actual_flops_per_sec / theoretical_peak_flops

For H100 (bfloat16): theoretical peak = 989 TFLOPs/s

Typical MFU values:

35-45%: Good (small models)
45-55%: Excellent (medium models)
55-60%: Outstanding (large models, good kernels)

If MFU drops below 30%, something's wrong—check batch size, data loading, or GPU utilization. Higher MFU = more value per GPU-hour.

Validation Evaluation

Every 250 steps (by default), the model evaluates on validation data:

Step 00000 | Validation bpb: 4.5678
Step 00250 | Validation bpb: 1.8234
Step 00500 | Validation bpb: 1.6789

Bits per byte (bpb):

4.5: Untrained model (high)
2.0: Learning something
1.5: Decent small model
1.0: Good medium model
0.6: GPT-4 level (estimated)

For tracking your progress, this means: bpb is your single best metric during pretraining. If it stops decreasing, you've either exhausted your data or hit the model's capacity limit.

See Loss Landscape & Scaling Laws for details on bpb.

Final Statistics

Peak memory usage: 8,234.56 MiB
Total training time: 1.92m
Minimum validation bpb: 2.8945

Memory usage:

d8 model: ~8 GB (fits on consumer GPUs!)
d20 model: ~32 GB (needs A100/H100)
d26 model: ~64 GB (needs 80GB GPU or reduce batch size)

Practically speaking, d8 is your prototyping friend. Debug your data pipeline, loss functions, and evaluation on d8 before burning H100 hours on d20. Most bugs show up at d8 scale.

Part 4: The d20 model delivers GPT-2 quality for $100

Now let's train the "standard" nanochat model—the d20 (~561M parameters) from speedrun.sh:

Step 1: Download More Data

# d20 with 20× data ratio needs ~1.66B tokens
# At ~250M chars/shard, need about 80 shards
python -m nanochat.dataset -n 100 -w 8

Step 2: Launch Training

Single GPU:

python scripts/base_train.py --depth=20

Multi-GPU (8× GPUs):

torchrun --standalone --nproc_per_node=8 scripts/base_train.py --depth=20

Training time:

Single GPU: ~32 hours
8× GPUs: ~4 hours

Step 3: Monitor Progress

The script outputs progress regularly. Key things to watch:

1. Loss should decrease smoothly:

step 00000: loss=10.38
step 00100: loss=4.12
step 00500: loss=2.89
step 01000: loss=2.34
step 02000: loss=1.98
step 03000: loss=1.78

2. Validation bpb should improve:

Step 00000 | Validation bpb: 4.4567
Step 00250 | Validation bpb: 2.1234
Step 00500 | Validation bpb: 1.8901
Step 00750 | Validation bpb: 1.7234
Step 01000 | Validation bpb: 1.6123
...
Step 03167 | Validation bpb: 1.4501

3. Model samples (every 2000 steps):

Step 02000 | Sample outputs:
The capital of France is Paris.
The chemical symbol of gold is Au
If yesterday was Friday, then tomorrow will be Sunday

These get better over time!

Samples are your qualitative sanity check. If loss drops but samples still look like gibberish, something's wrong—usually a tokenizer mismatch or data corruption. Trust samples over metrics when they disagree.

Step 4: Find Your Checkpoint

After training completes, find your model:

ls -lh base_checkpoints/d20/

Output:

checkpoint_003167.pt     # 335 MB - the final model
meta.json                # Metadata (config, metrics)

Checkpoint Inspector

Analyze training checkpoints and model states

Training Loss

10.820

PPL: 50011.1

Tokens Processed

0.00B

0M tokens

Gradient Norm

15.20

High

Learning Rate

0.0h elapsed

Loss Trending

Initial checkpoint

Gradient Norm

Consider gradient clipping

Learning Rate

Warmup not started

Checkpoint Metadata

{
  "step": 0,
  "timestamp": "2024-01-01 00:00:00",
  "metrics": {
    "train_loss": 10.82,
    "perplexity": "50011.09",
    "grad_norm": 15.2,
    "learning_rate": 0
  },
  "progress": {
    "tokens_processed": 0,
    "tokens_billions": "0.000"
  }
}

Training Decision

🚀 Continue training. Loss is still high and model is learning rapidly.

Part 5: These hyperparameters control 90% of training dynamics

nanochat uses a "Poor Man's Configurator"—you can override any training parameter via command line or config file.

Command-Line Overrides

python scripts/base_train.py \
    --depth=20 \
    --device_batch_size=16 \
    --max_seq_len=1024 \
    --matrix_lr=0.01 \
    --num_iterations=5000

Common parameters:

Parameter	Default	Description
`depth`	20	Model depth (scales everything else)
`max_seq_len`	2048	Context length
`device_batch_size`	32	Per-GPU batch size
`total_batch_size`	524288	Total tokens per step
`num_iterations`	-1	Explicit step count (-1 = auto)
`target_param_data_ratio`	20	Chinchilla ratio
`matrix_lr`	0.02	Muon learning rate
`embedding_lr`	0.2	AdamW LR for embeddings
`grad_clip`	1.0	Gradient clipping (0 = off)
`eval_every`	250	Validation frequency
`run`	"dummy"	Wandb run name ("dummy" = no logging)

Hyperparameter Tuner

Explore training hyperparameters and their effects

Presets:

Learning Rate

Warmup Steps2,000 steps (2.0%)

Batch Size

Weight Decay0.1

Dropout Rate0.1

Recommendations

✅ Learning rate is in a reasonable range.
✅ Warmup duration looks good.
✅ Batch size is reasonable.
✅ Weight decay looks good.

Configuration

optimizer = AdamW(
    lr=3.0e-4,
    betas=(0.9, 0.95),
    weight_decay=0.1
)
scheduler = CosineAnnealingLR(
    warmup_steps=2000,
    total_steps=100,000
)
batch_size = 512
dropout = 0.1

Config Files

Create config/my_experiment.py:

config/my_experiment.py

# Smaller model for quick testing
depth = 12
num_iterations = 500
device_batch_size = 16
 
# More aggressive learning rates
matrix_lr = 0.03
embedding_lr = 0.3
 
# Wandb logging
run = "d12_fast_lr"

Run with:

python scripts/base_train.py config/my_experiment.py

You can combine config files with CLI overrides:

python scripts/base_train.py config/my_experiment.py --depth=16

Memory Management

Out of memory? Reduce device_batch_size:

# Original (32 → 65K tokens per GPU)
python scripts/base_train.py --depth=26
 
# Reduced (16 → 32K tokens per GPU)
python scripts/base_train.py --depth=26 --device_batch_size=16
 
# Further reduced (8 → 16K tokens per GPU)
python scripts/base_train.py --depth=26 --device_batch_size=8

The script automatically compensates by increasing gradient accumulation:

Same total batch size (524K tokens)
Same final model quality
Slightly slower (more sequential compute)

Memory usage by model size (from actual nanochat):

Model	Params	Training Time (8×H100)	Cost
d20	~561M	~4 hours	~$100
d26	~1B (est.)	~12 hours	~$300
d32	~1.9B	~33 hours	~$800

Note: Parameter counts verified from Karpathy's nanochat walkthrough. Smaller tutorial models (d6, d8) exist for learning but aren't the focus of production training.

Part 6: Wandb transforms cryptic numbers into actionable insights

Enable Weights & Biases logging for beautiful dashboards:

# Install wandb
uv pip install wandb
 
# Login (first time only)
wandb login
 
# Train with logging
python scripts/base_train.py --run=my_experiment_name

What gets logged:

Training metrics (every 100 steps):
- train/loss: Cross-entropy loss
- train/lrm: Learning rate multiplier
- train/dt: Step duration
- train/tok_per_sec: Token throughput
- train/mfu: Model FLOPs utilization
Validation metrics (every 250 steps):
- val/bpb: Bits per byte
CORE metrics (every 2000 steps):
- core_metric: Overall score
- Individual task scores
System metrics:
- total_training_flops: Cumulative FLOPs
- total_training_time: Wall-clock time

Wandb dashboard view:

my_experiment_name
├── train/loss        [line chart: decreasing]
├── val/bpb          [line chart: decreasing]
├── train/mfu        [line chart: stable ~40-50%]
├── core_metric      [line chart: increasing]
└── System           [GPU utilization, memory, etc.]

Part 7: Text generation proves your model learned something useful

After training, let's test what the model learned!

Quick CLI Test

python scripts/chat_cli.py

Interactive session:

You: The capital of France is
Assistant: Paris. The city is known for the Eiffel Tower and the Louvre Museum.

You: If 2+2=4, then 3+3=
Assistant: 6

You: Why is the sky blue?
Assistant: The sky appears blue because of Rayleigh scattering...

(Your d8 model won't be this good yet—that's from a larger model!)

Web UI

python scripts/chat_web.py

Then visit http://localhost:8000 (or your server's IP if remote).

Evaluation Benchmarks

Validate bpb:

python scripts/base_eval.py

This loads the latest checkpoint and evaluates CORE benchmark (~15 minutes on 8 GPUs).

Output:

Model: base_model (step 3167)
================================================================================
Task                                , Accuracy  , Centered
arc_challenge                       , 0.2134    , 0.2645
arc_easy                            , 0.3456    , 0.3789
...
CORE                                ,           , 0.2219

Part 8: These five errors break every first training run—here's how to fix them

Issue 1: Loss is NaN

step 00042: loss=nan

Causes:

Learning rate too high
Numerical instability
Corrupted data

Production-grade error handling for the training loop prevents crashes:

# Add to training loop (after loss computation)
try:
    with autocast_ctx:
        loss = model(train_inputs, train_targets)
    
    # Check for NaN loss (indicates training instability)
    if torch.isnan(loss):
        logging.warning(f"NaN loss detected at step {step}. Skipping batch.")
        optimizer.zero_grad()
        continue
    
    loss.backward()
    
except RuntimeError as e:
    if "out of memory" in str(e):
        logging.error(f"OOM at step {step}. Clearing cache and skipping batch.")
        torch.cuda.empty_cache()
        optimizer.zero_grad()
        continue
    else:
        raise e

Solutions:

# Reduce learning rates
python scripts/base_train.py --matrix_lr=0.01 --embedding_lr=0.1
 
# Enable gradient clipping (should be on by default)
python scripts/base_train.py --grad_clip=1.0

Issue 2: Loss Not Decreasing

step 00000: loss=10.38
step 00100: loss=10.35
step 00200: loss=10.32

Causes:

Learning rate too low
Not enough training steps
Model too small for data

Solutions:

# Increase learning rates
python scripts/base_train.py --matrix_lr=0.03 --embedding_lr=0.3
 
# Train longer
python scripts/base_train.py --num_iterations=1000
 
# Use larger model
python scripts/base_train.py --depth=16

Issue 3: OOM (Out of Memory)

torch.cuda.OutOfMemoryError: CUDA out of memory

Solution:

# Reduce batch size (automatically increases grad accumulation)
python scripts/base_train.py --device_batch_size=16
 
# Or reduce sequence length
python scripts/base_train.py --max_seq_len=1024
 
# Or use smaller model
python scripts/base_train.py --depth=12

Issue 4: Slow Training

tok/sec: 50,000  (expected: 500,000+)

Causes:

CPU bottleneck (data loading)
Inefficient GPU utilization
Small batch size

Solutions:

# Increase device batch size (if memory allows)
python scripts/base_train.py --device_batch_size=64
 
# Check data loading isn't bottleneck
# (should see GPU utilization near 100% in nvidia-smi)

Issue 5: Can't Resume Training

nanochat doesn't support mid-training checkpointing by default (keeps code simple). To train longer:

# Option 1: Increase iterations from the start
python scripts/base_train.py --num_iterations=5000
 
# Option 2: Use mid-training script (covered in later posts)
python scripts/mid_train.py

Part 9: DDP scales training linearly—add GPUs, divide time

Scaling to Multiple GPUs

2 GPUs:

torchrun --standalone --nproc_per_node=2 scripts/base_train.py

4 GPUs:

torchrun --standalone --nproc_per_node=4 scripts/base_train.py

8 GPUs:

torchrun --standalone --nproc_per_node=8 scripts/base_train.py

What changes:

Training time ÷ num_GPUs
Memory per GPU stays the same
Gradient accumulation automatically reduced
Final model quality identical

Scaling to Larger Models

d12 (30M params) - $30 tier:

torchrun --standalone --nproc_per_node=8 scripts/base_train.py \
    --depth=12 \
    --run=d12_experiment

d26 (140M params) - $300 tier:

# Need more data
python -m nanochat.dataset -n 450 -w 8
 
# Train (reduce batch size for memory)
torchrun --standalone --nproc_per_node=8 scripts/base_train.py \
    --depth=26 \
    --device_batch_size=16 \
    --run=d26_experiment

Performance expectations:

Model	Params	Training Time (8×H100)	Cost
d20	~561M	~4 hours	~$100
d26	~1B	~12 hours	~$300
d32	~1.9B	~33 hours	~$800

Note: Parameter counts for d20 (560,988,160) and d32 (1.9B) are from Karpathy's official nanochat walkthrough. Costs assume ~$24/hr for 8×H100 on Lambda Labs as of January 2025.

Part 10: Your first model opens these five paths forward

Congratulations! You've trained your first language model from scratch. Here's what to explore next:

1. Fine-tune for chat:

See Fine-tuning for Chat (SFT)
Turn your base model into a conversational assistant

2. Understand the architecture:

Review Modern Transformer Architecture
Learn what's inside the model you just trained

3. Experiment with hyperparameters:

Try different learning rates
Test longer training runs
Compare model sizes

4. Train your own tokenizer:

See Tokenizer Design Choices
Customize vocabulary for your domain

5. Build custom evaluations:

See Building Custom Evaluation Tasks
Test your model on tasks you care about

Training a language model is engineering, not magic

With nanochat, you can:

✅ Train models from 11M to 140M+ parameters
✅ Use single GPU or scale to 8× GPUs
✅ Monitor training with key metrics
✅ Evaluate with standardized benchmarks
✅ Deploy as a chat interface

For your first training run, this means: you're not waiting for permission or expensive cloud credits. A single RTX 3090 and 15 minutes gets you from zero to generating text. That's the bar now.

Before you start your first training run:

Verify your GPU memory first. Run nvidia-smi before anything else. If you see <20GB free, close other apps or reduce device_batch_size.
Start with d8, not d20. The tutorial model trains in 15 minutes. If something breaks, you find out fast—not 4 hours in.
Watch the loss curve, not the terminal. Enable wandb logging (--run=my_experiment). A flattening loss means your model hit capacity or exhausted data.
Save your config files. Every training run should be reproducible. Config files beat command-line archaeology.
Check validation bpb, not just training loss. If they diverge, you're overfitting. For tiny models, this happens fast.

The best way to demystify AI is to build it yourself. You just started.

Previous in series:

Loss Landscape & Scaling Laws - Understand evaluation metrics like bpb and Chinchilla scaling laws

Next in series:

Fine-tuning for Chat (SFT) - Transform your base model into a conversational assistant

Related posts:

Modern Transformer Architecture - RoPE, QK normalization, and architectural innovations explained
Muon Optimizer Explained - Understanding the optimizer that powers nanochat training

Part of the nanochat Deep-Dive Series • Track 2: Practical Guides

Sources and References

Institutional and Industry Research

Epoch AI — Tracks training compute trends and scaling law research (as of January 2025).
Stanford HAI AI Index — Annual report on LLM training costs and efficiency improvements.
MLCommons MLPerf Training — Industry benchmarks for distributed training performance.
Lambda Labs GPU Pricing — Current cloud GPU costs (~$24/hr for 8×H100 as of January 2025).

nanochat Implementation

nanochat repository. Full source code.
base_train.py. Training script.
Karpathy, A. nanoGPT. Predecessor project.

Training Data

Penedo, G., et al. (2024). The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. arXiv:2406.17557. 15T token dataset.
Penedo, G., et al. (2024). FineWeb-Edu: A Curated Web Dataset for Educational Content. HuggingFace. Filtered educational subset.

Distributed Training

PyTorch Distributed Tutorial. torchrun and DDP setup.
Li, S., et al. (2020). PyTorch Distributed: Experiences on Accelerating Data Parallel Training. VLDB 2020. DDP design.
Rajbhandari, S., et al. (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. SC 2020. Memory optimization.

Training Methodology

Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models. NeurIPS 2022. Chinchilla scaling laws.
Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. OpenAI scaling laws.
Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017. Transformer architecture.

Optimizer

Liu, J., et al. (2025). Muon is Scalable for LLM Training. arXiv:2502.16982. Muon optimizer scaling.
Jordan, K. (2024). Muon: Momentum + Newton-Schulz. Original Muon blog post.

Muon Optimizer Explained - Understanding the optimizer
Modern Transformer Architecture - Architectural innovations
Loss Landscape & Scaling Laws - Evaluation metrics

From random noise to coherent text—this is where LLMs begin. Everything else is refinement.

On This Page

Training Your First Model from Scratch

You can train a language model in 15 minutes on one GPU

Training Progress Dashboard

📈 Understanding Training Metrics

You need one GPU with 24GB VRAM to start

Hardware

Software

Time

Part 1: Environment setup takes 5 minutes with uv

Step 1: Clone and Enter Repository

Step 2: Install Dependencies

Step 3: Download Data

Step 4: Train Tokenizer

Part 2: One command trains a model—complexity hides inside

Quick Start: The d8 Model

Part 3: These numbers reveal if training is working or failing

The Banner

Model Configuration

Batch Size Configuration

Training Horizon

Training Step Output

Validation Evaluation

Final Statistics

Part 4: The d20 model delivers GPT-2 quality for $100

Step 1: Download More Data

Step 2: Launch Training

Step 3: Monitor Progress

Step 4: Find Your Checkpoint

Checkpoint Inspector

Checkpoint Metadata

Training Decision

Part 5: These hyperparameters control 90% of training dynamics

Command-Line Overrides

Hyperparameter Tuner

Recommendations

Configuration

Config Files

Memory Management

Part 6: Wandb transforms cryptic numbers into actionable insights

Part 7: Text generation proves your model learned something useful

Quick CLI Test

Web UI

Evaluation Benchmarks

Part 8: These five errors break every first training run—here's how to fix them

Issue 1: Loss is NaN

Issue 2: Loss Not Decreasing

Issue 3: OOM (Out of Memory)

Issue 4: Slow Training

Issue 5: Can't Resume Training

Part 9: DDP scales training linearly—add GPUs, divide time

Scaling to Multiple GPUs

Scaling to Larger Models

Part 10: Your first model opens these five paths forward

Training a language model is engineering, not magic

Related Posts

Sources and References

Institutional and Industry Research

nanochat Implementation

Training Data

Distributed Training

Training Methodology

Optimizer

Related Posts in This Series

Related Articles

🤖→🚀Memory Optimization Techniques: Gradient Accumulation & Mixed Precision

🤖→🚀Fine-tuning for Chat (SFT)

🤖→🚀Tokenizer Design Choices: BPE, Vocabulary, and Implementation