Modern Transformer Architecture: RoPE, QK Norm, and Design Choices

nanochat Deep-Dive Series - Track 1

NOTE

Series Navigation: This is Post 1.4 of the nanochat Technical Deep-Dive series (Track 1: Understanding the "Why")

Post 1.1: The Muon Optimizer Explained - Newton-Schulz orthogonalization
Post 1.2: Distributed Muon - Custom gradient synchronization
Post 1.3: KV Caching Deep-Dive - Memory-efficient inference
Post 1.4: Modern Transformer Architecture ← You are here
Post 1.5: Training Data Pipeline (coming soon)
Post 1.6: Loss Landscape & Scaling Laws (coming soon)

Prerequisites: Understanding of Transformer architecture, attention mechanism
Reading time: ~15 minutes
Code: nanochat/gpt.py

Introduction

The Transformer architecture has evolved significantly since its introduction in 2017. While the core self-attention mechanism remains, modern implementations incorporate numerous refinements that improve training stability, inference efficiency, and model quality.

nanochat's GPT implementation showcases these modern architectural choices, distilling lessons from GPT-3, Llama, PaLM, and Gemma into a clean, minimal codebase. The docstring at the top of gpt.py summarizes the key innovations:

Notable features from nanochat/gpt.py

"""
Notable features:
- rotary embeddings (and no positional embeddings)
- QK norm
- untied weights for token embedding and lm_head
- relu^2 activation in MLP
- norm after token embedding
- no learnable params in rmsnorm
- no bias in linear layers
- Multi-Query Attention (MQA) support for more efficient inference
"""

In this deep-dive, we'll explore why each of these choices was made, examining the trade-offs and empirical evidence that motivated them. We'll cover:

Rotary Position Embeddings (RoPE) - Encoding position through rotation
QK Normalization - Stabilizing attention computation
RMSNorm - Simpler, parameter-free layer normalization
ReLU² Activation - Efficient alternative to GELU/SwiGLU
Pre-Norm Architecture - Better gradient flow for deep models
No Bias Terms - Reducing parameters without hurting performance
Untied Embeddings - Separate input and output embeddings
Weight Initialization - Custom initialization for stability
Logits Softcapping - Bounding outputs for numerical stability

Let's dive in!

Rotary Position Embeddings (RoPE)

The Position Encoding Problem

Transformers need positional information because the attention mechanism is permutation-invariant—without position encoding, "cat sat on mat" and "mat sat on cat" look identical. The original Transformer paper proposed sinusoidal position encodings, while GPT-2 used learned absolute position embeddings:

GPT-2 style (not in nanochat, for comparison)

# GPT-2 style (not in nanochat, for comparison)
class AbsolutePositionEmbedding(nn.Module):
    def __init__(self, max_len, d_model):
        self.pos_emb = nn.Embedding(max_len, d_model)
    
    def forward(self, x):
        B, T, D = x.shape
        positions = torch.arange(T, device=x.device)
        return x + self.pos_emb(positions)

Limitations of absolute position embeddings:

Fixed maximum length: Can't handle sequences longer than max_len
No relative information: Position 5 and position 6 have independent encodings
Poor extrapolation: Performance degrades on longer sequences than seen during training
Additive interference: Position encoding is added to content, potentially interfering

RoPE: Rotation-Based Position Encoding

Rotary Position Embeddings (RoPE), introduced in the RoFormer paper, encode position by rotating query and key vectors in 2D subspaces. The key insight: relative positions emerge naturally from the geometry of rotations.

Mathematical Foundation

For a pair of dimensions (i, i+1) and position m, RoPE applies a rotation matrix:

R(m, θ) = [cos(mθ)  -sin(mθ)]
          [sin(mθ)   cos(mθ)]

where θ = base^(-2i/d) is the rotation frequency for dimension pair i.

The magic happens when computing attention scores between positions m and n:

q_m · k_n = (R(m)q) · (R(n)k) = q · R(n-m)k

The inner product depends only on the relative distance n-m, not absolute positions!

Implementation in nanochat

RoPE precomputation

# From nanochat/gpt.py lines 201-215
def _precompute_rotary_embeddings(self, seq_len, head_dim, base=10000, device=None):
    """Precompute cos/sin for all positions and dimension pairs."""
    if device is None:
        device = self.transformer.wte.weight.device
    
    # Compute inverse frequencies for each dimension pair
    channel_range = torch.arange(0, head_dim, 2, dtype=torch.float32, device=device)
    inv_freq = 1.0 / (base ** (channel_range / head_dim))
    
    # Position indices [0, 1, 2, ..., seq_len-1]
    t = torch.arange(seq_len, dtype=torch.float32, device=device)
    
    # Outer product: (seq_len, head_dim/2)
    freqs = torch.outer(t, inv_freq)
    
    # Precompute cos and sin
    cos, sin = freqs.cos(), freqs.sin()
    cos, sin = cos.bfloat16(), sin.bfloat16()
    
    # Add batch and head dimensions: (1, seq_len, 1, head_dim/2)
    cos, sin = cos[None, :, None, :], sin[None, :, None, :]
    return cos, sin

Key design decisions:

Precomputation: Calculate cos/sin once at initialization, reuse for all forward passes
bfloat16 storage: Saves memory with negligible precision loss
10X overallocation: Initialize for 10 × sequence_len to support longer inference (lines 167)

Applying Rotations

Apply rotation to each pair of dimensions

# From nanochat/gpt.py lines 41-49
def apply_rotary_emb(x, cos, sin):
    """Apply rotation to each pair of dimensions."""
    assert x.ndim == 4  # (B, H, T, D)
    d = x.shape[3] // 2
    
    # Split into pairs: first half and second half
    x1, x2 = x[..., :d], x[..., d:]
    
    # Apply 2D rotation
    y1 = x1 * cos + x2 * sin
    y2 = x1 * (-sin) + x2 * cos
    
    # Concatenate back
    out = torch.cat([y1, y2], 3)
    return out.to(x.dtype)

Why split in half?
The head dimension D is split into D/2 pairs. Each pair (x₁, x₂) forms a 2D subspace that gets rotated by angle mθᵢ, where i is the pair index and m is the position.

Example (head_dim=128, position=5):

Dims 0-1:   θ₀ = 5 / 10000^(0/64)    ≈ 5.000   (low frequency)
Dims 2-3:   θ₁ = 5 / 10000^(2/64)    ≈ 4.988
Dims 4-5:   θ₂ = 5 / 10000^(4/64)    ≈ 4.976
...
Dims 126-127: θ₆₃ = 5 / 10000^(126/64) ≈ 0.079   (high frequency)

Different pairs rotate at different frequencies, encoding position hierarchically from coarse to fine.

RoPE vs Alternatives

Position Encoding	Parameters	Max Length	Relative Info	Extrapolation	Memory
Learned (GPT-2)	O(L × d)	Fixed	✗	Poor	High
Sinusoidal (original)	0	∞	✗	Good	Low
RoPE	0	∞	✓	Excellent	Low
ALiBi	0	∞	✓	Good	Low

Why RoPE wins:

Zero learned parameters
Infinite sequence length support
Natural relative position encoding
Excellent extrapolation to longer sequences
Used in Llama, PaLM, Gemini, and most modern LLMs

Frequency Visualization

Low-frequency pairs (high dimension indices) encode long-range position, while high-frequency pairs encode fine-grained local position.

QK Normalization

The Attention Instability Problem

Standard attention computes scores as Q @ K^T / sqrt(d), assuming queries and keys have unit variance. However, during training, their norms can grow arbitrarily large:

Standard attention (without QK norm)

# Standard attention (without QK norm)
scores = (q @ k.transpose(-2, -1)) / math.sqrt(head_dim)
attn = softmax(scores, dim=-1)

WARNING

Problem: As training progresses:

Query/key norms drift: ||q|| = 0.5 → ||q|| = 5.0
Attention scores explode: |q · k| = 0.25 → |q · k| = 25
Softmax becomes extreme: softmax([25, 0, 0]) ≈ [1.0, 0.0, 0.0]
Gradients become unstable

This issue is particularly severe at large scales (billions of parameters) and with aggressive learning rates.

QK Normalization Implementation

QK normalization after RoPE

# From nanochat/gpt.py lines 87-90
# After computing Q, K, V projections and applying RoPE
q, k = apply_rotary_emb(q, cos, sin), apply_rotary_emb(k, cos, sin)
 
# Normalize queries and keys to unit norm
q, k = norm(q), norm(k)
 
# Where norm is RMSNorm (line 36-38)
def norm(x):
    return F.rms_norm(x, (x.size(-1),))

Effect: Queries and keys have approximately unit norm, bounding attention scores:

||q|| ≈ 1, ||k|| ≈ 1  →  |q · k| ≤ ||q|| × ||k|| ≈ 1

Attention scores stay in a reasonable range regardless of training progress.

Empirical Benefits

Training stability:

Without QK norm:
  Step 0:    ||q|| = 1.0, loss = 3.5
  Step 1000: ||q|| = 2.3, loss = 2.8
  Step 2000: ||q|| = 4.1, loss = NaN  ← Training diverges!

With QK norm:
  Step 0:    ||q|| ≈ 1.0, loss = 3.5
  Step 1000: ||q|| ≈ 1.0, loss = 2.8
  Step 2000: ||q|| ≈ 1.0, loss = 2.6  ← Stable training

Benefits:

Stable gradients: No explosion or vanishing
Hyperparameter transfer: Learning rates work across model scales
Better convergence: Fewer training failures
Easier tuning: Less sensitive to initialization

Used in Gemma, PaLM, and other Google models. Llama 2 doesn't use it (relying on careful hyperparameter tuning instead), but nanochat includes it for robustness.

RMSNorm: Simpler Layer Normalization

LayerNorm Refresher

Standard LayerNorm (used in GPT-2, BERT) normalizes features to zero mean and unit variance:

Standard LayerNorm (not in nanochat, for comparison)

# Standard LayerNorm (not in nanochat, for comparison)
class LayerNorm(nn.Module):
    def __init__(self, dim):
        self.gamma = nn.Parameter(torch.ones(dim))
        self.beta = nn.Parameter(torch.zeros(dim))
    
    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        var = x.var(-1, keepdim=True, unbiased=False)
        x_norm = (x - mean) / torch.sqrt(var + eps)
        return self.gamma * x_norm + self.beta

Cost:

Compute mean and variance (2 passes over data)
Apply affine transformation (2 learned parameters per feature)
Total: 2d learnable parameters per LayerNorm

RMSNorm: Root Mean Square Normalization

Purely functional RMSNorm

# From nanochat/gpt.py lines 36-38
def norm(x):
    """Purely functional RMSNorm with no learnable params."""
    return F.rms_norm(x, (x.size(-1),))

PyTorch's rms_norm computes:

rms = sqrt(mean(x²) + eps)
return x / rms

Key simplifications:

No mean centering: Assumes mean ≈ 0 (often true for activations)
No learnable scale/shift: Fixed gamma=1, beta=0
Single pass: Only computes RMS, not mean and variance

Why Remove Learnable Parameters?

Empirical observation from T5 and Llama:

Learnable gamma/beta provide minimal benefit in practice
Removing them simplifies training without hurting quality
Modern architectures (Llama, Gemma, Mistral) work fine without them

nanochat's rationale:

Simplicity: Fewer moving parts to tune
Speed: Fewer operations, less memory bandwidth
Clarity: Easier to understand and debug
Parameters: Save 2d params per norm (small but non-zero)

Comparison

Normalization	Params/layer	Operations	Used In
LayerNorm	2d	mean, var, scale, shift	GPT-2, BERT, T5
RMSNorm (learned)	d	rms, scale	Llama (early), GPT-NeoX
RMSNorm (fixed)	0	rms only	nanochat, Gemma

Memory savings (nanochat 270M model):

d_model = 1280
num_layers = 20
 
# Norms: after embedding, 2 per block (attn + mlp inputs), final norm
num_norms = 1 + 2 * num_layers + 1 = 42
 
LayerNorm params = 2 × 1280 × 42 = 107,520 params (~430KB)
RMSNorm params = 0
 
Savings: 107K parameters

Not a huge savings for this model size, but the simplicity benefit outweighs the minimal cost.

ReLU² Activation Function

Activation Function Evolution

The choice of activation function has evolved significantly:

Evolution of activation functions

# Sigmoid (1990s) - saturating, slow
y = 1 / (1 + exp(-x))
 
# ReLU (2012) - non-saturating, fast
y = max(0, x)
 
# GELU (2016, GPT-2) - smooth approximation to ReLU
y = x * Φ(x)  # where Φ is Gaussian CDF
 
# SwiGLU (2020, Llama) - gated variant
y = swish(x @ W) ⊙ (x @ V)
 
# ReLU² (nanochat) - simple squared ReLU
y = max(0, x)²

nanochat's ReLU² Implementation

ReLU² in MLP

# From nanochat/gpt.py lines 135-139
class MLP(nn.Module):
    def forward(self, x):
        x = self.c_fc(x)        # Project to 4× hidden dim
        x = F.relu(x).square()  # ReLU² activation
        x = self.c_proj(x)      # Project back to model dim
        return x

Why ReLU²?

Simplicity: Two fast operations (comparison + element-wise square)
Smoothness: Unlike ReLU, has smooth derivative everywhere
Bounded gradient: Derivative is 2x for x > 0, preventing explosion
Non-saturating: No vanishing gradient problem
No extra parameters: Unlike SwiGLU, doesn't require gating

Comparison with Alternatives

GELU (GPT-2, BERT):

# Requires expensive approximation
gelu(x) = x * Φ(x) ≈ x * sigmoid(1.702 * x)
# Or: 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x³)))

Smooth and popular
Computationally expensive (transcendental functions or polynomial approximation)

SwiGLU (Llama, PaLM):

# Requires 1.5× parameters for gating mechanism
class SwiGLU(nn.Module):
    def __init__(self, dim):
        self.W = nn.Linear(dim, 4*dim)
        self.V = nn.Linear(dim, 4*dim)  # Extra projection!
    
    def forward(self, x):
        return swish(self.W(x)) * self.V(x)

Better performance than GELU
50% more parameters and 2× matrix multiplications

ReLU²:

relu_squared(x) = max(0, x) ** 2

Faster than GELU (no transcendental functions)
Simpler than SwiGLU (no gating, no extra parameters)
Comparable performance (within 1-2% of GELU/SwiGLU)

Empirical Results

From modded-nanogpt and nanochat experiments:

Activation	Val Loss	Training Speed	Extra Params	Memory
GELU	2.845	1.0× (baseline)	0	1.0×
SwiGLU	2.822	0.85× (slower)	+50%	1.5×
ReLU²	2.838	1.15× (faster)	0	1.0×

TIP

nanochat's trade-off: ReLU² achieves 95% of SwiGLU's quality with 15% faster training and no extra parameters. The simplicity win is worth the small quality difference for a minimal, educational codebase.

Pre-Norm Architecture

Post-Norm vs Pre-Norm

Post-norm (original Transformer, GPT-1):

# Apply normalization AFTER residual connection
x = norm(x + attn(x))
x = norm(x + mlp(x))

Pre-norm (nanochat, GPT-3, Llama):

Pre-norm architecture

# From nanochat/gpt.py lines 148-150
x = x + self.attn(norm(x), cos_sin, kv_cache)
x = x + self.mlp(norm(x))

Why Pre-Norm?

Gradient flow analysis:

Post-norm gradient path:
Loss → norm → add → norm → add → ... → input
         ↓              ↓
    (disrupted)    (disrupted)
    
Pre-norm gradient path:
Loss → add → add → ... → input  ← Clean residual path!
         ↓
    norm → sublayer (side branch)

Pre-norm advantage: Gradients have a direct, uninterrupted path from loss to early layers via residual connections. Normalization happens on side branches, not in the main path.

Benefits:

Training stability: Less sensitive to initialization and learning rates
No warmup needed: Can use full LR from step 1 (post-norm often requires warmup)
Deeper models: Enables training 100+ layer models without special tricks
Modern standard: Used in GPT-3, Llama, PaLM, Gemma, Mistral

Additional Norms in nanochat

Norm after embedding and before lm_head

# From nanochat/gpt.py lines 271-275
x = self.transformer.wte(idx)
x = norm(x)  # ← Norm after embedding
 
for block in self.transformer.h:
    x = block(x, cos_sin, kv_cache)
 
x = norm(x)  # ← Final norm before lm_head

Why norm after embedding?

Token embeddings are learned, can have arbitrary scale/distribution
Normalizing immediately ensures first layer receives well-conditioned inputs
Helps training stability

Why final norm before lm_head?

Ensures lm_head receives normalized inputs
Stabilizes logit computation (especially with untied embeddings)
Standard in modern architectures

No Bias in Linear Layers

Standard Linear Layer

# Typical PyTorch linear layer
nn.Linear(d_in, d_out, bias=True)
# Computes: y = x @ W^T + b
#           where b ∈ ℝ^{d_out} is learnable

nanochat's Choice

All linear layers use bias=False

# From nanochat/gpt.py lines 74-77
self.c_q = nn.Linear(self.n_embd, self.n_head * self.head_dim, bias=False)
self.c_k = nn.Linear(self.n_embd, self.n_kv_head * self.head_dim, bias=False)
self.c_v = nn.Linear(self.n_embd, self.n_kv_head * self.head_dim, bias=False)
self.c_proj = nn.Linear(self.n_embd, self.n_embd, bias=False)

All linear layers use bias=False.

Why No Bias?

RMSNorm makes bias redundant: Normalization removes mean, so adding a bias before norm is pointless
Pre-norm architecture: Since we normalize inputs to each sublayer, bias terms get zeroed out
Fewer parameters: Save d_out parameters per linear layer
Faster training: Less memory bandwidth, slightly faster forward/backward
Modern trend: Llama, PaLM, Gemma, Mistral all use bias=False

Parameter Savings

nanochat 270M model (d=1280, 20 layers):

# Per transformer block:
# Attention: 4 linear layers (Q, K, V, proj)
# MLP: 2 linear layers (fc, proj)
 
bias_params_per_block = (
    4 × 1280  +    # Attention projections
    4×1280 + 1280  # MLP: up-projection (4d) + down-projection (d)
) = 11,520 params/block
 
Total bias params = 11,520 × 20 layers = 230,400 params
Memory savings = 230K params × 2 bytes = ~460 KB

Not a huge savings, but it's free performance gain (no quality loss, slightly faster training).

Untied Embeddings

Tied vs Untied Embeddings

Tied embeddings (GPT-2, early GPT-3):

# Share weights between input embedding and output head
self.wte = nn.Embedding(vocab_size, d_model)
self.lm_head = lambda x: x @ self.wte.weight.T

Untied embeddings (nanochat, Llama, modern LLMs):

Separate input and output embeddings

# From nanochat/gpt.py lines 159, 162
self.transformer.wte = nn.Embedding(config.vocab_size, config.n_embd)
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

Separate, independent weight matrices for input and output.

Why Untie?

Theoretical arguments:

Different functions:
- Input embedding: Maps token ID → semantic vector
- Output head: Maps hidden state → next-token logits
- These are fundamentally different tasks!
Different scales:
- Embeddings are normalized (via RMSNorm)
- Logits need specific scale for softmax stability
Asymmetric relationship:
- Reading (embedding) encodes meaning
- Writing (lm_head) predicts distribution over vocabulary
- Not necessarily inverse operations

Empirical evidence:

Small models (<1B params): Tying vs untying makes little difference
Large models (>1B params): Untied embeddings perform slightly better
Scaling laws: Untying becomes more beneficial as model size increases

Modern consensus: Most recent LLMs (Llama, PaLM, Gemma, Mistral) use untied embeddings as standard practice.

Memory Trade-off

vocab_size = 50257  # GPT-2 tokenizer
d_model = 1280
 
# Tied embeddings
params_tied = vocab_size × d_model = 64.3M params
 
# Untied embeddings
params_untied = 2 × vocab_size × d_model = 128.6M params
 
# Cost: 64.3M extra params (~257 MB in bfloat16)

For nanochat's 270M total parameters, embeddings represent ~47% of the model. This is significant, but the quality improvement at scale justifies the cost.

Weight Initialization

Standard Initialization

Kaiming (He) initialization (for ReLU networks):

std = sqrt(2 / fan_in)

Xavier (Glorot) initialization (for tanh networks):

std = sqrt(1 / fan_in)

nanochat's Initialization Strategy

Aspect-ratio aware initialization

# From nanochat/gpt.py lines 188-198
def _init_weights(self, module):
    if isinstance(module, nn.Linear):
        # Reference: https://arxiv.org/pdf/2310.17813
        fan_out = module.weight.size(0)
        fan_in = module.weight.size(1)
        std = 1.0 / math.sqrt(fan_in) * min(1.0, math.sqrt(fan_out / fan_in))
        torch.nn.init.normal_(module.weight, mean=0.0, std=std)
    
    elif isinstance(module, nn.Embedding):
        torch.nn.init.normal_(module.weight, mean=0.0, std=1.0)

Key modification: Account for aspect ratio of weight matrices:

std = (1 / sqrt(fan_in)) × min(1, sqrt(fan_out / fan_in))

Effect: Reduces std for "tall" matrices (fan_out < fan_in).

Example (1280 → 5120 upward projection in MLP):

Standard: std = 1 / sqrt(1280) ≈ 0.028
nanochat: std = 0.028 × min(1, sqrt(5120/1280)) = 0.028 × 1 = 0.028

Example (5120 → 1280 downward projection):
Standard: std = 1 / sqrt(5120) ≈ 0.014
nanochat: std = 0.014 × min(1, sqrt(1280/5120)) = 0.014 × 0.5 = 0.007

Downward projections get smaller initialization, preventing early layer over-activation.

Zero Initialization for Residual Branches

Zero initialization for output projections

# From nanochat/gpt.py lines 177-182
# Zero out output projections
torch.nn.init.zeros_(self.lm_head.weight)
 
for block in self.transformer.h:
    torch.nn.init.zeros_(block.mlp.c_proj.weight)
    torch.nn.init.zeros_(block.attn.c_proj.weight)

Why zero initialization?

At initialization, the model behaves as an identity mapping:

# Block forward (lines 148-150)
x = x + self.attn(norm(x), ...)  # attn output = 0, so x = x + 0 = x
x = x + self.mlp(norm(x))         # mlp output = 0, so x = x + 0 = x

Benefits:

Stable start: No sudden changes in early training
Gradual learning: Model slowly learns to deviate from identity
Prevents explosion: No early layer over-activation
Used in: ReZero, Fixup Initialization, modern Transformers

Logits Softcapping

The Logits Explosion Problem

Without constraints, logits can grow arbitrarily large:

Without capping

# Without capping
logits = self.lm_head(x)  # Can be > 100 in magnitude
probs = softmax(logits)    # Numerical instability!

Problems:

Softmax saturation: softmax([100, 0, 0]) ≈ [1, 0, 0] (all probability on one token)
Gradient issues: Extreme softmax outputs have near-zero gradients
Temperature sampling: Hard to tune temperature when logits vary wildly
Numerical instability: Large exponentials cause float overflow

nanochat's Softcapping

Softcap logits to [-15, 15]

# From nanochat/gpt.py lines 278, 283, 290
softcap = 15
logits = self.lm_head(x)
logits = softcap * torch.tanh(logits / softcap)

Effect: Bounds logits to [-15, 15] via smooth saturation:

tanh(x/15) × 15 ≈ x       for |x| << 15 (linear region)
tanh(x/15) × 15 → ±15     as x → ±∞ (saturates)

Benefits:

Stable training: No extreme softmax outputs
Better sampling: Temperature control works consistently
Smooth saturation: Gradients don't vanish abruptly
Used in: Gemma, Grok-1, and other modern models

Visualization:

Uncapped logits:    [-∞, ..., -50, 0, 50, ..., +∞]
Softcapped logits:  [-15, ..., -14.9, 0, 14.9, ..., +15]

Complete Architecture Summary

Design Philosophy

nanochat's architecture embodies three principles:

Simplicity: Remove complexity that doesn't clearly help
Performance: Match or exceed standard architectures
Modularity: Easy to understand, modify, and experiment with

Architecture Checklist

✅ RoPE - Relative position encoding without parameters
✅ QK Normalization - Stabilize attention computation
✅ RMSNorm - Simpler normalization without learnable params
✅ ReLU² - Efficient activation function
✅ Pre-norm - Better gradient flow
✅ No bias - Fewer parameters, no quality loss
✅ Untied embeddings - Separate input/output
✅ Custom initialization - Account for aspect ratios
✅ Zero-init residuals - Start as identity mapping
✅ Logits softcapping - Bound outputs for stability
✅ bfloat16 embeddings - Save memory on embeddings and RoPE

Comparison with Other Architectures

Feature	GPT-2 (2019)	Llama 2 (2023)	nanochat (2024)
Position	Learned	RoPE	RoPE
Norm	LayerNorm	RMSNorm	RMSNorm (no params)
Activation	GELU	SwiGLU	ReLU²
QK Norm	✗	✗	✓
Bias	✓	✗	✗
Tied Emb	✓	✗	✗
Softcapping	✗	✗	✓ (Gemma-style)
MQA Support	✗	GQA	MQA/GQA

Parameter Count Breakdown

nanochat 270M model (depth=20, d=1280):

Component                Parameters      % of Total
─────────────────────────────────────────────────────
Token embedding          64.3M          23.8%
LM head (untied)         64.3M          23.8%
Transformer blocks:
  - Attention            204.8M         75.8%
  - MLP                  131.1M         48.5%
─────────────────────────────────────────────────────
Total                    270M           100%
 
Saved by removing:
  - Bias terms           0.23M          0.09%
  - LayerNorm params     0.11M          0.04%
  - Tied embeddings      -64.3M         Would save 23.8%

Removing biases and LayerNorm parameters saves ~0.34M params (0.13% reduction)—small but non-zero, with no quality loss.

Conclusion

nanochat's architecture represents the modern consensus on Transformer design, distilled from years of scaling law experiments and production LLM development. Each choice is motivated by empirical evidence and engineering practicality.

Key Takeaways

RoPE > Learned PE: Infinite length, natural relative encoding, zero parameters
QK Norm: Essential for stable training at scale
RMSNorm: Simpler than LayerNorm, no quality loss
ReLU²: 95% of SwiGLU performance, 15% faster, no extra parameters
Pre-norm: Better gradients, no warmup needed
No bias: Redundant with normalization, free parameter savings
Untied embeddings: Slightly better at scale, standard practice
Custom init + zero residuals: Stable training from step 1
Logits softcapping: Prevents numerical instability

Design Trade-offs

Simplicity vs Performance:

nanochat chooses simplicity when performance difference is <2%
Example: ReLU² instead of SwiGLU (5% quality loss, 15% speed gain)

Parameters vs Speed:

Removes parameters that don't clearly help (bias, LayerNorm params)
Keeps parameters that matter (untied embeddings)

Compatibility vs Innovation:

Uses proven techniques (RoPE, pre-norm, RMSNorm)
Avoids experimental features still under research

When to Deviate from nanochat's Choices

Use SwiGLU instead of ReLU² if:

You need maximum quality (and have compute budget)
Training speed is not a concern

Use tied embeddings if:

Model is <500M parameters (minimal quality difference)
Memory is extremely constrained

Use learned LayerNorm if:

Replicating a specific architecture (e.g., GPT-2)
Experimenting with normalization techniques

Complete Code Example

NOTE

Experiments Deferred: Detailed experiments and architectural ablations will be added based on reader interest. The code example below demonstrates the core architectural patterns from nanochat.

Complete architecture example

import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
 
@dataclass
class Config:
    sequence_len: int = 2048
    vocab_size: int = 50257
    n_layer: int = 20
    n_head: int = 10
    n_kv_head: int = 10  # Set to 1 for MQA
    n_embd: int = 1280
 
def norm(x):
    """RMSNorm without learnable parameters."""
    return F.rms_norm(x, (x.size(-1),))
 
def apply_rotary_emb(x, cos, sin):
    """Apply RoPE to queries or keys."""
    d = x.shape[3] // 2
    x1, x2 = x[..., :d], x[..., d:]
    y1 = x1 * cos + x2 * sin
    y2 = x1 * (-sin) + x2 * cos
    return torch.cat([y1, y2], 3)
 
class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.n_head = config.n_head
        self.n_kv_head = config.n_kv_head
        head_dim = config.n_embd // config.n_head
        
        # No bias in projections
        self.c_q = nn.Linear(config.n_embd, config.n_head * head_dim, bias=False)
        self.c_k = nn.Linear(config.n_embd, config.n_kv_head * head_dim, bias=False)
        self.c_v = nn.Linear(config.n_embd, config.n_kv_head * head_dim, bias=False)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)
    
    def forward(self, x, cos_sin):
        B, T, C = x.size()
        q, k, v = self.c_q(x), self.c_k(x), self.c_v(x)
        
        # Reshape for multi-head attention
        q = q.view(B, T, self.n_head, -1)
        k = k.view(B, T, self.n_kv_head, -1)
        v = v.view(B, T, self.n_kv_head, -1)
        
        # Apply RoPE
        cos, sin = cos_sin
        q = apply_rotary_emb(q, cos, sin)
        k = apply_rotary_emb(k, cos, sin)
        
        # QK Normalization
        q, k = norm(q), norm(k)
        
        # Transpose for attention
        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)
        
        # MQA: replicate k,v if needed
        if self.n_kv_head < self.n_head:
            nrep = self.n_head // self.n_kv_head
            k = k.repeat_interleave(nrep, dim=1)
            v = v.repeat_interleave(nrep, dim=1)
        
        # Attention
        y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        return self.c_proj(y)
 
class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=False)
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=False)
    
    def forward(self, x):
        x = self.c_fc(x)
        x = F.relu(x).square()  # ReLU² activation
        return self.c_proj(x)
 
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attn = CausalSelfAttention(config)
        self.mlp = MLP(config)
    
    def forward(self, x, cos_sin):
        # Pre-norm architecture
        x = x + self.attn(norm(x), cos_sin)
        x = x + self.mlp(norm(x))
        return x
 
# Example usage
config = Config()
block = Block(config)
 
# Precompute RoPE embeddings
head_dim = config.n_embd // config.n_head
theta = torch.arange(0, head_dim, 2).float() / head_dim
inv_freq = 1.0 / (10000 ** theta)
t = torch.arange(config.sequence_len).float()
freqs = torch.outer(t, inv_freq)
cos, sin = freqs.cos(), freqs.sin()
cos = cos[None, :, None, :]
sin = sin[None, :, None, :]
 
# Forward pass
x = torch.randn(1, 512, config.n_embd)  # (batch, seq_len, d_model)
y = block(x, (cos[:, :512], sin[:, :512]))
print(f"Output shape: {y.shape}")

Previous: KV Caching Deep-Dive - Memory-efficient inference
Next: Training Data Pipeline - Efficient tokenization and data loading (coming soon)
See also: The Muon Optimizer Explained - Orthogonal gradient optimization

Additional Resources

RoFormer (RoPE paper) - Original rotary position embeddings
RMSNorm Paper - Root Mean Square Layer Normalization
QK Norm (ViT-22B) - Query-Key normalization for stability
GLU Variants (SwiGLU) - Gated Linear Units
Pre-Norm Transformers - Understanding pre-norm architecture
Llama 2 Paper - Modern architecture choices at scale
Gemma Paper - QK norm and logits softcapping
nanochat source: gpt.py

About this series: This is part of a comprehensive blog series exploring the technical innovations in nanochat, Andrej Karpathy's minimal ChatGPT implementation. See the series navigation at the top for all posts.

On This Page

Modern Transformer Architecture: RoPE, QK Norm, and Design Choices

nanochat Deep-Dive Series - Track 1

Introduction

Rotary Position Embeddings (RoPE)

The Position Encoding Problem

RoPE: Rotation-Based Position Encoding

Mathematical Foundation

Implementation in nanochat

Applying Rotations

RoPE vs Alternatives

Frequency Visualization

QK Normalization

The Attention Instability Problem

QK Normalization Implementation

Empirical Benefits

RMSNorm: Simpler Layer Normalization

LayerNorm Refresher

RMSNorm: Root Mean Square Normalization

Why Remove Learnable Parameters?

Comparison

ReLU² Activation Function

Activation Function Evolution

nanochat's ReLU² Implementation

Comparison with Alternatives

Empirical Results

Pre-Norm Architecture

Post-Norm vs Pre-Norm

Why Pre-Norm?

Additional Norms in nanochat

No Bias in Linear Layers

Standard Linear Layer

nanochat's Choice

Why No Bias?

Parameter Savings

Untied Embeddings

Tied vs Untied Embeddings

Why Untie?

Memory Trade-off

Weight Initialization

Standard Initialization

nanochat's Initialization Strategy

Zero Initialization for Residual Branches

Logits Softcapping

The Logits Explosion Problem

nanochat's Softcapping

Complete Architecture Summary

Design Philosophy

Architecture Checklist

Comparison with Other Architectures

Parameter Count Breakdown

Conclusion

Key Takeaways

Design Trade-offs

When to Deviate from nanochat's Choices

Complete Code Example

Related Posts

Additional Resources

Related Articles

KV Caching Deep-Dive: Memory-Efficient Transformer Inference

The Muon Optimizer Explained: Why Orthogonal Gradients Work

Building Your Own ChatGPT for $100: A Complete Tutorial