José David Baena

On This Page

On this page

Reinforcement Learning from Human Feedback (RLHF)

Banner.jpeg
Published on
/28 mins read

SFT teaches imitation. RL teaches discovery.

The jump from SFT to RL was the most surprising part of nanochat's training pipeline for me. You can see the model's reasoning quality visibly improve—solutions it could never produce by imitating demonstrations emerge naturally when you let it explore and reinforce what works.

Your SFT model can imitate. But it can't exceed your demonstrations. RL breaks that ceiling—and GRPO makes it practical on consumer hardware.

TL;DR: GRPO (Group Relative Policy Optimization) replaces expensive PPO reward models with binary verification. For math problems, you can check if the answer is correct without human labels. nanochat's 561M parameter model reaches 50%+ on GSM8K this way—discovering chain-of-thought reasoning patterns that weren't in the training data. In January 2025, DeepSeek-V3's GRPO training produced the most cost-effective frontier model yet. The technique is going mainstream.

When SFT hits a wall: Consider a common pattern: spending months curating high-quality math demonstrations for SFT, only to hit a ceiling around 38% on GSM8K that refuses to improve no matter how much data you add. The problem is that imitation learning can't exceed your demonstrations—if your best demonstrations are imperfect, the model faithfully learns those imperfections. Implementing GRPO with binary verification (just checking if answers are correct) lets the model discover reasoning patterns the human annotators never used. Same base model. Different training signal. GSM8K can jump to 52% in days. RL doesn't just teach your model to follow—it teaches it to search.

Track 2: Practical Guides - Post 2.3 of 6

This post builds on Fine-tuning for Chat (SFT). View all posts in this track →


Prerequisites and Installation

Before starting RL training, ensure you have a properly configured nanochat environment and a trained SFT model.

System Requirements:

  • CUDA: 11.8+ or 12.x (required for GPU training)
  • Python: 3.10-3.11 (nanochat compatibility)
  • RAM: 32GB+ (for data loading and rollout generation)
  • GPU: 24GB+ VRAM recommended (e.g., RTX 3090, A100)
    • Single GPU: Works but slower (~32 hours for full run)
    • 8× GPUs: Optimal for production training (~4 hours)
  • Disk: 10GB+ for nanochat repository and checkpoints

Installation:

# Clone nanochat repository
git clone https://github.com/karpathy/nanochat.git
cd nanochat
 
# Install uv package manager (fast, reliable dependency management)
curl -LsSf https://astral.sh/uv/install.sh | sh
 
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .
 
# Verify installation
python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}')"

Required Checkpoint: RL training requires a pre-trained SFT model. If you don't have one:

# Verify SFT checkpoint exists
ls -la chatsft_checkpoints/
 
# Should see directories like: d12/, d20/, or d24/ with model_step_*.pt files
# If missing, see: /blog/nanochat-deep-dive/fine-tuning-for-chat-sft

Common Installation Issues:

ErrorCauseSolution
ImportError: No module named 'nanochat'Not installed in editable modeRun uv pip install -e . from nanochat root
CUDA out of memory during rolloutsInsufficient VRAM for device_batch_size=8Reduce to --device_batch_size=4 or =2
FileNotFoundError: chatsft_checkpoints/Missing SFT checkpointTrain SFT model first (see prerequisites above)
RuntimeError: CUDA errorCUDA version mismatchEnsure PyTorch CUDA version matches system CUDA
Slow rollout generationCPU bottleneckIncrease num_workers in data loading or reduce num_samples

RL lets your model discover solutions better than your demonstrations

SFT teaches your model to imitate. RL teaches it to improve. That's how nanochat's ~561M parameter d20 model reaches 50%+ on GSM8K.

TL;DR: GRPO uses binary rewards (correct/incorrect) with no reward model. Group normalization compares responses within each prompt. KL penalty prevents distribution collapse. 8 rollouts per prompt is sufficient for stable training.

Supervised Fine-Tuning (SFT) teaches a model to follow instructions by imitating human-written responses. But imitation has limits—how do you train a model to be more helpful, more accurate, or better at reasoning when you can't easily demonstrate the perfect answer?

This is where Reinforcement Learning from Human Feedback (RLHF) shines. Instead of imitating demonstrations, the model learns by trying different responses and receiving feedback on which ones are better. This allows it to discover solutions that might be better than any single human demonstration.

nanochat applies a simplified form of GRPO (Group Relative Policy Optimization) to improve mathematical reasoning on GSM8K:

  • The fundamental difference between SFT and RL for chat models
  • How to design reward functions for subjective qualities
  • The rollout generation process and advantage estimation
  • Policy gradient optimization without trust regions or PPO
  • Practical considerations for RL training stability
  • Tool use integration in RL (calculator for math problems)

RL optimizes expected reward, not log-likelihood

From Imitation to Optimization

SFT optimizes:

max E_{(x,y)~D} [log p(y|x)]

Where D is a dataset of (question, answer) pairs.

RL optimizes:

max E_{x~D, y~π} [R(x, y)]

Where π is the policy (our model), and R(x,y) is a reward function scoring how good response y is for question x.

Key difference: SFT learns from fixed demonstrations. RL learns from its own generations and improves based on feedback.

Why RL After SFT?

Consider this GSM8K problem:

Question: Weng earns $12/hour babysitting. Yesterday she did 50 minutes. How much did she earn?

SFT might produce:
"She worked 50 minutes which is 50/60 hours, so 12 * 50/60 = $10"
(Correct reasoning, correct answer)

But RL can explore:
"First convert to hours: 50 minutes = 50/60 = 0.833 hours
Then multiply: 12 * 0.833 = $10"
(Alternative valid approach)

Or discover:
"12/60 = $0.2 per minute
50 * 0.2 = $10"  
(More direct solution)

RL explores the solution space and learns which strategies work best, potentially finding better approaches than the training demonstrations.

GRPO simplifies RLHF by removing the reward model

Simplified GRPO → REINFORCE

From scripts/chat_rl.py:

"""
I put GRPO in quotes because we actually end up with something a lot
simpler and more similar to just REINFORCE:
 
1) Delete trust region, so there is no KL regularization to a reference model
2) We are on policy, so there's no need for PPO ratio+clip.
3) We use GAPO style normalization that is token-level, not sequence-level.
4) Instead of z-score normalization (r - mu)/sigma, only use (r - mu) as the advantage.
"""

Translation: nanochat uses vanilla policy gradients (REINFORCE) with mean-baseline advantages. No PPO complexity, no reference model, just simple and effective RL.

Core RL Hyperparameters

source = "sft"                    # Start from SFT model (not base)
device_batch_size = 8            # Max samples per forward pass
examples_per_step = 16           # Training examples per gradient step
num_samples = 16                 # Samples per example (for advantage estimation)
max_new_tokens = 256             # Max response length
temperature = 1.0                # Sampling temperature
top_k = 50                       # Top-k sampling
 
# Learning rates (same structure as SFT)
unembedding_lr = 0.004
embedding_lr = 0.2
matrix_lr = 0.02
weight_decay = 0.0
init_lr_frac = 0.05              # Start at 5% of base LR
 
num_epochs = 1                   # One pass through GSM8K train set
save_every = 60                  # Checkpoint frequency
eval_every = 60                  # Evaluation frequency
eval_examples = 400              # Examples for pass@k evaluation

Key insight: We start from the SFT model, not the base model. SFT provides a good initialization—the model already knows how to format responses. RL fine-tunes the reasoning quality.

For your training pipeline, this means: never skip SFT. Going straight from base model to RL usually fails—the model doesn't know how to structure responses, so it can't explore effectively.

For your compute budget, this means: SFT is cheap insurance for RL success. One hour of SFT can save days of failed RL experiments.

Reward Model Playground

Explore how reward models score different responses

Conversation

user

What's the best way to learn programming?

assistantReward: 0.85

Start with Python, build small projects, and practice consistently every day. Online courses and coding challenges are great resources.

Compare Alternative Responses

Reward Breakdown (Selected Response)

helpfulness
0.80
specificity
0.85
tone
0.90
safety
1.00
coherence
0.85
relevance
0.60
Overall
0.84

Try Your Own Response

How Reward Models Work

  • • Trained on human preference data (response A vs B rankings)
  • • Learn to predict which responses humans would prefer
  • • Used to provide reward signal during RLHF training
  • • Must balance helpfulness with safety constraints

GSM8K provides verifiable math rewards for training

GSM8K (Grade School Math 8K) is perfect for RL because:

  1. Clear success criterion: Either the final answer is correct or not (binary reward)
  2. Tool use: Problems require calculator operations, testing agent capabilities
  3. Diverse strategies: Multiple valid solution paths exist

Example from tasks/gsm8k.py:

Question:
Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?

Answer:
Weng earns 12/60 = $<<12/60=0.2>>0.2 per minute.
Working 50 minutes, she earned 0.2 x 50 = $<<0.2*50=10>>10.
#### 10

Notice the &lt;&lt;...>> tags—these are calculator tool calls. The model learns to invoke tools and use their outputs.

Conversation Format for GSM8K

The dataset is converted to conversations:

messages = [
    {"role": "user", "content": "Weng earns $12 an hour..."},
    {"role": "assistant", "content": [
        {"type": "text", "text": "Weng earns 12/60 = $"},
        {"type": "python", "text": "12/60"},
        {"type": "python_output", "text": "0.2"},
        {"type": "text", "text": "0.2 per minute.\nWorking 50 minutes, she earned 0.2 x 50 = $"},
        {"type": "python", "text": "0.2*50"},
        {"type": "python_output", "text": "10"},
        {"type": "text", "text": "10.\n#### 10"},
    ]}
]

The tokenizer renders this with special tokens:

<|user_start|>Weng earns $12 an hour...<|user_end|>
<|assistant_start|>Weng earns 12/60 = $<|python_start|>12/60<|python_end|><|output_start|>0.2<|output_end|>0.2 per minute...

Rollouts sample multiple responses per prompt for comparison

The Rollout Loop

The core of RL training is generating rollouts—sampling completions from the current policy:

@torch.no_grad()
def get_batch():
    assistant_end = tokenizer.encode_special("<|assistant_end|>")
    rank_indices = range(ddp_rank, len(train_task), ddp_world_size)
    
    for example_idx in itertools.cycle(rank_indices):
        # Get the conversation
        conversation = train_task[example_idx]
        
        # Tokenize: remove assistant message, prime for completion
        tokens = tokenizer.render_for_completion(conversation)
        prefix_length = len(tokens)
        
        # Generate num_samples completions
        model.eval()
        generated_token_sequences = []
        masks = []
        
        num_sampling_steps = num_samples // device_batch_size
        for sampling_step in range(num_sampling_steps):
            seed = hash((step, example_idx, sampling_step)) & 0x7FFFFFFF
            
            try:
                with autocast_ctx:
                    sequences_batch, masks_batch = engine.generate_batch(
                        tokens,
                        num_samples=device_batch_size,
                        max_tokens=max_new_tokens,
                        temperature=temperature,
                        top_k=top_k,
                        seed=seed,
                    )
                generated_token_sequences.extend(sequences_batch)
                masks.extend(masks_batch)
                
            except RuntimeError as e:
                if "out of memory" in str(e):
                    logging.error(f"OOM during rollout at step {step}, example {example_idx}. Clearing cache.")
                    torch.cuda.empty_cache()
                    # Reduce batch size for this rollout
                    device_batch_size = max(1, device_batch_size // 2)
                    continue
                else:
                    raise e

Key points:

  1. Multiple samples per question: We generate 16 completions for each question to estimate advantage
  2. Batched generation: Use device_batch_size=8 to avoid OOM, run 2 sampling steps (8×2=16)
  3. Deterministic seeds: Reproducibility via hash((step, example_idx, sampling_step))
  4. Masks track generation: mask=1 for sampled tokens, mask=0 for forced tokens (prompts, tool outputs)

Reward Calculation

After generating completions, compute rewards:

rewards = []
for sample_tokens in generated_token_sequences:
    # Extract generated response (after prompt)
    generated_tokens = sample_tokens[prefix_length:]
    generated_text = tokenizer.decode(generated_tokens)
    
    # Calculate reward
    reward = train_task.reward(conversation, generated_text)
    rewards.append(reward)

The reward() function extracts the final answer:

def reward(self, conversation, assistant_response):
    """Binary reward: 1.0 if answer is correct, 0.0 otherwise"""
    is_correct = self.evaluate(conversation, assistant_response)
    return float(is_correct)

Extraction logic:

GSM_RE = re.compile(r"#### (\-?[0-9\.\,]+)")
 
def extract_answer(completion):
    """Extract numerical answer after #### marker"""
    match = GSM_RE.search(completion)
    if match:
        match_str = match.group(1).strip().replace(",", "")
        return match_str
    return None

Example:

Response: "She earned 0.2 * 50 = $10. #### 10"
Extracted: "10"
Ground truth: "10"
Reward: 1.0 ✓

Advantage Estimation

Simple mean-baseline advantage:

rewards = torch.tensor(rewards, dtype=torch.float, device=device)  # (B,)
mu = rewards.mean()
advantages = rewards - mu  # Simple baseline, no z-score normalization

Why mean baseline?

  • Reduces variance in gradient estimates
  • Centers advantages around zero (positive for above-average, negative for below-average)
  • Simpler than z-score (r - mu)/sigma, which can be unstable with binary rewards

Example with 16 samples:

Rewards: [0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1]
Mean: 0.5625
Advantages: [-0.56, 0.44, -0.56, -0.56, 0.44, 0.44, -0.56, -0.56, 0.44, -0.56, 0.44, 0.44, -0.56, 0.44, -0.56, 0.44]

Samples with correct answers get positive advantages (upweighted), incorrect get negative (downweighted).

Collation and Padding

Pad sequences to uniform length for batching:

max_length = max(len(seq) for seq in generated_token_sequences)
padded_sequences = [seq + [assistant_end] * (max_length - len(seq)) 
                    for seq in generated_token_sequences]
padded_masks = [mask + [0] * (max_length - len(mask)) 
                for mask in masks]
 
ids = torch.tensor(padded_sequences, dtype=torch.long, device=device)
mask_ids = torch.tensor(padded_masks, dtype=torch.long, device=device)
 
# Autoregressive setup
inputs = ids[:, :-1]
targets = ids[:, 1:].clone()
targets[mask_ids[:, 1:] == 0] = -1  # Mask prompt and tool outputs

Masking strategy:

  • Prompt tokens: mask=0target=-1 (not trained on)
  • Tool outputs: mask=0target=-1 (not trained on)
  • Sampled tokens: mask=1target=token_id (trained on)

For your RL setup, this means: mask everything the model didn't generate. Training on forced tokens (prompts, tool outputs) leaks supervision and defeats the purpose of exploration.

This ensures we only optimize the model's own generated text, not forced context.

RL Training Visualizer

Watch RLHF/PPO training dynamics in real-time

Step: 0/100
0.020
0.20

Training Status

Start training to see status updates.

RLHF Key Concepts

  • Reward: Score from reward model indicating response quality
  • KL Divergence: Distance from reference model (prevents reward hacking)
  • PPO Clip: Limits policy updates for stable training
  • Entropy: Measures response diversity (too low = mode collapse)

Policy gradients update the model using advantage-weighted log-probs

The PG Objective

for example_step in range(examples_per_rank):
    # Get batch for one training example
    sequences_all, inputs_all, targets_all, rewards_all, advantages_all = next(batch_iterator)
    
    model.train()
    
    # Process in device_batch_size chunks
    num_passes = inputs_all.size(0) // device_batch_size
    for pass_idx in range(num_passes):
        b0, b1 = pass_idx * device_batch_size, (pass_idx + 1) * device_batch_size
        inputs = inputs_all[b0:b1]
        targets = targets_all[b0:b1]
        advantages = advantages_all[b0:b1]
        
        try:
            # Calculate log probabilities
            with autocast_ctx:
                logp = -model(inputs, targets, loss_reduction='none').view_as(inputs)  # (B, T)
            
            # Check for NaN in loss (indicates instability)
            if torch.isnan(logp).any():
                logging.warning(f"NaN detected in policy gradient at step {step}, pass {pass_idx}. Skipping.")
                optimizer.zero_grad()
                continue
            
            # Policy gradient objective
            pg_obj = (logp * advantages.unsqueeze(-1)).sum()
            
            # Normalize by valid tokens and batch structure
            num_valid = (targets >= 0).sum().clamp(min=1)
            pg_obj = pg_obj / (num_valid * num_passes * examples_per_rank)
            
            # Minimize negative objective (maximize objective)
            loss = -pg_obj
            loss.backward()
            
        except RuntimeError as e:
            if "out of memory" in str(e):
                logging.error(f"OOM during backward pass at step {step}. Clearing cache.")
                torch.cuda.empty_cache()
                optimizer.zero_grad()
                continue
            else:
                raise e

Mathematical breakdown:

  1. Log probability: model(inputs, targets) returns negative log-likelihood (NLL), so we negate to get log-prob
  2. Weighted by advantage: logp * advantages upweights correct samples, downweights incorrect
  3. Summed over tokens: Each token in the sequence contributes to the objective
  4. Normalized: Divide by number of valid tokens to make loss scale-invariant
  5. Gradient ascent: Maximize objective → minimize -pg_obj

Why No PPO?

Traditional PPO uses:

ratio = torch.exp(logp - old_logp)
clipped_ratio = torch.clamp(ratio, 1-epsilon, 1+epsilon)
loss = -torch.min(ratio * advantages, clipped_ratio * advantages).mean()

nanochat skips this because:

  1. On-policy: We sample from the current policy, so ratio ≈ 1 anyway
  2. Simplicity: Clipping adds hyperparameters (epsilon) and complexity
  3. Stability: The mean baseline and small learning rates provide sufficient stability

When would you need PPO? Off-policy learning (reusing old samples) or very large policy updates.

For your RL implementation, this means: start with vanilla policy gradients before adding PPO complexity. If you're sampling fresh from your policy each step and using conservative learning rates, the clipping machinery adds hyperparameters without measurable benefit.

No Reference Model KL

Some RLHF methods add KL divergence to a reference model:

loss = -pg_obj + beta * KL(π, π_ref)

nanochat skips this:

  1. Trust in SFT initialization: Starting from a good SFT model reduces need for regularization
  2. Conservative learning rates: init_lr_frac=0.05 means small updates
  3. Short training: 1 epoch doesn't allow much divergence

When would you need KL regularization? Long training runs or when the reward function is exploitable (e.g., a learned reward model that can be gamed).

Tool use in RL: the calculator offloads arithmetic

Tool State Machine in the Engine

From nanochat/engine.py:

if next_token == python_start:
    state.in_python_block = True
    state.python_expr_tokens = []
elif next_token == python_end and state.in_python_block:
    state.in_python_block = False
    if state.python_expr_tokens:
        expr = self.tokenizer.decode(state.python_expr_tokens)
        result = use_calculator(expr)
        if result is not None:
            result_tokens = self.tokenizer.encode(str(result))
            state.forced_tokens.append(output_start)
            state.forced_tokens.extend(result_tokens)
            state.forced_tokens.append(output_end)
    state.python_expr_tokens = []
elif state.in_python_block:
    state.python_expr_tokens.append(next_token)

State machine:

  1. Model generates &lt;|python_start|>
  2. Engine enters "tool mode", collects tokens
  3. Model generates &lt;|python_end|>
  4. Engine evaluates the expression with use_calculator()
  5. Engine forces the result tokens into the sequence
  6. Model continues generating, incorporating the tool output

Safe Calculator Evaluation

def use_calculator(expr):
    """Evaluate a math expression safely"""
    expr = expr.replace(",", "")
    # Only allow numeric chars and basic operators
    if any([x not in "0123456789*+-/.() " for x in expr]):
        return None
    # Disallow power operator (expensive)
    if "**" in expr:
        return None
    return eval_with_timeout(expr, max_time=3)

Safety measures:

  • Whitelist characters (no variables, no functions)
  • No power operator (prevents 9**9**9**9 DoS)
  • 3-second timeout (prevents infinite loops)

Why Tool Use Matters for RL

The calculator provides groundedness:

  • Correct calculations: 12/60 = 0.2 is always correct
  • Reduced hallucination: Model doesn't need to memorize arithmetic
  • Credit assignment: If the tool returns the right intermediate value, the model learns that invoking it was good

During training:

  • Tool invocation tokens: supervised (model learns when to call tools)
  • Tool output tokens: not supervised (forced by environment)

This creates a natural division: the model controls when to use tools, the environment provides what the tools return.

Pass@k measures reasoning success rate over multiple samples

What is Pass@k?

Instead of accuracy (pass@1), we measure:

Pass@k: Probability that at least one of k samples is correct

This is more forgiving and reflects real usage (users try multiple times).

Implementation

if step % eval_every == 0:
    model.eval()
    passk = torch.zeros(device_batch_size, device=device)
    
    with autocast_ctx:
        records_iter = run_gsm8k_eval(val_task, tokenizer, engine, 
                                       num_samples=device_batch_size, 
                                       max_examples=eval_examples, 
                                       temperature=1.0)
        records = list(records_iter)
    
    # Calculate pass@k for k=1..device_batch_size
    for k in range(1, device_batch_size + 1):
        passk[k-1] = sum(any(o["is_correct"] for o in r["outcomes"][:k]) 
                         for r in records)
    
    # Aggregate across ranks
    if ddp:
        dist.all_reduce(passk, op=dist.ReduceOp.SUM)
    
    passk = passk / num_records

Interpretation:

Step 0 | Pass@1: 0.2500, Pass@2: 0.3750, Pass@4: 0.5000, Pass@8: 0.6250
  • 25% of problems are solved on first try
  • 62.5% are solved if you try 8 times
  • Diversity in sampling helps

Why Temperature=1.0 for Evaluation?

During training: temperature=1.0 (explore diverse solutions) During evaluation: temperature=1.0 (measure diverse solution quality)

If we used temperature=0.0 (greedy), we'd only measure the single best solution, missing the model's ability to find correct answers through different paths.

Warmup prevents early instability in policy gradients

Linear Decay

def get_lr_multiplier(it):
    lrm = 1.0 - it / num_steps
    return lrm
 
# Apply each step
for opt in optimizers:
    for group in opt.param_groups:
        group["lr"] = group["initial_lr"] * lrm

Schedule visualization (1000 steps):

Step    0: lrm=1.000 → lr=100%
Step  250: lrm=0.750 → lr=75%
Step  500: lrm=0.500 → lr=50%
Step  750: lrm=0.250 → lr=25%
Step 1000: lrm=0.000 → lr=0%

Why linear for RL?

  • RL training is inherently noisy (rewards are sparse/binary)
  • Gradual reduction prevents large updates late in training
  • Reaches zero at the end, ensuring convergence

Step-by-step: from SFT checkpoint to RL training

Step 1: Ensure You Have an SFT Model

# Check for SFT checkpoint
ls -la chatsft_checkpoints/
 
# Should see: d12/ or d24/ with model_step_*.pt

If you don't have one, run SFT first (see Fine-tuning for Chat).

Step 2: Configure RL Training

Create rl_config.txt:

run = "gsm8k_rl_run1"
source = "sft"
num_epochs = 1
examples_per_step = 16
device_batch_size = 8
num_samples = 16
temperature = 1.0
init_lr_frac = 0.05
eval_every = 30
save_every = 30

Step 3: Launch Training

Single GPU:

python -m scripts.chat_rl --config rl_config.txt

Multi-GPU (8 GPUs):

torchrun --standalone --nproc_per_node=8 -m scripts.chat_rl -- --config rl_config.txt

Step 4: Monitor Training

Watch for:

Step 0 | Pass@1: 0.2500, Pass@2: 0.3750, Pass@4: 0.5000, Pass@8: 0.6250
Step 0/500 | Example step 0 | Pass 0 | loss: 0.532145 | Average reward: 0.5625
Step 0/500 | Average reward: 0.5625 | Average sequence length: 147.23

Step 60 | Pass@1: 0.3125, Pass@2: 0.4375, Pass@4: 0.5625, Pass@8: 0.6875
Step 60/500 | Example step 0 | Pass 0 | loss: 0.421876 | Average reward: 0.6250
Step 60/500 | Average reward: 0.6250 | Average sequence length: 138.45

Step 120 | Pass@1: 0.3750, Pass@2: 0.5000, Pass@4: 0.6250, Pass@8: 0.7500

Good signs:

  • Pass@k metrics increasing over time
  • Average reward improving
  • Loss decreasing (but can be noisy)
  • Sequence length stabilizing (model not degenerating to very short/long outputs)

Bad signs:

  • Pass@k plateauing or decreasing (model not learning)
  • Average reward = 0.0 or 1.0 (reward function broken)
  • Sequence length → 0 or → max_tokens (model collapsing)

Advanced: KL penalty, entropy bonus, and reward shaping

Custom Reward Functions

Binary rewards are simple but coarse. You can design richer rewards:

def reward(self, conversation, assistant_response):
    # Base reward: correctness
    is_correct = self.evaluate(conversation, assistant_response)
    reward = float(is_correct)
    
    # Bonus for shorter solutions
    length = len(assistant_response)
    reward += 0.1 * max(0, 1 - length / 500)
    
    # Bonus for showing work
    if "<<" in assistant_response:  # Used calculator
        reward += 0.05
    
    # Penalty for format violations
    if "####" not in assistant_response:
        reward -= 0.2
    
    return reward

Design principles:

  • Dominant term: Correctness should be the main driver
  • Small bonuses: Auxiliary rewards should be 10-20% of main reward
  • Avoid exploitation: Don't reward superficial patterns (e.g., just typing "####" without solving)

Dense Rewards

Instead of binary outcome, reward intermediate progress:

def reward(self, conversation, assistant_response):
    # Extract all calculator results
    calc_results = re.findall(r'<<(.+?)=(.+?)>>', assistant_response)
    
    # Check if intermediate steps are correct
    correct_steps = 0
    for expr, result in calc_results:
        expected = eval(expr)
        if abs(float(result) - expected) < 0.01:
            correct_steps += 1
    
    # Partial credit
    if correct_steps > 0:
        return 0.1 * correct_steps
    
    # Full credit for correct final answer
    if self.evaluate(conversation, assistant_response):
        return 1.0
    
    return 0.0

This provides feedback even when the final answer is wrong, helping the model learn intermediate steps.

Multi-Objective Optimization

Combine multiple reward signals:

rewards_dict = {
    "correctness": 1.0 if correct else 0.0,
    "efficiency": -len(response) / 1000,  # Shorter is better
    "clarity": count_explanation_sentences(response) / 10,
}
 
# Weighted sum
reward = (
    1.0 * rewards_dict["correctness"] +
    0.1 * rewards_dict["efficiency"] +
    0.05 * rewards_dict["clarity"]
)

Track individual components in logging to diagnose what the model is optimizing for.

Reward Shaping Pitfalls

Pitfall 1: Overfitting to Proxies

# BAD: Reward word count
reward += 0.01 * word_count  # Model learns to be verbose

Pitfall 2: Conflicting Signals

# BAD: Contradictory rewards
reward += 0.1 if len(response) < 100 else 0  # Reward brevity
reward += 0.1 if "detailed explanation" in response else 0  # Reward detail

Pitfall 3: Reward Hacking

# BAD: Exploitable pattern
reward += 0.5 if "Therefore, the answer is" in response else 0
# Model learns to always write this phrase regardless of correctness

Solution: Always tie rewards to outcome metrics (accuracy, user satisfaction, etc.) and validate on held-out sets.

SFT teaches format, RL discovers novel solutions

AspectSFTRL
ObjectiveImitate demonstrationsMaximize reward
DataFixed (x, y) pairsDynamic (generate y, score it)
Training SignalCross-entropy lossPolicy gradient
ExplorationNone (teacher forcing)Sampling-based
StrengthsStable, fast, learns formatDiscovers novel solutions
WeaknessesLimited by demosNoisy, slower, can diverge
Use CaseTeach format & basicsOptimize for quality

Best Practice: SFT first (provides good initialization), then RL (optimizes quality).

These debugging patterns reveal common RL failures

Reward Distribution Analysis

# During rollout generation
print(f"Reward distribution: {torch.bincount(rewards.long())}")
# Output: tensor([10, 6])  → 10 wrong, 6 correct
 
# Calculate success rate
success_rate = rewards.sum() / len(rewards)
print(f"Success rate: {success_rate:.2%}")

What to look for:

  • Early training: ~25-50% success (better than random, not perfect)
  • Late training: ~60-80% success (strong performance)
  • All zeros: Reward function broken or task too hard
  • All ones: Reward function broken or task too easy

Gradient Norms

# After loss.backward()
total_norm = 0
for p in model.parameters():
    if p.grad is not None:
        param_norm = p.grad.data.norm(2)
        total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5
print(f"Gradient norm: {total_norm:.4f}")

Typical values:

  • Healthy: 0.1 - 10.0
  • Too small (<0.01): Learning rate too low or gradients vanishing
  • Too large (>100): Learning rate too high or exploding gradients

Advantage Distribution

advantages = rewards - rewards.mean()
print(f"Advantages: min={advantages.min():.3f}, max={advantages.max():.3f}, std={advantages.std():.3f}")

Healthy distribution:

Advantages: min=-0.563, max=0.438, std=0.501

Centered around zero, reasonable spread.

Degenerate:

Advantages: min=0.000, max=0.000, std=0.000

All rewards identical—model not exploring or task trivial.

Sequence Length Tracking

sequence_lengths = [len(seq) for seq in generated_token_sequences]
print(f"Seq lengths: min={min(sequence_lengths)}, max={max(sequence_lengths)}, mean={sum(sequence_lengths)/len(sequence_lengths):.1f}")

Healthy:

Seq lengths: min=87, max=203, mean=145.3

Degenerate:

Seq lengths: min=3, max=5, mean=4.1  → Model collapsed (just says "####0")
Seq lengths: min=256, max=256, mean=256.0  → Model rambling (hitting max_tokens)

Expect 50%+ GSM8K accuracy after 1000 steps

Training Times

On 8× A100 GPUs (80GB):

MetricValue
Examples per step16
Samples per example16
Total sequences per step256
Steps per epoch (GSM8K)~525
Time per step~30 seconds
Total training time~4.5 hours
Cost (AWS)~$110

Expected Improvements

Starting from SFT model:

MetricSFTAfter RLDelta
GSM8K Pass@115-25%30-40%+15%
GSM8K Pass@425-35%50-60%+25%
GSM8K Pass@830-40%60-70%+30%

NOTE

These are rough estimates for a 12-layer model. Larger models see bigger gains.

Memory Requirements

Per GPU:

ComponentMemory
Model (BF16)~450 MB
Rollout generation (batch=8)~8 GB
Forward pass (batch=8)~8 GB
Gradients~450 MB
Total~17 GB

RL is more memory-intensive than SFT due to rollout generation.

These pitfalls cause reward hacking and mode collapse

1. Starting from Base Model

Symptom: Model generates gibberish or doesn't follow format.

Solution: Always start from SFT. The base model doesn't know conversation structure.

2. Insufficient Samples per Example

Symptom: High variance in rewards, unstable training.

Solution: Increase num_samples (try 16-32). More samples = better advantage estimates.

3. Learning Rate Too High

Symptom: Pass@k oscillates wildly or collapses to zero.

Solution: Reduce init_lr_frac (try 0.02-0.05) or lower the base learning rates.

4. Reward Hacking

Symptom: Model achieves high reward but produces nonsense.

Example: Model learns to output "####" followed by random numbers, getting partial credit.

Solution: Make reward function more robust—check that the reasoning is present, not just the format.

5. Mode Collapse

Symptom: All generated responses become identical.

Solution:

  • Increase temperature (try 1.0-1.2)
  • Add entropy bonus to reward: reward += 0.01 * entropy
  • Reduce training duration (model is overfitting)

The same pattern works for coding, reasoning, and tool use

Other Tasks for RL

Coding (HumanEval):

def reward(self, conversation, assistant_response):
    # Extract code, run test cases
    code = extract_code_block(assistant_response)
    test_results = run_tests(code, self.test_cases)
    return float(all(test_results))

Instruction Following:

def reward(self, conversation, assistant_response):
    # Check if response follows constraints
    instruction = parse_instruction(conversation)
    follows_format = check_format(assistant_response, instruction.format)
    includes_keywords = check_keywords(assistant_response, instruction.keywords)
    return 0.5 * follows_format + 0.5 * includes_keywords

Conversational Quality:

def reward(self, conversation, assistant_response):
    # Use a reward model (small LM trained on human preferences)
    return reward_model.score(conversation + [assistant_response])

Multi-Turn RL

For dialogue, optimize over full conversations:

def get_batch_multiturn():
    # Start with conversation up to turn N-1
    conversation = sample_conversation(max_turns=3)
    
    # Generate turn N
    tokens = tokenizer.render_for_completion(conversation)
    sequences = engine.generate_batch(tokens, num_samples=16)
    
    # Reward based on full conversation
    rewards = [reward_conversation(conversation + [response]) for response in sequences]
    
    yield sequences, rewards

State of the Art: What's Next?

Constitutional AI

Train models to self-critique and revise:

1. Generate initial response
2. Critique: "What's wrong with this answer?"
3. Revise: Generate improved response
4. Reward revision

Outcome-Supervised RL

Instead of rewarding final answers, reward intermediate reasoning:

Reward:
- Each correct reasoning step: +0.1
- Correct final answer: +1.0
- Self-correction: +0.2

Learned Reward Models

Instead of hand-coded rewards, train a model on human preferences:

1. Collect human comparisons: "Response A is better than Response B"
2. Train reward model to predict human preferences
3. Use reward model in RL loop

This is the "HF" (Human Feedback) in RLHF!

RL goes beyond imitation—models discover solutions through exploration

The key insights:

  1. Build on SFT: Start from a strong instruction-following model
  2. Simple works: Vanilla policy gradients (REINFORCE) are effective
  3. Tool integration: Calculator provides grounding for math reasoning
  4. Pass@k evaluation: Measures solution diversity, not just single-path accuracy
  5. Reward design matters: Binary rewards are simple; consider shaping for complex tasks

nanochat's RL implementation is minimalistic yet powerful. By understanding these principles, you can adapt it to your own tasks—coding, instruction following, creative writing, or any domain where "better" is easier to judge than to demonstrate.

Imitation gets you started. Exploration gets you ahead.

The next post covers building custom evaluation tasks: creating benchmarks that measure what truly matters for your use case.

Sources

Institutional and Industry Research

Research Papers

  • Training language models to follow instructions with human feedback (InstructGPT) (Ouyang et al., 2022) - The foundational paper introducing RLHF for language models, showing that fine-tuning with human feedback produces outputs preferred over 175B GPT-3 despite using 100x fewer parameters. arXiv:2203.02155

  • Proximal Policy Optimization Algorithms (PPO) (Schulman et al., 2017) - Introduces the PPO algorithm used in most RLHF implementations, providing a simpler and more stable alternative to TRPO for policy gradient optimization. arXiv:1707.06347

  • Direct Preference Optimization (DPO) (Rafailov et al., 2023) - Presents an alternative to RLHF that eliminates the need for a separate reward model by directly optimizing preferences with a classification loss. Published at NeurIPS 2023. arXiv:2305.18290

  • DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Shao et al., 2024) - Introduces Group Relative Policy Optimization (GRPO), a memory-efficient variant of PPO that nanochat's RL implementation is based on. arXiv:2402.03300

  • Training Verifiers to Solve Math Word Problems (GSM8K) (Cobbe et al., 2021) - Introduces the GSM8K dataset of 8.5K grade school math word problems used for RL training in nanochat, along with the verifier-based approach for improving reasoning. arXiv:2110.14168

  • Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (REINFORCE) (Williams, 1992) - The foundational policy gradient algorithm that nanochat's simplified GRPO implementation is based on. Machine Learning 8, 229-256

  • Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022) - RLAIF alternative using AI-generated preferences. arXiv:2212.08073

Reward Models and Alignment

  • Learning to Summarize with Human Feedback (Stiennon et al., 2020) - Early RLHF work demonstrating reward model training. arXiv:2009.01325
  • Kahneman-Tversky Optimization (KTO) (Ethayarajh et al., 2024) - Alternative to DPO without requiring preference pairs. arXiv:2402.01306

Technical Documentation

  • OpenAI Spinning Up: Policy Gradient Methods - Comprehensive tutorial on policy gradient algorithms including REINFORCE and advantage estimation techniques. Spinning Up RL

  • Hugging Face TRL Library - Open-source library for training language models with reinforcement learning, implementing PPO, DPO, and other RLHF methods. TRL GitHub

Datasets & Benchmarks

  • GSM8K Dataset - Grade School Math 8K dataset with 8,500 linguistically diverse math word problems, providing verifiable binary rewards for RL training. OpenAI GSM8K GitHub

  • MATH Dataset - Competition-level mathematics problems for evaluating advanced reasoning capabilities. Hendrycks et al., MATH GitHub

  • Anthropic HH-RLHF - Human preference dataset for helpful and harmless training. Anthropic HH-RLHF

Previous in series:

Next in series:

Related posts:


Part of the nanochat Deep-Dive Series • Track 2: Practical Guides

GitHub: nanochat repository
RL Script: scripts/chat_rl.py


Before you start RL training:

  1. Start from SFT, never base model. The base model doesn't understand conversation structure—RL on it produces gibberish.
  2. Use 16+ samples per prompt. Fewer samples mean high-variance advantage estimates that destabilize training.
  3. Design robust reward functions. Binary pass/fail on final answer is fragile—consider partial credit for reasoning steps.
  4. Monitor for mode collapse. If all generated responses become identical, increase temperature or add entropy bonus.
  5. Track Pass@k alongside Pass@1. Solution diversity matters—a model with high Pass@8 but low Pass@1 is still useful.

TIP

Pass@k evaluation is a powerful metric for measuring solution diversity. Consider implementing it for your own tasks, even outside RL!

Imitation gets you started. Exploration gets you ahead.