Building Custom Evaluation Tasks

Track 2: Practical Guides - Post 2.4 of 6
This post builds on Reinforcement Learning from Human Feedback. View all posts in this track →

Standard benchmarks don't measure what you care about

Building domain-specific evaluation tasks is where I've seen the most value from nanochat's framework. The abstraction is simple enough to implement quickly, but flexible enough to handle any domain.

MMLU measures general knowledge. GSM8K measures math. Neither measures your actual use case. That's why you need custom evaluation tasks.

TL;DR: One base class with two modes (categorical and generative). CORE normalizes scores 0–1 with baseline centering. Custom tasks take 50 lines. Sandbox execution makes code evaluation safe.

The benchmark that missed the bug: Consider a common pattern: a legal-tech team trains a contract analysis model and celebrates 78% accuracy on MMLU law questions. In production, it misses 40% of non-compete clause violations—the exact use case they built it for. MMLU tests general legal knowledge; their users need specific clause detection. After building a custom evaluation task with 500 real contract excerpts, they discover the model is guessing on unfamiliar clause structures. The custom benchmark exposes the gap. Targeted fine-tuning later, clause detection hits 91%. General benchmarks tell you your model is broadly capable. Custom benchmarks tell you it actually works.

You've trained a model, fine-tuned it for chat, and optimized it with RL. But how do you know if it's actually good at what you care about? Standard benchmarks like MMLU and GSM8K are useful, but they measure general capabilities—not your specific use case.

Custom evaluation tasks let you measure what matters for your application: medical diagnosis accuracy, legal document analysis, code generation for your codebase, or any domain-specific skill.

nanochat's evaluation framework makes this straightforward:

The three evaluation task types: multiple choice, schema, and language modeling
How the CORE benchmark framework works
Building custom tasks for any domain
Best practices for prompt engineering in evaluation
Code execution sandboxing for safe evaluation
Distributed evaluation for large benchmarks

One Task class handles any evaluation type

Base Task Class

All evaluation tasks in nanochat inherit from Task:

class Task:
    def __init__(self, start=0, stop=None, step=1):
        # Allows lightweight slicing over the dataset
        self.start = start
        self.stop = stop
        self.step = step
    
    @property
    def eval_type(self):
        # one of 'generative' | 'categorical'
        raise NotImplementedError
    
    def num_examples(self):
        raise NotImplementedError
    
    def get_example(self, index):
        # Returns a conversation dict
        raise NotImplementedError
    
    def evaluate(self, conversation, assistant_response):
        # Returns success (bool or float)
        raise NotImplementedError

Key design principles:

Lightweight slicing: Create views over datasets without copying data
Lazy loading: Examples fetched on-demand via get_example()
Two evaluation modes: categorical (constrained choices) or generative (free-form)
Conversation format: All tasks return standardized conversation dicts

The Two Evaluation Modes

Categorical Evaluation

When to use: Multiple choice, yes/no, classification tasks

How it works: Model assigns probabilities to predefined options, chooses highest probability

Example from MMLU:

@property
def eval_type(self):
    return 'categorical'
 
def get_example(self, index):
    row = self.ds[index]
    question = row["question"]
    choices = row["choices"]  # ["Choice A", "Choice B", "Choice C", "Choice D"]
    answer = row["answer"]    # 0, 1, 2, or 3
    
    user_message = render_mc(question, ['A', 'B', 'C', 'D'], choices)
    assistant_message = ['A', 'B', 'C', 'D'][answer]
    
    return {
        "messages": [
            {"role": "user", "content": user_message},
            {"role": "assistant", "content": assistant_message}
        ],
        "letters": ['A', 'B', 'C', 'D'],  # For evaluation
    }
 
def evaluate(self, conversation, assistant_response):
    correct_answer = conversation['messages'][-1]['content']
    return assistant_response == correct_answer

Advantages:

Fast (no sampling required, batch multiple questions)
Deterministic (no temperature variation)
Easy to score (exact match)

Generative Evaluation

When to use: Open-ended tasks (code generation, math, creative writing)

How it works: Model generates free-form text, evaluated against success criteria

Example from HumanEval:

@property
def eval_type(self):
    return 'generative'
 
def get_example(self, index):
    row = self.ds[index]
    prompt = row['prompt']           # Function signature
    solution = row['canonical_solution']
    test = row['test']               # Test cases
    
    return {
        "messages": [
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": f"{prompt}\\n{solution}"},
        ],
        "entry_point": row['entry_point'],
        "test": test,
    }
 
def evaluate(self, conversation, completion):
    # Extract code from completion
    code = extract_program(completion)
    
    # Build executable program
    program = (
        imports +
        "\\n\\n" +
        code +
        "\\n\\n" +
        conversation['test'] +
        "\\n" +
        f"check({conversation['entry_point']})"
    )
    
    # Execute and check
    result = execute_code(program)
    return result.success

Advantages:

Flexible (measures complex capabilities)
Realistic (mimics actual use)
Rich feedback (can analyze failure modes)

For your custom tasks, this means: use categorical evaluation when you can (faster, deterministic), but don't force it. If your task is naturally open-ended—code generation, summarization, creative writing—generative evaluation captures what categorical misses.

For your compute budget, this means: categorical evaluation is 10-100× cheaper than generative. A single forward pass vs. potentially hundreds of tokens sampled. If you're evaluating on every training step, categorical saves real money.

CORE normalizes scores across diverse task types

What is CORE?

CORE (Compressed Open-Ended Requirements Evaluation) is an 11-task benchmark from the DCLM paper that evaluates base models across diverse capabilities with minimal compute.

From nanochat/core_eval.py:

Tasks:
1. ARC (easy/challenge) - Science reasoning
2. HellaSwag - Commonsense reasoning  
3. MMLU - Multitask knowledge
4. OpenBookQA - Elementary science
5. PIQA - Physical reasoning
6. SIQA - Social reasoning
7. WinoGrande - Coreference resolution
8. BoolQ - Yes/no questions
9. COPA - Causal reasoning
10. StoryCloze - Story completion
11. SQuAD - Reading comprehension

Why CORE matters:

Coverage: Tests diverse reasoning types
Efficiency: 11 tasks vs. dozens in full benchmarks
Correlation: CORE score correlates with broader evaluation suites
Centered metric: Accounts for random baselines (0=random, 1=perfect)

Three Task Types in CORE

CORE defines three fundamental task structures:

1. Multiple Choice Tasks

Structure: Same context, different continuations

Question: What is the capital of France?
- Paris=A
- London=B
- Berlin=C
- Madrid=D

Respond only with the letter of the correct answer.

Evaluation method: Compare log probabilities of each continuation, choose lowest loss (highest probability)

From nanochat/core_eval.py:

if task_type == 'multiple_choice':
    prompts = render_prompts_mc(item, continuation_delimiter, fewshot_examples)
    tokens, start_idxs, end_idxs = batch_sequences_mc(tokenizer, prompts)
    # ...
    # Find option with lowest average loss
    mean_losses = [losses[i, si-1:ei-1].mean().item()
                   for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
    pred_idx = mean_losses.index(min(mean_losses))
    is_correct = pred_idx == item['gold']

Key insight: We evaluate log probabilities, not generated text. This is much faster and avoids issues with generation formatting.

2. Schema Tasks

Structure: Different contexts, same continuation

Context A: "The dog barked loudly."
Context B: "The dog slept quietly."
Context C: "The dog ran quickly."

Continuation: " It was happy."

Which context is most likely?

Use case: Sentence completion, coreference resolution

Evaluation method: Similar to multiple choice, but context varies instead of continuation

if task_type == 'schema':
    prompts = render_prompts_schema(item, continuation_delimiter, fewshot_examples)
    tokens, start_idxs, end_idxs = batch_sequences_schema(tokenizer, prompts)
    # Find context with lowest loss for the continuation
    mean_losses = [losses[i, si-1:ei-1].mean().item()
                   for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
    pred_idx = mean_losses.index(min(mean_losses))
    is_correct = pred_idx == item['gold']

3. Language Modeling Tasks

Structure: Context + continuation, evaluate continuation likelihood

Context: "The capital of France is"
Continuation: " Paris"

Check if model assigns high probability to continuation

Use case: Reading comprehension, factual knowledge

Evaluation method: Check if argmax predictions match actual tokens

if task_type == 'language_modeling':
    prompts = render_prompts_lm(item, continuation_delimiter, fewshot_examples)
    tokens, start_idxs, end_idxs = batch_sequences_lm(tokenizer, prompts)
    # Check if all predicted tokens match actual tokens
    si, ei = start_idxs[0], end_idxs[0]
    predicted_tokens = predictions[0, si-1:ei-1]
    actual_tokens = input_ids[0, si:ei]
    is_correct = torch.all(predicted_tokens == actual_tokens).item()

Centered Metrics

Raw accuracy can be misleading when random guessing achieves high scores:

# Example: 4-choice multiple choice
raw_accuracy = 0.30  # 30% correct
random_baseline = 0.25  # 25% by guessing
 
# Centered accuracy
centered = (raw_accuracy - random_baseline) / (1.0 - random_baseline)
# centered = (0.30 - 0.25) / (1.0 - 0.25) = 0.067 (6.7% above random)

Interpretation:

0.0 = Random guessing
1.0 = Perfect performance
Negative = Worse than random (model is broken)

For your evaluation pipeline, this means: always report centered accuracy, not raw accuracy. A "30% accuracy" sounds bad until you realize random is 25%—your model is actually learning something. Centered metrics make progress visible.

For your stakeholder reports, this means: centered metrics translate to business value. "7% above random" is concrete improvement. "32% accuracy" on a 4-choice task sounds terrible but means the same thing.

The CORE metric is the mean of centered accuracies across all 11 tasks.

Evaluation Task Builder

Create custom evaluation tasks for testing LLM capabilities

Tasks (1)

Add New Task

Scoring Methods

Exact Match: Response must match exactly
Contains: Response must contain pattern
Regex: Pattern is a regular expression
Semantic: Embedding similarity (simulated)

Centered Metric Calculator

Calculate human-centered evaluation metrics

Baseline Scores (comma-separated)

Model Scores (comma-separated)

Presets:

Raw Model Score

68.6%

Centered Accuracy

55.1%

(model - baseline) / (human - baseline)

Normalized Accuracy

80.7%

model / human

Gap to Human

16.4

points remaining

Per-Task Centered Accuracy

Task 1

50.0%

Task 2

55.6%

Task 3

58.8%

Task 4

57.1%

Task 5

54.1%

Why Centered Metrics?

Raw accuracy can be misleading - 75% on an easy task vs hard task means different things
Centered accuracy normalizes to 0% (random) to 100% (human-level)
A score of 50% means halfway between random baseline and human performance
Scores above 100% indicate superhuman performance (possible with some metrics)

Formula

Centered Accuracy = (model_score - baseline) / (human_ceiling - baseline)

Where:
  baseline = random/majority class performance
  human_ceiling = expert human performance
  
Range: [0, 1] typically (can exceed 1 for superhuman)

Multiple choice tasks: constrained evaluation in 40 lines

Example: Custom Medical Diagnosis Task

Building a medical diagnosis task from scratch:

from datasets import load_dataset
from tasks.common import Task, render_mc
 
class MedicalDiagnosis(Task):
    """Evaluate medical diagnosis from symptoms"""
    
    def __init__(self, split, **kwargs):
        super().__init__(**kwargs)
        assert split in ["train", "test"], "split must be train|test"
        # Assume you have a dataset with structure:
        # {"symptoms": str, "diagnosis": str, "options": List[str]}
        self.ds = load_dataset("your_org/medical_diagnosis", split=split)
        self.letters = ('A', 'B', 'C', 'D', 'E')  # Up to 5 options
    
    @property
    def eval_type(self):
        return 'categorical'
    
    def num_examples(self):
        return len(self.ds)
    
    def get_example(self, index):
        row = self.ds[index]
        symptoms = row['symptoms']
        diagnosis = row['diagnosis']
        options = row['options']  # List of possible diagnoses
        
        # Find which option is correct
        assert diagnosis in options, f"Correct diagnosis not in options!"
        answer_idx = options.index(diagnosis)
        
        # Render as multiple choice
        question = f"Given the following symptoms:\\n{symptoms}\\n\\nWhat is the most likely diagnosis?"
        user_message = render_mc(question, self.letters[:len(options)], options)
        assistant_message = self.letters[answer_idx]
        
        return {
            "messages": [
                {"role": "user", "content": user_message},
                {"role": "assistant", "content": assistant_message}
            ],
            "letters": self.letters[:len(options)],
            "specialty": row.get('specialty', 'general'),  # For grouping results
        }
    
    def evaluate(self, conversation, assistant_response):
        assert assistant_response in conversation['letters']
        correct_answer = conversation['messages'][-1]['content']
        return assistant_response == correct_answer

Usage:

from tasks.medical_diagnosis import MedicalDiagnosis
 
# Create task
task = MedicalDiagnosis(split="test")
 
# Evaluate (from scripts/chat_eval.py pattern)
from scripts.chat_eval import run_categorical_eval
 
accuracy = run_categorical_eval(
    task_object=task,
    tokenizer=tokenizer,
    model=model,
    batch_size=8,
    max_problems=None  # Evaluate all
)
 
print(f"Medical Diagnosis Accuracy: {accuracy:.2%}")

Generative tasks: free-form evaluation with verifiable answers

Example: Custom Code Generation Task

For domain-specific code generation:

from tasks.common import Task
from nanochat.execution import execute_code
 
class CustomCodeGen(Task):
    """Evaluate code generation for your specific domain"""
    
    def __init__(self, split, **kwargs):
        super().__init__(**kwargs)
        # Load your custom code dataset
        # Structure: {"prompt": str, "solution": str, "tests": str, "imports": str}
        self.ds = load_dataset("your_org/custom_code", split=split)
    
    @property
    def eval_type(self):
        return 'generative'
    
    def num_examples(self):
        return len(self.ds)
    
    def get_example(self, index):
        row = self.ds[index]
        prompt = row['prompt']        # Function description or stub
        solution = row['solution']    # Reference solution
        tests = row['tests']          # Test cases
        imports = row['imports']      # Required imports
        
        return {
            "messages": [
                {"role": "user", "content": prompt},
                {"role": "assistant", "content": solution},
            ],
            "tests": tests,
            "imports": imports,
        }
    
    def evaluate(self, conversation, completion):
        """Execute generated code and check if tests pass"""
        # Extract code from markdown blocks
        code = self.extract_code(completion)
        
        # Build executable program
        program = (
            conversation['imports'] +
            "\\n\\n" +
            code +
            "\\n\\n" +
            conversation['tests']
        )
        
        # Execute safely
        result = execute_code(
            program,
            timeout=5.0,
            maximum_memory_bytes=256 * 1024 * 1024  # 256MB
        )
        
        return result.success
    
    def extract_code(self, completion):
        """Extract code from LLM output"""
        import re
        # Try to find code blocks
        pattern = r'```(?:python)?\\s*\\n(.*?)\\n```'
        matches = re.findall(pattern, completion, re.DOTALL)
        if matches:
            return matches[0].strip()
        # Fall back to whole completion
        return completion.strip()

Advanced: Partial Credit

Instead of binary pass/fail, award partial credit:

def evaluate(self, conversation, completion):
    """Award partial credit based on number of passing tests"""
    code = self.extract_code(completion)
    test_cases = self.parse_tests(conversation['tests'])  # List of individual tests
    
    passing_tests = 0
    for test_case in test_cases:
        program = (
            conversation['imports'] +
            "\\n\\n" +
            code +
            "\\n\\n" +
            test_case
        )
        result = execute_code(program, timeout=2.0)
        if result.success:
            passing_tests += 1
    
    # Return fraction of tests passed
    return passing_tests / len(test_cases)

Sandbox execution makes code evaluation safe

The Sandbox

nanochat includes a sandboxed execution environment for running untrusted code from LLMs:

From nanochat/execution.py:

def execute_code(
    code: str,
    timeout: float = 5.0,
    maximum_memory_bytes: Optional[int] = 256 * 1024 * 1024,
) -> ExecutionResult:
    """
    Execute Python code in a sandboxed environment.
    
    Safety features:
    - Runs in separate process (can be killed)
    - Time limit (default 5 seconds)
    - Memory limit (default 256MB)
    - Temporary directory (auto-cleaned)
    - Disabled dangerous functions (os.system, subprocess, etc.)
    """

What's protected:

def reliability_guard(maximum_memory_bytes):
    # Disable exit/quit
    builtins.exit = None
    builtins.quit = None
    
    # Disable dangerous OS operations
    os.kill = None
    os.system = None
    os.remove = None
    os.fork = None
    
    # Disable subprocess
    subprocess.Popen = None
    
    # Disable filesystem manipulation
    shutil.rmtree = None
    
    # Set memory limits
    resource.setrlimit(resource.RLIMIT_AS, (maximum_memory_bytes, maximum_memory_bytes))

What's NOT protected:

Network access (sockets can be opened)
Python's dynamic features (ctypes, etc.)
No kernel-level isolation

Recommendation: Use this sandbox for evaluation, but for production systems serving untrusted code, use proper containerization (Docker with limited capabilities, gVisor, Firecracker, etc.).

Execution Result Handling

result = execute_code("print('hello'); 1/0")
 
# Check result
if result.success:
    print(f"Output: {result.stdout}")
else:
    print(f"Error: {result.error}")
    if result.timeout:
        print("Execution timed out")
    if result.memory_exceeded:
        print("Memory limit exceeded")

Example outputs:

# Success
ExecutionResult(success=True, stdout='hello world\\n', stderr='')
 
# Timeout
ExecutionResult(success=False, timeout=True, error='Execution timed out')
 
# Runtime error
ExecutionResult(success=False, error='ZeroDivisionError: division by zero')
 
# Memory exceeded
ExecutionResult(success=False, memory_exceeded=True, error='Memory limit exceeded: ...')

Prompt format affects evaluation accuracy more than you expect

Multiple Choice Rendering

The render_mc() function creates standardized multiple choice prompts:

From tasks/common.py:

def render_mc(question, letters, choices):
    """
    Important design decisions:
    1) Letter AFTER choice (better binding for small models)
    2) No whitespace before letter (tokenization consistency)
    """
    query = f"Multiple Choice question: {question}\\n"
    query += "".join([f"- {choice}={letter}\\n" for letter, choice in zip(letters, choices)])
    query += "\\nRespond only with the letter of the correct answer."
    return query

Example output:

Multiple Choice question: What is the capital of France?
- Paris=A
- London=B
- Berlin=C
- Madrid=D

Respond only with the letter of the correct answer.

Why this format?

Letter after choice: Smaller models bind better when the letter comes after
No space before letter: Tokenizer treats "=A" consistently (not "= A" or " A")
Explicit instruction: "Respond only with the letter" reduces verbosity

Few-Shot Prompting

CORE uses few-shot examples for consistency:

def render_prompts_mc(item, continuation_delimiter, fewshot_examples=None):
    template_str = """
{%- for example in fewshot_examples -%}
{{ example.query }}{{ continuation_delimiter }}{{ example.choices[example.gold] }}
 
{% endfor -%}
{{ item.query }}{{ continuation_delimiter }}{{ choice }}""".strip()
    
    template = Template(template_str)
    fewshot_examples = fewshot_examples or []
    prompts = [template.render(choice=choice, item=item, fewshot_examples=fewshot_examples, continuation_delimiter=continuation_delimiter) 
               for choice in item['choices']]
    return prompts

Example with 2-shot:

Q: What is 2+2?
A: 4

Q: What is the capital of Germany?
A: Berlin

Q: What is the chemical symbol for gold?
A: Au

Best practices:

Use 0-5 shots (more doesn't always help)
Sample fewshot examples randomly but deterministically (seed-based)
Exclude current example from fewshot pool
Match format exactly (same template for fewshot and test)

Distributed evaluation scales to 10K+ examples

Multi-GPU Evaluation

From nanochat/core_eval.py:

def evaluate_task(model, tokenizer, data, device, task_meta):
    """Evaluate task across multiple GPUs"""
    rank = dist.get_rank() if dist.is_initialized() else 0
    world_size = dist.get_world_size() if dist.is_initialized() else 1
    
    correct = torch.zeros(len(data), dtype=torch.float32, device=device)
    
    # Each rank processes different examples
    for idx in range(rank, len(data), world_size):
        is_correct = evaluate_example(idx, model, tokenizer, data, device, task_meta)
        correct[idx] = float(is_correct)
    
    # Synchronize results across ranks
    if world_size > 1:
        dist.barrier()
        dist.all_reduce(correct, op=dist.ReduceOp.SUM)
    
    # Compute mean accuracy
    mean_correct = correct.mean().item()
    return mean_correct

How it works:

Stride distribution: Rank 0 gets examples [0, 8, 16, ...], Rank 1 gets [1, 9, 17, ...], etc.
Local evaluation: Each rank evaluates its assigned examples
Synchronization: all_reduce sums results across ranks
Global mean: Final accuracy computed from aggregated results

Launch evaluation:

# Single GPU
python scripts/base_eval.py
 
# 8 GPUs (8x speedup)
torchrun --nproc_per_node=8 scripts/base_eval.py

Task mixtures combine multiple evaluations into one score

Combining Multiple Tasks

Use TaskMixture to create multi-task datasets:

from tasks.common import TaskMixture
from tasks.arc import ARC
from tasks.gsm8k import GSM8K
from tasks.medical_diagnosis import MedicalDiagnosis
 
train_ds = TaskMixture([
    ARC(subset="ARC-Easy", split="train"),      # 2.3K examples
    GSM8K(subset="main", split="train"),         # 8K examples
    MedicalDiagnosis(split="train"),             # Your custom task
])
 
# Access examples (automatically shuffled)
for i in range(len(train_ds)):
    conversation = train_ds[i]
    # Train on this conversation

TaskMixture features:

Deterministic shuffling: Same seed = same ordering (reproducible)
Uniform sampling: All tasks mixed throughout training (prevents forgetting)
Simple oversampling: Include a task multiple times to oversample

# Oversample medical diagnosis 3x
train_ds = TaskMixture([
    ARC(subset="ARC-Easy", split="train"),
    GSM8K(subset="main", split="train"),
    MedicalDiagnosis(split="train"),
    MedicalDiagnosis(split="train"),  # 2x
    MedicalDiagnosis(split="train"),  # 3x
])

These practices make evaluations reliable and reproducible

1. Data Quality

Good evaluation data:

Representative of real use cases
Balanced difficulty (not all easy/hard)
Diverse examples (covers edge cases)
High-quality labels (verified correct)
Clear success criteria

Bad evaluation data:

Artificial or contrived examples
Ambiguous questions
Multiple valid answers (but only one labeled)
Label noise or errors

2. Prompt Design

Do:

Use clear, specific instructions
Provide examples (few-shot) when helpful
Match training format
Test prompts on small sample first

Don't:

Change format between train and eval
Use ambiguous wording
Assume model knows implicit conventions
Over-complicate prompts

3. Evaluation Metrics

For classification tasks:

Accuracy (overall)
Per-class precision/recall
Confusion matrix
Calibration (confidence vs. correctness)

For generation tasks:

Exact match
F1 score (token overlap)
BLEU/ROUGE (for summarization)
Pass@k (for code)
Human evaluation (gold standard)

4. Error Analysis

Always analyze failures:

# Collect failures
failures = []
for i in range(len(task)):
    conversation = task[i]
    response = generate_response(conversation)
    is_correct = task.evaluate(conversation, response)
    
    if not is_correct:
        failures.append({
            "index": i,
            "conversation": conversation,
            "response": response,
            "expected": conversation['messages'][-1]['content'],
        })
 
# Analyze patterns
print(f"Failure rate: {len(failures) / len(task):.2%}")
print("\\nSample failures:")
for failure in failures[:5]:
    print(f"\\nQuestion: {failure['conversation']['messages'][0]['content']}")
    print(f"Expected: {failure['expected']}")
    print(f"Got: {failure['response']}")

These mistakes invalidate your evaluation results

1. Data Leakage

Problem: Test examples appear in training data

Solution:

Use different splits (train/val/test)
Check for near-duplicates
Time-based splits for temporal data
Hold out test set completely

2. Prompt Sensitivity

Problem: Small prompt changes cause large metric changes

Solution:

Test multiple prompt variations
Report mean and standard deviation
Use few-shot examples
Standardize format

3. Metric Gaming

Problem: Model exploits metric without solving task

Example: Model learns to always output "A" on multiple choice if "A" is most common

Solution:

Balance datasets (equal distribution of answers)
Use multiple complementary metrics
Manual inspection of samples
Adversarial examples

4. Evaluation Bugs

Problem: Bug in evaluation code inflates/deflates scores

Solution:

Unit test evaluation logic
Verify with reference implementations
Check edge cases (empty strings, special characters)
Manual verification on small sample

Custom evaluation measures what actually matters

Generic benchmarks miss your domain. Build evaluations that test what you care about. The key principles:

Choose the right eval type: Categorical for constrained choices, generative for open-ended tasks
Design clear prompts: Unambiguous instructions, consistent format
Define success criteria: Exact match, fuzzy match, execution-based, or multi-faceted scoring
Analyze failures: Understand where and why the model fails
Use safe execution: Sandbox untrusted code from LLMs
Distribute evaluation: Speed up with multi-GPU evaluation

The nanochat evaluation framework provides all the building blocks—Task abstraction, CORE benchmark patterns, safe execution, and distributed evaluation. By extending these patterns to your domain, you can build benchmarks that drive model improvements where they matter most.

If you can't measure it, you can't improve it. Now you can measure anything.

Next up: tokenizer design choices. Vocabulary size, regex patterns, and special tokens affect model performance and training efficiency.

Before you build your custom evaluation:

Define success criteria before writing code. Exact match? Fuzzy match? Execution-based? This determines your entire architecture.
Start with 10 hand-verified examples. Run your model on these manually—you'll discover edge cases no automated test catches.
Sandbox untrusted code execution. Never run LLM-generated code without resource limits—one infinite loop crashes your eval pipeline.
Balance your evaluation dataset. Equal distribution of answer choices—if 60% of answers are "A", your model will learn to guess "A".
Test on the actual model first. Run 5 examples end-to-end before launching 10,000—discover prompt format bugs early.

Sources

Institutional and Industry Research

Epoch AI — Tracks evaluation methodology evolution and benchmark trends (as of January 2025).
Stanford HAI AI Index — Annual report on AI evaluation standards and benchmark adoption.
MLCommons MLPerf — Industry-standard benchmarks for AI model evaluation.
EleutherAI Evaluation Harness — Community-standard framework for LLM benchmarking.

Research Papers

DCLM (DataComp-LM): Li, J. et al. (2024). "DataComp-LM: In search of the next generation of training sets for language models". arXiv:2406.11794. Introduces CORE benchmark used in nanochat evaluation.
MMLU: Hendrycks, D. et al. (2020). "Measuring Massive Multitask Language Understanding". arXiv:2009.03300. ICLR 2021. 57-task benchmark covering academic and professional knowledge.
HellaSwag: Zellers, R. et al. (2019). "HellaSwag: Can a Machine Really Finish Your Sentence?". arXiv:1905.07830. ACL 2019. Commonsense reasoning benchmark with adversarial filtering.
HumanEval: Chen, M. et al. (2021). "Evaluating Large Language Models Trained on Code". arXiv:2107.03374. Code generation benchmark measuring functional correctness.
ARC Challenge: Clark, P. et al. (2018). "Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge". arXiv:1803.05457. Grade-school science questions requiring reasoning.

Evaluation Methodology

Beyond Accuracy: Liang, P. et al. (2022). "Holistic Evaluation of Language Models". HELM framework for comprehensive LM evaluation.
Few-Shot Learning: Brown, T. et al. (2020). "Language Models are Few-Shot Learners". NeurIPS 2020. Few-shot evaluation protocol.
TruthfulQA: Lin, S. et al. (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods". ACL 2022. Truthfulness benchmark.

Evaluation Frameworks

lm-evaluation-harness: EleutherAI. "Language Model Evaluation Harness". Industry-standard framework powering HuggingFace Open LLM Leaderboard.
OpenLM: "OpenLM Framework". Training framework used by DCLM for standardized experiments.
BIG-bench: Srivastava, A. et al. (2022). "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models". 204 collaborative benchmark tasks.

nanochat Implementation

nanochat Repository: karpathy/nanochat. Source code for evaluation task implementations.
Tasks Directory: tasks/. Task implementations for CORE benchmark.
CORE Eval: nanochat/core_eval.py. Distributed evaluation implementation.

Previous in series:

Loss Landscape, Scaling Laws, and Evaluation - Evaluation metrics and frameworks

Next in series:

Fine-Tuning for Chat: SFT - Training on evaluation tasks

💡 nanochat Tip: Domain-specific evaluation is often more valuable than general benchmarks. Invest in building high-quality evaluation tasks for your use case—they'll guide model development and measure real-world impact.

MMLU measures what the internet knows. Your custom task measures what your users need. Build both.

On This Page

Building Custom Evaluation Tasks

Standard benchmarks don't measure what you care about

One Task class handles any evaluation type

Base Task Class

The Two Evaluation Modes

Categorical Evaluation

Generative Evaluation

CORE normalizes scores across diverse task types

What is CORE?

Three Task Types in CORE

1. Multiple Choice Tasks

2. Schema Tasks

3. Language Modeling Tasks

Centered Metrics

Evaluation Task Builder

Tasks (1)

Add New Task

Scoring Methods

Centered Metric Calculator

Per-Task Centered Accuracy

Why Centered Metrics?

Formula

Multiple choice tasks: constrained evaluation in 40 lines

Example: Custom Medical Diagnosis Task

Generative tasks: free-form evaluation with verifiable answers

Example: Custom Code Generation Task

Sandbox execution makes code evaluation safe

The Sandbox

Execution Result Handling

Prompt format affects evaluation accuracy more than you expect

Multiple Choice Rendering

Few-Shot Prompting

Distributed evaluation scales to 10K+ examples

Multi-GPU Evaluation

Task mixtures combine multiple evaluations into one score

Combining Multiple Tasks

These practices make evaluations reliable and reproducible

1. Data Quality

2. Prompt Design

3. Evaluation Metrics

4. Error Analysis

These mistakes invalidate your evaluation results

1. Data Leakage

2. Prompt Sensitivity

3. Metric Gaming

4. Evaluation Bugs

Custom evaluation measures what actually matters

Sources

Institutional and Industry Research

Research Papers

Evaluation Methodology

Evaluation Frameworks

nanochat Implementation

Related Posts

Related Articles

🤖→🚀Memory Optimization Techniques: Gradient Accumulation & Mixed Precision

🤖→🚀Tokenizer Design Choices: BPE, Vocabulary, and Implementation

🤖→🚀Reinforcement Learning from Human Feedback (RLHF)