Building Custom Evaluation Tasks

- Published on
- /19 mins read
Track 2: Practical Guides - Post 2.4 of 6
This post builds on Reinforcement Learning from Human Feedback. View all posts in this track →
Standard benchmarks don't measure what you care about
Building domain-specific evaluation tasks is where I've seen the most value from nanochat's framework. The abstraction is simple enough to implement quickly, but flexible enough to handle any domain.
MMLU measures general knowledge. GSM8K measures math. Neither measures your actual use case. That's why you need custom evaluation tasks.
TL;DR: One base class with two modes (categorical and generative). CORE normalizes scores 0–1 with baseline centering. Custom tasks take 50 lines. Sandbox execution makes code evaluation safe.
The benchmark that missed the bug: Consider a common pattern: a legal-tech team trains a contract analysis model and celebrates 78% accuracy on MMLU law questions. In production, it misses 40% of non-compete clause violations—the exact use case they built it for. MMLU tests general legal knowledge; their users need specific clause detection. After building a custom evaluation task with 500 real contract excerpts, they discover the model is guessing on unfamiliar clause structures. The custom benchmark exposes the gap. Targeted fine-tuning later, clause detection hits 91%. General benchmarks tell you your model is broadly capable. Custom benchmarks tell you it actually works.
You've trained a model, fine-tuned it for chat, and optimized it with RL. But how do you know if it's actually good at what you care about? Standard benchmarks like MMLU and GSM8K are useful, but they measure general capabilities—not your specific use case.
Custom evaluation tasks let you measure what matters for your application: medical diagnosis accuracy, legal document analysis, code generation for your codebase, or any domain-specific skill.
nanochat's evaluation framework makes this straightforward:
- The three evaluation task types: multiple choice, schema, and language modeling
- How the CORE benchmark framework works
- Building custom tasks for any domain
- Best practices for prompt engineering in evaluation
- Code execution sandboxing for safe evaluation
- Distributed evaluation for large benchmarks
One Task class handles any evaluation type
Base Task Class
All evaluation tasks in nanochat inherit from Task:
class Task:
def __init__(self, start=0, stop=None, step=1):
# Allows lightweight slicing over the dataset
self.start = start
self.stop = stop
self.step = step
@property
def eval_type(self):
# one of 'generative' | 'categorical'
raise NotImplementedError
def num_examples(self):
raise NotImplementedError
def get_example(self, index):
# Returns a conversation dict
raise NotImplementedError
def evaluate(self, conversation, assistant_response):
# Returns success (bool or float)
raise NotImplementedErrorKey design principles:
- Lightweight slicing: Create views over datasets without copying data
- Lazy loading: Examples fetched on-demand via
get_example() - Two evaluation modes:
categorical(constrained choices) orgenerative(free-form) - Conversation format: All tasks return standardized conversation dicts
The Two Evaluation Modes
Categorical Evaluation
When to use: Multiple choice, yes/no, classification tasks
How it works: Model assigns probabilities to predefined options, chooses highest probability
Example from MMLU:
@property
def eval_type(self):
return 'categorical'
def get_example(self, index):
row = self.ds[index]
question = row["question"]
choices = row["choices"] # ["Choice A", "Choice B", "Choice C", "Choice D"]
answer = row["answer"] # 0, 1, 2, or 3
user_message = render_mc(question, ['A', 'B', 'C', 'D'], choices)
assistant_message = ['A', 'B', 'C', 'D'][answer]
return {
"messages": [
{"role": "user", "content": user_message},
{"role": "assistant", "content": assistant_message}
],
"letters": ['A', 'B', 'C', 'D'], # For evaluation
}
def evaluate(self, conversation, assistant_response):
correct_answer = conversation['messages'][-1]['content']
return assistant_response == correct_answerAdvantages:
- Fast (no sampling required, batch multiple questions)
- Deterministic (no temperature variation)
- Easy to score (exact match)
Generative Evaluation
When to use: Open-ended tasks (code generation, math, creative writing)
How it works: Model generates free-form text, evaluated against success criteria
Example from HumanEval:
@property
def eval_type(self):
return 'generative'
def get_example(self, index):
row = self.ds[index]
prompt = row['prompt'] # Function signature
solution = row['canonical_solution']
test = row['test'] # Test cases
return {
"messages": [
{"role": "user", "content": prompt},
{"role": "assistant", "content": f"{prompt}\\n{solution}"},
],
"entry_point": row['entry_point'],
"test": test,
}
def evaluate(self, conversation, completion):
# Extract code from completion
code = extract_program(completion)
# Build executable program
program = (
imports +
"\\n\\n" +
code +
"\\n\\n" +
conversation['test'] +
"\\n" +
f"check({conversation['entry_point']})"
)
# Execute and check
result = execute_code(program)
return result.successAdvantages:
- Flexible (measures complex capabilities)
- Realistic (mimics actual use)
- Rich feedback (can analyze failure modes)
For your custom tasks, this means: use categorical evaluation when you can (faster, deterministic), but don't force it. If your task is naturally open-ended—code generation, summarization, creative writing—generative evaluation captures what categorical misses.
For your compute budget, this means: categorical evaluation is 10-100× cheaper than generative. A single forward pass vs. potentially hundreds of tokens sampled. If you're evaluating on every training step, categorical saves real money.
CORE normalizes scores across diverse task types
What is CORE?
CORE (Compressed Open-Ended Requirements Evaluation) is an 11-task benchmark from the DCLM paper that evaluates base models across diverse capabilities with minimal compute.
From nanochat/core_eval.py:
Tasks:
1. ARC (easy/challenge) - Science reasoning
2. HellaSwag - Commonsense reasoning
3. MMLU - Multitask knowledge
4. OpenBookQA - Elementary science
5. PIQA - Physical reasoning
6. SIQA - Social reasoning
7. WinoGrande - Coreference resolution
8. BoolQ - Yes/no questions
9. COPA - Causal reasoning
10. StoryCloze - Story completion
11. SQuAD - Reading comprehension
Why CORE matters:
- Coverage: Tests diverse reasoning types
- Efficiency: 11 tasks vs. dozens in full benchmarks
- Correlation: CORE score correlates with broader evaluation suites
- Centered metric: Accounts for random baselines (0=random, 1=perfect)
Three Task Types in CORE
CORE defines three fundamental task structures:
1. Multiple Choice Tasks
Structure: Same context, different continuations
Question: What is the capital of France?
- Paris=A
- London=B
- Berlin=C
- Madrid=D
Respond only with the letter of the correct answer.
Evaluation method: Compare log probabilities of each continuation, choose lowest loss (highest probability)
From nanochat/core_eval.py:
if task_type == 'multiple_choice':
prompts = render_prompts_mc(item, continuation_delimiter, fewshot_examples)
tokens, start_idxs, end_idxs = batch_sequences_mc(tokenizer, prompts)
# ...
# Find option with lowest average loss
mean_losses = [losses[i, si-1:ei-1].mean().item()
for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
pred_idx = mean_losses.index(min(mean_losses))
is_correct = pred_idx == item['gold']Key insight: We evaluate log probabilities, not generated text. This is much faster and avoids issues with generation formatting.
2. Schema Tasks
Structure: Different contexts, same continuation
Context A: "The dog barked loudly."
Context B: "The dog slept quietly."
Context C: "The dog ran quickly."
Continuation: " It was happy."
Which context is most likely?
Use case: Sentence completion, coreference resolution
Evaluation method: Similar to multiple choice, but context varies instead of continuation
if task_type == 'schema':
prompts = render_prompts_schema(item, continuation_delimiter, fewshot_examples)
tokens, start_idxs, end_idxs = batch_sequences_schema(tokenizer, prompts)
# Find context with lowest loss for the continuation
mean_losses = [losses[i, si-1:ei-1].mean().item()
for i, (si, ei) in enumerate(zip(start_idxs, end_idxs))]
pred_idx = mean_losses.index(min(mean_losses))
is_correct = pred_idx == item['gold']3. Language Modeling Tasks
Structure: Context + continuation, evaluate continuation likelihood
Context: "The capital of France is"
Continuation: " Paris"
Check if model assigns high probability to continuation
Use case: Reading comprehension, factual knowledge
Evaluation method: Check if argmax predictions match actual tokens
if task_type == 'language_modeling':
prompts = render_prompts_lm(item, continuation_delimiter, fewshot_examples)
tokens, start_idxs, end_idxs = batch_sequences_lm(tokenizer, prompts)
# Check if all predicted tokens match actual tokens
si, ei = start_idxs[0], end_idxs[0]
predicted_tokens = predictions[0, si-1:ei-1]
actual_tokens = input_ids[0, si:ei]
is_correct = torch.all(predicted_tokens == actual_tokens).item()Centered Metrics
Raw accuracy can be misleading when random guessing achieves high scores:
# Example: 4-choice multiple choice
raw_accuracy = 0.30 # 30% correct
random_baseline = 0.25 # 25% by guessing
# Centered accuracy
centered = (raw_accuracy - random_baseline) / (1.0 - random_baseline)
# centered = (0.30 - 0.25) / (1.0 - 0.25) = 0.067 (6.7% above random)Interpretation:
0.0= Random guessing1.0= Perfect performance- Negative = Worse than random (model is broken)
For your evaluation pipeline, this means: always report centered accuracy, not raw accuracy. A "30% accuracy" sounds bad until you realize random is 25%—your model is actually learning something. Centered metrics make progress visible.
For your stakeholder reports, this means: centered metrics translate to business value. "7% above random" is concrete improvement. "32% accuracy" on a 4-choice task sounds terrible but means the same thing.
The CORE metric is the mean of centered accuracies across all 11 tasks.
Evaluation Task Builder
Create custom evaluation tasks for testing LLM capabilities
Tasks (1)
Add New Task
Scoring Methods
- Exact Match: Response must match exactly
- Contains: Response must contain pattern
- Regex: Pattern is a regular expression
- Semantic: Embedding similarity (simulated)
Centered Metric Calculator
Calculate human-centered evaluation metrics
Per-Task Centered Accuracy
Why Centered Metrics?
- Raw accuracy can be misleading - 75% on an easy task vs hard task means different things
- Centered accuracy normalizes to 0% (random) to 100% (human-level)
- A score of 50% means halfway between random baseline and human performance
- Scores above 100% indicate superhuman performance (possible with some metrics)
Formula
Centered Accuracy = (model_score - baseline) / (human_ceiling - baseline) Where: baseline = random/majority class performance human_ceiling = expert human performance Range: [0, 1] typically (can exceed 1 for superhuman)
Multiple choice tasks: constrained evaluation in 40 lines
Example: Custom Medical Diagnosis Task
Building a medical diagnosis task from scratch:
from datasets import load_dataset
from tasks.common import Task, render_mc
class MedicalDiagnosis(Task):
"""Evaluate medical diagnosis from symptoms"""
def __init__(self, split, **kwargs):
super().__init__(**kwargs)
assert split in ["train", "test"], "split must be train|test"
# Assume you have a dataset with structure:
# {"symptoms": str, "diagnosis": str, "options": List[str]}
self.ds = load_dataset("your_org/medical_diagnosis", split=split)
self.letters = ('A', 'B', 'C', 'D', 'E') # Up to 5 options
@property
def eval_type(self):
return 'categorical'
def num_examples(self):
return len(self.ds)
def get_example(self, index):
row = self.ds[index]
symptoms = row['symptoms']
diagnosis = row['diagnosis']
options = row['options'] # List of possible diagnoses
# Find which option is correct
assert diagnosis in options, f"Correct diagnosis not in options!"
answer_idx = options.index(diagnosis)
# Render as multiple choice
question = f"Given the following symptoms:\\n{symptoms}\\n\\nWhat is the most likely diagnosis?"
user_message = render_mc(question, self.letters[:len(options)], options)
assistant_message = self.letters[answer_idx]
return {
"messages": [
{"role": "user", "content": user_message},
{"role": "assistant", "content": assistant_message}
],
"letters": self.letters[:len(options)],
"specialty": row.get('specialty', 'general'), # For grouping results
}
def evaluate(self, conversation, assistant_response):
assert assistant_response in conversation['letters']
correct_answer = conversation['messages'][-1]['content']
return assistant_response == correct_answerUsage:
from tasks.medical_diagnosis import MedicalDiagnosis
# Create task
task = MedicalDiagnosis(split="test")
# Evaluate (from scripts/chat_eval.py pattern)
from scripts.chat_eval import run_categorical_eval
accuracy = run_categorical_eval(
task_object=task,
tokenizer=tokenizer,
model=model,
batch_size=8,
max_problems=None # Evaluate all
)
print(f"Medical Diagnosis Accuracy: {accuracy:.2%}")Generative tasks: free-form evaluation with verifiable answers
Example: Custom Code Generation Task
For domain-specific code generation:
from tasks.common import Task
from nanochat.execution import execute_code
class CustomCodeGen(Task):
"""Evaluate code generation for your specific domain"""
def __init__(self, split, **kwargs):
super().__init__(**kwargs)
# Load your custom code dataset
# Structure: {"prompt": str, "solution": str, "tests": str, "imports": str}
self.ds = load_dataset("your_org/custom_code", split=split)
@property
def eval_type(self):
return 'generative'
def num_examples(self):
return len(self.ds)
def get_example(self, index):
row = self.ds[index]
prompt = row['prompt'] # Function description or stub
solution = row['solution'] # Reference solution
tests = row['tests'] # Test cases
imports = row['imports'] # Required imports
return {
"messages": [
{"role": "user", "content": prompt},
{"role": "assistant", "content": solution},
],
"tests": tests,
"imports": imports,
}
def evaluate(self, conversation, completion):
"""Execute generated code and check if tests pass"""
# Extract code from markdown blocks
code = self.extract_code(completion)
# Build executable program
program = (
conversation['imports'] +
"\\n\\n" +
code +
"\\n\\n" +
conversation['tests']
)
# Execute safely
result = execute_code(
program,
timeout=5.0,
maximum_memory_bytes=256 * 1024 * 1024 # 256MB
)
return result.success
def extract_code(self, completion):
"""Extract code from LLM output"""
import re
# Try to find code blocks
pattern = r'```(?:python)?\\s*\\n(.*?)\\n```'
matches = re.findall(pattern, completion, re.DOTALL)
if matches:
return matches[0].strip()
# Fall back to whole completion
return completion.strip()Advanced: Partial Credit
Instead of binary pass/fail, award partial credit:
def evaluate(self, conversation, completion):
"""Award partial credit based on number of passing tests"""
code = self.extract_code(completion)
test_cases = self.parse_tests(conversation['tests']) # List of individual tests
passing_tests = 0
for test_case in test_cases:
program = (
conversation['imports'] +
"\\n\\n" +
code +
"\\n\\n" +
test_case
)
result = execute_code(program, timeout=2.0)
if result.success:
passing_tests += 1
# Return fraction of tests passed
return passing_tests / len(test_cases)Sandbox execution makes code evaluation safe
The Sandbox
nanochat includes a sandboxed execution environment for running untrusted code from LLMs:
From nanochat/execution.py:
def execute_code(
code: str,
timeout: float = 5.0,
maximum_memory_bytes: Optional[int] = 256 * 1024 * 1024,
) -> ExecutionResult:
"""
Execute Python code in a sandboxed environment.
Safety features:
- Runs in separate process (can be killed)
- Time limit (default 5 seconds)
- Memory limit (default 256MB)
- Temporary directory (auto-cleaned)
- Disabled dangerous functions (os.system, subprocess, etc.)
"""What's protected:
def reliability_guard(maximum_memory_bytes):
# Disable exit/quit
builtins.exit = None
builtins.quit = None
# Disable dangerous OS operations
os.kill = None
os.system = None
os.remove = None
os.fork = None
# Disable subprocess
subprocess.Popen = None
# Disable filesystem manipulation
shutil.rmtree = None
# Set memory limits
resource.setrlimit(resource.RLIMIT_AS, (maximum_memory_bytes, maximum_memory_bytes))What's NOT protected:
- Network access (sockets can be opened)
- Python's dynamic features (ctypes, etc.)
- No kernel-level isolation
Recommendation: Use this sandbox for evaluation, but for production systems serving untrusted code, use proper containerization (Docker with limited capabilities, gVisor, Firecracker, etc.).
Execution Result Handling
result = execute_code("print('hello'); 1/0")
# Check result
if result.success:
print(f"Output: {result.stdout}")
else:
print(f"Error: {result.error}")
if result.timeout:
print("Execution timed out")
if result.memory_exceeded:
print("Memory limit exceeded")Example outputs:
# Success
ExecutionResult(success=True, stdout='hello world\\n', stderr='')
# Timeout
ExecutionResult(success=False, timeout=True, error='Execution timed out')
# Runtime error
ExecutionResult(success=False, error='ZeroDivisionError: division by zero')
# Memory exceeded
ExecutionResult(success=False, memory_exceeded=True, error='Memory limit exceeded: ...')Prompt format affects evaluation accuracy more than you expect
Multiple Choice Rendering
The render_mc() function creates standardized multiple choice prompts:
From tasks/common.py:
def render_mc(question, letters, choices):
"""
Important design decisions:
1) Letter AFTER choice (better binding for small models)
2) No whitespace before letter (tokenization consistency)
"""
query = f"Multiple Choice question: {question}\\n"
query += "".join([f"- {choice}={letter}\\n" for letter, choice in zip(letters, choices)])
query += "\\nRespond only with the letter of the correct answer."
return queryExample output:
Multiple Choice question: What is the capital of France?
- Paris=A
- London=B
- Berlin=C
- Madrid=D
Respond only with the letter of the correct answer.
Why this format?
- Letter after choice: Smaller models bind better when the letter comes after
- No space before letter: Tokenizer treats "=A" consistently (not "= A" or " A")
- Explicit instruction: "Respond only with the letter" reduces verbosity
Few-Shot Prompting
CORE uses few-shot examples for consistency:
def render_prompts_mc(item, continuation_delimiter, fewshot_examples=None):
template_str = """
{%- for example in fewshot_examples -%}
{{ example.query }}{{ continuation_delimiter }}{{ example.choices[example.gold] }}
{% endfor -%}
{{ item.query }}{{ continuation_delimiter }}{{ choice }}""".strip()
template = Template(template_str)
fewshot_examples = fewshot_examples or []
prompts = [template.render(choice=choice, item=item, fewshot_examples=fewshot_examples, continuation_delimiter=continuation_delimiter)
for choice in item['choices']]
return promptsExample with 2-shot:
Q: What is 2+2?
A: 4
Q: What is the capital of Germany?
A: Berlin
Q: What is the chemical symbol for gold?
A: Au
Best practices:
- Use 0-5 shots (more doesn't always help)
- Sample fewshot examples randomly but deterministically (seed-based)
- Exclude current example from fewshot pool
- Match format exactly (same template for fewshot and test)
Distributed evaluation scales to 10K+ examples
Multi-GPU Evaluation
From nanochat/core_eval.py:
def evaluate_task(model, tokenizer, data, device, task_meta):
"""Evaluate task across multiple GPUs"""
rank = dist.get_rank() if dist.is_initialized() else 0
world_size = dist.get_world_size() if dist.is_initialized() else 1
correct = torch.zeros(len(data), dtype=torch.float32, device=device)
# Each rank processes different examples
for idx in range(rank, len(data), world_size):
is_correct = evaluate_example(idx, model, tokenizer, data, device, task_meta)
correct[idx] = float(is_correct)
# Synchronize results across ranks
if world_size > 1:
dist.barrier()
dist.all_reduce(correct, op=dist.ReduceOp.SUM)
# Compute mean accuracy
mean_correct = correct.mean().item()
return mean_correctHow it works:
- Stride distribution: Rank 0 gets examples [0, 8, 16, ...], Rank 1 gets [1, 9, 17, ...], etc.
- Local evaluation: Each rank evaluates its assigned examples
- Synchronization:
all_reducesums results across ranks - Global mean: Final accuracy computed from aggregated results
Launch evaluation:
# Single GPU
python scripts/base_eval.py
# 8 GPUs (8x speedup)
torchrun --nproc_per_node=8 scripts/base_eval.pyTask mixtures combine multiple evaluations into one score
Combining Multiple Tasks
Use TaskMixture to create multi-task datasets:
from tasks.common import TaskMixture
from tasks.arc import ARC
from tasks.gsm8k import GSM8K
from tasks.medical_diagnosis import MedicalDiagnosis
train_ds = TaskMixture([
ARC(subset="ARC-Easy", split="train"), # 2.3K examples
GSM8K(subset="main", split="train"), # 8K examples
MedicalDiagnosis(split="train"), # Your custom task
])
# Access examples (automatically shuffled)
for i in range(len(train_ds)):
conversation = train_ds[i]
# Train on this conversationTaskMixture features:
- Deterministic shuffling: Same seed = same ordering (reproducible)
- Uniform sampling: All tasks mixed throughout training (prevents forgetting)
- Simple oversampling: Include a task multiple times to oversample
# Oversample medical diagnosis 3x
train_ds = TaskMixture([
ARC(subset="ARC-Easy", split="train"),
GSM8K(subset="main", split="train"),
MedicalDiagnosis(split="train"),
MedicalDiagnosis(split="train"), # 2x
MedicalDiagnosis(split="train"), # 3x
])These practices make evaluations reliable and reproducible
1. Data Quality
Good evaluation data:
- Representative of real use cases
- Balanced difficulty (not all easy/hard)
- Diverse examples (covers edge cases)
- High-quality labels (verified correct)
- Clear success criteria
Bad evaluation data:
- Artificial or contrived examples
- Ambiguous questions
- Multiple valid answers (but only one labeled)
- Label noise or errors
2. Prompt Design
Do:
- Use clear, specific instructions
- Provide examples (few-shot) when helpful
- Match training format
- Test prompts on small sample first
Don't:
- Change format between train and eval
- Use ambiguous wording
- Assume model knows implicit conventions
- Over-complicate prompts
3. Evaluation Metrics
For classification tasks:
- Accuracy (overall)
- Per-class precision/recall
- Confusion matrix
- Calibration (confidence vs. correctness)
For generation tasks:
- Exact match
- F1 score (token overlap)
- BLEU/ROUGE (for summarization)
- Pass@k (for code)
- Human evaluation (gold standard)
4. Error Analysis
Always analyze failures:
# Collect failures
failures = []
for i in range(len(task)):
conversation = task[i]
response = generate_response(conversation)
is_correct = task.evaluate(conversation, response)
if not is_correct:
failures.append({
"index": i,
"conversation": conversation,
"response": response,
"expected": conversation['messages'][-1]['content'],
})
# Analyze patterns
print(f"Failure rate: {len(failures) / len(task):.2%}")
print("\\nSample failures:")
for failure in failures[:5]:
print(f"\\nQuestion: {failure['conversation']['messages'][0]['content']}")
print(f"Expected: {failure['expected']}")
print(f"Got: {failure['response']}")These mistakes invalidate your evaluation results
1. Data Leakage
Problem: Test examples appear in training data
Solution:
- Use different splits (train/val/test)
- Check for near-duplicates
- Time-based splits for temporal data
- Hold out test set completely
2. Prompt Sensitivity
Problem: Small prompt changes cause large metric changes
Solution:
- Test multiple prompt variations
- Report mean and standard deviation
- Use few-shot examples
- Standardize format
3. Metric Gaming
Problem: Model exploits metric without solving task
Example: Model learns to always output "A" on multiple choice if "A" is most common
Solution:
- Balance datasets (equal distribution of answers)
- Use multiple complementary metrics
- Manual inspection of samples
- Adversarial examples
4. Evaluation Bugs
Problem: Bug in evaluation code inflates/deflates scores
Solution:
- Unit test evaluation logic
- Verify with reference implementations
- Check edge cases (empty strings, special characters)
- Manual verification on small sample
Custom evaluation measures what actually matters
Generic benchmarks miss your domain. Build evaluations that test what you care about. The key principles:
- Choose the right eval type: Categorical for constrained choices, generative for open-ended tasks
- Design clear prompts: Unambiguous instructions, consistent format
- Define success criteria: Exact match, fuzzy match, execution-based, or multi-faceted scoring
- Analyze failures: Understand where and why the model fails
- Use safe execution: Sandbox untrusted code from LLMs
- Distribute evaluation: Speed up with multi-GPU evaluation
The nanochat evaluation framework provides all the building blocks—Task abstraction, CORE benchmark patterns, safe execution, and distributed evaluation. By extending these patterns to your domain, you can build benchmarks that drive model improvements where they matter most.
If you can't measure it, you can't improve it. Now you can measure anything.
Next up: tokenizer design choices. Vocabulary size, regex patterns, and special tokens affect model performance and training efficiency.
Before you build your custom evaluation:
- Define success criteria before writing code. Exact match? Fuzzy match? Execution-based? This determines your entire architecture.
- Start with 10 hand-verified examples. Run your model on these manually—you'll discover edge cases no automated test catches.
- Sandbox untrusted code execution. Never run LLM-generated code without resource limits—one infinite loop crashes your eval pipeline.
- Balance your evaluation dataset. Equal distribution of answer choices—if 60% of answers are "A", your model will learn to guess "A".
- Test on the actual model first. Run 5 examples end-to-end before launching 10,000—discover prompt format bugs early.
Sources
Institutional and Industry Research
- Epoch AI — Tracks evaluation methodology evolution and benchmark trends (as of January 2025).
- Stanford HAI AI Index — Annual report on AI evaluation standards and benchmark adoption.
- MLCommons MLPerf — Industry-standard benchmarks for AI model evaluation.
- EleutherAI Evaluation Harness — Community-standard framework for LLM benchmarking.
Research Papers
- DCLM (DataComp-LM): Li, J. et al. (2024). "DataComp-LM: In search of the next generation of training sets for language models". arXiv:2406.11794. Introduces CORE benchmark used in nanochat evaluation.
- MMLU: Hendrycks, D. et al. (2020). "Measuring Massive Multitask Language Understanding". arXiv:2009.03300. ICLR 2021. 57-task benchmark covering academic and professional knowledge.
- HellaSwag: Zellers, R. et al. (2019). "HellaSwag: Can a Machine Really Finish Your Sentence?". arXiv:1905.07830. ACL 2019. Commonsense reasoning benchmark with adversarial filtering.
- HumanEval: Chen, M. et al. (2021). "Evaluating Large Language Models Trained on Code". arXiv:2107.03374. Code generation benchmark measuring functional correctness.
- ARC Challenge: Clark, P. et al. (2018). "Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge". arXiv:1803.05457. Grade-school science questions requiring reasoning.
Evaluation Methodology
- Beyond Accuracy: Liang, P. et al. (2022). "Holistic Evaluation of Language Models". HELM framework for comprehensive LM evaluation.
- Few-Shot Learning: Brown, T. et al. (2020). "Language Models are Few-Shot Learners". NeurIPS 2020. Few-shot evaluation protocol.
- TruthfulQA: Lin, S. et al. (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods". ACL 2022. Truthfulness benchmark.
Evaluation Frameworks
- lm-evaluation-harness: EleutherAI. "Language Model Evaluation Harness". Industry-standard framework powering HuggingFace Open LLM Leaderboard.
- OpenLM: "OpenLM Framework". Training framework used by DCLM for standardized experiments.
- BIG-bench: Srivastava, A. et al. (2022). "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models". 204 collaborative benchmark tasks.
nanochat Implementation
- nanochat Repository: karpathy/nanochat. Source code for evaluation task implementations.
- Tasks Directory:
tasks/. Task implementations for CORE benchmark. - CORE Eval:
nanochat/core_eval.py. Distributed evaluation implementation.
Related Posts
Previous in series:
- Loss Landscape, Scaling Laws, and Evaluation - Evaluation metrics and frameworks
Next in series:
- Fine-Tuning for Chat: SFT - Training on evaluation tasks
💡 nanochat Tip: Domain-specific evaluation is often more valuable than general benchmarks. Invest in building high-quality evaluation tasks for your use case—they'll guide model development and measure real-world impact.
MMLU measures what the internet knows. Your custom task measures what your users need. Build both.
On this page
- Standard benchmarks don't measure what you care about
- One Task class handles any evaluation type
- Base Task Class
- The Two Evaluation Modes
- Categorical Evaluation
- Generative Evaluation
- CORE normalizes scores across diverse task types
- What is CORE?
- Three Task Types in CORE
- 1. Multiple Choice Tasks
- 2. Schema Tasks
- 3. Language Modeling Tasks
- Centered Metrics
- Multiple choice tasks: constrained evaluation in 40 lines
- Example: Custom Medical Diagnosis Task
- Generative tasks: free-form evaluation with verifiable answers
- Example: Custom Code Generation Task
- Sandbox execution makes code evaluation safe
- The Sandbox
- Execution Result Handling
- Prompt format affects evaluation accuracy more than you expect
- Multiple Choice Rendering
- Few-Shot Prompting
- Distributed evaluation scales to 10K+ examples
- Multi-GPU Evaluation
- Task mixtures combine multiple evaluations into one score
- Combining Multiple Tasks
- These practices make evaluations reliable and reproducible
- 1. Data Quality
- 2. Prompt Design
- 3. Evaluation Metrics
- 4. Error Analysis
- These mistakes invalidate your evaluation results
- 1. Data Leakage
- 2. Prompt Sensitivity
- 3. Metric Gaming
- 4. Evaluation Bugs
- Custom evaluation measures what actually matters
- Sources
- Institutional and Industry Research
- Research Papers
- Evaluation Methodology
- Evaluation Frameworks
- nanochat Implementation
- Related Posts



