Tokenizer Design Choices: BPE, Vocabulary, and Implementation

Track 2: Practical Guides - Post 2.5 of 6
Tokenization fundamentals: BPE algorithm, vocabulary design, dual Rust/Python implementation, special tokens for chat, and tokenizer-agnostic evaluation. View all posts in this track →

Your tokenizer decides what your model can learn

Tokenization seems like a solved problem until you look at the details. Vocabulary size, regex patterns, special tokens—nanochat's implementation reveals why these decisions compound across billions of tokens.

Every training step, every inference call—your tokenizer runs first. BPE with the right vocabulary size compounds across billions of tokens.

TL;DR: BPE iteratively merges frequent byte pairs into tokens. 32K vocabulary gives ~3.5 tokens per word. Rust implementation trains at 10M tokens/sec. tiktoken inference handles 10M+ tokens/sec. Special tokens define conversation boundaries.

The tokenizer that killed accuracy: Consider a common failure mode: fine-tuning a code model using GPT-2's tokenizer (trained on web text) for Python code generation. HumanEval score: 12%—far below published benchmarks for similar model sizes. The cause: the tokenizer split common Python patterns (def, self., import) into 3-4 tokens each, while code-optimized tokenizers treat them as single tokens. After retraining with a code-first vocabulary, the same architecture hit 31%. The model was identical. The tokenizer changed everything. Your vocabulary isn't a preprocessing detail—it's the first layer of your neural network.

Tokenization is the foundational layer of any language model—it determines how text is converted into discrete tokens that the model can process. Poor tokenization choices can cripple model performance, waste compute on suboptimal token representations, and create artifacts in generation.

nanochat implements Byte Pair Encoding (BPE) tokenization in the style of GPT-4, with two complementary implementations:

RustBPE + tiktoken: Fast training in Rust, ultra-fast inference with tiktoken
HuggingFace Tokenizers: Pure Python fallback for compatibility

In this post:

BPE algorithm fundamentals and why it works
GPT-4 style design choices: regex patterns, byte fallback, special tokens
Training pipeline: streaming data, vocabulary size selection
Inference optimizations: tiktoken integration, parallel encoding
Conversation formatting: special tokens for chat, supervision masking
Evaluation metrics: bits-per-byte for tokenizer-agnostic comparison
Implementation comparison: Rust vs Python trade-offs

Token Prediction Game

Test your intuition about how language models predict the next token

Difficulty:

🎯

Token Prediction Challenge

Think like an LLM! Given a text prompt, predict which token the model would most likely generate next.

🧠 How LLMs Predict Tokens

• LLMs output a probability distribution over all possible tokens
• The next token is sampled based on these probabilities
• Temperature controls how random vs deterministic the selection is
• NanoChat implements this in a minimal, educational codebase

BPE Algorithm Fundamentals
GPT-4 Style Tokenization
Training Pipeline
The RustBPE Implementation
Inference with tiktoken
Special Tokens for Chat
Conversation Rendering
Tokenizer Evaluation
Implementation Trade-offs
Best Practices & Common Pitfalls

BPE merges frequent byte pairs into a learned vocabulary

What is Byte Pair Encoding?

BPE is a data compression algorithm adapted for tokenization. It builds a vocabulary by iteratively merging the most frequent pairs of tokens:

# Start with bytes (256 base tokens)
text = "hello hello world"
tokens = [104, 101, 108, 108, 111, 32, ...]  # byte values
 
# Find most frequent pair
pairs = count_pairs(tokens)  # (104, 101) appears most
# => merge (h, e) -> token 256
 
# Repeat for vocab_size - 256 merges
# Final vocabulary: [0..255, "he", "ll", "lo", "hello", ...]

Key insight: BPE learns a vocabulary optimized for the training data distribution. Common words become single tokens, rare words are split into subwords.

In practice, train your tokenizer on data similar to your deployment domain. A BPE vocabulary trained on code will waste tokens on prose, and vice versa.

For multilingual models, BPE trained primarily on English will under-represent other languages. Asian characters often become single-byte fallbacks. Train on balanced multilingual data if you need good coverage.

Why BPE Works

No OOV (out-of-vocabulary): Byte fallback ensures any text can be encoded
Compression: Common patterns use fewer tokens (efficiency)
Generalization: Rare words share subword units with common words
Language-agnostic: Works across languages without language-specific rules

BPE Training Algorithm

def train_bpe(text_chunks, vocab_size):
    """Core BPE training loop"""
    # 1. Initialize with byte-level tokens
    words = [chunk.encode("utf-8") for chunk in text_chunks]
    vocab = list(range(256))  # base vocabulary
    merges = {}  # (token_a, token_b) -> merged_token_id
    
    # 2. Iteratively merge most frequent pairs
    for i in range(vocab_size - 256):
        # Count all adjacent pairs
        pair_counts = count_pairs_in_corpus(words)
        
        # Find most frequent pair
        best_pair = max(pair_counts, key=pair_counts.get)
        
        # Create new token for this pair
        new_token_id = 256 + i
        merges[best_pair] = new_token_id
        vocab.append(best_pair)
        
        # Apply merge to all words
        words = apply_merge(words, best_pair, new_token_id)
    
    return vocab, merges

The core challenge: efficiency. Naive implementation is O(n²) or worse. nanochat uses advanced optimizations.

GPT-4 patterns split on whitespace boundaries for cleaner tokens

The Split Pattern

nanochat uses a regex pre-tokenization pattern inspired by GPT-4:

SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,2}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

This pattern makes several design decisions:

Pattern Breakdown

Component	Matches	Purpose
`'(?i:[sdmt]\|ll\|ve\|re)`	Contractions like `'s`, `'m`, `'ll`	Keep common suffixes together
`[^\r\n\p{L}\p{N}]?+\p{L}+`	Words with optional leading punctuation	"hello" or ".hello"
`\p{N}{1,2}`	Numbers (1-2 digits)	Deviation from GPT-4
`?[^\s\p{L}\p{N}]++[\r\n]*`	Punctuation with optional trailing newlines	Symbols like `...`
`\s*[\r\n]`	Newlines with optional preceding whitespace	Line breaks
`\s+(?!\S)`	Trailing whitespace	Spaces at end
`\s+`	Other whitespace	Spaces, tabs

Key Design Decision: `\p{N}{1,2}` vs GPT-4's `\p{N}{1,3}`

# nanochat uses \p{N}{1,2} (1-2 digits)
# GPT-4 uses \p{N}{1,3} (1-3 digits)

Why the change?

Vocabulary efficiency: \p{N}{1,3} creates 1000 base patterns (0-999)
Small model consideration: With vocab_size=65536, we don't want to "waste" tokens on numbers
Trade-off: Slightly worse compression on numeric data, but better vocabulary utilization

TODO note in code: This hasn't been validated experimentally! It's a hypothesis.

Byte Fallback

Critical feature: byte fallback ensures any Unicode text can be encoded:

tokenizer = HFTokenizer(BPE(
    byte_fallback=True,  # Enable byte-level fallback
    unk_token=None,      # No unknown token needed!
    fuse_unk=False,
))

How it works:

Pre-tokenization splits text into chunks (using regex)
Each chunk is UTF-8 encoded to bytes
BPE merges are applied to byte sequences
Unknown patterns fall back to individual bytes (256 base tokens)

Result: No text is ever "unknown"—worst case, it's encoded as raw bytes.

Training pipeline: streaming 10B bytes into 32K vocabulary

Streaming Iterator Design

nanochat trains on 10 billion characters from FineWeb-Edu. Loading this into memory is impractical. Solution: streaming iterator:

def text_iterator():
    """
    Stream text from parquet files:
    1) Flatten batches into single iterator
    2) Crop documents to doc_cap characters
    3) Break after max_chars total
    """
    nchars = 0
    for batch in parquets_iter_batched(split="train"):
        for doc in batch:
            # Crop long documents
            doc_text = doc[:args.doc_cap] if len(doc) > args.doc_cap else doc
            nchars += len(doc_text)
            yield doc_text
            if nchars > args.max_chars:
                return

Design decisions:

Document capping (doc_cap=10,000): Prevents single huge documents from dominating
Total character limit (max_chars=10B): Controls training time vs coverage
Streaming: Never loads full dataset into RAM

Training Command

python scripts/tok_train.py \
    --vocab_size 65536 \
    --max_chars 10_000_000_000 \
    --doc_cap 10000

Vocabulary size choice: 65536 = 2^16

Power of 2: Efficient for hardware (alignment, indexing)
Embedding matrix: vocab_size × d_model is a major memory cost
Trade-off: Larger vocab = better compression but more parameters

Common choices:

GPT-2: 50,257 (slightly odd number)
GPT-4: ~100,000 (higher compression)
nanochat: 65,536 (power of 2, balanced)

Vocabulary size is a memory vs compression trade-off. For small models, 32K-64K is the sweet spot—larger vocabularies waste embedding parameters.

On the inference side, smaller vocabulary = fewer tokens per prompt = lower API costs. The embedding table is a fixed cost; token efficiency is the payoff.

Tokenizer Comparison

Compare tokenization efficiency across different LLM tokenizers

Input Text

107 characters

Visualize Tokens

Token Boundaries (simulated)

Hello,▁world!▁This▁is▁a▁test▁of▁different▁tokenization▁strategies.▁How▁efficiently▁can▁we▁encode▁thi

~28 tokens with GPT-2

Tokenizer	Vocab Size	Algorithm	Est. Tokens	Bytes/Token
GPT-2	50,257	BPE	27	3.96
GPT-4	100,256	BPE (cl100k)	24	4.46
LLaMA	32,000	SentencePiece BPE	29	3.69
T5	32,128	SentencePiece Unigram	26	4.12
BERT	30,522	WordPiece	27	3.96
Byte-level	256	Raw bytes	107	1.00

Most Efficient

GPT-4

Largest Vocab

GPT-4

Token Savings

346% fewer than bytes

Tokenization Trade-offs

Larger vocab: Better compression, but larger embedding matrices
Smaller vocab: More tokens per text, but smaller model size
BPE: Data-driven, handles rare words with subwords
Byte-level: No OOV tokens, but very long sequences

Vocabulary Explorer

Explore the composition of LLM tokenizer vocabularies

Total Vocabulary

32,257

Memory Impact

Embedding Matrix Size:32,257 × 768 = 99 MB (fp32)

With 4096 dim:528 MB (fp32)

Vocabulary Design Considerations

BPE: Merge frequent byte pairs iteratively until vocab size reached
Special tokens: Control sequences like EOS, PAD, system prompts
Subwords: Balance between vocabulary size and sequence length
Unicode: Important for multilingual models

Token Bytes Cache

After training, nanochat computes token_bytes: how many UTF-8 bytes each token represents.

token_bytes = []
for token_id in range(vocab_size):
    token_str = tokenizer.decode([token_id])
    if token_str in special_tokens:
        token_bytes.append(0)  # Special tokens don't count
    else:
        token_bytes.append(len(token_str.encode("utf-8")))
 
# Save for bits-per-byte evaluation
torch.save(torch.tensor(token_bytes), "token_bytes.pt")

Why? This enables bits-per-byte (bpb) evaluation—a tokenizer-agnostic metric:

# Traditional loss: depends on vocabulary size
loss = -log P(token_id)  # varies with vocab_size
 
# Bits-per-byte: normalized by token byte count
bpb = loss / token_bytes[token_id]  # comparable across tokenizers

This metric appears in Post 1.6 (Loss Landscape & Scaling Laws).

Rust BPE trains 100× faster than pure Python

Why Rust?

BPE training is compute-intensive. Python is too slow. nanochat implements the hot path in Rust:

Performance characteristics:

10B characters: ~5-10 minutes on modern CPU
Python equivalent: Hours or days
Speedup: 50-100x

Core Algorithm: Incremental BPE with Heap

nanochat's Rust implementation uses an incremental heap-based algorithm:

struct MergeJob {
    pair: (u32, u32),           // token pair
    count: u64,                 // frequency
    pos: AHashSet<usize>,       // word indices where pair occurs
}
 
fn train_core_incremental(&mut self, words: Vec<Word>, counts: Vec<i32>, vocab_size: u32) {
    // 1. Count all pairs in parallel
    let (pair_counts, where_to_update) = count_pairs_parallel(&words, &counts);
    
    // 2. Build max-heap of pairs by frequency
    let mut heap = OctonaryHeap::with_capacity(pair_counts.len());
    for (pair, pos) in where_to_update {
        heap.push(MergeJob { pair, count: pair_counts[pair], pos });
    }
    
    // 3. Merge loop: process vocab_size - 256 pairs
    for i in 0..(vocab_size - 256) {
        // Pop most frequent pair
        let top = heap.pop().unwrap();
        
        // Lazy refresh: check if count is still valid
        if top.count != pair_counts[top.pair] {
            heap.push(MergeJob { count: pair_counts[top.pair], ..top });
            continue;
        }
        
        // Record merge
        let new_id = 256 + i;
        self.merges.insert(top.pair, new_id);
        
        // Apply merge to affected words
        for &word_idx in &top.pos {
            let deltas = words[word_idx].merge_pair(top.pair, new_id);
            
            // Update pair counts based on deltas
            for (pair, delta) in deltas {
                pair_counts[pair] += delta * counts[word_idx];
                if delta > 0 {
                    heap.push(MergeJob { pair, count: pair_counts[pair], .. });
                }
            }
        }
    }
}

Optimizations

1. Octonary Heap

use dary_heap::OctonaryHeap;  // 8-ary heap vs binary

Why 8-ary? Better cache locality than binary heap—fewer cache misses during heap operations.

2. Lazy Evaluation

// Pop from heap
let top = heap.pop();
 
// Check if count is stale (other merges updated it)
if top.count != pair_counts[top.pair] {
    // Re-push with updated count (lazy refresh)
    heap.push(MergeJob { count: pair_counts[top.pair], ..top });
    continue;
}

Avoids eager heap updates—only refresh when item is popped.

3. Parallel Pair Counting

fn count_pairs_parallel(words: &[Word], counts: &[i32]) -> (HashMap<Pair, i32>, HashMap<Pair, HashSet<usize>>) {
    words.par_iter()  // Rayon parallel iterator
        .enumerate()
        .map(|(i, word)| {
            // Count pairs in this word
            let mut local_counts = HashMap::new();
            for (a, b) in word.pairs() {
                *local_counts.entry((a, b)).or_default() += counts[i];
            }
            local_counts
        })
        .reduce(/* merge local counts */)
}

Uses Rayon for data parallelism—scales to all CPU cores.

4. Incremental Updates

When merging pair (a, b) -> new_id, only affected pairs change:

fn merge_pair(&mut self, pair: (a, b), new_id) -> Vec<(Pair, i32)> {
    let mut deltas = Vec::new();
    
    // For each occurrence of (a, b):
    // - Remove pairs: (left, a), (a, b), (b, right)
    // - Add pairs: (left, new_id), (new_id, right)
    
    if let Some(left) = left_neighbor {
        deltas.push(((left, a), -1));      // removed
        deltas.push(((left, new_id), +1)); // added
    }
    deltas.push(((a, b), -1));             // removed
    if let Some(right) = right_neighbor {
        deltas.push(((b, right), -1));     // removed
        deltas.push(((new_id, right), +1)); // added
    }
    
    deltas
}

Efficiency: Only track ~3-5 pair changes per merge (not full recount).

Streaming Ingestion

The Rust code releases the GIL for parallel processing:

pub fn train_from_iterator(&mut self, py: Python, iterator: &PyAny, vocab_size: u32, buffer_size: usize) {
    let mut buf: Vec<String> = Vec::with_capacity(buffer_size);
    
    loop {
        // 1. Refill buffer (under GIL)
        let exhausted = refill(&mut buf, iterator)?;
        
        // 2. Process buffer (release GIL, parallel)
        let local_counts = py.allow_threads(|| {
            buf.par_iter()
                .map(|text| {
                    // Apply regex, count chunks
                    let mut counts = HashMap::new();
                    for chunk in pattern.find_iter(text) {
                        *counts.entry(chunk).or_default() += 1;
                    }
                    counts
                })
                .reduce(/* merge */)
        });
        
        // 3. Merge into global counts
        for (chunk, count) in local_counts {
            *global_counts.entry(chunk).or_default() += count;
        }
        
        if exhausted { break; }
    }
}

Pattern:

Acquire GIL → read buffer_size strings from Python iterator
Release GIL → process strings in parallel (Rayon)
Acquire GIL → merge results, repeat

Maximizes throughput by minimizing GIL contention.

tiktoken inference handles 10M+ tokens per second

Training vs Inference Split

nanochat uses two libraries:

Phase	Library	Why?
Training	RustBPE	Optimized incremental algorithm
Inference	tiktoken	OpenAI's ultra-fast encoder

RustBPE training produces:

pattern: The regex split pattern
mergeable_ranks: Dictionary mapping token bytes → token ID

These are fed into tiktoken for inference:

# After training with RustBPE
pattern = tokenizer.get_pattern()
mergeable_ranks_list = tokenizer.get_mergeable_ranks()
mergeable_ranks = {bytes(k): v for k, v in mergeable_ranks_list}
 
# Add special tokens
special_tokens = {
    "<|bos|>": 65536,
    "<|user_start|>": 65537,
    "<|user_end|>": 65538,
    # ... 9 special tokens total
}
 
# Create tiktoken Encoding
enc = tiktoken.Encoding(
    name="rustbpe",
    pat_str=pattern,
    mergeable_ranks=mergeable_ranks,
    special_tokens=special_tokens,
)

tiktoken Performance

tiktoken is extremely fast:

# Single string
tokens = enc.encode_ordinary("Hello, world!")
 
# Batch encoding (parallel)
texts = ["text1", "text2", ..., "text1000"]
tokens_batch = enc.encode_ordinary_batch(texts, num_threads=8)

Why so fast?

Rust implementation: Zero-copy string operations
Parallel batch encoding: Scales to 8+ threads
Optimized BPE merge: Uses precomputed merge priorities

Typical throughput: 10-50 MB/s per core (varies by text).

Special Token Handling

tiktoken distinguishes ordinary vs special tokens:

# Ordinary encoding: treats special tokens as text
tokens = enc.encode_ordinary("<|bos|>Hello")
# => [60, 124, 98, 111, 115, 124, 62, 9906]  (raw bytes)
 
# Special encoding: recognizes special token
tokens = enc.encode("<|bos|>Hello", allowed_special="all")
# => [65536, 9906]  (special token ID + "Hello")
 
# Single special token
bos_id = enc.encode_single_token("<|bos|>")  # => 65536

nanochat uses encode_ordinary for user text and explicitly injects special tokens.

Special tokens mark conversation boundaries for chat

Token Inventory

nanochat defines 9 special tokens:

SPECIAL_TOKENS = [
    "<|bos|>",           # Beginning of sequence (every document)
    "<|user_start|>",    # User message delimiter
    "<|user_end|>",
    "<|assistant_start|>",  # Assistant message delimiter
    "<|assistant_end|>",
    "<|python_start|>",  # Tool use: Python REPL
    "<|python_end|>",
    "<|output_start|>",  # Tool output
    "<|output_end|>",
]

Design Philosophy

Explicit delimiters: Clear boundaries between messages
Role-based: Different tokens for user vs assistant
Tool use: Dedicated tokens for Python code and outputs
No implicit behavior: All special tokens are explicit in the format

Why Not Use Text Markers?

Some systems use text-based markers:

User: Hello
Assistant: Hi there!

Problems:

Ambiguous: What if user types "User:" in their message?
Tokenization artifacts: "User:" might split across tokens
No structural guarantee: Model must learn format from examples

Special tokens enforce structure at the tokenization level—impossible to generate malformed conversations.

Conversation rendering produces supervision masks

The Challenge

Training chat models requires converting conversations into token sequences with supervision masking:

conversation = {
    "messages": [
        {"role": "user", "content": "What is 2+2?"},
        {"role": "assistant", "content": "2+2 equals 4."},
    ]
}
 
# Need to produce:
# tokens: [<bos>, <user_start>, "What", "is", "2", "+", "2", "?", <user_end>,
#          <assistant_start>, "2", "+", "2", "equals", "4", ".", <assistant_end>]
# mask:   [0,     0,            0,     0,   0,  0,  0,  0,  0,
#          0,                  1,   1,   1,   1,       1,   1,   1]

Mask semantics:

0 = Do not supervise (no gradient)
1 = Supervise (compute loss, backprop)

Why? We only want to train on assistant responses, not user messages.

Render Implementation

def render_conversation(self, conversation, max_tokens=2048):
    """
    Tokenize a conversation and return (ids, mask).
    - ids: list of token IDs
    - mask: 1 for assistant tokens (supervised), 0 otherwise
    """
    ids, mask = [], []
    
    def add_tokens(token_ids, mask_val):
        if isinstance(token_ids, int):
            token_ids = [token_ids]
        ids.extend(token_ids)
        mask.extend([mask_val] * len(token_ids))
    
    # Fetch special token IDs
    bos = self.encode_special("<|bos|>")
    user_start = self.encode_special("<|user_start|>")
    user_end = self.encode_special("<|user_end|>")
    assistant_start = self.encode_special("<|assistant_start|>")
    assistant_end = self.encode_special("<|assistant_end|>")
    
    # Add BOS (unsupervised)
    add_tokens(bos, 0)
    
    # Process messages
    for i, message in enumerate(conversation["messages"]):
        role = message["role"]
        content = message["content"]
        
        if role == "user":
            add_tokens(user_start, 0)
            add_tokens(self.encode(content), 0)
            add_tokens(user_end, 0)
            
        elif role == "assistant":
            add_tokens(assistant_start, 0)  # Start token not supervised
            add_tokens(self.encode(content), 1)  # Content IS supervised
            add_tokens(assistant_end, 1)  # End token IS supervised
    
    # Truncate to max_tokens
    ids = ids[:max_tokens]
    mask = mask[:max_tokens]
    
    return ids, mask

System Message Handling

Some conversations start with a system message:

{
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"},
        {"role": "assistant", "content": "Hi there!"},
    ]
}

nanochat merges system into the first user message:

if conversation["messages"][0]["role"] == "system":
    conversation = copy.deepcopy(conversation)
    messages = conversation["messages"]
    assert messages[1]["role"] == "user"
    # Prepend system content to user message
    messages[1]["content"] = messages[0]["content"] + "\n\n" + messages[1]["content"]
    messages = messages[1:]  # Remove system message

Why? Simplifies format—no need for separate <|system_start|> tokens.

Tool Use: Python REPL

Assistants can invoke Python code:

{
    "role": "assistant",
    "content": [
        {"type": "text", "text": "Let me calculate that:"},
        {"type": "python", "text": "print(2+2)"},
        {"type": "python_output", "text": "4"},
        {"type": "text", "text": "The answer is 4."},
    ]
}

Rendering:

add_tokens(assistant_start, 0)
 
for part in content:
    value_ids = self.encode(part["text"])
    
    if part["type"] == "text":
        add_tokens(value_ids, 1)  # Supervised
        
    elif part["type"] == "python":
        add_tokens(python_start, 1)
        add_tokens(value_ids, 1)  # Supervised: learn to generate code
        add_tokens(python_end, 1)
        
    elif part["type"] == "python_output":
        add_tokens(output_start, 0)  # NOT supervised
        add_tokens(value_ids, 0)     # Output comes from Python, not model
        add_tokens(output_end, 0)
 
add_tokens(assistant_end, 1)

Key insight: Python outputs are not supervised—they're generated by the REPL, not the model.

Visualization Helper

Debugging tokenization is crucial. nanochat provides a visualizer:

def visualize_tokenization(self, ids, mask):
    """Color-code tokens by supervision mask"""
    RED = '\033[91m'    # Unsupervised
    GREEN = '\033[92m'  # Supervised
    RESET = '\033[0m'
    
    tokens = []
    for token_id, mask_val in zip(ids, mask):
        token_str = self.decode([token_id])
        color = GREEN if mask_val == 1 else RED
        tokens.append(f"{color}{token_str}{RESET}")
    
    return '|'.join(tokens)

Example output:

<|bos|>|<|user_start|>|What|is|2|+|2|?|<|user_end|>|<|assistant_start|>|2|+|2|equals|4|.|<|assistant_end|>
  RED      RED           RED RED RED RED RED   RED           RED              GREEN GREEN GREEN GREEN GREEN GREEN

Green = model is trained on these tokens, Red = ignored in loss.

Bits-per-byte normalizes evaluation across tokenizers

Compression Ratio

The primary metric: bytes per token (higher = better compression):

text = "Hello, world!"
encoded = tokenizer.encode(text)
encoded_bytes = text.encode('utf-8')
 
compression_ratio = len(encoded_bytes) / len(encoded)
# Higher ratio = fewer tokens for same text = better compression

Comparing to GPT-2 and GPT-4

scripts/tok_eval.py compares nanochat's tokenizer against baselines:

# Load tokenizers
gpt2_tok = RustBPETokenizer.from_pretrained("gpt2")
gpt4_tok = RustBPETokenizer.from_pretrained("cl100k_base")
ours_tok = get_tokenizer()
 
# Test on diverse data
test_data = [
    ("news", news_article),
    ("korean", korean_text),
    ("code", python_code),
    ("math", latex_math),
    ("science", technical_prose),
]
 
# Encode and compare
for name, text in test_data:
    gpt2_tokens = gpt2_tok.encode(text)
    gpt4_tokens = gpt4_tok.encode(text)
    ours_tokens = ours_tok.encode(text)
    
    gpt2_ratio = len(text.encode('utf-8')) / len(gpt2_tokens)
    gpt4_ratio = len(text.encode('utf-8')) / len(gpt4_tokens)
    ours_ratio = len(text.encode('utf-8')) / len(ours_tokens)
    
    print(f"{name:10} GPT-2: {gpt2_ratio:.2f}  GPT-4: {gpt4_ratio:.2f}  Ours: {ours_ratio:.2f}")

Example output:

Vocab sizes:
GPT-2: 50257
GPT-4: 100256
Ours: 65536

Comparison with GPT-2:
==========================================================================================
Text Type  Bytes    GPT-2           Ours            Relative     Better
                    Tokens  Ratio   Tokens  Ratio   Diff %
------------------------------------------------------------------------------------------
news       1087     295     3.69    275     3.95    +6.8%        Ours
korean     385      180     2.14    145     2.66    +19.4%       Ours
code       876      251     3.49    245     3.58    +2.4%        Ours
math       1547     556     2.78    520     2.98    +6.5%        Ours
science    715      194     3.69    185     3.86    +4.6%        Ours
fwe-train  10247    2568    3.99    2450    4.18    +4.6%        Ours

Insights:

Korean text: Large improvement (+19%) due to better multilingual support
English text: Modest improvement (+5-7%) due to similar training data
Code/Math: Competitive, slightly better due to domain coverage

Bits-Per-Byte Metric

From Post 1.6, bits-per-byte (bpb) normalizes loss across tokenizers:

# Load token_bytes mapping
token_bytes = get_token_bytes(device="cuda")  # [vocab_size]
 
# During evaluation
losses = F.cross_entropy(logits, targets, reduction='none')  # [batch, seq_len]
 
# Weight each loss by token byte count
token_bytes_flat = token_bytes[targets.flatten()]  # [batch*seq_len]
valid_mask = token_bytes_flat > 0  # Exclude special tokens
 
# Bits-per-byte
bpb = (losses.flatten()[valid_mask] / token_bytes_flat[valid_mask]).mean()

Why bpb?

Metric	Formula	Problem
Loss	`-log P(token)`	Depends on vocab_size
Perplexity	`exp(loss)`	Still vocab-dependent
bpb	`loss / token_bytes`	Vocab-agnostic

A model with vocab_size=50k and one with vocab_size=100k can be fairly compared using bpb.

Rust vs Python: when speed matters vs when it doesn't

HuggingFace vs RustBPE + tiktoken

nanochat provides two implementations:

HuggingFace Tokenizers

class HuggingFaceTokenizer:
    """Pure Python, uses HF tokenizers library"""
    
    @classmethod
    def train_from_iterator(cls, text_iterator, vocab_size):
        tokenizer = HFTokenizer(BPE(byte_fallback=True))
        # ... configure pre-tokenizer, decoder
        trainer = BpeTrainer(vocab_size=vocab_size, special_tokens=SPECIAL_TOKENS)
        tokenizer.train_from_iterator(text_iterator, trainer)
        return cls(tokenizer)

Pros:

✅ Single library for training + inference
✅ No Rust compilation required
✅ Widely used, well-documented

Cons:

❌ Slower training (~10-20x vs RustBPE)
❌ Confusing API (many configuration options)
❌ Slower inference (~2-5x vs tiktoken)

RustBPE + tiktoken

class RustBPETokenizer:
    """Rust training, tiktoken inference"""
    
    @classmethod
    def train_from_iterator(cls, text_iterator, vocab_size):
        # Train with Rust
        tokenizer = rustbpe.Tokenizer()
        tokenizer.train_from_iterator(text_iterator, vocab_size, pattern=SPLIT_PATTERN)
        
        # Export to tiktoken
        pattern = tokenizer.get_pattern()
        mergeable_ranks = tokenizer.get_mergeable_ranks()
        enc = tiktoken.Encoding(name="rustbpe", pat_str=pattern, mergeable_ranks=mergeable_ranks, ...)
        return cls(enc)

Pros:

✅ Fast training (~50-100x speedup)
✅ Ultra-fast inference (tiktoken is highly optimized)
✅ Parallel batch encoding

Cons:

❌ Requires Rust toolchain for compilation
❌ Two-library complexity (rustbpe + tiktoken)
❌ Custom code to maintain

When to Use Each

Use Case	Recommendation
Production training	RustBPE + tiktoken (speed critical)
Research/experimentation	HuggingFace (easier iteration)
CPU-only environment	Both work, but RustBPE still faster
No Rust compiler	HuggingFace (pure Python)

nanochat default: RustBPE + tiktoken for performance.

These patterns prevent tokenization failures

Best Practices

1. Vocabulary Size Selection

# Good: Powers of 2 for efficiency
vocab_sizes = [32768, 65536, 131072]
 
# Avoid: Arbitrary sizes
vocab_sizes = [50000, 75000]  # Worse for hardware alignment

Trade-offs:

Smaller vocab (32k): Fewer parameters, lower memory, worse compression
Larger vocab (128k): Better compression, more parameters, higher memory

2. Document Capping

# Good: Cap individual documents
def text_iterator():
    for doc in dataset:
        yield doc[:10000]  # Prevent single huge doc from dominating
 
# Bad: No capping
def text_iterator():
    for doc in dataset:
        yield doc  # 100MB document consumes 1% of training data alone

3. Special Token Design

# Good: Explicit, unambiguous delimiters
SPECIAL_TOKENS = ["<|bos|>", "<|user_start|>", "<|user_end|>"]
 
# Bad: Ambiguous markers (could appear in text)
SPECIAL_TOKENS = ["[BOS]", "[USER]", "[/USER]"]

4. Supervision Masking

# Good: Only supervise assistant tokens
if role == "assistant":
    add_tokens(content_ids, mask=1)  # Supervised
else:
    add_tokens(content_ids, mask=0)  # Not supervised
 
# Bad: Supervise everything (including user messages)
add_tokens(content_ids, mask=1)  # Model learns to generate user messages

Common Pitfalls

Pitfall 1: Forgetting Byte Fallback

# Bad: No byte fallback
tokenizer = BPE(byte_fallback=False, unk_token="<UNK>")
# Problem: Unknown characters → <UNK> token → information loss
 
# Good: Byte fallback enabled
tokenizer = BPE(byte_fallback=True, unk_token=None)
# Solution: Unknown patterns → individual bytes (no information loss)

Pitfall 2: Inconsistent Regex Patterns

# Bad: Different patterns for training vs inference
train_pattern = r"\w+|\d+|[^\w\s]+"
inference_pattern = r"\w+|\s+|."  # Oops, different!
 
# Good: Store pattern with tokenizer
tokenizer.pattern = SPLIT_PATTERN  # Use same pattern always

Pitfall 3: Special Token Injection

# Bad: Treating special tokens as ordinary text
text = "<|bos|>Hello"
tokens = tokenizer.encode(text)  # Encodes "<|bos|>" as 5-6 tokens
 
# Good: Explicit special token injection
tokens = [tokenizer.encode_special("<|bos|>")] + tokenizer.encode("Hello")
# => [65536, 9906]  (correct)

Pitfall 4: Ignoring Token Bytes

# Bad: Using raw loss for evaluation
loss = F.cross_entropy(logits, targets).mean()
# Problem: Loss depends on vocab_size (not comparable)
 
# Good: Normalize by token bytes
token_bytes_flat = token_bytes[targets]
bpb = (loss / token_bytes_flat).mean()
# Solution: Vocabulary-agnostic metric

Pitfall 5: Truncation Without Padding

# Bad: Truncate but don't track actual lengths
ids = ids[:max_tokens]  # Lost information about original length
 
# Good: Track lengths or use attention masks
ids = ids[:max_tokens]
actual_length = min(len(original_ids), max_tokens)
# Use actual_length for loss masking

Debugging Tokenization

When things go wrong:

# 1. Visualize tokenization
viz = tokenizer.visualize_tokenization(ids, mask)
print(viz)  # Color-coded output
 
# 2. Decode individual tokens
for token_id in ids:
    print(f"{token_id}: {tokenizer.decode([token_id])!r}")
 
# 3. Check special token IDs
special_ids = {name: tokenizer.encode_special(name) for name in SPECIAL_TOKENS}
print(special_ids)
 
# 4. Verify encode/decode round-trip
text = "Test 你好 123"
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)
assert decoded == text, f"Round-trip failed: {text!r} != {decoded!r}"

Tokenization is the interface between text and neural networks

Get it wrong, and no amount of architecture engineering will save you.

Key takeaways:

BPE with byte fallback ensures no text is ever "unknown"
GPT-4 style regex splitting creates linguistically meaningful chunks
RustBPE + tiktoken offers best-in-class training and inference speed
Special tokens enforce conversation structure at tokenization level
Supervision masking ensures models learn assistant behavior, not user imitation
Bits-per-byte enables fair comparison across tokenizers
Vocabulary size is a critical hyperparameter balancing compression vs parameters

The tokenizer you train sets the foundation for everything downstream—training data efficiency, model capacity utilization, and generation quality. Spend time getting it right.

Before you train the model, you train the tokenizer. Get this wrong, and everything else is wasted compute.

Sources

Institutional and Industry Research

Epoch AI — Tracks tokenization efficiency trends and vocabulary optimization (as of January 2025).
Stanford HAI AI Index — Annual report on tokenization methods and their impact on model efficiency.
MLCommons MLPerf Inference — Benchmarks showing tokenization overhead in production systems.
OpenAI Research — GPT tokenizer evolution and best practices.

Research Papers

Neural Machine Translation of Rare Words with Subword Units (BPE) (Sennrich et al., 2016) - The foundational paper introducing Byte Pair Encoding for neural text processing, adapting the compression algorithm for subword tokenization. Published at ACL 2016. arXiv:1508.07909
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo & Richardson, 2018) - Introduces SentencePiece, which enables end-to-end training from raw text without pre-tokenization. Published at EMNLP 2018 as a demo paper. arXiv:1808.06226
Language Models are Unsupervised Multitask Learners (GPT-2) (Radford et al., 2019) - Introduces byte-level BPE for language models, demonstrating that tokenization with byte fallback eliminates out-of-vocabulary issues. OpenAI Blog
A New Algorithm for Data Compression (Gage, 1994) - The original Byte Pair Encoding algorithm for data compression, later adapted for NLP tokenization. C Users Journal

Tokenization Libraries

tiktoken - OpenAI's fast BPE tokenizer written in Rust with Python bindings. Used by GPT-4 and Claude for efficient encoding/decoding. GitHub
Hugging Face Tokenizers - Fast tokenizers library supporting BPE, WordPiece, and Unigram models with byte fallback. Written in Rust with Python bindings. GitHub
SentencePiece - Google's language-independent subword tokenizer supporting BPE and Unigram models, used by many open-source LLMs. GitHub

Technical Documentation

OpenAI Tokenizer Tool - Interactive tool for visualizing GPT tokenization, useful for understanding token boundaries and vocabulary. OpenAI Tokenizer
Hugging Face Tokenizers Documentation - Comprehensive guide to the tokenizers library including training, configuration, and special token handling. HF Tokenizers Docs

BPE-dropout: Simple and Effective Subword Regularization (Provilkov et al., 2020) - Introduces dropout during BPE encoding for improved robustness. arXiv:1910.13267
Unigram Language Model (Kudo, 2018) - Alternative to BPE that uses a unigram language model for subword segmentation. arXiv:1804.10959

Post 1.5: Training Data Pipeline - How tokenized data flows through dataloaders
Post 1.6: Loss Landscape & Scaling Laws - Bits-per-byte in evaluation
Post 2.2: Fine-tuning for Chat (SFT) - Using render_conversation for SFT
tiktoken documentation
BPE paper: "Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016)

Exercises

Experiment with vocabulary size: Train tokenizers with vocab_size=32768 and vocab_size=131072. Compare compression ratios and model performance.
Analyze token distribution: Plot histogram of token frequencies from trained tokenizer. What percentage of tokens account for 50% of usage?
Multilingual compression: Evaluate compression ratio on 10+ languages. Which languages compress best/worst? Why?
Custom regex patterns: Modify SPLIT_PATTERN to handle code better (e.g., keep -> or == together). Measure impact on code compression.
Special token ablation: Train a model with vs without special tokens (using text markers instead). Compare chat quality.

Next Post: Post 2.6: Memory Optimization Techniques - Gradient accumulation, mixed precision, activation checkpointing, batch size tuning

Before you train your tokenizer:

Choose vocabulary size as a power of 2. 32768, 65536, or 131072 align with hardware—arbitrary sizes waste memory.
Enable byte fallback always. Without it, unknown characters become <UNK> tokens and information is lost forever.
Cap individual documents. A single 100MB document shouldn't consume 1% of your training data—limit to 10K characters.
Use the same regex pattern for training and inference. Different patterns mean different tokenizations—subtle bugs that destroy model quality.
Test encode/decode round-trip. If decode(encode(text)) != text, you have a bug that will corrupt your entire dataset.

Part of the nanochat Deep-Dive Series • Code: nanochat on GitHub