José David Baena

Tokenizer Design Choices: BPE, Vocabulary, and Implementation

Banner.jpeg
Published on
/21 mins read

Track 2: Practical Guides - Post 2.5 of 6

This post explores tokenization fundamentals: BPE algorithm, vocabulary design, dual Rust/Python implementation, special tokens for chat, and tokenizer-agnostic evaluation. View all posts in this track →

Introduction

Tokenization is the foundational layer of any language model—it determines how text is converted into discrete tokens that the model can process. Poor tokenization choices can cripple model performance, waste compute on suboptimal token representations, and create artifacts in generation.

nanochat implements Byte Pair Encoding (BPE) tokenization in the style of GPT-4, with two complementary implementations:

  1. RustBPE + tiktoken: Fast training in Rust, ultra-fast inference with tiktoken
  2. HuggingFace Tokenizers: Pure Python fallback for compatibility

In this post:

  • BPE algorithm fundamentals and why it works
  • GPT-4 style design choices: regex patterns, byte fallback, special tokens
  • Training pipeline: streaming data, vocabulary size selection
  • Inference optimizations: tiktoken integration, parallel encoding
  • Conversation formatting: special tokens for chat, supervision masking
  • Evaluation metrics: bits-per-byte for tokenizer-agnostic comparison
  • Implementation comparison: Rust vs Python trade-offs

Table of Contents

  1. BPE Algorithm Fundamentals
  2. GPT-4 Style Tokenization
  3. Training Pipeline
  4. The RustBPE Implementation
  5. Inference with tiktoken
  6. Special Tokens for Chat
  7. Conversation Rendering
  8. Tokenizer Evaluation
  9. Implementation Trade-offs
  10. Best Practices & Common Pitfalls

BPE Algorithm Fundamentals

What is Byte Pair Encoding?

BPE is a data compression algorithm adapted for tokenization. It builds a vocabulary by iteratively merging the most frequent pairs of tokens:

# Start with bytes (256 base tokens)
text = "hello hello world"
tokens = [104, 101, 108, 108, 111, 32, ...]  # byte values
 
# Find most frequent pair
pairs = count_pairs(tokens)  # (104, 101) appears most
# => merge (h, e) -> token 256
 
# Repeat for vocab_size - 256 merges
# Final vocabulary: [0..255, "he", "ll", "lo", "hello", ...]

Key insight: BPE learns a vocabulary optimized for the training data distribution. Common words become single tokens, rare words are split into subwords.

Why BPE Works

  1. No OOV (out-of-vocabulary): Byte fallback ensures any text can be encoded
  2. Compression: Common patterns use fewer tokens (efficiency)
  3. Generalization: Rare words share subword units with common words
  4. Language-agnostic: Works across languages without language-specific rules

BPE Training Algorithm

def train_bpe(text_chunks, vocab_size):
    """Core BPE training loop"""
    # 1. Initialize with byte-level tokens
    words = [chunk.encode("utf-8") for chunk in text_chunks]
    vocab = list(range(256))  # base vocabulary
    merges = {}  # (token_a, token_b) -> merged_token_id
    
    # 2. Iteratively merge most frequent pairs
    for i in range(vocab_size - 256):
        # Count all adjacent pairs
        pair_counts = count_pairs_in_corpus(words)
        
        # Find most frequent pair
        best_pair = max(pair_counts, key=pair_counts.get)
        
        # Create new token for this pair
        new_token_id = 256 + i
        merges[best_pair] = new_token_id
        vocab.append(best_pair)
        
        # Apply merge to all words
        words = apply_merge(words, best_pair, new_token_id)
    
    return vocab, merges

The core challenge: efficiency. Naive implementation is O(n²) or worse. nanochat uses advanced optimizations.


GPT-4 Style Tokenization

The Split Pattern

nanochat uses a regex pre-tokenization pattern inspired by GPT-4:

SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,2}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

This pattern makes several design decisions:

Pattern Breakdown

ComponentMatchesPurpose
'(?i:[sdmt]|ll|ve|re)Contractions like 's, 'm, 'llKeep common suffixes together
[^\r\n\p{L}\p{N}]?+\p{L}+Words with optional leading punctuation"hello" or ".hello"
\p{N}{1,2}Numbers (1-2 digits)Deviation from GPT-4
?[^\s\p{L}\p{N}]++[\r\n]*Punctuation with optional trailing newlinesSymbols like ...
\s*[\r\n]Newlines with optional preceding whitespaceLine breaks
\s+(?!\S)Trailing whitespaceSpaces at end
\s+Other whitespaceSpaces, tabs

Key Design Decision: \p{N}{1,2} vs GPT-4's \p{N}{1,3}

# nanochat uses \p{N}{1,2} (1-2 digits)
# GPT-4 uses \p{N}{1,3} (1-3 digits)

Why the change?

  • Vocabulary efficiency: \p{N}{1,3} creates 1000 base patterns (0-999)
  • Small model consideration: With vocab_size=65536, we don't want to "waste" tokens on numbers
  • Trade-off: Slightly worse compression on numeric data, but better vocabulary utilization

TODO note in code: This hasn't been validated experimentally! It's a hypothesis.

Byte Fallback

Critical feature: byte fallback ensures any Unicode text can be encoded:

tokenizer = HFTokenizer(BPE(
    byte_fallback=True,  # Enable byte-level fallback
    unk_token=None,      # No unknown token needed!
    fuse_unk=False,
))

How it works:

  1. Pre-tokenization splits text into chunks (using regex)
  2. Each chunk is UTF-8 encoded to bytes
  3. BPE merges are applied to byte sequences
  4. Unknown patterns fall back to individual bytes (256 base tokens)

Result: No text is ever "unknown"—worst case, it's encoded as raw bytes.


Training Pipeline

Streaming Iterator Design

nanochat trains on 10 billion characters from FineWeb-Edu. Loading this into memory is impractical. Solution: streaming iterator:

def text_iterator():
    """
    Stream text from parquet files:
    1) Flatten batches into single iterator
    2) Crop documents to doc_cap characters
    3) Break after max_chars total
    """
    nchars = 0
    for batch in parquets_iter_batched(split="train"):
        for doc in batch:
            # Crop long documents
            doc_text = doc[:args.doc_cap] if len(doc) > args.doc_cap else doc
            nchars += len(doc_text)
            yield doc_text
            if nchars > args.max_chars:
                return

Design decisions:

  1. Document capping (doc_cap=10,000): Prevents single huge documents from dominating
  2. Total character limit (max_chars=10B): Controls training time vs coverage
  3. Streaming: Never loads full dataset into RAM

Training Command

python scripts/tok_train.py \
    --vocab_size 65536 \
    --max_chars 10_000_000_000 \
    --doc_cap 10000

Vocabulary size choice: 65536 = 2^16

  • Power of 2: Efficient for hardware (alignment, indexing)
  • Embedding matrix: vocab_size × d_model is a major memory cost
  • Trade-off: Larger vocab = better compression but more parameters

Common choices:

  • GPT-2: 50,257 (slightly odd number)
  • GPT-4: ~100,000 (higher compression)
  • nanochat: 65,536 (power of 2, balanced)

Token Bytes Cache

After training, nanochat computes token_bytes: how many UTF-8 bytes each token represents.

token_bytes = []
for token_id in range(vocab_size):
    token_str = tokenizer.decode([token_id])
    if token_str in special_tokens:
        token_bytes.append(0)  # Special tokens don't count
    else:
        token_bytes.append(len(token_str.encode("utf-8")))
 
# Save for bits-per-byte evaluation
torch.save(torch.tensor(token_bytes), "token_bytes.pt")

Why? This enables bits-per-byte (bpb) evaluation—a tokenizer-agnostic metric:

# Traditional loss: depends on vocabulary size
loss = -log P(token_id)  # varies with vocab_size
 
# Bits-per-byte: normalized by token byte count
bpb = loss / token_bytes[token_id]  # comparable across tokenizers

This metric appears in Post 1.6 (Loss Landscape & Scaling Laws).


The RustBPE Implementation

Why Rust?

BPE training is compute-intensive. Python is too slow. nanochat implements the hot path in Rust:

Performance characteristics:

  • 10B characters: ~5-10 minutes on modern CPU
  • Python equivalent: Hours or days
  • Speedup: 50-100x

Core Algorithm: Incremental BPE with Heap

nanochat's Rust implementation uses an incremental heap-based algorithm:

struct MergeJob {
    pair: (u32, u32),           // token pair
    count: u64,                 // frequency
    pos: AHashSet<usize>,       // word indices where pair occurs
}
 
fn train_core_incremental(&mut self, words: Vec<Word>, counts: Vec<i32>, vocab_size: u32) {
    // 1. Count all pairs in parallel
    let (pair_counts, where_to_update) = count_pairs_parallel(&words, &counts);
    
    // 2. Build max-heap of pairs by frequency
    let mut heap = OctonaryHeap::with_capacity(pair_counts.len());
    for (pair, pos) in where_to_update {
        heap.push(MergeJob { pair, count: pair_counts[pair], pos });
    }
    
    // 3. Merge loop: process vocab_size - 256 pairs
    for i in 0..(vocab_size - 256) {
        // Pop most frequent pair
        let top = heap.pop().unwrap();
        
        // Lazy refresh: check if count is still valid
        if top.count != pair_counts[top.pair] {
            heap.push(MergeJob { count: pair_counts[top.pair], ..top });
            continue;
        }
        
        // Record merge
        let new_id = 256 + i;
        self.merges.insert(top.pair, new_id);
        
        // Apply merge to affected words
        for &word_idx in &top.pos {
            let deltas = words[word_idx].merge_pair(top.pair, new_id);
            
            // Update pair counts based on deltas
            for (pair, delta) in deltas {
                pair_counts[pair] += delta * counts[word_idx];
                if delta > 0 {
                    heap.push(MergeJob { pair, count: pair_counts[pair], .. });
                }
            }
        }
    }
}

Optimizations

1. Octonary Heap

use dary_heap::OctonaryHeap;  // 8-ary heap vs binary

Why 8-ary? Better cache locality than binary heap—fewer cache misses during heap operations.

2. Lazy Evaluation

// Pop from heap
let top = heap.pop();
 
// Check if count is stale (other merges updated it)
if top.count != pair_counts[top.pair] {
    // Re-push with updated count (lazy refresh)
    heap.push(MergeJob { count: pair_counts[top.pair], ..top });
    continue;
}

Avoids eager heap updates—only refresh when item is popped.

3. Parallel Pair Counting

fn count_pairs_parallel(words: &[Word], counts: &[i32]) -> (HashMap<Pair, i32>, HashMap<Pair, HashSet<usize>>) {
    words.par_iter()  // Rayon parallel iterator
        .enumerate()
        .map(|(i, word)| {
            // Count pairs in this word
            let mut local_counts = HashMap::new();
            for (a, b) in word.pairs() {
                *local_counts.entry((a, b)).or_default() += counts[i];
            }
            local_counts
        })
        .reduce(/* merge local counts */)
}

Uses Rayon for data parallelism—scales to all CPU cores.

4. Incremental Updates

When merging pair (a, b) -> new_id, only affected pairs change:

fn merge_pair(&mut self, pair: (a, b), new_id) -> Vec<(Pair, i32)> {
    let mut deltas = Vec::new();
    
    // For each occurrence of (a, b):
    // - Remove pairs: (left, a), (a, b), (b, right)
    // - Add pairs: (left, new_id), (new_id, right)
    
    if let Some(left) = left_neighbor {
        deltas.push(((left, a), -1));      // removed
        deltas.push(((left, new_id), +1)); // added
    }
    deltas.push(((a, b), -1));             // removed
    if let Some(right) = right_neighbor {
        deltas.push(((b, right), -1));     // removed
        deltas.push(((new_id, right), +1)); // added
    }
    
    deltas
}

Efficiency: Only track ~3-5 pair changes per merge (not full recount).

Streaming Ingestion

The Rust code releases the GIL for parallel processing:

pub fn train_from_iterator(&mut self, py: Python, iterator: &PyAny, vocab_size: u32, buffer_size: usize) {
    let mut buf: Vec<String> = Vec::with_capacity(buffer_size);
    
    loop {
        // 1. Refill buffer (under GIL)
        let exhausted = refill(&mut buf, iterator)?;
        
        // 2. Process buffer (release GIL, parallel)
        let local_counts = py.allow_threads(|| {
            buf.par_iter()
                .map(|text| {
                    // Apply regex, count chunks
                    let mut counts = HashMap::new();
                    for chunk in pattern.find_iter(text) {
                        *counts.entry(chunk).or_default() += 1;
                    }
                    counts
                })
                .reduce(/* merge */)
        });
        
        // 3. Merge into global counts
        for (chunk, count) in local_counts {
            *global_counts.entry(chunk).or_default() += count;
        }
        
        if exhausted { break; }
    }
}

Pattern:

  1. Acquire GIL → read buffer_size strings from Python iterator
  2. Release GIL → process strings in parallel (Rayon)
  3. Acquire GIL → merge results, repeat

Maximizes throughput by minimizing GIL contention.


Inference with tiktoken

Training vs Inference Split

nanochat uses two libraries:

PhaseLibraryWhy?
TrainingRustBPEOptimized incremental algorithm
InferencetiktokenOpenAI's ultra-fast encoder

RustBPE training produces:

  • pattern: The regex split pattern
  • mergeable_ranks: Dictionary mapping token bytes → token ID

These are fed into tiktoken for inference:

# After training with RustBPE
pattern = tokenizer.get_pattern()
mergeable_ranks_list = tokenizer.get_mergeable_ranks()
mergeable_ranks = {bytes(k): v for k, v in mergeable_ranks_list}
 
# Add special tokens
special_tokens = {
    "<|bos|>": 65536,
    "<|user_start|>": 65537,
    "<|user_end|>": 65538,
    # ... 9 special tokens total
}
 
# Create tiktoken Encoding
enc = tiktoken.Encoding(
    name="rustbpe",
    pat_str=pattern,
    mergeable_ranks=mergeable_ranks,
    special_tokens=special_tokens,
)

tiktoken Performance

tiktoken is extremely fast:

# Single string
tokens = enc.encode_ordinary("Hello, world!")
 
# Batch encoding (parallel)
texts = ["text1", "text2", ..., "text1000"]
tokens_batch = enc.encode_ordinary_batch(texts, num_threads=8)

Why so fast?

  • Rust implementation: Zero-copy string operations
  • Parallel batch encoding: Scales to 8+ threads
  • Optimized BPE merge: Uses precomputed merge priorities

Typical throughput: 10-50 MB/s per core (varies by text).

Special Token Handling

tiktoken distinguishes ordinary vs special tokens:

# Ordinary encoding: treats special tokens as text
tokens = enc.encode_ordinary("<|bos|>Hello")
# => [60, 124, 98, 111, 115, 124, 62, 9906]  (raw bytes)
 
# Special encoding: recognizes special token
tokens = enc.encode("<|bos|>Hello", allowed_special="all")
# => [65536, 9906]  (special token ID + "Hello")
 
# Single special token
bos_id = enc.encode_single_token("<|bos|>")  # => 65536

nanochat uses encode_ordinary for user text and explicitly injects special tokens.


Special Tokens for Chat

Token Inventory

nanochat defines 9 special tokens:

SPECIAL_TOKENS = [
    "<|bos|>",           # Beginning of sequence (every document)
    "<|user_start|>",    # User message delimiter
    "<|user_end|>",
    "<|assistant_start|>",  # Assistant message delimiter
    "<|assistant_end|>",
    "<|python_start|>",  # Tool use: Python REPL
    "<|python_end|>",
    "<|output_start|>",  # Tool output
    "<|output_end|>",
]

Design Philosophy

  1. Explicit delimiters: Clear boundaries between messages
  2. Role-based: Different tokens for user vs assistant
  3. Tool use: Dedicated tokens for Python code and outputs
  4. No implicit behavior: All special tokens are explicit in the format

Why Not Use Text Markers?

Some systems use text-based markers:

User: Hello
Assistant: Hi there!

Problems:

  • Ambiguous: What if user types "User:" in their message?
  • Tokenization artifacts: "User:" might split across tokens
  • No structural guarantee: Model must learn format from examples

Special tokens enforce structure at the tokenization level—impossible to generate malformed conversations.


Conversation Rendering

The Challenge

Training chat models requires converting conversations into token sequences with supervision masking:

conversation = {
    "messages": [
        {"role": "user", "content": "What is 2+2?"},
        {"role": "assistant", "content": "2+2 equals 4."},
    ]
}
 
# Need to produce:
# tokens: [<bos>, <user_start>, "What", "is", "2", "+", "2", "?", <user_end>,
#          <assistant_start>, "2", "+", "2", "equals", "4", ".", <assistant_end>]
# mask:   [0,     0,            0,     0,   0,  0,  0,  0,  0,
#          0,                  1,   1,   1,   1,       1,   1,   1]

Mask semantics:

  • 0 = Do not supervise (no gradient)
  • 1 = Supervise (compute loss, backprop)

Why? We only want to train on assistant responses, not user messages.

Render Implementation

def render_conversation(self, conversation, max_tokens=2048):
    """
    Tokenize a conversation and return (ids, mask).
    - ids: list of token IDs
    - mask: 1 for assistant tokens (supervised), 0 otherwise
    """
    ids, mask = [], []
    
    def add_tokens(token_ids, mask_val):
        if isinstance(token_ids, int):
            token_ids = [token_ids]
        ids.extend(token_ids)
        mask.extend([mask_val] * len(token_ids))
    
    # Fetch special token IDs
    bos = self.encode_special("<|bos|>")
    user_start = self.encode_special("<|user_start|>")
    user_end = self.encode_special("<|user_end|>")
    assistant_start = self.encode_special("<|assistant_start|>")
    assistant_end = self.encode_special("<|assistant_end|>")
    
    # Add BOS (unsupervised)
    add_tokens(bos, 0)
    
    # Process messages
    for i, message in enumerate(conversation["messages"]):
        role = message["role"]
        content = message["content"]
        
        if role == "user":
            add_tokens(user_start, 0)
            add_tokens(self.encode(content), 0)
            add_tokens(user_end, 0)
            
        elif role == "assistant":
            add_tokens(assistant_start, 0)  # Start token not supervised
            add_tokens(self.encode(content), 1)  # Content IS supervised
            add_tokens(assistant_end, 1)  # End token IS supervised
    
    # Truncate to max_tokens
    ids = ids[:max_tokens]
    mask = mask[:max_tokens]
    
    return ids, mask

System Message Handling

Some conversations start with a system message:

{
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"},
        {"role": "assistant", "content": "Hi there!"},
    ]
}

nanochat merges system into the first user message:

if conversation["messages"][0]["role"] == "system":
    conversation = copy.deepcopy(conversation)
    messages = conversation["messages"]
    assert messages[1]["role"] == "user"
    # Prepend system content to user message
    messages[1]["content"] = messages[0]["content"] + "\n\n" + messages[1]["content"]
    messages = messages[1:]  # Remove system message

Why? Simplifies format—no need for separate <|system_start|> tokens.

Tool Use: Python REPL

Assistants can invoke Python code:

{
    "role": "assistant",
    "content": [
        {"type": "text", "text": "Let me calculate that:"},
        {"type": "python", "text": "print(2+2)"},
        {"type": "python_output", "text": "4"},
        {"type": "text", "text": "The answer is 4."},
    ]
}

Rendering:

add_tokens(assistant_start, 0)
 
for part in content:
    value_ids = self.encode(part["text"])
    
    if part["type"] == "text":
        add_tokens(value_ids, 1)  # Supervised
        
    elif part["type"] == "python":
        add_tokens(python_start, 1)
        add_tokens(value_ids, 1)  # Supervised: learn to generate code
        add_tokens(python_end, 1)
        
    elif part["type"] == "python_output":
        add_tokens(output_start, 0)  # NOT supervised
        add_tokens(value_ids, 0)     # Output comes from Python, not model
        add_tokens(output_end, 0)
 
add_tokens(assistant_end, 1)

Key insight: Python outputs are not supervised—they're generated by the REPL, not the model.

Visualization Helper

Debugging tokenization is crucial. nanochat provides a visualizer:

def visualize_tokenization(self, ids, mask):
    """Color-code tokens by supervision mask"""
    RED = '\033[91m'    # Unsupervised
    GREEN = '\033[92m'  # Supervised
    RESET = '\033[0m'
    
    tokens = []
    for token_id, mask_val in zip(ids, mask):
        token_str = self.decode([token_id])
        color = GREEN if mask_val == 1 else RED
        tokens.append(f"{color}{token_str}{RESET}")
    
    return '|'.join(tokens)

Example output:

<|bos|>|<|user_start|>|What|is|2|+|2|?|<|user_end|>|<|assistant_start|>|2|+|2|equals|4|.|<|assistant_end|>
  RED      RED           RED RED RED RED RED   RED           RED              GREEN GREEN GREEN GREEN GREEN GREEN

Green = model is trained on these tokens, Red = ignored in loss.


Tokenizer Evaluation

Compression Ratio

The primary metric: bytes per token (higher = better compression):

text = "Hello, world!"
encoded = tokenizer.encode(text)
encoded_bytes = text.encode('utf-8')
 
compression_ratio = len(encoded_bytes) / len(encoded)
# Higher ratio = fewer tokens for same text = better compression

Comparing to GPT-2 and GPT-4

scripts/tok_eval.py compares nanochat's tokenizer against baselines:

# Load tokenizers
gpt2_tok = RustBPETokenizer.from_pretrained("gpt2")
gpt4_tok = RustBPETokenizer.from_pretrained("cl100k_base")
ours_tok = get_tokenizer()
 
# Test on diverse data
test_data = [
    ("news", news_article),
    ("korean", korean_text),
    ("code", python_code),
    ("math", latex_math),
    ("science", technical_prose),
]
 
# Encode and compare
for name, text in test_data:
    gpt2_tokens = gpt2_tok.encode(text)
    gpt4_tokens = gpt4_tok.encode(text)
    ours_tokens = ours_tok.encode(text)
    
    gpt2_ratio = len(text.encode('utf-8')) / len(gpt2_tokens)
    gpt4_ratio = len(text.encode('utf-8')) / len(gpt4_tokens)
    ours_ratio = len(text.encode('utf-8')) / len(ours_tokens)
    
    print(f"{name:10} GPT-2: {gpt2_ratio:.2f}  GPT-4: {gpt4_ratio:.2f}  Ours: {ours_ratio:.2f}")

Example output:

Vocab sizes:
GPT-2: 50257
GPT-4: 100256
Ours: 65536

Comparison with GPT-2:
==========================================================================================
Text Type  Bytes    GPT-2           Ours            Relative     Better
                    Tokens  Ratio   Tokens  Ratio   Diff %
------------------------------------------------------------------------------------------
news       1087     295     3.69    275     3.95    +6.8%        Ours
korean     385      180     2.14    145     2.66    +19.4%       Ours
code       876      251     3.49    245     3.58    +2.4%        Ours
math       1547     556     2.78    520     2.98    +6.5%        Ours
science    715      194     3.69    185     3.86    +4.6%        Ours
fwe-train  10247    2568    3.99    2450    4.18    +4.6%        Ours

Insights:

  • Korean text: Large improvement (+19%) due to better multilingual support
  • English text: Modest improvement (+5-7%) due to similar training data
  • Code/Math: Competitive, slightly better due to domain coverage

Bits-Per-Byte Metric

From Post 1.6, bits-per-byte (bpb) normalizes loss across tokenizers:

# Load token_bytes mapping
token_bytes = get_token_bytes(device="cuda")  # [vocab_size]
 
# During evaluation
losses = F.cross_entropy(logits, targets, reduction='none')  # [batch, seq_len]
 
# Weight each loss by token byte count
token_bytes_flat = token_bytes[targets.flatten()]  # [batch*seq_len]
valid_mask = token_bytes_flat > 0  # Exclude special tokens
 
# Bits-per-byte
bpb = (losses.flatten()[valid_mask] / token_bytes_flat[valid_mask]).mean()

Why bpb?

MetricFormulaProblem
Loss-log P(token)Depends on vocab_size
Perplexityexp(loss)Still vocab-dependent
bpbloss / token_bytesVocab-agnostic

A model with vocab_size=50k and one with vocab_size=100k can be fairly compared using bpb.


Implementation Trade-offs

HuggingFace vs RustBPE + tiktoken

nanochat provides two implementations:

HuggingFace Tokenizers

class HuggingFaceTokenizer:
    """Pure Python, uses HF tokenizers library"""
    
    @classmethod
    def train_from_iterator(cls, text_iterator, vocab_size):
        tokenizer = HFTokenizer(BPE(byte_fallback=True))
        # ... configure pre-tokenizer, decoder
        trainer = BpeTrainer(vocab_size=vocab_size, special_tokens=SPECIAL_TOKENS)
        tokenizer.train_from_iterator(text_iterator, trainer)
        return cls(tokenizer)

Pros:

  • ✅ Single library for training + inference
  • ✅ No Rust compilation required
  • ✅ Widely used, well-documented

Cons:

  • ❌ Slower training (~10-20x vs RustBPE)
  • ❌ Confusing API (many configuration options)
  • ❌ Slower inference (~2-5x vs tiktoken)

RustBPE + tiktoken

class RustBPETokenizer:
    """Rust training, tiktoken inference"""
    
    @classmethod
    def train_from_iterator(cls, text_iterator, vocab_size):
        # Train with Rust
        tokenizer = rustbpe.Tokenizer()
        tokenizer.train_from_iterator(text_iterator, vocab_size, pattern=SPLIT_PATTERN)
        
        # Export to tiktoken
        pattern = tokenizer.get_pattern()
        mergeable_ranks = tokenizer.get_mergeable_ranks()
        enc = tiktoken.Encoding(name="rustbpe", pat_str=pattern, mergeable_ranks=mergeable_ranks, ...)
        return cls(enc)

Pros:

  • ✅ Fast training (~50-100x speedup)
  • ✅ Ultra-fast inference (tiktoken is highly optimized)
  • ✅ Parallel batch encoding

Cons:

  • ❌ Requires Rust toolchain for compilation
  • ❌ Two-library complexity (rustbpe + tiktoken)
  • ❌ Custom code to maintain

When to Use Each

Use CaseRecommendation
Production trainingRustBPE + tiktoken (speed critical)
Research/experimentationHuggingFace (easier iteration)
CPU-only environmentBoth work, but RustBPE still faster
No Rust compilerHuggingFace (pure Python)

nanochat default: RustBPE + tiktoken for performance.


Best Practices & Common Pitfalls

Best Practices

1. Vocabulary Size Selection

# Good: Powers of 2 for efficiency
vocab_sizes = [32768, 65536, 131072]
 
# Avoid: Arbitrary sizes
vocab_sizes = [50000, 75000]  # Worse for hardware alignment

Trade-offs:

  • Smaller vocab (32k): Fewer parameters, lower memory, worse compression
  • Larger vocab (128k): Better compression, more parameters, higher memory

2. Document Capping

# Good: Cap individual documents
def text_iterator():
    for doc in dataset:
        yield doc[:10000]  # Prevent single huge doc from dominating
 
# Bad: No capping
def text_iterator():
    for doc in dataset:
        yield doc  # 100MB document consumes 1% of training data alone

3. Special Token Design

# Good: Explicit, unambiguous delimiters
SPECIAL_TOKENS = ["<|bos|>", "<|user_start|>", "<|user_end|>"]
 
# Bad: Ambiguous markers (could appear in text)
SPECIAL_TOKENS = ["[BOS]", "[USER]", "[/USER]"]

4. Supervision Masking

# Good: Only supervise assistant tokens
if role == "assistant":
    add_tokens(content_ids, mask=1)  # Supervised
else:
    add_tokens(content_ids, mask=0)  # Not supervised
 
# Bad: Supervise everything (including user messages)
add_tokens(content_ids, mask=1)  # Model learns to generate user messages

Common Pitfalls

Pitfall 1: Forgetting Byte Fallback

# Bad: No byte fallback
tokenizer = BPE(byte_fallback=False, unk_token="<UNK>")
# Problem: Unknown characters → <UNK> token → information loss
 
# Good: Byte fallback enabled
tokenizer = BPE(byte_fallback=True, unk_token=None)
# Solution: Unknown patterns → individual bytes (no information loss)

Pitfall 2: Inconsistent Regex Patterns

# Bad: Different patterns for training vs inference
train_pattern = r"\w+|\d+|[^\w\s]+"
inference_pattern = r"\w+|\s+|."  # Oops, different!
 
# Good: Store pattern with tokenizer
tokenizer.pattern = SPLIT_PATTERN  # Use same pattern always

Pitfall 3: Special Token Injection

# Bad: Treating special tokens as ordinary text
text = "<|bos|>Hello"
tokens = tokenizer.encode(text)  # Encodes "<|bos|>" as 5-6 tokens
 
# Good: Explicit special token injection
tokens = [tokenizer.encode_special("<|bos|>")] + tokenizer.encode("Hello")
# => [65536, 9906]  (correct)

Pitfall 4: Ignoring Token Bytes

# Bad: Using raw loss for evaluation
loss = F.cross_entropy(logits, targets).mean()
# Problem: Loss depends on vocab_size (not comparable)
 
# Good: Normalize by token bytes
token_bytes_flat = token_bytes[targets]
bpb = (loss / token_bytes_flat).mean()
# Solution: Vocabulary-agnostic metric

Pitfall 5: Truncation Without Padding

# Bad: Truncate but don't track actual lengths
ids = ids[:max_tokens]  # Lost information about original length
 
# Good: Track lengths or use attention masks
ids = ids[:max_tokens]
actual_length = min(len(original_ids), max_tokens)
# Use actual_length for loss masking

Debugging Tokenization

When things go wrong:

# 1. Visualize tokenization
viz = tokenizer.visualize_tokenization(ids, mask)
print(viz)  # Color-coded output
 
# 2. Decode individual tokens
for token_id in ids:
    print(f"{token_id}: {tokenizer.decode([token_id])!r}")
 
# 3. Check special token IDs
special_ids = {name: tokenizer.encode_special(name) for name in SPECIAL_TOKENS}
print(special_ids)
 
# 4. Verify encode/decode round-trip
text = "Test 你好 123"
tokens = tokenizer.encode(text)
decoded = tokenizer.decode(tokens)
assert decoded == text, f"Round-trip failed: {text!r} != {decoded!r}"

Conclusion

Tokenization is the interface between raw text and neural networks—get it wrong, and no amount of architecture engineering will save you.

Key takeaways:

  1. BPE with byte fallback ensures no text is ever "unknown"
  2. GPT-4 style regex splitting creates linguistically meaningful chunks
  3. RustBPE + tiktoken offers best-in-class training and inference speed
  4. Special tokens enforce conversation structure at tokenization level
  5. Supervision masking ensures models learn assistant behavior, not user imitation
  6. Bits-per-byte enables fair comparison across tokenizers
  7. Vocabulary size is a critical hyperparameter balancing compression vs parameters

The tokenizer you train sets the foundation for everything downstream—training data efficiency, model capacity utilization, and generation quality. Spend time getting it right.



Exercises

  1. Experiment with vocabulary size: Train tokenizers with vocab_size=32768 and vocab_size=131072. Compare compression ratios and model performance.

  2. Analyze token distribution: Plot histogram of token frequencies from trained tokenizer. What percentage of tokens account for 50% of usage?

  3. Multilingual compression: Evaluate compression ratio on 10+ languages. Which languages compress best/worst? Why?

  4. Custom regex patterns: Modify SPLIT_PATTERN to handle code better (e.g., keep -> or == together). Measure impact on code compression.

  5. Special token ablation: Train a model with vs without special tokens (using text markers instead). Compare chat quality.


Next Post: Post 2.6: Memory Optimization Techniques - Gradient accumulation, mixed precision, activation checkpointing, batch size tuning


Part of the nanochat Deep-Dive Series • Code: nanochat on GitHub

Related Articles