José David Baena

Mathematical Foundations of Model Compression: Theory Behind Tiny LLMs

Banner.jpeg
Published on
/13 mins read

📚 Tiny Language Models Series - Track 1: Foundation

Part 3 of 3 - Understanding the mathematical theory behind efficient models

  1. 1.1 Tiny Language Models: Complete Guide
  2. 1.2 Evolution of Efficient LLMs
  3. 1.3 Mathematical Foundations (You are here)

Four mathematical pillars explain why compression works

Understanding the math behind compression changed how I approach model optimization. After years of trial-and-error compression attempts, learning the theory was a revelation—I could finally reason about what's possible and predict when approaches would fail before wasting compute.

75% size reduction. 2% accuracy loss. That's not magic—it's information theory and linear algebra.

TL;DR: Distillation transfers dark knowledge through temperature-scaled softmax. Quantization uses k-bit precision with bounded error. Pruning exploits the Lottery Ticket Hypothesis. LoRA reduces fine-tuning to rank-r updates. These four techniques compose.

The compression that shouldn't have worked: Consider a common pattern: quantizing a 1.5B model to INT4 expecting maybe 85% quality retention based on industry rule-of-thumb—but getting 97%. Why? The model's weight distribution is unusually narrow—most weights clustered near zero. INT4's 16 levels cover the actual range with minimal discretization error. Try the same technique on a model with wide weight distribution and quality drops to 61%. The math predicts both outcomes: quantization error scales with weight variance divided by bit precision squared. Understanding the theory explains why some models compress well—and warns against expecting the same behavior from architecturally different models.

Why can a 2.7B parameter model match a 13B model's performance? How does quantization reduce model size by 75% while losing only 2% accuracy? What mathematical principles enable knowledge distillation to compress GPT-4's capabilities into a 3B student model?

The answers lie in the mathematical foundations of model compression—a set of elegant theories explaining why neural networks can be radically simplified without catastrophic quality loss.

This article provides the theoretical grounding for understanding tiny language models. Unlike typical tutorials that show you how to compress models, this explains why the techniques work, backed by:

  • Rigorous mathematical proofs of compression theorems
  • Detailed algorithms with step-by-step derivations
  • Information theory perspectives on model capacity
  • Practical implementations translating theory to code

Four pillars of compression:

  1. Knowledge Distillation: Teacher-student learning theory
  2. Quantization: Precision reduction with minimal information loss
  3. Pruning: Structured sparsity and the Lottery Ticket Hypothesis
  4. Low-Rank Factorization: Matrix decomposition for parameter efficiency

Temperature-scaled softmax transfers dark knowledge between models

The Central Idea

Intuition: A trained neural network contains more than just its weights—it encodes knowledge about the problem structure. This knowledge can be transferred to a smaller model.

Formal definition: Given a large teacher model T and smaller student model S, distillation minimizes:

L_distill = α · L_KL(S || T) + (1-α) · L_task(S, y)

Where:

  • L_KL(S || T) = KL divergence between teacher and student output distributions
  • L_task(S, y) = Standard task loss (e.g., cross-entropy with true labels)
  • α ∈ [0,1] = Balance between mimicking teacher vs learning task

Temperature-Scaled Softmax

The problem with raw logits: Teacher outputs are often nearly one-hot (e.g., 0.99 on correct class, 0.01 on others). This provides little information about relationships between classes.

Solution: Temperature scaling softens the distribution.

Standard softmax:

p_i = exp(z_i) / Σ_j exp(z_j)

Temperature-scaled softmax:

p_i^(T) = exp(z_i/T) / Σ_j exp(z_j/T)

Where T > 1 increases entropy (more uniform distribution).

Practical value: At T=2 or T=3, we get "soft targets" revealing inter-class relationships.

For your distillation experiments, this means: start with T=2 and α=0.7. Only tune these hyperparameters after you've validated your data pipeline and training setup. Temperature tuning gives 1-2% gains; data quality gives 10-20%.

Temperature Scaling Demo

See how temperature affects probability distributions in softmax

Sharper (more confident)Softer (more uniform)
0.69
Entropy (bits)
30%
Uncertainty Level
Softmax with Temperature:
pi = exp(zi/τ) / Σ exp(zj/τ)
Key Insights:
  • τ < 1: Sharper distribution, model becomes more confident
  • τ = 1: Standard softmax behavior
  • τ > 1: Softer distribution, reveals "dark knowledge"
  • τ → ∞: Approaches uniform distribution
💡 In knowledge distillation, higher temperatures help students learn from teacher's relative confidence across all classes, not just the top prediction.

The Distillation Loss Function

Implementation:

def distillation_loss(student_logits, teacher_logits, labels, temperature=2.0, alpha=0.5):
    """
    Compute knowledge distillation loss.
    
    Args:
        student_logits: [batch, vocab_size] - Student model outputs
        teacher_logits: [batch, vocab_size] - Teacher model outputs  
        labels: [batch] - True labels
        temperature: float - Softening parameter
        alpha: float - Weight for distillation vs task loss
    """
    # Temperature-scaled probabilities
    student_soft = F.softmax(student_logits / temperature, dim=-1)
    teacher_soft = F.softmax(teacher_logits / temperature, dim=-1)
    
    # KL divergence (with temperature scaling compensation)
    distill_loss = F.kl_div(
        student_soft.log(),
        teacher_soft,
        reduction='batchmean'
    ) * (temperature ** 2)
    
    # Standard task loss
    task_loss = F.cross_entropy(student_logits, labels)
    
    # Combined loss
    return alpha * distill_loss + (1 - alpha) * task_loss

Why T² scaling? To match gradients at T=1, we must scale KL loss by T².

Distillation Loss Visualizer

Explore how temperature and loss weighting affect knowledge distillation

Higher temperature softens probability distributions

Weight between soft (α) and hard (1-α) targets

Ltotal = 0.70 × τ² × Lsoft + 0.30 × Lhard
Hard Target Loss
Cross-entropy with ground truth labels
Soft Target Loss
KL divergence with teacher outputs
💡 Temperature τ controls how "soft" the teacher's probability distribution is. Higher τ reveals more dark knowledge about inter-class relationships.

Dark Knowledge: What Gets Transferred?

Question: What information does the teacher provide beyond the correct label?

Answer: Inter-class similarities and relationships.

Example: For input "A cute dog"

  • True label: "dog"
  • Teacher probabilities (T=3):
    • dog: 0.85
    • puppy: 0.10
    • cat: 0.03
    • car: 0.00001

The teacher reveals that "puppy" is more related to "dog" than "cat" is, and "car" is completely unrelated.

Feature-Based Distillation

Beyond matching output distributions, we can match intermediate representations.

Formulation:

L_feature = ||h_S^(l) - h_T^(l)||_2^2

Where h^(l) are hidden states at layer l.

Challenge: Teacher and student may have different dimensions.

Solution: Use projection layer W:

L_feature = ||h_S^(l) - W · h_T^(l)||_2^2

Implementation:

class FeatureDistillation(nn.Module):
    def __init__(self, student_dim, teacher_dim):
        super().__init__()
        self.projection = nn.Linear(teacher_dim, student_dim)
        
    def forward(self, student_hidden, teacher_hidden):
        # Project teacher features to student dimension
        teacher_projected = self.projection(teacher_hidden)
        
        # MSE loss between features
        return F.mse_loss(student_hidden, teacher_projected)

k-bit quantization bounds error while cutting memory 4-8×

The Quantization Problem

Goal: Represent floating-point weights with lower precision (INT8, INT4) while minimizing accuracy loss.

Mathematical formulation:

Given weight matrix W ∈ ℝ^(m×n), find quantized representation Ŵ and scale factors s such that:

Ŵ = Quantize(W) = round(W/s) · s

Where Ŵ uses b bits per element (e.g., b=8 for INT8).

Symmetric vs Asymmetric Quantization

Symmetric quantization:

s = max(|W|) / (2^(b-1) - 1)
Ŵ_ij = s · clip(round(W_ij/s), -2^(b-1), 2^(b-1)-1)

Asymmetric quantization:

s = (max(W) - min(W)) / (2^b - 1)
z = round(-min(W)/s)  # zero-point
Ŵ_ij = s · (clip(round(W_ij/s + z), 0, 2^b-1) - z)

Calibration: Finding Optimal Scale

Three approaches:

1. Min-Max Calibration (simple, outlier-sensitive):

s = max(|W|) / (2^(b-1) - 1)

For your calibration strategy, this means: min-max works for well-behaved weight distributions, but outliers will destroy your quantization quality. Use percentile-based (99.9th percentile) for production models—it's more robust and rarely worse.

2. Percentile-Based (robust to outliers):

s = percentile(|W|, 99.9) / (2^(b-1) - 1)

3. MSE-Optimal (minimize reconstruction error):

s* = argmin_s ||W - Quantize(W, s)||_2^2

Implementation:

def find_optimal_scale(weights, bits=8, num_candidates=100):
    """Find scale factor minimizing quantization MSE."""
    max_val = weights.abs().max()
    
    # Try range of scale factors
    scales = torch.linspace(max_val / (2**(bits-1)), max_val / 10, num_candidates)
    
    best_scale = scales[0]
    best_error = float('inf')
    
    for s in scales:
        # Quantize with this scale
        quant = quantize_symmetric(weights, s, bits)
        
        # Measure error
        error = (weights - quant).pow(2).mean()
        
        if error < best_error:
            best_error = error
            best_scale = s
    
    return best_scale

GPTQ: Post-Training Quantization

Challenge: Quantizing all weights independently causes error accumulation.

Insight: Quantize sequentially, compensating for errors.

Algorithm:

def gptq_quantize(model, calibration_data, bits=4):
    """
    GPTQ: Accurate post-training quantization.
    Based on: https://arxiv.org/abs/2210.17323
    """
    for layer in model.layers:
        # Get weight matrix
        W = layer.weight  # [out_features, in_features]
        
        # Compute Hessian (inverse) on calibration data
        H_inv = compute_hessian_inverse(layer, calibration_data)
        
        # Quantize column-by-column with error compensation
        W_quant = torch.zeros_like(W)
        
        for i in range(W.size(1)):  # For each column
            # Quantize this column
            w_col = W[:, i]
            w_col_quant = quantize(w_col, bits)
            
            # Compute quantization error
            error = w_col - w_col_quant
            
            # Compensate error in remaining columns
            W[:, i+1:] -= torch.outer(error, H_inv[i, i+1:]) / H_inv[i, i]
            
            W_quant[:, i] = w_col_quant
        
        layer.weight.data = W_quant
    
    return model

The Lottery Ticket Hypothesis explains why pruning preserves accuracy

The Pruning Hypothesis

Core idea: Neural networks are over-parameterized. Many weights contribute minimally to the output and can be removed.

Magnitude-based pruning:

Prune(W) = W ⊙ M

Where mask M_ij = 𝟙_{|W_ij| > θ} for threshold θ.

Unstructured vs Structured Pruning

Unstructured (element-wise):

  • Remove individual weights
  • High sparsity possible (90%+)
  • Requires sparse matrix operations

Structured (neuron/channel pruning):

  • Remove entire rows/columns
  • Lower sparsity (50-70%)
  • Standard matrix operations (fast)
# Unstructured pruning
def prune_unstructured(W, sparsity=0.9):
    """Remove individual weights by magnitude."""
    threshold = torch.quantile(W.abs(), sparsity)
    mask = W.abs() > threshold
    return W * mask
 
# Structured pruning (neuron-level)
def prune_structured(W, sparsity=0.5):
    """Remove entire neurons (rows) by norm."""
    row_norms = W.norm(dim=1)
    num_keep = int((1 - sparsity) * W.size(0))
    
    _, keep_indices = torch.topk(row_norms, num_keep)
    
    W_pruned = torch.zeros_like(W)
    W_pruned[keep_indices] = W[keep_indices]
    return W_pruned

The Lottery Ticket Hypothesis

Claim: Dense networks contain sparse subnetworks ("winning tickets") that can train to comparable accuracy when initialized properly.

Algorithm:

def find_lottery_ticket(model, train_fn, prune_rate=0.2, iterations=5):
    """Find sparse subnetwork via iterative magnitude pruning."""
    # Save initial weights
    initial_weights = {name: param.clone() for name, param in model.named_parameters()}
    
    masks = {name: torch.ones_like(param) for name, param in model.named_parameters()}
    
    for it in range(iterations):
        # Train with current mask
        model = train_fn(model, masks)
        
        # Prune by magnitude
        for name, param in model.named_parameters():
            if 'weight' not in name:
                continue
            
            weights_magnitude = (param * masks[name]).abs()
            threshold = torch.quantile(weights_magnitude[masks[name] > 0], prune_rate)
            
            # Update mask
            masks[name] = (weights_magnitude > threshold).float()
        
        # Reset to initial weights (key insight!)
        for name, param in model.named_parameters():
            param.data = initial_weights[name] * masks[name]
    
    return model, masks

For your compression pipeline, this means: you don't need to accept quality loss as the cost of pruning. The Lottery Ticket Hypothesis proves that somewhere inside your large model is a sparse network waiting to be discovered. Iterative magnitude pruning finds it—giving you 10× parameter reduction while preserving 95%+ of your model's capability.

For your training budget, this means: the "rewinding" trick—resetting to initial weights after finding the mask—is critical. Training from scratch with the pruned architecture often fails. The winning ticket's initialization matters as much as its structure.


LoRA reduces fine-tuning to rank-r matrix updates

Matrix Rank and Neural Networks

Observation: Weight matrices in transformers often have low intrinsic rank.

Definition: Intrinsic rank is the smallest r such that:

W ≈ U V^T

Where U ∈ ℝ^(m×r), V ∈ ℝ^(n×r), and r << min(m, n).

LoRA: Low-Rank Adaptation

Insight: For fine-tuning, weight updates are low-rank.

LoRA formulation:

h = W x          # standard
h = W_0 x + ΔW x # fine-tuning
h = W_0 x + B A x # LoRA

Where:

  • W_0 is frozen pretrained weights
  • B ∈ ℝ^(d×r), A ∈ ℝ^(r×d) are trainable
  • r << d (typically r = 8-64, d = 768-4096)

Implementation:

class LoRALayer(nn.Module):
    def __init__(self, in_features, out_features, rank=16, alpha=16):
        super().__init__()
        
        # LoRA parameters
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) / rank)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
    
    def forward(self, x, base_output):
        # Low-rank update
        lora_output = (x @ self.lora_A.T) @ self.lora_B.T
        return base_output + self.scaling * lora_output

Parameter efficiency: For Llama-7B with LoRA:

  • Base model: 7B parameters (frozen)
  • LoRA parameters: ~4M (trainable)
  • Trainable fraction: 0.057%

Mathematics reveals beautiful truths about neural network redundancy

Through distillation, quantization, pruning, and low-rank factorization, you can compress models by 10-100× while retaining 90-98% of capabilities.

Key insights:

  1. Distillation works because soft probabilities encode class relationships
  2. Quantization works because weights are robust to precision reduction
  3. Pruning works because networks are over-parameterized
  4. LoRA works because updates live in low-dimensional subspaces

These aren't heuristics—they're mathematical principles with theoretical foundations.


Before you apply compression mathematics:

  1. Understand temperature before tuning it. T>1 flattens distributions (more knowledge transfer); T<1 sharpens them (more confident predictions).
  2. Calculate your quantization error budget. INT8 gives ~0.5% error per layer; with 24 layers, expect ~10% cumulative—plan accordingly.
  3. Use magnitude pruning as baseline. More sophisticated methods rarely beat simple magnitude pruning by more than 5%.
  4. Start with LoRA rank=8. Most fine-tuning tasks don't need higher rank—scale up only when validation loss plateaus.
  5. Remember: 90% weights, 10% capacity. Pruning 90% of parameters typically loses only 10% of capability—neural networks are massively redundant.

Mathematics doesn't lie. Tiny models work—and now you know why.


Sources and References

Institutional and Industry Research

  • Epoch AI — Tracks compute efficiency trends and mathematical foundations of model compression (as of January 2025).
  • Stanford HAI AI Index — Annual report on AI efficiency metrics and theoretical advances in compression.
  • MLCommons MLPerf Training — Industry-standard benchmarks validating compression mathematics.
  • Google DeepMind Research — Foundational research on neural network theory and compression.

Knowledge Distillation

Quantization Theory

Pruning and Lottery Ticket Hypothesis

Low-Rank Factorization and LoRA

Information Theory

  • Cover, T.M. & Thomas, J.A. (2006). Elements of Information Theory, 2nd Edition. Wiley. KL divergence and entropy foundations.
  • Shannon, C.E. (1948). A Mathematical Theory of Communication. Original information theory paper.

Neural Network Theory


The math tells you what's possible before you train. That's cheaper than finding out after.