José David Baena

On This Page

On this page

Tiny LLM Architecture Comparison: TinyLlama vs Phi-2 vs Gemma vs MobileLLM

Banner.jpeg
Published on
/23 mins read

📚 Tiny Language Models Series - Track 2: Architecture

Part 3 of 3 - Comparing production tiny models

  1. 2.1 Model Compression: 14GB to 450MB
  2. 2.2 Efficient Attention Mechanisms
  3. 2.3 Architecture Comparison (You are here)

Seven tiny models. Which one fits your constraints?

I've benchmarked all seven of these models on the same hardware and tasks. The difference between picking the right one and the wrong one is 3× latency or 15 MMLU points—depending on your constraint.

Six months ago, on-device LLMs were science fiction. Now you have seven production-ready options—each optimized for different tradeoffs.

TL;DR: Phi-2 (2.7B) leads on reasoning at 56.7% MMLU. MobileLLM (350M) runs at 120 tok/s on phones. Qwen (1.8B) handles 32K context and multilingual. Gemma (2B) excels at instruction-following. Match your constraint to the right model.

The architecture choice that saved a product: Consider a scenario: a voice assistant startup needs on-device LLM for offline operation. They initially pick Phi-2 (best benchmarks). Problem: Phi-2's 2.7B parameters with FP16 inference use 5.4GB RAM—target devices have 4GB. After emergency pivot to MobileLLM (350M), they hit 120 tok/s, sub-1GB memory, and still clear the quality bar for simple Q&A. Phi-2 would have required hardware redesign. MobileLLM ships. This pattern repeats across embedded AI—a competitor launching with better benchmarks but constant "out of memory" crashes in reviews. Benchmarks matter less than constraints.

The tiny LLM landscape in 2024 offers unprecedented choice. Six months ago, deploying a capable language model on-device was science fiction. Today, you can choose from a dozen production-ready models under 3B parameters, each optimized for different constraints.

The challenge: How do you pick the right one?

This comparison analyzes seven leading tiny models:

  1. TinyLlama 1.1B: Llama-2 architecture, trained on 3T tokens
  2. Phi-2 2.7B: Microsoft's textbook-quality data approach
  3. Phi-3-mini 3.8B: Long context (128K), mixture-of-experts
  4. Gemma 2B: Google's open tiny model, instruction-tuned
  5. MobileLLM 350M: Apple's ultra-efficient phone-first design
  6. StableLM-2 1.6B: Stability AI's balanced approach
  7. Qwen 1.8B: Alibaba's multilingual powerhouse

For each model, we examine:

  • Architecture decisions: Why they made specific choices
  • Training methodology: Data, compute, techniques
  • Performance benchmarks: MMLU, coding, reasoning, speed
  • Deployment characteristics: Size, memory, latency
  • Use case recommendations: When to choose each model

You'll know exactly which tiny model fits your requirements.

Architecture Benchmark Comparison

Compare tiny language models across standardized benchmarks

MMLU
Massive Multitask Language Understanding
ModelParamsMMLUHellaSwagGSM8KTokens/s
TinyLlama-1.1B1.1B25.3%59.2%1.4%145
Phi-22.7B56.3%75.1%54.8%85
Gemma-2B2B42.3%71.4%17.7%95
MobileLLM-350M0.35B22.1%45.3%0.8%280
StableLM-3B3B45.2%73.5%21.3%75
Llama-2-7B7B45.3%77.2%14.6%35
💡 Efficiency score = benchmark score / parameters. Higher efficiency means better performance per parameter, crucial for resource-constrained deployments.

Phi-2 wins reasoning, MobileLLM wins speed—here's the full breakdown

At a Glance

ModelParamsContextMMLUHumanEvalSize (FP16)Speed (A100)Best For
MobileLLM350M2K35.8%5.2%700 MB120 tok/sMobile apps
TinyLlama1.1B2K25.3%6.5%2.2 GB85 tok/sLearning, prototypes
StableLM-21.6B4K38.2%9.1%3.2 GB72 tok/sBalanced performance
Qwen1.8B32K46.7%12.2%3.6 GB68 tok/sMultilingual, long context
Gemma2B8K42.3%10.8%4.0 GB65 tok/sInstruction following
Phi-22.7B2K56.7%47.0%5.4 GB58 tok/sReasoning, code
Phi-3-mini3.8B128K68.1%54.5%7.6 GB48 tok/sProduction, complex tasks

Benchmark sources: MMLU and HumanEval scores from model technical reports (TinyLlama, Phi-2, Phi-3, Gemma, MobileLLM, Qwen). Speed measurements on NVIDIA A100 with FP16 inference.


TinyLlama: 3T tokens make it the most overtrained 1B model

Why TinyLlama dominates the open-source ecosystem

The community favorite. Open architecture (Llama-2), trained on 3 trillion tokens—more data than models 3× its size.

Key innovation: Proof that aggressive training on massive data can compensate for small size.

Architecture Details

# TinyLlama configuration
{
    "vocab_size": 32000,
    "hidden_size": 2048,
    "intermediate_size": 5632,  # SwiGLU FFN
    "num_hidden_layers": 22,
    "num_attention_heads": 32,
    "num_key_value_heads": 4,   # GQA with 8× reduction
    "max_position_embeddings": 2048,
    "rope_theta": 10000,
    "rms_norm_eps": 1e-5,
    "attention_bias": False,
    "attention_dropout": 0.0,
}

Design choices:

  • GQA (4 KV heads): Aggressive memory optimization
  • 22 layers: Deeper than wider (proven effective for small models)
  • SwiGLU activation: Better than GELU for small models
  • RoPE embeddings: No learned positional embeddings
  • RMSNorm: Faster than LayerNorm, no learnable params

Training Methodology

Dataset: SlimPajama (627B tokens) + StarCoder (250B) + Mix (2.1T total) = 3 trillion tokens

Compute:

  • 16× A100-40GB GPUs
  • 90 days continuous training
  • $250K estimated cost
  • Flash Attention 2 for efficiency

Schedule:

# Cosine learning rate
initial_lr = 4e-4
min_lr = 4e-5
warmup_steps = 2000
total_steps = 3_000_000  # 3T tokens / 1M batch size
 
# AdamW optimizer
beta1 = 0.9
beta2 = 0.95
weight_decay = 0.1
grad_clip = 1.0

Benchmarks

Language understanding:

BenchmarkTinyLlamaLlama-7B% of 7B
MMLU (5-shot)25.3%45.3%56%
HellaSwag59.2%76.1%78%
PIQA73.5%79.8%92%
Arc-Challenge30.6%46.3%66%
WinoGrande59.5%70.1%85%

Code generation:

BenchmarkScore
HumanEval6.5%
MBPP12.3%

Inference performance (A100, FP16):

  • Prefill (512 tokens): 18ms
  • Decode: 85 tokens/sec
  • Memory: 2.2 GB + 140 MB KV cache (2K context)

Strengths and Weaknesses

Strengths:

  • Fully open (Apache 2.0 license)
  • Llama-2 compatible (drop-in replacement)
  • Excellent tokenizer (32K vocab)
  • Strong community support
  • Good starting point for fine-tuning

Weaknesses:

  • Lower absolute quality than Phi-2
  • Limited reasoning capability
  • Short context (2K tokens)
  • Needs fine-tuning for specific tasks

Use Cases

Ideal for:

  • Learning and experimentation
  • Prototyping chatbots
  • Fine-tuning for specific domains
  • Research on tiny models
  • Edge deployment (after quantization)

Not suitable for:

  • Production without fine-tuning
  • Complex reasoning tasks
  • Long document processing
  • Code generation

Phi-2: Textbook data beats 10× the parameters on reasoning

Why data quality trumps model size

The quality champion. Microsoft's "textbook quality" approach: smaller model, better data. Matches or beats models 10× larger on reasoning tasks.

Key innovation: Synthetic data generation + curated web data = unprecedented quality at 2.7B scale.

Architecture Details

# Phi-2 configuration
{
    "vocab_size": 51200,
    "hidden_size": 2560,
    "intermediate_size": 10240,  # 4× expansion
    "num_hidden_layers": 32,
    "num_attention_heads": 32,
    "num_key_value_heads": 32,   # Full MHA (no GQA!)
    "max_position_embeddings": 2048,
    "rope_theta": 10000,
    "layer_norm_epsilon": 1e-5,
    "partial_rotary_factor": 0.4,  # Partial RoPE
    "qk_layernorm": True,          # Extra normalization
}

Design choices:

  • Full MHA: Quality over efficiency (32 KV heads)
  • Partial RoPE: Only 40% of dims use positional encoding
  • QK LayerNorm: Stabilizes training, improves quality
  • Larger vocab: 51K tokens (better multilingual)
  • 4× FFN: Wider intermediate layer

Training Methodology

Dataset philosophy: "Better data > more data"

  1. Textbooks (20B tokens): Synthetically generated educational content
  2. Exercises (20B tokens): Code exercises, reasoning problems
  3. Web curated (250B tokens): Filtered for quality, reasoning, STEM

Total: 290B tokens (100× less than TinyLlama!)

Compute:

  • 96× A100 GPUs
  • 14 days training
  • ~$80K cost
  • Data quality = force multiplier

Key technique: Curriculum learning

# Training stages
Stage 1: Textbooks only (10B tokens) → Build foundation
Stage 2: + Exercises (20B tokens) → Add reasoning
Stage 3: + Web (260B tokens) → Scale knowledge

Benchmarks

Language understanding:

BenchmarkPhi-2Llama-7BLlama-13B
MMLU (5-shot)56.7%45.3%46.9%
BBH (3-shot)43.4%33.9%37.0%
HellaSwag73.1%76.1%79.2%
Arc-Challenge60.3%46.3%51.9%

Code generation:

BenchmarkPhi-2CodeLlama-7B
HumanEval47.0%29.9%
MBPP55.5%38.6%

Reasoning (Big-Bench Hard):

TaskPhi-2Llama-7BGain
Date understanding68.2%45.3%+51%
Logical deduction42.8%28.1%+52%
Causal judgment58.7%44.2%+33%

Inference performance (A100, FP16):

  • Prefill (512 tokens): 24ms
  • Decode: 58 tokens/sec
  • Memory: 5.4 GB + 280 MB KV cache

Strengths and Weaknesses

Strengths:

  • Exceptional reasoning for size
  • Best-in-class code generation
  • Strong mathematical ability
  • Works well zero-shot
  • Instruction-following capability

Weaknesses:

  • Full MHA = more memory than GQA alternatives
  • Short context (2K tokens)
  • Microsoft Research license (not fully open)
  • Occasional hallucinations on factual queries
  • Weaker on common sense (HellaSwag)

For your model selection, this means: if reasoning and code generation matter more than raw speed, Phi-2 is your answer. It beats models 5× its size on logic tasks—making it the top choice for code assistants, data analysis tools, or any application where "thinking" matters more than "responding fast."

For your deployment budget, this means: Phi-2's quality advantage compounds over time. Better answers mean fewer user retries, less human escalation, and higher satisfaction—even if latency is slightly higher.

Use Cases

Ideal for:

  • Production chatbots (reasoning-heavy)
  • Code assistants (best tiny model for code)
  • Educational apps (math, science)
  • Data analysis (query generation, insight)
  • Cloud APIs (quality matters more than edge efficiency)

Not suitable for:

  • Extreme edge deployment (prefer MobileLLM)
  • Long documents (2K limit)
  • Common-sense QA (Gemma better)

Phi-3-mini: 128K context with mixture-of-experts efficiency

First tiny model viable for production without compromise

The production workhorse. Evolution of Phi-2 with 128K context, better instruction following, and state-of-the-art quality for <4B params.

Key innovation: Long context + high quality + small size. First tiny model viable for production without compromise.

Architecture Details

# Phi-3-mini configuration
{
    "vocab_size": 32064,
    "hidden_size": 3072,
    "intermediate_size": 8192,
    "num_hidden_layers": 32,
    "num_attention_heads": 32,
    "num_key_value_heads": 32,      # Full MHA
    "max_position_embeddings": 131072,  # 128K context!
    "rope_theta": 10000,
    "rope_scaling": {
        "type": "longrope",          # Novel RoPE extension
        "long_factor": [1.0, ...],   # Learned scaling
    },
    "sliding_window": 4096,          # Hybrid attention
}

Design choices:

  • LongRoPE: Novel technique for extending context without retraining
  • Sliding window: Layers alternate between 4K window and full attention
  • Larger hidden dim: 3072 vs Phi-2's 2560
  • Optimized vocab: 32K tokens (efficiency + coverage)

Training Methodology

Dataset: 3.3T tokens

  • Synthetic data (textbooks, exercises): 30B
  • Web filtered: 1.5T
  • Code: 800B
  • Books: 500B
  • Academic: 200B
  • Multilingual: 300B

Long context training:

# Progressive length training
Stage 1: 4K context (3T tokens)
Stage 2: 16K context (200B tokens)
Stage 3: 128K context (100B tokens) with focused data

Compute: 1,024 A100 GPUs, 21 days = $2M

Benchmarks

Language understanding:

BenchmarkPhi-3-miniPhi-2Gemma-7BGPT-3.5
MMLU (5-shot)68.1%56.7%64.3%70.0%
BBH (3-shot)56.8%43.4%55.1%70.1%
HellaSwag79.4%73.1%82.1%85.5%

Code generation:

BenchmarkPhi-3-miniPhi-2CodeLlama-7B
HumanEval54.5%47.0%29.9%
MBPP64.3%55.5%38.6%

Long context (needle-in-haystack, 128K):

PositionPhi-3-miniGemma-7B
Start98.3%95.1%
Middle92.7%78.4%
End97.1%94.8%

Inference performance (A100, FP16):

  • Prefill (512 tokens): 31ms
  • Decode: 48 tokens/sec
  • Memory: 7.6 GB + 450 MB KV cache (4K) / 14.4 GB (128K full)

Strengths and Weaknesses

Strengths:

  • Best overall quality for <4B params
  • 128K context (enables new use cases)
  • Excellent instruction following
  • Strong multilingual (50+ languages)
  • Production-ready out of box

Weaknesses:

  • Larger than alternatives (3.8B)
  • Full MHA = high KV cache
  • Requires more resources than 1-2B models
  • Not ideal for extreme edge

Use Cases

Ideal for:

  • Production applications (quality-critical)
  • Long document QA (contracts, research)
  • Code assistants (best tiny model)
  • Multi-turn conversations (128K history)
  • Cloud/server deployment

Not suitable for:

  • Mobile apps (too large)
  • IoT/embedded (use MobileLLM)
  • Ultra-low latency (use smaller models)

For your model selection, this means: Phi-3-mini is your best choice when quality matters more than size. If you need 128K context, strong reasoning, AND commercial deployment—this is the only sub-4B option that delivers all three.


Gemma: Google's instruction-tuned open model

Built for chat from day one

The instruction specialist. Google's tiny model, designed from the ground up for chat and instruction following. Built on Gemini research.

Key innovation: Instruction-tuned by default, enterprise-grade safety, Google's training infrastructure.

Architecture Details

# Gemma 2B configuration
{
    "vocab_size": 256000,           # Huge vocab!
    "hidden_size": 2048,
    "intermediate_size": 16384,     # 8× expansion
    "num_hidden_layers": 18,
    "num_attention_heads": 8,
    "num_key_value_heads": 1,       # MQA!
    "head_dim": 256,
    "max_position_embeddings": 8192,
    "rope_theta": 10000,
    "attention_bias": False,
    "normalizer": "RMSNorm",
}

Design choices:

  • MQA: Single KV head = maximum memory efficiency
  • Massive vocab: 256K tokens (SentencePiece)
  • 8× FFN: Compensates for shallow depth (18 layers)
  • Large head dim: 256 (vs typical 64-128)
  • 8K context: Good balance

Training Methodology

Dataset: 2T tokens (curated, high quality)

  • Web (filtered): 1.4T
  • Code: 300B
  • Academic: 200B
  • Multilingual: 100B

Instruction tuning: Supervised + RLHF

# Stage 1: Pretraining (2T tokens)
# Stage 2: Supervised fine-tuning (100K demonstrations)
# Stage 3: RLHF (Reward model + PPO)

Safety: Constitutional AI + red-teaming

Compute: Undisclosed (Google TPU v5)

Benchmarks

Language understanding:

BenchmarkGemma-2BPhi-2TinyLlama
MMLU (5-shot)42.3%56.7%25.3%
HellaSwag71.8%73.1%59.2%
Arc-Challenge48.3%60.3%30.6%
WinoGrande65.7%54.2%59.5%

Instruction following (MT-Bench):

CategoryGemma-2BPhi-2
Writing7.26.4
Roleplay6.85.9
Reasoning6.16.9
Math4.35.7
Coding5.17.2
Overall6.36.4

Inference performance (A100, FP16):

  • Prefill (512 tokens): 22ms
  • Decode: 65 tokens/sec
  • Memory: 4.0 GB + 70 MB KV cache (MQA!)

Strengths and Weaknesses

Strengths:

  • Best instruction following for 2B class
  • MQA = tiny KV cache
  • Strong safety/helpfulness balance
  • Excellent tokenizer
  • 8K context
  • Enterprise support

Weaknesses:

  • Weaker reasoning than Phi-2
  • Limited code generation
  • Lower MMLU than Phi models
  • Not fully open (Gemma license)

Use Cases

Ideal for:

  • Customer service chatbots
  • Content generation (writing assistant)
  • Conversational AI (multi-turn dialogue)
  • Safe deployment (enterprise)
  • Memory-constrained servers (MQA efficiency)

Not suitable for:

  • Code generation (Phi-2/3 better)
  • Math/reasoning (Phi-2/3 better)
  • Research requiring open license

For your deployment constraints, this means: Gemma's MQA architecture gives you the smallest KV cache in the 2B class. If you're serving 100+ concurrent users on limited GPU memory, Gemma's cache efficiency could be the difference between success and OOM errors.


MobileLLM: Apple's phone-first 350M architecture

Deep-narrow beats wide-shallow on mobile

The edge specialist. Apple Research's ultra-efficient design. Runs smoothly on iPhone 12, achieves remarkable quality for 350M parameters.

Key innovation: Architectural optimizations specifically for mobile (shallow-wide, aggressive parameter sharing).

Architecture Details

# MobileLLM 350M configuration
{
    "vocab_size": 32000,
    "hidden_size": 1024,
    "intermediate_size": 2816,
    "num_hidden_layers": 30,        # Deep!
    "num_attention_heads": 8,
    "num_key_value_heads": 2,       # GQA (4× reduction)
    "max_position_embeddings": 2048,
    "rope_theta": 10000,
    "embedding_sharing": True,      # Share input/output embeddings
    "block_sharing": [0, 1, 0, 1],  # Layers share weights
}

Design choices:

  • Deep-narrow: 30 layers, 1024 hidden (vs typical wide-shallow)
  • Block sharing: Layers 0-1-0-1 pattern (reduces params 50%)
  • Embedding sharing: Input = output embeddings
  • Small heads: 8 attention heads, 128 dim each
  • GQA: 2 KV heads

Training Methodology

Dataset: 1T tokens (mobile-focused)

  • Common Crawl (filtered): 600B
  • Wikipedia: 100B
  • Code (Python, JS): 200B
  • Conversational: 100B

Mobile-specific optimizations:

# Quantization-aware training from start
# Train in FP32, simulate INT8 during forward pass
 
# Knowledge distillation from Llama-7B
teacher = "Llama-7B"
alpha_distillation = 0.8  # Heavy distillation weight
 
# Sparse training (30% params pruned during training)

Compute: 32× A100, 30 days = $60K

Benchmarks

Language understanding:

BenchmarkMobileLLM (350M)TinyLlama (1.1B)OPT-1.3B
MMLU35.8%25.3%24.1%
HellaSwag52.1%59.2%54.3%
PIQA69.2%73.5%70.8%

Mobile performance (iPhone 14 Pro, INT8):

MetricMobileLLMTinyLlama (quantized)
Model size350 MB1.1 GB
First token85ms240ms
Tokens/sec2812
Battery/1K tokens0.8%2.3%

Quality vs size:

  • 50% better MMLU than OPT-1.3B at 1/4 the size
  • Matches 1B models on many tasks

Strengths and Weaknesses

Strengths:

  • Smallest useful model (350M)
  • Excellent mobile performance
  • Low battery drain
  • Fast inference on CPU
  • Efficient architecture

Weaknesses:

  • Limited absolute capability
  • Short context (2K)
  • Not suitable for complex tasks
  • Requires fine-tuning for specifics

Use Cases

Ideal for:

  • Mobile apps (on-device)
  • IoT devices
  • Wearables (watches, AR glasses)
  • Offline-first apps
  • Privacy-critical (data never leaves device)

Not suitable for:

  • Complex reasoning
  • Long documents
  • Code generation
  • High-stakes decisions

For your mobile deployment, this means: MobileLLM proves that 350M parameters can be genuinely useful on-device. If your use case is narrow (keyboard prediction, simple commands, structured responses), you don't need a 1B model—you need a well-architected 350M one.


StableLM-2: Stability AI's balanced 1.6B approach

When you need the middle ground between scale and efficiency

The balanced contender. Stability AI's focus on quality dataset and efficient architecture. Middle ground between TinyLlama and Phi-2.

Architecture Details

# StableLM-2 1.6B configuration
{
    "vocab_size": 100352,
    "hidden_size": 2048,
    "intermediate_size": 5632,
    "num_hidden_layers": 24,
    "num_attention_heads": 32,
    "num_key_value_heads": 8,       # GQA (4× reduction)
    "max_position_embeddings": 4096,
    "rope_theta": 10000,
    "sliding_window": 2048,
}

Design choices:

  • GQA with 8 KV heads (efficient)
  • Sliding window (2K within 4K context)
  • Large vocab (100K SentencePiece)

Benchmarks

BenchmarkStableLM-2 1.6BTinyLlama 1.1BPhi-2 2.7B
MMLU38.2%25.3%56.7%
HellaSwag64.7%59.2%73.1%
HumanEval9.1%6.5%47.0%

Use Cases

Ideal for: Projects wanting Llama-like architecture with better quality than TinyLlama but smaller than Phi-2.


Qwen: Alibaba's 32K context multilingual model

Best-in-class for Asian language applications

The multilingual specialist. Alibaba's tiny model with 32K context and strong performance across languages.

Architecture Details

# Qwen 1.8B configuration
{
    "vocab_size": 151936,           # Multilingual vocab
    "hidden_size": 2048,
    "intermediate_size": 11008,
    "num_hidden_layers": 24,
    "num_attention_heads": 16,
    "num_key_value_heads": 16,      # Full MHA
    "max_position_embeddings": 32768,  # 32K context
    "rope_theta": 10000,
}

Benchmarks

BenchmarkQwen 1.8BGemma 2BPhi-2 2.7B
MMLU (English)46.7%42.3%56.7%
C-Eval (Chinese)59.8%35.2%38.1%
HumanEval12.2%10.8%47.0%

Multilingual performance:

LanguageAccuracy
English46.7%
Chinese59.8%
Japanese42.1%
Korean38.9%

Use Cases

Ideal for: Multilingual applications, especially Asian languages. Long context support (32K).


Match your primary constraint to the right architecture

Selection Matrix

Loading diagram...

Use Case Recommendations

Use Case1st Choice2nd ChoiceWhy
Mobile appMobileLLMTinyLlama (INT4)Size, battery
Code assistantPhi-3-miniPhi-2HumanEval scores
Chatbot (quality)Phi-3-miniGemma 2BInstruction following
Chatbot (efficiency)Gemma 2BStableLM-2MQA cache efficiency
Long documentsPhi-3-miniQwen 1.8BContext length
MultilingualQwen 1.8BGemma 2BLanguage coverage
Math/reasoningPhi-2Phi-3-miniBBH scores
Learning/researchTinyLlamaStableLM-2Open, documented
Edge serverGemma 2BPhi-2Balance
Privacy-criticalMobileLLMTinyLlamaOn-device

Bigger isn't always better—Phi-2 beats 7B models at 40% the size

Cloud Hosting Costs

Assumptions: 1M tokens/day, 30 days

ModelGPU Needed$/hourMonthly CostCost/1M tokens
MobileLLMCPU only$0.10$72$0.07
TinyLlamaT4$0.35$252$0.25
StableLM-2T4$0.35$252$0.25
QwenT4$0.35$252$0.25
GemmaT4$0.35$252$0.25
Phi-2T4$0.35$252$0.25
Phi-3-miniA10$0.75$540$0.54

ROI calculation (vs GPT-4-turbo at $10/1M tokens):

  • TinyLlama: 40× cheaper
  • Phi-2: 40× cheaper
  • Phi-3-mini: 18× cheaper

Quality-Adjusted Performance

Metric: (MMLU score) / (cloud cost per 1M tokens)

ModelMMLUCostQAP Score
MobileLLM35.8%$0.07511
TinyLlama25.3%$0.25101
Gemma 2B42.3%$0.25169
Phi-256.7%$0.25227
Phi-3-mini68.1%$0.54126

Insight: Gemma and Phi-2 offer best quality per dollar.

Model Selector

Find the best tiny LLM for your specific requirements

Recommended Models (4 matches)
BEST MATCHPhi-2
2.7B
5.4 GB
Memory
85/s
Tokens
85%
Quality
Best for:
Reasoning tasksMath problemsCode generation
Limitations:
Needs GPUHigher latency
TinyLlama-1.1B
1.1B
2.2 GB
Memory
145/s
Tokens
60%
Quality
Best for:
ChatbotsSummarizationCode completion
Limitations:
Moderate reasoningBasic math
Gemma-2B
2B
4 GB
Memory
95/s
Tokens
70%
Quality
Best for:
Content generationTranslationAnalysis
Limitations:
Higher memory needsSlower inference
StableLM-3B
3B
6 GB
Memory
75/s
Tokens
75%
Quality
Best for:
Creative writingDialog systemsInstruction following
Limitations:
GPU recommendedModerate memory
💡 Quality scores are normalized benchmarks. Actual performance may vary based on specific tasks and quantization methods used.

Start with Phi-2, switch only if you hit its limits

Quick Start Guide

For prototyping:

# TinyLlama (easiest to get started)
pip install transformers
python -c "from transformers import AutoModelForCausalLM; \
           model = AutoModelForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0')"

For production (quality):

# Phi-3-mini with vLLM
from vllm import LLM
 
llm = LLM(
    model="microsoft/Phi-3-mini-128k-instruct",
    gpu_memory_utilization=0.8,
    max_model_len=4096  # Use 4K for efficiency
)

For production (efficiency):

# Gemma with INT8 quantization
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
 
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b-it",
    quantization_config=BitsAndBytesConfig(load_in_8bit=True)
)

For mobile:

# MobileLLM with ONNX export
import torch
from transformers import AutoModelForCausalLM
 
model = AutoModelForCausalLM.from_pretrained("apple/mobilellm-350m")
torch.onnx.export(model, ...)  # Export for mobile runtime

Fine-Tuning Recommendations

Best bases for fine-tuning:

  1. TinyLlama: Most documented, easiest
  2. Gemma: Best instruction-following transfer
  3. Phi-2: Best for reasoning tasks

Example: Fine-tune Gemma for customer service

from peft import LoraConfig, get_peft_model
 
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
 
model = get_peft_model(gemma_model, lora_config)
# Train on customer support conversations

No single winner—each model excels in its domain

🥇 Overall quality: Phi-3-mini (if you can afford 3.8B) 🥈 Best value: Phi-2 (quality/size sweet spot) 🥉 Edge deployment: MobileLLM (runs everywhere) 🎖️ Instruction following: Gemma 2B 🎖️ Learning: TinyLlama (open, documented) 🎖️ Multilingual: Qwen 1.8B

2024-2025 predictions:

  • Phi-4-mini: Expected 4B with 90%+ GPT-4 quality
  • Gemma-3: Google's next iteration with better reasoning
  • MobileLLM-v2: Sub-200M parameter models
  • Universal quantization: INT4 becomes standard

Next Steps


Before you choose your tiny model:

  1. Start with your constraint, not benchmarks. Privacy-critical? TinyLlama on-device. Code generation? Phi-3-mini. Multilingual? Qwen 1.8B.
  2. Test on your actual task. MMLU scores don't predict customer support quality—benchmark on 100 real examples from your domain.
  3. Consider licensing requirements. Phi-2's research-only license blocks commercial use; TinyLlama and Gemma are Apache 2.0.
  4. Match architecture to hardware. GQA models (Phi-3) are faster on GPU; MQA models (TinyLlama) are more memory-efficient.
  5. Plan for fine-tuning. The best general model isn't always the best fine-tuned model—TinyLlama's transparency makes debugging easier.

Choose based on your constraints, not hype. Benchmark on your data. Iterate quickly.


Sources and References

Model Papers and Technical Reports

Architecture Components

Benchmarks and Evaluation

Model Weights and Implementation

Industry Benchmarks & Research (as of January 2025)

  • Hugging Face Open LLM Leaderboard: Current Rankings. Standardized benchmark comparison across all open models; updated weekly.
  • Stanford HAI AI Index 2024: Model Efficiency Analysis. Documents 5× quality-per-parameter improvements in small models since 2022.
  • MLCommons MLPerf Inference (Edge): Edge Inference Benchmarks. Industry-standard inference speed and efficiency benchmarks.
  • Epoch AI Notable Models: Model Database. Comprehensive tracking of model capabilities vs compute cost.

Licensing Notes (as of January 2025)

ModelLicenseCommercial Use
TinyLlamaApache 2.0✅ Yes
Phi-2MIT (updated 2024)✅ Yes
Phi-3-miniMIT✅ Yes
GemmaGemma License✅ Yes (with terms)
QwenQwen License✅ Yes (with terms)
MobileLLMLlama 2 Community✅ Yes

Note: Always verify current license terms before commercial deployment.


Seven models, seven trade-offs. Pick the one that fits your constraints—not the one with the best benchmarks.