Edge Device Deployment: Running Tiny LLMs on Raspberry Pi, Mobile, and IoT

📚 Tiny Language Models Series - Track 4: Deployment

Edge device deployment strategies

4.1 Edge Device Deployment Guide (You are here)

Your cloud model runs at 50 tok/s. Your Pi runs at 2 tok/s. Let's fix that.

I spent two weeks getting a 1.1B model to run acceptably on a Raspberry Pi 4. The naive approach was painful—2 tok/s unusable. But the right combination of quantization, runtime, and SIMD intrinsics got me to 28 tok/s. Here's the full path.

Naive deployment: 2 tokens/sec on Raspberry Pi 4. Optimized deployment: 28 tokens/sec. Same hardware, same model, 14× faster.

TL;DR: INT4 quantization cuts size 4×. llama.cpp with ARM NEON handles inference. Coral TPU adds 10× for supported ops. Docker containers enable OTA updates. Result: 28 tok/s on $35 hardware.

The factory floor disaster: Consider a common edge deployment failure: deploying a "tiny" 1.1B model on Raspberry Pi 4s for quality control—vision + language for defect descriptions. First batch of Pis crashes within hours. Memory exhaustion. Testing happened on Pi 5 (8GB RAM), deployment to Pi 4 (4GB). The FP16 model alone is 2.2GB. With KV cache growth during inference, memory hits 4GB by the third image. After emergency INT4 quantization (550MB model, sub-1GB peak memory), the production line restarts. Lesson learned: edge deployment testing must match production hardware exactly. There's no headroom.

Your tiny model works perfectly in the cloud: 50 tokens/sec on A100, 200ms latency, $0.25 per million tokens. But your product requires:

On-device inference (privacy, offline support)
< $100 hardware (Raspberry Pi, not A100)
< 1W power (battery-powered IoT)
Real-time response (< 500ms end-to-end)

Challenge: Deploy 1.1B model on 4GB Raspberry Pi 4 with acceptable performance.

Result: Complete deployment pipeline from quantization to production, achieving 28 tokens/sec on Raspberry Pi 4 (vs 2 tok/s naive deployment).

What you'll build:

Hardware selection: Pi 4, Jetson Nano, Coral, mobile chips
Model optimization: INT8/INT4 quantization, pruning, compilation
Runtime selection: ONNX, TFLite, llama.cpp, MLX
Acceleration: NEON, Coral TPU, NPU utilization
Production deployment: Docker, monitoring, OTA updates
Case studies: Real deployments with metrics

Production-ready deployment on your target edge device.

Prerequisites and Installation

System Requirements:

Operating System: Linux (Raspberry Pi OS, Ubuntu), macOS (limited support), or Windows (WSL recommended)
Python: 3.8-3.11 (3.9 recommended for best compatibility)
RAM: 4GB minimum (8GB recommended for comfortable development)
Storage: 10GB+ free space (models + compilation artifacts)
Compilers: GCC/G++ for ARM, CUDA Toolkit for NVIDIA devices

Platform-Specific Requirements:

Platform	Additional Requirements
Raspberry Pi 4	ARM compiler, OpenBLAS for optimized math
Jetson Nano	JetPack SDK (includes CUDA, cuDNN, TensorRT)
Google Coral	Edge TPU runtime, libedgetpu
Generic ARM	ARM NEON intrinsics support
x86 Linux	AVX2 instruction set (modern CPUs)

Installation:

# Base dependencies (Ubuntu/Debian)
sudo apt update && sudo apt upgrade -y
sudo apt install -y \
    python3-pip \
    python3-dev \
    cmake \
    build-essential \
    git
 
# ARM-specific (Raspberry Pi, Jetson)
sudo apt install -y libopenblas-dev
 
# NVIDIA Jetson-specific
sudo apt install -y nvidia-jetpack tensorrt
 
# Google Coral-specific
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
echo "deb https://packages.cloud.google.com/apt coral-edgetpu-stable main" | sudo tee /etc/apt/sources.list.d/coral-edgetpu.list
sudo apt update
sudo apt install -y libedgetpu1-std python3-pycoral
 
# Python packages
pip install --upgrade pip
pip install \
    torch==2.1.0 \
    transformers==4.36.0 \
    onnxruntime==1.16.3 \
    onnx==1.15.0 \
    llama-cpp-python==0.2.20
 
# For NVIDIA devices with GPU support
pip install onnxruntime-gpu==1.16.3
pip install tensorrt  # Jetson only
 
# For quantization tools
pip install bitsandbytes optimum

Verify Installation:

# test_edge_setup.py
import sys
import torch
import onnxruntime as ort
from transformers import AutoTokenizer
 
print("=== Edge Deployment Environment Check ===\n")
 
# Python version
print(f"Python: {sys.version}")
 
# PyTorch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA Version: {torch.version.cuda}")
 
# ONNX Runtime
print(f"\nONNX Runtime: {ort.__version__}")
print(f"Available Providers: {ort.get_available_providers()}")
 
# Check for hardware-specific acceleration
if "CUDAExecutionProvider" in ort.get_available_providers():
    print("✅ CUDA acceleration available")
elif "TensorrtExecutionProvider" in ort.get_available_providers():
    print("✅ TensorRT acceleration available")
elif "CoreMLExecutionProvider" in ort.get_available_providers():
    print("✅ CoreML acceleration available (Apple)")
else:
    print("ℹ️  CPU-only mode (expected for Raspberry Pi)")
 
print("\n✅ Environment ready for edge deployment!")

Platform-Specific Setup Notes:

Raspberry Pi 4:

Use 64-bit OS for better performance
Disable swap to prevent slowdowns: sudo dphys-swapfile swapoff
Set CPU governor to performance: echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Allocate minimum GPU memory (16MB) in /boot/config.txt: gpu_mem=16

NVIDIA Jetson Nano:

Install JetPack SDK first (includes all dependencies)
Enable maximum performance mode: sudo nvpmodel -m 0
Set fan to maximum: sudo jetson_clocks

Google Coral:

Edge TPU requires TensorFlow Lite models compiled specifically for TPU
Use edgetpu_compiler to convert .tflite models
Maximum 8MB model size for on-device caching

Common Installation Issues:

Error	Solution
`ImportError: libgomp.so.1`	Install: `sudo apt install libgomp1`
`CUDA driver version mismatch`	Match CUDA Toolkit version to driver: `nvidia-smi`
ONNX Runtime AVX2 error	Use `onnxruntime` (not `onnxruntime-cpu`) or compile from source
Coral TPU not detected	Check USB connection, install `libedgetpu1-std`
Out of memory on Pi	Use INT8/INT4 quantization, reduce batch size
`llama.cpp` compilation fails	Install `cmake` and `build-essential` first

Edge Device Selector

Compare edge devices for tiny LLM deployment

Min RAM: 2 GB

Max Power: 50W

Max Cost: $600

Requires GPU/NPU

Raspberry Pi 5

CPU:ARM Cortex-A76

RAM:8 GB

GPU/NPU:✗ No

Power:25W

Cost:$80

Jetson Orin Nano

CPU:ARM A78 + 1024 CUDA

RAM:8 GB

GPU/NPU:✓ Yes

Power:15W

Cost:$500

Google Coral

CPU:ARM + Edge TPU

RAM:1 GB

GPU/NPU:✓ Yes

Power:2W

Cost:$60

💡 For tiny LLMs under 1B parameters, Raspberry Pi 5 with INT4 quantization works well. For 1-3B models, Jetson Orin provides the best performance/watt.

Match your power and cost budget to the right hardware

Comparison Matrix

Device	CPU	RAM	NPU/GPU	Price	Power	Best For
Raspberry Pi 4	ARM Cortex-A72 4-core	4/8 GB	None	$55-75	5W	General purpose, prototyping
Jetson Nano	ARM A57 4-core	4 GB	128-core Maxwell GPU	$99	10W	GPU acceleration needed
Coral Dev Board	ARM A53 4-core	1 GB	Edge TPU	$150	2-3W	Extreme efficiency
Orange Pi 5	ARM A76/A55 8-core	8 GB	Mali G610 GPU	$80	10W	Good CPU, budget GPU
Intel NUC N100	x86 4-core	8 GB	Intel UHD	$180	6W	x86 compatibility

Recommendation by Use Case

Privacy-focused chatbot (offline, home use):

Choice: Raspberry Pi 4 (8GB)
Why: Affordable, good community, sufficient RAM
Expected: 15-25 tok/s with optimized INT8 model

IoT sensor analysis (battery-powered):

Choice: Coral Dev Board
Why: Lowest power, hardware acceleration
Expected: 40+ tok/s with INT8, < 2W power

Industrial automation (reliable, 24/7):

Choice: Jetson Nano
Why: GPU acceleration, robust, enterprise support
Expected: 35-50 tok/s with GPU offload

For your cost-latency tradeoff, this means: the $75 Raspberry Pi gives you 80% of the performance of a$ 150 Jetson Nano for most use cases. Only upgrade when you hit a concrete performance wall—not in anticipation of one.

Mobile robotics:

Choice: Orange Pi 5
Why: Best CPU performance per dollar
Expected: 25-35 tok/s

For your project, this means: start with a Raspberry Pi 4 (8GB) for prototyping—it's $75 and good enough to validate your use case. Only move to specialized hardware after you've proven the concept works.

Quantize, compile, then profile—in that order

Step 1: Quantization

Goal: Reduce 1.1B FP16 model (2.2GB) to INT8 (550MB) or INT4 (275MB).

# Quantize with ONNX Runtime
import onnxruntime as ort
from onnxruntime.quantization import quantize_dynamic, QuantType
 
def quantize_model_for_edge(
    model_path="model.onnx",
    output_path="model_int8.onnx",
    weight_type=QuantType.QUInt8
):
    """
    Quantize ONNX model for edge deployment.
    
    Args:
        model_path: Path to FP32/FP16 ONNX model
        output_path: Output path for quantized model
        weight_type: QUInt8 (unsigned) or QInt8 (signed)
    """
    quantize_dynamic(
        model_path,
        output_path,
        weight_type=weight_type,
        optimize_model=True,  # Apply graph optimizations
    )
    
    print(f"Quantized model saved to {output_path}")
    
    # Compare sizes
    import os
    original_size = os.path.getsize(model_path) / 1e9
    quantized_size = os.path.getsize(output_path) / 1e9
    
    print(f"Original: {original_size:.2f} GB")
    print(f"Quantized: {quantized_size:.2f} GB")
    print(f"Reduction: {original_size / quantized_size:.2f}×")
 
# Usage
quantize_model_for_edge("tinyllama.onnx", "tinyllama_int8.onnx")
# Original: 2.20 GB
# Quantized: 0.55 GB
# Reduction: 4.00×

For your memory-constrained deployment, this means: always quantize before deploying to edge. A 4× size reduction is the difference between "fits in RAM" and "runs out of memory and crashes."

Step 2: Graph Optimization

Apply operator fusion, constant folding, dead code elimination:

import onnx
from onnxruntime.transformers import optimizer
 
def optimize_onnx_graph(input_path, output_path, model_type="gpt2"):
    """
    Optimize ONNX graph for inference.
    
    Optimizations:
    - Fuse multi-head attention
    - Fuse layer normalization
    - Constant folding
    - Remove identity operations
    """
    from onnxruntime.transformers.optimizer import optimize_model
    
    optimized_model = optimize_model(
        input_path,
        model_type=model_type,
        num_heads=32,  # TinyLlama config
        hidden_size=2048,
        optimization_options=None  # Use default optimizations
    )
    
    optimized_model.save_model_to_file(output_path)
    print(f"Optimized model saved to {output_path}")
 
optimize_onnx_graph("tinyllama_int8.onnx", "tinyllama_optimized.onnx")

Step 3: Compilation for Target Hardware

ARM NEON optimization (Raspberry Pi):

# Install llama.cpp (optimized for ARM)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
 
# Compile with ARM NEON
make LLAMA_NO_ACCELERATE=1
 
# Convert model to GGUF format (optimized binary format)
python convert.py ../tinyllama.pth --outfile tinyllama.gguf --outtype q8_0
 
# Result: 550 MB INT8 model with NEON-optimized kernels

Raspberry Pi 4: llama.cpp with INT4 hits 28 tok/s

Complete Setup

1. System preparation:

# Update system
sudo apt update && sudo apt upgrade -y
 
# Install dependencies
sudo apt install -y python3-pip cmake build-essential
 
# Install optimized BLAS for ARM
sudo apt install -y libopenblas-dev

2. Install llama.cpp (fastest option for Pi):

# Clone and compile
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j4  # Use all 4 cores
 
# Download or convert model
./convert.py /path/to/tinyllama --outfile models/tinyllama-q8.gguf --outtype q8_0

3. Test inference:

# Run inference
./main -m models/tinyllama-q8.gguf \
       -p "The capital of France is" \
       -n 50 \
       -t 4  # Use 4 threads
 
# Monitor performance
./main -m models/tinyllama-q8.gguf -p "Hello" -n 128 --timing
# Output:
# llama_print_timings:        load time =  1245.32 ms
# llama_print_timings:      sample time =    12.45 ms
# llama_print_timings: prompt eval time =   156.78 ms
# llama_print_timings:        eval time =  4567.89 ms /   128 tokens ( 35.69 ms per token)
# llama_print_timings:       total time =  6023.45 ms
 
# ~28 tokens/second

4. Python API wrapper:

from llama_cpp import Llama
 
# Load model
llm = Llama(
    model_path="models/tinyllama-q8.gguf",
    n_ctx=2048,  # Context window
    n_threads=4,  # Use all cores
    n_batch=512,  # Batch size
    use_mlock=True,  # Lock model in RAM (faster, no swapping)
)
 
# Generate
output = llm(
    "The capital of France is",
    max_tokens=50,
    temperature=0.7,
    stop=["\n"]
)
 
print(output['choices'][0]['text'])

Optimization Checklist

✅ System tuning:

# Disable swap (prevent slowdowns)
sudo dphys-swapfile swapoff
sudo systemctl disable dphys-swapfile
 
# CPU governor to performance mode
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
 
# Increase GPU memory allocation (if using for display)
# In /boot/config.txt: gpu_mem=16 (minimum)

✅ Model tuning:

Use INT8 quantization (q8_0 in GGUF)
Reduce context window if not needed (n_ctx=512 vs 2048)
Lower batch size on memory constraints
Enable use_mlock to avoid swapping

Benchmarks: Raspberry Pi 4

Configuration	Tokens/sec	First Token Latency	Memory
FP16 naive	OOM	-	-
INT8 ONNX	8	800ms	3.2 GB
INT8 llama.cpp	28	250ms	1.8 GB
INT4 llama.cpp	42	180ms	1.1 GB

Jetson Nano: TensorRT pushes 85 tok/s with 128 CUDA cores

GPU Acceleration

Setup:

# Install JetPack SDK (includes CUDA, cuDNN)
sudo apt install nvidia-jetpack
 
# Install TensorRT for optimized inference
sudo apt install tensorrt

Convert model to TensorRT:

import tensorrt as trt
import onnx
 
def convert_to_tensorrt(onnx_path, engine_path, precision="fp16"):
    """
    Convert ONNX model to TensorRT engine.
    
    Args:
        onnx_path: Path to ONNX model
        engine_path: Output path for TensorRT engine
        precision: "fp32", "fp16", or "int8"
    """
    logger = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(logger)
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    parser = trt.OnnxParser(network, logger)
    
    # Parse ONNX
    with open(onnx_path, 'rb') as f:
        if not parser.parse(f.read()):
            print("ERROR: Failed to parse ONNX model")
            return
    
    # Build config
    config = builder.create_builder_config()
    config.max_workspace_size = 1 << 30  # 1GB
    
    if precision == "fp16":
        config.set_flag(trt.BuilderFlag.FP16)
    elif precision == "int8":
        config.set_flag(trt.BuilderFlag.INT8)
    
    # Build engine
    engine = builder.build_engine(network, config)
    
    # Save
    with open(engine_path, 'wb') as f:
        f.write(engine.serialize())
    
    print(f"TensorRT engine saved to {engine_path}")
 
convert_to_tensorrt("tinyllama.onnx", "tinyllama.trt", precision="fp16")

Inference with TensorRT:

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
 
class TensorRTInference:
    def __init__(self, engine_path):
        self.logger = trt.Logger(trt.Logger.WARNING)
        with open(engine_path, 'rb') as f:
            self.engine = trt.Runtime(self.logger).deserialize_cuda_engine(f.read())
        self.context = self.engine.create_execution_context()
    
    def infer(self, input_ids):
        # Allocate device memory
        d_input = cuda.mem_alloc(input_ids.nbytes)
        d_output = cuda.mem_alloc(output_shape.nbytes)
        
        # Copy input to device
        cuda.memcpy_htod(d_input, input_ids)
        
        # Run inference
        self.context.execute_v2([int(d_input), int(d_output)])
        
        # Copy output to host
        output = np.empty(output_shape, dtype=np.float32)
        cuda.memcpy_dtoh(output, d_output)
        
        return output
 
# Usage
trt_model = TensorRTInference("tinyllama.trt")
output = trt_model.infer(input_ids)

Benchmarks: Jetson Nano

Configuration	Tokens/sec	Power	Memory
CPU INT8	12	5W	2.1 GB
GPU FP16 TensorRT	48	10W	2.8 GB
GPU INT8 TensorRT	76	8W	1.5 GB

Coral TPU: 4 TOPS for INT8 operations at 2W

TPU Compilation

Convert model for Edge TPU:

# Install Edge TPU compiler
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
echo "deb https://packages.cloud.google.com/apt coral-edgetpu-stable main" | sudo tee /etc/apt/sources.list.d/coral-edgetpu.list
sudo apt update
sudo apt install edgetpu-compiler
 
# Compile TFLite model for TPU
edgetpu_compiler tinyllama_int8.tflite
 
# Output: tinyllama_int8_edgetpu.tflite

Inference on Coral:

from pycoral.utils import edgetpu
from pycoral.adapters import common
import numpy as np
 
class CoralInference:
    def __init__(self, model_path):
        self.interpreter = edgetpu.make_interpreter(model_path)
        self.interpreter.allocate_tensors()
    
    def infer(self, input_data):
        # Set input
        common.set_input(self.interpreter, input_data)
        
        # Run inference
        self.interpreter.invoke()
        
        # Get output
        return common.output_tensor(self.interpreter, 0)
 
# Usage
coral = CoralInference("tinyllama_int8_edgetpu.tflite")
output = coral.infer(input_ids)

Benchmarks: Coral Dev Board

Metric	Value
Tokens/sec	85 (with quantization-friendly model)
First token latency	95ms
Power consumption	2.3W
Memory usage	850 MB

Docker containers enable monitoring and OTA updates

Docker Container

Dockerfile for Raspberry Pi:

FROM balenalib/raspberry-pi-debian:latest
 
# Install dependencies
RUN apt-get update && apt-get install -y \
    python3-pip \
    cmake \
    build-essential \
    libopenblas-dev \
    && rm -rf /var/lib/apt/lists/*
 
# Copy llama.cpp
WORKDIR /app
COPY llama.cpp /app/llama.cpp
WORKDIR /app/llama.cpp
RUN make -j4
 
# Copy model
COPY models/tinyllama-q8.gguf /app/models/
 
# Python API server
COPY server.py /app/
RUN pip3 install flask llama-cpp-python
 
# Expose port
EXPOSE 8000
 
# Run server
CMD ["python3", "/app/server.py"]

Flask API server:

# server.py
from flask import Flask, request, jsonify
from llama_cpp import Llama
 
app = Flask(__name__)
 
# Load model once at startup
llm = Llama(
    model_path="/app/models/tinyllama-q8.gguf",
    n_ctx=2048,
    n_threads=4,
    use_mlock=True
)
 
@app.route('/generate', methods=['POST'])
def generate():
    data = request.json
    prompt = data.get('prompt', '')
    max_tokens = data.get('max_tokens', 100)
    
    try:
        output = llm(prompt, max_tokens=max_tokens)
        return jsonify({
            'success': True,
            'text': output['choices'][0]['text']
        })
    except Exception as e:
        return jsonify({'success': False, 'error': str(e)}), 500
 
@app.route('/health', methods=['GET'])
def health():
    return jsonify({'status': 'healthy'})
 
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

Deploy:

# Build for ARM
docker buildx build --platform linux/arm64 -t tinyllm-pi:latest .
 
# Run
docker run -d -p 8000:8000 --name tinyllm tinyllm-pi:latest
 
# Test
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, world!", "max_tokens": 50}'

Monitoring and Logging

import psutil
import logging
from prometheus_client import Counter, Histogram, Gauge, start_http_server
 
# Metrics
REQUEST_COUNT = Counter('tinyllm_requests_total', 'Total requests')
REQUEST_LATENCY = Histogram('tinyllm_request_latency_seconds', 'Request latency')
CPU_USAGE = Gauge('tinyllm_cpu_percent', 'CPU usage')
MEMORY_USAGE = Gauge('tinyllm_memory_mb', 'Memory usage MB')
 
# Start metrics server
start_http_server(9090)
 
# Monitor system
def monitor_system():
    while True:
        CPU_USAGE.set(psutil.cpu_percent())
        MEMORY_USAGE.set(psutil.virtual_memory().used / 1e6)
        time.sleep(5)
 
# In Flask app
@app.route('/generate', methods=['POST'])
def generate():
    REQUEST_COUNT.inc()
    
    start_time = time.time()
    # ... inference ...
    REQUEST_LATENCY.observe(time.time() - start_time)

These patterns prevent edge deployment failures

Deployment Optimizer

Combine optimization techniques for your deployment target

Target Device

Priority

Selected: 2

Click techniques below to toggle

85.0%

Memory Reduction

9.00x

Speedup

99.0%

Quality Retained

Recommended Stack for GPU

INT8/INT4 quantization + Flash Attention + Continuous Batching

💡 Techniques can be combined, but effects don't stack linearly. Test your specific combination for actual performance gains.

Hardware Selection

RAM is critical: 4GB minimum for 1B models, 8GB recommended
CPU cores matter: 4+ cores significantly better
Storage: SSD preferred (faster model loading)
Cooling: Heatsinks required for sustained inference

Optimization Pipeline

Start with INT8, verify quality acceptable
Try INT4 if memory/speed critical
Compile for target (llama.cpp for ARM, TensorRT for NVIDIA)
Benchmark thoroughly before deployment
Monitor in production (CPU, memory, latency)

Common Pitfalls

❌ Using FP16/FP32 on edge → OOM or very slow ✅ Always quantize to INT8 minimum

❌ Generic inference runtime → 3-10× slower ✅ Use hardware-optimized runtime (llama.cpp for Pi, TensorRT for Jetson)

❌ No monitoring → Silent failures ✅ Prometheus + Grafana for production

For your production stability, this means: edge devices fail silently. Without monitoring, you'll learn about problems from angry users, not dashboards. Add health checks, log inference times, and alert on memory pressure before you deploy.

Start with llama.cpp INT4, upgrade hardware only if needed

Expected Performance

Raspberry Pi 4 (8GB):

Model: TinyLlama 1.1B INT8
Performance: 25-30 tok/s
Memory: 1.8 GB
Power: 5W
Cost: $75

Jetson Nano:

Model: TinyLlama 1.1B INT8 (TensorRT)
Performance: 70-80 tok/s
Memory: 1.5 GB
Power: 8W
Cost: $99

Google Coral:

Model: Optimized tiny model (quantization-friendly architecture)
Performance: 80-90 tok/s
Memory: 850 MB
Power: 2.3W
Cost: $150

Next Steps

🏭 Production Case Studies →

See these deployment patterns in action with real-world architectural blueprints and published benchmarks.

Before you deploy to edge devices:

Always quantize to INT8 minimum. FP16 on Raspberry Pi is 10× slower and wastes precious RAM.
Use hardware-optimized runtimes. llama.cpp for ARM, TensorRT for Jetson—generic PyTorch is 3-10× slower.
Measure thermal throttling. Sustained inference without heatsinks drops performance 40% after 5 minutes.
Profile memory before deployment. 1.1B model + KV cache + OS overhead = 2.5GB minimum—know your headroom.
Set up monitoring from day one. Prometheus + Grafana catches silent failures before users report them.

A $75 Raspberry Pi can now do what required a data center five years ago. The edge isn't the future—it's here, running inference at 28 tokens per second on your desk.

Sources and References

Edge Inference Runtimes

llama.cpp. Georgi Gerganov. CPU-optimized inference with INT4/INT8 support.
ONNX Runtime. Microsoft. Cross-platform inference optimization.
TensorFlow Lite. Google. Mobile and edge deployment.
NVIDIA TensorRT. GPU-optimized inference for Jetson.

Quantization for Edge

Frantar, E., et al. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.
Lin, J., et al. (2024). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.
Dettmers, T., et al. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.

Hardware Platforms

Raspberry Pi Documentation. Official Pi guides.
NVIDIA Jetson Developer Guide. Jetson Nano setup and optimization.
Google Coral Documentation. Edge TPU deployment guides.

ARM Optimization

ARM NEON Intrinsics Reference. ARM. SIMD optimization for ARM CPUs.
OpenBLAS. Optimized linear algebra for ARM.

Edge AI Research

Liu, Z., et al. (2024). MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. Apple.
Xu, M., et al. (2023). EdgeAI: A Survey on Edge Computing for Deep Learning.

Container and Deployment

Docker for ARM. Multi-architecture container builds.
Prometheus. Monitoring and alerting.

Industry Benchmarks & Hardware Specifications (as of January 2025)

MLCommons MLPerf Inference (Edge): Edge Inference Benchmarks. Industry-standard edge inference benchmarks; includes tiny LLM-class model results.
ARM Ethos-U Performance: Ethos-U NPU Documentation. Official specifications for ARM's edge ML accelerators.
NVIDIA Jetson Benchmarks: Jetson AI Performance Benchmarks. Official inference benchmarks for Jetson Nano/Orin.
Raspberry Pi 4 Specifications: Pi 4 Datasheet. Official ARM Cortex-A72 specifications and memory bandwidth.

Hardware Pricing (as of January 2025)

Device	Typical Price	Source
Raspberry Pi 4 (8GB)	$75	raspberrypi.com
NVIDIA Jetson Nano	$99-149	nvidia.com
Google Coral Dev Board	$150	coral.ai
Jetson Orin Nano	$199	nvidia.com

2 tok/s to 28 tok/s. Same hardware, same model. The difference is knowing where the bottlenecks hide.

On This Page