Deployment & Infrastructure

Edge and On-Device LLM Inference

Running LLMs on phones, laptops, and IoT devices — model selection, optimization frameworks, and practical deployment guides for edge computing

Published: 2026-04-13 · Last updated: 2026-04-13

Edge and On-Device LLM Inference

Running LLMs locally on edge devices — phones, laptops, Raspberry Pi, and embedded systems — offers complete privacy, zero latency, and no API costs. But it requires aggressive optimization.

Hardware Landscape

DeviceRAMNeural AcceleratorMax Practical Model
iPhone 15/168GBNeural Engine (16-core)3B (MLX, CoreML)
Android (flagship)12GBSnapdragon NPU7B (MLC, QNN)
MacBook M316-36GBNeural Engine (16-core)70B (MLX, llama.cpp)
Windows laptop16GBNPU (Intel/AMD)7B (ONNX)
Raspberry Pi 58GBNone (CPU only)1B (llama.cpp)
NVIDIA Jetson Orin16-64GBGPU (2048-core)13B (TensorRT)

Inference Frameworks

llama.cpp (Universal)

The most widely used edge inference engine:

# Convert model to GGUF
python convert-hf-to-gguf.py meta-llama/Llama-3.2-1B \
    --outfile llama-3.2-1b.Q4_K_M.gguf \
    --outtype q4_k_m

# Run on CPU
./llama-cli \
    --model llama-3.2-1b.Q4_K_M.gguf \
    --prompt "Hello, world!" \
    --threads 8 \
    --n-predict 200

# Run on Mac with Metal acceleration
./llama-cli \
    --model llama-3.2-1b.Q4_K_M.gguf \
    --prompt "Hello!" \
    --n-gpu-layers 999  # Offload all layers to GPU

MLX (Apple Silicon)

Apple's framework optimized for M-series chips:

import mlx.core as mx
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

prompt = "Explain quantum computing simply."
response = generate(model, tokenizer, prompt=prompt, max_tokens=200, temp=0.7)
print(response)

Why MLX? Unified memory architecture of Apple Silicon means CPU and GPU share memory — no data transfer overhead.

MLC LLM (Cross-Platform)

Compiles models to run on any GPU (mobile, desktop, web):

from mlc_llm import MLCEngine

engine = MLCEngine(model="Llama-3.2-3B-Instruct-q4f16_1")
response = engine.chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}]
)

Platforms: iOS, Android, macOS, Linux, Windows, WebGPU (browser).

ONNX Runtime (Windows/Android)

import onnxruntime_genai as og

model = og.Model("Llama-3.2-3B-Instruct-onnx")
tokenizer = og.Tokenizer(model)

inputs = tokenizer.encode("Hello, world!")
outputs = model.generate(inputs, max_length=200)
print(tokenizer.decode(outputs[0]))

Tiny Models (< 1B parameters)

ModelSize (Q4)Speed (CPU)QualityBest For
Qwen2.5-0.5B300MB50 tok/sBasicIoT, microcontrollers
Llama-3.2-1B600MB30 tok/sGood for sizeMobile, embedded
Phi-3.5-mini (3.8B)2.2GB15 tok/sSurprisingly goodLaptops, phones

Small Models (1-7B parameters)

ModelSize (Q4)Speed (M3 Mac)QualityBest For
Llama-3.2-3B2GB40 tok/sVery goodGeneral mobile use
Gemma-2-2B1.5GB50 tok/sGoodLightweight assistant
Mistral-7B4GB20 tok/sExcellentLaptops, development
Qwen2.5-7B4.5GB18 tok/sStrong multilingualMultilingual edge

On-Device RAG

Combine local LLM with local knowledge base:

# On-device RAG pipeline
from sentence_transformers import SentenceTransformer
import faiss

# Embed documents on device (runs once)
embedder = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")
doc_embeddings = embedder.encode(local_documents)

# Build local index
index = faiss.IndexFlatIP(doc_embeddings.shape[1])
index.add(doc_embeddings.astype("float32"))

# Query
def local_rag(query: str, k: int = 3) -> str:
    query_embed = embedder.encode([query])
    _, indices = index.search(query_embed.astype("float32"), k)
    
    context = "\n".join([local_documents[i] for i in indices[0]])
    prompt = f"Based on the following context:\n\n{context}\n\nAnswer: {query}"
    
    return generate(model, tokenizer, prompt=prompt, max_tokens=200)

Performance Optimization for Edge

1. Model Selection Criteria

def select_edge_model(
    device_ram: int,         # GB
    target_speed: int,       # tokens/sec
    quality_requirement: str,  # "low", "medium", "high"
) -> str:
    """Pick the best model for device constraints."""
    candidates = {
        "high": [("Mistral-7B", 4), ("Qwen2.5-7B", 4.5)],
        "medium": [("Phi-3.5-mini", 2.2), ("Llama-3.2-3B", 2)],
        "low": [("Llama-3.2-1B", 0.6), ("Qwen2.5-0.5B", 0.3)],
    }
    
    for model_name, size_gb in candidates[quality_requirement]:
        if size_gb < device_ram * 0.6:  # Leave 40% for OS + KV cache
            return model_name
    
    return "Qwen2.5-0.5B"  # Fallback to smallest

2. Memory Management

# Pre-allocate KV cache to avoid allocation during generation
# llama.cpp does this automatically, but custom implementations need:

class EdgeKVCache:
    def __init__(self, max_tokens=4096, num_layers=32, num_heads=32, head_dim=128):
        # Pre-allocate: max_tokens × layers × heads × head_dim × 2 bytes (FP16)
        memory = max_tokens * num_layers * num_heads * head_dim * 2 / (1024**3)
        print(f"Pre-allocating {memory:.1f} GB for KV cache")
        
        self.k_cache = torch.zeros(max_tokens, num_layers, num_heads, head_dim, 
                                   dtype=torch.float16)
        self.v_cache = torch.zeros(max_tokens, num_layers, num_heads, head_dim, 
                                   dtype=torch.float16)

3. Battery Optimization

# Throttle generation speed to save battery
def battery_aware_generation(prompt: str, battery_pct: float) -> str:
    if battery_pct < 20:
        # Low battery: use smallest model, fewer tokens
        model = load_smallest_model()
        return generate(model, prompt, max_tokens=100, temp=0.1)
    elif battery_pct < 50:
        # Medium: use standard settings
        model = load_standard_model()
        return generate(model, prompt, max_tokens=300)
    else:
        # Full battery: use best quality
        model = load_best_model()
        return generate(model, prompt, max_tokens=500, temp=0.7)

Privacy Benefits of Edge Inference

Cloud API:                    On-Device:
User → Prompt → Internet → Server → Response → User    User → Prompt → Local Model → Response
              ↑                                               
         Data leaves device                            No data leaves device

Compliance: On-device inference naturally satisfies GDPR, HIPAA, and other data privacy regulations since data never leaves the user's device.

Key Takeaways

  • llama.cpp is the universal choice for CPU/edge inference
  • MLX is best for Apple Silicon; MLC LLM for cross-platform GPU
  • 1-3B models at Q4 quantization run well on modern phones
  • On-device RAG enables private, offline knowledge systems
  • Edge inference eliminates API costs and data privacy concerns
  • Battery and memory management are critical on mobile devices

Related docs