Deployment & Infrastructure

Inference Optimization and Quantization

Comprehensive guide to running LLMs efficiently — quantization methods, memory management, batching strategies, and throughput optimization

Published: 2026-04-11 · Last updated: 2026-04-13

Inference Optimization and Quantization

Running LLMs efficiently requires understanding the interplay between model size, precision, memory, and compute. This guide covers all major optimization techniques.

The Memory Budget

For a model with P parameters:

Component	Memory (FP16)	Memory (INT4)
Model weights	2 × P bytes	0.5 × P bytes
KV cache (per token)	2 × layers × d_model × 2 bytes	Same
Activations	Varies by batch size	Same

Example for 70B model with 4K context:

Component	FP16	INT4
Weights	140 GB	35 GB
KV cache (4K tokens)	32 GB	32 GB
Activations (batch=1)	8 GB	8 GB
Total	180 GB	75 GB

This is why INT4 quantization enables running 70B models on a single 48GB GPU (with some offloading).

Quantization Methods

Post-Training Quantization (PTQ)

Quantize after training without additional training:

# GPTQ: Quantize weights to 4-bit with minimal accuracy loss
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model = AutoGPTQForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    quantize_config=BaseQuantizeConfig(bits=4, group_size=128)
)
model.quantize(calibration_dataset)  # 128-512 representative samples
model.save_quantized("./llama-3-gptq-4bit")

AWQ (Activation-Aware Weight Quantization)

Protect important weights by keeping them at higher precision:

# AWQ identifies "salient" weights that contribute most to outputs
# and keeps them at higher precision while quantizing the rest

# Usage via AutoAWQ:
from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")
model.quantize(calibration_dataset, quant_method="awq", bits=4)
model.save_quantized("./llama-3-awq-4bit")

GGUF (llama.cpp format)

The standard for CPU and edge inference:

# Convert to GGUF
python convert-hf-to-gguf.py meta-llama/Llama-3.2-3B \
    --outfile llama-3-3b.Q4_K_M.gguf \
    --outtype q4_k_m  # Quality: medium, 4-bit

# Run with llama.cpp
./llama-cli \
    --model llama-3-3b.Q4_K_M.gguf \
    --prompt "Hello, world!" \
    --n-gpu-layers 35  # Offload 35 layers to GPU (rest on CPU)

GGUF Quantization	Size (7B)	Size (70B)	Quality
Q2_K	2.6 GB	26 GB	Noticeable degradation
Q3_K_M	3.3 GB	33 GB	Good for most tasks
Q4_K_M	4.2 GB	42 GB	Near-FP16 quality
Q5_K_M	4.9 GB	49 GB	Very close to FP16
Q6_K	5.7 GB	57 GB	Essentially FP16
Q8_0	7.4 GB	74 GB	Identical to FP16

Memory Optimization Techniques

1. KV Cache Management

# Sliding window KV cache (only keep recent context in GPU memory)
class SlidingWindowKVCache:
    def __init__(self, window_size=4096):
        self.window_size = window_size
        self.k_cache = []
        self.v_cache = []
    
    def append(self, k_new, v_new):
        self.k_cache.append(k_new)
        self.v_cache.append(v_new)
        
        # Evict old entries if over window
        if len(self.k_cache) > self.window_size:
            self.k_cache = self.k_cache[-self.window_size:]
            self.v_cache = self.v_cache[-self.window_size:]
    
    def get(self):
        return torch.cat(self.k_cache, dim=1), torch.cat(self.v_cache, dim=1)

2. Offloading

Move less-used parts of the model to CPU/disk:

from accelerate import infer_auto_device_map, load_checkpoint_and_dispatch

# Offload some layers to CPU
device_map = infer_auto_device_map(
    model,
    max_memory={0: "20GB", "cpu": "64GB"},  # GPU 0: 20GB, CPU: 64GB
)

model = load_checkpoint_and_dispatch(
    model,
    device_map=device_map,
    offload_folder="offload",  # Disk offload path
)

Trade-off: CPU offloading is 10-50× slower than GPU but enables running models that don't fit in GPU memory.

3. Tensor Parallelism

Split model across multiple GPUs:

# vLLM tensor parallelism
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,  # Split across 4 GPUs
    gpu_memory_utilization=0.9,  # Use 90% of GPU memory
)

Batching Strategies

Static Batching

# Process multiple prompts in parallel
prompts = ["Hello", "What is AI?", "Write code"]
outputs = model.generate(prompts, max_tokens=100)

Problem: All outputs wait for the longest generation to finish.

Continuous Batching (vLLM)

Request 1: ████████████████ (done at t=5)
Request 2: ████████████████████ (done at t=7)
Request 3:         ████████████ (done at t=9, started after Req 1 finished)
Request 4:                 ██████████████ (done at t=12)

Requests are added and removed dynamically, maximizing GPU utilization.

Throughput vs. Latency

Optimization	Throughput	Per-Request Latency	Best For
Large batch	High	High (queuing)	Batch processing, offline jobs
Small batch	Low	Low	Interactive chat
Speculative decoding	High	Low	Best of both
Quantization	Higher	Lower	All scenarios
Tensor parallelism	Same	Lower (for large models)	Models that don't fit one GPU

Benchmarking Your Setup

import time
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.2-3B")
params = SamplingParams(max_tokens=100, temperature=0.7)

# Benchmark
prompts = ["Explain quantum physics"] * 100  # 100 identical prompts

start = time.time()
outputs = llm.generate(prompts, params)
elapsed = time.time() - start

print(f"Total time: {elapsed:.1f}s")
print(f"Throughput: {len(prompts) / elapsed:.1f} req/s")
print(f"Tokens/sec: {sum(len(o.outputs[0].token_ids) for o in outputs) / elapsed:.0f}")

Key Takeaways

INT4 quantization reduces memory by 4× with ~2-5% accuracy loss
GGUF format enables CPU inference for models up to 13B
KV cache often dominates memory for long generations
Continuous batching dramatically improves throughput
Always benchmark your specific workload — theoretical speedups may not materialize

Speculative Decoding — Generation speedup techniques
Deployment Strategies — Production serving
Cost Management — Managing LLM spend

Related docs

LLM Latency Optimization

Achieving sub-second LLM latency — speculative decoding, model parallelism, prefill optimization, and real-time serving patterns

LLM Memory Systems

Building persistent memory for LLM applications — short-term vs long-term memory, vector-based recall, summarization memory, and memory-augmented reasoning

Attention Mechanisms Variants

A deep technical survey of attention variants — from scaled dot-product to FlashAttention, linear attention, and state space alternatives

Related agents

Data Platform agent blueprint focused on maintain durable task state, summarize interaction history, and preserve only the context worth carrying forward for analysts and engineers need better query generation, pipeline debugging, and dataset explanation across changing schemas.

Developer Productivity Memory Agent

Developer Productivity agent blueprint focused on maintain durable task state, summarize interaction history, and preserve only the context worth carrying forward for engineering teams want reliable help with issue triage, runbook guidance, and change review without obscuring system ownership.

Finance Operations Memory Agent

Finance Operations agent blueprint focused on maintain durable task state, summarize interaction history, and preserve only the context worth carrying forward for finance teams need faster reconciliation, exception review, and policy-aware reporting for recurring operational workflows.

Inference Optimization and Quantization

Related docs

LLM Latency Optimization

LLM Memory Systems

Attention Mechanisms Variants

Related agents

Data Platform Memory Agent

Developer Productivity Memory Agent

Finance Operations Memory Agent