Deployment & Infrastructure

Deployment Strategies for Production

Serving LLMs in production — API design, autoscaling, load balancing, monitoring, and reliability patterns for high-availability model serving

Published: 2026-04-11 · Last updated: 2026-04-13

Deployment Strategies for Production

Moving an LLM from prototype to production requires addressing challenges in latency, throughput, reliability, and cost. This guide covers the full stack of production serving.

Deployment Options

1. Cloud API (Managed)

Provider	Models	Pricing	SLA
OpenAI	GPT-4o, o1, GPT-4o mini	$/token	99.9%
Anthropic	Claude Sonnet, Opus	$/token	99.9%
Google	Gemini Pro, Flash	$/token	99.9%
Together AI	Open-source models	$/token	99.5%
Groq	Llama, Mixtral (ultra-fast)	$/token	99.5%

Pros: No infrastructure management, automatic scaling, latest models. Cons: Per-token costs add up, data leaves your network, rate limits.

2. Self-Hosted (Cloud GPUs)

# Example: Kubernetes deployment on cloud GPUs
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-3-70b
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-server
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        - --model
        - meta-llama/Llama-3.1-70B-Instruct
        - --tensor-parallel-size
        - "4"  # 4 GPUs per replica
        - --max-num-seqs
        - "256"
        resources:
          limits:
            nvidia.com/gpu: 4
        ports:
        - containerPort: 8000

GPU Cost Estimates (monthly):

GPU	VRAM	Cost (AWS)	Models Supported
A10G	24GB	~$800	7B (quantized)
A100	80GB	~$3,500	70B (tensor parallel)
H100	80GB	~$5,000	70B+, 405B (multi-node)
L40S	48GB	~$2,000	8B-13B

3. Hybrid (Router)

Route requests between cloud API and self-hosted based on cost, latency, and content:

class ModelRouter:
    def __init__(self):
        self.cloud_client = OpenAI()
        self.self_hosted_url = "http://internal-llm:8000"
    
    def route(self, request: str, priority: str = "balanced") -> str:
        if priority == "cost":
            return self.self_hosted_query(request)
        elif priority == "quality":
            return self.cloud_query(request, model="gpt-4o")
        else:  # balanced
            if self.self_hosted_healthy():
                return self.self_hosted_query(request)
            return self.cloud_query(request, model="gpt-4o-mini")

API Design Patterns

Streaming Responses

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()

@app.post("/v1/chat/completions")
async def chat_completion(request: ChatRequest):
    async def generate():
        for token in model.stream_generate(request.messages):
            chunk = {
                "id": f"chatcmpl-{uuid4()}",
                "object": "chat.completion.chunk",
                "choices": [{
                    "delta": {"content": token},
                    "index": 0,
                }],
            }
            yield f"data: {json.dumps(chunk)}\n\n"
        
        # Final chunk
        yield "data: [DONE]\n\n"
    
    return StreamingResponse(generate(), media_type="text/event-stream")

Rate Limiting

from fastapi import Request, HTTPException
import time

class RateLimiter:
    def __init__(self, requests_per_minute=60, tokens_per_minute=100000):
        self.requests = {}  # user_id → [(timestamp, tokens)]
        self.rpm = requests_per_minute
        self.tpm = tokens_per_minute
    
    async def check(self, request: Request, user_id: str, tokens: int):
        now = time.time()
        minute_ago = now - 60
        
        # Clean old entries
        if user_id in self.requests:
            self.requests[user_id] = [
                (ts, tok) for ts, tok in self.requests[user_id]
                if ts > minute_ago
            ]
        else:
            self.requests[user_id] = []
        
        # Check limits
        if len(self.requests[user_id]) >= self.rpm:
            raise HTTPException(429, "Too many requests")
        
        total_tokens = sum(t for _, t in self.requests[user_id])
        if total_tokens + tokens > self.tpm:
            raise HTTPException(429, "Token limit exceeded")
        
        self.requests[user_id].append((now, tokens))

Autoscaling

Kubernetes HPA (Horizontal Pod Autoscaler)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: queue_depth
      target:
        type: AverageValue
        averageValue: 50

Scale-to-Zero (Serverless)

For infrequently used models:

Platform	Cold Start	Max GPU	Notes
Modal	5-15s	H100	GPU serverless
RunPod	10-30s	A100-H100	Serverless GPUs
Baseten	5-20s	A100	Model-focused
Replicate	3-10s	Various	Easy deployment

Load Balancing

# Round-robin with health checks
class LoadBalancer:
    def __init__(self, backends: list[str]):
        self.backends = backends
        self.current = 0
        self.health = {b: True for b in backends}
    
    def next_backend(self) -> str:
        healthy = [b for b in self.backends if self.health[b]]
        if not healthy:
            raise Exception("All backends unhealthy")
        
        backend = healthy[self.current % len(healthy)]
        self.current += 1
        return backend
    
    async def health_check(self):
        for backend in self.backends:
            try:
                await asyncio.wait_for(
                    httpx.get(f"{backend}/health"), timeout=5
                )
                self.health[backend] = True
            except:
                self.health[backend] = False

Monitoring

Key Metrics

Metric	Alert Threshold	Action
P50 latency	> 500ms	Check GPU utilization
P99 latency	> 5s	Check for long outputs, scale up
Error rate	> 1%	Check OOM, model crashes
GPU memory	> 90%	Scale up, reduce batch
Queue depth	> 100	Scale replicas
Token throughput	Degraded	Check for throttling

Prometheus + Grafana Dashboard

from prometheus_client import Counter, Histogram, start_http_server

REQUEST_COUNT = Counter("llm_requests_total", "Total LLM requests", ["model", "status"])
REQUEST_LATENCY = Histogram("llm_request_latency_seconds", "Request latency", ["model"])
TOKEN_COUNT = Counter("llm_tokens_total", "Total tokens processed", ["type"])

# In your request handler:
@REQUEST_LATENCY.labels(model="llama-3-70b").time():
    response = model.generate(prompt)

REQUEST_COUNT.labels(model="llama-3-70b", status="success").inc()
TOKEN_COUNT.labels(type="input").inc(input_tokens)
TOKEN_COUNT.labels(type="output").inc(output_tokens)

Key Takeaways

Cloud APIs are fastest to deploy; self-hosting is cheapest at scale
Streaming responses improve perceived latency dramatically
Always implement rate limiting to protect against abuse
Monitor GPU memory, queue depth, and error rates
Use a router pattern to blend cloud and self-hosted for cost optimization

Inference Optimization — Making models faster and smaller
Cost Management — Managing LLM spend at scale
Speculative Decoding — Speed optimization techniques

Related docs

LLM Metrics & KPIs

Defining and tracking LLM success metrics — quality KPIs, cost KPIs, user satisfaction, throughput targets, and dashboard design

Model Versioning Management

Managing model versions in production — rollback strategies, A/B testing, canary deployments, version compatibility, and lifecycle management

LLM Testing & Debugging

Systematic approaches to testing and debugging LLM applications — unit testing prompts, integration testing chains, regression testing model updates, and production debugging

Related models

Enterprises power agentic workflows in Cloudflare Agent Cloud with OpenAI

Cloudflare brings OpenAI’s GPT-5.4 and Codex to Agent Cloud, enabling enterprises to build, deploy, and scale AI agents for real-world tasks with speed and security.

AI fundamentals

Learn what AI is, how it works, and how tools like ChatGPT use large language models. A clear, beginner-friendly guide to understanding artificial intelligence.

Analyzing data with ChatGPT

Learn how to analyze data with ChatGPT by exploring datasets, generating insights, creating visualizations, and turning findings into actionable decisions.

Related agents

Data Platform agent blueprint focused on watch workflows over time, detect drift or failures, and surface the smallest useful signal to operators quickly for analysts and engineers need better query generation, pipeline debugging, and dataset explanation across changing schemas.

Developer Productivity Monitor Agent

Developer Productivity agent blueprint focused on watch workflows over time, detect drift or failures, and surface the smallest useful signal to operators quickly for engineering teams want reliable help with issue triage, runbook guidance, and change review without obscuring system ownership.

Finance Operations Monitor Agent

Finance Operations agent blueprint focused on watch workflows over time, detect drift or failures, and surface the smallest useful signal to operators quickly for finance teams need faster reconciliation, exception review, and policy-aware reporting for recurring operational workflows.

Deployment Strategies for Production

Related docs

LLM Metrics & KPIs

Model Versioning Management

LLM Testing & Debugging

Related models

Enterprises power agentic workflows in Cloudflare Agent Cloud with OpenAI

AI fundamentals

Analyzing data with ChatGPT

Related agents

Data Platform Monitor Agent

Developer Productivity Monitor Agent

Finance Operations Monitor Agent