Deployment & Infrastructure

Cost Management and Optimization

Understanding and controlling LLM costs — token pricing, caching strategies, model selection for budget, and spend tracking at scale

Published: 2026-04-12 · Last updated: 2026-04-13

Cost Management and Optimization

LLM costs can scale dramatically with usage. A feature serving 1M users with a frontier model can cost thousands per month. This guide covers how to understand, track, and optimize LLM spend.

Token Pricing Landscape

Major API Providers (approximate, per 1M tokens)

Provider	Model	Input	Output	Context
OpenAI	GPT-4o	$2.50	$10.00	128K
OpenAI	GPT-4o mini	$0.15	$0.60	128K
OpenAI	o1	$15.00	$60.00	128K
Anthropic	Claude Sonnet 4	$3.00	$15.00	200K
Anthropic	Claude Opus	$15.00	$75.00	200K
Google	Gemini 2.5 Pro	$1.25	$10.00	1M
Google	Gemini Flash	$0.075	$0.30	1M
DeepSeek	DeepSeek V3	$0.14	$0.28	128K

Self-Hosted Cost Calculation

def self_hosted_cost(gpu_hourly_rate: float, tokens_per_second: float) -> dict:
    """Calculate cost per 1M tokens for self-hosted setup."""
    tokens_per_hour = tokens_per_second * 3600
    cost_per_million = (gpu_hourly_rate / tokens_per_hour) * 1_000_000
    
    return {
        "cost_per_1M_input_tokens": f"${cost_per_million:.2f}",
        "gpu_hours_per_1M_tokens": f"{1_000_000 / tokens_per_hour:.1f}",
    }

# Example: A100 at $3.50/hr, generating 500 tokens/sec
print(self_hosted_cost(3.50, 500))
# {'cost_per_1M_input_tokens': '$7.00', 'gpu_hours_per_1M_tokens': '0.6'}

Rule of thumb: Self-hosting becomes cheaper than API when you exceed ~10B tokens/month.

Cost Estimation Calculator

def estimate_monthly_cost(
    daily_users: int,
    avg_conversation_length: int,  # messages
    avg_tokens_per_message: int,
    model_input_price: float,       # per 1M tokens
    model_output_price: float,
) -> float:
    """Estimate monthly API cost."""
    daily_input_tokens = daily_users * avg_conversation_length * avg_tokens_per_message
    daily_output_tokens = daily_users * avg_conversation_length * avg_tokens_per_message
    
    # Output tokens are typically similar to input in chat
    monthly_input_cost = (daily_input_tokens * 30 / 1_000_000) * model_input_price
    monthly_output_cost = (daily_output_tokens * 30 / 1_000_000) * model_output_price
    
    return monthly_input_cost + monthly_output_cost

# Example: 1,000 users, 10 messages, 200 tokens each, GPT-4o
cost = estimate_monthly_cost(
    daily_users=1000,
    avg_conversation_length=10,
    avg_tokens_per_message=200,
    model_input_price=2.50,
    model_output_price=10.00,
)
print(f"Estimated monthly cost: ${cost:,.0f}")  # ~$75,000/month!

Cost Reduction Strategies

1. Model Cascading

Use a cheap model for easy queries, escalate to expensive models only when needed:

def cascade_query(query: str) -> str:
    """Try cheap model first, escalate if needed."""
    # Step 1: Try GPT-4o mini
    response = call_model("gpt-4o-mini", query)
    
    # Step 2: Check if response is adequate
    if is_adequate(response):
        return response
    
    # Step 3: Escalate to GPT-4o
    return call_model("gpt-4o", query)

def is_adequate(response: str, min_length: int = 50) -> bool:
    """Simple adequacy check."""
    return len(response) >= min_length and "I don't know" not in response

Cost savings: 60-80% of queries handled by cheap model, only 20-40% escalate.

2. Response Caching

import hashlib
import json
from diskcache import Cache

cache = Cache("./llm-cache")

def cached_completion(messages: list, model: str = "gpt-4o") -> str:
    """Cache responses to identical prompts."""
    cache_key = hashlib.sha256(
        json.dumps(messages, sort_keys=True).encode()
    ).hexdigest()
    
    if cache_key in cache:
        return cache[cache_key]
    
    response = call_llm(messages, model)
    cache[cache_key] = response
    return response

Cache hit rates: 20-50% for customer support, 5-15% for general chat.

3. Prompt Optimization

Shorter prompts = fewer input tokens:

# Before: 350 tokens
prompt = """You are a helpful AI assistant with expertise in customer support.
Your goal is to help users with their questions about our product.
Please be polite, professional, and thorough.
If you don't know the answer, say so clearly.
Always provide step-by-step explanations when applicable.

User question: {question}"""

# After: 50 tokens
prompt = "Answer concisely: {question}"

4. Output Length Control

# Set max_tokens to prevent runaway generation
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=500,      # Hard limit
    temperature=0.3,      # Lower = more predictable length
)

# Stop sequences to prevent unnecessary content
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    stop=["\n\n\n", "###"],  # Stop at natural boundaries
)

Spend Tracking

# Track token usage per request
from dataclasses import dataclass
from datetime import datetime

@dataclass
class UsageRecord:
    timestamp: datetime
    user_id: str
    model: str
    input_tokens: int
    output_tokens: int
    cost: float
    endpoint: str

class UsageTracker:
    def __init__(self):
        self.records: list[UsageRecord] = []
        self.price_map = {
            "gpt-4o": {"input": 2.50, "output": 10.00},
            "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        }
    
    def record(self, record: UsageRecord):
        self.records.append(record)
    
    def calculate_cost(self, input_tokens: int, output_tokens: int, model: str) -> float:
        prices = self.price_map[model]
        return (input_tokens / 1_000_000 * prices["input"] + 
                output_tokens / 1_000_000 * prices["output"])
    
    def monthly_spend(self, month: int = None) -> float:
        records = self.records
        if month:
            records = [r for r in records if r.timestamp.month == month]
        return sum(r.cost for r in records)
    
    def top_users(self, n: int = 10) -> list[tuple[str, float]]:
        user_costs = {}
        for r in self.records:
            user_costs[r.user_id] = user_costs.get(r.user_id, 0) + r.cost
        return sorted(user_costs.items(), key=lambda x: x[1], reverse=True)[:n]

Budget Alerts

def check_budget_alerts(monthly_spend: float, budget: float):
    """Send alert when approaching budget limit."""
    usage_ratio = monthly_spend / budget
    
    if usage_ratio >= 1.0:
        send_alert("🚨 BUDGET EXCEEDED", f"Spend: ${monthly_spend:.0f} / ${budget:.0f}")
    elif usage_ratio >= 0.8:
        send_alert("⚠️ Approaching budget", f"Spend: ${monthly_spend:.0f} / ${budget:.0f} ({usage_ratio:.0%})")
    elif usage_ratio >= 0.5:
        send_alert("📊 Budget update", f"Spend: ${monthly_spend:.0f} / ${budget:.0f} ({usage_ratio:.0%})")

Key Takeaways

API costs scale linearly with usage; self-hosting has fixed costs
Model cascading (cheap → expensive) can reduce costs by 60-80%
Caching identical requests saves 5-50% depending on use case
Always set max_tokens and use stop sequences to control output length
Track spend per user, per model, and per endpoint to identify optimization opportunities

Deployment Strategies — Production serving patterns
Inference Optimization — Technical cost reduction
Evaluation Metrics — Justifying cost with quality metrics

Related docs

LLM Metrics & KPIs

Defining and tracking LLM success metrics — quality KPIs, cost KPIs, user satisfaction, throughput targets, and dashboard design

LLM Latency Optimization

Achieving sub-second LLM latency — speculative decoding, model parallelism, prefill optimization, and real-time serving patterns

Attention Mechanisms Variants

A deep technical survey of attention variants — from scaled dot-product to FlashAttention, linear attention, and state space alternatives