Deployment & Infrastructure
Cost Management and Optimization
Understanding and controlling LLM costs — token pricing, caching strategies, model selection for budget, and spend tracking at scale
Published: 2026-04-12 · Last updated: 2026-04-13
Cost Management and Optimization
LLM costs can scale dramatically with usage. A feature serving 1M users with a frontier model can cost thousands per month. This guide covers how to understand, track, and optimize LLM spend.
Token Pricing Landscape
Major API Providers (approximate, per 1M tokens)
| Provider | Model | Input | Output | Context |
|---|---|---|---|---|
| OpenAI | GPT-4o | $2.50 | $10.00 | 128K |
| OpenAI | GPT-4o mini | $0.15 | $0.60 | 128K |
| OpenAI | o1 | $15.00 | $60.00 | 128K |
| Anthropic | Claude Sonnet 4 | $3.00 | $15.00 | 200K |
| Anthropic | Claude Opus | $15.00 | $75.00 | 200K |
| Gemini 2.5 Pro | $1.25 | $10.00 | 1M | |
| Gemini Flash | $0.075 | $0.30 | 1M | |
| DeepSeek | DeepSeek V3 | $0.14 | $0.28 | 128K |
Self-Hosted Cost Calculation
def self_hosted_cost(gpu_hourly_rate: float, tokens_per_second: float) -> dict:
"""Calculate cost per 1M tokens for self-hosted setup."""
tokens_per_hour = tokens_per_second * 3600
cost_per_million = (gpu_hourly_rate / tokens_per_hour) * 1_000_000
return {
"cost_per_1M_input_tokens": f"${cost_per_million:.2f}",
"gpu_hours_per_1M_tokens": f"{1_000_000 / tokens_per_hour:.1f}",
}
# Example: A100 at $3.50/hr, generating 500 tokens/sec
print(self_hosted_cost(3.50, 500))
# {'cost_per_1M_input_tokens': '$7.00', 'gpu_hours_per_1M_tokens': '0.6'}
Rule of thumb: Self-hosting becomes cheaper than API when you exceed ~10B tokens/month.
Cost Estimation Calculator
def estimate_monthly_cost(
daily_users: int,
avg_conversation_length: int, # messages
avg_tokens_per_message: int,
model_input_price: float, # per 1M tokens
model_output_price: float,
) -> float:
"""Estimate monthly API cost."""
daily_input_tokens = daily_users * avg_conversation_length * avg_tokens_per_message
daily_output_tokens = daily_users * avg_conversation_length * avg_tokens_per_message
# Output tokens are typically similar to input in chat
monthly_input_cost = (daily_input_tokens * 30 / 1_000_000) * model_input_price
monthly_output_cost = (daily_output_tokens * 30 / 1_000_000) * model_output_price
return monthly_input_cost + monthly_output_cost
# Example: 1,000 users, 10 messages, 200 tokens each, GPT-4o
cost = estimate_monthly_cost(
daily_users=1000,
avg_conversation_length=10,
avg_tokens_per_message=200,
model_input_price=2.50,
model_output_price=10.00,
)
print(f"Estimated monthly cost: ${cost:,.0f}") # ~$75,000/month!
Cost Reduction Strategies
1. Model Cascading
Use a cheap model for easy queries, escalate to expensive models only when needed:
def cascade_query(query: str) -> str:
"""Try cheap model first, escalate if needed."""
# Step 1: Try GPT-4o mini
response = call_model("gpt-4o-mini", query)
# Step 2: Check if response is adequate
if is_adequate(response):
return response
# Step 3: Escalate to GPT-4o
return call_model("gpt-4o", query)
def is_adequate(response: str, min_length: int = 50) -> bool:
"""Simple adequacy check."""
return len(response) >= min_length and "I don't know" not in response
Cost savings: 60-80% of queries handled by cheap model, only 20-40% escalate.
2. Response Caching
import hashlib
import json
from diskcache import Cache
cache = Cache("./llm-cache")
def cached_completion(messages: list, model: str = "gpt-4o") -> str:
"""Cache responses to identical prompts."""
cache_key = hashlib.sha256(
json.dumps(messages, sort_keys=True).encode()
).hexdigest()
if cache_key in cache:
return cache[cache_key]
response = call_llm(messages, model)
cache[cache_key] = response
return response
Cache hit rates: 20-50% for customer support, 5-15% for general chat.
3. Prompt Optimization
Shorter prompts = fewer input tokens:
# Before: 350 tokens
prompt = """You are a helpful AI assistant with expertise in customer support.
Your goal is to help users with their questions about our product.
Please be polite, professional, and thorough.
If you don't know the answer, say so clearly.
Always provide step-by-step explanations when applicable.
User question: {question}"""
# After: 50 tokens
prompt = "Answer concisely: {question}"
4. Output Length Control
# Set max_tokens to prevent runaway generation
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=500, # Hard limit
temperature=0.3, # Lower = more predictable length
)
# Stop sequences to prevent unnecessary content
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stop=["\n\n\n", "###"], # Stop at natural boundaries
)
Spend Tracking
# Track token usage per request
from dataclasses import dataclass
from datetime import datetime
@dataclass
class UsageRecord:
timestamp: datetime
user_id: str
model: str
input_tokens: int
output_tokens: int
cost: float
endpoint: str
class UsageTracker:
def __init__(self):
self.records: list[UsageRecord] = []
self.price_map = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
}
def record(self, record: UsageRecord):
self.records.append(record)
def calculate_cost(self, input_tokens: int, output_tokens: int, model: str) -> float:
prices = self.price_map[model]
return (input_tokens / 1_000_000 * prices["input"] +
output_tokens / 1_000_000 * prices["output"])
def monthly_spend(self, month: int = None) -> float:
records = self.records
if month:
records = [r for r in records if r.timestamp.month == month]
return sum(r.cost for r in records)
def top_users(self, n: int = 10) -> list[tuple[str, float]]:
user_costs = {}
for r in self.records:
user_costs[r.user_id] = user_costs.get(r.user_id, 0) + r.cost
return sorted(user_costs.items(), key=lambda x: x[1], reverse=True)[:n]
Budget Alerts
def check_budget_alerts(monthly_spend: float, budget: float):
"""Send alert when approaching budget limit."""
usage_ratio = monthly_spend / budget
if usage_ratio >= 1.0:
send_alert("🚨 BUDGET EXCEEDED", f"Spend: ${monthly_spend:.0f} / ${budget:.0f}")
elif usage_ratio >= 0.8:
send_alert("⚠️ Approaching budget", f"Spend: ${monthly_spend:.0f} / ${budget:.0f} ({usage_ratio:.0%})")
elif usage_ratio >= 0.5:
send_alert("📊 Budget update", f"Spend: ${monthly_spend:.0f} / ${budget:.0f} ({usage_ratio:.0%})")
Key Takeaways
- API costs scale linearly with usage; self-hosting has fixed costs
- Model cascading (cheap → expensive) can reduce costs by 60-80%
- Caching identical requests saves 5-50% depending on use case
- Always set max_tokens and use stop sequences to control output length
- Track spend per user, per model, and per endpoint to identify optimization opportunities
Related Documentation
- Deployment Strategies — Production serving patterns
- Inference Optimization — Technical cost reduction
- Evaluation Metrics — Justifying cost with quality metrics
Related docs
LLM Metrics & KPIs
Defining and tracking LLM success metrics — quality KPIs, cost KPIs, user satisfaction, throughput targets, and dashboard design
LLM Latency Optimization
Achieving sub-second LLM latency — speculative decoding, model parallelism, prefill optimization, and real-time serving patterns
Attention Mechanisms Variants
A deep technical survey of attention variants — from scaled dot-product to FlashAttention, linear attention, and state space alternatives