Evaluation & Safety
LLM Metrics & KPIs
Defining and tracking LLM success metrics — quality KPIs, cost KPIs, user satisfaction, throughput targets, and dashboard design
Published: 2026-04-27 · Last updated: 2026-04-13
LLM Metrics & KPIs
You cannot manage what you cannot measure. LLM systems require a comprehensive metrics framework spanning output quality, operational performance, cost efficiency, and user satisfaction. This guide provides a structured approach to defining, tracking, and acting on LLM KPIs.
The LLM Metrics Framework
┌──────────────────────────────────────────────────────────────┐
│ LLM Metrics Dashboard │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Quality │ │ Operations │ │ Cost │ │
│ │ KPIs │ │ KPIs │ │ KPIs │ │
│ │ │ │ │ │ │ │
│ │ • Accuracy │ │ • Latency │ │ • $/Request │ │
│ │ • Helpfulness│ │ • Throughput │ │ • $/Token │ │
│ │ • Safety │ │ • Error Rate │ │ • Budget │ │
│ │ • Hallucination│ │ • Uptime │ │ • Utilization│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ User │ │ Business │ │
│ │ Satisfaction│ │ Impact │ │
│ │ │ │ │ │
│ │ • CSAT │ │ • Task │ │
│ │ • NPS │ │ Completion │ │
│ │ • Retention │ │ • Time Saved │ │
│ │ • Feedback │ │ • Revenue │ │
│ └──────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────────┘
Quality KPIs
Output Quality Metrics
| Metric | Definition | Target | How to Measure |
|---|---|---|---|
| Accuracy | Fraction of responses that are factually correct | >90% | Human evaluation, fact-checking |
| Helpfulness | How useful the response is for the user's goal | >4.0/5.0 | User ratings, LLM judge |
| Relevance | How well the response addresses the query | >85% | LLM judge, keyword matching |
| Completeness | Whether the response covers all aspects of the query | >80% | LLM judge with rubric |
| Coherence | Logical flow and consistency within the response | >90% | LLM judge, self-consistency check |
| Conciseness | No unnecessary verbosity | 50-200 words for simple queries | Token count analysis |
LLM-as-a-Judge Evaluation
class LLMJudgeEvaluator:
"""Use an LLM to evaluate response quality."""
def __init__(self, judge_model, rubric: dict):
self.judge = judge_model
self.rubric = rubric
async def evaluate(self, prompt: str, response: str) -> dict:
"""Evaluate a response against the rubric."""
rubric_text = "\n".join(f"- {k}: {v}" for k, v in self.rubric.items())
evaluation_prompt = f"""Evaluate the following response against these criteria:
Rubric:
{rubric_text}
Query: {prompt}
Response: {response}
For each criterion, provide:
1. A score from 1-5
2. A brief justification
Format your response as JSON:
{{"criterion_name": {{"score": N, "justification": "..."}}}}"""
result = await self.judge.generate(evaluation_prompt, max_tokens=500)
scores = self._parse_scores(result.text)
return {
"scores": scores,
"overall": sum(s["score"] for s in scores.values()) / len(scores),
"prompt": prompt,
"response": response,
}
def _parse_scores(self, text: str) -> dict:
"""Parse JSON scores from the LLM response."""
import json
import re
match = re.search(r'\{.*\}', text, re.DOTALL)
if match:
return json.loads(match.group())
return {}
Safety & Alignment Metrics
| Metric | Definition | Target | Frequency |
|---|---|---|---|
| Toxicity Rate | % of responses containing toxic content | <0.1% | Continuous |
| Bias Score | Demographic parity across groups | <5% disparity | Weekly |
| Hallucination Rate | % of responses with unsupported claims | <5% | Weekly |
| Jailbreak Success Rate | % of adversarial prompts that succeed | <1% | Monthly |
| Refusal Rate | % of queries the model appropriately refuses | 95-100% for harmful queries | Continuous |
| Over-refusal Rate | % of benign queries incorrectly refused | <2% | Weekly |
class SafetyMetricsTracker:
"""Track safety metrics in real-time."""
def __init__(self):
self.toxicity_detector = ToxicityDetector()
self.hallucination_detector = HallucinationDetector()
self.bias_evaluator = BiasEvaluator()
self.total_responses = 0
self.toxic_responses = 0
self.hallucinated_responses = 0
def record_response(self, prompt: str, response: str) -> dict:
"""Record and evaluate a response for safety."""
self.total_responses += 1
metrics = {
"toxic": self.toxicity_detector.detect(response),
"hallucinated": self.hallucination_detector.detect(prompt, response),
"refused": self._is_refusal(response),
"over_refused": False,
}
if metrics["toxic"]:
self.toxic_responses += 1
if metrics["hallucinated"]:
self.hallucinated_responses += 1
# Check for over-refusal
if metrics["refused"] and not self._should_have_refused(prompt):
metrics["over_refused"] = True
return metrics
def get_current_rates(self) -> dict:
"""Get current safety metric rates."""
return {
"toxicity_rate": self.toxic_responses / max(self.total_responses, 1),
"hallucination_rate": self.hallucinated_responses / max(self.total_responses, 1),
"total_responses": self.total_responses,
}
Operational KPIs
Performance Metrics
| Metric | Definition | Target | Alert Threshold |
|---|---|---|---|
| TTFT | Time to First Token | <500ms | >1000ms |
| TPOT | Time Per Output Token | <50ms | >100ms |
| p50 Latency | Median end-to-end latency | <2s | >3s |
| p95 Latency | 95th percentile latency | <5s | >8s |
| p99 Latency | 99th percentile latency | <10s | >15s |
| Throughput | Requests per second | Based on capacity | <50% of target |
| Error Rate | % of requests that fail | <0.1% | >1% |
| Uptime | % of time service is available | >99.9% | <99.5% |
Capacity Metrics
class CapacityMetrics:
"""Track system capacity and utilization."""
def __init__(self, max_rps: int, max_concurrent: int):
self.max_rps = max_rps
self.max_concurrent = max_concurrent
self.current_rps = 0
self.current_concurrent = 0
self.peak_rps = 0
self.peak_concurrent = 0
def record_request_start(self):
self.current_concurrent += 1
self.current_rps += 1
self.peak_concurrent = max(self.peak_concurrent, self.current_concurrent)
self.peak_rps = max(self.peak_rps, self.current_rps)
def record_request_end(self):
self.current_concurrent -= 1
def get_utilization(self) -> dict:
return {
"rps_utilization": self.current_rps / self.max_rps,
"concurrent_utilization": self.current_concurrent / self.max_concurrent,
"peak_rps": self.peak_rps,
"peak_concurrent": self.peak_concurrent,
"headroom_rps": self.max_rps - self.current_rps,
"headroom_concurrent": self.max_concurrent - self.current_concurrent,
}
def is_at_capacity(self, threshold: float = 0.85) -> bool:
"""Check if we're approaching capacity limits."""
util = self.get_utilization()
return (
util["rps_utilization"] > threshold or
util["concurrent_utilization"] > threshold
)
Cost KPIs
Cost Tracking
| Metric | Definition | Target | How to Reduce |
|---|---|---|---|
| Cost per Request | Average cost of a single LLM call | <$0.01 for simple, <$0.10 for complex | Model routing, caching |
| Cost per Token | Average cost per 1K tokens | Varies by model | Smaller models, compression |
| Monthly LLM Spend | Total monthly cost across all providers | Within budget | Budget allocation, quotas |
| Cost per User | LLM cost attributed per user | Decreasing over time | Efficiency improvements |
| Cost per Successful Task | Cost divided by task completion rate | Decreasing | Better prompting, fewer retries |
| Token Utilization | % of generated tokens used by downstream | >80% | Shorter prompts, focused responses |
Cost Dashboard
class CostDashboard:
"""Track and visualize LLM costs."""
def __init__(self):
self.daily_costs: dict[str, float] = {} # date → cost
self.model_costs: dict[str, float] = {} # model → cost
self.endpoint_costs: dict[str, float] = {} # endpoint → cost
def record_cost(self, cost: float, model: str, endpoint: str, date: str = None):
"""Record a cost entry."""
from datetime import datetime
date = date or datetime.utcnow().strftime("%Y-%m-%d")
self.daily_costs[date] = self.daily_costs.get(date, 0) + cost
self.model_costs[model] = self.model_costs.get(model, 0) + cost
self.endpoint_costs[endpoint] = self.endpoint_costs.get(endpoint, 0) + cost
def get_daily_summary(self, days: int = 30) -> dict:
"""Get cost summary for recent days."""
from datetime import datetime, timedelta
cutoff = (datetime.utcnow() - timedelta(days=days)).strftime("%Y-%m-%d")
recent_costs = {
date: cost for date, cost in self.daily_costs.items()
if date >= cutoff
}
total = sum(recent_costs.values())
avg_daily = total / max(len(recent_costs), 1)
daily_values = sorted(recent_costs.values())
median_daily = daily_values[len(daily_values) // 2] if daily_values else 0
return {
"total": total,
"avg_daily": avg_daily,
"median_daily": median_daily,
"max_daily": max(recent_costs.values()) if recent_costs else 0,
"min_daily": min(recent_costs.values()) if recent_costs else 0,
"projected_monthly": avg_daily * 30,
}
def get_model_breakdown(self) -> list[dict]:
"""Get cost breakdown by model."""
total = sum(self.model_costs.values())
return [
{
"model": model,
"cost": cost,
"percentage": cost / total * 100 if total > 0 else 0,
}
for model, cost in sorted(self.model_costs.items(), key=lambda x: x[1], reverse=True)
]
def get_cost_anomalies(self, threshold: float = 2.0) -> list[dict]:
"""Detect unusual cost spikes."""
if len(self.daily_costs) < 7:
return []
values = list(self.daily_costs.values())
mean_cost = sum(values) / len(values)
std_cost = (sum((v - mean_cost) ** 2 for v in values) / len(values)) ** 0.5
anomalies = []
for date, cost in self.daily_costs.items():
if std_cost > 0 and abs(cost - mean_cost) > threshold * std_cost:
anomalies.append({
"date": date,
"cost": cost,
"expected": mean_cost,
"deviation": (cost - mean_cost) / std_cost,
})
return sorted(anomalies, key=lambda x: abs(x["deviation"]), reverse=True)
User Satisfaction Metrics
Tracking User Feedback
| Metric | Definition | Target | Collection Method |
|---|---|---|---|
| CSAT | Customer Satisfaction Score (1-5) | >4.0 | Thumbs up/down, star rating |
| NPS | Net Promoter Score (-100 to +100) | >30 | Periodic survey |
| Task Success Rate | % of tasks users complete successfully | >80% | Behavioral tracking |
| Time to Completion | How long users take to achieve their goal | Decreasing | Session analytics |
| Re-query Rate | % of queries where users immediately re-query | <10% | Session logs |
| Session Length | Average queries per session | Context-dependent | Session tracking |
| Retention Rate | % of users returning in subsequent weeks | >60% weekly | Cohort analysis |
Feedback Collection
class FeedbackCollector:
"""Collect and analyze user feedback."""
def __init__(self):
self.feedback: list[dict] = []
def record_feedback(
self,
user_id: str,
query: str,
response: str,
rating: int, # 1-5
feedback_text: str = None,
category: str = None,
):
"""Record user feedback."""
self.feedback.append({
"user_id": user_id,
"query": query,
"response": response,
"rating": rating,
"feedback_text": feedback_text,
"category": category,
"timestamp": datetime.utcnow().isoformat(),
})
def get_csat(self, window_days: int = 7) -> dict:
"""Calculate CSAT score."""
cutoff = (datetime.utcnow() - timedelta(days=window_days)).isoformat()
recent = [f for f in self.feedback if f["timestamp"] >= cutoff]
if not recent:
return {"csat": None, "count": 0}
ratings = [f["rating"] for f in recent]
return {
"csat": sum(ratings) / len(ratings),
"count": len(ratings),
"distribution": {
str(i): ratings.count(i) for i in range(1, 6)
},
}
def get_low_rated_queries(self, threshold: int = 2, top_k: int = 10) -> list[dict]:
"""Find the most poorly rated queries."""
low_rated = [f for f in self.feedback if f["rating"] <= threshold]
# Group by query pattern
from collections import Counter
query_counts = Counter(f["query"] for f in low_rated)
return [
{"query": query, "low_rating_count": count}
for query, count in query_counts.most_common(top_k)
]
def get_trending_issues(self) -> list[dict]:
"""Identify emerging issues from feedback text."""
recent = [
f for f in self.feedback
if f["rating"] <= 2 and f.get("feedback_text")
]
# Simple keyword-based categorization
categories = {
"incorrect_answer": ["wrong", "incorrect", "not right", "inaccurate"],
"too_verbose": ["too long", "verbose", "rambling", "too much"],
"too_brief": ["too short", "not enough", "insufficient"],
"hallucination": ["made up", "fabricated", "doesn't exist", "not true"],
"unsafe": ["inappropriate", "offensive", "harmful"],
}
issue_counts = Counter()
for feedback in recent:
text = feedback["feedback_text"].lower()
for category, keywords in categories.items():
if any(kw in text for kw in keywords):
issue_counts[category] += 1
return [
{"category": cat, "count": count}
for cat, count in issue_counts.most_common()
]
Dashboard Design
Executive Dashboard
┌─────────────────────────────────────────────────────────────┐
│ LLM Platform Overview │
│ Last 7 days | Updated 5m ago │
├─────────────────────────────────────────────────────────────┤
│ │
│ Quality Operations Cost Users │
│ ─────── ────────── ──── ───── │
│ Accuracy: 94% p50: 1.2s $4,230 CSAT: 4.3│
│ Safety: 99.8% p95: 3.8s $0.004/req NPS: 42 │
│ Hallucination: Throughput: $126K/mo Active: │
│ 2.3% 847 rps (proj) 12,450 │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Daily Cost Trend (last 30 days) │ │
│ │ $600 ┤ ╭──╮ │ │
│ │ $500 ┤ ╱ ╲ ╭──╮ │ │
│ │ $400 ┤ ╱ ╲ ╱ ╲ ╭──╮ │ │
│ │ $300 ┤ ╱ ╲╱ ╲ ╱ ╲ │ │
│ │ $200 ┤╱ ╲╱ ╲ │ │
│ │ └───────────────────────────────── │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Top Issues │ │ Model Breakdown │ │
│ │ 1. Hallucination │ │ GPT-4o: 45% │ │
│ │ (23 reports) │ │ Claude: 30% │ │
│ │ 2. Latency spike │ │ Haiku: 15% │ │
│ │ (12 reports) │ │ Self-host: 10% │ │
│ │ 3. Over-refusal │ │ │ │
│ │ (8 reports) │ │ │ │
│ └──────────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Alert Configuration
| Alert | Condition | Severity | Action |
|---|---|---|---|
| Quality drop | Accuracy drops below 85% for 1h | Critical | Page on-call, rollback |
| Latency spike | p95 > 2x baseline for 15m | High | Scale up, investigate |
| Cost anomaly | Daily cost > 2x moving average | Medium | Investigate, alert team |
| Error rate | Error rate > 1% for 5m | Critical | Page on-call, failover |
| Safety violation | Any toxic response detected | Critical | Immediate review, block model |
| User satisfaction | CSAT drops below 3.5 for 1 day | High | Review feedback, investigate |
| Capacity warning | Utilization > 85% for 30m | Medium | Scale up proactively |
Metric Interdependencies
Understanding how metrics relate helps avoid optimization traps:
| Trade-off | Description | Resolution |
|---|---|---|
| Quality vs. Cost | Better models cost more | Route by task complexity |
| Latency vs. Quality | Stronger models are slower | Use smaller models for simple tasks |
| Safety vs. Helpfulness | Over-refusal reduces helpfulness | Tune refusal thresholds carefully |
| Conciseness vs. Completeness | Shorter may miss details | Adapt length to query complexity |
| Throughput vs. Latency | More concurrent requests = slower | Scale horizontally |
Implementation Checklist
Phase 1: Foundation (Week 1-2)
- Instrument all LLM calls with logging
- Track basic cost metrics (tokens, $/request)
- Monitor latency (p50, p95, p99)
- Set up error rate monitoring
- Collect thumbs up/down feedback
Phase 2: Quality (Week 3-4)
- Implement LLM-as-a-judge evaluation
- Track hallucination rate
- Measure task success rate
- Build safety metrics pipeline
- Create quality regression tests
Phase 3: User Insights (Week 5-6)
- Calculate CSAT and NPS
- Track retention and engagement
- Analyze feedback patterns
- Build executive dashboard
- Set up alerting thresholds
Phase 4: Optimization (Ongoing)
- Cost optimization reviews (weekly)
- Quality trend analysis (weekly)
- Capacity planning (monthly)
- Model comparison experiments
- A/B test framework for changes
Cross-References
- For a broader evaluation framework, see Evaluation Metrics & Benchmarks
- For monitoring and observability infrastructure, see LLM Observability & Monitoring
- For cost optimization across providers, see Model Hub & Federation
- For governance and compliance tracking, see Generative AI Governance
Related docs
Data Platform Monitor Agent Implementation Guide
Architecture, workflow design, metrics, and rollout guidance for a data platform monitor agent in production.
Data Platform Reviewer Agent Implementation Guide
Architecture, workflow design, metrics, and rollout guidance for a data platform reviewer agent in production.
Developer Productivity Monitor Agent Implementation Guide
Architecture, workflow design, metrics, and rollout guidance for a developer productivity monitor agent in production.
Related agents
Data Platform Monitor Agent
Data Platform agent blueprint focused on watch workflows over time, detect drift or failures, and surface the smallest useful signal to operators quickly for analysts and engineers need better query generation, pipeline debugging, and dataset explanation across changing schemas.
Data Platform Reviewer Agent
Data Platform agent blueprint focused on inspect drafts, tool outputs, or decisions for gaps, policy issues, and missing evidence before work moves forward for analysts and engineers need better query generation, pipeline debugging, and dataset explanation across changing schemas.
Developer Productivity Monitor Agent
Developer Productivity agent blueprint focused on watch workflows over time, detect drift or failures, and surface the smallest useful signal to operators quickly for engineering teams want reliable help with issue triage, runbook guidance, and change review without obscuring system ownership.