Evaluation & Safety

Evaluation Metrics and Benchmarks

How to measure LLM capability — from academic benchmarks (MMLU, GSM8K, HumanEval) to practical evaluation pipelines for production systems

Published: 2026-04-13 · Last updated: 2026-04-13

Evaluation Metrics and Benchmarks

Evaluating LLMs is fundamentally different from evaluating traditional software. Outputs are open-ended, correctness is often subjective, and capabilities span dozens of domains. This guide covers both academic benchmarks and practical evaluation strategies.

Academic Benchmarks

Knowledge and Understanding

Benchmark	Task	Format	What It Measures
MMLU	57 subjects (STEM, humanities, etc.)	Multiple choice	Broad knowledge
MMLU-Pro	Harder MMLU with reasoning	Multiple choice + reasoning	Deep understanding
HellaSwag	Sentence completion	Multiple choice	Commonsense reasoning
ARC	Science questions	Multiple choice	Scientific reasoning
TruthfulQA	Factual accuracy	Multiple choice + generation	Resistance to misconceptions

Reasoning and Math

Benchmark	Task	What It Measures
GSM8K	Grade school math word problems	Multi-step arithmetic
MATH	Competition-level math	Advanced mathematical reasoning
AIME	American Invitational Math Exam	Olympiad-level problem solving
GPQA	Graduate-level science questions	Expert-level reasoning

Code Generation

Benchmark	Task	Metric
HumanEval	164 Python functions	Pass@1 (does the code pass tests?)
MBPP	974 Python tasks	Pass@1
HumanEval+	Extended HumanEval	Pass@1, Pass@10
LiveCodeBench	Recent competition problems	Pass@1 on unseen problems
SWE-bench	Real GitHub issues	% resolved autonomously

Instruction Following

Benchmark	Task	What It Measures
IFEval	Followable instructions	Instruction-following accuracy
AlpacaEval	Pairwise comparison	Overall helpfulness
Arena Hard	Challenging prompts	Capability on hard tasks

Typical Model Scores

Approximate scores for reference:

Model	MMLU	GSM8K	HumanEval	TruthfulQA
GPT-4o	88%	95%	90%	75%
Claude Sonnet 4	90%	96%	92%	80%
Llama 3.1 405B	85%	90%	85%	70%
Llama 3.1 70B	79%	83%	75%	65%
Llama 3.2 3B	62%	55%	45%	50%
Mistral Large	80%	80%	72%	63%
DeepSeek V3	85%	92%	86%	72%

Note: Scores change frequently with new model releases. Always check official benchmark reports.

Running Benchmarks Yourself

Using lm-evaluation-harness

# Install
pip install lm-eval

# Evaluate a Hugging Face model
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3.2-3B \
    --tasks mmlu,gsm8k,hellaswag \
    --device cuda:0 \
    --batch_size 8 \
    --output_results

# Evaluate via OpenAI-compatible API
lm_eval --model openai \
    --model_args model=meta-llama/Llama-3.2-3B,base_url=http://localhost:8000/v1 \
    --tasks mmlu \
    --num_fewshot 5

Custom Evaluation Pipeline

import json
from openai import OpenAI

client = OpenAI()

def evaluate_model(model_name: str, benchmark_data: list[dict]) -> dict:
    """Run a custom benchmark on a model."""
    results = []
    
    for item in benchmark_data:
        response = client.chat.completions.create(
            model=model_name,
            messages=[{"role": "user", "content": item["prompt"]}],
            temperature=0,  # Deterministic for evaluation
        )
        
        answer = response.choices[0].message.content.strip()
        is_correct = check_answer(answer, item["expected"])
        
        results.append({
            "prompt": item["prompt"],
            "expected": item["expected"],
            "actual": answer,
            "correct": is_correct,
        })
    
    accuracy = sum(1 for r in results if r["correct"]) / len(results)
    return {
        "model": model_name,
        "accuracy": accuracy,
        "total": len(results),
        "correct": sum(1 for r in results if r["correct"]),
        "details": results,
    }

Production Evaluation

Academic benchmarks measure general capability, but production systems need task-specific evaluation.

Task-Specific Evaluation

# Define YOUR evaluation criteria
def evaluate_customer_support(model_name: str) -> dict:
    """Evaluate on your actual support tickets."""
    tickets = load_support_tickets()
    
    metrics = {
        "accuracy": [],      # Does it give the right answer?
        "tone": [],           # Is it appropriately professional?
        "completeness": [],   # Does it address all issues raised?
        "safety": [],         # Does it avoid harmful commitments?
    }
    
    for ticket in tickets:
        response = generate_response(model_name, ticket)
        
        # Human or LLM-as-judge scoring
        scores = score_response(response, ticket["rubric"])
        metrics["accuracy"].append(scores["accuracy"])
        metrics["tone"].append(scores["tone"])
        metrics["completeness"].append(scores["completeness"])
        metrics["safety"].append(scores["safety"])
    
    return {metric: sum(scores)/len(scores) for metric, scores in metrics.items()}

LLM-as-a-Judge

def llm_judge(prompt: str, response: str, rubric: str) -> dict:
    """Use GPT-4o as an evaluator."""
    judgment = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": f"""You are an expert evaluator. Score the response based on this rubric:
{rubric}

Return ONLY a JSON: {{"score": 1-5, "reasoning": "brief explanation"}}"""
        }, {
            "role": "user",
            "content": f"Prompt: {prompt}\nResponse: {response}"
        }],
        response_format={"type": "json_object"},
    )
    
    return json.loads(judgment.choices[0].message.content)

A/B Testing Models

def ab_test_model_comparison(
    model_a: str,
    model_b: str,
    test_prompts: list[str],
    judge_model: str = "gpt-4o",
) -> dict:
    """Compare two models on a test set."""
    wins_a, wins_b, ties = 0, 0, 0
    
    for prompt in test_prompts:
        response_a = call_model(model_a, prompt)
        response_b = call_model(model_b, prompt)
        
        judgment = llm_judge_compare(prompt, response_a, response_b, judge_model)
        
        if judgment == "A":
            wins_a += 1
        elif judgment == "B":
            wins_b += 1
        else:
            ties += 1
    
    total = len(test_prompts)
    return {
        "model_a": model_a,
        "model_b": model_b,
        "a_win_rate": f"{wins_a/total:.1%}",
        "b_win_rate": f"{wins_b/total:.1%}",
        "tie_rate": f"{ties/total:.1%}",
    }

Evaluation Best Practices

Test on YOUR data: Academic benchmarks are gameable; use your own evaluation set
Use chain-of-thought: Some models need CoT prompting to show their true capability
Temperature matters: Use temperature=0 for evaluation consistency
Multiple runs: Average over 3-5 runs to reduce variance
Human spot-checks: Automated metrics miss nuance; always have humans review samples
Monitor degradation: Re-run evaluation when models are updated

Key Takeaways

MMLU measures knowledge, GSM8K measures reasoning, HumanEval measures coding
Academic benchmarks don't always correlate with production performance
Build your own evaluation set from real user queries
LLM-as-a-judge is convenient but has its own biases
Always evaluate at temperature=0 for consistency
A/B test models on your actual workload before committing

Prompt Engineering — Optimizing prompts before evaluation
Observability — Continuous evaluation in production
Safety and Red-teaming — Evaluating safety properties

Related docs

LLM Metrics & KPIs

Defining and tracking LLM success metrics — quality KPIs, cost KPIs, user satisfaction, throughput targets, and dashboard design

LLM Bias Mitigation

Understanding and mitigating bias in LLM outputs — demographic bias, cultural bias, measurement techniques, debiasing strategies, and continuous monitoring

Prompt Security Testing

Systematic prompt security testing methodology — injection testing, jailbreak detection, output validation, and continuous security monitoring

Related agents

Data Platform Evaluator Agent

Data Platform agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for analysts and engineers need better query generation, pipeline debugging, and dataset explanation across changing schemas.

Data Platform Reviewer Agent

Data Platform agent blueprint focused on inspect drafts, tool outputs, or decisions for gaps, policy issues, and missing evidence before work moves forward for analysts and engineers need better query generation, pipeline debugging, and dataset explanation across changing schemas.

Developer Productivity Evaluator Agent

Developer Productivity agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for engineering teams want reliable help with issue triage, runbook guidance, and change review without obscuring system ownership.