Advanced Technical

LLM Testing & Debugging

Systematic approaches to testing and debugging LLM applications — unit testing prompts, integration testing chains, regression testing model updates, and production debugging

Published: 2026-04-19 · Last updated: 2026-04-13

LLM Testing & Debugging

Testing LLM applications requires a fundamentally different approach than traditional software testing. Outputs are non-deterministic, quality is subjective, and bugs often manifest as subtle quality regressions rather than crashes. This guide provides systematic methodologies for testing and debugging LLM applications across the entire development lifecycle.

The Testing Challenge

Traditional software testing assumes:

Deterministic outputs given the same inputs
Binary pass/fail criteria
Clear specifications

LLM applications violate all three assumptions. Outputs vary between runs, quality exists on a spectrum, and "correct" behavior is often ambiguous. This requires a testing pyramid adapted for LLM systems:

        ┌─────────────┐
        │  E2E Tests  │  Full pipeline, golden datasets
        ├─────────────┤
        │ Integration │  Multi-step chains, tool use
        ├─────────────┤
        │  Unit Tests │  Individual prompts, parsers
        ├─────────────┤
        │ Eval Suites │  Quality benchmarks, regression
        └─────────────┘

Unit Testing Prompts

Testing Prompt Structure

import pytest

def test_system_prompt_includes_role():
    """Verify the system prompt establishes the correct role."""
    from myapp.prompts import SYSTEM_PROMPT

    assert "You are a helpful coding assistant" in SYSTEM_PROMPT
    assert "Do not provide medical advice" in SYSTEM_PROMPT
    assert "Respond in JSON format" in SYSTEM_PROMPT


def test_prompt_template_renders_correctly():
    """Test that the prompt template fills in variables properly."""
    from myapp.prompts import render_answer_prompt

    prompt = render_answer_prompt(
        context="Python is a programming language.",
        question="What is Python?",
    )

    assert "Python is a programming language." in prompt
    assert "What is Python?" in prompt
    assert "{{context}}" not in prompt  # No unfilled template variables
    assert "{{question}}" not in prompt

Testing Output Parsers

def test_json_response_parser():
    """Test the parser handles valid JSON responses."""
    from myapp.parsers import parse_json_response

    # Valid JSON
    result = parse_json_response('{"answer": "yes", "confidence": 0.95}')
    assert result["answer"] == "yes"
    assert result["confidence"] == pytest.approx(0.95)

    # JSON with markdown code block
    result = parse_json_response('```json\n{"answer": "yes"}\n```')
    assert result["answer"] == "yes"

    # JSON with surrounding text
    result = parse_json_response('Here is the answer:\n{"answer": "yes"}\nHope this helps!')
    assert result["answer"] == "yes"


def test_json_parser_handles_errors():
    """Test the parser handles malformed responses gracefully."""
    from myapp.parsers import parse_json_response, ParseError

    with pytest.raises(ParseError, match="Could not extract JSON"):
        parse_json_response("This is not JSON at all")

    with pytest.raises(ParseError, match="Missing required field"):
        parse_json_response('{"answer": "yes"}')  # missing "confidence"

Testing with Mocked LLM Responses

from unittest.mock import AsyncMock, patch

@pytest.mark.asyncio
async def test_answer_chain_with_mock_llm():
    """Test the full answer chain without calling the actual LLM."""
    mock_response = '{"answer": "Paris is the capital of France", "sources": ["encyclopedia"]}'

    with patch("myapp.llm_client.generate", new_callable=AsyncMock) as mock_llm:
        mock_llm.return_value = mock_response

        result = await answer_question("What is the capital of France?")

        assert result["answer"] == "Paris is the capital of France"
        assert "encyclopedia" in result["sources"]
        mock_llm.assert_called_once()

Integration Testing Chains

Testing Multi-Step Workflows

import pytest

class TestRAGChain:
    """Integration tests for the full RAG pipeline."""

    @pytest.fixture
    def rag_chain(self):
        return RAGChain(
            retriever=TestRetriever(),
            llm=MockLLM(),
            prompt_template=DEFAULT_PROMPT,
        )

    async def test_retrieval_then_generation(self, rag_chain):
        """Test the full chain with known inputs."""
        result = await rag_chain.query("How do I reset my password?")

        # Verify retrieval happened
        assert len(result.sources) > 0
        assert any("password reset" in s.title for s in result.sources)

        # Verify generation quality
        assert "password" in result.answer.lower()
        assert len(result.answer) > 50  # Not a trivial response

    async def test_handles_no_results(self, rag_chain):
        """Test behavior when retrieval returns nothing."""
        result = await rag_chain.query("XYZNONEXISTENTQUERY123")

        assert result.answer == "I could not find information about that topic."
        assert result.sources == []
        assert result.fallback_used is True

    async def test_hallucination_detection(self, rag_chain):
        """Test that the chain flags unsupported answers."""
        result = await rag_chain.query("What is the meaning of life?")

        # This query should trigger the hallucination detector
        assert result.confidence < 0.5 or result.hallucination_warning is True

Testing Tool-Using Agents

class TestAgentTools:
    """Test agent tool calling with controlled inputs."""

    @pytest.fixture
    def agent(self):
        return Agent(
            llm=MockLLM(responses=[
                '{"action": "search", "query": "weather in London"}',
                '{"action": "final_answer", "answer": "It is 15°C in London"}',
            ]),
            tools=[WeatherTool(), SearchTool()],
        )

    async def test_tool_selection(self, agent):
        """Test that the agent selects the correct tool."""
        result = await agent.run("What is the weather in London?")

        # Verify the agent called the search tool first
        first_action = agent.history[0]
        assert first_action["action"] == "search"

        # Verify the final answer uses tool results
        assert "15°C" in result.answer
        assert len(agent.history) == 2  # search + final_answer

    async def test_tool_error_handling(self, agent):
        """Test agent behavior when a tool fails."""
        # Simulate a tool error
        agent.tools["weather"].raise_error = True

        result = await agent.run("What is the weather in London?")

        # Agent should retry with a different approach or report the error
        assert result.error is not None or result.fallback_response is not None

Regression Testing Model Updates

Golden Dataset Testing

class GoldenTestSuite:
    """Maintain a set of input/output pairs that should not regress."""

    def __init__(self, test_file: str = "golden_tests.jsonl"):
        self.tests = self._load(test_file)

    def _load(self, path: str) -> list[dict]:
        import json
        tests = []
        with open(path) as f:
            for line in f:
                tests.append(json.loads(line))
        return tests

    def run(self, model_fn) -> dict:
        """Run all golden tests and report pass/fail."""
        results = []
        for test in self.tests:
            output = model_fn(test["input"])
            passed = self._evaluate(output, test)
            results.append({
                "id": test["id"],
                "passed": passed,
                "input": test["input"],
                "expected": test.get("expected_output"),
                "actual": output,
            })

        pass_rate = sum(1 for r in results if r["passed"]) / len(results)
        return {
            "total": len(results),
            "passed": sum(1 for r in results if r["passed"]),
            "failed": sum(1 for r in results if not r["passed"]),
            "pass_rate": pass_rate,
            "results": results,
        }

    def _evaluate(self, output: str, test: dict) -> bool:
        """Evaluate output against golden test criteria."""
        if "exact_match" in test:
            return output.strip() == test["exact_match"].strip()
        if "contains" in test:
            return all(term in output for term in test["contains"])
        if "llm_judge" in test:
            # Use an LLM judge for quality assessment
            return llm_judge(output, test["llm_judge"]["criteria"]) >= test["llm_judge"]["min_score"]
        return False

Comparing Model Versions

def compare_model_versions(
    model_a,
    model_b,
    test_cases: list[str],
    eval_fn,
) -> dict:
    """Compare two model versions across a test suite."""
    results_a = [eval_fn(model_a(q), q) for q in test_cases]
    results_b = [eval_fn(model_b(q), q) for q in test_cases]

    comparison = {
        "model_a_avg": sum(results_a) / len(results_a),
        "model_b_avg": sum(results_b) / len(results_b),
        "improvement": sum(results_b) / len(results_b) - sum(results_a) / len(results_a),
        "better_on": sum(1 for a, b in zip(results_a, results_b) if b > a),
        "worse_on": sum(1 for a, b in zip(results_a, results_b) if b < a),
        "same_on": sum(1 for a, b in zip(results_a, results_b) if b == a),
        "details": [
            {"query": q, "score_a": a, "score_b": b, "delta": b - a}
            for q, a, b in zip(test_cases, results_a, results_b)
        ],
    }

    return comparison

Regression testing checklist:

Test Type	What to Check	Frequency
Golden dataset	Pre-defined input/output pairs	Every model change
Quality regression	Aggregate quality metrics	Every deployment
Safety regression	Safety test suite pass rate	Every model change
Latency regression	Response time p50/p95/p99	Every deployment
Cost regression	Token usage per query	Weekly
Bias regression	Demographic parity tests	Monthly

Debugging Production Issues

Systematic Debugging Framework

User reports a problem
         │
         ▼
┌─────────────────────────┐
│ 1. REPRODUCE            │  Can you trigger it consistently?
└────────┬────────────────┘
         ▼
┌─────────────────────────┐
│ 2. ISOLATE              │  Which component is failing?
│  - Prompt?              │
│  - Retrieval?           │
│  - Generation?          │
│  - Parsing?             │
└────────┬────────────────┘
         ▼
┌─────────────────────────┐
│ 3. MINIMIZE             │  Find the smallest failing case
└────────┬────────────────┘
         ▼
┌─────────────────────────┐
│ 4. HYPOTHESIZE          │  What change would fix it?
└────────┬────────────────┘
         ▼
┌─────────────────────────┐
│ 5. VERIFY               │  Test the fix on the minimal case
└────────┬────────────────┘
         ▼
┌─────────────────────────┐
│ 6. REGRESSION TEST      │  Add to golden dataset
└─────────────────────────┘

Debug Logging for LLM Calls

import logging
import json
from functools import wraps

logger = logging.getLogger("llm.debug")

def debug_llm_call(func):
    """Decorator that logs LLM inputs and outputs for debugging."""
    @wraps(func)
    def wrapper(*args, **kwargs):
        call_id = generate_call_id()
        prompt = kwargs.get("prompt") or args[0] if args else None

        logger.info(f"[{call_id}] LLM call starting")
        logger.debug(f"[{call_id}] Prompt: {prompt[:500]}...")
        logger.debug(f"[{call_id}] Parameters: temperature={kwargs.get('temperature', 'default')}")

        try:
            import time
            start = time.time()
            result = func(*args, **kwargs)
            elapsed = time.time() - start

            logger.info(f"[{call_id}] LLM call completed in {elapsed:.2f}s")
            logger.debug(f"[{call_id]}] Response: {result[:500]}...")
            logger.debug(f"[{call_id}] Tokens used: {result.usage.total_tokens}")

            return result

        except Exception as e:
            logger.error(f"[{call_id}] LLM call failed: {e}")
            raise

    return wrapper

Debugging Hallucinations

def debug_hallucination(query: str, response: str, sources: list[dict]) -> dict:
    """Analyze whether a response is grounded in the provided sources."""
    analysis = {
        "query": query,
        "response": response,
        "hallucination_risk": "low",
        "unsupported_claims": [],
        "supported_claims": [],
        "source_coverage": 0.0,
    }

    # Extract claims from the response
    claims = extract_claims(response)

    for claim in claims:
        supported_by = []
        for source in sources:
            if claim in source["content"]:
                supported_by.append(source["id"])

        if supported_by:
            analysis["supported_claims"].append({
                "claim": claim,
                "sources": supported_by,
            })
        else:
            analysis["unsupported_claims"].append(claim)

    # Calculate risk level
    total_claims = len(claims)
    unsupported = len(analysis["unsupported_claims"])

    if total_claims > 0:
        analysis["source_coverage"] = 1 - (unsupported / total_claims)

    if analysis["source_coverage"] < 0.5:
        analysis["hallucination_risk"] = "high"
    elif analysis["source_coverage"] < 0.8:
        analysis["hallucination_risk"] = "medium"

    return analysis

Debugging Retrieval Failures

def debug_retrieval(query: str, results: list[dict], threshold: float = 0.3) -> dict:
    """Diagnose why retrieval may have failed to find relevant documents."""
    analysis = {
        "query": query,
        "num_results": len(results),
        "max_similarity": max((r["score"] for r in results), default=0.0),
        "min_similarity": min((r["score"] for r in results), default=0.0),
        "avg_similarity": sum(r["score"] for r in results) / len(results) if results else 0.0,
        "possible_issues": [],
    }

    if analysis["max_similarity"] < threshold:
        analysis["possible_issues"].append(
            "All results have low similarity — query may be out of domain"
        )

    if analysis["avg_similarity"] < 0.1:
        analysis["possible_issues"].append(
            "Very low average similarity — embedding model may not match domain"
        )

    score_range = analysis["max_similarity"] - analysis["min_similarity"]
    if score_range < 0.05:
        analysis["possible_issues"].append(
            "Scores are too uniform — index may need re-normalization"
        )

    if len(results) == 0:
        analysis["possible_issues"].append(
            "Zero results — index may be empty or query embedding failed"
        )

    # Check for query issues
    if len(query.split()) < 2:
        analysis["possible_issues"].append(
            "Very short query — consider query expansion"
        )

    return analysis

Production Issue Checklist

When debugging a production LLM issue, check these systematically:

Check	Command / Approach
Model API status	`curl https://api.openai.com/v1/models`
Token usage spike	Query APM for `llm.tokens.total` metric
Latency regression	Compare p95 latency to 7-day baseline
Error rate spike	Check `llm.errors.rate` in monitoring
Prompt changes	`git diff HEAD~10..HEAD -- prompts/`
Index freshness	Compare index timestamp to latest document
Embedding drift	Re-embed sample queries, compare to cached
Rate limiting	Check `429` response count in logs

Automated Testing Pipeline

# .github/workflows/llm-tests.yml
name: LLM Tests

on: [pull_request]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pytest tests/unit/ --tb=short

  golden-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pytest tests/golden/ -v
        env:
          OPENAI_API_KEY: ${{ secrets.TEST_API_KEY }}

  quality-regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run quality eval
        run: python scripts/run_eval_suite.py
        env:
          OPENAI_API_KEY: ${{ secrets.TEST_API_KEY }}
      - name: Check for regression
        run: python scripts/check_regression.py
        # Fail if quality drops by more than 2%

  safety-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pytest tests/safety/ -v
        # Must pass 100% — no regressions on safety

Cross-References

For a broader evaluation framework, see Evaluation Metrics & Benchmarks
For prompt engineering fundamentals, see Prompt Engineering
For detecting hallucinations systematically, see Hallucination Detection & Mitigation
For tracking quality in production, see LLM Metrics & KPIs

Related docs

Model Versioning Management

Managing model versions in production — rollback strategies, A/B testing, canary deployments, version compatibility, and lifecycle management

Prompt Security Testing

Systematic prompt security testing methodology — injection testing, jailbreak detection, output validation, and continuous security monitoring

Data Platform Evaluator Agent Implementation Guide

Architecture, workflow design, metrics, and rollout guidance for a data platform evaluator agent in production.

Related agents

Data Platform Evaluator Agent

Data Platform agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for analysts and engineers need better query generation, pipeline debugging, and dataset explanation across changing schemas.

Developer Productivity Evaluator Agent

Developer Productivity agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for engineering teams want reliable help with issue triage, runbook guidance, and change review without obscuring system ownership.

Finance Operations Evaluator Agent

Finance Operations agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for finance teams need faster reconciliation, exception review, and policy-aware reporting for recurring operational workflows.