Deployment & Infrastructure

LLM Observability and Monitoring

Tracking LLM behavior in production — logging, tracing, evaluation pipelines, drift detection, and alerting for AI systems

Published: 2026-04-12 · Last updated: 2026-04-13

LLM Observability and Monitoring

Traditional monitoring tracks latency, errors, and throughput. LLM monitoring must additionally track output quality, safety, coherence, and drift — metrics that are inherently subjective and require specialized tooling.

What to Monitor

System Metrics (Standard)

Metric	Why It Matters	Alert Threshold
Latency (P50, P95, P99)	User experience	P99 > 10s
Error rate	System health	> 1%
Token throughput	Capacity planning	Degradation
GPU memory usage	Resource health	> 90%
Queue depth	Demand vs supply	> 100 pending

LLM-Specific Metrics

Metric	Why It Matters	How to Measure
Output quality	Are responses useful?	Human review, LLM-as-judge
Hallucination rate	Are facts correct?	Fact-checking pipeline
Toxicity rate	Is output safe?	Toxicity classifier
Response length drift	Are responses degrading?	Statistical monitoring
Prompt injection attempts	Security threats	Pattern detection
Token cost per request	Budget tracking	Usage tracking

Tracing and Logging

Structured Request Logging

import json
import uuid
from datetime import datetime

def log_llm_request(
    prompt: str,
    response: str,
    model: str,
    input_tokens: int,
    output_tokens: int,
    latency_ms: float,
    user_id: str,
    metadata: dict = None,
):
    """Log a single LLM interaction."""
    log_entry = {
        "trace_id": str(uuid.uuid4()),
        "timestamp": datetime.utcnow().isoformat(),
        "model": model,
        "user_id": user_id,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "latency_ms": latency_ms,
        "prompt_preview": prompt[:200],  # Don't log full prompts (PII risk)
        "response_preview": response[:200],
        "metadata": metadata or {},
    }
    
    # Write to structured log / analytics pipeline
    with open("llm-requests.jsonl", "a") as f:
        f.write(json.dumps(log_entry) + "\n")

Distributed Tracing with OpenTelemetry

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer("llm-service")

@tracer.start_as_current_span("llm_completion")
def llm_completion(user_query: str) -> str:
    span = trace.get_current_span()
    span.set_attribute("query_length", len(user_query))
    span.set_attribute("model", "gpt-4o")
    
    start = time.time()
    response = call_model(user_query)
    elapsed = (time.time() - start) * 1000
    
    span.set_attribute("response_length", len(response))
    span.set_attribute("latency_ms", elapsed)
    span.set_attribute("input_tokens", count_tokens(user_query))
    span.set_attribute("output_tokens", count_tokens(response))
    
    return response

Quality Evaluation Pipeline

Automated Quality Checks

from textstat import flesch_reading_ease
import openai

def evaluate_response_quality(prompt: str, response: str) -> dict:
    """Automated quality scoring."""
    scores = {}
    
    # Readability
    scores["readability"] = flesch_reading_ease(response)
    
    # Coherence (LLM-as-judge)
    judgment = openai.ChatCompletion.create(
        model="gpt-4o-mini",  # Cheap model for judging
        messages=[{
            "role": "system",
            "content": "Rate the response quality from 1-5 based on: relevance, accuracy, helpfulness."
        }, {
            "role": "user",
            "content": f"Prompt: {prompt}\nResponse: {response}"
        }]
    )
    scores["llm_judge_score"] = int(judgment.choices[0].message.content.strip())
    
    # Toxicity check
    toxicity = toxicity_classifier.predict(response)
    scores["toxicity"] = toxicity
    
    # Length appropriateness
    prompt_tokens = count_tokens(prompt)
    response_tokens = count_tokens(response)
    scores["length_ratio"] = response_tokens / max(prompt_tokens, 1)
    
    return scores

Human-in-the-Loop Review

# Sample low-confidence or flagged responses for human review
def flag_for_review(log_entry: dict, quality_scores: dict) -> bool:
    """Determine if a response needs human review."""
    if quality_scores.get("toxicity", 0) > 0.5:
        return True
    if quality_scores.get("llm_judge_score", 5) <= 2:
        return True
    if log_entry["output_tokens"] > 2000:  # Unusually long
        return True
    if "I don't know" in log_entry["response_preview"].lower():
        return True
    return False

# Route flagged responses to review queue
if flag_for_review(log_entry, scores):
    review_queue.add({
        "trace_id": log_entry["trace_id"],
        "prompt": log_entry["prompt_preview"],
        "response": log_entry["response_preview"],
        "scores": scores,
        "flagged_at": datetime.utcnow().isoformat(),
    })

Drift Detection

Response Distribution Drift

import numpy as np
from scipy import stats

def detect_drift(
    baseline_lengths: list[int],
    current_lengths: list[int],
    window_size: int = 1000,
) -> dict:
    """Detect if response lengths have drifted."""
    baseline = np.random.choice(baseline_lengths, window_size)
    current = np.random.choice(current_lengths, window_size)
    
    # Kolmogorov-Smirnov test
    ks_statistic, p_value = stats.ks_2samp(baseline, current)
    
    return {
        "drift_detected": p_value < 0.01,
        "p_value": p_value,
        "ks_statistic": ks_statistic,
        "baseline_mean": np.mean(baseline),
        "current_mean": np.mean(current),
    }

Topic/Domain Drift

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def detect_topic_drift(
    baseline_prompts: list[str],
    current_prompts: list[str],
) -> float:
    """Detect if user query topics have shifted."""
    vectorizer = TfidfVectorizer(max_features=1000)
    all_texts = baseline_prompts + current_prompts
    
    tfidf_matrix = vectorizer.fit_transform(all_texts)
    baseline_tfidf = tfidf_matrix[:len(baseline_prompts)]
    current_tfidf = tfidf_matrix[len(baseline_prompts):]
    
    # Compare mean topic distributions
    baseline_mean = baseline_tfidf.mean(axis=0)
    current_mean = current_tfidf.mean(axis=0)
    
    similarity = cosine_similarity(baseline_mean, current_mean)[0][0]
    return float(similarity)  # 1.0 = identical, < 0.8 = significant drift

Observability Platforms

Platform	Features	Best For
LangSmith	Tracing, evaluation, datasets	LangChain users
Arize Phoenix	Embedding visualization, drift	Open-source, self-hosted
Helicone	Proxy, caching, analytics	Cost tracking + observability
Galileo	Quality monitoring, alerting	Enterprise
WhyLabs	Data + LLM monitoring	Statistical rigor
Custom (OpenTelemetry + Grafana)	Full control	Engineering teams

Alerting Rules

# Example: Prometheus alerting rules for LLM service
groups:
  - name: llm-alerts
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(llm_request_latency_seconds[5m])) > 10
        for: 5m
        annotations:
          summary: "P99 latency above 10s"
      
      - alert: HighErrorRate
        expr: rate(llm_requests_total{status="error"}[5m]) / rate(llm_requests_total[5m]) > 0.05
        for: 2m
        annotations:
          summary: "Error rate above 5%"
      
      - alert: ToxicitySpike
        expr: rate(toxic_responses_total[5m]) > 0.02
        for: 5m
        annotations:
          summary: "Toxicity rate above 2%"
      
      - alert: BudgetThreshold
        expr: monthly_llm_spend > budget * 0.8
        annotations:
          summary: "80% of monthly budget consumed"

Key Takeaways

LLM observability requires both system metrics AND quality metrics
Structured logging with trace IDs enables debugging individual requests
Automated quality checks (LLM-as-judge, toxicity) catch issues before humans do
Drift detection identifies when user behavior or model output changes
Sample flagged responses for human review to maintain quality over time

Cost Management — Tracking spend alongside quality
Evaluation Metrics — Defining quality benchmarks
Safety and Red-teaming — Detecting adversarial inputs

Related docs

LLM Metrics & KPIs

Defining and tracking LLM success metrics — quality KPIs, cost KPIs, user satisfaction, throughput targets, and dashboard design

LLM Bias Mitigation

Understanding and mitigating bias in LLM outputs — demographic bias, cultural bias, measurement techniques, debiasing strategies, and continuous monitoring

Model Comparison Guide

A systematic methodology for comparing LLMs — benchmark analysis, cost evaluation, task-specific assessment, and selection frameworks

Related agents

Data Platform Evaluator Agent

Data Platform agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for analysts and engineers need better query generation, pipeline debugging, and dataset explanation across changing schemas.

Data Platform Monitor Agent

Data Platform agent blueprint focused on watch workflows over time, detect drift or failures, and surface the smallest useful signal to operators quickly for analysts and engineers need better query generation, pipeline debugging, and dataset explanation across changing schemas.

Developer Productivity Evaluator Agent

Developer Productivity agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for engineering teams want reliable help with issue triage, runbook guidance, and change review without obscuring system ownership.