Topic Hub

Testing

31 linked pages across the LLM-Docs library.

doc

Prompt Security Testing

Systematic prompt security testing methodology — injection testing, jailbreak detection, output validation, and continuous security monitoring

doc

LLM Testing & Debugging

Systematic approaches to testing and debugging LLM applications — unit testing prompts, integration testing chains, regression testing model updates, and production debugging

doc

Data Platform Evaluator Agent Implementation Guide

Architecture, workflow design, metrics, and rollout guidance for a data platform evaluator agent in production.

doc

Developer Productivity Evaluator Agent Implementation Guide

Architecture, workflow design, metrics, and rollout guidance for a developer productivity evaluator agent in production.

doc

Finance Operations Evaluator Agent Implementation Guide

Architecture, workflow design, metrics, and rollout guidance for a finance operations evaluator agent in production.

doc

Growth Marketing Evaluator Agent Implementation Guide

Architecture, workflow design, metrics, and rollout guidance for a growth marketing evaluator agent in production.

doc

Healthcare Operations Evaluator Agent Implementation Guide

Architecture, workflow design, metrics, and rollout guidance for a healthcare operations evaluator agent in production.

doc

Legal Compliance Evaluator Agent Implementation Guide

Architecture, workflow design, metrics, and rollout guidance for a legal compliance evaluator agent in production.

doc

Research Intelligence Evaluator Agent Implementation Guide

Architecture, workflow design, metrics, and rollout guidance for a research intelligence evaluator agent in production.

doc

Sales Enablement Evaluator Agent Implementation Guide

Architecture, workflow design, metrics, and rollout guidance for a sales enablement evaluator agent in production.

doc

Security Operations Evaluator Agent Implementation Guide

Architecture, workflow design, metrics, and rollout guidance for a security operations evaluator agent in production.

doc

Support Operations Evaluator Agent Implementation Guide

Architecture, workflow design, metrics, and rollout guidance for a support operations evaluator agent in production.

doc

Evaluation Metrics and Benchmarks

How to measure LLM capability — from academic benchmarks (MMLU, GSM8K, HumanEval) to practical evaluation pipelines for production systems

doc

Evaluation Systems Architecture Patterns

Reference patterns, tradeoffs, and building blocks for designing evaluation systems systems.

doc

Evaluation Systems Cost and Performance

How to trade off latency, throughput, quality, and spend when operating evaluation systems.

doc

Evaluation Systems Evaluation Metrics

Metrics, scorecards, and review methods for measuring evaluation systems quality in practice.

doc

Evaluation Systems Failure Modes

Common failure patterns, debugging workflows, and prevention strategies for evaluation systems.

doc

Evaluation Systems Foundations

Core concepts, terminology, workflows, and mental models for measuring quality, regressions, and business impact across ai workflows in modern AI systems.

doc

Evaluation Systems Implementation Guide

A practical step-by-step guide for implementing evaluation systems with production constraints in mind.

doc

Evaluation Systems Production Checklist

Deployment checklist, operational controls, and rollout guidance for evaluation systems workloads.

doc

Evaluation Systems Vendor Landscape

How vendors, open-source options, and ecosystem tools compare for evaluation systems use cases.

agent

Data Platform agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for analysts and engineers need better query generation, pipeline debugging, and dataset explanation across changing schemas.

agent

Developer Productivity Evaluator Agent

Developer Productivity agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for engineering teams want reliable help with issue triage, runbook guidance, and change review without obscuring system ownership.

agent

Finance Operations Evaluator Agent

Finance Operations agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for finance teams need faster reconciliation, exception review, and policy-aware reporting for recurring operational workflows.

agent

Growth Marketing Evaluator Agent

Growth Marketing agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for campaign teams need faster experimentation, channel-specific copy, and clearer measurement loops without losing brand control.

agent

Healthcare Operations Evaluator Agent

Healthcare Operations agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for care and operations teams need workflow assistance around intake, documentation, and coordination while preserving safety review.

agent

Legal Compliance Evaluator Agent

Legal Compliance agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for legal teams need structured review support for contracts, obligations, and policy mapping under strict approval controls.

agent

Research Intelligence Evaluator Agent

Research Intelligence agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for research and strategy teams need synthesis across large source sets with explicit provenance, tradeoffs, and update tracking.

agent

Sales Enablement Evaluator Agent

Sales Enablement agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for fragmented deal context, inconsistent follow-up quality, and too much rep time spent gathering account intelligence.

agent

Security Operations Evaluator Agent

Security Operations agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for security teams must classify alerts, enrich incidents, and reduce analyst fatigue without introducing unsafe automation.

agent

Support Operations Evaluator Agent

Support Operations agent blueprint focused on score outputs against explicit rubrics so teams can compare variants, regressions, and rollout quality over time for high ticket volume, inconsistent routing, and slow escalation paths across chat, email, and in-product support.