Testing & Evaluation
Measure quality so you can ship with evidence.
← All modules in this stageYou wouldn't ship a function without tests. You shouldn't ship a Claude feature without evals. They're the same idea — measurable definitions of "did it work" — adapted for outputs that don't always have one right answer. This module is the pragmatic version: small, fast, and useful from day one.
By the end of this module you'll have
- A working eval harness — golden examples, a runner, a pass/fail report
- A small library of graders for the three common output shapes (exact, substring, model-as-judge)
- A regression habit: every change to a prompt or model goes through evals before merge
Time: about 2 hours for the basics, ~7 hours with all three notebooks.
Prerequisites: Modules 3 (prompt basics), 6 (advanced prompting), 14 (production patterns).
Why this isn't just "tests, but for LLMs"
Two things break the analogy with normal unit tests:
- Outputs are non-deterministic. Same prompt, slightly different reply. Your assertions need to be tolerant of variation while strict about correctness.
- There isn't always a single right answer. "Summarise this article" has a thousand valid summaries. You need graders that capture the qualities that matter, not the exact string.
Solving (1) and (2) is what an eval harness is for. The good news: a useful one is ~50 lines.
The minimum viable eval harness
Save as eval_harness.py:
import json
from dataclasses import dataclass
from typing import Callable
from anthropic import Anthropic
from dotenv import load_dotenv
load_dotenv()
client = Anthropic()
@dataclass
class EvalCase:
name: str
input: str
grader: Callable[[str], tuple[bool, str]] # returns (passed, why)
def run_eval(cases: list[EvalCase], *, model="claude-sonnet-4-6", system: str = "") -> dict:
results, passed = [], 0
for c in cases:
r = client.messages.create(
model=model, max_tokens=400, system=system,
messages=[{"role": "user", "content": c.input}],
)
output = r.content[0].text
ok, why = c.grader(output)
passed += int(ok)
results.append({"name": c.name, "passed": ok, "why": why, "output": output[:200]})
return {"pass_rate": passed / len(cases), "n": len(cases), "results": results}
# --- graders -------------------------------------------------------------
def exact(expected: str):
return lambda out: (out.strip() == expected.strip(), "exact")
def contains(needle: str):
return lambda out: (needle.lower() in out.lower(), f"contains {needle!r}")
def is_valid_json():
def _g(out: str):
try:
json.loads(out)
return True, "valid JSON"
except Exception as exc:
return False, f"json parse failed: {exc}"
return _g
def model_judge(rubric: str, model="claude-haiku-4-5-20251001"):
def _g(out: str):
verdict = client.messages.create(
model=model, max_tokens=20,
system="Reply only PASS or FAIL.\nRubric: " + rubric,
messages=[{"role": "user", "content": out}],
).content[0].text.strip().upper()
return verdict.startswith("PASS"), verdict
return _g
# --- a real eval set -----------------------------------------------------
CASES = [
EvalCase("classifies positive review",
"Setup was painless and audio is crisp.",
contains("positive")),
EvalCase("classifies negative review",
"App crashes every five minutes. Useless.",
contains("negative")),
EvalCase("returns valid JSON",
"Setup was painless and audio is crisp.",
is_valid_json()),
EvalCase("doesn't invent an unrelated answer",
"Translate to French: 'I love coffee'.",
model_judge("PASS if the response is a French translation of 'I love coffee'.")),
]
if __name__ == "__main__":
SYSTEM = (
"Classify the review's sentiment as positive, neutral, or negative. "
"Reply with one JSON object: {\"sentiment\": \"...\"}"
)
report = run_eval(CASES, system=SYSTEM)
print(f"Pass rate: {report['pass_rate']:.0%} ({report['n']} cases)")
for r in report["results"]:
mark = "✓" if r["passed"] else "✗"
print(f" {mark} {r['name']} ({r['why']})")
What you should notice:
- Each case is self-contained: input, grader, name. Adding a case is one append.
- Graders are composable.
is_valid_json()andcontains("positive")can grade the same output. - The judge model is cheaper than the model under test. Haiku grading Sonnet is the right ratio.
Picking the right grader
| Output shape | Use this grader |
|---|---|
| Exact value (a label, a number) | exact("...") or regex |
| A specific token must appear | contains("...") |
| JSON / structured output | is_valid_json() plus a schema check |
| Free-form text with quality requirements | model_judge(rubric) |
| Multi-criteria (factuality, tone, format) | Several graders, all must pass |
| Comparing two model outputs | Pairwise judge: "which is better, A or B, and why?" |
Strict graders are flaky if your output is genuinely creative. Loose graders pass garbage. The art is choosing the strictest grader the task allows.
The model-as-judge pattern (and its traps)
Letting Claude judge another model's output works well for subjective qualities (tone, helpfulness, factuality vs a source). It also has known failure modes:
- Position bias in pairwise comparisons. Always randomise A/B order.
- Length bias. Judges sometimes prefer longer answers. Mention "verbosity is not a virtue" in the rubric.
- Hand-waving ("looks reasonable"). Force the rubric to enumerate criteria, and force PASS/FAIL.
- Self-preference. A judge sometimes rates outputs from its own family generously. If you can, use a different family or a deterministic check.
Calibrate by running the judge against a labelled set you've graded yourself. If the judge agrees with you ≥85%, it's good enough.
A regression workflow that scales
The harness is the easy part. The discipline:
- Start with 10 cases. Cover the obvious failure modes you've already hit. Stop at 10; don't try to be exhaustive on day one.
- Lock a baseline. Run the harness, save the pass rate. That's your floor.
- Every prompt or model change: re-run, compare to baseline. Refuse to merge regressions.
- Grow the suite by ~1 case per week. When a real bug is reported, add it to the suite before you fix it.
- Once a month, audit cases that always pass — they're probably no longer interesting.
This is the same loop as software unit tests. The point is to make changes safely, not to achieve coverage theatre.
Online evals (production)
Module 14 introduced sampled "smell tests" on live traffic. Treat those as a separate signal from your offline eval suite:
| Eval type | Catches | Misses |
|---|---|---|
| Offline suite (golden cases) | Regressions on known-important behaviours | New failure modes you haven't seen yet |
| Online smell test (1% of traffic) | Real-world degradation, weird inputs | Slow drift the smell test isn't sensitive to |
| Human review (10/day random sample) | Subtle quality, brand tone, anything qualitative | Anything the reviewer doesn't see (PII, edge cases) |
You want all three eventually. Start with offline; layer the others as you scale.
Try changing one thing
- Add a case for an input you know breaks the prompt today. Watch it fail. Fix the prompt. Watch it pass.
- Run the harness against
claude-haiku-4-5-20251001andclaude-sonnet-4-6. Compare pass rates. Decide whether the cost difference is worth it. - Add a
model_judgerubric that grades brand voice. Run on 20 outputs. Notice how often it disagrees with you — calibrate. - Write a tiny CI step:
python -m my_app.evals --threshold 0.9that fails the build if pass rate drops below 90%. You've just gated merges on quality.
Going deeper: open the notebooks
notebooks/01_introduction.ipynb— building eval suites, model-as-judge calibration (~1.5–2h)notebooks/02_intermediate.ipynb— pairwise comparisons, A/B testing prompts in production (~2–3h)notebooks/03_advanced.ipynb— eval-driven product development, golden-set maintenance (~1.5–2.5h)
Module checklist
- [ ] You've run an eval harness with at least 5 real cases
- [ ] You can name three grader types and when to use each
- [ ] You've watched a model-as-judge and noticed at least one of its biases
- [ ] You have an opinion about which evals you'd require before merging a prompt change
Stage 04 complete
That's the Pro stage. You can now optimise, customise (or know not to), coordinate multiple agents, scale across teams, and prove quality with evidence. The final stage is Specialties — the same toolkit applied to particular domains.