Stage 04 · ProModule 20 of 26~7h

Testing & Evaluation

Measure quality so you can ship with evidence.

You wouldn't ship a function without tests. You shouldn't ship a Claude feature without evals. They're the same idea — measurable definitions of "did it work" — adapted for outputs that don't always have one right answer. This module is the pragmatic version: small, fast, and useful from day one.

By the end of this module you'll have

A working eval harness — golden examples, a runner, a pass/fail report
A small library of graders for the three common output shapes (exact, substring, model-as-judge)
A regression habit: every change to a prompt or model goes through evals before merge

Time: about 2 hours for the basics, ~7 hours with all three notebooks.

Prerequisites: Modules 3 (prompt basics), 6 (advanced prompting), 14 (production patterns).

Why this isn't just "tests, but for LLMs"

Two things break the analogy with normal unit tests:

Outputs are non-deterministic. Same prompt, slightly different reply. Your assertions need to be tolerant of variation while strict about correctness.
There isn't always a single right answer. "Summarise this article" has a thousand valid summaries. You need graders that capture the qualities that matter, not the exact string.

Solving (1) and (2) is what an eval harness is for. The good news: a useful one is ~50 lines.

The minimum viable eval harness

Save as eval_harness.py:

import json
from dataclasses import dataclass
from typing import Callable
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()

@dataclass
class EvalCase:
    name: str
    input: str
    grader: Callable[[str], tuple[bool, str]]   # returns (passed, why)

def run_eval(cases: list[EvalCase], *, model="claude-sonnet-4-6", system: str = "") -> dict:
    results, passed = [], 0
    for c in cases:
        r = client.messages.create(
            model=model, max_tokens=400, system=system,
            messages=[{"role": "user", "content": c.input}],
        )
        output = r.content[0].text
        ok, why = c.grader(output)
        passed += int(ok)
        results.append({"name": c.name, "passed": ok, "why": why, "output": output[:200]})
    return {"pass_rate": passed / len(cases), "n": len(cases), "results": results}

# --- graders -------------------------------------------------------------

def exact(expected: str):
    return lambda out: (out.strip() == expected.strip(), "exact")

def contains(needle: str):
    return lambda out: (needle.lower() in out.lower(), f"contains {needle!r}")

def is_valid_json():
    def _g(out: str):
        try:
            json.loads(out)
            return True, "valid JSON"
        except Exception as exc:
            return False, f"json parse failed: {exc}"
    return _g

def model_judge(rubric: str, model="claude-haiku-4-5-20251001"):
    def _g(out: str):
        verdict = client.messages.create(
            model=model, max_tokens=20,
            system="Reply only PASS or FAIL.\nRubric: " + rubric,
            messages=[{"role": "user", "content": out}],
        ).content[0].text.strip().upper()
        return verdict.startswith("PASS"), verdict
    return _g

# --- a real eval set -----------------------------------------------------

CASES = [
    EvalCase("classifies positive review",
             "Setup was painless and audio is crisp.",
             contains("positive")),
    EvalCase("classifies negative review",
             "App crashes every five minutes. Useless.",
             contains("negative")),
    EvalCase("returns valid JSON",
             "Setup was painless and audio is crisp.",
             is_valid_json()),
    EvalCase("doesn't invent an unrelated answer",
             "Translate to French: 'I love coffee'.",
             model_judge("PASS if the response is a French translation of 'I love coffee'.")),
]

if __name__ == "__main__":
    SYSTEM = (
        "Classify the review's sentiment as positive, neutral, or negative. "
        "Reply with one JSON object: {\"sentiment\": \"...\"}"
    )
    report = run_eval(CASES, system=SYSTEM)
    print(f"Pass rate: {report['pass_rate']:.0%} ({report['n']} cases)")
    for r in report["results"]:
        mark = "✓" if r["passed"] else "✗"
        print(f"  {mark} {r['name']}  ({r['why']})")

What you should notice:

Each case is self-contained: input, grader, name. Adding a case is one append.
Graders are composable. is_valid_json() and contains("positive") can grade the same output.
The judge model is cheaper than the model under test. Haiku grading Sonnet is the right ratio.

Picking the right grader

Output shape	Use this grader
Exact value (a label, a number)	`exact("...")` or regex
A specific token must appear	`contains("...")`
JSON / structured output	`is_valid_json()` plus a schema check
Free-form text with quality requirements	`model_judge(rubric)`
Multi-criteria (factuality, tone, format)	Several graders, all must pass
Comparing two model outputs	Pairwise judge: "which is better, A or B, and why?"

Strict graders are flaky if your output is genuinely creative. Loose graders pass garbage. The art is choosing the strictest grader the task allows.

The model-as-judge pattern (and its traps)

Letting Claude judge another model's output works well for subjective qualities (tone, helpfulness, factuality vs a source). It also has known failure modes:

Position bias in pairwise comparisons. Always randomise A/B order.
Length bias. Judges sometimes prefer longer answers. Mention "verbosity is not a virtue" in the rubric.
Hand-waving ("looks reasonable"). Force the rubric to enumerate criteria, and force PASS/FAIL.
Self-preference. A judge sometimes rates outputs from its own family generously. If you can, use a different family or a deterministic check.

Calibrate by running the judge against a labelled set you've graded yourself. If the judge agrees with you ≥85%, it's good enough.

A regression workflow that scales

The harness is the easy part. The discipline:

Start with 10 cases. Cover the obvious failure modes you've already hit. Stop at 10; don't try to be exhaustive on day one.
Lock a baseline. Run the harness, save the pass rate. That's your floor.
Every prompt or model change: re-run, compare to baseline. Refuse to merge regressions.
Grow the suite by ~1 case per week. When a real bug is reported, add it to the suite before you fix it.
Once a month, audit cases that always pass — they're probably no longer interesting.

This is the same loop as software unit tests. The point is to make changes safely, not to achieve coverage theatre.

Online evals (production)

Module 14 introduced sampled "smell tests" on live traffic. Treat those as a separate signal from your offline eval suite:

Eval type	Catches	Misses
Offline suite (golden cases)	Regressions on known-important behaviours	New failure modes you haven't seen yet
Online smell test (1% of traffic)	Real-world degradation, weird inputs	Slow drift the smell test isn't sensitive to
Human review (10/day random sample)	Subtle quality, brand tone, anything qualitative	Anything the reviewer doesn't see (PII, edge cases)

You want all three eventually. Start with offline; layer the others as you scale.

Try changing one thing

Add a case for an input you know breaks the prompt today. Watch it fail. Fix the prompt. Watch it pass.
Run the harness against claude-haiku-4-5-20251001 and claude-sonnet-4-6. Compare pass rates. Decide whether the cost difference is worth it.
Add a model_judge rubric that grades brand voice. Run on 20 outputs. Notice how often it disagrees with you — calibrate.
Write a tiny CI step: python -m my_app.evals --threshold 0.9 that fails the build if pass rate drops below 90%. You've just gated merges on quality.

Going deeper: open the notebooks

notebooks/01_introduction.ipynb — building eval suites, model-as-judge calibration (~1.5–2h)
notebooks/02_intermediate.ipynb — pairwise comparisons, A/B testing prompts in production (~2–3h)
notebooks/03_advanced.ipynb — eval-driven product development, golden-set maintenance (~1.5–2.5h)

Module checklist

[ ] You've run an eval harness with at least 5 real cases
[ ] You can name three grader types and when to use each
[ ] You've watched a model-as-judge and noticed at least one of its biases
[ ] You have an opinion about which evals you'd require before merging a prompt change

Stage 04 complete

That's the Pro stage. You can now optimise, customise (or know not to), coordinate multiple agents, scale across teams, and prove quality with evidence. The final stage is Specialties — the same toolkit applied to particular domains.

Module S1 · Business Intelligence →

This moduleOverviewResourcesExercises

Notebook links: nbviewer uses nbviewer.jupyter.org/url/… (not nbviewer.org) and loads the file via jsDelivr (cdn.jsdelivr.net/gh/…) so Project Jupyter does not hit GitHub’s REST API (which can return 503 when rate-limited). Requires a public repo mirror (Berta-one/claude-guides, branch master). Raw points at the copy on this website (same path under claudeguides.berta.one). Colab and the GitHub file link also need that public mirror. Forks: set CLAUDE_GUIDES_GITHUB_REPO and CLAUDE_GUIDES_BRANCH when building. You can always clone the repo and open .ipynb locally.

Notebooks

Intronbviewer Colab GitHub RawIntermediatenbviewer Colab GitHub RawAdvancednbviewer Colab GitHub RawModule overview

Code

View main_application.py (on this site)GitHub mirror

Next in module →

Full pathStep 20 of 26 · typical ~7h

← Previous module (Guide 19)

Paid & instructor-led courses