Stage 04 · ProModule 20 of 26~7h

Testing & Evaluation

Measure quality so you can ship with evidence.

← All modules in this stage

You wouldn't ship a function without tests. You shouldn't ship a Claude feature without evals. They're the same idea — measurable definitions of "did it work" — adapted for outputs that don't always have one right answer. This module is the pragmatic version: small, fast, and useful from day one.

By the end of this module you'll have

Time: about 2 hours for the basics, ~7 hours with all three notebooks.

Prerequisites: Modules 3 (prompt basics), 6 (advanced prompting), 14 (production patterns).


Why this isn't just "tests, but for LLMs"

Two things break the analogy with normal unit tests:

  1. Outputs are non-deterministic. Same prompt, slightly different reply. Your assertions need to be tolerant of variation while strict about correctness.
  2. There isn't always a single right answer. "Summarise this article" has a thousand valid summaries. You need graders that capture the qualities that matter, not the exact string.

Solving (1) and (2) is what an eval harness is for. The good news: a useful one is ~50 lines.


The minimum viable eval harness

Save as eval_harness.py:

import json
from dataclasses import dataclass
from typing import Callable
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()

@dataclass
class EvalCase:
    name: str
    input: str
    grader: Callable[[str], tuple[bool, str]]   # returns (passed, why)

def run_eval(cases: list[EvalCase], *, model="claude-sonnet-4-6", system: str = "") -> dict:
    results, passed = [], 0
    for c in cases:
        r = client.messages.create(
            model=model, max_tokens=400, system=system,
            messages=[{"role": "user", "content": c.input}],
        )
        output = r.content[0].text
        ok, why = c.grader(output)
        passed += int(ok)
        results.append({"name": c.name, "passed": ok, "why": why, "output": output[:200]})
    return {"pass_rate": passed / len(cases), "n": len(cases), "results": results}

# --- graders -------------------------------------------------------------

def exact(expected: str):
    return lambda out: (out.strip() == expected.strip(), "exact")

def contains(needle: str):
    return lambda out: (needle.lower() in out.lower(), f"contains {needle!r}")

def is_valid_json():
    def _g(out: str):
        try:
            json.loads(out)
            return True, "valid JSON"
        except Exception as exc:
            return False, f"json parse failed: {exc}"
    return _g

def model_judge(rubric: str, model="claude-haiku-4-5-20251001"):
    def _g(out: str):
        verdict = client.messages.create(
            model=model, max_tokens=20,
            system="Reply only PASS or FAIL.\nRubric: " + rubric,
            messages=[{"role": "user", "content": out}],
        ).content[0].text.strip().upper()
        return verdict.startswith("PASS"), verdict
    return _g

# --- a real eval set -----------------------------------------------------

CASES = [
    EvalCase("classifies positive review",
             "Setup was painless and audio is crisp.",
             contains("positive")),
    EvalCase("classifies negative review",
             "App crashes every five minutes. Useless.",
             contains("negative")),
    EvalCase("returns valid JSON",
             "Setup was painless and audio is crisp.",
             is_valid_json()),
    EvalCase("doesn't invent an unrelated answer",
             "Translate to French: 'I love coffee'.",
             model_judge("PASS if the response is a French translation of 'I love coffee'.")),
]

if __name__ == "__main__":
    SYSTEM = (
        "Classify the review's sentiment as positive, neutral, or negative. "
        "Reply with one JSON object: {\"sentiment\": \"...\"}"
    )
    report = run_eval(CASES, system=SYSTEM)
    print(f"Pass rate: {report['pass_rate']:.0%} ({report['n']} cases)")
    for r in report["results"]:
        mark = "✓" if r["passed"] else "✗"
        print(f"  {mark} {r['name']}  ({r['why']})")

What you should notice:


Picking the right grader

Output shape Use this grader
Exact value (a label, a number) exact("...") or regex
A specific token must appear contains("...")
JSON / structured output is_valid_json() plus a schema check
Free-form text with quality requirements model_judge(rubric)
Multi-criteria (factuality, tone, format) Several graders, all must pass
Comparing two model outputs Pairwise judge: "which is better, A or B, and why?"

Strict graders are flaky if your output is genuinely creative. Loose graders pass garbage. The art is choosing the strictest grader the task allows.


The model-as-judge pattern (and its traps)

Letting Claude judge another model's output works well for subjective qualities (tone, helpfulness, factuality vs a source). It also has known failure modes:

Calibrate by running the judge against a labelled set you've graded yourself. If the judge agrees with you ≥85%, it's good enough.


A regression workflow that scales

The harness is the easy part. The discipline:

  1. Start with 10 cases. Cover the obvious failure modes you've already hit. Stop at 10; don't try to be exhaustive on day one.
  2. Lock a baseline. Run the harness, save the pass rate. That's your floor.
  3. Every prompt or model change: re-run, compare to baseline. Refuse to merge regressions.
  4. Grow the suite by ~1 case per week. When a real bug is reported, add it to the suite before you fix it.
  5. Once a month, audit cases that always pass — they're probably no longer interesting.

This is the same loop as software unit tests. The point is to make changes safely, not to achieve coverage theatre.


Online evals (production)

Module 14 introduced sampled "smell tests" on live traffic. Treat those as a separate signal from your offline eval suite:

Eval type Catches Misses
Offline suite (golden cases) Regressions on known-important behaviours New failure modes you haven't seen yet
Online smell test (1% of traffic) Real-world degradation, weird inputs Slow drift the smell test isn't sensitive to
Human review (10/day random sample) Subtle quality, brand tone, anything qualitative Anything the reviewer doesn't see (PII, edge cases)

You want all three eventually. Start with offline; layer the others as you scale.


Try changing one thing


Going deeper: open the notebooks


Module checklist


Stage 04 complete

That's the Pro stage. You can now optimise, customise (or know not to), coordinate multiple agents, scale across teams, and prove quality with evidence. The final stage is Specialties — the same toolkit applied to particular domains.

Module S1 · Business Intelligence