Stage 04 · ProModule 16 of 26~6h

Optimization

Cut latency and cost without losing quality.

Three forces are constantly in tension: quality, latency, cost. You can usually optimise two; the third bites you. This module is the toolkit for moving each lever deliberately — with measurements, not vibes.

By the end of this module you'll have

A simple measurement habit so optimisations are evidence-based
Working prompt caching that cuts cost on repeat-prefix workloads
A model cascade pattern: fast model first, escalate only when needed

Time: about 1.5 hours for the basics, ~6 hours with all three notebooks.

Prerequisites: Modules 4 (API basics), 5 (tokens), 14 (production patterns).

Measure before you optimise

The first rule of optimisation is the same in any system: don't guess. Run something realistic, measure it, then change one thing.

import time, statistics
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()

def benchmark(fn, *, runs=10):
    latencies, costs = [], []
    for _ in range(runs):
        start = time.perf_counter()
        r = fn()
        latencies.append((time.perf_counter() - start) * 1000)
        # rough cost proxy: total tokens. Replace with real per-token rates.
        costs.append(r.usage.input_tokens + 4 * r.usage.output_tokens)
    return {
        "p50_ms": round(statistics.median(latencies)),
        "p90_ms": round(statistics.quantiles(latencies, n=10)[-1]),
        "mean_cost": round(statistics.mean(costs)),
    }

def baseline():
    return client.messages.create(
        model="claude-sonnet-4-6", max_tokens=400,
        messages=[{"role": "user", "content": "Summarise the plot of Hamlet in 3 bullets."}],
    )

print(benchmark(baseline))

That's your baseline. Every optimisation below changes one thing and re-runs it. Anything that doesn't improve p90_ms or mean_cost measurably is noise — abandon it.

Lever 1 · Prompt caching (cheapest fast)

If your prompts have a long, unchanging prefix (a system message with rules, a knowledge-base snippet, a few-shot block), Anthropic can cache it server-side and charge you a fraction for every reuse.

SYSTEM_RULES = "You are a customer support agent. Tone: warm but precise. " * 200  # ~thousands of tokens

response = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=400,
    system=[
        {
            "type": "text",
            "text": SYSTEM_RULES,
            "cache_control": {"type": "ephemeral"},     # ← cache this prefix
        },
    ],
    messages=[{"role": "user", "content": "How do I cancel an order?"}],
)

What changes:

The first call writes the cache (slightly more expensive than uncached).
Subsequent calls within the cache TTL pay a small fraction of the input-token cost for the cached prefix.
The user message and the response are not cached — only the prefix.

Use it for: shared system prompts, large RAG context that's the same across users, few-shot example blocks. Don't use it for: prompts that change every time.

Lever 2 · Model cascade

Cheap model first; only escalate if the answer fails a check. The check is itself cheap.

def cascade(question: str) -> str:
    fast = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=400,
        messages=[{"role": "user", "content": question}],
    ).content[0].text

    judge = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=10,
        system="Reply only PASS or FAIL. FAIL if the answer is empty, evasive, or contradicts itself.",
        messages=[{"role": "user", "content": f"Q: {question}\nA: {fast}"}],
    ).content[0].text.strip()

    if judge.upper().startswith("PASS"):
        return fast

    # Only the failures pay for Sonnet.
    return client.messages.create(
        model="claude-sonnet-4-6", max_tokens=600,
        messages=[{"role": "user", "content": question}],
    ).content[0].text

If 80% of your traffic is easy, this cuts cost by ~70%. Measure the escalation rate — if it's above 30%, the cascade is helping less than just running Sonnet.

Lever 3 · Trim the prompt

Tokens you didn't send are tokens you didn't pay for. Three places to trim:

System prompt. That 1,200-word brand voice guide? Most of it is restating itself. Cut to the actual rules.
History. Module 10's rolling window applies here too. You rarely need 30 turns of context.
RAG context. Top-K=3 with sharp chunks usually beats Top-K=10 with sprawling chunks.

A useful exercise: print len(prompt) next to your output, then ask "would the answer change if I cut a third of this?" Run a benchmark to find out.

Lever 4 · Streaming (latency only, not cost)

Streaming doesn't change cost or total time-to-completion. It changes time-to-first-token, which is what users actually feel.

with client.messages.stream(
    model="claude-sonnet-4-6", max_tokens=600,
    messages=[{"role": "user", "content": "Write a 6-paragraph essay on focus."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Use streaming any time the user is waiting on the response. Don't use it for background jobs, batched workloads, or anything where you need the full response before deciding what to do next.

Lever 5 · Smaller `max_tokens`

Latency tracks output length. If your task always answers in 200 tokens, max_tokens=200 is faster and a hard ceiling on cost.

Bonus: it also forces you to write tighter prompts that ask for shorter answers.

A decision table

Symptom	First lever to pull
Prompts share a long, stable prefix	Prompt caching
Most queries are easy, some are hard	Model cascade
User waits and watches a spinner	Streaming
Bills are high but P90 latency is fine	Trim prompt + smaller `max_tokens`
Quality is good but everything's slow	Move to a smaller model and re-evaluate quality
Quality dropped after an optimisation	Roll back; you broke the second of the three forces

Try changing one thing

Add prompt caching to your most-called endpoint and run the benchmark before/after. Note the cost difference per 1,000 calls.
Build a cascade where the judge is a regex (r"^[A-Z]" for "starts with a capital", say). Far cheaper than a Haiku call. Sometimes good enough.
Run an A/B: half your traffic to Sonnet, half to a cascade. Compare quality (smell-test pass rate from Module 14) and cost.
Cut your system prompt by 30%. Re-run quality evals. Note the win — and where quality slipped.

Going deeper: open the notebooks

notebooks/01_introduction.ipynb — measurement habits, prompt caching, baseline benchmarks (~1.5–2h)
notebooks/02_intermediate.ipynb — model cascades, escalation policies, A/B harnesses (~2–3h)
notebooks/03_advanced.ipynb — multi-region, batched workloads, capacity planning (~1.5–2.5h)

Module checklist

[ ] You've benchmarked at least one Claude call and have numbers, not vibes
[ ] You've enabled prompt caching and observed the cost change
[ ] You've built a cascade and measured its escalation rate
[ ] You can name which lever to pull when given each symptom in the table

Next module

Module 17 · Fine-tuning — when (and when not) to specialise Claude for your domain.

This moduleOverviewResourcesExercises

Notebook links: nbviewer uses nbviewer.jupyter.org/url/… (not nbviewer.org) and loads the file via jsDelivr (cdn.jsdelivr.net/gh/…) so Project Jupyter does not hit GitHub’s REST API (which can return 503 when rate-limited). Requires a public repo mirror (Berta-one/claude-guides, branch master). Raw points at the copy on this website (same path under claudeguides.berta.one). Colab and the GitHub file link also need that public mirror. Forks: set CLAUDE_GUIDES_GITHUB_REPO and CLAUDE_GUIDES_BRANCH when building. You can always clone the repo and open .ipynb locally.