Stage 04 · ProModule 16 of 26~6h

Optimization

Cut latency and cost without losing quality.

← All modules in this stage

Three forces are constantly in tension: quality, latency, cost. You can usually optimise two; the third bites you. This module is the toolkit for moving each lever deliberately — with measurements, not vibes.

By the end of this module you'll have

Time: about 1.5 hours for the basics, ~6 hours with all three notebooks.

Prerequisites: Modules 4 (API basics), 5 (tokens), 14 (production patterns).


Measure before you optimise

The first rule of optimisation is the same in any system: don't guess. Run something realistic, measure it, then change one thing.

import time, statistics
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()

def benchmark(fn, *, runs=10):
    latencies, costs = [], []
    for _ in range(runs):
        start = time.perf_counter()
        r = fn()
        latencies.append((time.perf_counter() - start) * 1000)
        # rough cost proxy: total tokens. Replace with real per-token rates.
        costs.append(r.usage.input_tokens + 4 * r.usage.output_tokens)
    return {
        "p50_ms": round(statistics.median(latencies)),
        "p90_ms": round(statistics.quantiles(latencies, n=10)[-1]),
        "mean_cost": round(statistics.mean(costs)),
    }

def baseline():
    return client.messages.create(
        model="claude-sonnet-4-6", max_tokens=400,
        messages=[{"role": "user", "content": "Summarise the plot of Hamlet in 3 bullets."}],
    )

print(benchmark(baseline))

That's your baseline. Every optimisation below changes one thing and re-runs it. Anything that doesn't improve p90_ms or mean_cost measurably is noise — abandon it.


Lever 1 · Prompt caching (cheapest fast)

If your prompts have a long, unchanging prefix (a system message with rules, a knowledge-base snippet, a few-shot block), Anthropic can cache it server-side and charge you a fraction for every reuse.

SYSTEM_RULES = "You are a customer support agent. Tone: warm but precise. " * 200  # ~thousands of tokens

response = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=400,
    system=[
        {
            "type": "text",
            "text": SYSTEM_RULES,
            "cache_control": {"type": "ephemeral"},     # ← cache this prefix
        },
    ],
    messages=[{"role": "user", "content": "How do I cancel an order?"}],
)

What changes:

Use it for: shared system prompts, large RAG context that's the same across users, few-shot example blocks. Don't use it for: prompts that change every time.


Lever 2 · Model cascade

Cheap model first; only escalate if the answer fails a check. The check is itself cheap.

def cascade(question: str) -> str:
    fast = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=400,
        messages=[{"role": "user", "content": question}],
    ).content[0].text

    judge = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=10,
        system="Reply only PASS or FAIL. FAIL if the answer is empty, evasive, or contradicts itself.",
        messages=[{"role": "user", "content": f"Q: {question}\nA: {fast}"}],
    ).content[0].text.strip()

    if judge.upper().startswith("PASS"):
        return fast

    # Only the failures pay for Sonnet.
    return client.messages.create(
        model="claude-sonnet-4-6", max_tokens=600,
        messages=[{"role": "user", "content": question}],
    ).content[0].text

If 80% of your traffic is easy, this cuts cost by ~70%. Measure the escalation rate — if it's above 30%, the cascade is helping less than just running Sonnet.


Lever 3 · Trim the prompt

Tokens you didn't send are tokens you didn't pay for. Three places to trim:

A useful exercise: print len(prompt) next to your output, then ask "would the answer change if I cut a third of this?" Run a benchmark to find out.


Lever 4 · Streaming (latency only, not cost)

Streaming doesn't change cost or total time-to-completion. It changes time-to-first-token, which is what users actually feel.

with client.messages.stream(
    model="claude-sonnet-4-6", max_tokens=600,
    messages=[{"role": "user", "content": "Write a 6-paragraph essay on focus."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Use streaming any time the user is waiting on the response. Don't use it for background jobs, batched workloads, or anything where you need the full response before deciding what to do next.


Lever 5 · Smaller max_tokens

Latency tracks output length. If your task always answers in 200 tokens, max_tokens=200 is faster and a hard ceiling on cost.

Bonus: it also forces you to write tighter prompts that ask for shorter answers.


A decision table

Symptom First lever to pull
Prompts share a long, stable prefix Prompt caching
Most queries are easy, some are hard Model cascade
User waits and watches a spinner Streaming
Bills are high but P90 latency is fine Trim prompt + smaller max_tokens
Quality is good but everything's slow Move to a smaller model and re-evaluate quality
Quality dropped after an optimisation Roll back; you broke the second of the three forces

Try changing one thing


Going deeper: open the notebooks


Module checklist


Next module

Module 17 · Fine-tuning — when (and when not) to specialise Claude for your domain.