Optimization
Cut latency and cost without losing quality.
← All modules in this stageThree forces are constantly in tension: quality, latency, cost. You can usually optimise two; the third bites you. This module is the toolkit for moving each lever deliberately — with measurements, not vibes.
By the end of this module you'll have
- A simple measurement habit so optimisations are evidence-based
- Working prompt caching that cuts cost on repeat-prefix workloads
- A model cascade pattern: fast model first, escalate only when needed
Time: about 1.5 hours for the basics, ~6 hours with all three notebooks.
Prerequisites: Modules 4 (API basics), 5 (tokens), 14 (production patterns).
Measure before you optimise
The first rule of optimisation is the same in any system: don't guess. Run something realistic, measure it, then change one thing.
import time, statistics
from anthropic import Anthropic
from dotenv import load_dotenv
load_dotenv()
client = Anthropic()
def benchmark(fn, *, runs=10):
latencies, costs = [], []
for _ in range(runs):
start = time.perf_counter()
r = fn()
latencies.append((time.perf_counter() - start) * 1000)
# rough cost proxy: total tokens. Replace with real per-token rates.
costs.append(r.usage.input_tokens + 4 * r.usage.output_tokens)
return {
"p50_ms": round(statistics.median(latencies)),
"p90_ms": round(statistics.quantiles(latencies, n=10)[-1]),
"mean_cost": round(statistics.mean(costs)),
}
def baseline():
return client.messages.create(
model="claude-sonnet-4-6", max_tokens=400,
messages=[{"role": "user", "content": "Summarise the plot of Hamlet in 3 bullets."}],
)
print(benchmark(baseline))
That's your baseline. Every optimisation below changes one thing and re-runs it. Anything that doesn't improve p90_ms or mean_cost measurably is noise — abandon it.
Lever 1 · Prompt caching (cheapest fast)
If your prompts have a long, unchanging prefix (a system message with rules, a knowledge-base snippet, a few-shot block), Anthropic can cache it server-side and charge you a fraction for every reuse.
SYSTEM_RULES = "You are a customer support agent. Tone: warm but precise. " * 200 # ~thousands of tokens
response = client.messages.create(
model="claude-sonnet-4-6", max_tokens=400,
system=[
{
"type": "text",
"text": SYSTEM_RULES,
"cache_control": {"type": "ephemeral"}, # ← cache this prefix
},
],
messages=[{"role": "user", "content": "How do I cancel an order?"}],
)
What changes:
- The first call writes the cache (slightly more expensive than uncached).
- Subsequent calls within the cache TTL pay a small fraction of the input-token cost for the cached prefix.
- The user message and the response are not cached — only the prefix.
Use it for: shared system prompts, large RAG context that's the same across users, few-shot example blocks. Don't use it for: prompts that change every time.
Lever 2 · Model cascade
Cheap model first; only escalate if the answer fails a check. The check is itself cheap.
def cascade(question: str) -> str:
fast = client.messages.create(
model="claude-haiku-4-5-20251001", max_tokens=400,
messages=[{"role": "user", "content": question}],
).content[0].text
judge = client.messages.create(
model="claude-haiku-4-5-20251001", max_tokens=10,
system="Reply only PASS or FAIL. FAIL if the answer is empty, evasive, or contradicts itself.",
messages=[{"role": "user", "content": f"Q: {question}\nA: {fast}"}],
).content[0].text.strip()
if judge.upper().startswith("PASS"):
return fast
# Only the failures pay for Sonnet.
return client.messages.create(
model="claude-sonnet-4-6", max_tokens=600,
messages=[{"role": "user", "content": question}],
).content[0].text
If 80% of your traffic is easy, this cuts cost by ~70%. Measure the escalation rate — if it's above 30%, the cascade is helping less than just running Sonnet.
Lever 3 · Trim the prompt
Tokens you didn't send are tokens you didn't pay for. Three places to trim:
- System prompt. That 1,200-word brand voice guide? Most of it is restating itself. Cut to the actual rules.
- History. Module 10's rolling window applies here too. You rarely need 30 turns of context.
- RAG context. Top-K=3 with sharp chunks usually beats Top-K=10 with sprawling chunks.
A useful exercise: print len(prompt) next to your output, then ask "would the answer change if I cut a third of this?" Run a benchmark to find out.
Lever 4 · Streaming (latency only, not cost)
Streaming doesn't change cost or total time-to-completion. It changes time-to-first-token, which is what users actually feel.
with client.messages.stream(
model="claude-sonnet-4-6", max_tokens=600,
messages=[{"role": "user", "content": "Write a 6-paragraph essay on focus."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
Use streaming any time the user is waiting on the response. Don't use it for background jobs, batched workloads, or anything where you need the full response before deciding what to do next.
Lever 5 · Smaller max_tokens
Latency tracks output length. If your task always answers in 200 tokens, max_tokens=200 is faster and a hard ceiling on cost.
Bonus: it also forces you to write tighter prompts that ask for shorter answers.
A decision table
| Symptom | First lever to pull |
|---|---|
| Prompts share a long, stable prefix | Prompt caching |
| Most queries are easy, some are hard | Model cascade |
| User waits and watches a spinner | Streaming |
| Bills are high but P90 latency is fine | Trim prompt + smaller max_tokens |
| Quality is good but everything's slow | Move to a smaller model and re-evaluate quality |
| Quality dropped after an optimisation | Roll back; you broke the second of the three forces |
Try changing one thing
- Add prompt caching to your most-called endpoint and run the benchmark before/after. Note the cost difference per 1,000 calls.
- Build a cascade where the judge is a regex (
r"^[A-Z]"for "starts with a capital", say). Far cheaper than a Haiku call. Sometimes good enough. - Run an A/B: half your traffic to Sonnet, half to a cascade. Compare quality (smell-test pass rate from Module 14) and cost.
- Cut your system prompt by 30%. Re-run quality evals. Note the win — and where quality slipped.
Going deeper: open the notebooks
notebooks/01_introduction.ipynb— measurement habits, prompt caching, baseline benchmarks (~1.5–2h)notebooks/02_intermediate.ipynb— model cascades, escalation policies, A/B harnesses (~2–3h)notebooks/03_advanced.ipynb— multi-region, batched workloads, capacity planning (~1.5–2.5h)
Module checklist
- [ ] You've benchmarked at least one Claude call and have numbers, not vibes
- [ ] You've enabled prompt caching and observed the cost change
- [ ] You've built a cascade and measured its escalation rate
- [ ] You can name which lever to pull when given each symptom in the table
Next module
Module 17 · Fine-tuning — when (and when not) to specialise Claude for your domain.