Production Patterns
Logs, retries, safety — ship without surprises.
← All modules in this stageThe day after you ship a Claude feature, three things go wrong: a model returns nonsense and you can't reproduce it, a slow response page-times-out a user, or a single bad input cascades into a thousand. This module gives you the small set of patterns that make those things visible, recoverable, and survivable.
By the end of this module you'll have
- Structured logs for every model call: enough to debug tomorrow, nothing that endangers users
- A timeout + retry + fallback pattern that fails gracefully instead of catastrophically
- A simple online eval that catches quality regressions before users do
Time: about 2 hours for the basics, ~8 hours with all three notebooks.
Prerequisites: Modules 4 (API basics), 7 (building apps). Familiarity with at least one production system you've operated.
Pattern 1 · Log every call (carefully)
Every model call should write a structured log line. You'll thank yourself the first time a user reports a bad answer.
import json, time, uuid, logging
from anthropic import Anthropic
from dotenv import load_dotenv
load_dotenv()
client = Anthropic()
log = logging.getLogger("claude")
def hashed_prefix(text: str, n: int = 80) -> str:
"""Truncated, *not* hashed — but never log full bodies in production."""
return text[:n].replace("\n", "\\n") + ("…" if len(text) > n else "")
def call(*, prompt: str, model: str, **kw):
request_id = str(uuid.uuid4())
started = time.perf_counter()
try:
r = client.messages.create(
model=model, max_tokens=kw.pop("max_tokens", 600),
messages=[{"role": "user", "content": prompt}], **kw,
)
elapsed_ms = (time.perf_counter() - started) * 1000
log.info(json.dumps({
"event": "claude.call",
"request_id": request_id,
"model": model,
"latency_ms": round(elapsed_ms),
"in_tokens": r.usage.input_tokens,
"out_tokens": r.usage.output_tokens,
"stop": r.stop_reason,
"prompt": hashed_prefix(prompt), # truncated only
}))
return r
except Exception as exc:
log.exception(json.dumps({
"event": "claude.error",
"request_id": request_id,
"model": model,
"error": type(exc).__name__,
}))
raise
What this gets you, day one: searchable logs, latency percentiles, cost-per-feature dashboards (just sum tokens grouped by feature), and a request_id you can quote when a user complains. What it doesn't get you: PII in your logs.
Pattern 2 · Timeout + retry + fallback
Three layers, each handling a different class of problem:
import time, random
from anthropic import (
Anthropic, RateLimitError, APIConnectionError, APITimeoutError, APIStatusError,
)
TRANSIENT = (RateLimitError, APIConnectionError, APITimeoutError)
def call_with_recovery(prompt: str, *, primary="claude-sonnet-4-6", fallback="claude-haiku-4-5-20251001"):
for attempt in range(4):
try:
return client.messages.create(
model=primary, max_tokens=600, timeout=20.0, # layer 1: timeout
messages=[{"role": "user", "content": prompt}],
)
except TRANSIENT:
if attempt == 3:
break
time.sleep((2 ** attempt) + random.random()) # layer 2: retry with backoff
# layer 3: fallback to a cheaper, often more available model
return client.messages.create(
model=fallback, max_tokens=600, timeout=20.0,
messages=[{"role": "user", "content": prompt}],
)
Three rules to keep this honest:
- Set a real
timeout. No timeout means your user is staring at a spinner forever. - Cap retries. Four attempts is plenty. Anything more masks systemic problems.
- Have a fallback you'd actually ship. "Sorry, try again" is fine — that's still a fallback. Don't pretend Haiku is a perfect substitute for Sonnet; just decide what degraded UX looks like.
Pattern 3 · Online evals (catch regressions in flight)
You can't run the full eval suite on every request, but you can run a small classifier on the response itself and alert when quality drops.
def passes_smell_test(prompt: str, response_text: str) -> bool:
"""Cheap, cheap signal. Misses subtle regressions but catches obvious ones."""
judge = client.messages.create(
model="claude-haiku-4-5-20251001", max_tokens=20,
system=(
"You are a quality gate. Reply with one word: PASS or FAIL.\n"
"FAIL if the response is empty, refuses for a benign request, or contradicts itself."
),
messages=[{
"role": "user",
"content": f"PROMPT:\n{prompt}\n\nRESPONSE:\n{response_text}\n\nVerdict:",
}],
)
return judge.content[0].text.strip().upper().startswith("PASS")
Wire it in as a sample (e.g. 1% of traffic), log the failures with full request/response, and build a dashboard. Module 20 turns this into a real eval framework.
What to log and what to never log
| Log | Don't log |
|---|---|
| Model id, latency, in/out tokens, request_id, error type | Full user prompts that contain PII or secrets |
Stop reason (end_turn, max_tokens, tool_use) |
API keys, even hashed |
| Truncated prompt prefix (≤ 100 chars) | Raw response bodies that may contain user data |
| Whether the smell test passed | Anything you couldn't justify in a privacy review |
Cost-per-call (out_tokens × out_rate + in_tokens × in_rate) |
"Just for now" log-everything fields. They never come out. |
A useful instinct: imagine an auditor reading your logs. Could they reconstruct who asked what? If yes, change what you log.
A small operational checklist
Before a Claude-backed feature goes to real users:
- [ ] Every call logs a structured line with model, latency, tokens, request_id
- [ ] There's a timeout on every request
- [ ] Retries are capped and backoff is exponential with jitter
- [ ] There's a defined fallback (degraded model, cached answer, or graceful "try again")
- [ ] PII never reaches the logs
- [ ] You can compute cost-per-feature from yesterday's logs
- [ ] You have a smell-test or sampled human review running on responses
- [ ] You know how to roll the model id back if the next release misbehaves
If any of those is unchecked, you'll find out in production. Better to find out in staging.
Try changing one thing
- Add a
featurefield to the log line and tag every call site. Now your dashboard groups by feature. - Set
timeout=0.5deliberately and watch the retry/fallback fire. Helps you trust the recovery path. - Run the smell test on 100 historical responses. Set a baseline pass rate; alert when it drops 5 points.
- Add
cache_key = hash((model, prompt))and skip the call when you have a recent cached response. Module 16 dives deeper.
Going deeper: open the notebooks
notebooks/01_introduction.ipynb— observability, structured logs, error budgets (~1.5–2h)notebooks/02_intermediate.ipynb— circuit breakers, idempotency, cost tracking per user (~2–3h)notebooks/03_advanced.ipynb— disaster recovery, multi-region, incident playbooks (~1.5–2.5h)
Module checklist
- [ ] You've added structured logging around at least one Claude call
- [ ] You've tested your retry/fallback path by deliberately failing the primary call
- [ ] You can name the items on the pre-launch operational checklist from memory
- [ ] You've decided what your feature's degraded mode looks like
Next module
Module 15 · Advanced Reasoning — patterns for the hard problems where one prompt isn't enough.