Stage 03 · BuilderModule 14 of 26~8h

Production Patterns

Logs, retries, safety — ship without surprises.

← All modules in this stage

The day after you ship a Claude feature, three things go wrong: a model returns nonsense and you can't reproduce it, a slow response page-times-out a user, or a single bad input cascades into a thousand. This module gives you the small set of patterns that make those things visible, recoverable, and survivable.

By the end of this module you'll have

Time: about 2 hours for the basics, ~8 hours with all three notebooks.

Prerequisites: Modules 4 (API basics), 7 (building apps). Familiarity with at least one production system you've operated.


Pattern 1 · Log every call (carefully)

Every model call should write a structured log line. You'll thank yourself the first time a user reports a bad answer.

import json, time, uuid, logging
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()
log = logging.getLogger("claude")

def hashed_prefix(text: str, n: int = 80) -> str:
    """Truncated, *not* hashed — but never log full bodies in production."""
    return text[:n].replace("\n", "\\n") + ("…" if len(text) > n else "")

def call(*, prompt: str, model: str, **kw):
    request_id = str(uuid.uuid4())
    started = time.perf_counter()
    try:
        r = client.messages.create(
            model=model, max_tokens=kw.pop("max_tokens", 600),
            messages=[{"role": "user", "content": prompt}], **kw,
        )
        elapsed_ms = (time.perf_counter() - started) * 1000
        log.info(json.dumps({
            "event":      "claude.call",
            "request_id": request_id,
            "model":      model,
            "latency_ms": round(elapsed_ms),
            "in_tokens":  r.usage.input_tokens,
            "out_tokens": r.usage.output_tokens,
            "stop":       r.stop_reason,
            "prompt":     hashed_prefix(prompt),       # truncated only
        }))
        return r
    except Exception as exc:
        log.exception(json.dumps({
            "event":      "claude.error",
            "request_id": request_id,
            "model":      model,
            "error":      type(exc).__name__,
        }))
        raise

What this gets you, day one: searchable logs, latency percentiles, cost-per-feature dashboards (just sum tokens grouped by feature), and a request_id you can quote when a user complains. What it doesn't get you: PII in your logs.


Pattern 2 · Timeout + retry + fallback

Three layers, each handling a different class of problem:

import time, random
from anthropic import (
    Anthropic, RateLimitError, APIConnectionError, APITimeoutError, APIStatusError,
)

TRANSIENT = (RateLimitError, APIConnectionError, APITimeoutError)

def call_with_recovery(prompt: str, *, primary="claude-sonnet-4-6", fallback="claude-haiku-4-5-20251001"):
    for attempt in range(4):
        try:
            return client.messages.create(
                model=primary, max_tokens=600, timeout=20.0,            # layer 1: timeout
                messages=[{"role": "user", "content": prompt}],
            )
        except TRANSIENT:
            if attempt == 3:
                break
            time.sleep((2 ** attempt) + random.random())                # layer 2: retry with backoff
    # layer 3: fallback to a cheaper, often more available model
    return client.messages.create(
        model=fallback, max_tokens=600, timeout=20.0,
        messages=[{"role": "user", "content": prompt}],
    )

Three rules to keep this honest:

  1. Set a real timeout. No timeout means your user is staring at a spinner forever.
  2. Cap retries. Four attempts is plenty. Anything more masks systemic problems.
  3. Have a fallback you'd actually ship. "Sorry, try again" is fine — that's still a fallback. Don't pretend Haiku is a perfect substitute for Sonnet; just decide what degraded UX looks like.

Pattern 3 · Online evals (catch regressions in flight)

You can't run the full eval suite on every request, but you can run a small classifier on the response itself and alert when quality drops.

def passes_smell_test(prompt: str, response_text: str) -> bool:
    """Cheap, cheap signal. Misses subtle regressions but catches obvious ones."""
    judge = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=20,
        system=(
            "You are a quality gate. Reply with one word: PASS or FAIL.\n"
            "FAIL if the response is empty, refuses for a benign request, or contradicts itself."
        ),
        messages=[{
            "role": "user",
            "content": f"PROMPT:\n{prompt}\n\nRESPONSE:\n{response_text}\n\nVerdict:",
        }],
    )
    return judge.content[0].text.strip().upper().startswith("PASS")

Wire it in as a sample (e.g. 1% of traffic), log the failures with full request/response, and build a dashboard. Module 20 turns this into a real eval framework.


What to log and what to never log

Log Don't log
Model id, latency, in/out tokens, request_id, error type Full user prompts that contain PII or secrets
Stop reason (end_turn, max_tokens, tool_use) API keys, even hashed
Truncated prompt prefix (≤ 100 chars) Raw response bodies that may contain user data
Whether the smell test passed Anything you couldn't justify in a privacy review
Cost-per-call (out_tokens × out_rate + in_tokens × in_rate) "Just for now" log-everything fields. They never come out.

A useful instinct: imagine an auditor reading your logs. Could they reconstruct who asked what? If yes, change what you log.


A small operational checklist

Before a Claude-backed feature goes to real users:

If any of those is unchecked, you'll find out in production. Better to find out in staging.


Try changing one thing


Going deeper: open the notebooks


Module checklist


Next module

Module 15 · Advanced Reasoning — patterns for the hard problems where one prompt isn't enough.