Stage 04 · ProModule 19 of 26~8h

Enterprise Scale

Patterns that survive thousands of users.

← All modules in this stage

Once Claude is in your product, the next problem isn't model behaviour — it's everything else. Quotas, key rotation, multi-tenant isolation, audit trails, the day someone in finance asks "what is this $14k AWS line?" This module is the toolbox for that day.

By the end of this module you'll have

Time: about 2 hours for the basics, ~8 hours with all three notebooks.

Prerequisites: Modules 4 (API basics), 14 (production patterns), 16 (optimization). Familiarity with operating multi-tenant systems helps.


The shape of a Claude platform team

Once you have more than two teams using Claude, you have a platform problem. The solution is almost always a shared layer that:

Application teams                            ┌─────────────────┐
┌───────┐ ┌───────┐ ┌───────┐                │  Anthropic API  │
│ Web   │ │ Email │ │  CRM  │                └────────┬────────┘
└──┬────┘ └──┬────┘ └──┬────┘                         ▲
   └─────────┴─────────┘                              │
             │                                        │
             ▼                                        │
   ┌────────────────────────┐                         │
   │  Internal Claude proxy │ ── retries ── logs ── budgets ── auth ──┘
   │  (one place, one team) │
   └────────────────────────┘

A small platform team owns the proxy. Application teams consume it. They never see the API key, they never get to skip retries, they get usage-by-feature dashboards for free.


A minimum viable proxy

In ~80 lines this gives you keys-not-leaked, per-tenant budgets, and uniform logs. Use FastAPI for an HTTP front; the same shape works as a Python library import.

# claude_proxy.py
import json, logging, os, time, uuid
from fastapi import FastAPI, Header, HTTPException, Request
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()                            # ANTHROPIC_API_KEY held here, nowhere else
app = FastAPI()
log = logging.getLogger("claude.proxy")

# Per-tenant token budget (tokens/min). In real life: Redis with a sliding window.
BUDGETS: dict[str, int]   = {"web": 200_000, "email": 50_000, "crm": 100_000}
SPENT:   dict[str, list]  = {k: [] for k in BUDGETS}     # list of (epoch_sec, tokens)

def allow_and_record(tenant: str, tokens: int) -> bool:
    now = time.time()
    SPENT[tenant] = [(t, n) for t, n in SPENT[tenant] if now - t < 60]
    if sum(n for _, n in SPENT[tenant]) + tokens > BUDGETS.get(tenant, 0):
        return False
    SPENT[tenant].append((now, tokens))
    return True

@app.post("/messages")
async def messages(req: Request, x_tenant: str = Header(...), x_feature: str = Header(default="unknown")):
    if x_tenant not in BUDGETS:
        raise HTTPException(403, "Unknown tenant")
    body = await req.json()

    # Estimate before we spend. ~4 chars per token; raise to a real estimator if needed.
    estimate = sum(len(m.get("content", "")) for m in body.get("messages", [])) // 4
    if not allow_and_record(x_tenant, estimate):
        raise HTTPException(429, "Tenant minute budget exhausted")

    request_id = str(uuid.uuid4())
    started = time.perf_counter()
    try:
        r = client.messages.create(**body, timeout=20.0)
    except Exception as exc:
        log.exception(json.dumps({"event":"claude.error","request_id":request_id,"tenant":x_tenant,"feature":x_feature,"error":type(exc).__name__}))
        raise HTTPException(502, str(exc))

    latency_ms = round((time.perf_counter() - started) * 1000)
    log.info(json.dumps({
        "event":"claude.call","request_id":request_id,"tenant":x_tenant,"feature":x_feature,
        "model":body.get("model"),"latency_ms":latency_ms,
        "in_tokens":r.usage.input_tokens,"out_tokens":r.usage.output_tokens,"stop":r.stop_reason,
    }))
    return r.model_dump()

What this gets you on day one:

What it doesn't get you (yet, deliberately):


Per-tenant budgets that work

The example above uses an in-memory dict; that breaks the second you run two replicas. The real version uses Redis (or any shared store) with a sliding-window counter:

Approach When to use it
In-memory counter Single-process demo. Don't ship.
Redis sliding window (per tenant) Default for most teams. Sub-millisecond, accurate enough.
Token bucket per tenant When you also need bursts; richer behaviour.
Anthropic-side rate limits only Lazy. Means tenant A can starve tenant B.

Make the budget visible: a 429 should include "tenant web exhausted (200k/min)" so the application team can see it and act. Hidden quotas cause hidden tickets.


Secrets, keys, and compliance

A short list of disciplines that pay back instantly:


Quotas, fallbacks, degraded modes

When you hit a real outage (Anthropic, your network, a region), your application teams shouldn't all reinvent the same fallback. The proxy is the right place to:

Document the contract: callers don't need to know about it, but the headers and response shape are stable.


Observability you'll want by month two

The dashboards aren't the product. They're how the platform team finds the next thing to fix without waiting for a complaint.


Try changing one thing


Going deeper: open the notebooks


Module checklist


Next module

Module 20 · Testing & Evaluation — and the closer for the Pro stage: how to prove quality with numbers, not vibes.