Enterprise Scale
Patterns that survive thousands of users.
← All modules in this stageOnce Claude is in your product, the next problem isn't model behaviour — it's everything else. Quotas, key rotation, multi-tenant isolation, audit trails, the day someone in finance asks "what is this $14k AWS line?" This module is the toolbox for that day.
By the end of this module you'll have
- A shared client layer so every team uses the same retries, logs, and budgets
- Per-tenant rate limiting that your finance team will thank you for
- A clean view of where governance, secrets, and compliance fit in
Time: about 2 hours for the basics, ~8 hours with all three notebooks.
Prerequisites: Modules 4 (API basics), 14 (production patterns), 16 (optimization). Familiarity with operating multi-tenant systems helps.
The shape of a Claude platform team
Once you have more than two teams using Claude, you have a platform problem. The solution is almost always a shared layer that:
Application teams ┌─────────────────┐
┌───────┐ ┌───────┐ ┌───────┐ │ Anthropic API │
│ Web │ │ Email │ │ CRM │ └────────┬────────┘
└──┬────┘ └──┬────┘ └──┬────┘ ▲
└─────────┴─────────┘ │
│ │
▼ │
┌────────────────────────┐ │
│ Internal Claude proxy │ ── retries ── logs ── budgets ── auth ──┘
│ (one place, one team) │
└────────────────────────┘
A small platform team owns the proxy. Application teams consume it. They never see the API key, they never get to skip retries, they get usage-by-feature dashboards for free.
A minimum viable proxy
In ~80 lines this gives you keys-not-leaked, per-tenant budgets, and uniform logs. Use FastAPI for an HTTP front; the same shape works as a Python library import.
# claude_proxy.py
import json, logging, os, time, uuid
from fastapi import FastAPI, Header, HTTPException, Request
from anthropic import Anthropic
from dotenv import load_dotenv
load_dotenv()
client = Anthropic() # ANTHROPIC_API_KEY held here, nowhere else
app = FastAPI()
log = logging.getLogger("claude.proxy")
# Per-tenant token budget (tokens/min). In real life: Redis with a sliding window.
BUDGETS: dict[str, int] = {"web": 200_000, "email": 50_000, "crm": 100_000}
SPENT: dict[str, list] = {k: [] for k in BUDGETS} # list of (epoch_sec, tokens)
def allow_and_record(tenant: str, tokens: int) -> bool:
now = time.time()
SPENT[tenant] = [(t, n) for t, n in SPENT[tenant] if now - t < 60]
if sum(n for _, n in SPENT[tenant]) + tokens > BUDGETS.get(tenant, 0):
return False
SPENT[tenant].append((now, tokens))
return True
@app.post("/messages")
async def messages(req: Request, x_tenant: str = Header(...), x_feature: str = Header(default="unknown")):
if x_tenant not in BUDGETS:
raise HTTPException(403, "Unknown tenant")
body = await req.json()
# Estimate before we spend. ~4 chars per token; raise to a real estimator if needed.
estimate = sum(len(m.get("content", "")) for m in body.get("messages", [])) // 4
if not allow_and_record(x_tenant, estimate):
raise HTTPException(429, "Tenant minute budget exhausted")
request_id = str(uuid.uuid4())
started = time.perf_counter()
try:
r = client.messages.create(**body, timeout=20.0)
except Exception as exc:
log.exception(json.dumps({"event":"claude.error","request_id":request_id,"tenant":x_tenant,"feature":x_feature,"error":type(exc).__name__}))
raise HTTPException(502, str(exc))
latency_ms = round((time.perf_counter() - started) * 1000)
log.info(json.dumps({
"event":"claude.call","request_id":request_id,"tenant":x_tenant,"feature":x_feature,
"model":body.get("model"),"latency_ms":latency_ms,
"in_tokens":r.usage.input_tokens,"out_tokens":r.usage.output_tokens,"stop":r.stop_reason,
}))
return r.model_dump()
What this gets you on day one:
- Single point of compromise for the API key.
- Per-tenant budgets the second a team's experiment runs away.
- One log shape you can build dashboards and alerts against.
- Per-feature attribution because you required
X-Featureheader.
What it doesn't get you (yet, deliberately):
- Caching, idempotency keys, replay attack defence — add when you need them, not before.
- Auth — slot in your existing service-to-service auth, don't reinvent it.
Per-tenant budgets that work
The example above uses an in-memory dict; that breaks the second you run two replicas. The real version uses Redis (or any shared store) with a sliding-window counter:
| Approach | When to use it |
|---|---|
| In-memory counter | Single-process demo. Don't ship. |
| Redis sliding window (per tenant) | Default for most teams. Sub-millisecond, accurate enough. |
| Token bucket per tenant | When you also need bursts; richer behaviour. |
| Anthropic-side rate limits only | Lazy. Means tenant A can starve tenant B. |
Make the budget visible: a 429 should include "tenant web exhausted (200k/min)" so the application team can see it and act. Hidden quotas cause hidden tickets.
Secrets, keys, and compliance
A short list of disciplines that pay back instantly:
- The API key lives in exactly one place — the proxy's secret manager. Application teams cannot read it.
- Rotate on a schedule (90 days is reasonable). Have a rotation runbook tested in staging.
- Per-environment keys. Production, staging, dev are different keys. A leaked dev key isn't a production incident.
- Audit trail of admin changes. Who raised the
webbudget from 200k to 500k? Why? Cheap to set up; expensive to retrofit. - PII boundary at the proxy. Strip or hash known sensitive fields before sending to the API and before writing logs. Decide once, enforce everywhere.
Quotas, fallbacks, degraded modes
When you hit a real outage (Anthropic, your network, a region), your application teams shouldn't all reinvent the same fallback. The proxy is the right place to:
- Fall back to a smaller model when the primary fails (with a header so callers know).
- Return cached recent results for idempotent requests.
- Return a clear "degraded mode" response that frontends can treat consistently ("Claude is temporarily unavailable; here's what I can do offline").
Document the contract: callers don't need to know about it, but the headers and response shape are stable.
Observability you'll want by month two
- Cost-by-feature dashboard (sum tokens × rates from logs grouped by
x_feature). - Latency P50/P90/P99 by feature and by tenant.
- Smell-test pass rate (Module 14) on a 1% sample, by feature.
- Budget exhaustion rate per tenant (how often they 429).
- Top offenders by cost — almost always two or three call sites. Optimise those first (Module 16).
The dashboards aren't the product. They're how the platform team finds the next thing to fix without waiting for a complaint.
Try changing one thing
- Add a
priorityheader (high|normal|low). When global budget is tight, droplowfirst. - Replace the in-memory budget dict with Redis. Run two replicas. Convince yourself it's accurate.
- Write a migration script: every existing direct
Anthropic()call site is replaced by a call to your proxy. Count what you find — it's usually more than expected. - Add a
cache_controlflag on certain endpoints so the proxy turns on prompt caching automatically.
Going deeper: open the notebooks
notebooks/01_introduction.ipynb— proxy patterns, per-tenant budgeting (~1.5–2h)notebooks/02_intermediate.ipynb— multi-region operation, region failover, audit trails (~2–3h)notebooks/03_advanced.ipynb— governance frameworks, compliance regimes, cost allocation (~1.5–2.5h)
Module checklist
- [ ] You can sketch the proxy + tenants diagram and name what each piece owns
- [ ] You've run the minimum proxy and seen a 429 from a tenant going over budget
- [ ] You know where API keys live in your real architecture (and that it's exactly one place)
- [ ] You can name three dashboards you'd build in month two
Next module
Module 20 · Testing & Evaluation — and the closer for the Pro stage: how to prove quality with numbers, not vibes.