Stage 04 · ProModule 19 of 26~8h

Enterprise Scale

Patterns that survive thousands of users.

Once Claude is in your product, the next problem isn't model behaviour — it's everything else. Quotas, key rotation, multi-tenant isolation, audit trails, the day someone in finance asks "what is this $14k AWS line?" This module is the toolbox for that day.

By the end of this module you'll have

A shared client layer so every team uses the same retries, logs, and budgets
Per-tenant rate limiting that your finance team will thank you for
A clean view of where governance, secrets, and compliance fit in

Time: about 2 hours for the basics, ~8 hours with all three notebooks.

Prerequisites: Modules 4 (API basics), 14 (production patterns), 16 (optimization). Familiarity with operating multi-tenant systems helps.

The shape of a Claude platform team

Once you have more than two teams using Claude, you have a platform problem. The solution is almost always a shared layer that:

Application teams                            ┌─────────────────┐
┌───────┐ ┌───────┐ ┌───────┐                │  Anthropic API  │
│ Web   │ │ Email │ │  CRM  │                └────────┬────────┘
└──┬────┘ └──┬────┘ └──┬────┘                         ▲
   └─────────┴─────────┘                              │
             │                                        │
             ▼                                        │
   ┌────────────────────────┐                         │
   │  Internal Claude proxy │ ── retries ── logs ── budgets ── auth ──┘
   │  (one place, one team) │
   └────────────────────────┘

A small platform team owns the proxy. Application teams consume it. They never see the API key, they never get to skip retries, they get usage-by-feature dashboards for free.

A minimum viable proxy

In ~80 lines this gives you keys-not-leaked, per-tenant budgets, and uniform logs. Use FastAPI for an HTTP front; the same shape works as a Python library import.

# claude_proxy.py
import json, logging, os, time, uuid
from fastapi import FastAPI, Header, HTTPException, Request
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()                            # ANTHROPIC_API_KEY held here, nowhere else
app = FastAPI()
log = logging.getLogger("claude.proxy")

# Per-tenant token budget (tokens/min). In real life: Redis with a sliding window.
BUDGETS: dict[str, int]   = {"web": 200_000, "email": 50_000, "crm": 100_000}
SPENT:   dict[str, list]  = {k: [] for k in BUDGETS}     # list of (epoch_sec, tokens)

def allow_and_record(tenant: str, tokens: int) -> bool:
    now = time.time()
    SPENT[tenant] = [(t, n) for t, n in SPENT[tenant] if now - t < 60]
    if sum(n for _, n in SPENT[tenant]) + tokens > BUDGETS.get(tenant, 0):
        return False
    SPENT[tenant].append((now, tokens))
    return True

@app.post("/messages")
async def messages(req: Request, x_tenant: str = Header(...), x_feature: str = Header(default="unknown")):
    if x_tenant not in BUDGETS:
        raise HTTPException(403, "Unknown tenant")
    body = await req.json()

    # Estimate before we spend. ~4 chars per token; raise to a real estimator if needed.
    estimate = sum(len(m.get("content", "")) for m in body.get("messages", [])) // 4
    if not allow_and_record(x_tenant, estimate):
        raise HTTPException(429, "Tenant minute budget exhausted")

    request_id = str(uuid.uuid4())
    started = time.perf_counter()
    try:
        r = client.messages.create(**body, timeout=20.0)
    except Exception as exc:
        log.exception(json.dumps({"event":"claude.error","request_id":request_id,"tenant":x_tenant,"feature":x_feature,"error":type(exc).__name__}))
        raise HTTPException(502, str(exc))

    latency_ms = round((time.perf_counter() - started) * 1000)
    log.info(json.dumps({
        "event":"claude.call","request_id":request_id,"tenant":x_tenant,"feature":x_feature,
        "model":body.get("model"),"latency_ms":latency_ms,
        "in_tokens":r.usage.input_tokens,"out_tokens":r.usage.output_tokens,"stop":r.stop_reason,
    }))
    return r.model_dump()

What this gets you on day one:

Single point of compromise for the API key.
Per-tenant budgets the second a team's experiment runs away.
One log shape you can build dashboards and alerts against.
Per-feature attribution because you required X-Feature header.

What it doesn't get you (yet, deliberately):

Caching, idempotency keys, replay attack defence — add when you need them, not before.
Auth — slot in your existing service-to-service auth, don't reinvent it.

Per-tenant budgets that work

The example above uses an in-memory dict; that breaks the second you run two replicas. The real version uses Redis (or any shared store) with a sliding-window counter:

Approach	When to use it
In-memory counter	Single-process demo. Don't ship.
Redis sliding window (per tenant)	Default for most teams. Sub-millisecond, accurate enough.
Token bucket per tenant	When you also need bursts; richer behaviour.
Anthropic-side rate limits only	Lazy. Means tenant A can starve tenant B.

Make the budget visible: a 429 should include "tenant web exhausted (200k/min)" so the application team can see it and act. Hidden quotas cause hidden tickets.

Secrets, keys, and compliance

A short list of disciplines that pay back instantly:

The API key lives in exactly one place — the proxy's secret manager. Application teams cannot read it.
Rotate on a schedule (90 days is reasonable). Have a rotation runbook tested in staging.
Per-environment keys. Production, staging, dev are different keys. A leaked dev key isn't a production incident.
Audit trail of admin changes. Who raised the web budget from 200k to 500k? Why? Cheap to set up; expensive to retrofit.
PII boundary at the proxy. Strip or hash known sensitive fields before sending to the API and before writing logs. Decide once, enforce everywhere.

Quotas, fallbacks, degraded modes

When you hit a real outage (Anthropic, your network, a region), your application teams shouldn't all reinvent the same fallback. The proxy is the right place to:

Fall back to a smaller model when the primary fails (with a header so callers know).
Return cached recent results for idempotent requests.
Return a clear "degraded mode" response that frontends can treat consistently ("Claude is temporarily unavailable; here's what I can do offline").

Document the contract: callers don't need to know about it, but the headers and response shape are stable.

Observability you'll want by month two

Cost-by-feature dashboard (sum tokens × rates from logs grouped by x_feature).
Latency P50/P90/P99 by feature and by tenant.
Smell-test pass rate (Module 14) on a 1% sample, by feature.
Budget exhaustion rate per tenant (how often they 429).
Top offenders by cost — almost always two or three call sites. Optimise those first (Module 16).

The dashboards aren't the product. They're how the platform team finds the next thing to fix without waiting for a complaint.

Try changing one thing

Add a priority header (high|normal|low). When global budget is tight, drop low first.
Replace the in-memory budget dict with Redis. Run two replicas. Convince yourself it's accurate.
Write a migration script: every existing direct Anthropic() call site is replaced by a call to your proxy. Count what you find — it's usually more than expected.
Add a cache_control flag on certain endpoints so the proxy turns on prompt caching automatically.

Going deeper: open the notebooks

notebooks/01_introduction.ipynb — proxy patterns, per-tenant budgeting (~1.5–2h)
notebooks/02_intermediate.ipynb — multi-region operation, region failover, audit trails (~2–3h)
notebooks/03_advanced.ipynb — governance frameworks, compliance regimes, cost allocation (~1.5–2.5h)

Module checklist

[ ] You can sketch the proxy + tenants diagram and name what each piece owns
[ ] You've run the minimum proxy and seen a 429 from a tenant going over budget
[ ] You know where API keys live in your real architecture (and that it's exactly one place)
[ ] You can name three dashboards you'd build in month two

Next module

Module 20 · Testing & Evaluation — and the closer for the Pro stage: how to prove quality with numbers, not vibes.

This moduleOverviewResourcesExercises

Notebook links: nbviewer uses nbviewer.jupyter.org/url/… (not nbviewer.org) and loads the file via jsDelivr (cdn.jsdelivr.net/gh/…) so Project Jupyter does not hit GitHub’s REST API (which can return 503 when rate-limited). Requires a public repo mirror (Berta-one/claude-guides, branch master). Raw points at the copy on this website (same path under claudeguides.berta.one). Colab and the GitHub file link also need that public mirror. Forks: set CLAUDE_GUIDES_GITHUB_REPO and CLAUDE_GUIDES_BRANCH when building. You can always clone the repo and open .ipynb locally.

Notebooks

Intronbviewer Colab GitHub RawIntermediatenbviewer Colab GitHub RawAdvancednbviewer Colab GitHub RawModule overview

Code

View main_application.py (on this site)GitHub mirror

Next in module →

Full pathStep 19 of 26 · typical ~8h

← Previous module (Guide 18)

Paid & instructor-led courses