Stage 01 · FoundationsModule 5 of 26~3h

Tokens & Limits

Understand tokens, costs, and context limits before they bite.

The single biggest "I didn't expect that bill" moment with any LLM is misunderstanding tokens. This module makes sure that doesn't happen to you, and shows you how to keep prompts inside the limits the API actually enforces.

By the end of this module you'll have

A working intuition for what a token is and roughly how many your text contains
The habit of reading response.usage after every call
A simple token-aware truncation helper for handling long inputs

Time: about 45 minutes for the basics, ~3 hours with all three notebooks.

Prerequisites: Modules 1 through 4.

What a token is (without the lecture)

Roughly:

1 token ≈ 4 characters of English (so ~75 words ≈ 100 tokens).
Whitespace and punctuation each cost a token or two.
Code, JSON, and other languages tokenise differently — often denser.
Numbers can split unpredictably (12345 may be more than one token).

You're charged for input tokens (what you send) plus output tokens (what Claude writes back). Output tokens are typically a few × more expensive per token than input.

The 80/20. For nearly all decisions, "characters / 4" is close enough. Use the real count from response.usage once you've sent the request.

Read your usage every time

Save as track_usage.py:

from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()

response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=200,
    messages=[
        {"role": "user", "content": "Explain DNS in three sentences."}
    ],
)

print(response.content[0].text)
print()
print(f"input  tokens: {response.usage.input_tokens}")
print(f"output tokens: {response.usage.output_tokens}")
print(f"stop reason:   {response.stop_reason}")

Three numbers worth watching:

Field	What it tells you
`usage.input_tokens`	Size of everything you sent (system + messages + history)
`usage.output_tokens`	What Claude wrote — this is usually the biggest cost lever
`stop_reason`	`"end_turn"` = clean finish · `"max_tokens"` = you cut Claude off · `"stop_sequence"` = a custom stop fired

If you see "max_tokens" and a truncated reply, raise max_tokens. If you see "end_turn" and the answer is too long anyway, lower max_tokens or tighten your prompt.

Estimate before you send

You don't always need a perfect tokenizer. For most decisions, this is enough:

def approx_tokens(text: str) -> int:
    """Rough estimate: ~4 chars per token. Always errs slightly low for code."""
    return max(1, len(text) // 4)

For exact counts (when it matters — pricing reports, hard limits) use tiktoken or hit the Anthropic count-tokens endpoint. For day-to-day plumbing, the line above is fine.

Token-aware truncation for long inputs

The most common "context too long" fix: keep the start and end of a document, summarise (or just drop) the middle.

def truncate_to_tokens(text: str, budget_tokens: int) -> str:
    """Naive but predictable: keep text under a token budget."""
    char_budget = budget_tokens * 4
    if len(text) <= char_budget:
        return text
    head = text[: char_budget // 2]
    tail = text[-char_budget // 2 :]
    return f"{head}\n\n[... {len(text) - char_budget} chars omitted ...]\n\n{tail}"

document = open("some_long_file.txt").read()
trimmed  = truncate_to_tokens(document, budget_tokens=20_000)

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=800,
    messages=[
        {"role": "user", "content": f"Summarise this document:\n\n{trimmed}"}
    ],
)

For real document QA, summarise the dropped middle instead of throwing it away — Module 9 (RAG) shows how to do that properly. This helper is the cheap version that's right 80% of the time.

Rate limits, briefly

You'll hit two kinds of limit:

Per-minute request and token limits. Surfaced as RateLimitError. The retry wrapper from Module 4 handles transient bursts; for sustained load, throttle on your side.
The model's context window. A hard ceiling on input + output tokens combined. If you exceed it you get a BadRequestError, no retry will save you — you have to truncate or chunk.

A small monitor that prints running totals can save your wallet:

total_in = total_out = 0

def call(messages, **kw):
    global total_in, total_out
    r = client.messages.create(messages=messages, **kw)
    total_in  += r.usage.input_tokens
    total_out += r.usage.output_tokens
    return r

# ... use call(...) instead of client.messages.create(...)
print(f"running total — in: {total_in}  out: {total_out}")

Try changing one thing

Set max_tokens=20. Watch stop_reason flip to "max_tokens" — and notice the reply is mid-sentence.
Run the same prompt on Haiku and Sonnet and compare token counts. They're usually similar but not identical.
Send a 10,000-character document with no truncation. Read the error you get back — it's friendlier than you'd expect.
Add system="..." and watch input_tokens rise. Now you know what your "personality" is costing.

Going deeper: open the notebooks

notebooks/01_introduction.ipynb — tokenizer intuition, common pitfalls, chunking (~1.5–2h)
notebooks/02_intermediate.ipynb — adaptive retrieval under tight budgets, summarisation pipelines (~2–3h)
notebooks/03_advanced.ipynb — hard vs soft limits, SLOs, postmortems (~1.5–2.5h)

Module checklist

[ ] You can estimate tokens for any English passage to within ~20%
[ ] You read response.usage and response.stop_reason after every call
[ ] You have a truncation helper for inputs that might exceed your budget
[ ] You know which exception means "retry" and which means "fix your request"

Stage 01 complete

You've finished the Foundations stage. You can talk to Claude from Python, pick a model, write prompts that work, wrap calls safely, and keep token costs honest. That's a real skill set.

Next up is Stage 02 · Practitioner — building real things end-to-end.

Module 6 · Advanced Prompting →

This moduleOverviewResourcesExercises

Notebook links: nbviewer uses nbviewer.jupyter.org/url/… (not nbviewer.org) and loads the file via jsDelivr (cdn.jsdelivr.net/gh/…) so Project Jupyter does not hit GitHub’s REST API (which can return 503 when rate-limited). Requires a public repo mirror (Berta-one/claude-guides, branch master). Raw points at the copy on this website (same path under claudeguides.berta.one). Colab and the GitHub file link also need that public mirror. Forks: set CLAUDE_GUIDES_GITHUB_REPO and CLAUDE_GUIDES_BRANCH when building. You can always clone the repo and open .ipynb locally.

Notebooks

Intronbviewer Colab GitHub RawIntermediatenbviewer Colab GitHub RawAdvancednbviewer Colab GitHub RawModule overview

Code

View main_application.py (on this site)GitHub mirror

Next in module →

Full pathStep 5 of 26 · typical ~3h

← Previous module (Guide 4)

Paid & instructor-led courses