Tokens & Limits
Understand tokens, costs, and context limits before they bite.
← All modules in this stageThe single biggest "I didn't expect that bill" moment with any LLM is misunderstanding tokens. This module makes sure that doesn't happen to you, and shows you how to keep prompts inside the limits the API actually enforces.
By the end of this module you'll have
- A working intuition for what a token is and roughly how many your text contains
- The habit of reading
response.usageafter every call - A simple token-aware truncation helper for handling long inputs
Time: about 45 minutes for the basics, ~3 hours with all three notebooks.
Prerequisites: Modules 1 through 4.
What a token is (without the lecture)
Roughly:
- 1 token ≈ 4 characters of English (so ~75 words ≈ 100 tokens).
- Whitespace and punctuation each cost a token or two.
- Code, JSON, and other languages tokenise differently — often denser.
- Numbers can split unpredictably (
12345may be more than one token).
You're charged for input tokens (what you send) plus output tokens (what Claude writes back). Output tokens are typically a few × more expensive per token than input.
The 80/20. For nearly all decisions, "characters / 4" is close enough. Use the real count from
response.usageonce you've sent the request.
Read your usage every time
Save as track_usage.py:
from anthropic import Anthropic
from dotenv import load_dotenv
load_dotenv()
client = Anthropic()
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=200,
messages=[
{"role": "user", "content": "Explain DNS in three sentences."}
],
)
print(response.content[0].text)
print()
print(f"input tokens: {response.usage.input_tokens}")
print(f"output tokens: {response.usage.output_tokens}")
print(f"stop reason: {response.stop_reason}")
Three numbers worth watching:
| Field | What it tells you |
|---|---|
usage.input_tokens |
Size of everything you sent (system + messages + history) |
usage.output_tokens |
What Claude wrote — this is usually the biggest cost lever |
stop_reason |
"end_turn" = clean finish · "max_tokens" = you cut Claude off · "stop_sequence" = a custom stop fired |
If you see "max_tokens" and a truncated reply, raise max_tokens. If you see "end_turn" and the answer is too long anyway, lower max_tokens or tighten your prompt.
Estimate before you send
You don't always need a perfect tokenizer. For most decisions, this is enough:
def approx_tokens(text: str) -> int:
"""Rough estimate: ~4 chars per token. Always errs slightly low for code."""
return max(1, len(text) // 4)
For exact counts (when it matters — pricing reports, hard limits) use tiktoken or hit the Anthropic count-tokens endpoint. For day-to-day plumbing, the line above is fine.
Token-aware truncation for long inputs
The most common "context too long" fix: keep the start and end of a document, summarise (or just drop) the middle.
def truncate_to_tokens(text: str, budget_tokens: int) -> str:
"""Naive but predictable: keep text under a token budget."""
char_budget = budget_tokens * 4
if len(text) <= char_budget:
return text
head = text[: char_budget // 2]
tail = text[-char_budget // 2 :]
return f"{head}\n\n[... {len(text) - char_budget} chars omitted ...]\n\n{tail}"
document = open("some_long_file.txt").read()
trimmed = truncate_to_tokens(document, budget_tokens=20_000)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=800,
messages=[
{"role": "user", "content": f"Summarise this document:\n\n{trimmed}"}
],
)
For real document QA, summarise the dropped middle instead of throwing it away — Module 9 (RAG) shows how to do that properly. This helper is the cheap version that's right 80% of the time.
Rate limits, briefly
You'll hit two kinds of limit:
- Per-minute request and token limits. Surfaced as
RateLimitError. The retry wrapper from Module 4 handles transient bursts; for sustained load, throttle on your side. - The model's context window. A hard ceiling on input + output tokens combined. If you exceed it you get a
BadRequestError, no retry will save you — you have to truncate or chunk.
A small monitor that prints running totals can save your wallet:
total_in = total_out = 0
def call(messages, **kw):
global total_in, total_out
r = client.messages.create(messages=messages, **kw)
total_in += r.usage.input_tokens
total_out += r.usage.output_tokens
return r
# ... use call(...) instead of client.messages.create(...)
print(f"running total — in: {total_in} out: {total_out}")
Try changing one thing
- Set
max_tokens=20. Watchstop_reasonflip to"max_tokens"— and notice the reply is mid-sentence. - Run the same prompt on Haiku and Sonnet and compare token counts. They're usually similar but not identical.
- Send a 10,000-character document with no truncation. Read the error you get back — it's friendlier than you'd expect.
- Add
system="..."and watchinput_tokensrise. Now you know what your "personality" is costing.
Going deeper: open the notebooks
notebooks/01_introduction.ipynb— tokenizer intuition, common pitfalls, chunking (~1.5–2h)notebooks/02_intermediate.ipynb— adaptive retrieval under tight budgets, summarisation pipelines (~2–3h)notebooks/03_advanced.ipynb— hard vs soft limits, SLOs, postmortems (~1.5–2.5h)
Module checklist
- [ ] You can estimate tokens for any English passage to within ~20%
- [ ] You read
response.usageandresponse.stop_reasonafter every call - [ ] You have a truncation helper for inputs that might exceed your budget
- [ ] You know which exception means "retry" and which means "fix your request"
Stage 01 complete
You've finished the Foundations stage. You can talk to Claude from Python, pick a model, write prompts that work, wrap calls safely, and keep token costs honest. That's a real skill set.
Next up is Stage 02 · Practitioner — building real things end-to-end.