Stage 01 · FoundationsModule 5 of 26~3h

Tokens & Limits

Understand tokens, costs, and context limits before they bite.

← All modules in this stage

The single biggest "I didn't expect that bill" moment with any LLM is misunderstanding tokens. This module makes sure that doesn't happen to you, and shows you how to keep prompts inside the limits the API actually enforces.

By the end of this module you'll have

Time: about 45 minutes for the basics, ~3 hours with all three notebooks.

Prerequisites: Modules 1 through 4.


What a token is (without the lecture)

Roughly:

You're charged for input tokens (what you send) plus output tokens (what Claude writes back). Output tokens are typically a few × more expensive per token than input.

The 80/20. For nearly all decisions, "characters / 4" is close enough. Use the real count from response.usage once you've sent the request.


Read your usage every time

Save as track_usage.py:

from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()

response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=200,
    messages=[
        {"role": "user", "content": "Explain DNS in three sentences."}
    ],
)

print(response.content[0].text)
print()
print(f"input  tokens: {response.usage.input_tokens}")
print(f"output tokens: {response.usage.output_tokens}")
print(f"stop reason:   {response.stop_reason}")

Three numbers worth watching:

Field What it tells you
usage.input_tokens Size of everything you sent (system + messages + history)
usage.output_tokens What Claude wrote — this is usually the biggest cost lever
stop_reason "end_turn" = clean finish · "max_tokens" = you cut Claude off · "stop_sequence" = a custom stop fired

If you see "max_tokens" and a truncated reply, raise max_tokens. If you see "end_turn" and the answer is too long anyway, lower max_tokens or tighten your prompt.


Estimate before you send

You don't always need a perfect tokenizer. For most decisions, this is enough:

def approx_tokens(text: str) -> int:
    """Rough estimate: ~4 chars per token. Always errs slightly low for code."""
    return max(1, len(text) // 4)

For exact counts (when it matters — pricing reports, hard limits) use tiktoken or hit the Anthropic count-tokens endpoint. For day-to-day plumbing, the line above is fine.


Token-aware truncation for long inputs

The most common "context too long" fix: keep the start and end of a document, summarise (or just drop) the middle.

def truncate_to_tokens(text: str, budget_tokens: int) -> str:
    """Naive but predictable: keep text under a token budget."""
    char_budget = budget_tokens * 4
    if len(text) <= char_budget:
        return text
    head = text[: char_budget // 2]
    tail = text[-char_budget // 2 :]
    return f"{head}\n\n[... {len(text) - char_budget} chars omitted ...]\n\n{tail}"

document = open("some_long_file.txt").read()
trimmed  = truncate_to_tokens(document, budget_tokens=20_000)

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=800,
    messages=[
        {"role": "user", "content": f"Summarise this document:\n\n{trimmed}"}
    ],
)

For real document QA, summarise the dropped middle instead of throwing it away — Module 9 (RAG) shows how to do that properly. This helper is the cheap version that's right 80% of the time.


Rate limits, briefly

You'll hit two kinds of limit:

  1. Per-minute request and token limits. Surfaced as RateLimitError. The retry wrapper from Module 4 handles transient bursts; for sustained load, throttle on your side.
  2. The model's context window. A hard ceiling on input + output tokens combined. If you exceed it you get a BadRequestError, no retry will save you — you have to truncate or chunk.

A small monitor that prints running totals can save your wallet:

total_in = total_out = 0

def call(messages, **kw):
    global total_in, total_out
    r = client.messages.create(messages=messages, **kw)
    total_in  += r.usage.input_tokens
    total_out += r.usage.output_tokens
    return r

# ... use call(...) instead of client.messages.create(...)
print(f"running total — in: {total_in}  out: {total_out}")

Try changing one thing


Going deeper: open the notebooks


Module checklist


Stage 01 complete

You've finished the Foundations stage. You can talk to Claude from Python, pick a model, write prompts that work, wrap calls safely, and keep token costs honest. That's a real skill set.

Next up is Stage 02 · Practitioner — building real things end-to-end.

Module 6 · Advanced Prompting