Models & Capabilities
Pick the right Claude model for the job at hand.
← All modules in this stageThere are several Claude models. They cost different amounts, run at different speeds, and handle different difficulty levels. This module teaches you to pick the right one — by measuring, not guessing.
By the end of this module you'll have
- A clear mental model of the Haiku / Sonnet / Opus families and when each pays off
- A small benchmark script that compares two models on your task in seconds
- A reusable rule of thumb for escalating from cheaper to more capable models
Time: about 45 minutes for the basics, ~4 hours with all three notebooks.
Prerequisites: Module 1 finished and your environment working.
The shape of the family (today)
| Family | Latest ID | Best for | Rough cost | Rough speed |
|---|---|---|---|---|
| Haiku | claude-haiku-4-5-20251001 |
Classification, routing, high-volume cheap work, drafts | $ | Very fast |
| Sonnet | claude-sonnet-4-6 |
Most production work — the sensible default | $$ | Fast |
| Opus | claude-opus-4-7 |
Hard reasoning, long multi-step plans, premium quality | $$$ | Slower |
Pricing and speed change over time — always confirm with the official docs. What stays true: smaller is faster and cheaper, larger reasons better.
Rule of thumb. Start on Sonnet. Drop to Haiku once you've proven Sonnet works and you have a quality bar to test against. Escalate to Opus only when Sonnet visibly fails on tasks you actually care about.
Compare two models on your task (5 minutes)
Save as compare_models.py and adjust the task string to something you care about:
import os, time
from anthropic import Anthropic
from dotenv import load_dotenv
load_dotenv()
client = Anthropic()
task = "Summarize the plot of Hamlet in three bullet points, neutral tone."
models_to_compare = [
"claude-haiku-4-5-20251001",
"claude-sonnet-4-6",
]
for model in models_to_compare:
start = time.perf_counter()
response = client.messages.create(
model=model,
max_tokens=400,
messages=[{"role": "user", "content": task}],
)
elapsed_ms = (time.perf_counter() - start) * 1000
text = response.content[0].text
print(f"\n=== {model} ===")
print(f"latency: {elapsed_ms:.0f} ms")
print(f"in/out: {response.usage.input_tokens} / {response.usage.output_tokens} tokens")
print(f"output: {text}")
Look at three things in the output:
- Latency — how long the user waits.
- Tokens out — what you'll be billed for, and a rough proxy for verbosity.
- The reply itself — read it. Does the cheaper model's answer hold up against the more expensive one for this task?
If the answers look equally good, you've just saved real money. If the bigger model is clearly better, you've earned the right to spend more.
How to choose, in practice
Start with the task, not the model. Three questions decide most of it:
- What's the failure cost? Mis-routed support ticket = low. Wrong code shipped to prod = high. Higher cost → bigger model.
- How long is the chain of reasoning? A single classification = small. A multi-step plan with branching = larger.
- How long is the input? Long context windows are supported across the family, but longer prompts on a bigger model multiply latency and cost.
Then measure. A 10-line benchmark on 20 real examples beats a one-liner argument every time. Module 20 (Testing & Evaluation) shows how to turn that into a regression suite.
Try changing one thing
- Add
claude-opus-4-7tomodels_to_compare. Notice how much slower it is — and whether the answer is meaningfully better. - Lower
max_tokensto 80. Different models truncate differently — some still finish a thought, others stop mid-sentence. - Change the task to something domain-specific (e.g., "extract email addresses from this paragraph"). Cheaper models often win on simple, well-scoped tasks.
- Add
system="Reply in JSON only". See which model holds the format more reliably.
Going deeper: open the notebooks
notebooks/01_introduction.ipynb— capability map, reading model cards like a PM + engineer (~1.5–2h)notebooks/02_intermediate.ipynb— A/B tests across models, regression suites (~2–3h)notebooks/03_advanced.ipynb— latency budgets, multi-lingual, failure analysis from logs (~1.5–2.5h)
Module checklist
- [ ] You've benchmarked the same task on at least two models
- [ ] You can name two situations where Haiku is the right call, and two where Opus is
- [ ] You have a starting model in mind for the project you'll build later in this curriculum
Next module
Module 3 · Prompt Engineering Basics — the patterns that make any model behave more reliably.