Stage 01 · FoundationsModule 2 of 26~4h

Models & Capabilities

Pick the right Claude model for the job at hand.

← All modules in this stage

There are several Claude models. They cost different amounts, run at different speeds, and handle different difficulty levels. This module teaches you to pick the right one — by measuring, not guessing.

By the end of this module you'll have

Time: about 45 minutes for the basics, ~4 hours with all three notebooks.

Prerequisites: Module 1 finished and your environment working.


The shape of the family (today)

Family Latest ID Best for Rough cost Rough speed
Haiku claude-haiku-4-5-20251001 Classification, routing, high-volume cheap work, drafts $ Very fast
Sonnet claude-sonnet-4-6 Most production work — the sensible default $$ Fast
Opus claude-opus-4-7 Hard reasoning, long multi-step plans, premium quality $$$ Slower

Pricing and speed change over time — always confirm with the official docs. What stays true: smaller is faster and cheaper, larger reasons better.

Rule of thumb. Start on Sonnet. Drop to Haiku once you've proven Sonnet works and you have a quality bar to test against. Escalate to Opus only when Sonnet visibly fails on tasks you actually care about.


Compare two models on your task (5 minutes)

Save as compare_models.py and adjust the task string to something you care about:

import os, time
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()

task = "Summarize the plot of Hamlet in three bullet points, neutral tone."

models_to_compare = [
    "claude-haiku-4-5-20251001",
    "claude-sonnet-4-6",
]

for model in models_to_compare:
    start = time.perf_counter()
    response = client.messages.create(
        model=model,
        max_tokens=400,
        messages=[{"role": "user", "content": task}],
    )
    elapsed_ms = (time.perf_counter() - start) * 1000
    text = response.content[0].text

    print(f"\n=== {model} ===")
    print(f"latency:   {elapsed_ms:.0f} ms")
    print(f"in/out:    {response.usage.input_tokens} / {response.usage.output_tokens} tokens")
    print(f"output:    {text}")

Look at three things in the output:

  1. Latency — how long the user waits.
  2. Tokens out — what you'll be billed for, and a rough proxy for verbosity.
  3. The reply itself — read it. Does the cheaper model's answer hold up against the more expensive one for this task?

If the answers look equally good, you've just saved real money. If the bigger model is clearly better, you've earned the right to spend more.


How to choose, in practice

Start with the task, not the model. Three questions decide most of it:

  1. What's the failure cost? Mis-routed support ticket = low. Wrong code shipped to prod = high. Higher cost → bigger model.
  2. How long is the chain of reasoning? A single classification = small. A multi-step plan with branching = larger.
  3. How long is the input? Long context windows are supported across the family, but longer prompts on a bigger model multiply latency and cost.

Then measure. A 10-line benchmark on 20 real examples beats a one-liner argument every time. Module 20 (Testing & Evaluation) shows how to turn that into a regression suite.


Try changing one thing


Going deeper: open the notebooks


Module checklist


Next module

Module 3 · Prompt Engineering Basics — the patterns that make any model behave more reliably.