Stage 04 · ProModule 17 of 26~7h

Fine-tuning (and the cheaper alternatives)

When (and when not) to specialise Claude for your domain.

← All modules in this stage

The honest answer first: most teams who think they need to fine-tune don't. Prompt engineering, few-shot examples, RAG, and prompt caching solve 90% of the cases people reach for fine-tuning. This module helps you know which 10% is real, and what to do about the other 90%.

By the end of this module you'll have

Time: about 1 hour for the basics, ~7 hours with all three notebooks.

Prerequisites: Modules 3 (prompt basics), 9 (RAG), 16 (optimization), 20 (testing) — at least skim 20 first; you can't reason about fine-tuning without evals.


The landscape (today)

For nearly everyone reading this in their first year with Claude, fine-tuning is the wrong instinct. Here's why.


Why people think they need to fine-tune

What they say What's usually really happening
"It doesn't know our product" The system prompt has 80 words of brand voice and zero domain context. Add a knowledge file via RAG.
"It hallucinates customer details" They're feeding raw inputs without grounding. Module 9 fixes this.
"Output format isn't reliable" They're missing few-shot examples and a prefilled-{ trick. Module 6 fixes this.
"It's too verbose" The prompt doesn't say "Reply in ≤ 60 words. No preamble." Module 3 fixes this.
"It's expensive at scale" Sonnet on every call; no prompt caching, no model cascade. Module 16 fixes this.
"It refuses our valid use case" A clearer system prompt with the legitimate context usually solves it without customisation.

Fix the prompt, the retrieval, and the cascade. Re-measure. Usually you're done.


Three alternatives that almost always work first

1. Few-shot prompting at scale

Embed your actual examples (good and bad) directly into the system prompt. With prompt caching from Module 16, this is cheap on every call after the first.

SYSTEM = """\
You are a support classifier. Use these examples as your style guide.

Example 1:
Input: "My order arrived broken."
Output: {"intent": "refund_request", "urgency": "medium"}

Example 2:
Input: "Where is my order?"
Output: {"intent": "shipping_status", "urgency": "low"}

(... 30 examples ...)

Reply with one JSON object matching the same shape.
"""

Thirty examples is often the difference between 75% and 95% accuracy. No fine-tuning needed.

2. Distillation to a smaller model

For high-volume, narrow tasks where Sonnet's quality is overkill but Haiku is borderline:

  1. Run Sonnet on a few thousand real inputs and store the outputs.
  2. Use that synthetic dataset to fine-tune a small open-weights model (Llama 3, Mistral 7B).
  3. Serve the fine-tuned small model for the easy traffic; cascade to Sonnet only for hard cases.

You're not fine-tuning Claude; you're fine-tuning a smaller model with Claude's outputs as the teacher. This is what most "fine-tune the LLM" success stories actually are.

3. RAG on your own corpus

If the gap is "Claude doesn't know our docs," it never will — that's not what fine-tuning is for either. Fine-tuning teaches behaviour, not facts. For facts, use Module 9's RAG pattern.


When fine-tuning is the right call

Genuinely consider customisation when all of these are true:

If you can't tick all five, don't fine-tune yet. The expected ROI on better prompts and RAG is higher and the iteration loop is hours, not weeks.


A decision tree

Is the gap factual recall? ────────────► Use RAG (Module 9). Stop.
                │ no
                ▼
Have you tried 20+ few-shot examples? ─► Try that first. Stop.
                │ yes
                ▼
Have you measured the gap with evals? ─► Build evals (Module 20). Then return.
                │ yes
                ▼
Is volume high enough to justify
weeks of setup + ops?                ─► No: stay on prompts. Stop.
                │ yes
                ▼
Open-weights model viable?            ─► Yes: distil from Claude into a small model.
                │ no
                ▼
Reach out to Anthropic about
enterprise customisation. Have your
evals and economics ready.

Most paths exit at "Stop."


What this module's exercises actually do

Because fine-tuning Claude isn't a public API knob, the exercises build the alternatives:

If you finish those and you still think you need fine-tuning, you'll be rare — and you'll have the eval numbers to make the conversation with Anthropic productive.


Try changing one thing


Going deeper: open the notebooks


Module checklist


Next module

Module 18 · Multi-Agent Systems — when one prompt isn't enough, several specialised ones working together often are.