Fine-tuning (and the cheaper alternatives)
When (and when not) to specialise Claude for your domain.
← All modules in this stageThe honest answer first: most teams who think they need to fine-tune don't. Prompt engineering, few-shot examples, RAG, and prompt caching solve 90% of the cases people reach for fine-tuning. This module helps you know which 10% is real, and what to do about the other 90%.
By the end of this module you'll have
- A clear-eyed view of when fine-tuning Claude is the right move and when it isn't
- Three alternatives that ship this week and usually win
- A decision tree for "should I fine-tune?" that ends in "no" most of the time
Time: about 1 hour for the basics, ~7 hours with all three notebooks.
Prerequisites: Modules 3 (prompt basics), 9 (RAG), 16 (optimization), 20 (testing) — at least skim 20 first; you can't reason about fine-tuning without evals.
The landscape (today)
- Claude itself: customisation of Claude is offered by Anthropic for enterprise-scale customers, often via Amazon Bedrock, with significant onboarding. It is not a "click here to fine-tune" SaaS button.
- Open-weights models (Llama, Mistral, etc.): full fine-tuning is widely available and cheap for small models. A common pattern is to use Claude to generate training data for a smaller fine-tuned model.
- Always check current docs. Customisation availability changes; rely on Anthropic's docs for what's offered today, not folklore.
For nearly everyone reading this in their first year with Claude, fine-tuning is the wrong instinct. Here's why.
Why people think they need to fine-tune
| What they say | What's usually really happening |
|---|---|
| "It doesn't know our product" | The system prompt has 80 words of brand voice and zero domain context. Add a knowledge file via RAG. |
| "It hallucinates customer details" | They're feeding raw inputs without grounding. Module 9 fixes this. |
| "Output format isn't reliable" | They're missing few-shot examples and a prefilled-{ trick. Module 6 fixes this. |
| "It's too verbose" | The prompt doesn't say "Reply in ≤ 60 words. No preamble." Module 3 fixes this. |
| "It's expensive at scale" | Sonnet on every call; no prompt caching, no model cascade. Module 16 fixes this. |
| "It refuses our valid use case" | A clearer system prompt with the legitimate context usually solves it without customisation. |
Fix the prompt, the retrieval, and the cascade. Re-measure. Usually you're done.
Three alternatives that almost always work first
1. Few-shot prompting at scale
Embed your actual examples (good and bad) directly into the system prompt. With prompt caching from Module 16, this is cheap on every call after the first.
SYSTEM = """\
You are a support classifier. Use these examples as your style guide.
Example 1:
Input: "My order arrived broken."
Output: {"intent": "refund_request", "urgency": "medium"}
Example 2:
Input: "Where is my order?"
Output: {"intent": "shipping_status", "urgency": "low"}
(... 30 examples ...)
Reply with one JSON object matching the same shape.
"""
Thirty examples is often the difference between 75% and 95% accuracy. No fine-tuning needed.
2. Distillation to a smaller model
For high-volume, narrow tasks where Sonnet's quality is overkill but Haiku is borderline:
- Run Sonnet on a few thousand real inputs and store the outputs.
- Use that synthetic dataset to fine-tune a small open-weights model (Llama 3, Mistral 7B).
- Serve the fine-tuned small model for the easy traffic; cascade to Sonnet only for hard cases.
You're not fine-tuning Claude; you're fine-tuning a smaller model with Claude's outputs as the teacher. This is what most "fine-tune the LLM" success stories actually are.
3. RAG on your own corpus
If the gap is "Claude doesn't know our docs," it never will — that's not what fine-tuning is for either. Fine-tuning teaches behaviour, not facts. For facts, use Module 9's RAG pattern.
When fine-tuning is the right call
Genuinely consider customisation when all of these are true:
- You've already invested seriously in prompting, RAG, caching, and cascades.
- You have written, run, and read evals (Module 20). You can quantify the remaining gap.
- The remaining gap is a stylistic or behavioural one — not factual recall.
- The economics work: you're sending enough volume that a custom model's setup cost amortises.
- You have an enterprise relationship with Anthropic (for Claude) or you're willing to operate an open-weights model yourself.
If you can't tick all five, don't fine-tune yet. The expected ROI on better prompts and RAG is higher and the iteration loop is hours, not weeks.
A decision tree
Is the gap factual recall? ────────────► Use RAG (Module 9). Stop.
│ no
▼
Have you tried 20+ few-shot examples? ─► Try that first. Stop.
│ yes
▼
Have you measured the gap with evals? ─► Build evals (Module 20). Then return.
│ yes
▼
Is volume high enough to justify
weeks of setup + ops? ─► No: stay on prompts. Stop.
│ yes
▼
Open-weights model viable? ─► Yes: distil from Claude into a small model.
│ no
▼
Reach out to Anthropic about
enterprise customisation. Have your
evals and economics ready.
Most paths exit at "Stop."
What this module's exercises actually do
Because fine-tuning Claude isn't a public API knob, the exercises build the alternatives:
- A few-shot system prompt with cached prefix that adapts a generic model to a domain
- A distillation pipeline that turns Sonnet outputs into a tiny model's training data
- An eval harness that proves a 92% → 95% improvement before you go looking for further gains
If you finish those and you still think you need fine-tuning, you'll be rare — and you'll have the eval numbers to make the conversation with Anthropic productive.
Try changing one thing
- Take a task where you're tempted to fine-tune. Add 20 few-shot examples instead. Measure the win.
- Cache the few-shot block and re-measure the cost. Often the "fine-tuning saves money" argument evaporates.
- Pick a high-volume narrow task. Sketch the distillation plan: which small model, how many synthetic examples, where to run inference.
- Write down the eval numbers you'd need to see before paying for fine-tuning. Now go improve them with cheaper levers.
Going deeper: open the notebooks
notebooks/01_introduction.ipynb— when prompting alone wins, measured (~1.5–2h)notebooks/02_intermediate.ipynb— distillation pipelines and synthetic data quality (~2–3h)notebooks/03_advanced.ipynb— operating a custom model: cost, latency, governance (~1.5–2.5h)
Module checklist
- [ ] You can name three alternatives to fine-tuning that often win
- [ ] You understand fine-tuning teaches behaviour, not facts
- [ ] You've written down the evals you'd need before fine-tuning is worth it
- [ ] You're slightly less excited about fine-tuning than you were 30 minutes ago
Next module
Module 18 · Multi-Agent Systems — when one prompt isn't enough, several specialised ones working together often are.