Stage 02 · PractitionerModule 9 of 26~8h

RAG Systems

Ground Claude’s answers in your own documents.

Claude doesn't know your company's docs, your private wiki, or yesterday's tickets. Retrieval-Augmented Generation fixes that: at question time, you fetch the relevant snippets and put them in the prompt. The model answers from your data, not its training set.

By the end of this module you'll have

A working search → assemble → answer loop using real Python
The instinct to ask "did the right context make it into the prompt?" before blaming the model
A simple citations pattern so you can show users where each claim came from

Time: about 1.5 hours for the basics, ~8 hours with all three notebooks.

Prerequisites: Modules 6 (advanced prompting) and 7 (building apps). Module 5 (tokens) helps too.

RAG in three lines

1.  Retrieve  — find the K most relevant chunks for the user's question.
2.  Assemble  — paste them into a prompt with clear delimiters.
3.  Answer    — ask Claude to answer using ONLY those chunks, with citations.

That's it. Vectors, embeddings, re-ranking — those are optimisations for step 1. The shape of the loop never changes.

A working RAG in 30 lines

This uses simple keyword matching so you can see the loop without any vector-DB setup. We'll graduate to embeddings in the notebook.

Save as mini_rag.py:

import re
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()

# Pretend this is your knowledge base. Each chunk has an id and text.
DOCS = [
    {"id": "policy-1",  "text": "Refunds are available within 30 days of purchase, with proof of purchase."},
    {"id": "policy-2",  "text": "Shipping to Europe takes 5–7 business days. Express shipping is 2 days."},
    {"id": "policy-3",  "text": "Customer support hours are 9am–6pm GMT, Monday to Friday."},
    {"id": "product-1", "text": "The Pro plan costs $29/month and includes priority support."},
    {"id": "product-2", "text": "The Free plan has a 5GB storage limit."},
]

def retrieve(query: str, k: int = 3):
    """Naive keyword scoring: count term overlaps. Replace with embeddings later."""
    terms = set(re.findall(r"\w+", query.lower()))
    scored = []
    for doc in DOCS:
        words = set(re.findall(r"\w+", doc["text"].lower()))
        scored.append((len(terms & words), doc))
    scored.sort(reverse=True, key=lambda x: x[0])
    return [doc for score, doc in scored[:k] if score > 0]

def answer(question: str) -> str:
    chunks = retrieve(question)
    if not chunks:
        return "I don't have anything in the knowledge base about that."

    context = "\n".join(f"[{c['id']}] {c['text']}" for c in chunks)

    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=400,
        system=(
            "Answer the user's question using ONLY the snippets below. "
            "If the snippets don't contain the answer, say so. "
            "Cite each fact with its [id] tag at the end of the sentence. "
            "Do not invent information.\n\n"
            f"SNIPPETS:\n{context}"
        ),
        messages=[{"role": "user", "content": question}],
    )
    return response.content[0].text

print(answer("How long does shipping take to Europe?"))
print()
print(answer("What's the weather in Paris?"))   # nothing relevant — should say so

You should see citations like "Shipping to Europe takes 5–7 business days [policy-2]" and an honest "no information" for the off-topic question.

What just happened?

Three things you'll do in every RAG system:

You shrank the world. Instead of letting Claude guess, you handed it three relevant snippets out of (potentially) millions. That's the entire point.
You forced grounding. "Use ONLY the snippets" plus "if not present, say so" is the difference between a useful tool and a confident liar.
You enabled audit. The [id] citations let users (and you) trace any claim back to source. Module 14 turns this into a real eval signal.

When the answer is wrong, where did it go wrong?

Most RAG bugs aren't model bugs. Walk this in order:

Did the right chunk get retrieved? Print retrieve(question) separately and read it. If the right text isn't there, no model can save you.
Did the right chunk make it into the prompt? Token budgets are real — if you fetched 10 chunks and only 3 fit, the right one might have been dropped.
Was the chunk too long or too short? Sentence-fragments lose context. Whole-page chunks dilute relevance. Aim for 100–500 tokens per chunk in practice.
Did you tell the model to stay grounded? "Use ONLY the snippets" is doing real work — remove it and watch hallucinations return.
Only then is it a model problem.

Upgrade path: embeddings instead of keywords

Keyword search misses synonyms ("refund" vs "return"). The standard fix is embeddings:

# Pseudocode — see notebook 02 for a complete example.
from openai import OpenAI                     # any embedding provider works
embeddings_client = OpenAI()

def embed(text: str):
    return embeddings_client.embeddings.create(model="text-embedding-3-small", input=text).data[0].embedding

DOC_VECTORS = [(doc, embed(doc["text"])) for doc in DOCS]

def retrieve(query: str, k: int = 3):
    qv = embed(query)
    scored = [(cosine(qv, dv), doc) for doc, dv in DOC_VECTORS]
    scored.sort(reverse=True, key=lambda x: x[0])
    return [doc for _, doc in scored[:k]]

The downstream prompt is unchanged. That's the point — get the loop right with keywords first, swap retrievers later.

Try changing one thing

Add an off-policy chunk to DOCS (e.g. "Our private API key is sk-XXX") and ask "what's the API key?". Notice how Claude still cites the chunk — so RAG can leak data. Filter what you put in DOCS.
Lower k to 1. See answers degrade for questions that need to combine two facts.
Replace the system prompt with just the question (no instructions). Watch grounding evaporate.
Add a "no answer" check: if the top score is below a threshold, return "I don't know" without calling the model. You just saved a token.

Going deeper: open the notebooks

notebooks/01_introduction.ipynb — embeddings, hybrid retrieval (keyword + vector), re-ranking (~1.5–2h)
notebooks/02_intermediate.ipynb — query rewriting, contradiction handling, freshness (~2–3h)
notebooks/03_advanced.ipynb — access control, PII handling, scaling retrieval infra (~1.5–2.5h)

Module checklist

[ ] You ran a RAG query and got a citation in the answer
[ ] You watched the model honestly say "I don't know" when retrieval missed
[ ] You can name the three steps of the loop without notes
[ ] You've debugged at least one wrong answer by first checking what was retrieved

Next module

Module 10 · Conversations — multi-turn chat that remembers, stays on topic, and doesn't grow without bound.

This moduleOverviewResourcesExercises

Notebook links: nbviewer uses nbviewer.jupyter.org/url/… (not nbviewer.org) and loads the file via jsDelivr (cdn.jsdelivr.net/gh/…) so Project Jupyter does not hit GitHub’s REST API (which can return 503 when rate-limited). Requires a public repo mirror (Berta-one/claude-guides, branch master). Raw points at the copy on this website (same path under claudeguides.berta.one). Colab and the GitHub file link also need that public mirror. Forks: set CLAUDE_GUIDES_GITHUB_REPO and CLAUDE_GUIDES_BRANCH when building. You can always clone the repo and open .ipynb locally.

Notebooks

Intronbviewer Colab GitHub RawIntermediatenbviewer Colab GitHub RawAdvancednbviewer Colab GitHub RawModule overview

Code

View main_application.py (on this site)GitHub mirror

Next in module →

Full pathStep 9 of 26 · typical ~8h

← Previous module (Guide 8)

Paid & instructor-led courses