Stage 02 · PractitionerModule 9 of 26~8h

RAG Systems

Ground Claude’s answers in your own documents.

← All modules in this stage

Claude doesn't know your company's docs, your private wiki, or yesterday's tickets. Retrieval-Augmented Generation fixes that: at question time, you fetch the relevant snippets and put them in the prompt. The model answers from your data, not its training set.

By the end of this module you'll have

Time: about 1.5 hours for the basics, ~8 hours with all three notebooks.

Prerequisites: Modules 6 (advanced prompting) and 7 (building apps). Module 5 (tokens) helps too.


RAG in three lines

1.  Retrieve  — find the K most relevant chunks for the user's question.
2.  Assemble  — paste them into a prompt with clear delimiters.
3.  Answer    — ask Claude to answer using ONLY those chunks, with citations.

That's it. Vectors, embeddings, re-ranking — those are optimisations for step 1. The shape of the loop never changes.


A working RAG in 30 lines

This uses simple keyword matching so you can see the loop without any vector-DB setup. We'll graduate to embeddings in the notebook.

Save as mini_rag.py:

import re
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()

# Pretend this is your knowledge base. Each chunk has an id and text.
DOCS = [
    {"id": "policy-1",  "text": "Refunds are available within 30 days of purchase, with proof of purchase."},
    {"id": "policy-2",  "text": "Shipping to Europe takes 5–7 business days. Express shipping is 2 days."},
    {"id": "policy-3",  "text": "Customer support hours are 9am–6pm GMT, Monday to Friday."},
    {"id": "product-1", "text": "The Pro plan costs $29/month and includes priority support."},
    {"id": "product-2", "text": "The Free plan has a 5GB storage limit."},
]

def retrieve(query: str, k: int = 3):
    """Naive keyword scoring: count term overlaps. Replace with embeddings later."""
    terms = set(re.findall(r"\w+", query.lower()))
    scored = []
    for doc in DOCS:
        words = set(re.findall(r"\w+", doc["text"].lower()))
        scored.append((len(terms & words), doc))
    scored.sort(reverse=True, key=lambda x: x[0])
    return [doc for score, doc in scored[:k] if score > 0]

def answer(question: str) -> str:
    chunks = retrieve(question)
    if not chunks:
        return "I don't have anything in the knowledge base about that."

    context = "\n".join(f"[{c['id']}] {c['text']}" for c in chunks)

    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=400,
        system=(
            "Answer the user's question using ONLY the snippets below. "
            "If the snippets don't contain the answer, say so. "
            "Cite each fact with its [id] tag at the end of the sentence. "
            "Do not invent information.\n\n"
            f"SNIPPETS:\n{context}"
        ),
        messages=[{"role": "user", "content": question}],
    )
    return response.content[0].text

print(answer("How long does shipping take to Europe?"))
print()
print(answer("What's the weather in Paris?"))   # nothing relevant — should say so

You should see citations like "Shipping to Europe takes 5–7 business days [policy-2]" and an honest "no information" for the off-topic question.


What just happened?

Three things you'll do in every RAG system:

  1. You shrank the world. Instead of letting Claude guess, you handed it three relevant snippets out of (potentially) millions. That's the entire point.
  2. You forced grounding. "Use ONLY the snippets" plus "if not present, say so" is the difference between a useful tool and a confident liar.
  3. You enabled audit. The [id] citations let users (and you) trace any claim back to source. Module 14 turns this into a real eval signal.

When the answer is wrong, where did it go wrong?

Most RAG bugs aren't model bugs. Walk this in order:

  1. Did the right chunk get retrieved? Print retrieve(question) separately and read it. If the right text isn't there, no model can save you.
  2. Did the right chunk make it into the prompt? Token budgets are real — if you fetched 10 chunks and only 3 fit, the right one might have been dropped.
  3. Was the chunk too long or too short? Sentence-fragments lose context. Whole-page chunks dilute relevance. Aim for 100–500 tokens per chunk in practice.
  4. Did you tell the model to stay grounded? "Use ONLY the snippets" is doing real work — remove it and watch hallucinations return.
  5. Only then is it a model problem.

Upgrade path: embeddings instead of keywords

Keyword search misses synonyms ("refund" vs "return"). The standard fix is embeddings:

# Pseudocode — see notebook 02 for a complete example.
from openai import OpenAI                     # any embedding provider works
embeddings_client = OpenAI()

def embed(text: str):
    return embeddings_client.embeddings.create(model="text-embedding-3-small", input=text).data[0].embedding

DOC_VECTORS = [(doc, embed(doc["text"])) for doc in DOCS]

def retrieve(query: str, k: int = 3):
    qv = embed(query)
    scored = [(cosine(qv, dv), doc) for doc, dv in DOC_VECTORS]
    scored.sort(reverse=True, key=lambda x: x[0])
    return [doc for _, doc in scored[:k]]

The downstream prompt is unchanged. That's the point — get the loop right with keywords first, swap retrievers later.


Try changing one thing


Going deeper: open the notebooks


Module checklist


Next module

Module 10 · Conversations — multi-turn chat that remembers, stays on topic, and doesn't grow without bound.