Stage 03 · BuilderModule 11 of 26~6h

Data Analysis with Claude

Have Claude crunch CSVs and surface insights.

LLMs are surprisingly good at the talking-about-data parts of analysis: spotting odd values, suggesting follow-up questions, drafting a first interpretation. They are emphatically not a replacement for actually running the numbers. This module shows you how to get the best of both — Claude as a senior pair, your code as the source of truth.

By the end of this module you'll have

A working describe-this-table tool that summarises any CSV in plain English
A safe pattern: compute first, narrate second — never let the model invent statistics
A clear sense of what to delegate to Claude versus what to keep in pandas

Time: about 1.5 hours for the basics, ~6 hours with all three notebooks.

Prerequisites: Modules 6 (advanced prompting), 7 (building apps), 8 (tool use). Familiarity with pandas helps.

The cardinal rule

Compute first. Narrate second.

Don't paste a CSV into a prompt and ask "what does this say?" — the model will invent plausible numbers. Instead: run real code to compute statistics, then pass the summary to Claude for interpretation. The model is the senior reviewer; pandas is the analyst.

A working "describe this CSV" tool

Save as describe_csv.py. You'll need pandas (pip install pandas).

import io
import pandas as pd
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()

def technical_summary(df: pd.DataFrame) -> str:
    """Compute hard facts. No LLM involved."""
    parts = []
    parts.append(f"Rows: {len(df):,} · Columns: {len(df.columns)}")
    parts.append("\nDtypes:\n" + df.dtypes.to_string())
    parts.append("\nMissing values per column:\n" + df.isna().sum().to_string())
    if len(df.select_dtypes("number").columns):
        parts.append("\nNumeric describe:\n" + df.describe().round(2).to_string())
    cats = df.select_dtypes(include=["object", "category"])
    if len(cats.columns):
        top = {c: cats[c].value_counts().head(3).to_dict() for c in cats.columns}
        parts.append("\nTop categorical values:\n" + "\n".join(f"{k}: {v}" for k, v in top.items()))
    return "\n".join(parts)

def narrate(df: pd.DataFrame, *, business_context: str = "") -> str:
    facts = technical_summary(df)
    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=600,
        system=(
            "You are a senior data analyst reviewing a colleague's quick look. "
            "Use ONLY the numbers in the summary below. Do not invent statistics. "
            "If a question can't be answered from the summary, say so explicitly. "
            "Output: (1) a 4-sentence overview, (2) up to 3 anomalies worth investigating, "
            "(3) up to 3 follow-up questions you'd ask."
        ),
        messages=[{
            "role": "user",
            "content": f"BUSINESS CONTEXT: {business_context or 'none provided'}\n\nDATASET SUMMARY:\n{facts}",
        }],
    )
    return response.content[0].text

if __name__ == "__main__":
    csv_text = """date,product,units,revenue
2026-01-01,A,12,240
2026-01-01,B,5,150
2026-01-02,A,15,300
2026-01-02,B,,
2026-01-03,A,18,360
2026-01-03,B,7,210
"""
    df = pd.read_csv(io.StringIO(csv_text))
    print(narrate(df, business_context="A small e-commerce sales sample"))

Notice the structure:

Pandas computes the facts. Row counts, dtypes, descriptive stats, missing-value counts — all real, all reproducible.
Claude reads the summary, not the data. It can't hallucinate a mean if you handed it the mean.
The system prompt forbids invention. "Use ONLY the numbers" + "say so if you can't answer" is doing real work.

What Claude is great at (and what it isn't)

Great at	Don't trust it for
Spotting "this column has 30% missing values, that's worth a look"	Computing those percentages on raw data
Drafting follow-up questions from a summary	Choosing which statistical test is appropriate without you
Translating SQL to Python (or vice versa)	Running unverified SQL against your prod database
Plain-English summaries of regression coefficients	Inventing coefficients from a description of the data
Finding inconsistencies between two summaries	Joining two CSVs by inferring keys

The pattern: delegate the language work, keep the math.

Safer alternatives to "paste the CSV"

Three patterns that scale to real datasets:

1. Aggregate first, ask second

top_products = df.groupby("product")["revenue"].sum().sort_values(ascending=False).head(10)
narrative = narrate_series(top_products, business_context="weekly revenue by product")

You hand Claude 10 numbers, not 10 million rows.

2. Tool use (Module 8) for interactive analysis

Define tools like top_n(table, column, n) and correlation(table, col_a, col_b). Claude decides what to ask for, you compute it, return the result. The model never sees the raw rows — it composes its analysis from your computed answers.

3. Code-first: ask Claude to write the analysis code

prompt = """
A pandas DataFrame `df` has columns: date, product, units, revenue.
Write Python that:
1. Aggregates revenue by week and product
2. Highlights the week-on-week change
3. Returns a small DataFrame ready to display
Reply with code only, no commentary.
"""

You review and run the code yourself. Claude wrote it; you executed it. Different bargain — the model still doesn't see the data, but it does the boring SQL/pandas writing for you.

Try changing one thing

Add BUSINESS CONTEXT: "monthly revenue across two products" and re-run. Notice the narrative gets sharper when context is explicit.
Pass an obviously wrong "summary" (e.g. inflate one mean by 10x). Watch Claude flag it as an anomaly — it is doing some sanity checking.
Remove the "Use ONLY the numbers" line from the system prompt. Re-run on a different CSV. Watch made-up percentages creep in.
Build a tiny tool-use version where Claude can ask top_products(n=5) instead of getting the summary up-front.

Going deeper: open the notebooks

notebooks/01_introduction.ipynb — analysis copilot patterns, summary-first prompts (~1.5–2h)
notebooks/02_intermediate.ipynb — Claude as code generator for analysis, review loops (~2–3h)
notebooks/03_advanced.ipynb — analyst-grade evaluation, statistical literacy guards (~1.5–2.5h)

Module checklist

[ ] You ran describe_csv.py and got a sensible narrative
[ ] You can name three things Claude is great at and three things you would never delegate
[ ] You've separated computing the facts from narrating the facts in your own head
[ ] You can imagine a tool-use version of this script

Next module

Module 12 · Code Generation — same idea, applied to the language Claude is unusually good at: code.

This moduleOverviewResourcesExercises

Notebook links: nbviewer uses nbviewer.jupyter.org/url/… (not nbviewer.org) and loads the file via jsDelivr (cdn.jsdelivr.net/gh/…) so Project Jupyter does not hit GitHub’s REST API (which can return 503 when rate-limited). Requires a public repo mirror (Berta-one/claude-guides, branch master). Raw points at the copy on this website (same path under claudeguides.berta.one). Colab and the GitHub file link also need that public mirror. Forks: set CLAUDE_GUIDES_GITHUB_REPO and CLAUDE_GUIDES_BRANCH when building. You can always clone the repo and open .ipynb locally.

Notebooks

Intronbviewer Colab GitHub RawIntermediatenbviewer Colab GitHub RawAdvancednbviewer Colab GitHub RawModule overview

Code

View main_application.py (on this site)GitHub mirror

Next in module →

Full pathStep 11 of 26 · typical ~6h

← Previous module (Guide 10)

Paid & instructor-led courses