Stage 03 · BuilderModule 11 of 26~6h

Data Analysis with Claude

Have Claude crunch CSVs and surface insights.

← All modules in this stage

LLMs are surprisingly good at the talking-about-data parts of analysis: spotting odd values, suggesting follow-up questions, drafting a first interpretation. They are emphatically not a replacement for actually running the numbers. This module shows you how to get the best of both — Claude as a senior pair, your code as the source of truth.

By the end of this module you'll have

Time: about 1.5 hours for the basics, ~6 hours with all three notebooks.

Prerequisites: Modules 6 (advanced prompting), 7 (building apps), 8 (tool use). Familiarity with pandas helps.


The cardinal rule

Compute first. Narrate second.

Don't paste a CSV into a prompt and ask "what does this say?" — the model will invent plausible numbers. Instead: run real code to compute statistics, then pass the summary to Claude for interpretation. The model is the senior reviewer; pandas is the analyst.


A working "describe this CSV" tool

Save as describe_csv.py. You'll need pandas (pip install pandas).

import io
import pandas as pd
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()
client = Anthropic()

def technical_summary(df: pd.DataFrame) -> str:
    """Compute hard facts. No LLM involved."""
    parts = []
    parts.append(f"Rows: {len(df):,} · Columns: {len(df.columns)}")
    parts.append("\nDtypes:\n" + df.dtypes.to_string())
    parts.append("\nMissing values per column:\n" + df.isna().sum().to_string())
    if len(df.select_dtypes("number").columns):
        parts.append("\nNumeric describe:\n" + df.describe().round(2).to_string())
    cats = df.select_dtypes(include=["object", "category"])
    if len(cats.columns):
        top = {c: cats[c].value_counts().head(3).to_dict() for c in cats.columns}
        parts.append("\nTop categorical values:\n" + "\n".join(f"{k}: {v}" for k, v in top.items()))
    return "\n".join(parts)

def narrate(df: pd.DataFrame, *, business_context: str = "") -> str:
    facts = technical_summary(df)
    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=600,
        system=(
            "You are a senior data analyst reviewing a colleague's quick look. "
            "Use ONLY the numbers in the summary below. Do not invent statistics. "
            "If a question can't be answered from the summary, say so explicitly. "
            "Output: (1) a 4-sentence overview, (2) up to 3 anomalies worth investigating, "
            "(3) up to 3 follow-up questions you'd ask."
        ),
        messages=[{
            "role": "user",
            "content": f"BUSINESS CONTEXT: {business_context or 'none provided'}\n\nDATASET SUMMARY:\n{facts}",
        }],
    )
    return response.content[0].text

if __name__ == "__main__":
    csv_text = """date,product,units,revenue
2026-01-01,A,12,240
2026-01-01,B,5,150
2026-01-02,A,15,300
2026-01-02,B,,
2026-01-03,A,18,360
2026-01-03,B,7,210
"""
    df = pd.read_csv(io.StringIO(csv_text))
    print(narrate(df, business_context="A small e-commerce sales sample"))

Notice the structure:

  1. Pandas computes the facts. Row counts, dtypes, descriptive stats, missing-value counts — all real, all reproducible.
  2. Claude reads the summary, not the data. It can't hallucinate a mean if you handed it the mean.
  3. The system prompt forbids invention. "Use ONLY the numbers" + "say so if you can't answer" is doing real work.

What Claude is great at (and what it isn't)

Great at Don't trust it for
Spotting "this column has 30% missing values, that's worth a look" Computing those percentages on raw data
Drafting follow-up questions from a summary Choosing which statistical test is appropriate without you
Translating SQL to Python (or vice versa) Running unverified SQL against your prod database
Plain-English summaries of regression coefficients Inventing coefficients from a description of the data
Finding inconsistencies between two summaries Joining two CSVs by inferring keys

The pattern: delegate the language work, keep the math.


Safer alternatives to "paste the CSV"

Three patterns that scale to real datasets:

1. Aggregate first, ask second

top_products = df.groupby("product")["revenue"].sum().sort_values(ascending=False).head(10)
narrative = narrate_series(top_products, business_context="weekly revenue by product")

You hand Claude 10 numbers, not 10 million rows.

2. Tool use (Module 8) for interactive analysis

Define tools like top_n(table, column, n) and correlation(table, col_a, col_b). Claude decides what to ask for, you compute it, return the result. The model never sees the raw rows — it composes its analysis from your computed answers.

3. Code-first: ask Claude to write the analysis code

prompt = """
A pandas DataFrame `df` has columns: date, product, units, revenue.
Write Python that:
1. Aggregates revenue by week and product
2. Highlights the week-on-week change
3. Returns a small DataFrame ready to display
Reply with code only, no commentary.
"""

You review and run the code yourself. Claude wrote it; you executed it. Different bargain — the model still doesn't see the data, but it does the boring SQL/pandas writing for you.


Try changing one thing


Going deeper: open the notebooks


Module checklist


Next module

Module 12 · Code Generation — same idea, applied to the language Claude is unusually good at: code.