Data Analysis with Claude
Have Claude crunch CSVs and surface insights.
← All modules in this stageLLMs are surprisingly good at the talking-about-data parts of analysis: spotting odd values, suggesting follow-up questions, drafting a first interpretation. They are emphatically not a replacement for actually running the numbers. This module shows you how to get the best of both — Claude as a senior pair, your code as the source of truth.
By the end of this module you'll have
- A working describe-this-table tool that summarises any CSV in plain English
- A safe pattern: compute first, narrate second — never let the model invent statistics
- A clear sense of what to delegate to Claude versus what to keep in pandas
Time: about 1.5 hours for the basics, ~6 hours with all three notebooks.
Prerequisites: Modules 6 (advanced prompting), 7 (building apps), 8 (tool use). Familiarity with pandas helps.
The cardinal rule
Compute first. Narrate second.
Don't paste a CSV into a prompt and ask "what does this say?" — the model will invent plausible numbers. Instead: run real code to compute statistics, then pass the summary to Claude for interpretation. The model is the senior reviewer; pandas is the analyst.
A working "describe this CSV" tool
Save as describe_csv.py. You'll need pandas (pip install pandas).
import io
import pandas as pd
from anthropic import Anthropic
from dotenv import load_dotenv
load_dotenv()
client = Anthropic()
def technical_summary(df: pd.DataFrame) -> str:
"""Compute hard facts. No LLM involved."""
parts = []
parts.append(f"Rows: {len(df):,} · Columns: {len(df.columns)}")
parts.append("\nDtypes:\n" + df.dtypes.to_string())
parts.append("\nMissing values per column:\n" + df.isna().sum().to_string())
if len(df.select_dtypes("number").columns):
parts.append("\nNumeric describe:\n" + df.describe().round(2).to_string())
cats = df.select_dtypes(include=["object", "category"])
if len(cats.columns):
top = {c: cats[c].value_counts().head(3).to_dict() for c in cats.columns}
parts.append("\nTop categorical values:\n" + "\n".join(f"{k}: {v}" for k, v in top.items()))
return "\n".join(parts)
def narrate(df: pd.DataFrame, *, business_context: str = "") -> str:
facts = technical_summary(df)
response = client.messages.create(
model="claude-sonnet-4-6", max_tokens=600,
system=(
"You are a senior data analyst reviewing a colleague's quick look. "
"Use ONLY the numbers in the summary below. Do not invent statistics. "
"If a question can't be answered from the summary, say so explicitly. "
"Output: (1) a 4-sentence overview, (2) up to 3 anomalies worth investigating, "
"(3) up to 3 follow-up questions you'd ask."
),
messages=[{
"role": "user",
"content": f"BUSINESS CONTEXT: {business_context or 'none provided'}\n\nDATASET SUMMARY:\n{facts}",
}],
)
return response.content[0].text
if __name__ == "__main__":
csv_text = """date,product,units,revenue
2026-01-01,A,12,240
2026-01-01,B,5,150
2026-01-02,A,15,300
2026-01-02,B,,
2026-01-03,A,18,360
2026-01-03,B,7,210
"""
df = pd.read_csv(io.StringIO(csv_text))
print(narrate(df, business_context="A small e-commerce sales sample"))
Notice the structure:
- Pandas computes the facts. Row counts, dtypes, descriptive stats, missing-value counts — all real, all reproducible.
- Claude reads the summary, not the data. It can't hallucinate a mean if you handed it the mean.
- The system prompt forbids invention. "Use ONLY the numbers" + "say so if you can't answer" is doing real work.
What Claude is great at (and what it isn't)
| Great at | Don't trust it for |
|---|---|
| Spotting "this column has 30% missing values, that's worth a look" | Computing those percentages on raw data |
| Drafting follow-up questions from a summary | Choosing which statistical test is appropriate without you |
| Translating SQL to Python (or vice versa) | Running unverified SQL against your prod database |
| Plain-English summaries of regression coefficients | Inventing coefficients from a description of the data |
| Finding inconsistencies between two summaries | Joining two CSVs by inferring keys |
The pattern: delegate the language work, keep the math.
Safer alternatives to "paste the CSV"
Three patterns that scale to real datasets:
1. Aggregate first, ask second
top_products = df.groupby("product")["revenue"].sum().sort_values(ascending=False).head(10)
narrative = narrate_series(top_products, business_context="weekly revenue by product")
You hand Claude 10 numbers, not 10 million rows.
2. Tool use (Module 8) for interactive analysis
Define tools like top_n(table, column, n) and correlation(table, col_a, col_b). Claude decides what to ask for, you compute it, return the result. The model never sees the raw rows — it composes its analysis from your computed answers.
3. Code-first: ask Claude to write the analysis code
prompt = """
A pandas DataFrame `df` has columns: date, product, units, revenue.
Write Python that:
1. Aggregates revenue by week and product
2. Highlights the week-on-week change
3. Returns a small DataFrame ready to display
Reply with code only, no commentary.
"""
You review and run the code yourself. Claude wrote it; you executed it. Different bargain — the model still doesn't see the data, but it does the boring SQL/pandas writing for you.
Try changing one thing
- Add
BUSINESS CONTEXT: "monthly revenue across two products"and re-run. Notice the narrative gets sharper when context is explicit. - Pass an obviously wrong "summary" (e.g. inflate one mean by 10x). Watch Claude flag it as an anomaly — it is doing some sanity checking.
- Remove the "Use ONLY the numbers" line from the system prompt. Re-run on a different CSV. Watch made-up percentages creep in.
- Build a tiny tool-use version where Claude can ask
top_products(n=5)instead of getting the summary up-front.
Going deeper: open the notebooks
notebooks/01_introduction.ipynb— analysis copilot patterns, summary-first prompts (~1.5–2h)notebooks/02_intermediate.ipynb— Claude as code generator for analysis, review loops (~2–3h)notebooks/03_advanced.ipynb— analyst-grade evaluation, statistical literacy guards (~1.5–2.5h)
Module checklist
- [ ] You ran
describe_csv.pyand got a sensible narrative - [ ] You can name three things Claude is great at and three things you would never delegate
- [ ] You've separated computing the facts from narrating the facts in your own head
- [ ] You can imagine a tool-use version of this script
Next module
Module 12 · Code Generation — same idea, applied to the language Claude is unusually good at: code.