Blog

How Claude Automates 95% of Business Analytics: Inside Anthropic's Data Team

June 2026 · 6 min read · Technical

Data analyst reviewing analytics dashboards on dual monitors in a modern office
← Back to all posts

Anthropic's data science team published a detailed account on 3 June 2026 of how it automated its own business analytics with Claude. The headline numbers deserve attention: 95% of business analytics queries at Anthropic are now answered by Claude, at roughly 95% accuracy in aggregate. The analysts who used to field those requests now spend their time on causal modelling, forecasting, and machine learning instead of ad-hoc query work.

For Australian data leaders the post is useful precisely because it is not a victory lap. It is candid about the failure mode most teams hit first: point Claude at the warehouse, let it write SQL, and watch a false sense of precision creep in. The agent always returns a number. Whether it is the right number depends on work that happens long before anyone types a question.

Why analytics agents fail differently from coding agents

Coding agents perform well because software is an open-ended solution space with natural guardrails. Tests, type signatures, and documentation catch hallucinations before they ship. Analytics is the opposite shape of problem: a stakeholder's question usually has exactly one correct answer, drawn from one correct source, and there is no deterministic test that proves the answer right.

Anthropic's framing follows from that: analytics accuracy is a context and verification problem, not a code generation problem. The hard part is mapping a question like "weekly active users" to specific, current entities in the data model and knowing the correct way to use them. Get that mapping right and the SQL becomes trivial. The team identified three failure modes that account for the overwhelming majority of wrong answers:

  • Ambiguity. With hundreds of plausible fields in a data model, the agent picks the wrong one. Measuring "active users" alone forces choices about which actions count as active, whether fraudulent accounts are included, and what lookback window applies.

  • Staleness. Data sources, business definitions, and schemas change constantly. Assets and agent knowledge go stale and start returning subtly wrong answers that look plausible.

  • Discovery. The right table may exist and be properly annotated, but the search space is so vast the agent simply never finds it.

The stack Anthropic built to attack those errors

Each layer of Anthropic's agentic data stack exists to counter one or more of the three failure modes. Data foundations shrink the space of plausible entities until a concept resolves to a single governed answer. Maintenance and validation processes keep definitions from rotting as the business changes. Skills make sure the agent reliably finds and correctly uses the governed answer.

The practices that did the heavy lifting are unglamorous and very adoptable:

  • Fewer, heavily governed datasets. Curate a small set of canonical, single-source-of-truth models with clear owners, then aggressively deprecate the near-duplicates. The goal is that a concept search returns one governed answer, not forty plausible candidates.

  • Enforcement, not just governance. Agents are structurally routed to canonical models first, CI fails changes that bypass them, and downstream teams build on the governed layer or explain why not. Governance without enforcement decays straight back to the duplicates problem.

  • Colocation. Nearly all data code lives in one repository. If a modelling change would break a downstream dashboard or invalidate a documented metric, CI flags it and the fix ships in the same pull request.

  • A self-documenting warehouse. Column descriptions, grain documentation, valid value ranges, lineage, and ownership maintained with the same rigour as the transformations themselves.

  • Semantic layer first. Skills instruct the agent to check compiled metric definitions before writing any SQL. If a question maps to a defined metric, the agent calls a function and gets the same number every other surface in the company produces.

One negative result is worth as much as the wins. Anthropic tried bootstrapping its semantic layer by having an LLM auto-generate metric definitions from raw tables and query logs. The output looked plausible but encoded the very ambiguities the layer was meant to eliminate, and it scored net-negative on evals against a smaller, human-curated layer. The lesson for any team tempted to shortcut the curation step: write the definitions yourself.

What the numbers look like for an Australian data team

A mid-level data analyst in Sydney or Melbourne costs around $110,000 a year fully loaded. In most businesses a large share of that time goes to repetitive query requests rather than analysis. If 40% of an analyst's week is rote retrieval work, that is roughly $44,000 of capacity per analyst per year, or about $220,000 across a five-person team, that the Anthropic pattern shows can move to Claude once the context and verification layers are built properly.

The accuracy discipline matters more here than in most markets. An APRA-regulated lender or insurer putting agent-produced figures into a board pack, or any Australian business letting an agent query customer data subject to the Privacy Act, cannot accept answers that merely look precise. Verification in the loop is what separates a useful tool from a liability. Practical first steps:

  • Do not start by giving an agent warehouse access. Start by writing down the table definitions, metric definitions, and gotchas your analysts carry in their heads.

  • Treat Claude Skills as the delivery mechanism for that context, one well-scoped skill per analytics domain. Anthropic builds most of its analytics skills from a single template.

  • Build verification into the loop so wrong answers fail loudly instead of passing as precise.

  • Measure accuracy in aggregate on a fixed evaluation set before widening access beyond the data team.

Build the boring layers first

The distance between a demo that answers one question and a stack that holds 95% accuracy across a business is not model capability. It is design work: context engineering, skill structure, governed data foundations, and a verification loop. That work is exactly where most Australian teams stall, because it sits between data engineering and AI engineering and belongs to neither team by default.

Automata AI builds Claude-powered analytics workflows for Australian businesses, from the semantic layer up to the skills your team actually queries. If your analysts are still fielding the same ten questions every week, book a short brainstorm with us and we will map which of them Claude should be answering.

Ready to move from AI pilot to production?

We help mid-market Australian businesses deploy AI automations that actually reach production and deliver measurable ROI.