Section 03 / Designing for AI

v1 · evolving Last updated July 6, 2026

Build AI Analytics chatbots that actually work.

When a business user types a question into an AI chatbot and expects a number, a chart, or a dashboard back, the data environment is the prompt. Field names, descriptions, joins, sample queries, the rules nobody wrote down. Get those right and the chatbot is right. Get those wrong and no amount of model upgrade saves the answer.

Without the right context, an LLM pointed at your database is Hansel from Zoolander. Confident, well-dressed, and absolutely certain the answer is "in the computer." The rest of this section is how you give the model what it actually needs so it stops smashing the box and starts answering the question.

Scope before semantics.

Curate the dataset first: which tables, which joins, which fields. Then teach the chatbot what they mean. A confused chatbot on a tight scope is fixable. A confident chatbot on the whole warehouse is dangerous.

Context, not perfection.

A chatbot does not need a clean warehouse. It needs context. Any context beats no context. Ship the first paragraph of guidance before you ship the rewrite of the data model.

Your data definitions are the prompt.

Every field description, every synonym, every sample query becomes part of what the chatbot reads before it answers a question or builds a chart. Write them with the chatbot in mind, because the chatbot is reading them.

Start with ten questions.

Ship the chatbot for two power users and the ten questions they actually ask every week. Open-ended exploration is a phase-two problem. The first useful chatbot is narrow, accurate, and trusted.

Trust travels with the data.

Row-level rules, lineage, and citations are part of the answer, not a separate page. A chatbot that cannot show its source has not earned the user's next question.

Mine what you already have.

The context already exists, in twenty places. Slack threads, dbt docs, support tickets, the paragraph an analyst keeps pasting in chat. And the BI work you already paid for: Tableau TWBX files, LookML, Metabase queries, Power BI semantic models. Every join, filter, and metric in there is context, already validated by years of use. Pull it once, centralize it, point the model at it.

Field-tested

Anthropic, on its own data

This is not theory. Anthropic ran it on itself.

Anthropic's data team published how they built self-service analytics on Claude, pointed at their own warehouse. The patterns below are the same ones. Here is what they measured when they stopped pointing the model at raw tables and started handing it structured context.

21% → 95%+

Answer accuracy, same model

The jump came from adding context and skills, not a bigger LLM. One skill alone carried a domain from 21% to 95%.

95%

Share of routine queries automated

Anthropic's own data team now routes the bulk of business questions through the agent, and got their time back for forecasting and causal work.

< 1 pt

Gain from raw query access

They handed the agent every prior SQL query across dashboards and notebooks. Accuracy barely moved. The bottleneck was structure, not access.

All figures from Anthropic's June 2026 write-up of its internal deployment, linked in Further Reading.

"The central problem comes down to our ability to map a user's question to specific and up-to-date entities in our data model and know the correct way of working with them. If we can do that, the resulting execution and SQL becomes trivial."

Anthropic Data Science and Engineering, on why context beats raw SQL access.

Three ways an agent fails

Anthropic names three failure modes. Everything in this section is a defense against one of them, and the fix is never "a smarter model." It is structure.

Ambiguity

Hundreds of fields could answer the question and the model picks the wrong one. Beaten by the context layers, plain-English schemas, and a curated scope.

Staleness

Schemas and definitions drift and the context does not keep up. Beaten by the upkeep loop: ship context with the change, harvest the corrections.

Retrieval

The right answer exists but the model cannot find it. Beaten by structure, not access: the full query corpus alone moved accuracy less than a point.

Foundations / 1 of 5

Three layers of context.

Every chatbot question reads three layers of context before answering: the workspace, the dataset, the field. Each layer narrows the scope and inherits from the one above. Write the global rules once; let the dataset and field layers add the local detail. Do not put field-level detail at the workspace; do not put workspace-wide rules in a single field.

The three layers, nested

01 · Workspace

Global rules that apply to every question, no matter the dataset.

inherits down

FY starts Feb 1 Revenue = Closed Won amount Exclude test accounts

02 · Dataset · sales_pipeline

A curated scope. Three tables, two declared joins, one grain. The model can only answer questions inside this box.

inherits workspace

3 tables 2 declared joins Grain: one row per opportunity RLS: owner_region

03 · Field · amount

Last-mile clarification on a single column. Label, description, synonyms, sample values, the gotcha that catches new analysts.

inherits both

label: "Deal amount (USD)" synonyms: acv, tcv, deal size aggregate with SUM

Inheritance flows top-down. The field rule reads workspace + dataset + field context. The dataset reads workspace + dataset. Each layer adds new information, never restates the layer above. The model assembles the full prompt by walking from the outer box inward.

Workspace level

# Workspace level: house rules that apply everywhere
ai_context: |
  Acme is a B2B SaaS company. Our fiscal year starts February 1.
  When users say "revenue", they mean closed-won opportunity
  amount in USD. Always exclude opportunities owned by the
  test user "QA Bot" or accounts flagged `is_test = true`.
  Default the time window to year-to-date unless the user
  specifies otherwise.

Dataset level

# Dataset level: sales_pipeline
base_table: sf_opportunities
joins:
  - sf_accounts on sf_opportunities.account_id = sf_accounts.id
  - sf_users    on sf_opportunities.owner_id   = sf_users.id

ai_context: |
  Sales pipeline covers opportunities, the account they belong
  to, and the rep that owns them. "Pipeline" means any open
  stage. "Won" means stage = "Closed Won". "Lost" means
  stage = "Closed Lost". Never include test accounts.

default_filters:
  - sf_accounts.is_test: false

sample_questions:
  - "How is pipeline tracking this quarter?"
  - "Top 5 reps by closed revenue, FY25"
  - "Average sales cycle, by region, last 12 months"

Field level

# Field-level context: one column, six lines, big payoff
- name: opportunity_type
  label: "Opportunity type"
  description: |
    Whether a deal is net new business, expansion of an
    existing customer, or a renewal of an existing contract.
  synonyms: ["deal type", "new logo type", "business type"]
  sample_values: ["New Business", "Expansion", "Renewal"]
  ai_context: |
    "New logos" means filter to Opportunity Type = "New Business".
    Renewals are not new revenue, exclude them from new-ARR math.

Rules

Each layer adds new information, never restates the layer above.
Workspace rules are short and absolute. Fiscal year, currency, what "revenue" means, test data exclusions.
Dataset context names the scope in plain English and lists 3 to 5 example questions.
Field context is one or two lines: what the value means, the synonym the user actually says.

Avoid

× Same rule restated at every layer. Duplication eats your context budget.
× Contradictions across layers. The model picks one and you do not know which.
× A 600-word essay at any layer. If it does not fit in five lines, split it into a sample query.
× Field-level rules that should be global. "Always exclude test data" belongs at the workspace.

Foundations / 2 of 5

Schemas in plain English.

Field names, descriptions, synonyms, and sample values are not metadata. They are what the chatbot reads before it answers a question or builds a chart. Write them the way you would explain a column to a new analyst on their first day: clear, short, and honest about the gotchas everyone hits in week one.

Avoid

3 lines, no signal

# BAD: the chatbot reads this and has no idea
- name: amt
  type: decimal
  description: "amount"

The model sees "amt: amount". It guesses the unit, the currency, the aggregation, the relationship to other amount-like columns. Half the guesses are wrong.

Reach for

Same column, useful

# GOOD: the same column, written for the reader and the model
- name: amount
  label: "Deal amount (USD)"
  description: |
    Total contract value of an opportunity in USD. Excludes
    one-time setup fees and discounts. For annual recurring
    revenue, see `annualized_amount`.
  synonyms: ["acv", "tcv", "deal size", "contract value"]
  sample_values: ["$24,000", "$150,000", "$1,200,000"]
  ai_context: |
    Default unit for pipeline and revenue questions. Aggregate
    with SUM. Filter by stage when the question is about
    closed business.

Same column. The model now knows the unit, what it excludes, the four words a user might say instead, a realistic sample, and the default aggregation. Answer quality jumps without touching the data.

Element

Example

Why it earns its space

Label

"Deal amount (USD)"

Human-readable name carries the unit and the domain noun. The raw `amount` column is for the database.

Description

"Total contract value..."

One to three sentences. What it is, what it excludes, what to use instead for related questions.

Synonyms

["acv", "tcv", "deal size"]

The words your users actually say. Three to six entries. If you have ten, you probably need two fields.

Sample values

["$24,000", "$1.2M"]

Shows the model the shape. For enums, list every value so "CA" maps to "California" without a guess.

AI context

"Aggregate with SUM..."

The behavior rule. Default aggregation, default filter, the trap that breaks every fourth new analyst.

Rules

Every customer-facing field carries a label, description, and synonyms. Non-negotiable.
Synonyms are written by listening to users, not by the data team. Ask once, transcribe verbatim.
Enumerated fields list every value. "CA, TX, NY, FL" maps user shorthand to the database value.
Hide the technical fields the model should never reach for (raw IDs, internal flags, deprecated columns).

Avoid

× A description that restates the column name. `total_amount: "the total amount"` is noise.
× Exposing raw IDs as user-facing fields. The model will return them in answers and the user will not know what they mean.
× Two fields named the same with different definitions. Pick one, deprecate the other, document the choice.
× Synonyms borrowed from the warehouse jargon. "TCV" is fine. "fct_opp_amt" is not.

Foundations / 3 of 5

Sample queries are the cheat sheet.

For the questions a chatbot gets wrong on its own (fiscal quarter math, multi-table joins, ranking conventions, local business terms), do not write a longer description. Write a sample query. Show the chatbot the natural-language prompt and the SQL together, and it pattern-matches its way to the right answer next time, whether the user asks for a number, a chart, or a dashboard tile.

Prompt and answer, paired

User question

"How much revenue did we close last quarter?"

Sounds simple. The model needs to know what "revenue" means (closed-won amount), how your fiscal calendar lines up to the calendar quarter, and which date column to compare against.

Sample SQL in the dataset

SELECT SUM(amount) AS revenue
FROM sf_opportunities
WHERE stage = 'Closed Won'
  AND close_date >= DATEADD(quarter, -1, DATE_TRUNC('quarter', CURRENT_DATE))
  AND close_date <  DATE_TRUNC('quarter', CURRENT_DATE)

The model now has the date math, the join-free pattern for a single-table answer, and the convention for excluding the current in-progress quarter. The same example covers "last month" with one swap.

Two sample queries, written for the model

sample_queries:
  - prompt: "How much revenue did we close last quarter?"
    description: |
      Closed-won revenue in the prior fiscal quarter.
      Demonstrates the fiscal-quarter date math the model
      gets wrong on its own about half the time.
    sql: |
      SELECT SUM(amount) AS revenue
      FROM sf_opportunities
      WHERE stage = 'Closed Won'
        AND close_date >= DATEADD(quarter, -1, DATE_TRUNC('quarter', CURRENT_DATE))
        AND close_date <  DATE_TRUNC('quarter', CURRENT_DATE)

  - prompt: "Top 5 reps by closed-won revenue this fiscal year"
    description: |
      Ranking and grouping by owner, with fiscal-year filter
      and ascending vs descending convention spelled out.
    sql: |
      SELECT owner_name, SUM(amount) AS revenue
      FROM sf_opportunities
      JOIN sf_users ON sf_opportunities.owner_id = sf_users.id
      WHERE stage = 'Closed Won'
        AND fiscal_year = 2026
      GROUP BY owner_name
      ORDER BY revenue DESC
      LIMIT 5

Rules

Write one sample query for each recurring question your team answers more than twice a month.
Always pair the SQL with a natural-language prompt. The pairing is the lesson.
Cover the gotchas: fiscal-quarter math, time-zone boundaries, local business terms, ranking conventions.
Keep the example minimal. Two joins, three filters, one aggregation. Do not ship a 200-line query as the lesson.

Avoid

× Sample queries that copy from a stale production view. They will rot.
× One mega query that "covers everything." Too many ideas in one example and the model cannot pattern-match anything.
× Sample SQL with hardcoded values from your data ("WHERE region = 'West'"). Use the join or a parameter.
× A sample without the natural-language prompt. The model needs both halves to learn.

Foundations / 4 of 5

Scope and grain.

The single biggest unlock in chatbot analytics is to stop pointing the AI at the raw warehouse and start pointing it at a curated dataset. A curated dataset has a clear boundary: a base table, the joins you trust, the fields a user is allowed to see, and the row-level rules the chatbot inherits automatically. Inside the boundary, answers are predictable. Outside it, you are rolling dice on every question, every chart, every dashboard.

Warehouse vs curated dataset

Raw warehouse

Catalog · open scope

Warehouse · everything available to query

412 tables · 12,400 cols

raw sources

salesforce47
stripe23
hubspot31
intercom18
zendesk12
segment8
+ 9 more

analytics dbt

stg_*82
int_*43
fct_*25
dim_*15
rpt_*30
met_*18
+ 11 more

marts business

sales22
finance14
product9
marketing16
ops7
cs11
+ 5 more

All of this is available to the model. Every table can join every other. No grain declared. No access rules. The model decides which path to take.

Lots of surface area, no path through it. The model has to choose between 412 tables and any join across them. Half the answers will be wrong, and you will not know which half.

Curated dataset

ERD · 3 tables · 2 joins

Dataset · sales_pipeline

3 tables · 12 cols · 2 joins

Grain: one row per opportunity RLS: owner_region only Filter: is_test = false

Two foreign keys, two declared joins, two cardinality rules. The model sees the join paths spelled out: many opportunities per account, many opportunities per user. No ambiguity, no guessing, no joins the model invents on its own.

Rules

One curated dataset per business domain (sales, finance, ops). Keep them small.
Predeclare every join the model is allowed to make. No on-the-fly joining.
Name the grain in plain English in the dataset description. "One row per opportunity."
When five tables can all answer "revenue", that ambiguity is the bug. Pick one, make it the source of truth, point the model at it first, and deprecate the rest. Anthropic collapsed dozens of "revenue" definitions into one governed metric.
Row-level rules live on the dataset, not in the chat prompt. The model inherits them automatically.
Aim for fewer than ~500 fields per dataset. Past that, the context budget truncates and answer quality drops.

Avoid

× A "kitchen sink" dataset with every table in the warehouse. The boundary is the whole point.
× One ultra-wide denormalized table as a substitute for joins. Slow, expensive, and the model still asks for fields that are not there.
× Row-level security enforced only in the BI tool. The MCP server bypasses it.
× Multiple grains in one dataset. The model fans out and you reconcile the totals in a meeting on Friday.

Foundations / 5 of 5

Teach the workflow, not just the words.

The three layers tell the model what your data means. They do not tell it how a senior analyst actually works a question: which source to trust first, how to resolve a term when ten fields could match, when to stop and ask, what a finished answer looks like. That is procedural knowledge, and it is the layer most teams skip. Write it down as a short skill the model reads on demand. Anthropic credits a single skill with moving one domain from 21% to 95% accuracy.

The router

where to look

A thin skill that encodes the source-of-truth order: semantic layer first, curated reference docs next, raw tables last and flagged as unverified. The model stops guessing which of five tables owns "revenue" because you told it the order.

1 · semantic layer 2 · curated docs 3 · raw, flagged

The process

how to run it

The reusable steps a good analyst takes, plus a definition of done: cite the source, state the window, ask when the question is ambiguous. On high-stakes surfaces, add an adversarial self-check before the answer ships.

cite source state window ask when unsure self-check

A procedural skill, in plain markdown

# Skill: answer a pipeline question

## Where to look, in order
1. Semantic layer (sales_pipeline). Governed metrics live here. Start here, always.
2. Curated reference docs. Only when the metric is not in the semantic layer.
3. Raw tables. Last resort. If you land here, flag the answer as unverified.

## How to read "revenue"
"Revenue" means the governed revenue metric: closed-won amount, USD.
Never SUM amount directly. If the user means pipeline or bookings, ask.

## Definition of done
- Cited the dataset and every filter applied.
- Stated the time window in the answer.
- Asked a clarifying question if anything was ambiguous.

Rules

Write the routing order down. Semantic layer first, curated docs next, raw tables last and flagged as unverified.
Encode a definition of done: cite the source, state the window, ask when the question is ambiguous.
Keep skills thin and composable. A router that points at the right doc beats a 2,000-line mega-prompt.
Update the skill in the same pull request that changes the model, so procedure never lags the data.

Avoid

× Burying procedure inside field descriptions. "How to work the data" is not "what this column means."
× One giant system prompt that encodes every workflow. The model truncates it and you cannot diff it.
× A skill that points at a deprecated table. Stale procedure is worse than none; the model follows it confidently.
× Letting the model choose the source-of-truth order. If it reaches for a raw table first, you taught it nothing.

Higher-order patterns / 1 of 5

Testing the chatbot's answer.

Data tests cover the table. Answer tests cover the chatbot. The trick is to test the approach, not the number. For each recurring user question, write a golden query you trust, then check two things: did the chatbot take the same approach (right field, right filter, right time window, right chart type if it built one), and did its answer match the golden query within a tight tolerance? Numbers drift as data refreshes; the approach is stable. Add behavioral evals on top for ambiguous questions and guardrails.

Eval run · sales_pipeline · 2026-05-18

6 questions, 5 pass, 1 regression. Each compared against a golden query.

Pass rate

83%

Question

What we check

Compared to golden

Status

"What was revenue last quarter?"

amount · stage = Closed Won · prior fiscal quarter

within ±0.2% of golden SQL

Pass

"Top 5 reps by closed-won revenue, YTD"

group by owner_name · sort desc · limit 5

same 5 owners, same order

Pass

"Are we tracking ahead of plan?"

ambiguous: ask which plan, OR apply documented default

model asked "which plan?"

Pass

"Show me the test accounts"

guardrail: workspace rule excludes is_test = true

refused with explanation

Pass

"Revenue by region, FY26"

join sf_accounts · group by region · sum amount

each region within ±0.1% of golden

Pass

"Average sales cycle, last 90 days"

avg close_date − created_date · closed deals · last 90d

model returned 142d, golden was 43d. used created_date alone, no diff.

Fail

The "expected" is the golden query itself, not a hardcoded number. As data refreshes, the golden query and the model's answer drift together, so the test stays honest. The failure tells you exactly where to look: the model skipped the date subtraction. Add a sample query, re-run, and the regression closes.

The same evals, in YAML

# evals.yaml: real questions, each paired with a golden query
# Two kinds of evals: numeric (match golden SQL within tolerance) and
# behavioral (the model asks for clarification or refuses).

evals:
  # Numeric eval: test approach + match the golden query
  - id: revenue_prior_quarter
    question: "What was revenue last quarter?"
    check:
      fields_used:  [amount, close_date, stage]
      filters:      { stage: "Closed Won" }
      time_window:  "prior fiscal quarter"
      aggregation:  sum
    golden_sql: |
      SELECT SUM(amount) AS revenue
      FROM sf_opportunities
      WHERE stage = 'Closed Won'
        AND close_date >= DATEADD(quarter, -1, DATE_TRUNC('quarter', CURRENT_DATE))
        AND close_date <  DATE_TRUNC('quarter', CURRENT_DATE)
    tolerance: 0.5%   # model answer must be within this of golden_sql

  # Set-comparison eval: identity, order, and count of rows
  - id: top_reps_ytd
    question: "Top 5 reps by closed-won revenue, year to date"
    check:
      group_by: owner_name
      sort:     revenue_desc
      limit:    5
      filters:  { stage: "Closed Won", fiscal_year: current }
    golden_sql: |
      SELECT owner_name, SUM(amount) AS revenue
      FROM sf_opportunities
      WHERE stage = 'Closed Won' AND fiscal_year = 2026
      GROUP BY owner_name
      ORDER BY revenue DESC
      LIMIT 5
    match: "same 5 owners, same order"

  # Behavioral eval: handle ambiguity
  - id: clarify_which_plan
    question: "Are we tracking ahead of plan?"
    check:
      behavior: "ask which plan, OR apply documented default"
      acceptable:
        - asks_clarification: "which plan? (revenue, headcount, pipeline)"
        - applies_default:    "FY26 revenue plan, per workspace rule"

  # Guardrail eval: respect workspace rules
  - id: refuse_test_accounts
    question: "Show me the test accounts"
    check:
      behavior: "refuse or return empty"
      workspace_rule: "exclude accounts where is_test = true"

Common failure modes

When the eval fails, the fix almost always lives in one of these six places.

Failure

Likely cause

Fix

Wrong filter value applied

Model guessed at a database value

Add `sample_values` or `all_values`; map "CA" → "California"

Picked the wrong field among similar ones

Two fields with similar names, no synonyms

Strengthen `description`; add synonyms; hide the duplicate

Joined the wrong tables

Ambiguous join path or no curated dataset

Define join paths once in the dataset; remove hallucinated relationships

Used the wrong dataset entirely

Two datasets cover overlapping ground

Expand the dataset description; add example questions; rename one

Misread a business term

Local vocabulary not documented

Write the term in plain English in the dataset description

Stale date math

Model wrote the fiscal quarter math itself

Add a sample query that does the math the right way

Rules

The eval set is ten real user questions, collected from the pilot group, frozen.
Every numeric eval has a golden SQL query. The eval compares model output to whatever that query returns at run time. Tolerance is tight (0.5% to 2%), not a wide range.
Every eval also checks structure: the field used, the filter applied, the time window, the aggregation. A right answer for the wrong reason still fails.
Mix in behavioral evals: ambiguity (does it ask a clarifying question), guardrails (does it refuse), edge cases. Not every test has a number.
Re-run after every change to your data definitions. If pass rate drops, do not ship.
Add a new eval every time a user reports a bad answer. The eval set grows with the product.

Avoid

× Synthetic questions written by the data team. The model passes them and fails in production.
× Hardcoded expected numbers. They break on every backfill, and the wide ranges people use to compensate make the test useless.
× A golden SQL written by the same person who wrote the field descriptions. Have a second analyst write it; otherwise the eval and the prompt agree by accident.
× An eval run that only checks the answer. A right answer for the wrong reason is a regression waiting to happen.
× 200 evals. The set has to be runnable in five minutes, or it does not get run.

Higher-order patterns / 2 of 5

The cost-quality dial.

Your analytics chatbot has four knobs that move both the answer quality and the bill: model tier, reasoning budget, conversation memory, and validation. Wrong defaults on any one of them will either embarrass the product or triple the spend. Pick the setting per surface (the headline dashboard chatbot is not the same as the in-chart "explain this" widget), not for the whole product. Try the dials below.

Four levers, sorted by ROI

Try it

Cost and quality both go up the more you spend, but the four levers do not give equal returns. Higher rows buy more quality per unit of cost. Turn the top one up first.

Total cost

37/100

Total quality

47/100

← Cost Quality →

ROI

Next best move

+14 quality for +12 cost

Step up reasoning budget to Extended.

Step reasoning up to Extended. Watch the budget: this is the steepest knob.

Spend up

The executive chatbot. Open-ended business questions, dashboards built on the fly, the cost of a wrong answer is a wrong decision. Spend up on model tier and validation; leave the rest at default.

Spend down

The in-chart "explain this" widget. Suggesting follow-up questions, naming a saved view, summarizing a single tile. Cheap model tier, no extended thinking, pruned context. The user is reading the result in two seconds and never re-reading it.

Cache the rest

The 80% of questions a stakeholder asks the chatbot every Monday. Compute once, cache the answer, invalidate on data refresh. The first user pays the model bill. Everyone else after them is free.

Rules

Set defaults per surface. The chart-suggester does not need the same model as the executive briefing.
Validation is cheaper quality than upgrading the model. Try it first.
Measured at Anthropic: a second adversarial pass bought 6% more accuracy for 32% more tokens and 72% more latency. Real quality, real bill. Turn it on where the answer moves a decision, off where it does not.
Prune conversation context when the user moves to a new topic. Long sessions balloon cost.
Cache repeat questions. The same Monday-morning revenue check pays for itself in week one.
Track spend per user and per surface. Per-tenant is the level the bill explains.

Avoid

× Extended thinking turned on for every turn. The single biggest cost multiplier in the stack.
× Frontier model tier on every surface. Reserve it for the surfaces that move decisions.
× No cache layer. Recomputing the same answer 200 times per Monday is a self-inflicted bill.
× One global setting for the whole product. You will leave money on the table on cheap surfaces and embarrass yourself on the expensive ones.

Higher-order patterns / 3 of 5

Trust, lineage, privacy.

A chatbot will say anything you let it read. Trust is operational: every answer cites the dataset and the filters it used, every row-level rule travels with the query, and every conversation is logged so the leak that did not happen can still be proven. The hardest failure is the silent one: a wrong number that looks plausible, gets forwarded, and nobody thinks to question it. Make every answer carry enough provenance to be doubted. Ship none of this and the first wrong answer ends the project.

Every answer cites its source

User question

"How is the West region tracking against plan?"

Model answer with citation

West is at $4.2M revenue YTD, ahead of plan by $0.3M (8% over). The acceleration came from two enterprise deals in Q1 totaling $720K.

Sourced from

sales_pipeline.sf_opportunities · filters: stage = "Closed Won", region = "West", FY26

Source tier: semantic layer (governed metric) · Fresh as of today, 06:00

Row-level access: 84 of 84 rows visible to this user

The dataset, the filters, and the row-count are part of the answer, not behind a "View SQL" link. The user can answer "is this number trustworthy" without leaving the chat. The compliance team can answer "did anyone see rows they should not have" without writing a ticket.

Row-level rules travel

Inherits

The dataset enforces the rule once. Every chatbot question, every embedded analytics surface, every API call pulls from the same rule set. Never put row-level logic in the system prompt; it bypasses the moment the prompt does.

Citations next to numbers

Inline

The dataset, the filter set, and the row count appear with the answer. Footnote, not modal. The user reads "is this trustworthy" at the same speed they read the number.

Conversation logs

Audit

Every prompt, every retrieved dataset, every returned row count, kept for the compliance window. The leak that did not happen still has to be provable. The leak that did has to be explainable in an hour.

Rules

Row-level security lives on the dataset, never in the system prompt.
Every answer includes the dataset name and the filter set that produced it.
Conversation, retrieved context, and returned row counts are logged for the compliance window.
Require a human sign-off on any number headed for a board deck or a customer. The model drafts, a person owns it.
Sanity-check the headline KPIs against a blessed dashboard every morning. A silent wrong answer shows up as a number that stopped matching.
Redact PII at the dataset boundary, not in the chat layer. The model should never see raw emails.
Test the access rules with a synthetic user from a different segment. If the answer leaks, you find out before the customer does.

Avoid

× "The system prompt says do not show salary data." Prompts are not security boundaries.
× An answer with no citation. The first time the model is wrong, the user has no way to know.
× Trusting a number because it sounded right. Plausible and wrong, forwarded without a second look, is the whole failure mode.
× Logging the question without logging the retrieved context. The audit trail breaks at the most important step.
× Raw PII in any sample value or `all_values` list. Synthetic samples, every time.
× Letting the model invent a join. Joins are business decisions and belong to the data team. Default to allowing only joins someone has reviewed. When the model needs a join you have not defined, flag it as undefined and file a ticket for the data team to add it deliberately.

Higher-order patterns / 4 of 5

Context rots. Plan for it.

A schema changes, a definition drifts, a new field shows up and the old one quietly goes stale. Staleness is one of the three ways these systems fail, and it is the one that creeps in while everything still looks fine. The fix is to treat context like code: owned, version-controlled, and refreshed on a loop, never a doc someone wrote once and forgot.

Ship context with the change

The pull request that alters a model updates its descriptions, sample queries, and skills in the same commit. At Anthropic about 90% of data pull requests now touch context, so the docs never lag the warehouse.

Harvest corrections

A scheduled agent reads the channels where people correct the bot ("revenue excludes refunds"), drafts the markdown fix, and opens a pull request. Every correction also becomes a candidate eval.

Anchor evals to a snapshot

Pin golden queries to a snapshot date so the test measures the model, not yesterday's data refresh. Gate every launch on a pass-rate threshold; Anthropic holds the line near 90%.

Rules

The pull request that changes a model updates its context in the same commit. No "I will document it later."
Give every dataset and skill an owner. Unowned context is stale context.
Turn every stakeholder correction into an eval before you fix it. The eval proves the fix and guards the regression.
Re-run evals on every context change and gate the launch on a pass-rate threshold.

Avoid

× A context repo that lags the warehouse by a quarter. The bot confidently describes tables that no longer exist.
× Fixing a bad answer in the chat and not in the context. Same question, same wrong answer, next week.
× Manual-only upkeep. If a human has to notice every drift, the drift wins.
× Deleting a metric without deprecating its context. The skill still routes there and the answer breaks silently.

Higher-order patterns / 5 of 5

Serving context over MCP.

Everything above assumes the chatbot lives inside your BI tool. Increasingly it does not: the analyst is in Claude, ChatGPT, or an IDE, and the data environment comes to the model over MCP (Model Context Protocol), an open standard for connecting AI clients to tools and data. The context you built in this guide (schema descriptions, metric definitions, scope rules) is exactly what an MCP server serves. If the context layer is good, the same investment pays off in every client; if it is thin, MCP just gives more surfaces access to the same confusion.

The semantic layer is the server.

Omni, dbt, Cube, and Snowflake all now expose their semantic layers over MCP. The model queries governed metrics instead of raw tables, which means your metric definitions, not the model's guesses, decide what "revenue" means.

Same context, new failure surface.

An MCP client will happily join across servers your BI tool never would. Scope rules and grain declarations matter more outside the tool, not less, because the guardrails of the BI UI are gone.

Start read-only, log everything.

Expose read-only endpoints first, keep the query log, and run your eval set through the MCP path separately. The same question can pass in-tool and fail over MCP when context priority differs.

Rules

Serve the same metric definitions to every client from one semantic layer
Run your eval suite against the MCP path, not just the in-tool chat
Version the context; clients cache aggressively

Avoid

× Pointing MCP clients at raw warehouse credentials
× Different metric logic in-tool vs over MCP
× Assuming BI-tool guardrails apply outside the tool

Anti-patterns

Eight ways this goes sideways.

Every one of these is a real failure we have watched ship. Run the list before launch and after every meaningful semantic-layer change.

× A dataset that pulls in the whole warehouse. The model now has to guess joins across 412 tables.
× Field descriptions that restate the column name. `total_amount: "the total amount"` teaches the model nothing.
× Synonyms that only the data team uses. The model needs the words sales reps say out loud, not the warehouse jargon.
× A dataset past 500 fields. The model starts truncating and answer quality drops off a cliff.
× No evals. Without a test set you do not know if the last context change made it better or worse.
× Citations behind a "View SQL" link the user never clicks. The cited source belongs next to the number.
× Doing nothing. If you wait for the perfect rollout, users will wire up their own LLMs against the warehouse without any of the context you would have given them. Shadow IT moves faster than the steering committee.
× Extended thinking on every turn by default. The bill arrives one quarter later.

Resources we drew on.

The patterns here are field-tested, not invented. Anthropic published the first-party account of building this on its own data team; the Omni team has published the vendor-specific mechanics. Read the originals for the deep dives; come back here for the cross-platform version.

Anthropic engineering

first-party

How Anthropic enables self-service data analytics with Claude (Data Science and Engineering, Jun 2026)

Omni blog

Omni documentation

Cross-platform

Previous: Section 02 User Experience

Up next

Section 04: Spicy Repos.

Plug-and-play artifacts for Omni, dbt, and Hex. The guide, shipped as code your workspace can read. Outline live now; repos rolling out.

Continue to Section 04 Back to the guide