Skip to content
The Spicy field guide
Section 03 / Designing for AI
v1 · evolving

Build AI Analytics chatbots that actually work.

When a business user types a question into an AI chatbot and expects a number, a chart, or a dashboard back, the data environment is the prompt. Field names, descriptions, joins, sample queries, the rules nobody wrote down. Get those right and the chatbot is right. Get those wrong and no amount of model upgrade saves the answer.

Without the right context, an LLM pointed at your database is Hansel from Zoolander. Confident, well-dressed, and absolutely certain the answer is "in the computer." The rest of this section is how you give the model what it actually needs so it stops smashing the box and starts answering the question.

Scope before semantics.

Curate the dataset first: which tables, which joins, which fields. Then teach the chatbot what they mean. A confused chatbot on a tight scope is fixable. A confident chatbot on the whole warehouse is dangerous.

Context, not perfection.

A chatbot does not need a clean warehouse. It needs context. Any context beats no context. Ship the first paragraph of guidance before you ship the rewrite of the data model.

Your data definitions are the prompt.

Every field description, every synonym, every sample query becomes part of what the chatbot reads before it answers a question or builds a chart. Write them with the chatbot in mind, because the chatbot is reading them.

Start with ten questions.

Ship the chatbot for two power users and the ten questions they actually ask every week. Open-ended exploration is a phase-two problem. The first useful chatbot is narrow, accurate, and trusted.

Trust travels with the data.

Row-level rules, lineage, and citations are part of the answer, not a separate page. A chatbot that cannot show its source has not earned the user's next question.

Mine what you already have.

The context already exists, in twenty places. Slack threads, dbt docs, support tickets, the paragraph an analyst keeps pasting in chat. And the BI work you already paid for: Tableau TWBX files, LookML, Metabase queries, Power BI semantic models. Every join, filter, and metric in there is context, already validated by years of use. Pull it once, centralize it, point the model at it.


Foundations / 1 of 4

Three layers of context.

Every chatbot question reads three layers of context before answering: the workspace, the dataset, the field. Each layer narrows the scope and inherits from the one above. Write the global rules once; let the dataset and field layers add the local detail. Do not put field-level detail at the workspace; do not put workspace-wide rules in a single field.

The three layers, nested
01 · Workspace

Global rules that apply to every question, no matter the dataset.

inherits down
FY starts Feb 1 Revenue = Closed Won amount Exclude test accounts
02 · Dataset · sales_pipeline

A curated scope. Three tables, two declared joins, one grain. The model can only answer questions inside this box.

inherits workspace
3 tables 2 declared joins Grain: one row per opportunity RLS: owner_region
03 · Field · amount

Last-mile clarification on a single column. Label, description, synonyms, sample values, the gotcha that catches new analysts.

inherits both
label: "Deal amount (USD)" synonyms: acv, tcv, deal size aggregate with SUM

Inheritance flows top-down. The field rule reads workspace + dataset + field context. The dataset reads workspace + dataset. Each layer adds new information, never restates the layer above. The model assembles the full prompt by walking from the outer box inward.

Workspace level
# Workspace level: house rules that apply everywhere
ai_context: |
  Acme is a B2B SaaS company. Our fiscal year starts February 1.
  When users say "revenue", they mean closed-won opportunity
  amount in USD. Always exclude opportunities owned by the
  test user "QA Bot" or accounts flagged `is_test = true`.
  Default the time window to year-to-date unless the user
  specifies otherwise.
Dataset level
# Dataset level: sales_pipeline
base_table: sf_opportunities
joins:
  - sf_accounts on sf_opportunities.account_id = sf_accounts.id
  - sf_users    on sf_opportunities.owner_id   = sf_users.id

ai_context: |
  Sales pipeline covers opportunities, the account they belong
  to, and the rep that owns them. "Pipeline" means any open
  stage. "Won" means stage = "Closed Won". "Lost" means
  stage = "Closed Lost". Never include test accounts.

default_filters:
  - sf_accounts.is_test: false

sample_questions:
  - "How is pipeline tracking this quarter?"
  - "Top 5 reps by closed revenue, FY25"
  - "Average sales cycle, by region, last 12 months"
Field level
# Field-level context: one column, six lines, big payoff
- name: opportunity_type
  label: "Opportunity type"
  description: |
    Whether a deal is net new business, expansion of an
    existing customer, or a renewal of an existing contract.
  synonyms: ["deal type", "new logo type", "business type"]
  sample_values: ["New Business", "Expansion", "Renewal"]
  ai_context: |
    "New logos" means filter to Opportunity Type = "New Business".
    Renewals are not new revenue, exclude them from new-ARR math.
Rules
  • Each layer adds new information, never restates the layer above.
  • Workspace rules are short and absolute. Fiscal year, currency, what "revenue" means, test data exclusions.
  • Dataset context names the scope in plain English and lists 3 to 5 example questions.
  • Field context is one or two lines: what the value means, the synonym the user actually says.
Avoid
  • × Same rule restated at every layer. Duplication eats your context budget.
  • × Contradictions across layers. The model picks one and you do not know which.
  • × A 600-word essay at any layer. If it does not fit in five lines, split it into a sample query.
  • × Field-level rules that should be global. "Always exclude test data" belongs at the workspace.

Foundations / 2 of 4

Schemas in plain English.

Field names, descriptions, synonyms, and sample values are not metadata. They are what the chatbot reads before it answers a question or builds a chart. Write them the way you would explain a column to a new analyst on their first day: clear, short, and honest about the gotchas everyone hits in week one.

Avoid
3 lines, no signal
# BAD: the chatbot reads this and has no idea
- name: amt
  type: decimal
  description: "amount"

The model sees "amt: amount". It guesses the unit, the currency, the aggregation, the relationship to other amount-like columns. Half the guesses are wrong.

Reach for
Same column, useful
# GOOD: the same column, written for the reader and the model
- name: amount
  label: "Deal amount (USD)"
  description: |
    Total contract value of an opportunity in USD. Excludes
    one-time setup fees and discounts. For annual recurring
    revenue, see `annualized_amount`.
  synonyms: ["acv", "tcv", "deal size", "contract value"]
  sample_values: ["$24,000", "$150,000", "$1,200,000"]
  ai_context: |
    Default unit for pipeline and revenue questions. Aggregate
    with SUM. Filter by stage when the question is about
    closed business.

Same column. The model now knows the unit, what it excludes, the four words a user might say instead, a realistic sample, and the default aggregation. Answer quality jumps without touching the data.

Label
"Deal amount (USD)"
Human-readable name carries the unit and the domain noun. The raw `amount` column is for the database.
Description
"Total contract value..."
One to three sentences. What it is, what it excludes, what to use instead for related questions.
Synonyms
["acv", "tcv", "deal size"]
The words your users actually say. Three to six entries. If you have ten, you probably need two fields.
Sample values
["$24,000", "$1.2M"]
Shows the model the shape. For enums, list every value so "CA" maps to "California" without a guess.
AI context
"Aggregate with SUM..."
The behavior rule. Default aggregation, default filter, the trap that breaks every fourth new analyst.
Rules
  • Every customer-facing field carries a label, description, and synonyms. Non-negotiable.
  • Synonyms are written by listening to users, not by the data team. Ask once, transcribe verbatim.
  • Enumerated fields list every value. "CA, TX, NY, FL" maps user shorthand to the database value.
  • Hide the technical fields the model should never reach for (raw IDs, internal flags, deprecated columns).
Avoid
  • × A description that restates the column name. `total_amount: "the total amount"` is noise.
  • × Exposing raw IDs as user-facing fields. The model will return them in answers and the user will not know what they mean.
  • × Two fields named the same with different definitions. Pick one, deprecate the other, document the choice.
  • × Synonyms borrowed from the warehouse jargon. "TCV" is fine. "fct_opp_amt" is not.

Foundations / 3 of 4

Sample queries are the cheat sheet.

For the questions a chatbot gets wrong on its own (fiscal quarter math, multi-table joins, ranking conventions, local business terms), do not write a longer description. Write a sample query. Show the chatbot the natural-language prompt and the SQL together, and it pattern-matches its way to the right answer next time, whether the user asks for a number, a chart, or a dashboard tile.

Prompt and answer, paired
User question
"How much revenue did we close last quarter?"
Sounds simple. The model needs to know what "revenue" means (closed-won amount), how your fiscal calendar lines up to the calendar quarter, and which date column to compare against.
Sample SQL in the dataset
SELECT SUM(amount) AS revenue
FROM sf_opportunities
WHERE stage = 'Closed Won'
  AND close_date >= DATEADD(quarter, -1, DATE_TRUNC('quarter', CURRENT_DATE))
  AND close_date <  DATE_TRUNC('quarter', CURRENT_DATE)

The model now has the date math, the join-free pattern for a single-table answer, and the convention for excluding the current in-progress quarter. The same example covers "last month" with one swap.

Two sample queries, written for the model
sample_queries:
  - prompt: "How much revenue did we close last quarter?"
    description: |
      Closed-won revenue in the prior fiscal quarter.
      Demonstrates the fiscal-quarter date math the model
      gets wrong on its own about half the time.
    sql: |
      SELECT SUM(amount) AS revenue
      FROM sf_opportunities
      WHERE stage = 'Closed Won'
        AND close_date >= DATEADD(quarter, -1, DATE_TRUNC('quarter', CURRENT_DATE))
        AND close_date <  DATE_TRUNC('quarter', CURRENT_DATE)

  - prompt: "Top 5 reps by closed-won revenue this fiscal year"
    description: |
      Ranking and grouping by owner, with fiscal-year filter
      and ascending vs descending convention spelled out.
    sql: |
      SELECT owner_name, SUM(amount) AS revenue
      FROM sf_opportunities
      JOIN sf_users ON sf_opportunities.owner_id = sf_users.id
      WHERE stage = 'Closed Won'
        AND fiscal_year = 2026
      GROUP BY owner_name
      ORDER BY revenue DESC
      LIMIT 5
Rules
  • Write one sample query for each recurring question your team answers more than twice a month.
  • Always pair the SQL with a natural-language prompt. The pairing is the lesson.
  • Cover the gotchas: fiscal-quarter math, time-zone boundaries, local business terms, ranking conventions.
  • Keep the example minimal. Two joins, three filters, one aggregation. Do not ship a 200-line query as the lesson.
Avoid
  • × Sample queries that copy from a stale production view. They will rot.
  • × One mega query that "covers everything." Too many ideas in one example and the model cannot pattern-match anything.
  • × Sample SQL with hardcoded values from your data ("WHERE region = 'West'"). Use the join or a parameter.
  • × A sample without the natural-language prompt. The model needs both halves to learn.

Foundations / 4 of 4

Scope and grain.

The single biggest unlock in chatbot analytics is to stop pointing the AI at the raw warehouse and start pointing it at a curated dataset. A curated dataset has a clear boundary: a base table, the joins you trust, the fields a user is allowed to see, and the row-level rules the chatbot inherits automatically. Inside the boundary, answers are predictable. Outside it, you are rolling dice on every question, every chart, every dashboard.

Warehouse vs curated dataset
Raw warehouse
Catalog · open scope
Warehouse · everything available to query
412 tables · 12,400 cols
raw sources
  • salesforce47
  • stripe23
  • hubspot31
  • intercom18
  • zendesk12
  • segment8
  • + 9 more
analytics dbt
  • stg_*82
  • int_*43
  • fct_*25
  • dim_*15
  • rpt_*30
  • met_*18
  • + 11 more
marts business
  • sales22
  • finance14
  • product9
  • marketing16
  • ops7
  • cs11
  • + 5 more
All of this is available to the model. Every table can join every other. No grain declared. No access rules. The model decides which path to take.

Lots of surface area, no path through it. The model has to choose between 412 tables and any join across them. Half the answers will be wrong, and you will not know which half.

Curated dataset
ERD · 3 tables · 2 joins
Dataset · sales_pipeline
3 tables · 12 cols · 2 joins
sf_opportunities BASE id PK amount numeric stage text close_date date account_id FK owner_id FK type text sf_accounts id PK name text region text is_test bool sf_users id PK name text region text role text N 1 account_id = id N 1 owner_id = id
Grain: one row per opportunity RLS: owner_region only Filter: is_test = false

Two foreign keys, two declared joins, two cardinality rules. The model sees the join paths spelled out: many opportunities per account, many opportunities per user. No ambiguity, no guessing, no joins the model invents on its own.

Rules
  • One curated dataset per business domain (sales, finance, ops). Keep them small.
  • Predeclare every join the model is allowed to make. No on-the-fly joining.
  • Name the grain in plain English in the dataset description. "One row per opportunity."
  • Row-level rules live on the dataset, not in the chat prompt. The model inherits them automatically.
  • Aim for fewer than ~500 fields per dataset. Past that, the context budget truncates and answer quality drops.
Avoid
  • × A "kitchen sink" dataset with every table in the warehouse. The boundary is the whole point.
  • × One ultra-wide denormalized table as a substitute for joins. Slow, expensive, and the model still asks for fields that are not there.
  • × Row-level security enforced only in the BI tool. The MCP server bypasses it.
  • × Multiple grains in one dataset. The model fans out and you reconcile the totals in a meeting on Friday.

Higher-order patterns / 1 of 3

Testing the chatbot's answer.

Data tests cover the table. Answer tests cover the chatbot. The trick is to test the approach, not the number. For each recurring user question, write a golden query you trust, then check two things: did the chatbot take the same approach (right field, right filter, right time window, right chart type if it built one), and did its answer match the golden query within a tight tolerance? Numbers drift as data refreshes; the approach is stable. Add behavioral evals on top for ambiguous questions and guardrails.

Eval run · sales_pipeline · 2026-05-18
6 questions, 5 pass, 1 regression. Each compared against a golden query.
Pass rate
83%
"What was revenue last quarter?"
amount · stage = Closed Won · prior fiscal quarter
within ±0.2% of golden SQL
Pass
"Top 5 reps by closed-won revenue, YTD"
group by owner_name · sort desc · limit 5
same 5 owners, same order
Pass
"Are we tracking ahead of plan?"
ambiguous: ask which plan, OR apply documented default
model asked "which plan?"
Pass
"Show me the test accounts"
guardrail: workspace rule excludes is_test = true
refused with explanation
Pass
"Revenue by region, FY26"
join sf_accounts · group by region · sum amount
each region within ±0.1% of golden
Pass
"Average sales cycle, last 90 days"
avg close_date − created_date · closed deals · last 90d
model returned 142d, golden was 43d. used created_date alone, no diff.
Fail
The "expected" is the golden query itself, not a hardcoded number. As data refreshes, the golden query and the model's answer drift together, so the test stays honest. The failure tells you exactly where to look: the model skipped the date subtraction. Add a sample query, re-run, and the regression closes.
The same evals, in YAML
# evals.yaml: real questions, each paired with a golden query
# Two kinds of evals: numeric (match golden SQL within tolerance) and
# behavioral (the model asks for clarification or refuses).

evals:
  # Numeric eval: test approach + match the golden query
  - id: revenue_prior_quarter
    question: "What was revenue last quarter?"
    check:
      fields_used:  [amount, close_date, stage]
      filters:      { stage: "Closed Won" }
      time_window:  "prior fiscal quarter"
      aggregation:  sum
    golden_sql: |
      SELECT SUM(amount) AS revenue
      FROM sf_opportunities
      WHERE stage = 'Closed Won'
        AND close_date >= DATEADD(quarter, -1, DATE_TRUNC('quarter', CURRENT_DATE))
        AND close_date <  DATE_TRUNC('quarter', CURRENT_DATE)
    tolerance: 0.5%   # model answer must be within this of golden_sql

  # Set-comparison eval: identity, order, and count of rows
  - id: top_reps_ytd
    question: "Top 5 reps by closed-won revenue, year to date"
    check:
      group_by: owner_name
      sort:     revenue_desc
      limit:    5
      filters:  { stage: "Closed Won", fiscal_year: current }
    golden_sql: |
      SELECT owner_name, SUM(amount) AS revenue
      FROM sf_opportunities
      WHERE stage = 'Closed Won' AND fiscal_year = 2026
      GROUP BY owner_name
      ORDER BY revenue DESC
      LIMIT 5
    match: "same 5 owners, same order"

  # Behavioral eval: handle ambiguity
  - id: clarify_which_plan
    question: "Are we tracking ahead of plan?"
    check:
      behavior: "ask which plan, OR apply documented default"
      acceptable:
        - asks_clarification: "which plan? (revenue, headcount, pipeline)"
        - applies_default:    "FY26 revenue plan, per workspace rule"

  # Guardrail eval: respect workspace rules
  - id: refuse_test_accounts
    question: "Show me the test accounts"
    check:
      behavior: "refuse or return empty"
      workspace_rule: "exclude accounts where is_test = true"
Common failure modes
When the eval fails, the fix almost always lives in one of these six places.
Wrong filter value applied
Model guessed at a database value
Add `sample_values` or `all_values`; map "CA" → "California"
Picked the wrong field among similar ones
Two fields with similar names, no synonyms
Strengthen `description`; add synonyms; hide the duplicate
Joined the wrong tables
Ambiguous join path or no curated dataset
Define join paths once in the dataset; remove hallucinated relationships
Used the wrong dataset entirely
Two datasets cover overlapping ground
Expand the dataset description; add example questions; rename one
Misread a business term
Local vocabulary not documented
Write the term in plain English in the dataset description
Stale date math
Model wrote the fiscal quarter math itself
Add a sample query that does the math the right way
Rules
  • The eval set is ten real user questions, collected from the pilot group, frozen.
  • Every numeric eval has a golden SQL query. The eval compares model output to whatever that query returns at run time. Tolerance is tight (0.5% to 2%), not a wide range.
  • Every eval also checks structure: the field used, the filter applied, the time window, the aggregation. A right answer for the wrong reason still fails.
  • Mix in behavioral evals: ambiguity (does it ask a clarifying question), guardrails (does it refuse), edge cases. Not every test has a number.
  • Re-run after every change to your data definitions. If pass rate drops, do not ship.
  • Add a new eval every time a user reports a bad answer. The eval set grows with the product.
Avoid
  • × Synthetic questions written by the data team. The model passes them and fails in production.
  • × Hardcoded expected numbers. They break on every backfill, and the wide ranges people use to compensate make the test useless.
  • × A golden SQL written by the same person who wrote the field descriptions. Have a second analyst write it; otherwise the eval and the prompt agree by accident.
  • × An eval run that only checks the answer. A right answer for the wrong reason is a regression waiting to happen.
  • × 200 evals. The set has to be runnable in five minutes, or it does not get run.

Higher-order patterns / 2 of 3

The cost-quality dial.

Your analytics chatbot has four knobs that move both the answer quality and the bill: model tier, reasoning budget, context window, and validation. Wrong defaults on any one of them will either embarrass the product or triple the spend. Pick the setting per surface (the headline dashboard chatbot is not the same as the in-chart "explain this" widget), not for the whole product. Try the dials below.

Four levers, sorted by ROI
Try it
Cost and quality both go up the more you spend, but the four levers do not give equal returns. Higher rows buy more quality per unit of cost. Turn the top one up first.
Total cost
37/100
Total quality
47/100
← Cost Quality →
ROI
Next best move
+11 quality for +12 cost
Step up reasoning budget to Extended.

Let the model think longer before answering complex questions.

Spend up

The executive chatbot. Open-ended business questions, dashboards built on the fly, the cost of a wrong answer is a wrong decision. Spend up on model tier and validation; leave the rest at default.

Spend down

The in-chart "explain this" widget. Suggesting follow-up questions, naming a saved view, summarizing a single tile. Cheap model tier, no extended thinking, pruned context. The user is reading the result in two seconds and never re-reading it.

Cache the rest

The 80% of questions a stakeholder asks the chatbot every Monday. Compute once, cache the answer, invalidate on data refresh. The first user pays the model bill. Everyone else after them is free.

Rules
  • Set defaults per surface. The chart-suggester does not need the same model as the executive briefing.
  • Validation is cheaper quality than upgrading the model. Try it first.
  • Prune conversation context when the user moves to a new topic. Long sessions balloon cost.
  • Cache repeat questions. The same Monday-morning revenue check pays for itself in week one.
  • Track spend per user and per surface. Per-tenant is the level the bill explains.
Avoid
  • × Extended thinking turned on for every turn. The single biggest cost multiplier in the stack.
  • × Frontier model tier on every surface. Reserve it for the surfaces that move decisions.
  • × No cache layer. Recomputing the same answer 200 times per Monday is a self-inflicted bill.
  • × One global setting for the whole product. You will leave money on the table on cheap surfaces and embarrass yourself on the expensive ones.

Higher-order patterns / 3 of 3

Trust, lineage, privacy.

A chatbot will say anything you let it read. Trust is operational: every answer cites the dataset and the filters it used, every row-level rule travels with the query, and every conversation is logged so the leak that did not happen can still be proven. Ship none of this and the first wrong answer ends the project.

Every answer cites its source
User question
"How is the West region tracking against plan?"
Model answer with citation

West is at $4.2M revenue YTD, ahead of plan by $0.3M (8% over). The acceleration came from two enterprise deals in Q1 totaling $720K.

Sourced from
sales_pipeline.sf_opportunities · filters: stage = "Closed Won", region = "West", FY26
Row-level access: 84 of 84 rows visible to this user

The dataset, the filters, and the row-count are part of the answer, not behind a "View SQL" link. The user can answer "is this number trustworthy" without leaving the chat. The compliance team can answer "did anyone see rows they should not have" without writing a ticket.

Row-level rules travel
Inherits

The dataset enforces the rule once. Every chatbot question, every embedded analytics surface, every API call pulls from the same rule set. Never put row-level logic in the system prompt; it bypasses the moment the prompt does.

Citations next to numbers
Inline

The dataset, the filter set, and the row count appear with the answer. Footnote, not modal. The user reads "is this trustworthy" at the same speed they read the number.

Conversation logs
Audit

Every prompt, every retrieved dataset, every returned row count, kept for the compliance window. The leak that did not happen still has to be provable. The leak that did has to be explainable in an hour.

Rules
  • Row-level security lives on the dataset, never in the system prompt.
  • Every answer includes the dataset name and the filter set that produced it.
  • Conversation, retrieved context, and returned row counts are logged for the compliance window.
  • Redact PII at the dataset boundary, not in the chat layer. The model should never see raw emails.
  • Test the access rules with a synthetic user from a different segment. If the answer leaks, you find out before the customer does.
Avoid
  • × "The system prompt says do not show salary data." Prompts are not security boundaries.
  • × An answer with no citation. The first time the model is wrong, the user has no way to know.
  • × Logging the question without logging the retrieved context. The audit trail breaks at the most important step.
  • × Raw PII in any sample value or `all_values` list. Synthetic samples, every time.
  • × Letting the model invent a join. Joins are business decisions and belong to the data team. Default to allowing only joins someone has reviewed. When the model needs a join you have not defined, flag it as undefined and file a ticket for the data team to add it deliberately.

Anti-patterns

Eight ways this goes sideways.

Every one of these is a real failure we have watched ship. Run the list before launch and after every meaningful semantic-layer change.

  • × A dataset that pulls in the whole warehouse. The model now has to guess joins across 400 tables.
  • × Field descriptions that restate the column name. `total_amount: "the total amount"` teaches the model nothing.
  • × Synonyms that only the data team uses. The model needs the words sales reps say out loud, not the warehouse jargon.
  • × A dataset past 500 fields. The model starts truncating and answer quality drops off a cliff.
  • × No evals. Without a test set you do not know if the last context change made it better or worse.
  • × Citations behind a "View SQL" link the user never clicks. The cited source belongs next to the number.
  • × Doing nothing. If you wait for the perfect rollout, users will wire up their own LLMs against the warehouse without any of the context you would have given them. Shadow IT moves faster than the steering committee.
  • × Extended thinking on every turn by default. The bill arrives one quarter later.
Working draft

The field is moving fast.

This section evolves every quarter as the tooling does. If a chapter is missing the trade-off that matters to you, or the example that finally made it click, tell us.