Build AI Analytics chatbots that actually work.
When a business user types a question into an AI chatbot and expects a number, a chart, or a dashboard back, the data environment is the prompt. Field names, descriptions, joins, sample queries, the rules nobody wrote down. Get those right and the chatbot is right. Get those wrong and no amount of model upgrade saves the answer.
Without the right context, an LLM pointed at your database is Hansel from Zoolander. Confident, well-dressed, and absolutely certain the answer is "in the computer." The rest of this section is how you give the model what it actually needs so it stops smashing the box and starts answering the question.
Scope before semantics.
Curate the dataset first: which tables, which joins, which fields. Then teach the chatbot what they mean. A confused chatbot on a tight scope is fixable. A confident chatbot on the whole warehouse is dangerous.
Context, not perfection.
A chatbot does not need a clean warehouse. It needs context. Any context beats no context. Ship the first paragraph of guidance before you ship the rewrite of the data model.
Your data definitions are the prompt.
Every field description, every synonym, every sample query becomes part of what the chatbot reads before it answers a question or builds a chart. Write them with the chatbot in mind, because the chatbot is reading them.
Start with ten questions.
Ship the chatbot for two power users and the ten questions they actually ask every week. Open-ended exploration is a phase-two problem. The first useful chatbot is narrow, accurate, and trusted.
Trust travels with the data.
Row-level rules, lineage, and citations are part of the answer, not a separate page. A chatbot that cannot show its source has not earned the user's next question.
Mine what you already have.
The context already exists, in twenty places. Slack threads, dbt docs, support tickets, the paragraph an analyst keeps pasting in chat. And the BI work you already paid for: Tableau TWBX files, LookML, Metabase queries, Power BI semantic models. Every join, filter, and metric in there is context, already validated by years of use. Pull it once, centralize it, point the model at it.
Three layers of context.
Every chatbot question reads three layers of context before answering: the workspace, the dataset, the field. Each layer narrows the scope and inherits from the one above. Write the global rules once; let the dataset and field layers add the local detail. Do not put field-level detail at the workspace; do not put workspace-wide rules in a single field.
Global rules that apply to every question, no matter the dataset.
inherits downA curated scope. Three tables, two declared joins, one grain. The model can only answer questions inside this box.
inherits workspaceLast-mile clarification on a single column. Label, description, synonyms, sample values, the gotcha that catches new analysts.
inherits bothInheritance flows top-down. The field rule reads workspace + dataset + field context. The dataset reads workspace + dataset. Each layer adds new information, never restates the layer above. The model assembles the full prompt by walking from the outer box inward.
# Workspace level: house rules that apply everywhere
ai_context: |
Acme is a B2B SaaS company. Our fiscal year starts February 1.
When users say "revenue", they mean closed-won opportunity
amount in USD. Always exclude opportunities owned by the
test user "QA Bot" or accounts flagged `is_test = true`.
Default the time window to year-to-date unless the user
specifies otherwise. # Dataset level: sales_pipeline
base_table: sf_opportunities
joins:
- sf_accounts on sf_opportunities.account_id = sf_accounts.id
- sf_users on sf_opportunities.owner_id = sf_users.id
ai_context: |
Sales pipeline covers opportunities, the account they belong
to, and the rep that owns them. "Pipeline" means any open
stage. "Won" means stage = "Closed Won". "Lost" means
stage = "Closed Lost". Never include test accounts.
default_filters:
- sf_accounts.is_test: false
sample_questions:
- "How is pipeline tracking this quarter?"
- "Top 5 reps by closed revenue, FY25"
- "Average sales cycle, by region, last 12 months" # Field-level context: one column, six lines, big payoff
- name: opportunity_type
label: "Opportunity type"
description: |
Whether a deal is net new business, expansion of an
existing customer, or a renewal of an existing contract.
synonyms: ["deal type", "new logo type", "business type"]
sample_values: ["New Business", "Expansion", "Renewal"]
ai_context: |
"New logos" means filter to Opportunity Type = "New Business".
Renewals are not new revenue, exclude them from new-ARR math. - Each layer adds new information, never restates the layer above.
- Workspace rules are short and absolute. Fiscal year, currency, what "revenue" means, test data exclusions.
- Dataset context names the scope in plain English and lists 3 to 5 example questions.
- Field context is one or two lines: what the value means, the synonym the user actually says.
- × Same rule restated at every layer. Duplication eats your context budget.
- × Contradictions across layers. The model picks one and you do not know which.
- × A 600-word essay at any layer. If it does not fit in five lines, split it into a sample query.
- × Field-level rules that should be global. "Always exclude test data" belongs at the workspace.
Schemas in plain English.
Field names, descriptions, synonyms, and sample values are not metadata. They are what the chatbot reads before it answers a question or builds a chart. Write them the way you would explain a column to a new analyst on their first day: clear, short, and honest about the gotchas everyone hits in week one.
# BAD: the chatbot reads this and has no idea
- name: amt
type: decimal
description: "amount" The model sees "amt: amount". It guesses the unit, the currency, the aggregation, the relationship to other amount-like columns. Half the guesses are wrong.
# GOOD: the same column, written for the reader and the model
- name: amount
label: "Deal amount (USD)"
description: |
Total contract value of an opportunity in USD. Excludes
one-time setup fees and discounts. For annual recurring
revenue, see `annualized_amount`.
synonyms: ["acv", "tcv", "deal size", "contract value"]
sample_values: ["$24,000", "$150,000", "$1,200,000"]
ai_context: |
Default unit for pipeline and revenue questions. Aggregate
with SUM. Filter by stage when the question is about
closed business. Same column. The model now knows the unit, what it excludes, the four words a user might say instead, a realistic sample, and the default aggregation. Answer quality jumps without touching the data.
- Every customer-facing field carries a label, description, and synonyms. Non-negotiable.
- Synonyms are written by listening to users, not by the data team. Ask once, transcribe verbatim.
- Enumerated fields list every value. "CA, TX, NY, FL" maps user shorthand to the database value.
- Hide the technical fields the model should never reach for (raw IDs, internal flags, deprecated columns).
- × A description that restates the column name. `total_amount: "the total amount"` is noise.
- × Exposing raw IDs as user-facing fields. The model will return them in answers and the user will not know what they mean.
- × Two fields named the same with different definitions. Pick one, deprecate the other, document the choice.
- × Synonyms borrowed from the warehouse jargon. "TCV" is fine. "fct_opp_amt" is not.
Sample queries are the cheat sheet.
For the questions a chatbot gets wrong on its own (fiscal quarter math, multi-table joins, ranking conventions, local business terms), do not write a longer description. Write a sample query. Show the chatbot the natural-language prompt and the SQL together, and it pattern-matches its way to the right answer next time, whether the user asks for a number, a chart, or a dashboard tile.
SELECT SUM(amount) AS revenue
FROM sf_opportunities
WHERE stage = 'Closed Won'
AND close_date >= DATEADD(quarter, -1, DATE_TRUNC('quarter', CURRENT_DATE))
AND close_date < DATE_TRUNC('quarter', CURRENT_DATE) The model now has the date math, the join-free pattern for a single-table answer, and the convention for excluding the current in-progress quarter. The same example covers "last month" with one swap.
sample_queries:
- prompt: "How much revenue did we close last quarter?"
description: |
Closed-won revenue in the prior fiscal quarter.
Demonstrates the fiscal-quarter date math the model
gets wrong on its own about half the time.
sql: |
SELECT SUM(amount) AS revenue
FROM sf_opportunities
WHERE stage = 'Closed Won'
AND close_date >= DATEADD(quarter, -1, DATE_TRUNC('quarter', CURRENT_DATE))
AND close_date < DATE_TRUNC('quarter', CURRENT_DATE)
- prompt: "Top 5 reps by closed-won revenue this fiscal year"
description: |
Ranking and grouping by owner, with fiscal-year filter
and ascending vs descending convention spelled out.
sql: |
SELECT owner_name, SUM(amount) AS revenue
FROM sf_opportunities
JOIN sf_users ON sf_opportunities.owner_id = sf_users.id
WHERE stage = 'Closed Won'
AND fiscal_year = 2026
GROUP BY owner_name
ORDER BY revenue DESC
LIMIT 5 - Write one sample query for each recurring question your team answers more than twice a month.
- Always pair the SQL with a natural-language prompt. The pairing is the lesson.
- Cover the gotchas: fiscal-quarter math, time-zone boundaries, local business terms, ranking conventions.
- Keep the example minimal. Two joins, three filters, one aggregation. Do not ship a 200-line query as the lesson.
- × Sample queries that copy from a stale production view. They will rot.
- × One mega query that "covers everything." Too many ideas in one example and the model cannot pattern-match anything.
- × Sample SQL with hardcoded values from your data ("WHERE region = 'West'"). Use the join or a parameter.
- × A sample without the natural-language prompt. The model needs both halves to learn.
Scope and grain.
The single biggest unlock in chatbot analytics is to stop pointing the AI at the raw warehouse and start pointing it at a curated dataset. A curated dataset has a clear boundary: a base table, the joins you trust, the fields a user is allowed to see, and the row-level rules the chatbot inherits automatically. Inside the boundary, answers are predictable. Outside it, you are rolling dice on every question, every chart, every dashboard.
- salesforce47
- stripe23
- hubspot31
- intercom18
- zendesk12
- segment8
- + 9 more
- stg_*82
- int_*43
- fct_*25
- dim_*15
- rpt_*30
- met_*18
- + 11 more
- sales22
- finance14
- product9
- marketing16
- ops7
- cs11
- + 5 more
Lots of surface area, no path through it. The model has to choose between 412 tables and any join across them. Half the answers will be wrong, and you will not know which half.
Two foreign keys, two declared joins, two cardinality rules. The model sees the join paths spelled out: many opportunities per account, many opportunities per user. No ambiguity, no guessing, no joins the model invents on its own.
- One curated dataset per business domain (sales, finance, ops). Keep them small.
- Predeclare every join the model is allowed to make. No on-the-fly joining.
- Name the grain in plain English in the dataset description. "One row per opportunity."
- Row-level rules live on the dataset, not in the chat prompt. The model inherits them automatically.
- Aim for fewer than ~500 fields per dataset. Past that, the context budget truncates and answer quality drops.
- × A "kitchen sink" dataset with every table in the warehouse. The boundary is the whole point.
- × One ultra-wide denormalized table as a substitute for joins. Slow, expensive, and the model still asks for fields that are not there.
- × Row-level security enforced only in the BI tool. The MCP server bypasses it.
- × Multiple grains in one dataset. The model fans out and you reconcile the totals in a meeting on Friday.
Testing the chatbot's answer.
Data tests cover the table. Answer tests cover the chatbot. The trick is to test the approach, not the number. For each recurring user question, write a golden query you trust, then check two things: did the chatbot take the same approach (right field, right filter, right time window, right chart type if it built one), and did its answer match the golden query within a tight tolerance? Numbers drift as data refreshes; the approach is stable. Add behavioral evals on top for ambiguous questions and guardrails.
# evals.yaml: real questions, each paired with a golden query
# Two kinds of evals: numeric (match golden SQL within tolerance) and
# behavioral (the model asks for clarification or refuses).
evals:
# Numeric eval: test approach + match the golden query
- id: revenue_prior_quarter
question: "What was revenue last quarter?"
check:
fields_used: [amount, close_date, stage]
filters: { stage: "Closed Won" }
time_window: "prior fiscal quarter"
aggregation: sum
golden_sql: |
SELECT SUM(amount) AS revenue
FROM sf_opportunities
WHERE stage = 'Closed Won'
AND close_date >= DATEADD(quarter, -1, DATE_TRUNC('quarter', CURRENT_DATE))
AND close_date < DATE_TRUNC('quarter', CURRENT_DATE)
tolerance: 0.5% # model answer must be within this of golden_sql
# Set-comparison eval: identity, order, and count of rows
- id: top_reps_ytd
question: "Top 5 reps by closed-won revenue, year to date"
check:
group_by: owner_name
sort: revenue_desc
limit: 5
filters: { stage: "Closed Won", fiscal_year: current }
golden_sql: |
SELECT owner_name, SUM(amount) AS revenue
FROM sf_opportunities
WHERE stage = 'Closed Won' AND fiscal_year = 2026
GROUP BY owner_name
ORDER BY revenue DESC
LIMIT 5
match: "same 5 owners, same order"
# Behavioral eval: handle ambiguity
- id: clarify_which_plan
question: "Are we tracking ahead of plan?"
check:
behavior: "ask which plan, OR apply documented default"
acceptable:
- asks_clarification: "which plan? (revenue, headcount, pipeline)"
- applies_default: "FY26 revenue plan, per workspace rule"
# Guardrail eval: respect workspace rules
- id: refuse_test_accounts
question: "Show me the test accounts"
check:
behavior: "refuse or return empty"
workspace_rule: "exclude accounts where is_test = true" - The eval set is ten real user questions, collected from the pilot group, frozen.
- Every numeric eval has a golden SQL query. The eval compares model output to whatever that query returns at run time. Tolerance is tight (0.5% to 2%), not a wide range.
- Every eval also checks structure: the field used, the filter applied, the time window, the aggregation. A right answer for the wrong reason still fails.
- Mix in behavioral evals: ambiguity (does it ask a clarifying question), guardrails (does it refuse), edge cases. Not every test has a number.
- Re-run after every change to your data definitions. If pass rate drops, do not ship.
- Add a new eval every time a user reports a bad answer. The eval set grows with the product.
- × Synthetic questions written by the data team. The model passes them and fails in production.
- × Hardcoded expected numbers. They break on every backfill, and the wide ranges people use to compensate make the test useless.
- × A golden SQL written by the same person who wrote the field descriptions. Have a second analyst write it; otherwise the eval and the prompt agree by accident.
- × An eval run that only checks the answer. A right answer for the wrong reason is a regression waiting to happen.
- × 200 evals. The set has to be runnable in five minutes, or it does not get run.
The cost-quality dial.
Your analytics chatbot has four knobs that move both the answer quality and the bill: model tier, reasoning budget, context window, and validation. Wrong defaults on any one of them will either embarrass the product or triple the spend. Pick the setting per surface (the headline dashboard chatbot is not the same as the in-chart "explain this" widget), not for the whole product. Try the dials below.
Let the model think longer before answering complex questions.
The executive chatbot. Open-ended business questions, dashboards built on the fly, the cost of a wrong answer is a wrong decision. Spend up on model tier and validation; leave the rest at default.
The in-chart "explain this" widget. Suggesting follow-up questions, naming a saved view, summarizing a single tile. Cheap model tier, no extended thinking, pruned context. The user is reading the result in two seconds and never re-reading it.
The 80% of questions a stakeholder asks the chatbot every Monday. Compute once, cache the answer, invalidate on data refresh. The first user pays the model bill. Everyone else after them is free.
- Set defaults per surface. The chart-suggester does not need the same model as the executive briefing.
- Validation is cheaper quality than upgrading the model. Try it first.
- Prune conversation context when the user moves to a new topic. Long sessions balloon cost.
- Cache repeat questions. The same Monday-morning revenue check pays for itself in week one.
- Track spend per user and per surface. Per-tenant is the level the bill explains.
- × Extended thinking turned on for every turn. The single biggest cost multiplier in the stack.
- × Frontier model tier on every surface. Reserve it for the surfaces that move decisions.
- × No cache layer. Recomputing the same answer 200 times per Monday is a self-inflicted bill.
- × One global setting for the whole product. You will leave money on the table on cheap surfaces and embarrass yourself on the expensive ones.
Trust, lineage, privacy.
A chatbot will say anything you let it read. Trust is operational: every answer cites the dataset and the filters it used, every row-level rule travels with the query, and every conversation is logged so the leak that did not happen can still be proven. Ship none of this and the first wrong answer ends the project.
West is at $4.2M revenue YTD, ahead of plan by $0.3M (8% over). The acceleration came from two enterprise deals in Q1 totaling $720K.
The dataset, the filters, and the row-count are part of the answer, not behind a "View SQL" link. The user can answer "is this number trustworthy" without leaving the chat. The compliance team can answer "did anyone see rows they should not have" without writing a ticket.
The dataset enforces the rule once. Every chatbot question, every embedded analytics surface, every API call pulls from the same rule set. Never put row-level logic in the system prompt; it bypasses the moment the prompt does.
The dataset, the filter set, and the row count appear with the answer. Footnote, not modal. The user reads "is this trustworthy" at the same speed they read the number.
Every prompt, every retrieved dataset, every returned row count, kept for the compliance window. The leak that did not happen still has to be provable. The leak that did has to be explainable in an hour.
- Row-level security lives on the dataset, never in the system prompt.
- Every answer includes the dataset name and the filter set that produced it.
- Conversation, retrieved context, and returned row counts are logged for the compliance window.
- Redact PII at the dataset boundary, not in the chat layer. The model should never see raw emails.
- Test the access rules with a synthetic user from a different segment. If the answer leaks, you find out before the customer does.
- × "The system prompt says do not show salary data." Prompts are not security boundaries.
- × An answer with no citation. The first time the model is wrong, the user has no way to know.
- × Logging the question without logging the retrieved context. The audit trail breaks at the most important step.
- × Raw PII in any sample value or `all_values` list. Synthetic samples, every time.
- × Letting the model invent a join. Joins are business decisions and belong to the data team. Default to allowing only joins someone has reviewed. When the model needs a join you have not defined, flag it as undefined and file a ticket for the data team to add it deliberately.
Eight ways this goes sideways.
Every one of these is a real failure we have watched ship. Run the list before launch and after every meaningful semantic-layer change.
- × A dataset that pulls in the whole warehouse. The model now has to guess joins across 400 tables.
- × Field descriptions that restate the column name. `total_amount: "the total amount"` teaches the model nothing.
- × Synonyms that only the data team uses. The model needs the words sales reps say out loud, not the warehouse jargon.
- × A dataset past 500 fields. The model starts truncating and answer quality drops off a cliff.
- × No evals. Without a test set you do not know if the last context change made it better or worse.
- × Citations behind a "View SQL" link the user never clicks. The cited source belongs next to the number.
- × Doing nothing. If you wait for the perfect rollout, users will wire up their own LLMs against the warehouse without any of the context you would have given them. Shadow IT moves faster than the steering committee.
- × Extended thinking on every turn by default. The bill arrives one quarter later.
The field is moving fast.
This section evolves every quarter as the tooling does. If a chapter is missing the trade-off that matters to you, or the example that finally made it click, tell us.