Model Acceptance Criteria: How We Prove Our Data Models Work
TDD changed how we write code. Evals changed how we benchmark LLMs. MAC is how we prove our data models work.
{{REVIEW: hook as specified. If you want a softer opener — a scene from the Progress engagement, say — flag it and I’ll rewrite.}}
I’ve spent the last month on a dbt model called gold_opp_pipeline at a client. It had 14 tests when I inherited it. By the time I was done it had 95. That gap — the 81 tests nobody thought to write — is the reason I’m writing this.
The 81 weren’t edge cases or nice-to-haves. One of them caught a real UAT failure: 3,227 rows where opportunity type and sub-type didn’t align for closed-won deals. The bug had been sitting in production reporting for months. Nobody caught it because nobody had a framework that told them where to look.
That’s what this piece is about.
The problem: your tests are in one cell
Most dbt projects I audit have the same shape of test coverage. Lots of not_null. Lots of unique. Some accepted_values. Maybe a relationship test if the team is disciplined.
Pull up any schema.yml file and count: how many of your tests check a single column in isolation against a fixed rule? In most codebases, it’s north of 90%.
That means every test you have lives in one cell of a much larger space. One corner of a matrix you didn’t know you were supposed to fill.
Here’s the matrix:
Absolute Rel:Source Rel:Prod Rel:Recon Temporal Human
Column X - - - - -
Row - - - - - -
Aggregate - - - - - -
The X is where most teams live. The dashes are where the bugs hide.
The framework: Scope × Basis
Every meaningful test on an analytical model lives at the intersection of two questions.
Scope — where are you evaluating?
- Column — one field, in isolation.
amount > 0.stageis in an accepted list.opportunity_idis unique. - Row — one record, across columns.
start_date < end_date. New-business opportunities must have positive ACV. Salesforce rows must havesf_id; HubSpot rows must havehs_id. - Aggregate — the whole dataset. Accounting identities. Row-count reconciliation. Total pipeline ties to the executive dashboard.
Most bugs live at the aggregate level. Every row can pass every check and the total can still be wrong. A join fans out. A filter silently drops records. A window function double-counts across partitions.
Basis — what are you evaluating against?
- Absolute — a fixed rule. No external reference needed.
- Relative: Source — upstream data. Did the transformation preserve what it was supposed to?
- Relative: Production — the current dashboard, the finance ledger. Does the new model tie to the old truth?
- Relative: Reconciliation — an external system. Stripe. The bank. A vendor export.
- Temporal — history. Did last quarter’s numbers retroactively change? Did MoM swing past a threshold?
- Human sanity check — show the number to someone who knows the business and watch their face.
Three scopes, six bases, eighteen cells. A model isn’t “tested” until you’ve walked every one of them and decided — explicitly — whether it applies, what the check is, or why you’re skipping it.
The prerequisite: System of Record
Before you can write a single Relative:Source check, you have to answer one question: for this source table, what system is authoritative?
This sounds obvious and almost nobody does it. Half the dbt projects I audit have two systems feeding a conformed dimension with no documented merge rule. You can’t test against “the source” when “the source” is ambiguous. That’s a Stop-tier issue, and it blocks the rest of the matrix until it’s resolved.
Severity: Stop, Pause, Go
Not every failing test should kill the pipeline.
- Stop — block. Data cannot be served. Accounting identities. Broken referential integrity. Missing conformed dimensions.
- Pause — quarantine. Flag for review before serving. Aggregate recons outside tolerance. Row-count swings past threshold.
- Go — log it and keep moving. Cosmetic issues. Optional attributes. Minor distribution shifts.
Severity is also layer-aware. A null in a key field is a Go in Bronze (mirror what the source sent), a Pause in Silver (transformation should have handled it), and a Stop in Gold (consumer-facing). Same issue, three severities. The medallion layer is part of the test design, not an afterthought.
Default mapping by basis: Absolute and Relative:Source default to Stop. Relative:Production, Relative:Recon, and Temporal default to Pause. Human checks default to Go. Override based on business impact — a null in an optional description is still a Go even though it’s Absolute.
The anchor: gold_opp_pipeline
Here’s what the matrix looked like on the Progress engagement.
gold_opp_pipeline has roughly 30 columns and one row per opportunity. It pulls from three sources: the core opportunity fact table, a classification table, and the combined transaction table. The model has a dual-path pattern — for closed-won opportunities, projected measures get overridden with actuals from the transaction table. Open pipeline retains projected values from Salesforce.
This is where the matrix earned its keep.
Conventional schema.yml testing treats each column as having a single source. But projected_arr in this model has two sources depending on a condition:
-- Rel:Source for projected_arr when is_closed_won = FALSE
-- must equal silver_fct_ips_opportunity.projected_arr
SELECT g.opportunity_id, g.projected_arr, s.projected_arr AS source_value
FROM gold_opp_pipeline g
JOIN silver_fct_ips_opportunity s USING (opportunity_id)
WHERE g.is_closed_won = FALSE
AND g.projected_arr <> s.projected_arr;
-- expect 0 rows
-- Rel:Source for projected_arr when is_closed_won = TRUE
-- must equal SUM(silver_fct_combined_transaction.arr) per opportunity
WITH txn AS (
SELECT opportunity_id, SUM(arr) AS actual_arr
FROM silver_fct_combined_transaction
GROUP BY 1
)
SELECT g.opportunity_id, g.projected_arr, t.actual_arr
FROM gold_opp_pipeline g
JOIN txn t USING (opportunity_id)
WHERE g.is_closed_won = TRUE
AND g.projected_arr <> t.actual_arr;
-- expect 0 rows
Two tests for what a schema.yml would treat as one column. The matrix forces the question — what’s the source for this cell under this condition? — and the answer reveals the override pattern needs its own coverage.
The aggregate row of the matrix caught something else. A2 checks that closed-won revenue in gold_opp_pipeline reconciles to gold_txn_pipeline. A6 checks for orphan closed-won opportunities that exist in one model but not the other. Cross-model reconciliation at the aggregate level — the kind of check that catches silent fanout from a bad join long before a finance leader sees the wrong number in a board deck.
The R1 UAT failure — 3,227 rows where opportunity type and sub-type didn’t align for closed-won — was a Row × Relative:Production check. The business rule existed in the BRD. No test in the repo encoded it. The matrix has a cell for exactly that, and once we filled it, the bug surfaced on the first run.
14 tests to 95. Most of the 81 weren’t exotic. They were the cells we hadn’t thought to walk.
{{REVIEW: do you want me to include the full coverage table from the case study here, or keep the prose tighter? I kept it prose-first. Your call.}}
The agent multiplier
Here’s the part that makes this work now and didn’t five years ago.
95 tests is a lot. If every one of those fires an alert at 3am and a human has to triage, you’ll burn out your on-call rotation in a week. That’s the reason most data teams can’t run this many checks — the alert-to-signal ratio destroys the team before the tests prove their worth.
Agents change the math.
Severity tiers mean every test runs, but only Stops wake a human. Pauses go to an agent that knows the lineage, pulls the row counts from the last seven runs, cross-references the affected models, and writes a triage note. By the time a human looks at it, the question isn’t “what happened?” — it’s “do I accept this explanation?”
And the flip side: agents require this rigor. A model making decisions on your data at 3am doesn’t Slack you when a number looks weird. It acts on what you served. More models being built, less human review happening between data and decision. The old testing posture — catch the big stuff, trust the rest — stops being safe.
MAC makes the load bearable for humans and the surface legible for agents. It’s the testing discipline analytical data has always needed and now — finally — has the tooling to deploy.
{{REVIEW: the “agents change the math” framing is my extrapolation of your agentic-world thesis from the relaunch essay. If this overclaims or gets the emphasis wrong, flag it.}}
What to do Monday
Pick your most important gold-layer model. Open its schema.yml. Count the tests. Now draw the 3×6 matrix and mark which cells have at least one check.
If any scope row is empty, you have a gap. If the only basis column you’ve populated is Absolute, you have a bigger gap. If you don’t know the System of Record for one of your sources, you have the biggest gap of all — and none of the Relative:Source tests you write will be trustworthy until you resolve it.
The framework is one page. The discipline is the whole job.
Over the next seven days I’m sending a drip course that walks each piece end-to-end — one axis at a time, with the interview questions to extract the matrix from a stakeholder, and the prompt pattern to hand the filled matrix to an agent and get a dbt test suite out the other side.
{{REVIEW: subscribe CTA — drafting as a drip-course lead magnet. If the magnet is actually the downloadable template instead of the drip, swap the language. I defaulted to drip because the Notion spec leads with it.}}
Subscribe below. Day 1 lands tomorrow: Your tests are in one cell.
MAC — Model Acceptance Criteria — is the working framework Ray Data Co uses on client engagements. The anchor case study is gold_opp_pipeline on the Mammoth Growth / Progress engagement, April 2026.
{{REVIEW: byline/closing line — match whatever SC relaunch format you’ve settled on. I don’t have a canonical footer from the recent issues.}}