“Building a Better Data Agent Benchmark” — Benn Stancil (dbt Labs)

Why this is in the vault

ADE-bench is the most direct adjacent infrastructure to MAC that’s shipped publicly. dbt Labs explicitly names “Analytics Development Environment” (ADE) as their agent surface, ships a forkable benchmark repo, and frames the bottleneck as business context, not model capability — exactly the framing MAC sits inside. Worth filing, cross-linking from MAC anchor draft, and probably worth a fork-experiment to position MAC as the missing instrumentation layer.

The core argument

Synthetic benchmarks that test code reasoning in isolation miss what matters for analytical agents — the bottleneck is whether the agent has the business context (project structure, conventions, prior decisions) to do the right thing, not whether it can write valid SQL. ADE-bench evaluates agents inside realistic dbt projects (staging/intermediate/mart, macros, third-party sources, DuckDB + Snowflake) on real tasks (bug fixes, refactors, model updates, multiple-choice analytical questions), and treats prompt/context variants as a first-class evaluation axis.

Key claims

The benchmark surface is dbt projects, not isolated SQL prompts. Tasks land inside a real project tree with all the conventions and prior context that implies.
Task types include: bug fixes, refactors, model creation/update, and multiple-choice analytical questions (“which result answers question X”).
Scoring is task completion: did the model exist, does its content match the solution key. Coarse pass/fail.
Prompt/context variants are graded as their own axis — vague instructions vs. specific, with-context vs. without — because instruction quality is the load-bearing variable in real agent work.
Failure mode highlighted: vague instructions cause agents to make unnecessary writes (creating new models that shouldn’t exist, etc.) — exactly the failure mode downstream tests should catch.
Backends supported: DuckDB + Snowflake.
The repo is public at github.com/dbt-labs/ade-bench — tasks, scoring harness, and answer keys are forkable.
No leaderboard or numerical results published yet — this is a launch + design-principles piece, not a results report. The slot is open.
First shared at Coalesce / dbt Summit 2025; April 2026 blog post is the public expansion.

⚠️ Sponsorship

False — this is dbt Labs publishing on their own corporate blog about their own benchmark. Not third-party paid. Author bias: dbt is selling the entire ADE surface that ADE-bench was built to evaluate, so framing will favor agents-inside-dbt-projects as the right unit of evaluation. Worth flagging when citing.

Mapping against Ray Data Co

Direct adjacency to MAC, not competition. Two different layers:

ADE-bench scores agent task completion (did the agent fix the bug, build the model, answer the question)
MAC’s Scope × Basis matrix scores output correctness via test coverage (does the resulting model pass row-count, uniqueness, referential, freshness tests at the appropriate scope)

ADE-bench’s current grading (“verify model existence + compare table content to solution key”) is a coarse pass/fail. MAC’s matrix is the missing instrumentation layer that turns ADE-bench task completions into graded coverage — does the agent’s model pass the test surface a senior data engineer would have written?

Implications for RDCO:

MAC positioning anchor: We can position MAC as “what you grade ADE-bench runs with.” dbt’s benchmark gives us legitimate, citable infrastructure to anchor MAC content against. The /audit-model + /generate-tests skills produce exactly the test surface ADE-bench is missing.
Anchor evidence for the agent-deployer thesis: dbt has publicly committed to “ADE” as their agent surface naming and shipped benchmark infrastructure. They’re now a named agent-deployer, not just hosting LLM access. This reinforces the 2026-04-30-rdco-thesis-targeting-systems-feedback-loops cluster.
Sanity Check angle (latent): “The benchmark dbt shipped grades the wrong half of agent data work” — a contrarian frame that ONLY works if we fork the repo, run sample tasks, and document specific cases where MAC tests catch failures ADE-bench misses. Don’t pitch the article without the experiment.
Stancil-as-author signal: Benn Stancil writes Mode/Stratechery-tier essays and is now embedded with dbt Labs (post-ThoughtSpot). His byline gives the piece more authorial weight than typical dbt blog content — track him as an author candidate for the vault.

Open follow-ups

Fork experiment: clone dbt-labs/ade-bench, run 1-2 sample tasks, document the gap where MAC’s test surface would catch failures ADE-bench currently grades as pass. Queued as Notion task.
MAC anchor draft cross-link: when the MAC anchor article gets a writing pass, cite ADE-bench as adjacent infrastructure + position MAC as the grading layer above it.
Watch ADE-bench leaderboard: when results land, anchor numerical comparisons.
Track Benn Stancil: candidate for the X follow-forward / author-watch list. His writing is high-density and he’s now in the dbt orbit.

2026-04-30-rdco-thesis-targeting-systems-feedback-loops
2026-04-30-rdco-bet-architecture-playbook
2026-01-09-trevin-chow-agent-orchestration-thesis
~/rdco-vault/01-projects/data-quality-framework/ — MAC project folder (likely STRATEGY.md target for v1.5)