06-reference

dbt semantic layer vs text to sql benchmark

Mon Apr 06 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: dbt Labs Blog ·by Jason Ganz and Benoit Perigaud (dbt Labs)

dbt Semantic Layer vs Text-to-SQL — 2026 Benchmark

Why this is in the vault

Quantitative benchmark that validates the founder’s Scope × Basis testing framework from a different angle. The semantic layer enforces absolute basis constraints (fixed definitions in an ontology), while text-to-SQL operates without those guardrails and needs relative + temporal checks to catch silent errors. Direct consulting deliverable relevance for the phData/MG engagement.

The benchmark

dbt Labs tested both approaches against an ACME Insurance dataset with 11 questions spanning multiple reasoning levels — from simple lookups to multi-step calculations requiring joins and business logic.

Semantic Layer accuracy: 98.2% with Sonnet 4.6, 100% with GPT-5.3 Codex. The semantic layer compiles queries deterministically — the model selects the right metric definition, and the system generates correct SQL or returns an explicit error. It never produces silently wrong answers.

Text-to-SQL accuracy: 90% with Sonnet 4.6, 84.1% with GPT-5.3 Codex. Raw text-to-SQL has improved dramatically — doubling from 32.7% in 2023 to 64.5% in 2026 — but the failure mode is dangerous. Wrong answers look plausible. There is no structural guarantee that the generated SQL respects business definitions.

Key finding

The authors recommend semantic layer for critical, governed data and text-to-SQL for ad hoc exploration where approximate answers are acceptable. Even minimal modeling effort (three dbt models defining core metrics) improved accuracy for both approaches, reinforcing that the data layer — not the model — determines answer quality.

Mapping against Ray Data Co

This maps directly onto the founder’s Scope × Basis framework. The semantic layer is an absolute basis enforcement mechanism: metric definitions are accounting identities baked into the ontology, and violations surface as errors rather than wrong numbers. Text-to-SQL, by contrast, requires relative checks (does this match the canonical definition?) and temporal checks (did this quarter’s number change retroactively?) because there is no structural guarantee of correctness.

The benchmark also reinforces the Natkins thesis — the data layer does the work. Both approaches use the same underlying models; the accuracy gap is entirely attributable to the structured data layer sitting between the LLM and the warehouse.