06-reference

book adwd ch5 modeling star schemas 2026 04 13

Sun Apr 12 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: Agile Data Warehouse Design (book) ·by Lawrence Corr, Jim Stagnitto

Chapter 5: Modeling Star Schemas — Corr & Stagnitto

Why this is in the vault

This is the key chapter for data quality in the ADWD methodology. Corr frames data profiling explicitly as test-driven design — profile the source before you build the schema, and the profile results become your test suite. The severity ranking table (Table 5-2) is a pre-built triage system for data issues. The chapter also introduces ETL audit attributes as a standard pattern and demands that teams identify the System of Record before profiling begins. All of these map directly onto Ray Data Co’s testing framework.

Key concepts

Agile data profiling as TDD

Corr positions profiling as the DW equivalent of test-first development. You examine candidate data sources for prioritized events and dimensions before writing any DDL or ETL. The profiling results annotate the BEAM* business model so stakeholders can review data realities alongside their requirements. If no source exists yet (proactive design), the BEAM* model becomes a specification the operational team builds toward.

Profiling techniques

Three core checks, all executable with basic SQL:

  1. Missing values — count NULLs and blanks per column, calculate percentage missing. Focus on columns stakeholders marked mandatory (MD).
  2. Uniqueness and frequency — count distinct values and their distribution. 100% unique suggests a candidate business key; progressive drops in uniqueness reveal hierarchy levels. Frequency graphs expose “empty” values (spaces, default dates) that aren’t technically NULL.
  3. Data ranges and lengths — min/max/avg for numerics and dates, shortest/longest strings. Spots outliers, helps define warehouse data types, and reveals quality drift when grouped by time period.

Corr warns: always profile against offline snapshots, never against live transactional systems.

Annotated models and severity ranking

Profiling results are recorded directly on the BEAM* tables using a notation system: {source} for data lineage, strikethrough for unavailable data, MD for missing mandatory values, NN for unexpectedly not-null columns, and italics for discovered bonus attributes.

Table 5-2 — Data source issues ranked by severity (1 = highest, 12 = lowest):

SeverityIssueOutcome
1Missing conformed dimensionStop
1Missing eventStop
3Missing or incorrect business keyStop
3Conflicting data for conformed dimensionStop
5Event granularity mismatchStop/Pause
6Missing non-conformed dimensionPause
6Missing or poorly populated event detailPause
8Missing mandatory valuesPause
8Incorrect hierarchical relationshipPause
10Missing or poorly populated dimensional attributeGo
10Mismatched detail and attribute valuesGo
12Additional event details or dimensional attributesGo

Outcomes: Stop = rethink or reprioritize; Pause = resolve before building physical schema; Go = proceed but flag the gap.

System of Record (SoR) identification

Before profiling, identify the authoritative source for each fact and dimension. Corr’s rule: source from the original system, not downstream copies (“a photocopy of a photocopy”). Conformed dimensions are especially tricky — they may span multiple operational systems with no single best source. MDM systems help when available, but often ETL must merge keys and attributes from several SoRs.

Surrogate keys

Business keys from source systems are unreliable as warehouse primary keys — they change, get reused, conflict across systems, and are often composite or alphanumeric. Surrogate keys (integer sequence numbers) solve six problems: insulate from BK changes, unify multiple BKs for conformed dimensions, track SCD history via simple equi-joins, represent missing/N-A dimension values (SK = 0), reduce fact table size, and improve join performance. Reserve SK 0 for the default “Missing” row; use negative SKs for other special values (“Unknown”, “Not Applicable”).

Slowly changing dimensions

Attribute change behavior is captured during modelstorming using three codes:

Corr’s pragmatic advice: default to storing history (HV) because retrofitting Type 2 onto an existing warehouse is extremely expensive. Even if stakeholders say CV now, model HV and hide history in the BI layer until needed.

ETL audit attributes

Every dimension should carry administrative columns for SCD management and auditability:

Star schema design patterns

The chapter walks through converting BEAM* event tables into fact tables:

Physical dimensional matrix

After star schemas are drawn, build a physical dimensional matrix showing which dimensions appear in which fact tables — the implementation-level view of the earlier event matrix. This serves as ongoing documentation of the warehouse architecture.

Mapping against Ray Data Co

Corr’s profiling = Column x Absolute in the Scope x Basis matrix

Corr’s three profiling techniques (missing values, uniqueness, ranges) all operate at the Column scope and use Absolute basis — they check intrinsic properties of individual columns without comparing across tables or over time. This is the bottom-left quadrant of our testing matrix. The framework shows that Corr’s approach is necessary but not sufficient; it doesn’t cover Row-scope checks (referential integrity between tables), Cross-source checks, or Relative-basis checks (drift, trend breaks).

Corr’s severity table = our Stop/Pause/Go tiers

Table 5-2 maps almost directly onto the triage tiers in the data quality framework:

This gives us an external, published validation that the three-tier severity model is a sound pattern — and Corr’s specific issue list can seed a default checklist for the consulting deliverable.

SoR identification = a new prerequisite step

Corr’s insistence on identifying the System of Record before profiling is something our current testing matrix template assumes but doesn’t enforce. This should become an explicit prerequisite step in the framework: before running any Column x Absolute tests, document which source system is authoritative for each entity. Without this, profiling results are meaningless — you might be profiling a stale replica and attributing quality problems to the wrong system.

ETL audit columns as a standard pattern

Corr’s audit attributes (CREATED/UPDATED DATE/BY, EFFECTIVE/END DATE, CURRENT, MISSING) align with what the SeattleDataGuy piece on noisy checks identifies as the metadata layer needed to make quality checks actionable. If dimensions carry audit columns, you can write time-aware quality checks that distinguish “data was always bad” from “data degraded after ETL version X.”