Chapter 5: Modeling Star Schemas — Corr & Stagnitto
Why this is in the vault
This is the key chapter for data quality in the ADWD methodology. Corr frames data profiling explicitly as test-driven design — profile the source before you build the schema, and the profile results become your test suite. The severity ranking table (Table 5-2) is a pre-built triage system for data issues. The chapter also introduces ETL audit attributes as a standard pattern and demands that teams identify the System of Record before profiling begins. All of these map directly onto Ray Data Co’s testing framework.
Key concepts
Agile data profiling as TDD
Corr positions profiling as the DW equivalent of test-first development. You examine candidate data sources for prioritized events and dimensions before writing any DDL or ETL. The profiling results annotate the BEAM* business model so stakeholders can review data realities alongside their requirements. If no source exists yet (proactive design), the BEAM* model becomes a specification the operational team builds toward.
Profiling techniques
Three core checks, all executable with basic SQL:
- Missing values — count NULLs and blanks per column, calculate percentage missing. Focus on columns stakeholders marked mandatory (MD).
- Uniqueness and frequency — count distinct values and their distribution. 100% unique suggests a candidate business key; progressive drops in uniqueness reveal hierarchy levels. Frequency graphs expose “empty” values (spaces, default dates) that aren’t technically NULL.
- Data ranges and lengths — min/max/avg for numerics and dates, shortest/longest strings. Spots outliers, helps define warehouse data types, and reveals quality drift when grouped by time period.
Corr warns: always profile against offline snapshots, never against live transactional systems.
Annotated models and severity ranking
Profiling results are recorded directly on the BEAM* tables using a notation system: {source} for data lineage, strikethrough for unavailable data, MD for missing mandatory values, NN for unexpectedly not-null columns, and italics for discovered bonus attributes.
Table 5-2 — Data source issues ranked by severity (1 = highest, 12 = lowest):
| Severity | Issue | Outcome |
|---|---|---|
| 1 | Missing conformed dimension | Stop |
| 1 | Missing event | Stop |
| 3 | Missing or incorrect business key | Stop |
| 3 | Conflicting data for conformed dimension | Stop |
| 5 | Event granularity mismatch | Stop/Pause |
| 6 | Missing non-conformed dimension | Pause |
| 6 | Missing or poorly populated event detail | Pause |
| 8 | Missing mandatory values | Pause |
| 8 | Incorrect hierarchical relationship | Pause |
| 10 | Missing or poorly populated dimensional attribute | Go |
| 10 | Mismatched detail and attribute values | Go |
| 12 | Additional event details or dimensional attributes | Go |
Outcomes: Stop = rethink or reprioritize; Pause = resolve before building physical schema; Go = proceed but flag the gap.
System of Record (SoR) identification
Before profiling, identify the authoritative source for each fact and dimension. Corr’s rule: source from the original system, not downstream copies (“a photocopy of a photocopy”). Conformed dimensions are especially tricky — they may span multiple operational systems with no single best source. MDM systems help when available, but often ETL must merge keys and attributes from several SoRs.
Surrogate keys
Business keys from source systems are unreliable as warehouse primary keys — they change, get reused, conflict across systems, and are often composite or alphanumeric. Surrogate keys (integer sequence numbers) solve six problems: insulate from BK changes, unify multiple BKs for conformed dimensions, track SCD history via simple equi-joins, represent missing/N-A dimension values (SK = 0), reduce fact table size, and improve join performance. Reserve SK 0 for the default “Missing” row; use negative SKs for other special values (“Unknown”, “Not Applicable”).
Slowly changing dimensions
Attribute change behavior is captured during modelstorming using three codes:
- FV (Fixed Value) — corrected in place, no history needed. Implemented as Type 1 SCD.
- CV (Current Value) — overwritten on change, history lost. Also Type 1, but often a reporting directive rather than a storage decision.
- HV (Historic Value) — new row inserted on change, full history preserved. Type 2 SCD, keyed by surrogate to enable simple equi-joins against facts.
Corr’s pragmatic advice: default to storing history (HV) because retrofitting Type 2 onto an existing warehouse is extremely expensive. Even if stakeholders say CV now, model HV and hide history in the BI layer until needed.
ETL audit attributes
Every dimension should carry administrative columns for SCD management and auditability:
- EFFECTIVE DATE / END DATE — valid date range for each version row (END DATE defaults to max DB date, not NULL)
- CURRENT flag — “Y”/“N” for quick filtering without date logic
- MISSING flag — identifies special placeholder rows
- CREATED DATE / CREATED BY / UPDATED DATE / UPDATED BY — basic lineage for when and which ETL process touched the row
Star schema design patterns
The chapter walks through converting BEAM* event tables into fact tables:
- Fact table types: TF (transaction fact, from discrete events), PS (periodic snapshot), AS (accumulating snapshot)
- Fact additivity: FA (fully additive — can be summed across any dimension), SA (semi-additive), NA (non-additive). Store additive components, calculate percentages in BI.
- Degenerate dimensions (DD): transaction IDs like ORDER ID that live in the fact table without a separate dimension. If too many accumulate, consolidate into a junk/how dimension.
- Enhanced star schema layout: arrange dimensions around the fact by W-type — When (top-left), Who (top-right), Where (bottom-left), What (bottom-right), How (top-center), Why (bottom-center). Consistent layout across all stars aids rapid comprehension.
- Avoid snowflaking: normalized outrigger tables add complexity, hurt browse performance, and save negligible space. Use hierarchy charts instead. Exception: very large dimensions (Chapter 6).
Physical dimensional matrix
After star schemas are drawn, build a physical dimensional matrix showing which dimensions appear in which fact tables — the implementation-level view of the earlier event matrix. This serves as ongoing documentation of the warehouse architecture.
Mapping against Ray Data Co
Corr’s profiling = Column x Absolute in the Scope x Basis matrix
Corr’s three profiling techniques (missing values, uniqueness, ranges) all operate at the Column scope and use Absolute basis — they check intrinsic properties of individual columns without comparing across tables or over time. This is the bottom-left quadrant of our testing matrix. The framework shows that Corr’s approach is necessary but not sufficient; it doesn’t cover Row-scope checks (referential integrity between tables), Cross-source checks, or Relative-basis checks (drift, trend breaks).
Corr’s severity table = our Stop/Pause/Go tiers
Table 5-2 maps almost directly onto the triage tiers in the data quality framework:
- Stop (severity 1-3) = tests that should block deployment. Missing conformed dimensions or business keys are structural — you can’t build on a broken foundation.
- Pause (severity 5-8) = tests that should flag warnings and require human sign-off. Missing mandatory values or hierarchy errors are quality issues that might be acceptable temporarily.
- Go (severity 10-12) = tests that log observations but don’t block anything. Bonus attributes and minor mismatches are improvement opportunities, not blockers.
This gives us an external, published validation that the three-tier severity model is a sound pattern — and Corr’s specific issue list can seed a default checklist for the consulting deliverable.
SoR identification = a new prerequisite step
Corr’s insistence on identifying the System of Record before profiling is something our current testing matrix template assumes but doesn’t enforce. This should become an explicit prerequisite step in the framework: before running any Column x Absolute tests, document which source system is authoritative for each entity. Without this, profiling results are meaningless — you might be profiling a stale replica and attributing quality problems to the wrong system.
ETL audit columns as a standard pattern
Corr’s audit attributes (CREATED/UPDATED DATE/BY, EFFECTIVE/END DATE, CURRENT, MISSING) align with what the SeattleDataGuy piece on noisy checks identifies as the metadata layer needed to make quality checks actionable. If dimensions carry audit columns, you can write time-aware quality checks that distinguish “data was always bad” from “data degraded after ETL version X.”
Related
- 01-projects/data-quality-framework/testing-matrix-template — the Scope x Basis matrix; Corr’s profiling populates the Column x Absolute quadrant
- 2026-03-30-founder-data-quality-framework — founder’s original framework; Corr’s severity table validates the Stop/Pause/Go tiers
- 2026-04-07-seattle-data-guy-noisy-data-quality-checks — noisy checks problem; Corr’s audit attributes are part of the solution
- 2026-04-04-eric-weber-data-team-roi-ai-first — Weber’s ROI framing gives the “why measure” that pairs with Corr’s “how to measure”
- 2026-04-12-lindstrom-board-ai-governance — governance accountability maps to Corr’s SoR identification pattern