Chapter 8: How Many — Fact Tables and Measures — Corr & Stagnitto
Why this is in the vault
Fact table design is the structural decision that determines whether a dimensional model performs well or collapses under query load and ETL complexity. This chapter covers the three core fact table types, additivity rules that silently corrupt reports when violated, and aggregation patterns that remain relevant even in modern column-store engines. For Ray Data Co’s consulting practice, these patterns appear in virtually every client engagement — especially when building SaaS metric models like ARR waterfalls, where the choice between transaction facts, periodic snapshots, and accumulating snapshots directly determines whether reconciliation is even possible.
Key concepts
Fact table types
Three fact table types correspond to three kinds of business events:
-
Transaction fact tables [TF] — capture discrete, point-in-time events (a sale, a click, a payment). Insert-only, narrow, and extremely deep. They are the atomic bedrock: every other fact table can theoretically be derived from them. In practice, their size makes them impractical for answering cumulative or balance-style questions directly, so they need to be supplemented with snapshots. Financial transaction facts often carry an audit date dimension to handle late-arriving adjustments.
-
Periodic snapshots [PS] — store recurring measurements at a regular cadence (daily stock levels, monthly account balances). They trade event-level detail for pre-computed period-level measures, and their design is more open-ended than transaction facts because stakeholders can request KPIs that never existed in the source system. Corr recommends loading monthly snapshots nightly rather than waiting for month-end — this staggers ETL load and provides period-to-date values that are on average only one day stale rather than fifteen.
-
Accumulating snapshots [AS] — track evolving, multi-milestone events (an order moving through fulfillment, a claim through adjudication). They have multiple date keys representing milestones, and rows are updated as events progress. These are the most valuable for process performance measurement and the hardest to build, because they often merge data from multiple source systems with complex cardinality. Corr’s core advice: build them incrementally by first delivering each milestone as its own transaction star schema, then merging into the full accumulating snapshot once conformed dimensions are proven.
Granularity
Granularity is the single most important fact table property — it defines what each row means. It can be expressed in business terms (“one row per order line item”) or dimensionally (by listing the granularity dimensions that uniquely identify each row). Fuzzy or mixed granularity makes a fact table impossible to build or use correctly. For accumulating and period-to-date snapshots, the granularity dimensions (GD columns) also serve as unique update indexes for efficient ETL.
Modeling evolving events
Complex business processes can be modeled as multi-verb evolving events using BEAM* event tables. The key modeling tool is the event timeline — a visual chronology of milestones that surfaces duration measures between them. Duration naming follows a formula: number milestones DT1-DTn chronologically, then define durations as differences (e.g., Packing Time = DT2 - DT1). This keeps definitions unambiguous and portable.
State counts (1/0 flags recording whether a milestone has been reached) and event counts (number of repeated milestones like shipments per order) are essential derived measures that simplify process queries. Corr emphasizes that state logic can be surprisingly complex — the test for “delivered” is not just whether a delivery date exists, but whether delivered quantity equals ordered quantity.
Fact types (additivity)
-
Fully additive (FA) — can be summed across any dimension. Revenue is the canonical example. Must use a single standard unit of measure; store conversion factors as facts if local units are also needed.
-
Semi-additive (SA) — summable across most dimensions but not across their non-additive dimension(s), typically time. Account balances are the classic case. Averaging semi-additive facts requires dividing by the count of distinct non-additive dimension values (e.g., distinct dates), not by row count — using SQL
AVG()directly gives wrong answers. Corr provides BEAM* notation (SA/NA codes) to document these constraints so BI tools can enforce correct aggregation. -
Non-additive (NA) — never meaningful when summed (unit prices, percentages, temperatures). Store the additive components instead (discount amount rather than discount percentage) and let BI tools calculate ratios in context. Timestamps are non-additive but can be subtracted in pairs to produce additive or semi-additive duration facts.
Heterogeneous facts
When product lines have fundamentally different measures (banking: checking vs. mortgage accounts), a monolithic fact table becomes sparse and hard to manage. The pattern is to split into a core fact table with shared measures and custom fact tables per product family with their exclusive facts. Both share conformed dimensions. This avoids the “fact rich but data poor” anti-pattern.
Factless fact tables
Some events have nothing to measure except their occurrence — seminar attendance, eligibility coverage. These are modeled as factless fact tables containing only dimension keys. A dummy additive fact (always 1) can be added to support aggregate navigation. Factless tables also serve as coverage tables that enable “what didn’t happen” analysis (promoted but unsold products, covered but unclaimed benefits).
Fact table optimization
Four levers for performance:
-
Downsizing — use integer surrogate keys, date keys instead of datetime, move text into dimensions, limit history to what BI actually needs (start with the most recent data and work backwards).
-
Indexing — bitmap indexes on foreign keys for star join optimization; unique GD-based indexes on accumulating/PTD snapshots for efficient ETL updates. Prioritize query indexes over ETL indexes when there is a trade-off.
-
Partitioning — by date key range. Enables fast ETL (load into empty, un-indexed partitions then swap), history pruning (truncate a partition vs. row-by-row delete), and real-time support (un-indexed hot partitions for intraday trickle feeds, merged nightly).
-
Aggregation — pre-summarized derived fact tables that act as materialized group-by indexes. Three types:
- Lost dimension — drop dimensions entirely (simplest, just a GROUP BY).
- Shrunken dimension — roll up dimensions to coarser grain (Date to Month, Store to Region). More complex but covers broader queries.
- Collapsed dimension — pre-join selected dimension attributes into a denormalized summary table. Fastest to query but the wider rows can negate size savings.
Target aggregates ~20x smaller than base tables. Use DBMS aggregate navigation so they remain invisible to users. Budget up to 100% storage overhead.
Drill-across
When BI users need to combine measures from multiple fact tables (cross-process analysis), directly joining the tables produces the fan trap / chasm trap — overcounting caused by M:M relationships between joined fact tables. The solution is drill-across queries: separate SELECT statements against each star schema, aggregated independently to conformed row headers, then merged via full outer join. This is the only safe way to combine measures from different grain fact tables.
Mapping against Ray Data Co
ARR waterfall relevance
The ARR waterfall is a textbook accumulating snapshot problem. A subscription contract moves through milestones — booking, activation, renewal, expansion, contraction, churn — each with its own timestamp and amount. The chapter’s guidance maps directly:
-
Start with transaction facts. Each milestone event (new booking, expansion, churn) should first be modeled as its own transaction fact table with conformed customer, product, and calendar dimensions. This is exactly what the testing matrix template already assumes when it references
opportunity_idfidelity checks against bronze. -
Layer the accumulating snapshot on top. Once the transaction stars reconcile, merge into a contract-level accumulating snapshot where each row tracks one contract through its lifecycle. Milestone dates (book date, activation date, first renewal date, churn date) become the multiple time dimensions. Duration measures (time to activate, renewal cycle length) fall out naturally.
-
Semi-additivity is the trap. ARR balance at any point in time is semi-additive — you can sum it across customers and products, but not across time. The waterfall components (new, expansion, contraction, churn) are additive period-level changes, while the opening and closing ARR balances are semi-additive snapshots. Mixing these in a single report without understanding the additivity rules is how most ARR reconciliation bugs happen. Corr’s SA/NA notation would force the team to document this explicitly in the model.
-
Incremental delivery. The chapter’s recommendation to build accumulating snapshots incrementally by delivering milestone transaction stars first aligns with how we should sequence client engagements: ship a reconciled bookings fact table before attempting the full waterfall.
Broader consulting application
- The heterogeneous facts pattern applies whenever a client has multiple revenue streams measured differently (usage-based vs. seat-based SaaS, or mixed product/service businesses).
- Factless coverage tables are useful for modeling entitlement vs. consumption in usage-based pricing.
- The drill-across pattern is critical for any client wanting a single executive dashboard combining pipeline (CRM fact table), revenue (billing fact table), and usage (product analytics fact table) — joining these directly is the number one cause of inflated metrics in client environments.
Related
- testing-matrix-template — the data quality framework that validates each layer of this fact table architecture
- rdco-article-business-metrics — the business metrics article covers the “what to measure” side; this chapter covers “how to structure the storage”
- 2026-04-03-b2b-saas-pricing-masterclass — pricing model complexity drives heterogeneous fact table design; usage-based pricing especially requires careful additivity documentation