The activity schema aims for these design goals
only one definition for each concept - know exactly where to find the right data
single model layer - no layers of dependencies
simple definitions - no thousand-line SQL queries to model data
no relationships in the modeling layer - all concepts are defined independently of each other
no foreign key joins - all concepts can be joined together without having to identify how they relate in advance
analyses can be run anywhere — analyses can be shared and reused across companies with vastly different data.
high performance - reduced joins and fewer columns leverages the power of modern column-oriented data warehouses
incremental updates - no more rebuilding data models on every update (View Highlight)
At its core an activity schema consists of transforming raw tables into a single, time series table called an activity stream. All downstream plots, tables, materialized views, etc used for BI are built directly from that single table, with no other dependencies. (View Highlight)
An activity schema models an entity taking a sequence of activities over time. For example, a customer (entity) viewed a web page (activity). (View Highlight)
An activity schema is implemented as a single, time series table with (optionally) a limited set of enrichment tables for additional metadata. (View Highlight)
Each row in the table represents a single activity taken by the entity at a point in time. In other words, it has an activity identifier, an entity identifier, a timestamp, and some basic metadata called features. (View Highlight)
An activity schema table will only have one entity type and is typically named _stream. For example, an activity schema implementation for customers would be customer_stream, and one for bikes would be bike_stream (View Highlight)
Activities are specific actions taken by an entity. For example, if the entity is a customer an activity could be ‘opened an email’ or ‘submitted a support ticket’. Each row in a table modeled by an activity schema is a single instance of an activity taken by a specific entity.
Activities are intended to model real business processes. Taken together, the series of activities for a given entity would represent all relevant interactions that entity has had with a company. (View Highlight)
Every activity has metadata associated with it beyond the customer, the activity, and the timestamp. A ‘viewed page’ activity will want to store the actual page viewed, while an ‘invoice paid’ activity will store the total amount paid.
The stream tables of an activity schema have a finite number of metadata columns that can be associated with each activity. For performance reasons, the most commonly used metadata are stored in the stream table, and in limited cases, an activity schema supports an unlimited number of additional metadata columns stored in a separate table. In practice fewer than 4% of activities need additional columns. (View Highlight)
An activity schema uses a single table to store all activities. No new tables are created as the data evolves — any future activities will create rows in the same table. This is fundamentally distinct from other models, like a star schema, where each concept has its own ‘fact’ table. (View Highlight)
There are two types of tables in an Activity Schema
activity stream (one per activity schema)
entity table (optional - one per activity schema)
The activity stream table, (typically called _stream) stores all activities, their timestamps, the entity’s identifier, and some metadata.
The entity table (typically called ) stores metadata for each entity. For example, a customers table can store date of birth, first and last name, etc. (View Highlight)
Modeling here is the step to transform source tables in a data warehouse into the activity stream format. Querying is running SQL queries against the activity stream table to generate tables, materialized views, etc to be used for BI. (View Highlight)
the activity stream table, and the primary modeling concept is an activity (rather than a noun like an ‘order’ or an ‘invoice’).
The process is straightforward
choose your activities
find relevant raw table(s)
write a SQL query to create each activity (View Highlight)
Once the activities have been identified then it’s usually fairly straightforward to find which source table(s) will be needed. The only requirement is that we can identify an entity, a relevant activity, and a timestamp. (View Highlight)
The activity schema approach is effectively the same, only without the layers of dependencies. A common approach is to create a single SQL query per activity desired, and append the results of each query to the activity stream table. (View Highlight)
Note: I disagree with their rigid schema structure - but if you have a flexible schema you destroy the benefit of incremental loads. How can we get around this?
An activity schema differs in some fundamental ways to more traditional approaches
data is in a time-series format
queries only select from the activity stream table (and optionally join in enrichment tables)
any activity can be related (joined) to any other activity using only the entity and timestamp (View Highlight)
Note: Time-series +
One table +
Self-joins instead of foreign relationships +