A Data Pipeline Is a Materialized View

Metadata

Author: nchammas.com
Full Title: A Data Pipeline Is a Materialized View
Category: #articles
URL: https://nchammas.com/writing/data-pipeline-materialized-view

Highlights

The output of this data pipeline is a function of the input. In other words, the output is derived from the input by running the input through the pipeline. This is an important characteristic of the output. As long as the input data and pipeline transformations (i.e. the pipeline code) are preserved, the output can always be recreated. The input data is primary; if lost, it cannot be replaced. The output data, along with any intermediate stages in the pipeline, are derivative; they can always be recreated from the primary data using the pipeline. (View Highlight)
- Note: I feel the transformation jobs are getting the short end. Primary data can change. You may want to hold the output the same, but the source can change (because you acquired a company or switched source systems)
Most data pipelines, if you zoom out far enough, look something like this. You have some source data; it gets sliced, diced, and combined in various ways to produce some outputs (View Highlight)
Any time someone queries the output of the pipeline, it’s logically equivalent to them running the entire pipeline on the source data to get the output they’re looking for. In this way, a pipeline is a view into the source data. (View Highlight)
To update a materialized view, there are two high-level properties you typically care about: the update trigger, and the update granularity. The former affects the freshness of your output, which impacts end-users of the data, and the latter affects the performance of your update process, which impacts the engineers or operators responsible for that process. (View Highlight)
- Note: Update trigger affects data freshness which impacts end users Update granularity affects workflow performance which impacts engineers
A common update granularity is the full refresh. No matter how small or large the change to the source data, when the pipeline runs it throws away the entire output table and rebuilds it from scratch. (View Highlight)