06-reference

seattle data guy full refresh vs incremental

Mon Mar 16 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: SeattleDataGuy's Newsletter (Substack) ·by SeattleDataGuy (Ben Rogojan)

“Full Refresh vs Incremental Pipelines” — @SeattleDataGuy

⚠️ Sponsorship — self-sponsored this issue

First issue in the backfill where SDG drops Estuary and replaces it with his own consulting CTA. Quote: “today’s article is sponsored by me, the Seattle Data Guy!” followed by a pitch for his data-consulting services. Not misleading — it’s clearly labeled — but a different pattern than the disclosed-adviser Estuary slot. Skill should detect both sponsor forms.

The core argument

How you load data into your tables is a real design decision that affects compute costs, implementation complexity, backfill ease, update-handling, and tooling constraints. SDG walks through the two primary patterns — full refresh and incremental — and the common incremental sub-patterns.

Full refresh

CREATE OR REPLACE over a SELECT. Simplest possible pipeline. Works well when:

Pair with WAP (write-audit-publish):

  1. Write — build staged version
  2. Audit — row count vs prior, null checks on required cols, duplicate key detection, metric range checks. Stop the pipeline if any fail.
  3. Publish — promote to production only after checks pass.

Breaks down as tables grow, refresh windows tighten, or freshness requirements increase.

Incremental — four sub-patterns

1. ID appending

Two variants:

2. Date appending

Requires a reliable update_date. SDG gripes that this is far less common than you’d expect — he’s seen Postgres tables with create_date but never update_date, or the column exists but is never populated.

Naïve version: WHERE order_date > MAX(order_date). Works when data arrives in order, old records don’t change, no late-arriving records.

Lookback window version (safer):

DELETE FROM analytics.orders WHERE order_date >= CURRENT_DATE - INTERVAL '2 day';
INSERT INTO analytics.orders SELECT * FROM raw.orders WHERE order_date >= CURRENT_DATE - INTERVAL '2 day';

Trades a little compute for a lot more reliability when data can arrive late.

3. Upsert / Merge

MERGE INTO analytics.customers AS target
USING staging.customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET target.email = source.email, target.status = source.status
WHEN NOT MATCHED THEN INSERT (customer_id, email, status) VALUES (source.customer_id, source.email, source.status);

SDG notes Redshift early on didn’t have MERGE, which is how he first encountered the term “upsert.” Platform capabilities shape pipeline design.

4. Aggregation append (accounting pattern)

Common in finance/healthcare where corrections are append-only rather than destructive. Instead of updating a row, you insert a correction row (positive or negative delta). Aggregating the whole key group gives you the correct current state.

Mapping against Ray Data Co

Curation section