“Full Refresh vs Incremental Pipelines” — @SeattleDataGuy

⚠️ Sponsorship — self-sponsored this issue

First issue in the backfill where SDG drops Estuary and replaces it with his own consulting CTA. Quote: “today’s article is sponsored by me, the Seattle Data Guy!” followed by a pitch for his data-consulting services. Not misleading — it’s clearly labeled — but a different pattern than the disclosed-adviser Estuary slot. Skill should detect both sponsor forms.

The core argument

How you load data into your tables is a real design decision that affects compute costs, implementation complexity, backfill ease, update-handling, and tooling constraints. SDG walks through the two primary patterns — full refresh and incremental — and the common incremental sub-patterns.

Full refresh

CREATE OR REPLACE over a SELECT. Simplest possible pipeline. Works well when:

Dataset is small
Rebuild cost is low
Source lacks reliable change tracking
Simplicity beats efficiency

Pair with WAP (write-audit-publish):

Write — build staged version
Audit — row count vs prior, null checks on required cols, duplicate key detection, metric range checks. Stop the pipeline if any fail.
Publish — promote to production only after checks pass.

Breaks down as tables grow, refresh windows tighten, or freshness requirements increase.

Incremental — four sub-patterns

1. ID appending

Two variants:

WHERE id > MAX(id) — faster, assumes in-order arrival and no missed rows.
WHERE id NOT EXISTS — safer, more expensive, catches missed rows.

2. Date appending

Requires a reliable update_date. SDG gripes that this is far less common than you’d expect — he’s seen Postgres tables with create_date but never update_date, or the column exists but is never populated.

Naïve version: WHERE order_date > MAX(order_date). Works when data arrives in order, old records don’t change, no late-arriving records.

Lookback window version (safer):

DELETE FROM analytics.orders WHERE order_date >= CURRENT_DATE - INTERVAL '2 day';
INSERT INTO analytics.orders SELECT * FROM raw.orders WHERE order_date >= CURRENT_DATE - INTERVAL '2 day';

Trades a little compute for a lot more reliability when data can arrive late.

3. Upsert / Merge

MERGE INTO analytics.customers AS target
USING staging.customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET target.email = source.email, target.status = source.status
WHEN NOT MATCHED THEN INSERT (customer_id, email, status) VALUES (source.customer_id, source.email, source.status);

SDG notes Redshift early on didn’t have MERGE, which is how he first encountered the term “upsert.” Platform capabilities shape pipeline design.

4. Aggregation append (accounting pattern)

Common in finance/healthcare where corrections are append-only rather than destructive. Instead of updating a row, you insert a correction row (positive or negative delta). Aggregating the whole key group gives you the correct current state.

Mapping against Ray Data Co

autoinv is mostly full-refresh territory. Our experiments produce dated outputs (pm1e_predictions_20260411-141032.csv) — each run is a new artifact, no in-place updates. Fine for research; irrelevant until we start persisting long-running state.
The WAP pattern is worth stealing explicitly for any live pipeline. Our BiasAudit is halfway there: it’s a gate between computation and “this strategy is deployable.” Turning it into an explicit Write → Audit → Publish flow would make the discipline more visible. Something to consider when we first paper-trade a strategy.
If/when we ingest live equity bars or Polymarket trade streams into a persistent store, date-append with lookback window is the right default. The “max date naive append” approach would miss delayed trades, which is exactly the class of bug that’ll silently corrupt backtests.
The Redshift MERGE anecdote is the same lesson as our own infrastructure choices: the platform you pick shapes the code you write. We chose file-based outputs over a real database because Claude Code operates on markdown/CSV naturally; that limits some things but simplifies everything else.

Curation section

“Database Federation: Decentralized and ACL-Compliant Hive Databases” — genuine third-party, Uber engineering blog. 16K+ Hive datasets in a single monolithic database, decomposing into federated ACL-compliant ones. Not applicable at our scale. Skipping deep-dive.
“Layer by Layer, We Built Data Systems No One Understands” — appears to be SDG’s own prior post (the narrative voice and the “I’ve helped companies save millions on compute” framing matches). Not filing deep-dive; theme (over-layered modern data stacks) is noted but not blocking.

2026-02-09-seattle-data-guy-why-data-pipelines-exist
2026-02-23-seattle-data-guy-backfills
2026-04-07-seattle-data-guy-noisy-data-quality-checks — the “audit” step of WAP is the philosophical anchor for our BiasAudit