“Full Refresh vs Incremental Pipelines” — @SeattleDataGuy
⚠️ Sponsorship — self-sponsored this issue
First issue in the backfill where SDG drops Estuary and replaces it with his own consulting CTA. Quote: “today’s article is sponsored by me, the Seattle Data Guy!” followed by a pitch for his data-consulting services. Not misleading — it’s clearly labeled — but a different pattern than the disclosed-adviser Estuary slot. Skill should detect both sponsor forms.
The core argument
How you load data into your tables is a real design decision that affects compute costs, implementation complexity, backfill ease, update-handling, and tooling constraints. SDG walks through the two primary patterns — full refresh and incremental — and the common incremental sub-patterns.
Full refresh
CREATE OR REPLACE over a SELECT. Simplest possible pipeline. Works well when:
- Dataset is small
- Rebuild cost is low
- Source lacks reliable change tracking
- Simplicity beats efficiency
Pair with WAP (write-audit-publish):
- Write — build staged version
- Audit — row count vs prior, null checks on required cols, duplicate key detection, metric range checks. Stop the pipeline if any fail.
- Publish — promote to production only after checks pass.
Breaks down as tables grow, refresh windows tighten, or freshness requirements increase.
Incremental — four sub-patterns
1. ID appending
Two variants:
WHERE id > MAX(id)— faster, assumes in-order arrival and no missed rows.WHERE id NOT EXISTS— safer, more expensive, catches missed rows.
2. Date appending
Requires a reliable update_date. SDG gripes that this is far less common than you’d expect — he’s seen Postgres tables with create_date but never update_date, or the column exists but is never populated.
Naïve version: WHERE order_date > MAX(order_date). Works when data arrives in order, old records don’t change, no late-arriving records.
Lookback window version (safer):
DELETE FROM analytics.orders WHERE order_date >= CURRENT_DATE - INTERVAL '2 day';
INSERT INTO analytics.orders SELECT * FROM raw.orders WHERE order_date >= CURRENT_DATE - INTERVAL '2 day';
Trades a little compute for a lot more reliability when data can arrive late.
3. Upsert / Merge
MERGE INTO analytics.customers AS target
USING staging.customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET target.email = source.email, target.status = source.status
WHEN NOT MATCHED THEN INSERT (customer_id, email, status) VALUES (source.customer_id, source.email, source.status);
SDG notes Redshift early on didn’t have MERGE, which is how he first encountered the term “upsert.” Platform capabilities shape pipeline design.
4. Aggregation append (accounting pattern)
Common in finance/healthcare where corrections are append-only rather than destructive. Instead of updating a row, you insert a correction row (positive or negative delta). Aggregating the whole key group gives you the correct current state.
Mapping against Ray Data Co
- autoinv is mostly full-refresh territory. Our experiments produce dated outputs (
pm1e_predictions_20260411-141032.csv) — each run is a new artifact, no in-place updates. Fine for research; irrelevant until we start persisting long-running state. - The WAP pattern is worth stealing explicitly for any live pipeline. Our
BiasAuditis halfway there: it’s a gate between computation and “this strategy is deployable.” Turning it into an explicit Write → Audit → Publish flow would make the discipline more visible. Something to consider when we first paper-trade a strategy. - If/when we ingest live equity bars or Polymarket trade streams into a persistent store, date-append with lookback window is the right default. The “max date naive append” approach would miss delayed trades, which is exactly the class of bug that’ll silently corrupt backtests.
- The Redshift MERGE anecdote is the same lesson as our own infrastructure choices: the platform you pick shapes the code you write. We chose file-based outputs over a real database because Claude Code operates on markdown/CSV naturally; that limits some things but simplifies everything else.
Curation section
- “Database Federation: Decentralized and ACL-Compliant Hive Databases” — genuine third-party, Uber engineering blog. 16K+ Hive datasets in a single monolithic database, decomposing into federated ACL-compliant ones. Not applicable at our scale. Skipping deep-dive.
- “Layer by Layer, We Built Data Systems No One Understands” — appears to be SDG’s own prior post (the narrative voice and the “I’ve helped companies save millions on compute” framing matches). Not filing deep-dive; theme (over-layered modern data stacks) is noted but not blocking.
Related
- 2026-02-09-seattle-data-guy-why-data-pipelines-exist
- 2026-02-23-seattle-data-guy-backfills
- 2026-04-07-seattle-data-guy-noisy-data-quality-checks — the “audit” step of WAP is the philosophical anchor for our
BiasAudit