06-reference

seattle data guy backfills

Sun Feb 22 2026 19:00:00 GMT-0500 (Eastern Standard Time) ·reference ·source: SeattleDataGuy's Newsletter (Substack) ·by SeattleDataGuy (Ben Rogojan)

“Backfills — The Necessary Evil of Data Engineering” — @SeattleDataGuy

⚠️ Sponsorship

Estuary pre-amble (adviser disclosure). Same pattern.

The core argument

Backfills — rerunning a pipeline against historical data — are unavoidable (late source data, bugs, schema/logic changes), data engineers hate them (cost, scale, time, blast radius, trust erosion), and the right approach depends on the underlying table type and why you’re doing it. Article is meta-commentary on a Zach Wilson / Brian Greene LinkedIn debate about blue-green-swap vs re-runnable pipelines.

Why backfill

Why engineers dislike them

Scale, cost (rerunning pay-as-you-go jobs hurts), time (blocks daily jobs and consumes engineer hours), blast radius (every downstream user needs a heads-up), and trust erosion (stakeholders see numbers change, they start questioning).

The Zach-vs-Brian debate, decoded

Same phenomenon viewed from two infrastructure positions:

SDG’s conclusion: both are right for different stacks. The key discipline is “don’t run a bunch of random SQL scripts against production” — build a repeatable process that balances safety and verification, regardless of which pattern your stack favors.

He also cites Albert Campillio’s blue-green diagram and flags a subtle risk: if the swap is two ALTER statements and you lose connection between them, you’ve created a window where the production table doesn’t exist. Atomicity matters.

Two table-type approaches

  1. Traditional table + SFTP source — parameterize by partner ID and date range; the pipeline rerun handles delete-and-replace atomically. Works fine.
  2. Partition-based tables at scale — 180 partitions × multi-step pipeline = inconsistency windows across downstream readers. Swap tables instead of rerunning in place.

Limiting backfill frequency

Mapping against Ray Data Co

Curation section