“Backfills — The Necessary Evil of Data Engineering” — @SeattleDataGuy
⚠️ Sponsorship
Estuary pre-amble (adviser disclosure). Same pattern.
The core argument
Backfills — rerunning a pipeline against historical data — are unavoidable (late source data, bugs, schema/logic changes), data engineers hate them (cost, scale, time, blast radius, trust erosion), and the right approach depends on the underlying table type and why you’re doing it. Article is meta-commentary on a Zach Wilson / Brian Greene LinkedIn debate about blue-green-swap vs re-runnable pipelines.
Why backfill
- Late or corrected source data — upstream system sends bad SFTP file, now you need to rerun for a specific date range or partner.
- Pipeline bugs — a plain
UPDATEisn’t enough; you need the full pipeline logic re-applied. - Schema / logic changes — at Facebook, column drops and data-type conversions often required full table rebuilds because the underlying file format (e.g. ORC/Parquet) wouldn’t allow in-place changes.
Why engineers dislike them
Scale, cost (rerunning pay-as-you-go jobs hurts), time (blocks daily jobs and consumes engineer hours), blast radius (every downstream user needs a heads-up), and trust erosion (stakeholders see numbers change, they start questioning).
The Zach-vs-Brian debate, decoded
Same phenomenon viewed from two infrastructure positions:
- Zach (blue-green swap): build table v2 alongside v1, swap. Safer for partition-heavy environments (Facebook-scale, ORC/Parquet, 180+ partitions) because re-running in place leaves windows of inconsistency across downstream readers.
- Brian (re-runnable pipeline): pipeline should be idempotent and safely replayable in place. Works well in modern, partition-aware stacks where write-once-read-many storage makes this cleaner.
SDG’s conclusion: both are right for different stacks. The key discipline is “don’t run a bunch of random SQL scripts against production” — build a repeatable process that balances safety and verification, regardless of which pattern your stack favors.
He also cites Albert Campillio’s blue-green diagram and flags a subtle risk: if the swap is two ALTER statements and you lose connection between them, you’ve created a window where the production table doesn’t exist. Atomicity matters.
Two table-type approaches
- Traditional table + SFTP source — parameterize by partner ID and date range; the pipeline rerun handles delete-and-replace atomically. Works fine.
- Partition-based tables at scale — 180 partitions × multi-step pipeline = inconsistency windows across downstream readers. Swap tables instead of rerunning in place.
Limiting backfill frequency
- Reliable data quality checks catch bad data at ingest, not after.
- Pipelines designed re-runnable and parameterized from day one.
- No one-off fixes in production.
- Understand your storage format’s limitations before choosing it.
Mapping against Ray Data Co
- autoinv is in the “traditional table” regime — SQLite / parquet files, small enough to just rerun. No partition-swap pattern needed.
- The “don’t run random SQL against production” discipline applies metaphorically to research runs. Our scripts write dated outputs to
experiments/outputs/, which is the moral equivalent of blue-green: every run is a new artifact, nothing gets silently overwritten. eq1 → eq2 → eq3 PEAD runs all sit side-by-side, so any regression is traceable. - PM1e lesson from this morning: the April 3-10 Elon event predictions got written after the event closed, meaning the “market_mid” column was post-resolution. That’s the research-lab version of “ran random SQL against production.” I caught it, but the lesson is that even read-only backfills need timing discipline. Already fixed by re-running the forecaster pre-resolution for the April 10-17 event.
Curation section
- “The Analytical Skills No One Teaches You” — SDG’s own prior article (Olga Berezovsky guest post). Already filed as 2026-01-23-seattle-data-guy-analytical-skills. Self-cross-promo.
- “The Insanity of Data Education” — appears to be from another Substack publication (about a data-modeling survey of 1,100+ professionals reporting 89% struggling). Could be genuine third-party or another self-reference. Skipping deep-dive; the headline finding (“59% cite pressure to move fast, 51% lack clear ownership”) is consistent with themes already filed.
Related
- 2026-01-23-seattle-data-guy-analytical-skills
- 2026-02-09-seattle-data-guy-why-data-pipelines-exist
- ../01-projects/automated-investing/experiments/pm1e-elon-forecast — the “timing discipline” lesson from this morning