“Data Pipeline Foundations” — @SeattleDataGuy
Why this is in the vault
This is Rogojan’s meta-index of his own data-pipeline writing — a curated table-of-contents he intends to keep updating. For RDCO it’s useful as a single-reference map of the pipeline vocabulary we’ve been accumulating from the SDG backfill, and it confirms which topics Rogojan himself treats as the foundational set (sources, processing, building, operating, beyond-the-basics).
Sponsorship
Estuary pre-amble at top — the standard “adviser + past client work” disclosure SDG uses roughly 60% of issues. Explicit and early. Bias angle: Estuary sells managed ingestion/CDC, so any SDG piece that lands on “CDC is the pattern you want” should be read against that backdrop. This particular issue is a link roundup, not an argument piece, so Estuary bias is low-impact here.
Issue contents
Rogojan gathers his prior pipeline articles into five buckets and adds a short framing paragraph per bucket:
- Where Data Comes From — SFTP, APIs, database types (relational/document/time-series). The SFTP essay and the “still passing data via email” one-liner are his recurring point that enterprise data movement is less modern than Twitter implies.
- How Data Gets Processed — why pipelines exist at all, the “T” in ETL, full-refresh vs incremental, CDC, batch vs stream. Most of these are backfilled already under SDG slugs.
- Building Real Pipelines — pipeline patterns, what it takes to build a system, Prefect/Mage/Airflow comparison (guest-authored by Daniel Beach).
- Operating Pipelines in Production — backfills, “why your pipeline isn’t production-ready”, noisy data-quality checks. This is the operations discipline bucket.
- Beyond the Basics — migrations, Snowflake warehouse (by Inmon), Joe Reis “Fundamentals Are Gravity”, Airflow deployment mistakes, data modeling war stories.
His closing point: even in the AI era the job is still “get data into a central location, standardize, integrate, make queryable” — and AI era data modeling looks a lot like self-service era data modeling (“just-in-time data modeling” per Joe Reis).
Mapping against Ray Data Co
Three useful things for us:
- Taxonomy validation. Rogojan’s five buckets (sources / processing / building / operating / beyond-basics) are a reasonable skeleton for any data-pipeline discipline article we write. If the data-quality-framework project needs a pipeline-lifecycle scaffold, this is a known-good outline to borrow.
- “Every pipeline is a liability” — his one-liner under Operating Pipelines. Worth promoting to a durable RDCO principle: pipeline count is a cost, not an asset, and AI that lets us write more pipelines faster increases future migration/backfill/quality load proportionally. Directly relevant to the agent-deployer thesis: AI agents that generate pipelines on demand are creating future operational debt unless we also automate the operate-in-production half.
- “Just-in-time data modeling” = AI-era self-service. Rogojan (crediting Reis) flags that teams are building new tables per use case instead of modeling coherently. This reinforces the positioning gap we’ve already been tracking — Model Acceptance Criteria exists precisely because the AI era makes proliferation cheap and coherence expensive.
No contradictions with existing vault positions. The article is mostly confirmatory.
Curation section — notes
Only two items in the “Articles Worth Reading” block this issue:
- “Under the Hood: Scaling Responsible AI at Uber” — third-party (Uber engineering blog). Covers Model Catalog, feature-importance explainability, early compliance checks. Adjacent to data-quality-framework but heavy enterprise-governance framing. Did not deep-fetch — topic is adjacent rather than core, and Uber’s Responsible AI program is not something we have a specific RDCO-angle question about right now. Flag for possible future fetch if we start writing about model governance tooling.
- “5 Key Predictions for the Data Industry in 2026” — no author or source byline shown in the email; domain resolution behind a Substack redirect. Predictions pieces are low-rigor by genre. Did not deep-fetch. Likely a cross-promo slot (another Substack author).
No self-cross-promo detected in curation this issue. The “internal” self-promotion is entirely in the main body (the whole article IS a self-promotion index of his own pipeline essays), which is disclosed transparently as “here’s my series.”
Related
- 2026-01-05-seattle-data-guy-data-pipeline-patterns — referenced directly under “Building Real Pipelines”
- 2026-01-14-seattle-data-guy-build-a-pipeline-system — referenced directly
- 2026-02-09-seattle-data-guy-why-data-pipelines-exist — referenced directly
- 2026-02-23-seattle-data-guy-backfills — referenced directly
- 2026-03-17-seattle-data-guy-full-refresh-vs-incremental — referenced directly
- 2026-04-07-seattle-data-guy-noisy-data-quality-checks — referenced directly
- 2026-03-25-seattle-data-guy-know-nothing-and-be-happy — thematic sibling on the AI-era data landscape
- 2026-04-04-dedp-history-state-de — Inmon lineage, connects to the “SNOWFLAKE AND DATA WAREHOUSE” curation link
Copyright note
All framing above is paraphrase. No direct quotes longer than a few words. Original piece at the source_url.