06-reference

seattle data guy data pipeline foundations

Fri Apr 17 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: SeattleDataGuy's Newsletter ·by Ben Rogojan (SeattleDataGuy)

“Data Pipeline Foundations” — @SeattleDataGuy

Why this is in the vault

This is Rogojan’s meta-index of his own data-pipeline writing — a curated table-of-contents he intends to keep updating. For RDCO it’s useful as a single-reference map of the pipeline vocabulary we’ve been accumulating from the SDG backfill, and it confirms which topics Rogojan himself treats as the foundational set (sources, processing, building, operating, beyond-the-basics).

Sponsorship

Estuary pre-amble at top — the standard “adviser + past client work” disclosure SDG uses roughly 60% of issues. Explicit and early. Bias angle: Estuary sells managed ingestion/CDC, so any SDG piece that lands on “CDC is the pattern you want” should be read against that backdrop. This particular issue is a link roundup, not an argument piece, so Estuary bias is low-impact here.

Issue contents

Rogojan gathers his prior pipeline articles into five buckets and adds a short framing paragraph per bucket:

His closing point: even in the AI era the job is still “get data into a central location, standardize, integrate, make queryable” — and AI era data modeling looks a lot like self-service era data modeling (“just-in-time data modeling” per Joe Reis).

Mapping against Ray Data Co

Three useful things for us:

  1. Taxonomy validation. Rogojan’s five buckets (sources / processing / building / operating / beyond-basics) are a reasonable skeleton for any data-pipeline discipline article we write. If the data-quality-framework project needs a pipeline-lifecycle scaffold, this is a known-good outline to borrow.
  2. “Every pipeline is a liability” — his one-liner under Operating Pipelines. Worth promoting to a durable RDCO principle: pipeline count is a cost, not an asset, and AI that lets us write more pipelines faster increases future migration/backfill/quality load proportionally. Directly relevant to the agent-deployer thesis: AI agents that generate pipelines on demand are creating future operational debt unless we also automate the operate-in-production half.
  3. “Just-in-time data modeling” = AI-era self-service. Rogojan (crediting Reis) flags that teams are building new tables per use case instead of modeling coherently. This reinforces the positioning gap we’ve already been tracking — Model Acceptance Criteria exists precisely because the AI era makes proliferation cheap and coherence expensive.

No contradictions with existing vault positions. The article is mostly confirmatory.

Curation section — notes

Only two items in the “Articles Worth Reading” block this issue:

  1. “Under the Hood: Scaling Responsible AI at Uber” — third-party (Uber engineering blog). Covers Model Catalog, feature-importance explainability, early compliance checks. Adjacent to data-quality-framework but heavy enterprise-governance framing. Did not deep-fetch — topic is adjacent rather than core, and Uber’s Responsible AI program is not something we have a specific RDCO-angle question about right now. Flag for possible future fetch if we start writing about model governance tooling.
  2. “5 Key Predictions for the Data Industry in 2026” — no author or source byline shown in the email; domain resolution behind a Substack redirect. Predictions pieces are low-rigor by genre. Did not deep-fetch. Likely a cross-promo slot (another Substack author).

No self-cross-promo detected in curation this issue. The “internal” self-promotion is entirely in the main body (the whole article IS a self-promotion index of his own pipeline essays), which is disclosed transparently as “here’s my series.”

All framing above is paraphrase. No direct quotes longer than a few words. Original piece at the source_url.