06-reference

seattle data guy build a pipeline system

Tue Jan 13 2026 19:00:00 GMT-0500 (Eastern Standard Time) ·reference ·source: SeattleDataGuy's Newsletter (Substack) ·by SeattleDataGuy (Ben Rogojan)

“What It Actually Takes to Build a Data Pipeline System” — @SeattleDataGuy

Why this is in the vault

Second article in SDG’s 2026 pipeline series. Component-level inventory of what lives inside a pipeline system. Useful as a checklist against our own autoinv architecture — tells us which boxes we have, which we skipped, and which we can defer.

⚠️ Sponsorship disclosure — flagged

The article opens with a pre-amble for Estuary, a data-movement platform SDG is a disclosed adviser for. Quote: “Estuary, a platform I’ve used to help make clients’ data workflows easier and am an adviser for.” Not a paid ad placement per se, but a vested-interest call-out that warrants the same bias weight. SDG is transparent about the relationship — the disclosure is upfront, not hidden — but future references to Estuary in his content should be treated as non-neutral.

No other sponsors detected in this issue. The rest reads as his own material.

The core argument

If you were to build a data pipeline system from scratch (and SDG notes most teams shouldn’t — they should buy Airflow/dbt/etc), here are the nine components you’d actually end up writing.

The nine components

  1. Secrets & connection management — without shared source/destination config, the rest is orphaned SQL and Python doing nothing.
  2. Logging & monitoring — “library not found” vs “table not found” traceability. Gets more important as AI generates more pipeline code.
  3. Dependency awareness (DAGs) — some pipelines don’t need full DAGs, but they all need some form of “what runs after what.” Airflow’s set_downstream, dbt’s {{ ref() }}.
  4. Execution engine routers — newer concept. Teams now want to use DuckDB + Presto + Spark + Databricks + Snowflake in different parts of the same pipeline. Expect routing layers that pick compute engine based on cost/speed/data-size. Facebook had this manually.
  5. Scheduler — cron, Jenkins, SSIS/SQL Agent, Airflow. Airflow’s scheduler is special because it separates when work should happen from how it runs, which is what enables backfills/retries/complex deps.
  6. Pipelines themselves — the actual DSL or code that defines the transformation (dbt models, Airflow tasks, Glue jobs).
  7. Data quality checks — built-in, not bolt-on. “If it’s not easy to integrate, people won’t do it.”
  8. UI — technically optional, effectively mandatory at any real scale because non-engineers need to filter/rerun/debug.
  9. Operational concerns (not components but needed): idempotency/backfill safety, ownership tracking, alerting + on-call routing, environment isolation (dev/test/prod, ideally with natural dev flow).

Mapping against Ray Data Co — autoinv scorecard

Running this checklist against our own autoinv package:

SDG componentOur statusNotes
Secrets & connection mgmt⚠️ partial1Password wrappers for MCP creds, but autoinv API keys still via env vars. Could tighten.
Logging & monitoring⚠️ partialScripts print to stdout, no structured logs. Fine at current scale.
Dependency awareness (DAGs)No orchestration. Scripts run manually via /loop or by hand. Deferred until we have >3 dependent pipelines.
Execution engine routers❌ N/ASingle-machine Python. Not a problem worth solving.
Scheduler⚠️ partial/loop skill + the 4am nightly restart launchd job handles everything so far. Not truly a scheduler yet.
Pipelines themselvesautoinv package modules (data, pricing, metrics, polymarket, kalshi, engine).
Data quality checksBiasAudit is our version. PEAD eq3’s survivorship flag just caught a drawdown the headline return would have hidden.
UI❌ N/AZero real users. Claude Code IS the UI.
Idempotency / backfillScripts save dated outputs to experiments/outputs/, reruns don’t break prior artifacts.
Ownership❌ N/ASingle-operator project.
Alerting / on-call⚠️ partialNo pager; channel reply tool surfaces issues. Fine at scale-1.
Environment isolation❌ N/ASingle .venv, no prod/test split.

Takeaway: six of nine are genuinely N/A at our scale or better handled by Claude Code than custom tooling. The two gaps worth watching are logging/monitoring (if any strategy goes to paper-trade or live execution, structured logs become non-optional) and scheduler (if /loop grows more than a few concurrent rhythms, we’ll want real task management, probably Temporal or a lightweight durable-execution layer).

Two pieces, both appear to be SDG’s own blog content used for cross-promotion (not third-party curation):

Noting that the “curation” section in SDG’s newsletter is sometimes self-promotion dressed as curation. The skill needs to detect that pattern (same domain, same author byline) and label accordingly.