“What It Actually Takes to Build a Data Pipeline System” — @SeattleDataGuy

Why this is in the vault

Second article in SDG’s 2026 pipeline series. Component-level inventory of what lives inside a pipeline system. Useful as a checklist against our own autoinv architecture — tells us which boxes we have, which we skipped, and which we can defer.

⚠️ Sponsorship disclosure — flagged

The article opens with a pre-amble for Estuary, a data-movement platform SDG is a disclosed adviser for. Quote: “Estuary, a platform I’ve used to help make clients’ data workflows easier and am an adviser for.” Not a paid ad placement per se, but a vested-interest call-out that warrants the same bias weight. SDG is transparent about the relationship — the disclosure is upfront, not hidden — but future references to Estuary in his content should be treated as non-neutral.

No other sponsors detected in this issue. The rest reads as his own material.

The core argument

If you were to build a data pipeline system from scratch (and SDG notes most teams shouldn’t — they should buy Airflow/dbt/etc), here are the nine components you’d actually end up writing.

The nine components

Secrets & connection management — without shared source/destination config, the rest is orphaned SQL and Python doing nothing.
Logging & monitoring — “library not found” vs “table not found” traceability. Gets more important as AI generates more pipeline code.
Dependency awareness (DAGs) — some pipelines don’t need full DAGs, but they all need some form of “what runs after what.” Airflow’s set_downstream, dbt’s {{ ref() }}.
Execution engine routers — newer concept. Teams now want to use DuckDB + Presto + Spark + Databricks + Snowflake in different parts of the same pipeline. Expect routing layers that pick compute engine based on cost/speed/data-size. Facebook had this manually.
Scheduler — cron, Jenkins, SSIS/SQL Agent, Airflow. Airflow’s scheduler is special because it separates when work should happen from how it runs, which is what enables backfills/retries/complex deps.
Pipelines themselves — the actual DSL or code that defines the transformation (dbt models, Airflow tasks, Glue jobs).
Data quality checks — built-in, not bolt-on. “If it’s not easy to integrate, people won’t do it.”
UI — technically optional, effectively mandatory at any real scale because non-engineers need to filter/rerun/debug.
Operational concerns (not components but needed): idempotency/backfill safety, ownership tracking, alerting + on-call routing, environment isolation (dev/test/prod, ideally with natural dev flow).

Mapping against Ray Data Co — autoinv scorecard

Running this checklist against our own autoinv package:

SDG component	Our status	Notes
Secrets & connection mgmt	⚠️ partial	1Password wrappers for MCP creds, but autoinv API keys still via env vars. Could tighten.
Logging & monitoring	⚠️ partial	Scripts print to stdout, no structured logs. Fine at current scale.
Dependency awareness (DAGs)	❌	No orchestration. Scripts run manually via `/loop` or by hand. Deferred until we have >3 dependent pipelines.
Execution engine routers	❌ N/A	Single-machine Python. Not a problem worth solving.
Scheduler	⚠️ partial	`/loop` skill + the 4am nightly restart launchd job handles everything so far. Not truly a scheduler yet.
Pipelines themselves	✅	autoinv package modules (data, pricing, metrics, polymarket, kalshi, engine).
Data quality checks	✅	`BiasAudit` is our version. PEAD eq3’s survivorship flag just caught a drawdown the headline return would have hidden.
UI	❌ N/A	Zero real users. Claude Code IS the UI.
Idempotency / backfill	✅	Scripts save dated outputs to `experiments/outputs/`, reruns don’t break prior artifacts.
Ownership	❌ N/A	Single-operator project.
Alerting / on-call	⚠️ partial	No pager; channel reply tool surfaces issues. Fine at scale-1.
Environment isolation	❌ N/A	Single `.venv`, no prod/test split.

Takeaway: six of nine are genuinely N/A at our scale or better handled by Claude Code than custom tooling. The two gaps worth watching are logging/monitoring (if any strategy goes to paper-trade or live execution, structured logs become non-optional) and scheduler (if /loop grows more than a few concurrent rhythms, we’ll want real task management, probably Temporal or a lightweight durable-execution layer).

Curation section — bottom links

Two pieces, both appear to be SDG’s own blog content used for cross-promotion (not third-party curation):

“Behind the Scenes of SQL: Understanding SQL Query Execution” — basic SQL internals piece. Skipping deep-dive; not relevant to our work.
“Back To The Basics: What Is Columnar Storage” — columnar storage overview (Parquet, analytical use cases). Skipping deep-dive; we use Parquet via autoinv.data but we don’t need the explainer.

Noting that the “curation” section in SDG’s newsletter is sometimes self-promotion dressed as curation. The skill needs to detect that pattern (same domain, same author byline) and label accordingly.

2026-01-05-seattle-data-guy-data-pipeline-patterns — previous in series
2026-04-07-seattle-data-guy-noisy-data-quality-checks — the “fewer better checks” article that aligns with our BiasAudit design
../01-projects/automated-investing/autoinv/README — subject of the scorecard above