“What It Actually Takes to Build a Data Pipeline System” — @SeattleDataGuy
Why this is in the vault
Second article in SDG’s 2026 pipeline series. Component-level inventory of what lives inside a pipeline system. Useful as a checklist against our own autoinv architecture — tells us which boxes we have, which we skipped, and which we can defer.
⚠️ Sponsorship disclosure — flagged
The article opens with a pre-amble for Estuary, a data-movement platform SDG is a disclosed adviser for. Quote: “Estuary, a platform I’ve used to help make clients’ data workflows easier and am an adviser for.” Not a paid ad placement per se, but a vested-interest call-out that warrants the same bias weight. SDG is transparent about the relationship — the disclosure is upfront, not hidden — but future references to Estuary in his content should be treated as non-neutral.
No other sponsors detected in this issue. The rest reads as his own material.
The core argument
If you were to build a data pipeline system from scratch (and SDG notes most teams shouldn’t — they should buy Airflow/dbt/etc), here are the nine components you’d actually end up writing.
The nine components
- Secrets & connection management — without shared source/destination config, the rest is orphaned SQL and Python doing nothing.
- Logging & monitoring — “library not found” vs “table not found” traceability. Gets more important as AI generates more pipeline code.
- Dependency awareness (DAGs) — some pipelines don’t need full DAGs, but they all need some form of “what runs after what.” Airflow’s
set_downstream, dbt’s{{ ref() }}. - Execution engine routers — newer concept. Teams now want to use DuckDB + Presto + Spark + Databricks + Snowflake in different parts of the same pipeline. Expect routing layers that pick compute engine based on cost/speed/data-size. Facebook had this manually.
- Scheduler — cron, Jenkins, SSIS/SQL Agent, Airflow. Airflow’s scheduler is special because it separates when work should happen from how it runs, which is what enables backfills/retries/complex deps.
- Pipelines themselves — the actual DSL or code that defines the transformation (dbt models, Airflow tasks, Glue jobs).
- Data quality checks — built-in, not bolt-on. “If it’s not easy to integrate, people won’t do it.”
- UI — technically optional, effectively mandatory at any real scale because non-engineers need to filter/rerun/debug.
- Operational concerns (not components but needed): idempotency/backfill safety, ownership tracking, alerting + on-call routing, environment isolation (dev/test/prod, ideally with natural dev flow).
Mapping against Ray Data Co — autoinv scorecard
Running this checklist against our own autoinv package:
| SDG component | Our status | Notes |
|---|---|---|
| Secrets & connection mgmt | ⚠️ partial | 1Password wrappers for MCP creds, but autoinv API keys still via env vars. Could tighten. |
| Logging & monitoring | ⚠️ partial | Scripts print to stdout, no structured logs. Fine at current scale. |
| Dependency awareness (DAGs) | ❌ | No orchestration. Scripts run manually via /loop or by hand. Deferred until we have >3 dependent pipelines. |
| Execution engine routers | ❌ N/A | Single-machine Python. Not a problem worth solving. |
| Scheduler | ⚠️ partial | /loop skill + the 4am nightly restart launchd job handles everything so far. Not truly a scheduler yet. |
| Pipelines themselves | ✅ | autoinv package modules (data, pricing, metrics, polymarket, kalshi, engine). |
| Data quality checks | ✅ | BiasAudit is our version. PEAD eq3’s survivorship flag just caught a drawdown the headline return would have hidden. |
| UI | ❌ N/A | Zero real users. Claude Code IS the UI. |
| Idempotency / backfill | ✅ | Scripts save dated outputs to experiments/outputs/, reruns don’t break prior artifacts. |
| Ownership | ❌ N/A | Single-operator project. |
| Alerting / on-call | ⚠️ partial | No pager; channel reply tool surfaces issues. Fine at scale-1. |
| Environment isolation | ❌ N/A | Single .venv, no prod/test split. |
Takeaway: six of nine are genuinely N/A at our scale or better handled by Claude Code than custom tooling. The two gaps worth watching are logging/monitoring (if any strategy goes to paper-trade or live execution, structured logs become non-optional) and scheduler (if /loop grows more than a few concurrent rhythms, we’ll want real task management, probably Temporal or a lightweight durable-execution layer).
Curation section — bottom links
Two pieces, both appear to be SDG’s own blog content used for cross-promotion (not third-party curation):
- “Behind the Scenes of SQL: Understanding SQL Query Execution” — basic SQL internals piece. Skipping deep-dive; not relevant to our work.
- “Back To The Basics: What Is Columnar Storage” — columnar storage overview (Parquet, analytical use cases). Skipping deep-dive; we use Parquet via
autoinv.databut we don’t need the explainer.
Noting that the “curation” section in SDG’s newsletter is sometimes self-promotion dressed as curation. The skill needs to detect that pattern (same domain, same author byline) and label accordingly.
Related
- 2026-01-05-seattle-data-guy-data-pipeline-patterns — previous in series
- 2026-04-07-seattle-data-guy-noisy-data-quality-checks — the “fewer better checks” article that aligns with our
BiasAuditdesign - ../01-projects/automated-investing/autoinv/README — subject of the scorecard above