“Common Data Pipeline Patterns You’ll See in the Real World” — @SeattleDataGuy
Why this is in the vault
Part of the SDG backfill. First article in the SDG 2026 pipelines series. Useful as a taxonomic foundation note — future SDG pipeline articles in this series will reference these five categories, so having the shared vocabulary filed makes later pieces cheaper to integrate.
The core argument
When data teams say “data pipeline,” they actually mean any of several structurally different things. SDG groups them into five real-world patterns that show up across industries, and names a few more quickly at the end.
The five patterns
- Source Standardization Pipelines — ingest from multiple partners in different formats (CSV, XML, positional files, APIs), map into a shared core model. Mapping is the hard part: standardize gender codes, category labels, date formats, time zones. Output powers marketplaces, industry-level reports, cross-partner analytics.
- Amalgamation Pipelines — merge multiple sources into a single flow or 360-view. Sales funnel stitched across HubSpot + Google Ads + Salesforce + Stripe. Hard part is the reliable join ID and late-landing data handling.
- Excel “Data Pipelines” — semi-automated VBA/VLOOKUP-driven extract-transform-load. SDG argues these functionally solve the same problem even if they’re not “real” pipelines. They tend to get productionized later.
- Enrichment Pipelines — separate pipeline adding columns to core tables: lead scores, ML-derived features, external data joins. Built after the core model is stable.
- Operational Pipelines (reverse ETL) — push data back into operational systems (Salesforce segmentation, NetSuite updates, HubSpot lists). Hard because target systems often require single-record updates and have idiosyncratic APIs; straddles software + data boundary.
Honorable mentions: ML pipelines, integration pipelines, migration pipelines, metadata/lineage pipelines.
Mapping against Ray Data Co
This is a taxonomy article, not an operational one. Value for us is vocabulary-sharing across future SDG pieces in the series. Light RDCO mapping:
- autoinv’s data layer is a source-standardization pipeline — Polygon, Kalshi, Polymarket, Gamma API all come in differently and get normalized. Same class as SDG’s first pattern.
- The PEAD bias audit is enrichment — it adds a derived “signal quality” field on top of the event table rather than changing the table itself.
- No operational pipeline yet — we don’t push data back into any external system (no Slack alerter, no dashboard sync). When we eventually do (probable MCP server for one of the strategies), operational-pipeline concerns apply: idempotency, single-record semantics, failure handling.
Curation section — links worth noting
SDG’s “Articles Worth Reading” pointed at two pieces:
- mehdio — “How I Run System Design Interviews for Data Engineers” (exploration-style interview approach). Flagged as potentially useful if we ever hire; skipping deep-dive for now.
- Joe Reis — “Code Wasn’t The Hard Part (Keep Building)”. Loosely relevant to the “thinking vs doing” thesis from 2026-04-10-paddy-srinivasan-agentic-cloud. Skipping deep-dive; Joe Reis already has a tracked-author slot anyway.
Sponsorships / bias notes
- Self-promotion at top (Data Leaders Playbook signup link) — noted, not acting on.
- “Reader-supported publication, consider becoming a paid subscriber” — standard Substack boilerplate. No third-party paid placements detected in this issue.
Related
- 2026-04-07-seattle-data-guy-noisy-data-quality-checks — the maintenance counterpart to this “building” article
- ../01-projects/automated-investing/autoinv/README — where our own pipelines live
- ../02-sops — operational rhythms