06-reference

seattle data guy noisy data quality checks

Mon Apr 06 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: SeattleDataGuy's Newsletter (Substack) ·by SeattleDataGuy

“Daily Tasks With Data Pipelines — Data Quality Checks And The Problem With Noisy Checks” — @SeattleDataGuy

Why this is in the vault

Founder flagged this with a CRM note: “Seattle Data Guy should be added to the CRM. He has great content for the discipline.” The piece is directly relevant to the disciplined-execution posture we’ve been building into the autoinv package (bias audit, validation gates, honest metrics) and to the ../02-sops operational rhythm generally. Any system that runs unattended eventually fails in the same way this article describes — too many checks, unclear ownership, alert fatigue, then outright neglect.

The core argument

Deploying a pipeline is not the end — it’s the beginning of ongoing maintenance work. The article’s opening satire: teams get 137 quality alerts a day and fix 2. Data quality tools promise trustworthiness, but teams create “noisy” alert systems that quickly get ignored. The harder, unglamorous problem is maintenance, not building.

Where quality checks break down

  1. Over-checking everything. Applying checks to every column indiscriminately creates noise, especially on unused or naturally null-heavy columns. Without prioritization across hundreds of columns, chaos.
  2. Poorly tuned thresholds. Static thresholds fail; dynamic thresholds still miss gradual drift. Minor daily fluctuations sit just below the alert line indefinitely and mask real degradation.
  3. No clear ownership. Alerts pile up unassigned. If no specific person is on the hook for an alert firing, it gets ignored.
  4. Misaligned incentives — the article’s sharpest point. Short direct quote from the piece: “The team generating bad data is rewarded for shipping features, not fixing pipelines.” Application teams optimize for features and ship-speed; they overwrite historical fields or ignore update timestamps because the app works. Data teams inherit the mess and can’t fix the upstream cause.

What actually works, per the author

Caveat on AI-powered quality tools: many impress in demos but fail in production. Healthy skepticism toward vendor pitches.

Why data engineers ignore checks — the diagnostic list

When checks start getting ignored, the root cause is usually one of these (paraphrased from the article’s list):

Good comments worth capturing

Two commenter insights that extend the argument:

The “product with SLAs” reframe is the important one — it’s the same move the Jaya Gupta moat thesis makes for AI agents: the hard problem isn’t capability, it’s trust. A data quality product with stated SLAs is trust-as-a-first-class-citizen.

Mapping against Ray Data Co’s own posture

Where this reinforces what we already do:

Where this surfaces gaps we should close:

Action items

  1. Add SeattleDataGuy to the CRM (task #4 when we get to it). He’s publishing substantive data-engineering content weekly, 114K+ subscribers, clearly thoughtful. Specifically worth following for data platform discipline content.
  2. Add a monthly “audit checks” item to the vault rhythm. Kill dead autoinv tests, review which BiasAudit gates are tripping vs being ignored.
  3. When we eventually build a data product (probable future MCP server for one of the small bets), define SLAs before shipping — not after.
  4. Consider an “error trend” dashboard for the long-running pipelines (PM1e forward-test, any future live strategy) that tracks drift over time rather than point-in-time pass/fail.

Tracked author

SeattleDataGuy (real name not disclosed in the article; publicly known as Ben Rogojan in the data engineering community). Data engineer, strategy consultant, 114K+ Substack subscribers. Covers data engineering, MLOps, data science. High-quality operational content, not a “here’s a new framework” shouter. Worth adding to the CRM as a thought leader for the data engineering discipline. Publishes weekly, good source of filtered references back to the broader data ecosystem.

One short direct quote (<15 words) used in quotation marks for the incentive-misalignment point. All other content paraphrased and analyzed. Public newsletter content, non-paywalled at time of access.