06-reference

data cleaning is analysis

Thu Apr 02 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·article ·source: https://counting.substack.com/p/whys-it-hard-to-teach-data-cleaning ·by Randy Au

Why Data Cleaning Is Hard to Teach — It’s Analysis, Not Grunt Work

Randy Au reframes data cleaning as analysis with intent and purpose, not menial grunt work. The reason it’s hard to teach is that every analysis requires data to be shaped in a unique way.

The core reframe

“Data cleaning is transforming data with intent and purpose, that purpose being to complete an analysis.”

Data transformations span a spectrum of reusability: stripping SQL injection from form data is highly reusable; reweighting a survey sample for unexpected bias is not. The more repeatable and generalizable the transformation, the more people write it off as uninteresting “cleaning.”

Why it’s unteachable in the traditional sense

Every analysis requires data in a certain unique shape:

Upstream and downstream forces both shape what “clean” means. Data collection processes are tied to the original research question. When data is repurposed (what’s called “found data”), cleaning becomes especially complex.

The teaching gap

Public datasets are almost always pre-cleaned, creating a false impression of what real data work looks like. We need to lead people to messy data or encourage them to generate their own.

Connects to analytics as craft, analytics is a profession, embrace the grind, recipe for data intuition.

Open questions