Why Data Cleaning Is Hard to Teach — It’s Analysis, Not Grunt Work
Randy Au reframes data cleaning as analysis with intent and purpose, not menial grunt work. The reason it’s hard to teach is that every analysis requires data to be shaped in a unique way.
The core reframe
“Data cleaning is transforming data with intent and purpose, that purpose being to complete an analysis.”
Data transformations span a spectrum of reusability: stripping SQL injection from form data is highly reusable; reweighting a survey sample for unexpected bias is not. The more repeatable and generalizable the transformation, the more people write it off as uninteresting “cleaning.”
Why it’s unteachable in the traditional sense
Every analysis requires data in a certain unique shape:
- Software packages need specific formats
- Algorithms crash unless data types are exactly right
- One method needs missing values handled one way, another needs them handled differently
Upstream and downstream forces both shape what “clean” means. Data collection processes are tied to the original research question. When data is repurposed (what’s called “found data”), cleaning becomes especially complex.
The teaching gap
Public datasets are almost always pre-cleaned, creating a false impression of what real data work looks like. We need to lead people to messy data or encourage them to generate their own.
Connects to analytics as craft, analytics is a profession, embrace the grind, recipe for data intuition.
Open questions
- Could generative/synthetic SQL create realistic messy datasets for training?
- Is “data cleaning” the wrong term entirely, and does the term itself contribute to the devaluation of the work?