The Ghosts in the Data
Summary
Vicki Boykis introduces the concept of “ghost knowledge” — knowledge that exists within expert communities but is never written down and basically does not exist for outsiders. The mental model: the most important things about working with data are implicit, not explicit, and they follow power-law distributions rather than Gaussian ones.
Key ideas:
- Ghost knowledge is the real barrier to entry. The difference between explicit knowledge (documented APIs, SQL syntax) and implicit knowledge (knowing which columns are unreliable, which metrics have political baggage) is where data work actually lives. You learn it only by being embedded in the community.
- Power laws, not bell curves. Real-world data distributions are often Paretian, not Gaussian. A power law has no meaningful average and no finite standard deviation for confidence intervals. This means tail-end phenomena deserve as much attention as “typical” users — and most analysts default to thinking in averages.
- Data work is programming work. As data systems become larger and more distributed, data practitioners must adopt software engineering best practices: version control, tested modules, documentation, reproducibility. The era of “just run a query” is ending.
- The fundamental friction. Engineering sees data as an insignificant byproduct of writing code. Analytics sees code as an irritating hassle to get to the data. Both are optimizing for different things: developers want stable, fast apps (even at the expense of logging quality); analysts want clean data (even if it means pausing processes to fix things).
The ghost knowledge concept is deeply relevant to 01-projects/phdata/index and 01-projects/phdata/career-transition. Consulting success depends on quickly acquiring ghost knowledge at each client — the undocumented context about which datasets are trustworthy, which stakeholders have political capital around certain metrics, and which “data quality issues” are actually business logic no one wrote down. This is also why consulting engagements need embedded time, not just deliverables.
The “data work is programming work” argument reinforces 06-reference/2026-04-03-uber-data-culture-first-principles (data as code) and connects to 06-reference/2026-04-03-analytics-engineering-everywhere (AE brings engineering practices to the analytics layer).
The developer-vs-analyst friction maps to the org design tensions in 06-reference/2026-03-31-block-hierarchy-to-intelligence — where you place the data team in the hierarchy determines which incentive structure wins.
For 01-projects/data-marketplace/index, ghost knowledge is the hardest thing to package into a data product. Raw data without context is low-value. The opportunity is in surfacing ghost knowledge alongside the data — metadata, lineage, known quirks, recommended use cases.
The power-law insight connects to 06-reference/2026-04-03-feature-stores-hierarchy — feature engineering is often about capturing tail behavior that averages would miss.
Open Questions
- How do you systematically capture ghost knowledge before it walks out the door when an analyst leaves? Is this what a well-maintained dbt project with thorough documentation actually achieves?
- If power-law distributions are the norm, what does that mean for standard BI dashboards that show averages and trends? Should the default visualization be different?
- The developer-analyst friction seems structural. Is there an org design pattern that genuinely resolves it, or is managed tension the best outcome?