06-reference

ghosts in the data

Thu Apr 02 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·article ·source: http://veekaybee.github.io/2021/03/26/data-ghosts/ ·by Vicki Boykis

The Ghosts in the Data

Summary

Vicki Boykis introduces the concept of “ghost knowledge” — knowledge that exists within expert communities but is never written down and basically does not exist for outsiders. The mental model: the most important things about working with data are implicit, not explicit, and they follow power-law distributions rather than Gaussian ones.

Key ideas:

The ghost knowledge concept is deeply relevant to 01-projects/phdata/index and 01-projects/phdata/career-transition. Consulting success depends on quickly acquiring ghost knowledge at each client — the undocumented context about which datasets are trustworthy, which stakeholders have political capital around certain metrics, and which “data quality issues” are actually business logic no one wrote down. This is also why consulting engagements need embedded time, not just deliverables.

The “data work is programming work” argument reinforces 06-reference/2026-04-03-uber-data-culture-first-principles (data as code) and connects to 06-reference/2026-04-03-analytics-engineering-everywhere (AE brings engineering practices to the analytics layer).

The developer-vs-analyst friction maps to the org design tensions in 06-reference/2026-03-31-block-hierarchy-to-intelligence — where you place the data team in the hierarchy determines which incentive structure wins.

For 01-projects/data-marketplace/index, ghost knowledge is the hardest thing to package into a data product. Raw data without context is low-value. The opportunity is in surfacing ghost knowledge alongside the data — metadata, lineage, known quirks, recommended use cases.

The power-law insight connects to 06-reference/2026-04-03-feature-stores-hierarchy — feature engineering is often about capturing tail behavior that averages would miss.

Open Questions