Entities, Instances, and Identifiers (Ch 5)
Chapter 5 covers the most fundamental building block: entities (the “things” you model), instances (specific occurrences), and identifiers (how you tell them apart). Extends traditional structured-data treatment to all five forms.
Key concepts:
- Entity discovery across camps: Same Customer entity appears as a table (relational), dimension (analytics), JSON document (application), feature vector (ML/AI), and type node (knowledge graph)
- Entities in unstructured data: Use NER to extract entities from text; file itself can be an entity with metadata identifiers
- Natural vs. surrogate keys: Natural keys are meaningful but unstable; surrogates are stable but opaque; hybrid approach (surrogate PK + natural attributes indexed) is pragmatic default
- Anti-patterns: God Entity (one table for everything), Phantom Entity (no clear business meaning), Split Entity (same concept fragmented across tables), Temporal Confusion (no distinction between current and historical state)
- Cross-form identifiers: Use persistent customer_id across structured tables, JSON events, text metadata, and image filenames
RDCO relevance
Foundational for dbt model design. The natural vs. surrogate key guidance and anti-pattern catalog are directly applicable to client engagements. The cross-form identifier principle matters as clients start connecting dbt models to vector stores and knowledge graphs.