Entities, Instances, and Identifiers (Ch 5)

Chapter 5 covers the most fundamental building block: entities (the "things" you model), instances (specific occurrences), and identifiers (how you tell them apart). Extends traditional structured-data treatment to all five forms.

Key concepts:

Entity discovery across camps: Same Customer entity appears as a table (relational), dimension (analytics), JSON document (application), feature vector (ML/AI), and type node (knowledge graph)
Entities in unstructured data: Use NER to extract entities from text; file itself can be an entity with metadata identifiers
Natural vs. surrogate keys: Natural keys are meaningful but unstable; surrogates are stable but opaque; hybrid approach (surrogate PK + natural attributes indexed) is pragmatic default
Anti-patterns: God Entity (one table for everything), Phantom Entity (no clear business meaning), Split Entity (same concept fragmented across tables), Temporal Confusion (no distinction between current and historical state)
Cross-form identifiers: Use persistent customer_id across structured tables, JSON events, text metadata, and image filenames

RDCO relevance

Foundational for dbt model design. The natural vs. surrogate key guidance and anti-pattern catalog are directly applicable to client engagements. The cross-form identifier principle matters as clients start connecting dbt models to vector stores and knowledge graphs.