Beyond Rows and Columns: The Five Forms of Data (Ch 4)
First chapter of Part 2 (Building Blocks). Catalogs five forms of data that modern modelers must handle:
- Structured — tables, rows/columns, SQL, relational databases. Still the home turf but not the whole picture. Same data needs different models for OLTP, OLAP, and ML features.
- Semi-structured — JSON, XML, NoSQL. Flexibility at the cost of consistency. Hybrid “shred stable fields, keep rest as raw” approach recommended.
- Unstructured — text, images, audio, video. Model through metadata + derived features + reference pattern (content in object storage, reference in relational model, features in vector DB).
- ML/AI artifacts — trained models, embeddings, feature vectors, agent traces, synthetic data. Provenance tracking is a modeling challenge.
- Metadata — business, operational, technical. The “connective tissue.” Modern table formats (Iceberg, Delta, Hudi) are essentially metadata-as-model.
Distinguishes form vs. format (relational table vs. Parquet file) and modeling intent (greenfield) vs. modeling exhaust (brownfield/reverse engineering).
RDCO relevance
Expands the scope of what we should be thinking about in dbt projects. Most of our work is structured + semi-structured, but the metadata-as-first-class-citizen argument strengthens our case for investing in dbt documentation, descriptions, and tests as modeling artifacts, not afterthoughts.