06-reference

practical data modeling mma ch4 five forms of data

Sun Mar 01 2026 19:00:00 GMT-0500 (Eastern Standard Time) ·reference ·source: Practical Data Modeling (Substack) ·by Joe Reis
data-modelingstructured-datasemi-structured-dataunstructured-dataml-artifactsmetadatamixed-model-artschapter-4

Beyond Rows and Columns: The Five Forms of Data (Ch 4)

First chapter of Part 2 (Building Blocks). Catalogs five forms of data that modern modelers must handle:

  1. Structured — tables, rows/columns, SQL, relational databases. Still the home turf but not the whole picture. Same data needs different models for OLTP, OLAP, and ML features.
  2. Semi-structured — JSON, XML, NoSQL. Flexibility at the cost of consistency. Hybrid “shred stable fields, keep rest as raw” approach recommended.
  3. Unstructured — text, images, audio, video. Model through metadata + derived features + reference pattern (content in object storage, reference in relational model, features in vector DB).
  4. ML/AI artifacts — trained models, embeddings, feature vectors, agent traces, synthetic data. Provenance tracking is a modeling challenge.
  5. Metadata — business, operational, technical. The “connective tissue.” Modern table formats (Iceberg, Delta, Hudi) are essentially metadata-as-model.

Distinguishes form vs. format (relational table vs. Parquet file) and modeling intent (greenfield) vs. modeling exhaust (brownfield/reverse engineering).

RDCO relevance

Expands the scope of what we should be thinking about in dbt projects. Most of our work is structured + semi-structured, but the metadata-as-first-class-citizen argument strengthens our case for investing in dbt documentation, descriptions, and tests as modeling artifacts, not afterthoughts.