The Great Data Debate — Data Lake vs. Warehouse Convergence

A16z podcast debate between Martin Casado (a16z), Tristan Handy (dbt), George Fraser (Fivetran), and Bob Muglia (ex-Snowflake CEO) on whether data lakes and warehouses will converge.

Three types of data (Bob Muglia)

Structured — tables, rows, columns
Semi-structured — JSON, YAML, hierarchical
Complex — documents, images, videos (not “unstructured” — all data has structure)

Complex data is increasingly becoming semi-structured through ML processing.

The convergence thesis

Snowflake and Databricks approach from opposite directions: SQL/declarative vs. procedural/code-based
Both will offer both within their platforms
File storage, indexing, and metadata management will converge
Use case drives technology — the architecture optimized for the dominant use case wins

Three paths to bridge analytics and ML

ML into SQL — create UDFs for linear algebra (BigQuery’s approach). “Feature engineering is just another data transformation.”
SQL into Python — use data frames to embed SQL in procedural code (Databricks’ approach)
Arrow interchange — a format layer that lets everything talk to each other

The notebook as unifying layer

The notebook is best suited as the modern data document: language-agnostic, showing both data and code with rich business context. Can hide code to focus on business implications.

Key tension

Martin: operational AI is growing faster and will pull the stack toward data lake architecture. Bob: “Relational always wins” — SQL replaced navigational, hierarchical, OLAP, and MapReduce models.

Connects to data team operations, analytics engineering, tools and infrastructure.

Open questions

Did the convergence thesis play out? (Databricks added SQL; Snowflake added Python)
Is “feature engineering is just another transformation” still undersold?