06-reference

great data debate a16z

Thu Apr 02 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·article ·source: https://a16z.com/2020/11/12/a16z-podcast-the-great-data-debate/ ·by a16z (Martin Casado, Tristan Handy, George Fraser, Bob Muglia)

The Great Data Debate — Data Lake vs. Warehouse Convergence

A16z podcast debate between Martin Casado (a16z), Tristan Handy (dbt), George Fraser (Fivetran), and Bob Muglia (ex-Snowflake CEO) on whether data lakes and warehouses will converge.

Three types of data (Bob Muglia)

  1. Structured — tables, rows, columns
  2. Semi-structured — JSON, YAML, hierarchical
  3. Complex — documents, images, videos (not “unstructured” — all data has structure)

Complex data is increasingly becoming semi-structured through ML processing.

The convergence thesis

Three paths to bridge analytics and ML

  1. ML into SQL — create UDFs for linear algebra (BigQuery’s approach). “Feature engineering is just another data transformation.”
  2. SQL into Python — use data frames to embed SQL in procedural code (Databricks’ approach)
  3. Arrow interchange — a format layer that lets everything talk to each other

The notebook as unifying layer

The notebook is best suited as the modern data document: language-agnostic, showing both data and code with rich business context. Can hide code to focus on business implications.

Key tension

Martin: operational AI is growing faster and will pull the stack toward data lake architecture. Bob: “Relational always wins” — SQL replaced navigational, hierarchical, OLAP, and MapReduce models.

Connects to data team operations, analytics engineering, tools and infrastructure.

Open questions