The Great Data Debate — Data Lake vs. Warehouse Convergence
A16z podcast debate between Martin Casado (a16z), Tristan Handy (dbt), George Fraser (Fivetran), and Bob Muglia (ex-Snowflake CEO) on whether data lakes and warehouses will converge.
Three types of data (Bob Muglia)
- Structured — tables, rows, columns
- Semi-structured — JSON, YAML, hierarchical
- Complex — documents, images, videos (not “unstructured” — all data has structure)
Complex data is increasingly becoming semi-structured through ML processing.
The convergence thesis
- Snowflake and Databricks approach from opposite directions: SQL/declarative vs. procedural/code-based
- Both will offer both within their platforms
- File storage, indexing, and metadata management will converge
- Use case drives technology — the architecture optimized for the dominant use case wins
Three paths to bridge analytics and ML
- ML into SQL — create UDFs for linear algebra (BigQuery’s approach). “Feature engineering is just another data transformation.”
- SQL into Python — use data frames to embed SQL in procedural code (Databricks’ approach)
- Arrow interchange — a format layer that lets everything talk to each other
The notebook as unifying layer
The notebook is best suited as the modern data document: language-agnostic, showing both data and code with rich business context. Can hide code to focus on business implications.
Key tension
Martin: operational AI is growing faster and will pull the stack toward data lake architecture. Bob: “Relational always wins” — SQL replaced navigational, hierarchical, OLAP, and MapReduce models.
Connects to data team operations, analytics engineering, tools and infrastructure.
Open questions
- Did the convergence thesis play out? (Databricks added SQL; Snowflake added Python)
- Is “feature engineering is just another transformation” still undersold?