DEDP 1.2 — The History and State of Data Engineering
Comprehensive timeline from BI origins to the 2025 landscape. The densest historical reference in the book.
Historical Timeline
1970s-1980s: Foundations
- Edgar F. Codd proposed SQL as a standard database language to abstract storage complexities
- Bill Inmon formally defined “data warehouse,” establishing foundational BI principles
- SQL evolved into variants (T-SQL, PL/SQL) with procedural capabilities
1996: Dimensional Modeling
- Ralph Kimball’s The Data Warehouse Toolkit — dimensional modeling approaches still used today
- Inmon vs Kimball debate begins (and never really ends)
Early 2000s: Distributed Computing
- Massively parallel processing (MPP) databases emerged
- Google published GFS (2003) and MapReduce (2004) — two papers that changed everything
- Yahoo released Hadoop (2006)
Early 2010s: Cloud Revolution
- AWS, GCP, Azure transformed infrastructure economics
- Key tools: Amazon Redshift, Snowflake, Apache Airflow (2014), Superset (2015), dbt (2016)
2017-2018: Discipline Formation
- Maxime Beauchemin published “The Rise of the Data Engineer” — formally defining the discipline
- Transition from “big data engineer” to “data engineer”
- Functional Data Engineering paradigm established
State of the Field by Year
2022: Declarative Era
- Infrastructure-as-code, orchestration-as-code dominated
- Metadata management: cataloging, lineage, discovery
- Rust emerged for data-intensive apps (Ballista, Polars, DataFusion)
- Privacy regulations (GDPR, CCPA) heightened governance focus
2023: Renaissance and AI
- Data modeling renaissance amid MDS adoption challenges
- Vector databases (Pinecone, Qdrant) surged with GenAI
- Open Data Stack gained traction with open standards
- Table formats: Parquet, Iceberg, Delta Lake dominated
2024-2025: Return to Fundamentals
- AI strengthens rather than replaces data engineering roles
- Return to fundamentals: data modeling, SQL, lifecycle understanding
- Small data stack approaches gaining adoption (cost, speed, simplicity)
- Presentation and data quality remain critical bottlenecks
Key Insight
Beneath technological cycles lies an enduring need: “fresh, organized, and clean data.”
The book focuses on convergent design patterns recurring across eras rather than chasing technological trends.
Mental Models
- Technology waves, constant problems — every ~10 years the stack turns over, but the core needs (clean, fresh, organized, accessible data) never change. This is the strongest argument for pattern-based thinking.
- Paper → open-source → commercial → commodity — the lifecycle of data infrastructure (GFS paper → Hadoop → Snowflake → commodity cloud warehousing). Recognizing where a technology sits in this cycle informs build-vs-buy.
- Small data stack as counter-trend — after years of “big data” hype, 2024-2025 sees pragmatic return to simpler, cheaper approaches. DuckDB, Polars, single-node processing.
Related
- Intro to Data Engineering — overview and personal context
- Challenges in Data Engineering — lifecycle challenges
- DWH, MDM, Data Lake — architecture evolution details
- MV, OBT, dbt, OLAP, DWA — modeling approach evolution
- ETL Tool Comparisons — tool evolution from bash to Python
- Semantic Layer & BI — serving layer evolution
- Data Contracts & Schema Evolution — integration evolution