DEDP 1.3 — Challenges in Data Engineering
Catalogs the major obstacles across the data engineering lifecycle. Useful reference for scoping consulting engagements — every client challenge maps to one of these categories.
Data Engineering Lifecycle Stages
1. Generation
- Data exponentially increases from apps and devices
- Controlling synchronization from production systems
- Managing schema changes
- Handling data frequency variations
2. Storage
- Three Vs framework: Volume (massive amounts), Velocity (processing speed), Variety (structured to unstructured)
- Format decisions compound over time
3. Ingestion / Integration
- Complexity ranges from simple scripts to sophisticated ETL/ELT platforms
- Decisions: duplication strategies, frequency patterns, architectural approach
- Staging → cleansing → core → mart model remains standard
4. Transformation
- Automating business logic into code is persistently hard
- Evolving requirements, tool selection among thousands of options
- Persistent vs on-demand data definitions (relates to cache pattern and materialized views)
5. Serving
- Presenting data compellingly determines adoption
- Format selection: dashboards, APIs, datasets
- Narrative-driven insights vs raw data (relates to semantic layer)
Undercurrents (Cross-Lifecycle Challenges)
| Undercurrent | Key Challenges |
|---|---|
| Orchestration | Dependency management, intermediate stages, workflow modeling, execution tracking |
| Software Engineering | Git, testing, open-source contribution, multi-language mastery (Python, Scala, Java, SQL, Rust) |
| Security | Permission management, row-level security, balancing protection vs innovation |
| Data Management | Governance, lineage, storage ops, lifecycle management, privacy compliance, discoverability |
| DataOps | IaC, containerization, monitoring, collaborative agile culture |
| Data Architecture | Complex interdependent systems, balancing upfront planning vs rapid prototyping |
Traditional BI Pain Points (Historical Context)
- Slow integration of new data sources
- Lack of transparency in transformation logic
- User dependency on specialized engineers
- Difficulty with semi-structured and unstructured data
- Limited real-time availability (daily-only refresh was standard)
These are the problems that drove the evolution toward modern ETL approaches and the data lake/warehouse convergence.
Work Product Pyramid
Three ascending levels:
- Infrastructure setup — storage formats, orchestration
- Data foundation — schemas, business logic, data layers
- Data accessibility — dashboards, notebooks, datasets
Outcome: “consistent measurement and high-quality data based on a stable yet observable data platform”
Mental Models
- Lifecycle as diagnostic framework — when a client says “our data is bad,” map the symptom to a lifecycle stage. Generation problems need different solutions than transformation problems.
- Undercurrents as maturity indicators — the six undercurrents (orchestration, SWE, security, data management, DataOps, architecture) are a maturity checklist. Most orgs are strong in 1-2 and weak in the rest.
- Work Product Pyramid as prioritization — you cannot build data accessibility without data foundation, and you cannot build foundation without infrastructure. Teams that jump to dashboards without solid infrastructure always backtrack.
Related
- Intro to Data Engineering — field overview
- History and State of DE — how we got here
- DE Workspace Packaging — infrastructure-level pattern
- Data Asset Reusability — addressing transformation challenges
- Cache Pattern — performance and serving optimization
- Dynamic Queries — transformation pattern