Data Science for Business — Foster Provost & Tom Fawcett
Summary
The definitive bridge between data science technique and business application. Provost and Fawcett map the canonical data mining tasks, the CRISP-DM process, and the organizational structures that make data science actually work. Core mental models:
-
Two Types of Data-Driven Decisions. (a) Discovery-based: finding unknown patterns in data. (b) Repetitive at scale: decisions made millions of times where even small accuracy improvements compound. The type determines the approach.
-
Canonical Data Mining Tasks. Classification (predict category), regression (predict number), similarity matching (find alike items), clustering (group by similarity without a target), co-occurrence grouping / market-basket analysis (find items that appear together in transactions), profiling (characterize typical behavior for anomaly detection), link prediction (predict connections between items), and causal modeling (understand what actually influences what). The critical skill: decompose a business problem into pieces that map to known tasks.
-
A Model Is a Simplified Representation of Reality Created to Serve a Purpose. This definition is load-bearing. Models don’t need to be “right” — they need to be useful for the decision at hand. Descriptive models tell you what churning customers look like; predictive models tell you who will churn. Different purposes, different standards.
-
CRISP-DM Process. Business Understanding -> Data Understanding -> Data Preparation -> Modeling -> Evaluation -> Deployment. The critical insight is that data mining is closer to R&D than engineering — it requires problem formulation skill, rapid prototyping, and comfort with ambiguity, not just technical execution.
-
The Over-the-Wall Problem. “Your model is not what the data scientists design, it’s what the engineers build.” Data science engineers — people who understand both the production system and the data science — are the critical bridge role. Data scientists should remain involved through deployment; engineers should be involved from the start.
-
Overfitting as Organizational Hazard. Looking too hard at data finds something that won’t generalize. This applies to business decisions too — optimizing too narrowly on historical patterns creates brittleness. Evaluation environments that mirror production are essential before deployment.
Relevance
- 06-reference/2026-04-03-data-products-taxonomy — The canonical task taxonomy maps directly to data product types. Each product serves one or more of these tasks.
- 06-reference/2026-04-03-analytics-engineering-everywhere — The “over-the-wall” problem is why analytics engineering exists. The role bridges the data science / data engineering gap.
- 06-reference/2026-04-03-reforge-why-analytics-efforts-fail — Analytics efforts fail when they skip Business Understanding and jump to Modeling. CRISP-DM’s first step is the most important.
- 06-reference/2026-04-03-scaling-data-informed-driven-led — The two types of decisions map to data-informed (discovery) vs. data-driven (repetitive at scale).
Open Questions
- How does the CRISP-DM process adapt to LLM-based analytics where the “model” is a prompt chain rather than a trained classifier?
- Is the canonical task taxonomy still complete, or do generative AI tasks need a new category?