Reforge — Experimentation Foundations

Summary

This consolidates the foundational Reforge material on why experimentation matters, when to use it, and how to prepare — the prerequisites and guardrails that come before the strategic vs. ad hoc distinction covered in 06-reference/2026-04-03-reforge-strategic-experimentation.

Why Experimentation Is Critical

Neither intuition nor data alone is sufficient for good decisions.

Intuition fails because:

We overestimate the probability our own ideas will work.
We overestimate the impact of initiatives to justify pursuing them.
We inflate success in hindsight and overplay our understanding of why things worked.
As customers, business, and product change, our intuition drifts without us noticing.

Data alone fails because:

Past performance does not equal future performance.
People interpret data subjectively to support pre-existing narratives.
Summarizing data loses the nuance and context that gives it meaning.

Experimentation bridges the gap — it refines intuition with objective customer data to get closer to truth. It provides: a common language, problem decomposition into testable assumptions, connection of ideas to metrics, directional progress on big ideas, and deeper learnings through structured hypothesis building.

Three Hurdles to Experimentation Culture

Culture — the organization does not promote the decision-making and information-gathering practices experimentation requires.
Myths and beliefs — strongly held opinions undermine experimentation’s potential impact. See 06-reference/2026-04-03-five-myths-of-experimentation for the specific myths.
Narrow, ad hoc approach — treating experimentation as one-off tests rather than a strategic system. See 06-reference/2026-04-03-reforge-strategic-experimentation.

Cultural Barriers

Experimentation bridges the gap between perception and reality, but three things create distrust:

Inability to read statistical results — experimentation produces shades of grey (statistical significance), not binary yes/no. The burden is on the experimentation owner to communicate clearly.
Multiple sources of truth — poor data infrastructure produces conflicting data. Nothing kills confidence in experiments faster than messy data.
Ingrained mental models — decision-makers are reluctant to change beliefs formed from opinion or old data, even when experiments show those relationships have shifted.

When to Use Experimentation

Experimentation is not the right tool for every problem (strategic pivots, backend infrastructure, platform compatibility, persona expansion may not be testable). Three criteria must be met:

1. Minimum prerequisite capabilities:

Technological infrastructure to deploy experiments efficiently.
Sufficient time to run tests to valid results.
Team understanding of how to set up tests, avoid false signals, and analyze results.

2. No disqualifying constraints:

Scale constraints — not enough users/customers for statistical validity.
Decision timing constraints — need to move too fast for test cycles; take a leap of faith.
Non-product constraints — legal, compliance, or regulatory requirements that dictate the solution (e.g., SOX compliance).
Product constraints — big reveals, launches, or hardware where partial testing isn’t feasible.
Measurability constraints — the initiative’s purpose isn’t metric-driven (admin tools, internal features). Use do-no-harm testing at most.

3. Well-defined inputs (avoid garbage-in, garbage-out):

Experiments must be grounded in organizational mission and strategy — otherwise you refine unimportant ideas from “unimportant” to “refined but still unimportant.”
The problem must be clearly defined — loose problems produce loosely actionable insights.
Success metrics must be identified upfront — never run an experiment, scan a wide array of metrics, and cherry-pick the ones that improve.

Three Components of a Strategic Opportunity

Before running experiments, identify the strategic opportunity:

Strategy — understand the organization’s mission and how the growth model supports it. The four sub-questions: How do we acquire? Retain? Monetize? Defend and improve? See 06-reference/2026-04-03-reforge-defining-strategy.
Customer problem — defined by tying behavior to business impact and understanding why the problem exists. Problems come from two sources:
- User-identified — specific UX issues without clear business impact.
- Data-identified — KPI anomalies without clear customer explanation. A well-defined problem connects both: the behavior AND its business impact AND the underlying cause.
Business outcome — three levels of behavior metrics:
- Individual actions (click rates, time on page) — low strategic value alone.
- Actions signaling intent (pricing page visits, referrals sent) — intermediate value.
- Outcome-creating actions (paid conversion, activation, new user signup) — high value. Experimentation success should always be measured at this level.

Test Preparation: Statistical Foundations

The null hypothesis is the default assumption that two variations perform the same.
P-value measures the probability of a Type 1 error (calling a result positive/negative when the null is true). Target: p < 0.05. Decreases with more samples.
Statistical power measures the probability of avoiding a Type 2 error (accepting the null when a real difference exists — a “false null”). Higher is better (unlike p-value).
Both must be considered together — running underpowered tests leads to missed real effects.

The Growth Experiment Process (supplementary)

From Conor Dewey: growth is “the scientific method applied to KPIs.” The process:

Build a whiteboard-level quantitative model of how your product grows — break down acquisition channels, activation rate, retention curve.
Identify the highest-leverage points in the model.
Dig one level deeper with data analysis and segmentation to discover specific problems.
Frame as: Opportunity -> Problem -> Question -> Hypotheses -> Prioritize -> Solutions.

Example: “Most team invites come from onboarding. 55% engage with the form but only 3% invite. Why? Hypotheses: users don’t understand the benefit, prefer a different invite mechanism, or face too much friction typing emails.”

Relevance to projects:

01-projects/data-marketplace/index — Before launching experiments, apply the three-criteria checklist. Early on, scale constraints will likely disqualify A/B testing. Use the strategic opportunity framework to identify the 1-2 customer problems worth solving pre-launch, and design experiments that can run with small samples (qualitative research, landing page tests, concierge tests).
01-projects/newsletter/index — The newsletter has enough volume for basic experimentation. The growth experiment process template (opportunity -> problem -> question -> hypotheses) is a good lightweight framework for testing subject lines, CTAs, and content formats.

Connects to 06-reference/2026-04-03-reforge-strategic-experimentation (strategic vs. ad hoc distinction), 06-reference/2026-04-03-five-myths-of-experimentation (common myths that create cultural barriers), 06-reference/2026-04-03-reforge-defining-strategy (strategy as input to experimentation), 06-reference/2026-04-03-reforge-why-analytics-efforts-fail (data infrastructure as prerequisite), and 06-reference/2026-04-03-reforge-growth-models (growth model as the source of strategic opportunities).

Open Questions

For Ray Data Co projects, which of the five experimental constraints are binding right now? Scale constraints are likely the biggest blocker — what is the minimum sample size needed for useful experiments?
How do you build experimentation culture as a solo operator? The cultural barriers (distrust, ingrained beliefs) are less relevant, but the discipline of structured hypothesis building still matters.
The test preparation material on statistical power suggests many small companies run underpowered tests and make bad decisions from them. Is it better to not test at all than to test with insufficient power?