“Beck’s Measurement Model, or Why It’s So Damn Hard to Measure Software Development” — @CedricChin

Why this is in the vault

Beck’s four-layer model (effort → output → outcome → impact) is the cleanest framework I’ve seen for explaining why AI-agent output is so hard to measure — and it maps directly onto the MAC severity tiers and the agent-deployer JD. With the full essay now in hand, we also get Chin’s contrast against Colin Bryar’s Amazon flywheel answer, and his closing point that no single metric set suffices for software — only a combination of practices (read: the WBR) gets you to predictive knowledge. This is the measurement-theory backbone for anything RDCO says about “productivity” of software or agents.

The core argument (paraphrased)

Why measuring software dev is hard: the value chain has four distinct layers, and most metrics measure the wrong ones.

Chin opens with the August 2023 McKinsey report (“Yes, You Can Measure Software Developer Productivity”) proposing four metrics — Inner/Outer Loop Time Spent, Developer Velocity Index Benchmark, Contribution Analysis, Talent Capability Score — “presented with a beautiful sheen of legitimacy”. The dev community trashed it. Kent Beck and Gergely Orosz responded with a two-part rebuttal that Chin crystallizes into Beck’s Measurement Model.

The model, in Beck’s own framing:

Effort — “planning, coding and so on”. The input.
Output — “tangible things like the feature itself, the code, design documents”.
Outcome — “Customers will behave differently as a result” (e.g., fewer stuck in onboarding).
Impact — “value flowing back to us like feedback, revenue, referrals”.

Each arrow between layers is lossy and lagged. McKinsey’s four metrics all sit at the Effort and Output layers — time spent, contributions, velocity benchmarks — but the value businesses care about lives at Outcome and Impact, two or three causal hops away. Chin’s sharp claim: “the more a metric lies on the ‘Effort’ and ‘Output’ side of the measurement model, the more likely you’re going to get gamed metrics” divorced from business outcomes.

The asymmetry with sales/recruiting. Chin previously thought software was hard to measure because the work was variable (a bug fix ≠ a refactor ≠ a feature launch). He now corrects himself: “it’s not the variability of the work that matters, it’s the tightness of the relationship between Effort + Output to Outcomes + (business) Impact”. Sales has a short, tight loop (call → deal → revenue, same quarter). Recruiting the same (source → hire → retention). Software has a long, lossy one: changing the nature of value flows in a complex system. The feedback loop is so loose that entire statistical sub-disciplines exist just to attribute user-behavior changes to product changes.

The predictive power test. Chin likes Beck’s model partly because it predicts other hard-to-measure domains: editorial work at The New Yorker; brand marketing (vs. performance marketing — he cites a ride-hailing company shifting budget from perf to brand precisely because perf was over-measured); Selena Gomez’s Rare Beauty, whose influencer wellness retreats, 1%-to-mental-health pledge, and founder vulnerability all compound into brand value with no tight metric loop. Anywhere the effort→impact chain is long and lossy, measurement goes soft.

Bryar’s Amazon-flywheel counter-answer. Chin contrasts McKinsey with Colin Bryar’s answer to “how do you measure software engineering at Amazon?” Bryar’s reply: engineers never got a free pass, but Amazon didn’t measure them on lines of code. Instead, the flywheel (price / selection / convenience) was drilled so deep that every engineer knew which of the three drivers their feature pushed, and metrics for engineering tied back to those drivers. “You should imagine that you’re hovering over the shoulder of your happiest customer … what observable behaviours are they doing that you can detect?” Bryar’s advice sits squarely on the Outcome + Impact end of Beck’s model — the opposite of McKinsey. Executives take it on faith that moving price/selection/convenience drives long-run impact.

Where this leaves us. Chin admits neither answer is a silver bullet. His two closing observations:

Software measurement is a combination of practices, not a single metric set. From his own months running an Amazon-style Weekly Business Review at Commoncog, he’s developing “a fingertip feel for Commoncog’s performance” — predicting routine variation, making weekly bets, starting to guess correctly what moves the numbers. Marketing and software both have loose feedback loops; you don’t get the clarity of sales, but “having this sense is not nothing” — beats running on superstition.
Both Bryar and Beck fall out of a process-control worldview. Everything is a process. Outputs you care about live in the Outcome/Impact buckets. You pick controllable inputs, guess at the causal relationship, drive the input, watch the output, learn. Do it on a cadence. Put it together and “what you get is something close to a Weekly Business Review.”

The teaser: Chin is reaching out to Bryar for permission to write the WBR up publicly. (That becomes Parts 13-14 of the series.)

Mapping against Ray Data Co

This is strong-mapping. Five concrete links:

1. Agent-deployer measurement is software-dev measurement, one abstraction up. Per 2026-04-14-levie-agent-deployer-role-jd, the agent-deployer role needs measurable success criteria. Beck’s model says: don’t measure “tokens generated” or “agent actions executed” (effort/output). Measure outcome (did downstream human behavior shift — fewer tickets, faster close, better decisions?) and impact (did value flow back — revenue, retention, cost reduction?). Most early AI-agent deployments fail precisely where McKinsey did: they instrument effort/output, then can’t explain why leadership doesn’t feel the productivity gains. The fix, per Bryar’s Amazon pattern, is to drill a value flywheel into the agent-deployer so every agent shipped is justified by which driver it pushes. For RDCO clients that’s usually: time-to-insight, cost-per-decision, or error-rate-reduction. MAC severity tiers should explicitly bind to the outcome layer, not the output layer.

2. MAC’s six-basis matrix is a layer-aware measurement grid. The MAC matrix (3 scopes × 6 bases — absolute, rel-source, rel-production, rel-recon, temporal, human) reinterprets cleanly through Beck’s lens: absolute/rel-source/rel-production/rel-recon are output-layer checks (did the artifact come out right?); temporal is outcome-layer (does behavior stay stable over time?); human is impact-layer (does the end user trust it enough to act, and does that action generate value?). This labeling is load-bearing: it explains why a fully-passing output-layer suite doesn’t guarantee business value — the argument we need when clients push back on paying for the full MAC practice. See ../01-projects/data-quality-framework/testing-matrix-template.

3. phData / MG sales pitch: “four layers, pick your layer honestly.” The common failure pattern at phData-style data-engineering shops (per MG context) is selling dashboards that measure output — row counts, job runtimes, SLA compliance — and calling it “data-driven.” Beck’s model gives us the vocabulary to push the conversation one layer up: what outcome are you trying to shift, and how will you know? That reframes the sale from “do you want better dashboards?” (commodity) to “do you want a measurement discipline that ties engineering output to business impact?” (RDCO’s actual pitch). Bryar’s flywheel is the template: surface your client’s 2-3 value drivers, tie every data-engineering deliverable back to one of them, and the output metrics become diagnostic rather than terminal.

4. State-ownership is the persistence substrate for outcome/impact measurement. Outcome and impact are lagged — not visible in the sprint, only in the quarter. That requires durable state: the vault, the MAC history, the model of the business. Per ../04-tooling/rdco-state-ownership-architecture, the client owns this. Without it, every new model/agent rollout restarts the measurement clock and you can never close the effort→impact loop. Beck’s model is the why for state-ownership: short-horizon effort metrics are cheap and fungible; long-horizon impact metrics require memory, and memory is a moat.

5. The WBR is the endgame — “combination of practices, not a single metric set.” Chin’s closing is the most important RDCO mapping: software/agent measurement cannot be reduced to one KPI dashboard. It requires a weekly discipline where you make bets, watch routine vs. exceptional variation, update your causal model, and iterate. That is exactly what RDCO’s consulting posture should deliver — not a measurement tool, but an operating cadence the client practices. MAC is the agent-era equivalent of the WBR, and the drip course should position it that way: “You are not buying a framework, you are installing a weekly review discipline.” This closes the loop back to 2026-04-15-commoncog-becoming-data-driven-first-principles, where the WBR is the canonical implementation of Deming/SPC.

One honest caveat. Beck’s model doesn’t dissolve the hard cases. Chin flags refactoring and tech-debt paydown as still difficult under Bryar’s approach — “months-long refactoring … difficult to calculate business impact.” In agent-land, the analogue is harness investment, eval infrastructure, and skill authoring. We should be prepared to tell clients these line items will resist measurement and must be funded on the theory that they compound into future velocity — essentially a faith-based budget item validated over quarters, not sprints.

2026-04-15-commoncog-becoming-data-driven-first-principles — the cornerstone BDD piece; Beck’s model is the measurement-theory extension into the software domain
2026-04-14-levie-agent-deployer-role-jd — agent-deployer needs outcome/impact metrics, not effort/output metrics
../01-projects/data-quality-framework/testing-matrix-template — MAC matrix reinterpreted through the four-layer lens
../04-tooling/rdco-state-ownership-architecture — durable state is the substrate for lagged outcome/impact measurement
2026-04-13-moura-entangled-software-agent-harnesses-dead — Moura dismisses harnesses; Beck’s model says you can’t measure agent value without them
commentary-tan-fat-skills-thin-harness-2026-04-14 — operational discipline layered into the harness architecture
Forthcoming: Parts 13-14 of BDD (the WBR implementation) — Chin’s “combination of practices” closing is the hand-off

“Beck’s Measurement Model, or Why It’s So Damn Hard to Measure Software Development” — @CedricChin

Why this is in the vault

The core argument (paraphrased)

Mapping against Ray Data Co

Related