06-reference / concepts

operational definitions

Mon Apr 20 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·concept ·status: draft
operational-definitionsspcdemingwheelermac-frameworkdata-quality

Operational Definitions: Criterion + Test + Decision Rule

The one-sentence claim

A metric is not operationally defined until you can hand the criterion, the test procedure, and the threshold for “yes” to a stranger and have them produce the same count you would — and the third part is the one teams skip, which is why their numbers drift.

The pattern

An operational definition is the contract between a concept and a number. Without one, the concept floats — “active user,” “incident,” “complete,” “blemish,” “error” — and every team measuring it produces a different count for reasons that have nothing to do with the underlying reality. With one, the count is reproducible across people, shifts, vendors, audits, and time.

The version most teams reach for is two-part: a name and a test. “Active user = logged in this month.” “Defective = failed inspection.” “Incident = paged on-call.” This feels complete. It is not. It leaves the load-bearing decision — what threshold separates yes from no — implicit, and an implicit threshold is a threshold every observer is free to set on their own.

Wheeler & Chambers make this concrete in the context of count data. Their canonical example is the blemish count: a roll of fabric is inspected, and somebody writes down a number. Two inspectors looking at the same roll routinely produce different counts — not because one is careless, but because “blemish” was never defined past the criterion. One inspector counts surface scratches above a certain size. The other counts any visible mark. A third counts only marks that would cause the roll to be downgraded. All three are doing their jobs honestly. The chart built from their pooled counts is meaningless.

The fix is not better inspectors. The fix is the third part.

The three parts

Per Wheeler & Chambers, Understanding Statistical Process Control (3rd ed), Ch. 11.2 (~p. 290), attributing the structure to Deming: an operational definition consists of a criterion, a test, and a decision rule.

Criterion. The concept the definition is for. What thing are we counting or measuring? “A blemish on the surface of the fabric.” “An active user of the product.” “An incident in the production system.” “A completed task on the board.” The criterion names the category but does not yet make it measurable. This is the part everyone gets right.

Test. The procedure that produces an observation. How do you take the measurement? “Examine a 100-yard sample under standard lighting at arm’s length.” “Query the events table for sessions in the last 30 days.” “Read the PagerDuty incident log.” The test is the operational machinery. This is the part most disciplined teams get right — and where they typically stop.

Decision rule. The threshold that converts an observation into a yes/no count. What separates in-spec from out-of-spec? “A surface mark larger than 2mm in any dimension counts as one blemish.” “A user with at least 3 distinct sessions and at least one transaction counts as active.” “A PagerDuty incident with severity ≥ SEV2 counts as an incident.” The decision rule is what makes the count reproducible. Without it, every observer is silently making the rule themselves, and the counts can never be compared.

The three parts are not optional ornaments. They are the minimum specification. Drop any one and the metric stops being a metric and starts being a conversation about a metric.

Where the third part collapses

Teams skip the decision rule because in the moment of definition it feels obvious. The criterion is the concept; the test is the procedure; the threshold seems like it can be left to common sense. It cannot.

A product team defines “active user” as “uses the product.” The test is the events query. There is no decision rule. Six months later, a quarterly target is in danger. Someone moves the threshold inside the query from $100 in spend to $10 in spend, and the active-user count jumps. The criterion did not change. The test did not change. The decision rule — which was never written down — was rewritten silently. This is the case Cedric Chin opens his operational-definition essay with in ../2026-04-15-commoncog-whats-operational-definition, and it is the canonical failure mode.

A reliability team defines “incident” as “a problem that paged the on-call.” The test is the PagerDuty log. There is no decision rule. Three months later, the team adjusts alert sensitivity to reduce noise. Incident counts drop 40%. Reliability looks like it improved. Reliability did not improve; the decision rule that nobody wrote down moved.

A QA team defines “defective unit” as “fails inspection.” The test is the inspection protocol. There is no decision rule. Different inspectors apply different mental thresholds. The defect rate shows variation that is mostly inspector-to-inspector noise — the chart, by Wheeler’s standard, contains no signal because the data were never operationally defined to begin with.

In every case the metric looked operational. It had a name and a procedure. What it lacked was the third part — and the third part was where the work was.

Application to RDCO and MG

This is the prerequisite layer underneath every measurement RDCO and its clients build on top of.

For MG client harness builds. Every “error,” “incident,” “incomplete,” “anomaly,” and “exception” the harness counts needs all three parts before a process behavior chart built on top of it carries a signal. The MAC matrix at ../../01-projects/data-quality-framework/testing-matrix-template.md gives you the scope and basis of each test, but the test is only as good as the operational definition of the thing it is testing. A row-level absolute test for “negative ACV on new opportunities” requires an operational definition of “new opportunity” — criterion (a row in the opportunities table), test (the type field), decision rule (type = 'New Business' AND acv <= 0). The decision rule is in the SQL because Wheeler is right: without it, every analyst writes their own version.

For RDCO’s MAC framework. MAC is the prerequisite layer for trustworthy counts; operational definitions are the prerequisite layer underneath MAC. You cannot write a meaningful Relative: Source check (silver count = bronze count where stage = ‘Closed Won’) without a criterion-test-rule definition of “closed won” that is identical on both sides of the comparison. The portable skills bundle at ../../01-projects/data-quality-framework/portable-skills-bundle.md should ship with an operational-definition checklist as the first interview the /audit-model plan skill runs — before any cell of the matrix is filled, the operational definitions for every counted entity are pinned.

For the agent itself. Every metric the AI agent tracks is suspect until operationally defined. “Interesting things found” — by what criterion, what test, what threshold? “Tasks completed” — does a task with a follow-up question count as completed? “Skills called” — does a skill that errored on the first call and succeeded on the retry count once or twice? Without the third part, the agent’s self-reported telemetry is a conversation, not a measurement. The PBC discipline Chin lays out in ../2026-04-15-commoncog-process-behaviour-charts depends on the underlying counts being comparable run-to-run, and that depends on the decision rule being explicit and stable.

The exemplar to imitate is Amazon’s Weekly Business Review as documented in ../2026-04-15-commoncog-amazon-weekly-business-review: every metric on the deck has a written operational definition, definitions change only via a deliberate process with a paper trail, and the comparability of metrics across weeks is treated as a load-bearing property of the meeting itself. Drift the definition and you have broken the chart.

A workable template

Copy this into the metric registry whenever a new metric is defined. If any field is empty the metric is not yet operational and should not be charted, reported, or used as evidence in a decision.

Metric name: [the concept, e.g. "active user"]

(1) Criterion — what concept this metric is for, in one sentence:
    [e.g. "A user who derives meaningful ongoing value from the product."]

(2) Test — the exact procedure or query that produces the observation:
    [e.g. "SELECT user_id FROM events WHERE event_ts >= NOW() - INTERVAL '30 days'
     GROUP BY user_id"]

(3) Decision rule — the threshold that converts the observation into a yes/no count:
    [e.g. "A user counts as active if they have >= 3 distinct sessions AND
     >= 1 transaction in the 30-day window."]

Owner: [single human accountable for changes to this definition]
Last reviewed: [YYYY-MM-DD]
Change log: [every modification to any of the three parts, with date and reason]

The change log is not optional. The whole point of an operational definition is that it survives long enough to make charts comparable across time; a definition with no change log is a definition that gets quietly rewritten and breaks the comparability without anyone noticing. The friend in Chin’s essay rewrote the active-user threshold and nobody could prove it had moved — because there was no log.