Binary Decisions Around Continuous Probability: The Anti-Pattern Hidden in Every Threshold

The one-sentence claim

Wherever a continuous probability gets collapsed to a binary decision upstream of where the binary is actually needed, the thrown-away distribution was the most expensive signal in the pipeline — and the collapse is almost never audited.

The pattern

A system produces a probability — smooth, graded, calibrated as well as the model can manage. Then, for operational reasons (an insurance boundary, a regulatory line, a budget cap, a UI that needs one answer, a CPU that needs an integer), the system applies a threshold and emits a binary outcome. The threshold was chosen for reasons that have nothing to do with the underlying shape of the probability distribution. Small changes in the input produce discontinuous changes in the decision. Two cases that are materially identical end up on opposite sides of the line.

The binarization is sometimes required. That is not the anti-pattern. The anti-pattern is that the probability is usually destroyed before the last possible moment — before downstream consumers, before the audit layer, before the human who has to act on the output. What survives is a bit. What gets lost is the calibration.

Three domains make the shape visible.

Three domains where it shows up

Floodplain maps (civil engineering). NFIP base-flood boundaries draw a crisp line around the 100-year floodplain. A property one foot inside carries a federally-mandated insurance obligation and a raft of building-code constraints. A property one foot outside carries neither. The underlying quantity — annual exceedance probability of a given rainfall depth — is smooth, uncertain, and in the Kerr County case extrapolated from gauges that had never seen an event of the magnitude being predicted. Grady’s load-bearing question in ../2026-04-20-practical-engineering-an-engineers-perspective-on-the-texas-floods: “what’s the difference in risk profile between just-inside-the-line and just-outside?” Essentially zero. The line is wrong in one direction every year; which direction is random. The homeowner cannot tell you what their actual flood risk is — they can only tell you which side of the map they are on.

LLM token sampling (autoregressive AI). In ../2026-04-20-3blue1brown-large-language-models-explained-briefly Grant is careful: an LLM does not “predict one word with certainty” — it “assigns a probability to all possible next words.” Every chatbot UI in production then runs softmax → sample (temperature > 0) or softmax → argmax (temperature = 0), collapses the vocabulary-wide distribution to one token, and throws the rest away. The token is what the user sees. The distribution is what carried the calibration — including the model’s implicit uncertainty, the second- and third-place candidates a retry might have landed on, the entropy signal that could tell a downstream system whether to trust this completion. Downstream systems can’t audit a decision they never saw the distribution for. The distribution existed for one function call and was never logged.

Diffusion classifier-free guidance (generative AI). In ../2026-04-20-3blue1brown-but-how-do-ai-images-and-videos-actually-work Welch Labs makes the inverse case visible. Take out the random noise step in DDPM sampling and generated points collapse to the mean of the conditional distribution — the Bayesian-optimal single-image answer becomes a smeared, blurry average. To get a crisp output you deliberately push off the mean into a specific mode, via classifier-free guidance: condition vector minus unconditioned vector, scaled by alpha. The “crispness” of a generated image is the distribution being deliberately abandoned in favor of a point sample. This is the binarization done on purpose at the output stage — because the product is a single image, not a distribution over images. Note the direction: the mean was throwing away the perceptual structure; the sample throws away the center of mass. Both are information destruction. Which one costs you depends on what the consumer needs.

What the anti-pattern destroys

In floodplain: the homeowner never sees their actual risk. The NFIP could ship a calibrated annual-exceedance probability with confidence intervals; it ships a zone label. The insurance premium is binary, the regulatory burden is binary, the public’s trust in the map collapses when the line moves — because they were never given the uncertainty to begin with.

In LLM: the downstream system cannot weight the output by confidence. Every tool-use hop, every RAG pipeline, every audit layer is forced to treat “the model said X” as a flat signal when the model actually produced a distribution whose shape told you whether to believe X. The production stack burns compute re-computing uncertainty signals the model already had and discarded.

In diffusion: the mean was the Bayesian-optimal estimator for any squared-error loss. We threw it away because squared-error loss does not match human perception. That is a real trade — but it is a trade that depends on the product, and it gets made once and never revisited.

The RDCO surface

This is where the pattern earns its keep. Every skill that emits a decision is hiding a probability downstream skills could have used.

/check-board returns Done / To-Do / Blocked. The underlying “is this ready to work on” is continuous. A task may be 0.8-ready (one blocker plausibly resolvable in-loop) or 0.3-ready (three blockers, one of them the founder). The binary emits. The gradient is invisible to the next skill in the chain. The cycle-27 skill-iteration proposal to return probabilistic task-readiness (Notion task 348f7d49-36d1-81a1-87af-ca7e6ebecd91) is the direct antidote.
audit-newsletter-outputs.py returns pass / fail. The underlying structural soundness is continuous — frontmatter completeness, tag quality, cross-link density, bias-flag presence. A graded score exists in the audit pass and is being thrown away at the final return.
Skill-file thresholds are binary gates over smooth utility curves. The >240s promo-clip skip, the >30KB transcript-size sub-agent cutoff, the max-three-concurrent tool limit — each is a hard edge placed over a continuous cost-benefit function. Each is wrong at the boundary in a way that is invisible to callers.
Every LaunchAgent cron decision (“fire now / skip”) is a binary over a continuous “how stale is the state I’d be operating on” curve.

The operational rule: when a decision is fundamentally probabilistic, return the probability alongside the decision. Keep the binary for the consumer that needs a binary. Log the probability for the consumer that could use it. Never collapse at the emit step when the collapse could have happened at the consumption step.

When the pattern is fine anyway

Honest counter. The CPU needs an integer. The screen has to render one image. The insurance policy requires a boundary someone can point a lawyer at. The task board has to show one status column. Binarization is the product sometimes — the whole point of production systems is that they resolve ambiguity to action.

The anti-pattern is not “binarization exists.” It is binarizing upstream of the decision point that actually needs the binary. When the probability gets destroyed two hops before the human (or the next skill) acts on it, every intermediate layer is flying blind for no good reason. Push the collapse as late as the pipeline allows. Log the distribution even when the emit is binary. Give the consumer the option to read the gradient.

Grady says it in civil-engineering language: “facing the limitations of our understanding head-on actually instills more trust than pretending like we have all the answers.” Same move. Expose the distribution.

../2026-04-20-practical-engineering-an-engineers-perspective-on-the-texas-floods — civil-engineering canonical exemplar; NFIP floodplain maps as crisp binary lines around a continuous, sparsely-sampled, non-stationary probability
../2026-04-20-3blue1brown-large-language-models-explained-briefly — autoregressive-AI exemplar; logit distribution over vocabulary collapsed to one token at the UI surface
../2026-04-20-3blue1brown-but-how-do-ai-images-and-videos-actually-work — generative-AI exemplar; mean-collapse (no-noise sampling) vs mode-collapse (classifier-free guidance) as two directions of the same information-destruction move
high-dim-surface-concentration — companion concept; the geometry argument for why high-dim probability structure is load-bearing and human intuition about it is wrong. A joint Sanity Check piece pairing the geometry (CA-014) with the operational anti-pattern (CA-022) is the natural next write-up.
Notion task 348f7d49-36d1-81a1-87af-ca7e6ebecd91 — cycle-27 /improve proposal: probabilistic returns in /check-board as the concrete first fix inside RDCO
CANDIDATES — parent backlog; CA-022 origin trace

Confidence

Three sources, minimum canon-tier — just-barely promoted. Honest caveats:

Two of the three sources are from the 3Blue1Brown cluster (Sanderson on LLMs, Welch Labs guest video on diffusion). That is one author-community, not two. The cross-domain framing inside that cluster is real — LLM logits and diffusion score-functions are genuinely different AI objects — but the editorial voice is correlated. A fourth source from outside the 3B1B orbit would materially harden the AI half.
Only one civil-engineering source. Grady’s floodplain-map case is vivid and canonical, but a single Practical Engineering video is not a domain-wide pattern on its own.
The mapping to RDCO skill design is interpretive. The argument “binary /check-board is structurally identical to a floodplain zone” is a claim, not a proof. It is load-bearing enough to act on (the cycle-27 proposal is already queued) but should be revisited once a fourth source forces a sharpening.

The obvious missing source is the decision-theory / calibration literature: Brier score, reliability diagrams, the proper-scoring-rules tradition, or any classic treatment of threshold selection in classification. brier-score already exists in the vault as a concept page and is the natural next citation — adding one primary-literature piece (a Gneiting & Raftery paper, Brier’s 1950 original, or a Platt-scaling / isotonic-regression reference) would bring this to four sources with two communities.

The pattern earns the “anti-pattern” framing at three independent-domain exemplars. It does not yet earn canon-tier confidence in the “this is the general principle across decision science” framing. Act on the RDCO implications now; revisit the concept page when the fourth source lands.