06-reference

hughes quickcheck property based testing

Mon May 04 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·reference ·source: Koen Claessen and John Hughes, "QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs" (ICFP 2000) + Hypothesis (Python) docs ·by Koen Claessen, John Hughes (paper); David R. MacIver et al. (Hypothesis)
property-based-testingquickcheckhypothesisjohn-hughesagent-verifierrule-corpus-v2voice-profile

John Hughes - “QuickCheck” (2000) + Hypothesis docs

Sources are fully open. The Claessen / Hughes paper is the original 2000 publication at ICFP; the Hypothesis docs at hypothesis.readthedocs.io are the canonical modern Python interpretation of the same idea, with David R. MacIver as the principal author of the library.

Why this is in the vault

The verify-action rule corpus is example-based today (R001 is a literal regex for one Unicode codepoint), and property-based testing is the natural v2 direction once we want to catch entire classes of voice-and-behavior drift the regex can never see.

The core argument / doctrine

QuickCheck (and its descendants: Hypothesis in Python, Eris in PHP, fast-check in JS, ScalaCheck in Scala, PropEr in Erlang) is a different unit of testing from xUnit / pytest. The unit is not an example; the unit is a property.

A property is a universally-quantified statement: “for all inputs x in some specified domain, this predicate holds.” In QuickCheck the property is a Haskell function returning Bool. The framework’s job is to:

  1. Generate random inputs of the right shape via a generator (Hypothesis calls these strategies). Generators are first-class, composable, and programmable. st.integers(), st.lists(st.text()), recursive generators for trees, custom generators for domain types.
  2. Run the property against many generated inputs (default 100 in Hypothesis).
  3. Shrink on failure. When an input falsifies the property, the framework systematically reduces that input toward a minimal counterexample. A list of 412 elements that fails gets shrunk to a list of 2 elements that still fails. Shrinking is what makes property-based testing usable: the failure report points at the simplest input that breaks the property, not the random monstrosity that happened to find it first.

The key claim in the QuickCheck paper, captured by the Hughes/Claessen abstract: properties expressed as ordinary code, tested against many random cases, give stronger assurance than hand-written examples while requiring less test code. A short Hughes-tradition line: “Properties are universally quantified Boolean-valued functions.” (Twelve words; from the paper.) A short Hypothesis line: “Strategies describe the type of inputs you want.” (Eight words; from the docs.)

Why property-based finds bugs example-based misses. Examples are anchored to the developer’s mental model. The bugs example-based testing finds are the bugs the developer thought to look for. Generators do not share that mental model; they explore the input space stochastically and (with shrinking) report the simplest failure they can find. The empirical pattern across the QuickCheck literature: developers write 3-5 properties, each of which catches dozens of bugs that the example tests would never have caught, including bugs caused by the developer’s own assumptions about what “valid input” looks like.

Limits. Properties are harder to write than examples because they require articulating the universal claim, not the specific instance. Some behaviors do not have a clean property formulation - “this UI shows the right copy” cannot be a property. And generators have to be designed thoughtfully; a generator that never produces edge cases gives false confidence.

Mapping against Ray Data Co

The natural v2 direction for the verify-action rule corpus. Today the corpus is example-based. R001 is the substring check if U+2014 in text: fail. That works for that specific Unicode codepoint. It does not work for other AI-tells the founder will eventually flag (e.g., specific phrases, hedging cadences, novel-length replies, particular formatting tics). Each new tell currently means a new R-rule with a new regex. That is fine for the small-N case, but the corpus will hit a ceiling.

A property-based reformulation looks like this:

Property-based testing as the eval-layer foundation. 2026-05-04-karlmehta-llm-commoditization-intelligence-rails argues the eval / control-plane layer is the durable moat. Most production “evals” today are example-based (a fixture set of input/output pairs). Property-based evals are the version that survives model swaps and prompt rewrites - because the property does not depend on the specific input, only on the predicate. Action implication: when we build the RDCO eval layer beyond the verify-action corpus, the model is property-based from day one, not example-based with a planned migration.

Adversarial inputs as a test of the rule corpus itself. Hughes’ deeper insight is that the generator is the surface that finds bugs, not the property. The verify-action corpus today has no generator; we run hand-written fixtures. A v2 step is: generate random or adversarial drafts (Unicode-heavy, length-extreme, multilingual, code-block-injection, prompt-injection-style) and run them through the rule corpus. Rules that pass everything regardless of input are either trivially correct or trivially wrong; adversarial generation distinguishes the two. This is property-based testing applied to the verifier itself, not to the LLM.

Connection to the MAC framework. The Scope x Basis matrix is example-shaped today (each cell describes a specific check). A property-based reading of the matrix is straightforward: “for all rows r in the model, the column-level absolute property holds” / “for all months m in the aggregate, the accounting identity holds.” The matrix already has this shape implicitly. Promoting the implicit universality to explicit forall properties is the bridge from the matrix to a generative test suite that produces synthetic data and checks the model’s response. Useful for pre-production validation of new dbt models.

Limits to flag for the founder. Property-based testing requires generators; generators require domain modeling; domain modeling for “the founder’s voice” is exactly what voice-match already approximates with a fuzzy LLM-based scorer. We will not get a clean property for voice without accepting a fuzzy predicate, which means we are partially back in the LLM-verifier failure mode. The right architecture is: hard properties (chat_id format, em dash absence, length budgets) stay deterministic; soft properties (voice match, brand tone) use a voice-match scorer with a clearly stated threshold and acknowledged failure modes. The split is the same one 2026-05-05-tdd-is-dead-debate-dhh-beck-fowler makes.