PM1c + PM1d — Category slicing and stability check

Context

PM1b identified the $100K-$2M volume band as the first place we’ve measured mispricing on Polymarket (Brier 0.12-0.15 vs a 0.12 discipline gate). The founder asked to bundle the next three steps: category slicing (which kinds of markets drive it), stability check (is it persistent across time), and an X-sentiment prototype (can we build an edge).

This doc covers category slicing and stability. The X-sentiment prototype (PM1e) is deliberately NOT executed yet — the stability results changed my read of whether it’s worth the xmcp/LLM spend. See the “Decision point” section below.

Cost so far: $0. All analysis used the free Polymarket Gamma and CLOB endpoints.

PM1c — Category slicing

Setup. Pulled the top 500 resolved markets in the $100K-$2M band via volume_num_min=100000, volume_num_max=2000000, order=volumeNum. Polymarket’s own category field is empty on 490/491 markets so we infer categories from slug + events[0].ticker prefixes via a small regex classifier. Rule set lives in pm1c_category_slicing.py and covers sports, crypto, politics, tennis, esports, cricket, olympics, macro-fed, weather, entertainment, space-tech, tech, “other”.

Category distribution (N=491 total):

sports      220  (45%)
other       174  (35%)
politics     30   (6%)
esports      25   (5%)
crypto       25   (5%)
olympics      4
tech          4
tennis        4
space-tech    3
weather       1
macro-fed     1

Brier by category (min N=15 gate, 3 days before resolution):

Category              N     Median vol    Brier    Gate    Win rate
sports              200  $1,857,763    0.1989    FAIL    50.50%
crypto               21  $1,882,647    0.1876    FAIL    61.90%
other               163  $1,861,514    0.1159    PASS    31.90%
politics             28  $1,881,371    0.0057    PASS    21.43%

Key observations:

Sports is by far the biggest category (40% of the sample) and shows the worst calibration (Brier 0.1989 with a 50.5% win rate). A Brier of ~0.20 at a 50/50 baseline means the market is barely better than coin-flipping — expected actually, because most individual sports games are close to coin flips.
Politics in this band is nearly perfectly calibrated (Brier 0.0057). Sophisticated traders dominate political prediction markets even at this volume level — informational edge is very hard here.
Crypto shows above-gate Brier but small N (21) so the point estimate has wide uncertainty.
“Other” is a grab-bag — some of it is probably miscategorized sports or crypto that our regex missed. At Brier 0.1159 it’s borderline.

PM1d — Stability across time

Setup. Same $100K-$2M band. Bucket markets into three windows by endDate: H1 2025, H2 2025, Q1 2026. Compute Brier per (window × category) cell, require N ≥ 15 per cell. Only two categories clear the noise floor in at least two windows: sports and other.

Results:

Brier by window × category
category    H1 2025    H2 2025    Q1 2026
sports       0.1073     0.2216     0.1987
other        0.0796     0.1179     0.1562

Sample sizes (window × category):

category    H1 2025    H2 2025    Q1 2026
sports         22         96         80
other          27         57         56

Stability verdict:

Sports: UNSTABLE. Brier range 0.107 → 0.222 (2× degradation between H1 2025 and H2 2025). H1 2025 sports in this band actually PASSES the gate; H2 2025 and Q1 2026 fail it badly.
Other: UNSTABLE. Brier trends from 0.080 → 0.118 → 0.156 — a monotonic degradation over time.

What this means

The “sports has alpha in the $100K-$2M band” hypothesis is weaker than it looked from PM1c alone. Two plausible explanations:

Sample composition effect. The top-500 markets from each window are biased toward whatever events had high volume at that time. In H1 2025 the top sports markets might be elite playoff games (hyperefficient); in H2 2025 they include a broader sweep of regular-season games (less efficient). The “mispricing” isn’t a property of the band, it’s a property of the mix of games within the band.
Regime change. Polymarket’s user base grew and changed across 2025. Calibration may genuinely have degraded because new, less sophisticated users joined and bid prices off-fair.

I can’t distinguish these two explanations from this data alone. Both explanations argue against a simple “buy all sports markets in this band” strategy. If it’s (1), the alpha lives in a specific sub-population we haven’t identified yet. If it’s (2), the alpha exists but is regime-dependent and we need continuous recalibration.

The “other” category also trends unstable, which is less interpretable because “other” is whatever didn’t match our regex — it’s a catch-all with unclear composition.

Politics and crypto have too few markets per window to produce meaningful per-window Brier numbers.

Decision point — why I’m pausing before PM1e

The founder’s guidance was to be cost-conscious: “if our margin is thin it could quietly eat into our return.” Given what PM1d just showed, here’s my read:

Evidence for spending on PM1e (X-sentiment prototype):

There IS mispricing in the $100K-$2M band in aggregate
Sports is the largest and worst-calibrated category
X is genuinely the fastest information channel for sports (injuries, weather, late scratches)
Single-market prototype cost is bounded (~20-50 X queries + 1-2 LLM calls)

Evidence against spending on PM1e right now:

The sports signal is not stable across time — a prototype on one market from Q1 2026 might work and fail on a 2025 market, or vice versa
The category effect is confounded with sample composition — we don’t know what sub-population of sports is actually driving the signal
A failed prototype would tell us almost nothing (“did the X sentiment not work, or is sports mispricing real but we picked a bad market?”)
There’s cheaper free analysis available first

What I want to do BEFORE spending on PM1e:

Sub-category analysis within sports — split sports into NBA / NFL / MLB / soccer / esports / tennis / other. Which specific leagues drive the high Brier? That could reveal the real alpha population. (Free.)
Regular-season vs playoff split — hypothesis: playoff markets are well-priced because volume and attention concentrate there, while regular-season markets are where the mispricing hides. (Free.)
Spread-aware Brier — account for bid-ask spread at the time of the snapshot. Even if the midpoint shows Brier 0.20, if the spread is 10 cents wide, the actual tradeable prices may be well within discipline range. The narrow-margin concern applies doubly. (Cheap — another Polymarket API call per market.)

Only after those three are done would I feel comfortable spending xmcp budget on PM1e. And at that point I’d pick the prototype market deliberately — from the specific sub-population the analysis identifies as the alpha target.

Cost budget for future PM1e

When we do run PM1e, here’s the cost frame I’m planning against:

Per-prototype budget (single market):

xmcp X searches: ~10-30 searches per market, each retrieves ~10-30 tweets. Cost depends on X API tier. Basic tier ~$0.01 per tweet retrieved. Budget: $5 per prototype market.
LLM calls: Feed ~50-200 tweets + market context to Claude Sonnet for probability extraction. Single call, ~10K tokens. Cost: ~$0.05 per call (Sonnet pricing).
Polymarket API calls: Free.
Total per prototype: under $10, probably under $5.

Scaled-strategy budget (what it’d cost to run continuously):

200 active markets × 1 check per day × 5 searches per check = 1000 X searches/day × $0.01 = $10/day data cost
Plus ~200 LLM probability calls per day × $0.05 = $10/day inference cost
Total: ~$20/day = ~$600/month.
This is the hurdle a live strategy has to beat. If our edge is narrow (1-2% per bet on small positions), the data cost alone could eat half the returns. That’s exactly the scenario the founder flagged.

This budget math is going to be a key part of any go/no-go decision. Putting it here so we have a reference.

Plots

outputs/pm1c_category_slicing.png — horizontal bar chart, Brier by category, color-coded pass/fail
outputs/pm1d_stability_check.png — grouped bars of Brier per category per time window

pm1-polymarket-baseline — original (superseded-in-part)
pm1b-polymarket-long-tail-correction — the corrected long-tail finding this builds on
../scripts/pm1c_category_slicing
../scripts/pm1d_stability_check
../architecture-vision — future 5-agent vision
../../../06-reference/2026-04-10-halls-moore-algo-trading — the four backtesting biases framework we used to design the stability check