PM1c + PM1d — Category slicing and stability check
Context
PM1b identified the $100K-$2M volume band as the first place we’ve measured mispricing on Polymarket (Brier 0.12-0.15 vs a 0.12 discipline gate). The founder asked to bundle the next three steps: category slicing (which kinds of markets drive it), stability check (is it persistent across time), and an X-sentiment prototype (can we build an edge).
This doc covers category slicing and stability. The X-sentiment prototype (PM1e) is deliberately NOT executed yet — the stability results changed my read of whether it’s worth the xmcp/LLM spend. See the “Decision point” section below.
Cost so far: $0. All analysis used the free Polymarket Gamma and CLOB endpoints.
PM1c — Category slicing
Setup. Pulled the top 500 resolved markets in the $100K-$2M band via volume_num_min=100000, volume_num_max=2000000, order=volumeNum. Polymarket’s own category field is empty on 490/491 markets so we infer categories from slug + events[0].ticker prefixes via a small regex classifier. Rule set lives in pm1c_category_slicing.py and covers sports, crypto, politics, tennis, esports, cricket, olympics, macro-fed, weather, entertainment, space-tech, tech, “other”.
Category distribution (N=491 total):
sports 220 (45%)
other 174 (35%)
politics 30 (6%)
esports 25 (5%)
crypto 25 (5%)
olympics 4
tech 4
tennis 4
space-tech 3
weather 1
macro-fed 1
Brier by category (min N=15 gate, 3 days before resolution):
Category N Median vol Brier Gate Win rate
sports 200 $1,857,763 0.1989 FAIL 50.50%
crypto 21 $1,882,647 0.1876 FAIL 61.90%
other 163 $1,861,514 0.1159 PASS 31.90%
politics 28 $1,881,371 0.0057 PASS 21.43%
Key observations:
- Sports is by far the biggest category (40% of the sample) and shows the worst calibration (Brier 0.1989 with a 50.5% win rate). A Brier of ~0.20 at a 50/50 baseline means the market is barely better than coin-flipping — expected actually, because most individual sports games are close to coin flips.
- Politics in this band is nearly perfectly calibrated (Brier 0.0057). Sophisticated traders dominate political prediction markets even at this volume level — informational edge is very hard here.
- Crypto shows above-gate Brier but small N (21) so the point estimate has wide uncertainty.
- “Other” is a grab-bag — some of it is probably miscategorized sports or crypto that our regex missed. At Brier 0.1159 it’s borderline.
PM1d — Stability across time
Setup. Same $100K-$2M band. Bucket markets into three windows by endDate: H1 2025, H2 2025, Q1 2026. Compute Brier per (window × category) cell, require N ≥ 15 per cell. Only two categories clear the noise floor in at least two windows: sports and other.
Results:
Brier by window × category
category H1 2025 H2 2025 Q1 2026
sports 0.1073 0.2216 0.1987
other 0.0796 0.1179 0.1562
Sample sizes (window × category):
category H1 2025 H2 2025 Q1 2026
sports 22 96 80
other 27 57 56
Stability verdict:
- Sports: UNSTABLE. Brier range 0.107 → 0.222 (2× degradation between H1 2025 and H2 2025). H1 2025 sports in this band actually PASSES the gate; H2 2025 and Q1 2026 fail it badly.
- Other: UNSTABLE. Brier trends from 0.080 → 0.118 → 0.156 — a monotonic degradation over time.
What this means
The “sports has alpha in the $100K-$2M band” hypothesis is weaker than it looked from PM1c alone. Two plausible explanations:
-
Sample composition effect. The top-500 markets from each window are biased toward whatever events had high volume at that time. In H1 2025 the top sports markets might be elite playoff games (hyperefficient); in H2 2025 they include a broader sweep of regular-season games (less efficient). The “mispricing” isn’t a property of the band, it’s a property of the mix of games within the band.
-
Regime change. Polymarket’s user base grew and changed across 2025. Calibration may genuinely have degraded because new, less sophisticated users joined and bid prices off-fair.
I can’t distinguish these two explanations from this data alone. Both explanations argue against a simple “buy all sports markets in this band” strategy. If it’s (1), the alpha lives in a specific sub-population we haven’t identified yet. If it’s (2), the alpha exists but is regime-dependent and we need continuous recalibration.
The “other” category also trends unstable, which is less interpretable because “other” is whatever didn’t match our regex — it’s a catch-all with unclear composition.
Politics and crypto have too few markets per window to produce meaningful per-window Brier numbers.
Decision point — why I’m pausing before PM1e
The founder’s guidance was to be cost-conscious: “if our margin is thin it could quietly eat into our return.” Given what PM1d just showed, here’s my read:
Evidence for spending on PM1e (X-sentiment prototype):
- There IS mispricing in the $100K-$2M band in aggregate
- Sports is the largest and worst-calibrated category
- X is genuinely the fastest information channel for sports (injuries, weather, late scratches)
- Single-market prototype cost is bounded (~20-50 X queries + 1-2 LLM calls)
Evidence against spending on PM1e right now:
- The sports signal is not stable across time — a prototype on one market from Q1 2026 might work and fail on a 2025 market, or vice versa
- The category effect is confounded with sample composition — we don’t know what sub-population of sports is actually driving the signal
- A failed prototype would tell us almost nothing (“did the X sentiment not work, or is sports mispricing real but we picked a bad market?”)
- There’s cheaper free analysis available first
What I want to do BEFORE spending on PM1e:
- Sub-category analysis within sports — split sports into NBA / NFL / MLB / soccer / esports / tennis / other. Which specific leagues drive the high Brier? That could reveal the real alpha population. (Free.)
- Regular-season vs playoff split — hypothesis: playoff markets are well-priced because volume and attention concentrate there, while regular-season markets are where the mispricing hides. (Free.)
- Spread-aware Brier — account for bid-ask spread at the time of the snapshot. Even if the midpoint shows Brier 0.20, if the spread is 10 cents wide, the actual tradeable prices may be well within discipline range. The narrow-margin concern applies doubly. (Cheap — another Polymarket API call per market.)
Only after those three are done would I feel comfortable spending xmcp budget on PM1e. And at that point I’d pick the prototype market deliberately — from the specific sub-population the analysis identifies as the alpha target.
Cost budget for future PM1e
When we do run PM1e, here’s the cost frame I’m planning against:
Per-prototype budget (single market):
- xmcp X searches: ~10-30 searches per market, each retrieves ~10-30 tweets. Cost depends on X API tier. Basic tier ~$0.01 per tweet retrieved. Budget: $5 per prototype market.
- LLM calls: Feed ~50-200 tweets + market context to Claude Sonnet for probability extraction. Single call, ~10K tokens. Cost: ~$0.05 per call (Sonnet pricing).
- Polymarket API calls: Free.
- Total per prototype: under $10, probably under $5.
Scaled-strategy budget (what it’d cost to run continuously):
- 200 active markets × 1 check per day × 5 searches per check = 1000 X searches/day × $0.01 = $10/day data cost
- Plus ~200 LLM probability calls per day × $0.05 = $10/day inference cost
- Total: ~$20/day = ~$600/month.
- This is the hurdle a live strategy has to beat. If our edge is narrow (1-2% per bet on small positions), the data cost alone could eat half the returns. That’s exactly the scenario the founder flagged.
This budget math is going to be a key part of any go/no-go decision. Putting it here so we have a reference.
Plots
outputs/pm1c_category_slicing.png— horizontal bar chart, Brier by category, color-coded pass/failoutputs/pm1d_stability_check.png— grouped bars of Brier per category per time window
Related
- pm1-polymarket-baseline — original (superseded-in-part)
- pm1b-polymarket-long-tail-correction — the corrected long-tail finding this builds on
- ../scripts/pm1c_category_slicing
- ../scripts/pm1d_stability_check
- ../architecture-vision — future 5-agent vision
- ../../../06-reference/2026-04-10-halls-moore-algo-trading — the four backtesting biases framework we used to design the stability check