PM1 — Polymarket Baseline Calibration
⚠ Correction (2026-04-10): The conclusion in this document — that Polymarket is efficient across its entire top-650 volume range — was correct for what it measured but misleading as a statement about the whole venue. Our pagination approach capped us at markets with $7M+ volume (Gamma’s /markets endpoint 500s beyond offset ~950). A follow-up analysis using volume-band filters (pm1b-polymarket-long-tail-correction) found that three out of five volume bands between $10K and $100M fail the 0.12 discipline gate. Read the correction for the current verdict. The methodology and top-volume numbers below remain valid.
Question asked: can we find a winning strategy on a live prediction market?
Short answer: not with a directional model on the top-650 markets by volume ($7M+ median), but markets in the $100K–$2M volume range show Brier scores around 0.12–0.15 — above the discipline gate — and are the plausible target for directional strategies. See pm1b-polymarket-long-tail-correction for details.
This experiment is the first real measurement of whether Polymarket is even beatable from our current toolkit. Before testing any strategy idea, we need to know what “trust the market price” does on average — that’s the baseline any of our strategies has to beat. If Polymarket is hyperefficient, we’re either wasting time on directional prediction or we need a totally different angle (informational edge, market-making, niche markets).
Setup
Data source: Polymarket public Gamma + CLOB APIs (anonymous, no wallet). Client code at autoinv/polymarket.py.
Sample: all resolved binary markets in the top 650 by total volume (volumeNum sort), resolved between roughly 2023 and early 2026. Final N = 611 markets with usable daily price history.
Methodology:
- For each market, fetch daily mid-price history (
/prices-historywithfidelity=1440) - For each of several “days before resolution” windows (1, 3, 7, 14, 30), snapshot the market’s latest price at or before that time
- Compute the Brier score of those snapshots against the realized binary outcome
- Compare to naive baselines (always predict 0.5, always predict majority class) and to the discipline gate of Brier ≤ 0.12
Scripts:
- pm1_polymarket_explore.py — API hello-world
- pm1_brier_baseline.py — top-100 markets, Brier across time windows
- pm1_brier_by_volume_tier.py — Brier across four volume tiers
Results — top 100 markets, calibration across time windows
Window Brier Mean price Win rate
--------------------------------------------------------
1 day before 0.0348 0.199 20.2%
3 days before 0.0325 0.198 20.2%
7 days before 0.0361 0.194 20.2%
14 days before 0.0425 0.184 19.4%
30 days before 0.0610 0.158 20.0%
Baselines:
Always predict 0.5: Brier = 0.2500
Always predict majority (0.202): Brier = 0.1612
Discipline gate: Brier = 0.12
Excellent forecaster: Brier < 0.10
Interpretation:
- The best window is 3 days before resolution, Brier = 0.033. That’s 5× better than the majority baseline, and ~3× better than published benchmarks for professional election forecasters (538’s historical 0.06–0.12 on presidential races).
- Calibration improves as resolution approaches (0.061 → 0.033 → back up to 0.035) — expected: more time means more unresolved information, more noise.
- Mean price is ~0.20 because the dataset is dominated by multi-option markets where only one outcome wins, so most individual YES/NO pairs resolve NO (~20% win rate).
Results — Brier by volume tier
Tier N Median vol Brier Majority Lift
-------------------------------------------------------------------------
Top 1-50 (ultra-liquid) 49 $130M 0.0537 0.1741 +0.1204
Top 51-150 (large) 99 $ 53M 0.0148 0.1286 +0.1138
Top 151-350 (mid) 190 $ 24M 0.0262 0.1257 +0.0994
Top 351-650 (small) 273 $ 12M 0.0749 0.1609 +0.0860
Interpretation:
- Every volume tier is below 0.12. The entire top 650 markets by volume are more calibrated than our discipline gate. Polymarket is beating it on its own.
- The Top 1-50 tier is actually worse than the Top 51-150 tier (0.0537 vs 0.0148), which is counterintuitive. I suspect this is driven by a few high-volume long-shot surprises: a $130M market at a 2% price that hit YES contributes a 0.96 per-observation error, which swamps dozens of correctly-called 2%→0 outcomes. With N=49 the sampling noise is high.
- The smallest tier (median volume $12M) has Brier 0.0749 — worse than the mid tiers but still below every published benchmark. Efficiency holds up even as liquidity drops through an order of magnitude.
- The ratio is only 1.39× across a 10× range in volume. There’s no obvious “liquidity tail” where calibration falls apart and alpha is sitting uncollected.
The honest verdict
We did not find a winning strategy, and we quantified why that was unlikely from the start.
Polymarket is efficient. Not just “reasonably calibrated” — actively better than the best professional human forecasters across the entire 650-market volume range we tested. A strategy that buys contracts whose price disagrees with a model estimate will lose, on average, because the model is worse than the market price.
This is the honest negative result the roadmap article warned about: “tools have democratized, conviction hasn’t. Edge lives in unique data, unique models, or unique execution — not better pip installs.”
Three things this does NOT rule out:
- Very small / niche markets. Our sample stopped at volume rank 650 (median $12M). There’s a long tail of markets with $1K–$100K volume where fewer sophisticated traders participate. Plausibly less efficient but also plausibly too illiquid to trade.
- Short-horizon / intraday opportunities. We only looked at daily snapshots. There’s a 12-hour fidelity cap on resolved markets that prevents us from studying intraday price action without on-chain indexing. Some of the alpha might live in short bursts we can’t see from daily data.
- Informational edge strategies. If we build a data source the market doesn’t have — real-time news NLP, sensor data, scraped primary sources — we’d be competing with information rather than modeling noise. This is what the article means by “unique data.”
- Structural / market-making strategies. Providing liquidity (spreads, rebates) rather than directional prediction. This is a different game entirely and requires latency and capital, not forecasting accuracy.
- Other venues. Kalshi, Manifold, niche venues — different user bases, different efficiency characteristics. Polymarket’s advantage is that it’s the deepest crypto prediction market and attracts professional traders. Smaller venues may not.
Discipline gate status
From the simulate-like-quant-desk article: beat 0.12 Brier on a live event before deploying real capital.
Current status: the market itself beats 0.12 Brier at every window we measured. To pass the gate with a strategy, we’d have to build something that’s better than the market price. Our current stack (Black-Scholes for binary contracts + Monte Carlo + no alternative information sources) has zero chance of doing that.
Implication: the right next step is NOT to start tuning strategies on this data. It’s to figure out what our edge would be before writing another line of backtest code.
What I’d do next (pending your input)
Ranked roughly by effort:
-
Confirm the negative result is not an artifact. Run the same analysis on Kalshi to see whether the result generalizes, or whether Polymarket has some weird selection effect. If Kalshi shows similar efficiency, we’re looking at “prediction markets are hard” as a property of the venue class, not of Polymarket specifically.
-
Scan below volume-rank 650. Our sample may have missed the truly inefficient tail. Pulling the $1K-$500K volume band and repeating the analysis would either confirm or refute the “efficiency holds on tiny markets” finding. ~2 hours of work, low risk, potentially reveals where alpha lives.
-
Pick a specific market category where we have a plausible edge — e.g., crypto price markets (because we can build a live options-implied-volatility view that retail doesn’t have), or weather markets (because NOAA data is structured and underused), or niche sports markets in leagues the big traders ignore. Define the edge first, then build the strategy.
-
Pivot from forecasting to market-making. Polymarket’s CLOB has rebate programs for liquidity providers. A market-making strategy doesn’t need directional edge — it needs spread capture + inventory management + latency. Totally different skill stack but potentially more forgiving than directional prediction.
-
Abandon Polymarket and look at equities again with the autoinv toolkit. Equity markets are also efficient, but we have stronger data sources (fundamentals, earnings, alternative data) and more developed infrastructure. The PM track was supposed to be the “lower-stakes sandbox” — if it’s not meaningfully easier than equities, the rationale weakens.
Plots
outputs/pm1_brier_baseline.png— Brier vs days-before-resolution curve for top 100 markets, with baselines overlaidoutputs/pm1_brier_by_volume_tier.png— Brier by volume tier bar chart with majority baseline and discipline gateoutputs/pm1_brier_baseline.csv— raw per-market data for reanalysis
What I added to the package
autoinv/polymarket.py— anonymous read-only client for Gamma / CLOB / Data APIs, withMarketdataclass,list_markets,get_market,get_midpoint,get_price_history, anditer_resolved_marketshelpers- Key defaults baked in:
order="volumeNum"— the"volume"field sorts alphabetically and gives nonsense; usevolumeNumfor the actual numeric volumefidelity=1440(daily bars) — the 12-hour fidelity cap on resolved markets is enforced by the API- Resolved outcome parsed from
outcomePricestuples (1.0, 0.0) or (0.0, 1.0)
Related
- autoinv package README
- Infrastructure plan — PM1 was the first milestone after the math curriculum
- Consolidation pass writeup — built the engine that made PM1 quick to write
- ../../../06-reference/2026-04-10-gemchange-quant-from-scratch
- ../../../06-reference/2026-04-10-gemchange-simulate-like-quant-desk
- ../../../06-reference/2026-04-10-halls-moore-algo-trading