PM1 (Kalshi) MLB Deep Dive — hypothesis refuted

TL;DR

The MLB mispricing finding was entirely a snapshot-timing artifact. At 1 hour before close, Kalshi MLB markets have Brier 0.1918 with lift +0.058 over the majority baseline — positive, not negative. The initial finding of -0.056 lift was because the baseline script snapshotted MLB markets at ~24h before close, which is before meaningful price discovery on short-lived game markets. MLB is efficient once prices mature.

The only confirmed mispricing from the entire PM1 track remains the Polymarket Elon tweet-count finding (PM1c3). The April 9-11 Elon forward test resolves tomorrow 16:00 UTC; that’s the one real alpha lead we have left to validate.

Why run this deep dive

PM1 Kalshi baseline found MLB at Brier 0.3051 with lift -0.056 on N=40 markets. That paralleled the Elon tweet-count finding on Polymarket — a possible second sub-population where the market is actively worse than majority prediction. The finding was dramatic enough to be worth a larger sample and tighter snapshot timing before building strategy infrastructure around it.

The specific concern I flagged in the baseline writeup: MLB game markets on Kalshi open only ~48 hours before close. My adaptive snapshot put the price at “market midpoint” (~24h before close), which might be before meaningful price discovery. If early-life prices are noisy opening quotes with minimal liquidity, the apparent negative lift could be an artifact of asking the wrong question at the wrong time.

Methodology

Larger sample. Paginated Kalshi to pull 200 settled MLB markets (up from 40).
Multiple snapshot timings. For each market, snapshotted the midpoint at 24h, 12h, 6h, 1h before close. This builds a calibration curve showing how the market matures as resolution approaches.
Favorite vs underdog split at the closest snapshot.
Calibration decile plot — bin predictions into 10 deciles, compare mean predicted probability to observed win rate.

Cost: $0. All data from Kalshi’s public candlesticks endpoint.

Results — the timing curve

Timing          N     Brier    Majority    Lift    Mean pred    Win rate
24h before    162    0.2637    0.2498   -0.0139    0.5071      48.77%
12h before    167    0.2618    0.2500   -0.0119    0.5051      49.70%
 6h before    151    0.2684    0.2497   -0.0187    0.5034      48.34%
 1h before    147    0.1918    0.2500   +0.0582    0.4861      50.34%

The picture is unambiguous:

At 24h, 12h, and 6h before close, the market has marginally negative lift (between -0.012 and -0.019). Small enough to be sampling noise; certainly not a durable mispricing.
At 1h before close, lift jumps to +0.058 — the market meaningfully beats the baseline.
The original “Brier 0.3051, lift -0.056” from the baseline was the midlife-snapshot number, which here would sit somewhere around the 24h mark. The larger sample (N=162 vs N=40) moved that to -0.014 and revealed the earlier result was a bad point estimate.

The baseline’s “MLB is mispriced” conclusion was wrong. The right conclusion: Kalshi MLB markets mature sharply in the last few hours before close. Before the game starts, liquidity pours in, informed traders price the game correctly, and the market hits positive lift. The pre-game window (6-24 hours before close) is where the market is indistinguishable from the baseline — that’s not an anomaly, that’s just the market not having a strong view yet.

Favorite vs underdog (1h before close)

Slice                        N    Brier   Majority   Lift    Mean pred    Win rate
Favorites (mid > 0.5)       71   0.1909   0.2083   +0.017    0.742       70.42%
Underdogs (mid ≤ 0.5)       76   0.1926   0.2161   +0.024    0.247       31.58%

Both sides beat baseline by a small margin. Favorites land at 70.4% win rate when priced at 74.2% mean prediction (slight overconfidence, -3.8 points). Underdogs land at 31.6% win rate when priced at 24.7% mean prediction (slight underconfidence on longshots, +6.9 points).

Neither split shows the kind of systematic error that would enable a strategy. The market is basically right, with mild edge-case quirks that might not survive out-of-sample.

Calibration decile plot (1h before close)

Decile    N    Mean pred    Observed    Delta
  0      16     0.0263      0.0000    -0.026  ← accurate
  1      14     0.1143      0.2143    +0.100
  2      14     0.2729      0.5000    +0.227  ← market underpredicts
  3      15     0.3567      0.4000    +0.043
  4      15     0.4400      0.4000    -0.040
  5      14     0.5350      0.7857    +0.251  ← market underpredicts
  6      15     0.6087      0.5333    -0.075
  7      14     0.6871      0.4286    -0.259  ← market overpredicts
  8      15     0.8600      0.8000    -0.060
  9      15     0.9687      1.0000    +0.031  ← accurate

Extremes are accurate (deciles 0 and 9). The middle is noisy with several 20+ point deviations, but N per decile is 14-16 and those swings are consistent with pure sampling variance. A 25-point miss on 14 samples is roughly one standard deviation on a proportion estimate at p=0.5 — unremarkable.

The takeaway: MLB at 1h before close is well-calibrated in aggregate, with no obvious systematic bias on favorites or underdogs. There’s no repeatable directional edge visible in this sample.

What this means for the PM1 track

Before this deep dive: we had two candidate mispricings — Polymarket Elon tweets + Kalshi MLB — and they looked like a reusable pattern (retail-dominated markets where systematic data access gives alpha).

After this deep dive: we have one candidate — Polymarket Elon tweets. The “Kalshi MLB” parallel was a false positive from methodology artifact.

Why Elon tweets still stands:

Polymarket Elon markets live 7+ days, so “3 days before close” is a meaningful snapshot inside the mature market period (not near open like the MLB 24h snapshot was)
The Elon finding was N=21 with outcome binary and no decile calibration issues
The structural explanation is tighter: narrow 20-tweet buckets, retail-only participation, public data nobody pulls
Forward test is already running, first scored result tomorrow

What doesn’t generalize: the “retail markets have mispricing because the crowd is guessing” pattern. Kalshi MLB is also retail-dominated, and it is efficient once it matures. The reason Elon tweets might be mispriced isn’t that the crowd is retail — it’s that the bucket structure is narrow enough to reward anyone who actually pulls the posting-rate data. Most retail sports markets don’t have that narrow-bucket property.

Updates to earlier docs

pm1-kalshi-baseline — the MLB verdict (“biggest red flag, real alpha target”) is wrong. The sample there used a snapshot at market midlife which happened to fall at the immature end of MLB’s short lifetime. Adding a prominent correction banner to that doc.
Architecture implication: adaptive snapshots need to be smarter. “Market midlife” isn’t a good proxy for “price is mature” on short-lived markets. Better heuristic: use the latest-available mature snapshot, defined as “close - min(0.15 × lifetime, 72h)” — scale with a fraction of lifetime rather than naive midlife. Add this to the next iteration of the Kalshi baseline script.

What I’d do next

Let the Elon forward-test run. April 9-11 resolves tomorrow 16:00 UTC — first scored bet. Nothing else to do in the meantime.
Don’t chase other “sports mispricing” hypotheses on Kalshi. If MLB was false, NHL/EPL/LaLiga’s marginal negative lifts from the baseline are probably the same artifact.
Rerun the Kalshi baseline with the corrected snapshot methodology to confirm which series actually have lift above/below baseline when prices are mature. Cheap.
Return focus to the narrow-bucket retail hypothesis. Elon tweets fit the pattern. What else does? Polymarket “will X tweet Y-Z times” is rare, but other narrow-bucket question types — “what will the closing Dow be rounded to nearest 100” / “how many home runs will be hit in MLB this week” / specific player prop buckets — might show the same structure. Worth a targeted search on Polymarket and Kalshi for narrow-bucket market types.
Don’t build more infrastructure until we have a confirmed direction. We’ve got one live forward-test, we’ll score it tomorrow, and the result tells us whether this entire track has legs.

pm1-kalshi-baseline — the baseline (MLB verdict superseded)
pm1c3-other-breakdown — the Elon tweet finding that still stands
pm1e-elon-forecast — the live forward test
pm1cd-category-and-stability — historical context on stability-check methodology