01-projects / automated-investing / experiments

pm1 kalshi mlb deepdive

Thu Apr 09 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·experiment-writeup ·status: complete

PM1 (Kalshi) MLB Deep Dive — hypothesis refuted

TL;DR

The MLB mispricing finding was entirely a snapshot-timing artifact. At 1 hour before close, Kalshi MLB markets have Brier 0.1918 with lift +0.058 over the majority baseline — positive, not negative. The initial finding of -0.056 lift was because the baseline script snapshotted MLB markets at ~24h before close, which is before meaningful price discovery on short-lived game markets. MLB is efficient once prices mature.

The only confirmed mispricing from the entire PM1 track remains the Polymarket Elon tweet-count finding (PM1c3). The April 9-11 Elon forward test resolves tomorrow 16:00 UTC; that’s the one real alpha lead we have left to validate.

Why run this deep dive

PM1 Kalshi baseline found MLB at Brier 0.3051 with lift -0.056 on N=40 markets. That paralleled the Elon tweet-count finding on Polymarket — a possible second sub-population where the market is actively worse than majority prediction. The finding was dramatic enough to be worth a larger sample and tighter snapshot timing before building strategy infrastructure around it.

The specific concern I flagged in the baseline writeup: MLB game markets on Kalshi open only ~48 hours before close. My adaptive snapshot put the price at “market midpoint” (~24h before close), which might be before meaningful price discovery. If early-life prices are noisy opening quotes with minimal liquidity, the apparent negative lift could be an artifact of asking the wrong question at the wrong time.

Methodology

  1. Larger sample. Paginated Kalshi to pull 200 settled MLB markets (up from 40).
  2. Multiple snapshot timings. For each market, snapshotted the midpoint at 24h, 12h, 6h, 1h before close. This builds a calibration curve showing how the market matures as resolution approaches.
  3. Favorite vs underdog split at the closest snapshot.
  4. Calibration decile plot — bin predictions into 10 deciles, compare mean predicted probability to observed win rate.

Cost: $0. All data from Kalshi’s public candlesticks endpoint.

Results — the timing curve

Timing          N     Brier    Majority    Lift    Mean pred    Win rate
24h before    162    0.2637    0.2498   -0.0139    0.5071      48.77%
12h before    167    0.2618    0.2500   -0.0119    0.5051      49.70%
 6h before    151    0.2684    0.2497   -0.0187    0.5034      48.34%
 1h before    147    0.1918    0.2500   +0.0582    0.4861      50.34%

The picture is unambiguous:

The baseline’s “MLB is mispriced” conclusion was wrong. The right conclusion: Kalshi MLB markets mature sharply in the last few hours before close. Before the game starts, liquidity pours in, informed traders price the game correctly, and the market hits positive lift. The pre-game window (6-24 hours before close) is where the market is indistinguishable from the baseline — that’s not an anomaly, that’s just the market not having a strong view yet.

Favorite vs underdog (1h before close)

Slice                        N    Brier   Majority   Lift    Mean pred    Win rate
Favorites (mid > 0.5)       71   0.1909   0.2083   +0.017    0.742       70.42%
Underdogs (mid ≤ 0.5)       76   0.1926   0.2161   +0.024    0.247       31.58%

Both sides beat baseline by a small margin. Favorites land at 70.4% win rate when priced at 74.2% mean prediction (slight overconfidence, -3.8 points). Underdogs land at 31.6% win rate when priced at 24.7% mean prediction (slight underconfidence on longshots, +6.9 points).

Neither split shows the kind of systematic error that would enable a strategy. The market is basically right, with mild edge-case quirks that might not survive out-of-sample.

Calibration decile plot (1h before close)

Decile    N    Mean pred    Observed    Delta
  0      16     0.0263      0.0000    -0.026  ← accurate
  1      14     0.1143      0.2143    +0.100
  2      14     0.2729      0.5000    +0.227  ← market underpredicts
  3      15     0.3567      0.4000    +0.043
  4      15     0.4400      0.4000    -0.040
  5      14     0.5350      0.7857    +0.251  ← market underpredicts
  6      15     0.6087      0.5333    -0.075
  7      14     0.6871      0.4286    -0.259  ← market overpredicts
  8      15     0.8600      0.8000    -0.060
  9      15     0.9687      1.0000    +0.031  ← accurate

Extremes are accurate (deciles 0 and 9). The middle is noisy with several 20+ point deviations, but N per decile is 14-16 and those swings are consistent with pure sampling variance. A 25-point miss on 14 samples is roughly one standard deviation on a proportion estimate at p=0.5 — unremarkable.

The takeaway: MLB at 1h before close is well-calibrated in aggregate, with no obvious systematic bias on favorites or underdogs. There’s no repeatable directional edge visible in this sample.

What this means for the PM1 track

Before this deep dive: we had two candidate mispricings — Polymarket Elon tweets + Kalshi MLB — and they looked like a reusable pattern (retail-dominated markets where systematic data access gives alpha).

After this deep dive: we have one candidate — Polymarket Elon tweets. The “Kalshi MLB” parallel was a false positive from methodology artifact.

Why Elon tweets still stands:

What doesn’t generalize: the “retail markets have mispricing because the crowd is guessing” pattern. Kalshi MLB is also retail-dominated, and it is efficient once it matures. The reason Elon tweets might be mispriced isn’t that the crowd is retail — it’s that the bucket structure is narrow enough to reward anyone who actually pulls the posting-rate data. Most retail sports markets don’t have that narrow-bucket property.

Updates to earlier docs

What I’d do next

  1. Let the Elon forward-test run. April 9-11 resolves tomorrow 16:00 UTC — first scored bet. Nothing else to do in the meantime.
  2. Don’t chase other “sports mispricing” hypotheses on Kalshi. If MLB was false, NHL/EPL/LaLiga’s marginal negative lifts from the baseline are probably the same artifact.
  3. Rerun the Kalshi baseline with the corrected snapshot methodology to confirm which series actually have lift above/below baseline when prices are mature. Cheap.
  4. Return focus to the narrow-bucket retail hypothesis. Elon tweets fit the pattern. What else does? Polymarket “will X tweet Y-Z times” is rare, but other narrow-bucket question types — “what will the closing Dow be rounded to nearest 100” / “how many home runs will be hit in MLB this week” / specific player prop buckets — might show the same structure. Worth a targeted search on Polymarket and Kalshi for narrow-bucket market types.
  5. Don’t build more infrastructure until we have a confirmed direction. We’ve got one live forward-test, we’ll score it tomorrow, and the result tells us whether this entire track has legs.