PM1c3 — Breaking down the “other” category

Context

PM1c + PM1d found that the $100K-$2M band’s “other” bucket (163 markets, Brier 0.116) looked potentially interesting but was opaque because “other” was a catch-all for whatever didn’t match our sports/crypto/politics/esports regex. This script drills into that residual with a second-pass classifier covering crypto-price-threshold, elon-tweets, fed-rates, political-event, entertainment, weather, tech-launch, geopolitical, and several other patterns.

Cost: $0. Polymarket endpoints only.

Method

Re-pull the top 500 markets in the $100K-$2M band
Apply the PM1c coarse classifier (which tags 174 as “other”)
Apply a second-pass classifier with 14 refined regex rules
Compute Brier per refined category with MIN_N=10
For any category with both above-gate Brier AND a skewed win rate (not 50/50), mark as a viable candidate

Results

Refined category distribution (N=174 “other” markets):

109  unknown        (residual — couldn't classify even with finer rules)
 21  elon-tweets
 13  crypto-adjacent
 10  geopolitical
  8  political-event
  7  tech-launch
  2  entertainment-film
  2  election
  1  weather
  1  fed-rates

Brier per category (MIN_N=10):

Category            N    Median vol     Brier    Win%    Gate
elon-tweets        21   $1,881,446    0.2459   28.57%   FAIL  ← target
crypto-adjacent    12   $1,791,255    0.1419   50.00%   FAIL
unknown           102   $1,861,599    0.0977   32.35%   PASS

Viable candidates — above the gate AND not 50/50:

elon-tweets: N=21, Brier=0.2459, win rate=28.57%

The unknown residual (109 markets) passes the gate at 0.0977. It’s a mix of tennis finals, soccer matches, celebrity predictions, crypto price thresholds (missed by my regex), sovereign leader questions, and miscellaneous events. It looks reasonably calibrated in aggregate — no obvious sub-population drives a gap.

The elon-tweets finding is the real result. Worth digging into.

The Elon-tweets finding, in detail

All 21 markets are structurally similar: “Will Elon Musk post [X-Y] tweets from [date1] to [date2]?” — narrow-bucket predictions of Elon’s weekly tweet count. Each market covers a ~20-tweet-wide bucket (e.g., 280-299, 300-319, 320-339).

Outcome distribution: 6 YES (28.6%), 15 NO (71.4%). Skewed toward NO, which is expected because any given narrow bucket has low prior probability — there are usually ~20 buckets per week covering the plausible range, so most resolve NO.

The Brier math:

Majority baseline (always predict the base rate, 28.6%): Brier = 0.2041
Polymarket’s midpoint 3 days before resolution: Brier = 0.2459
Lift: -0.0419 — the market is worse than always predicting the base rate.

Read: the market is doing worse than a “know nothing, use base rate” strategy on these narrow-bucket markets. This is the first place in our analysis where we’ve found Polymarket’s own price to be measurably underperforming a trivial baseline.

Structural reason this is plausible:

These are retail entertainment markets, not high-stakes political/financial prediction. Professional traders are elsewhere.
The ranges are narrow (20 tweets wide). Small errors in estimating Elon’s posting rate translate into large errors in bucket prediction.
Most traders are guessing. Nobody with the actual tweet history is bothering to price these accurately.
Critically, the required data is 100% public and accessible via the X API. This isn’t a “we need insider information” edge — it’s a “we need to actually look at the data” edge.

Caveats I’m flagging honestly:

N=21 is small. The 0.2459 point estimate could easily be 0.20 ± 0.08 on a larger sample. The direction (market < majority baseline) is more reliable than the magnitude.
Stability unknown. We haven’t checked whether elon-tweets calibration has been consistently bad across time windows. Could be a 2026-specific artifact.
Elon is structurally unpredictable. He’s been known to tweet 100 times in a day and 0 the next. Our frequency model could whiff on regime-change events.
Narrow buckets amplify errors. A forecaster that’s “close” on the actual count can still be catastrophically wrong on the bucket prediction if the count lands near a boundary.

Why this is the ideal PM1e prototype target

Live markets exist right now. Four active Elon tweet count events on Polymarket:
- elon-musk-of-tweets-april-3-april-10 — resolves today at 16:00 UTC (30 markets)
- elon-musk-of-tweets-april-7-april-14 — resolves April 14 (30 markets)
- elon-musk-of-tweets-april-9-april-11 — resolves April 11 (10 markets)
- elon-musk-of-tweets-april-10-april-17 — resolves April 17 (30 markets) We can start forward-testing immediately.
No LLM required. This is pure frequency analysis. We pull Elon’s recent tweet history, fit a distribution (Poisson or Negative Binomial) to his weekly counts, compute P(count in bucket) for each market. No sentiment, no LLM inference, no Claude calls. Pennies per run.
xmcp gives us the data. The X API’s getUsersPosts endpoint returns Elon’s timestamped tweets. One call pulls ~100-200 tweets, enough to fit a distribution. Cost: ~$0.02 per snapshot.
Forward-testable. New event drops every few days. We can predict on each bucket, wait for resolution, score Brier over weeks. Build a track record as a side effect.
Failure modes are informative. If our Brier is worse than the market’s, we know the market has information we don’t. If our Brier is better, we’ve found a repeatable edge. Either answer is useful.

Proposed PM1e prototype structure

Skill: scripts/pm1e_elon_tweet_forecast.py

Pull active elon-musk-of-tweets-* events via /events endpoint
For each event, list its bucket markets and get current midpoints
Pull Elon’s last ~200 tweets via xmcp getUsersPosts(elonmusk)
Compute weekly tweet counts over the last 12 weeks
Fit a Negative Binomial distribution to weekly counts (robust to overdispersion; Elon’s variance > mean)
For each bucket market [X, Y], compute our P(tweets ∈ [X, Y]) = nbinom.cdf(Y, μ, α) - nbinom.cdf(X-1, μ, α) scaled to the bucket’s time window
Compare our probability to the market midpoint, record the delta
Write predictions + midpoints to a dated CSV for later scoring
Wait for resolution, then compute Brier for our predictions and for the market’s midpoints
Run /loop 6h /pm1e-elon-forecast to refresh predictions as the market evolves

Cost estimate:

Per snapshot: ~$0.02 (one xmcp getUsersPosts call) + free Polymarket calls
Daily cost running every 6h: ~$0.08/day
Full two-week forward-test (until all active events resolve): under $2 total
Zero LLM inference cost.

This is well within the “free analysis” envelope you agreed to. The ~$2 bounded cost is a rounding error against any plausible trading outcome, and even if the prototype finds zero edge we’ve built and validated a pipeline we can point at other data-rich markets.

What this means for the broader PM1 story

Recap of the full arc:

PM1: top-650 markets look hyperefficient → we were wrong, pagination capped us
PM1b: volume bands reveal $100K-$2M is the mispricing zone (Brier 0.12-0.15)
PM1c: splitting by category shows sports (N=200, Brier 0.20) dominates the band
PM1c2: sports Brier is stuck at ~0.20 for inherent 50/50 reasons; spread impact is zero
PM1cd (stability): sports Brier swings 0.11-0.22 across time windows, signal not persistent
PM1c3 (this doc): drilling into “other” reveals elon-tweets as the first real mispriced sub-population, with a plausible structural explanation and live markets to test on

The overall shape: most of Polymarket is efficient, most “mispricing” is actually game variance, but there are specific small retail-dominated sub-categories where the market is meaningfully worse than trivial baselines. Elon-tweets is the first one we’ve found. There may be others (crypto-price-threshold markets missed by our regex, narrow-bucket social-media metrics, long-tail celebrity events).

The strategy implication isn’t “trade everything in the $100K-$2M band.” It’s “find the specific retail-entertainment niches where the crowd is guessing and systematic data access gives us an edge.” That’s a much narrower but potentially repeatable thesis.

pm1cd-category-and-stability — the PM1c analysis this drills into
pm1b-polymarket-long-tail-correction — the volume-band finding
pm1-polymarket-baseline — the original (superseded) baseline
../../../06-reference/concepts/brier-score — reference
../architecture-vision — 5-agent target; PM1e would be a Strategy Research + Paper Testing cycle