01-projects / automated-investing / experiments

pm1c3 other breakdown

Thu Apr 09 2026 20:00:00 GMT-0400 (Eastern Daylight Time) ·experiment-writeup ·status: complete

PM1c3 — Breaking down the “other” category

Context

PM1c + PM1d found that the $100K-$2M band’s “other” bucket (163 markets, Brier 0.116) looked potentially interesting but was opaque because “other” was a catch-all for whatever didn’t match our sports/crypto/politics/esports regex. This script drills into that residual with a second-pass classifier covering crypto-price-threshold, elon-tweets, fed-rates, political-event, entertainment, weather, tech-launch, geopolitical, and several other patterns.

Cost: $0. Polymarket endpoints only.

Method

  1. Re-pull the top 500 markets in the $100K-$2M band
  2. Apply the PM1c coarse classifier (which tags 174 as “other”)
  3. Apply a second-pass classifier with 14 refined regex rules
  4. Compute Brier per refined category with MIN_N=10
  5. For any category with both above-gate Brier AND a skewed win rate (not 50/50), mark as a viable candidate

Results

Refined category distribution (N=174 “other” markets):

109  unknown        (residual — couldn't classify even with finer rules)
 21  elon-tweets
 13  crypto-adjacent
 10  geopolitical
  8  political-event
  7  tech-launch
  2  entertainment-film
  2  election
  1  weather
  1  fed-rates

Brier per category (MIN_N=10):

Category            N    Median vol     Brier    Win%    Gate
elon-tweets        21   $1,881,446    0.2459   28.57%   FAIL  ← target
crypto-adjacent    12   $1,791,255    0.1419   50.00%   FAIL
unknown           102   $1,861,599    0.0977   32.35%   PASS

Viable candidates — above the gate AND not 50/50:

The unknown residual (109 markets) passes the gate at 0.0977. It’s a mix of tennis finals, soccer matches, celebrity predictions, crypto price thresholds (missed by my regex), sovereign leader questions, and miscellaneous events. It looks reasonably calibrated in aggregate — no obvious sub-population drives a gap.

The elon-tweets finding is the real result. Worth digging into.

The Elon-tweets finding, in detail

All 21 markets are structurally similar: “Will Elon Musk post [X-Y] tweets from [date1] to [date2]?” — narrow-bucket predictions of Elon’s weekly tweet count. Each market covers a ~20-tweet-wide bucket (e.g., 280-299, 300-319, 320-339).

Outcome distribution: 6 YES (28.6%), 15 NO (71.4%). Skewed toward NO, which is expected because any given narrow bucket has low prior probability — there are usually ~20 buckets per week covering the plausible range, so most resolve NO.

The Brier math:

Read: the market is doing worse than a “know nothing, use base rate” strategy on these narrow-bucket markets. This is the first place in our analysis where we’ve found Polymarket’s own price to be measurably underperforming a trivial baseline.

Structural reason this is plausible:

Caveats I’m flagging honestly:

Why this is the ideal PM1e prototype target

  1. Live markets exist right now. Four active Elon tweet count events on Polymarket:
    • elon-musk-of-tweets-april-3-april-10 — resolves today at 16:00 UTC (30 markets)
    • elon-musk-of-tweets-april-7-april-14 — resolves April 14 (30 markets)
    • elon-musk-of-tweets-april-9-april-11 — resolves April 11 (10 markets)
    • elon-musk-of-tweets-april-10-april-17 — resolves April 17 (30 markets) We can start forward-testing immediately.
  2. No LLM required. This is pure frequency analysis. We pull Elon’s recent tweet history, fit a distribution (Poisson or Negative Binomial) to his weekly counts, compute P(count in bucket) for each market. No sentiment, no LLM inference, no Claude calls. Pennies per run.
  3. xmcp gives us the data. The X API’s getUsersPosts endpoint returns Elon’s timestamped tweets. One call pulls ~100-200 tweets, enough to fit a distribution. Cost: ~$0.02 per snapshot.
  4. Forward-testable. New event drops every few days. We can predict on each bucket, wait for resolution, score Brier over weeks. Build a track record as a side effect.
  5. Failure modes are informative. If our Brier is worse than the market’s, we know the market has information we don’t. If our Brier is better, we’ve found a repeatable edge. Either answer is useful.

Proposed PM1e prototype structure

Skill: scripts/pm1e_elon_tweet_forecast.py

  1. Pull active elon-musk-of-tweets-* events via /events endpoint
  2. For each event, list its bucket markets and get current midpoints
  3. Pull Elon’s last ~200 tweets via xmcp getUsersPosts(elonmusk)
  4. Compute weekly tweet counts over the last 12 weeks
  5. Fit a Negative Binomial distribution to weekly counts (robust to overdispersion; Elon’s variance > mean)
  6. For each bucket market [X, Y], compute our P(tweets ∈ [X, Y]) = nbinom.cdf(Y, μ, α) - nbinom.cdf(X-1, μ, α) scaled to the bucket’s time window
  7. Compare our probability to the market midpoint, record the delta
  8. Write predictions + midpoints to a dated CSV for later scoring
  9. Wait for resolution, then compute Brier for our predictions and for the market’s midpoints
  10. Run /loop 6h /pm1e-elon-forecast to refresh predictions as the market evolves

Cost estimate:

This is well within the “free analysis” envelope you agreed to. The ~$2 bounded cost is a rounding error against any plausible trading outcome, and even if the prototype finds zero edge we’ve built and validated a pipeline we can point at other data-rich markets.

What this means for the broader PM1 story

Recap of the full arc:

  1. PM1: top-650 markets look hyperefficient → we were wrong, pagination capped us
  2. PM1b: volume bands reveal $100K-$2M is the mispricing zone (Brier 0.12-0.15)
  3. PM1c: splitting by category shows sports (N=200, Brier 0.20) dominates the band
  4. PM1c2: sports Brier is stuck at ~0.20 for inherent 50/50 reasons; spread impact is zero
  5. PM1cd (stability): sports Brier swings 0.11-0.22 across time windows, signal not persistent
  6. PM1c3 (this doc): drilling into “other” reveals elon-tweets as the first real mispriced sub-population, with a plausible structural explanation and live markets to test on

The overall shape: most of Polymarket is efficient, most “mispricing” is actually game variance, but there are specific small retail-dominated sub-categories where the market is meaningfully worse than trivial baselines. Elon-tweets is the first one we’ve found. There may be others (crypto-price-threshold markets missed by our regex, narrow-bucket social-media metrics, long-tail celebrity events).

The strategy implication isn’t “trade everything in the $100K-$2M band.” It’s “find the specific retail-entertainment niches where the crowd is guessing and systematic data access gives us an edge.” That’s a much narrower but potentially repeatable thesis.