Brier Score — the forecaster’s ruler

TL;DR

Brier score = mean squared error between your probability predictions and the realized outcomes.

Lower is better. Zero is perfect. 0.25 is the score you get from always predicting 0.5 on a 50/50 event (random). 1.0 is the worst possible score (always predicting the opposite of what happens).
It’s a proper scoring rule — you can’t game it by hedging. The only way to minimize your expected Brier is to report your honest best estimate.
It’s the standard way to evaluate probabilistic forecasters — weather, elections, prediction markets, medical diagnoses.

The formula

For N binary events with predicted probabilities pᵢ and realized outcomes oᵢ (each 0 or 1):

Brier = (1/N) · Σ (pᵢ - oᵢ)²

For every prediction, you square the gap between what you said would happen and what did happen, then average over all your predictions.

Why it matters

Brier punishes confident wrong predictions more than hesitant ones. If you predict 0.9 for an event that turns out to be 0 (never happened), you eat a 0.81 penalty. If you predict 0.6 for the same event, you only eat 0.36. That’s the right incentive — confident bets should cost more when they’re wrong.

But it also punishes timid right predictions. If you predict 0.55 for an event that happens, you eat 0.2025 (you got it right but you hedged too much). That’s also the right incentive — if you actually knew, you should have said so.

The property that makes this work: the expected Brier is minimized when you report your true subjective probability. If you genuinely believe something is 70% likely, saying “70%” gives you the lowest expected Brier over many trials. Saying “50%” to hedge makes your expected score worse, not better. This is what “proper scoring rule” means.

Reference values

These are the benchmarks that let you calibrate your intuition:

Brier score	What it means
0.00	Perfect. Every prediction matched the outcome exactly. Impossible in practice.
0.02-0.05	Excellent. Professional forecaster with strong track record on a tractable domain.
0.06-0.12	Very good. 538’s historical range on US presidential elections. The best human election forecasters.
0.10-0.15	Good. Published weather forecasters on precipitation.
0.12	Our discipline gate. From the gemchange simulate-like-quant-desk article: “if your simulation can beat that, you have edge.”
0.16	Same as always predicting the base rate (majority class). On a 20% base rate, always guessing “20%” gives Brier 0.16. This is the “I know nothing except the overall frequency” floor.
0.25	Always predicting 0.5. This is what you get if you don’t know anything about individual events — you just throw up your hands and say “50/50 no idea.”
0.50	Worse than random on a 50/50 event. You’re systematically wrong.
1.00	Catastrophically wrong on every prediction. Always predicted the opposite of the truth.

Important nuance: the “good” Brier number depends on how hard the prediction problem is. For 50/50 coin-flip events (most individual sports games), even a perfect forecaster can’t get much below 0.20 because the outcome variance is inherently high. For lopsided events (90% favorite wins), a well-calibrated forecaster can easily hit 0.05.

Always compare your Brier to the right baseline:

What does “always predict 0.5” score on this dataset? (The worst honest baseline.)
What does “always predict the base rate / majority class” score? (The second-worst honest baseline — if you can’t beat this, your model knows nothing.)
What do professional forecasters on similar problems score? (The practical ceiling.)

How we use it in Automated Investing

Discipline gate: no strategy deploys real capital until it beats Brier 0.12 on held-out data (from the roadmap article). This is the primary go/no-go criterion.
Market calibration: measure how well Polymarket’s own midpoint predicts outcomes by computing Brier of the midpoint vs the realized outcome across a sample of resolved markets. That tells us whether there’s room for a model-based strategy to beat the market or whether the market is already too efficient.
Baseline comparison: every performance number is reported alongside “majority class” Brier and “always 0.5” Brier. If our strategy’s Brier isn’t meaningfully below the majority baseline, we don’t have a strategy — we have a noise generator.
Per-category slicing: compute Brier by category (sports / crypto / politics / etc.) to find where mispricing actually lives instead of averaging across the whole venue.

Relationship to other scoring rules

Log loss (cross-entropy): another proper scoring rule. Log loss = -(1/N) Σ [oᵢ · log(pᵢ) + (1-oᵢ) · log(1-pᵢ)]. Log loss is much harsher on confident wrong predictions (saying 0.99 for something that didn’t happen produces a huge log-loss penalty; Brier’s penalty is bounded at 1.0 per observation). Choose log loss if you want to discourage extreme overconfidence; choose Brier if you want a gentler, bounded scorer.
Accuracy: just the fraction of times you were on the “right side” of 0.5. Accuracy ignores confidence entirely — a model that says 0.51 is treated the same as one that says 0.99, which is almost never what you want in finance. Never use accuracy on probabilistic problems.
Calibration plot: not a single number but a diagnostic plot. Group predictions into buckets (e.g., 0-10%, 10-20%, …), plot the predicted probability on the x-axis and the observed frequency on the y-axis. A perfect forecaster’s plot is the 45° line. Use this alongside Brier to see where your model is mis-calibrated.

Gotchas

Small samples give noisy Brier estimates. A Brier of 0.10 on N=20 means very little; the same number on N=500 is load-bearing. Always report the N.
Class imbalance hides bad calibration. On a 95/5 class split, always predicting “95%” gives Brier 0.0475 — looks great but the model knows nothing about individual events. The majority-class baseline catches this.
Brier doesn’t tell you WHY a forecaster is wrong. Is it overconfident? Under-confident? Systematically biased toward “yes”? For that you need a calibration plot.

../2026-04-10-gemchange-simulate-like-quant-desk — introduces Brier score as the discipline gate for prediction-market strategies
../../01-projects/automated-investing/experiments/pm1-polymarket-baseline — our first use of Brier to measure Polymarket calibration
../../01-projects/automated-investing/autoinv/metrics — autoinv.metrics.brier_score implementation