Level 2 — Statistics Drills

Three drills from the quant-from-scratch roadmap Level 2 homework. Where Level 1 was vibe-check math, Level 2 is where the “most of what looks like signal is noise” lesson lands. Each drill is built specifically to expose a place where naive statistical intuition gets burned on real market data.

Drill 1 — Normality test and Student-t MLE on SPY returns

Script: ../scripts/level_2_normality_and_t_fit.py

Setup: 1,568 days of SPY daily returns (2020-01-02 → 2026-03-31, auto-adjusted). Test normality with Jarque-Bera, D’Agostino-Pearson, and Shapiro-Wilk. Fit both a Normal and a Student-t distribution via MLE, compare log-likelihoods, and run a likelihood ratio test.

Results:

Mean return:  0.000584
Std dev:      0.012932
Skewness:    -0.2524
Kurtosis:    13.0531  (normal = 0, excess)

Normality tests:
  Jarque-Bera:        stat=11148.39, p=0.00e+00
  D'Agostino-Pearson: stat=344.54,   p=1.53e-75
  Shapiro-Wilk:       stat=0.8752,   p=1.11e-33

MLE fits:
  Normal:    scale=0.012932, log-lik=4592.81
  Student-t: df=2.91, scale=0.007603, log-lik=4851.12
  LR test:   stat=516.62, p=0.00e+00

Interpretation:

Excess kurtosis of 13.05 is extreme. Normal distribution has excess kurtosis of 0. This number is driven by the COVID crash tail in March 2020 — the dataset starts in January 2020 specifically to include that shock, which is the right call for stress-testing a model.
All three normality tests reject with p-values that are effectively zero. There’s no ambiguity.
The MLE fit of Student-t gives df = 2.91, which is a very heavy-tailed distribution (lower df = heavier tails; at df ≤ 2 the variance is infinite). This is on the heavy side because of the COVID period in the sample.
Likelihood ratio test: adding the df parameter improves log-likelihood by 258 points. With only one extra parameter, any LR statistic above ~4 is significant. Ours is 516.
The Student-t significantly beats the Normal at essentially any sample size.

Why this matters for trading: if you assume returns are Gaussian and compute position sizes (or VaR, or option prices) under that assumption, you are systematically underestimating tail risk by orders of magnitude. Black Monday (1987) was a ~22 sigma event under Gaussian assumptions — something that should happen once every age-of-the-universe. It happened. This is why full-Kelly betting and default OLS standard errors kill accounts.

Plot: outputs/level_2_normality_and_t_fit.png — SPY return histogram (log-y) with Normal and Student-t density overlays. The tail difference is visually obvious.

Drill 2 — Fama-French 3-factor regression on AAPL with Newey-West SEs

Script: ../scripts/level_2_fama_french_regression.py

Setup: AAPL daily returns 2020-01-03 → 2026-02-27 (1,546 obs). Regress daily excess returns on the three Fama-French factors (Mkt-RF, SMB, HML) pulled directly from Ken French’s data library. Compare OLS default SEs against Newey-West HAC SEs with lag=5 to show why you need HAC on financial data.

Note on data source: pandas-datareader has a known incompatibility with recent pandas versions (missing deprecate_kwarg argument), so I fetch the Ken French CSV zip directly from Dartmouth and parse it. That’s the fallback documented in the script.

Results (Newey-West HAC SEs — the correct ones):

alpha:   0.031542  (p=0.3187)
Mkt-RF:  1.1734    (p~0)
SMB:    -0.3361    (p=9.19e-12)  →  large-cap tilt
HML:    -0.3235    (p=3.27e-28)  →  growth tilt
R²:      0.6416
Annualized alpha: 7.95%

Interpretation:

Market beta of 1.17 — AAPL is more volatile than the market, consistent with intuition for a mega-cap tech stock
SMB loading of -0.34 — large-cap tilt, which of course AAPL has (market cap > $3T)
HML loading of -0.32 — growth tilt, which of course AAPL has (low book-to-market ratio)
R² of 0.64 — these three factors alone explain 64% of AAPL’s daily variance
Alpha of 7.95% annualized looks great — until you notice the p-value of 0.32, which means it’s not statistically distinguishable from zero
The punchline: any apparent edge in owning AAPL is just factor exposure. You’re being compensated for taking market beta, large-cap growth risk, and (implicitly) momentum exposure. You are not getting alpha. If you pitched an AAPL-only “strategy” to an investor, they would price the factor exposure and pay you zero for the alpha.
NW vs OLS SEs: in this particular dataset, the NW SE on alpha was only 3.8% wider than OLS. That’s small — smaller than I expected — probably because daily AAPL returns are less autocorrelated than, say, illiquid small-caps or fixed-income. The article’s rule still stands: always use HAC SEs for financial data. When they matter, they matter a lot.

This is the “your edge is factor exposure” lesson in action. Once you regress out market, size, and value factors, “alpha” disappears. The quants who actually make money are the ones whose alpha survives a multi-factor regression — and even then, they have to worry about momentum, quality, low-vol, liquidity, and a dozen other known factors.

Drill 3 — Permutation test on a synthetic momentum strategy

Script: ../scripts/level_2_permutation_test.py

Setup: Test a toy “buy after up day” strategy on SPY: if yesterday’s return was positive, hold SPY today; otherwise sit in cash. Compare the observed strategy’s mean daily return against 10,000 random permutations of the entry signal.

Results:

Observations:         1568
Days in market:        858 (54.7%)
Strategy mean return:  0.000063/day  (1.60% annualized)
Buy-and-hold mean:     0.000584/day  (14.72% annualized)

Permutation test (10,000 shuffles):
  Permuted mean of means:  0.000322
  Strategy percentile:     5.7%
  One-sided p-value:       0.9431
  Verdict: FAIL — strategy is WORSE than 94% of random entry signals.

Interpretation: this is better than expected as a teaching moment. The strategy doesn’t just fail to beat buy-and-hold — it actively underperforms random entry timing. The observed strategy sits in the 5.7 percentile of random shuffles, meaning 94.3% of random entry signals produced higher returns than the “buy after up day” logic.

Why this happens: short-term reversal is a well-documented anomaly. On very short horizons (one day), up days tend to be followed by smaller or negative returns on average. A trader who naively thinks “momentum = buy after up days” is fighting the empirical short-term mean reversion. The permutation test catches this cleanly — no prior factor model needed, no normality assumption, just shuffle the signal and count.

The lesson for us: the permutation test is the fastest BS detector we have. No assumptions about distribution, no need to trust standard errors. It’s slow (10,000 passes over the data) but conceptually bulletproof. Every strategy we test in this project will get a permutation test as a final gate.

Plot: outputs/level_2_permutation_test.png — histogram of permuted strategy means with the observed strategy and buy-and-hold both marked.

Cross-drill observations

Fat tails are real and they’re not a minor correction. A Student-t fit with df < 3 on daily SPY is a giant signal that Gaussian assumptions anywhere downstream (position sizing, VaR, option pricing, backtesting confidence intervals) will be systematically wrong on the tails.
Apparent alpha is usually factor exposure. Even on a mega-cap stock as obviously “good” as AAPL, a three-factor regression strips out 64% of the variance and leaves alpha statistically indistinguishable from zero. Our first hundred strategy ideas will all look like this.
Permutation tests are our friend. They’re distribution-free, assumption-light, and they kill bad strategies dead. The fact that a “buy after up day” strategy failed the permutation test is exactly the kind of result that saves you from deploying a bad idea with real money.
Multiple comparisons is coming. We haven’t run into it yet — we tested one strategy, once. But as soon as we start sweeping parameter grids or trying different signals, Bonferroni / Benjamini-Hochberg become essential. That’s on the list for Level 3 or early Level 4.

Gate check → proceed to Level 3?

Yes. L2 drills all pass:

Normality test correctly rejects Gaussian for real returns ✓
Student-t MLE significantly improves log-likelihood ✓
Fama-French regression runs cleanly with Newey-West SEs ✓
Permutation test correctly rejects a bad strategy ✓
All plots render, all three scripts reproducible ✓