Level 2b — Time-Series CV and the Halls-Moore Gotcha
This is a bonus drill triggered by a deep-read of Halls-Moore’s Successful Algorithmic Trading, Chapter 15. In that chapter the book demonstrates k-fold cross-validation with shuffle=True on SPY time-series data, then explicitly flags the problem in an italicized aside on page 275: shuffling breaks temporal ordering, which lets the model peek into the future and inflates reported accuracy.
The claim is intuitive but worth proving on our own data. If the book is right, we should see shuffled k-fold produce a more optimistic cross-validation score than TimeSeriesSplit (walk-forward) on the same model, same features, same data.
Setup
Script: ../scripts/level_2b_time_series_cv.py
- Universe: SPY daily returns, 2020-01-02 → 2026-03-31, 1,563 observations
- Features: 5 lagged daily returns (
lag1throughlag5) - Target: sign of the current day’s return (1 = up, 0 = down)
- Model: logistic regression with
StandardScalerpreprocessing, wrapped in a sklearnPipeline - CV schemes:
- Shuffled k-fold (WRONG):
KFold(n_splits=10, shuffle=True, random_state=42) - TimeSeriesSplit (CORRECT):
TimeSeriesSplit(n_splits=10)— each fold’s train set strictly precedes the test set (walk-forward)
- Shuffled k-fold (WRONG):
- Both use
cross_val_scorewithscoring="accuracy"
Results
Mean Std Min Max
----------------------------------------------------------------------
Shuffled k-fold (WRONG) 0.5527 0.0346 0.5064 0.6178
TimeSeriesSplit (CORRECT) 0.5437 0.0379 0.4577 0.5915
Inflation from shuffling: +0.0091 accuracy points
Majority-class baseline: 0.5477
Interpretation
The gotcha is real. The shuffled approach reports 55.27% mean accuracy; the walk-forward approach reports 54.37%. That’s a 91 basis point inflation from the shuffled approach. Every single fold score shifts, so this isn’t one outlier fold — it’s systematic.
But the more important insight comes from comparing to the majority-class baseline. The class balance is 54.77% (SPY goes up 54.77% of days in this sample). If you blindly predicted “up” every day, you’d be right 54.77% of the time.
- Shuffled k-fold: 55.27% → lift of +0.50% over majority baseline (looks like a tiny edge)
- TimeSeriesSplit: 54.37% → lift of -0.40% over majority baseline (worse than “always predict up”)
The shuffled approach made a worthless model look like it had a tiny edge. The correct approach revealed the model has negative lift. If we deployed this strategy based on shuffled CV, we’d think we had a profitable (if marginal) signal. Under the proper evaluation, the model is actively harmful relative to the dumbest possible strategy.
This is exactly the failure mode Halls-Moore’s aside warned about. On our data, with our model, the gotcha changes the verdict from “marginal edge” to “no edge at all.”
It’s also consistent with the broader literature: lagged daily returns on a liquid index are widely known to be near-zero predictive for next-day direction. The efficient-market hypothesis isn’t perfect, but daily-horizon directional prediction from lagged returns is where it basically holds.
Why this matters beyond this one drill
- Every future backtest we run needs time-series-aware CV. Not k-fold shuffle, not plain train/test split with random_state. Either
TimeSeriesSplit, walk-forward validation, or purged k-fold (López de Prado’s approach). The shuffled default in sklearn is a trap for financial time series. - This is a rule for the
autoinvpackage. I’ll wireTimeSeriesSplitas the default CV in any utility we build, and I’ll refuse to expose ashuffle=Trueoption on financial data. If we ever want to measure the inflation explicitly, we can use the script from this drill as the reference. - Always compare to the majority-class baseline. The absolute accuracy number (55.27% looked like something!) is meaningless without knowing what a trivial model would have scored. Majority baseline is the first sanity check; random permutation (the L2 permutation test) is the second.
- This generalizes to regression problems too. Same logic applies to predicting returns instead of direction. The same
TimeSeriesSplitobject works.
Cross-links
- Applies the lesson from: Halls-Moore reference doc — Chapter 15 gotcha on p. 275
- Complements: Level 2 writeup — the permutation test is the out-of-sample check; time-series CV is the cross-validation check; both belong in our toolkit
- Feeds: Infrastructure plan —
TimeSeriesSplitas the default CV scheme in theautoinvconsolidation package
Plot
outputs/level_2b_time_series_cv.png — fold-by-fold bar chart comparing the two CV schemes against the majority-class baseline. Visually obvious that shuffled scores cluster higher than walk-forward scores across most folds.