Level 2b — Time-Series CV and the Halls-Moore Gotcha

This is a bonus drill triggered by a deep-read of Halls-Moore’s Successful Algorithmic Trading, Chapter 15. In that chapter the book demonstrates k-fold cross-validation with shuffle=True on SPY time-series data, then explicitly flags the problem in an italicized aside on page 275: shuffling breaks temporal ordering, which lets the model peek into the future and inflates reported accuracy.

The claim is intuitive but worth proving on our own data. If the book is right, we should see shuffled k-fold produce a more optimistic cross-validation score than TimeSeriesSplit (walk-forward) on the same model, same features, same data.

Setup

Script: ../scripts/level_2b_time_series_cv.py

Universe: SPY daily returns, 2020-01-02 → 2026-03-31, 1,563 observations
Features: 5 lagged daily returns (lag1 through lag5)
Target: sign of the current day’s return (1 = up, 0 = down)
Model: logistic regression with StandardScaler preprocessing, wrapped in a sklearn Pipeline
CV schemes:
- Shuffled k-fold (WRONG): KFold(n_splits=10, shuffle=True, random_state=42)
- TimeSeriesSplit (CORRECT): TimeSeriesSplit(n_splits=10) — each fold’s train set strictly precedes the test set (walk-forward)
Both use cross_val_score with scoring="accuracy"

Results

                                    Mean       Std       Min       Max
----------------------------------------------------------------------
Shuffled k-fold (WRONG)           0.5527    0.0346    0.5064    0.6178
TimeSeriesSplit (CORRECT)         0.5437    0.0379    0.4577    0.5915

Inflation from shuffling:  +0.0091 accuracy points
Majority-class baseline:   0.5477

Interpretation

The gotcha is real. The shuffled approach reports 55.27% mean accuracy; the walk-forward approach reports 54.37%. That’s a 91 basis point inflation from the shuffled approach. Every single fold score shifts, so this isn’t one outlier fold — it’s systematic.

But the more important insight comes from comparing to the majority-class baseline. The class balance is 54.77% (SPY goes up 54.77% of days in this sample). If you blindly predicted “up” every day, you’d be right 54.77% of the time.

Shuffled k-fold: 55.27% → lift of +0.50% over majority baseline (looks like a tiny edge)
TimeSeriesSplit: 54.37% → lift of -0.40% over majority baseline (worse than “always predict up”)

The shuffled approach made a worthless model look like it had a tiny edge. The correct approach revealed the model has negative lift. If we deployed this strategy based on shuffled CV, we’d think we had a profitable (if marginal) signal. Under the proper evaluation, the model is actively harmful relative to the dumbest possible strategy.

This is exactly the failure mode Halls-Moore’s aside warned about. On our data, with our model, the gotcha changes the verdict from “marginal edge” to “no edge at all.”

It’s also consistent with the broader literature: lagged daily returns on a liquid index are widely known to be near-zero predictive for next-day direction. The efficient-market hypothesis isn’t perfect, but daily-horizon directional prediction from lagged returns is where it basically holds.

Why this matters beyond this one drill

Every future backtest we run needs time-series-aware CV. Not k-fold shuffle, not plain train/test split with random_state. Either TimeSeriesSplit, walk-forward validation, or purged k-fold (López de Prado’s approach). The shuffled default in sklearn is a trap for financial time series.
This is a rule for the autoinv package. I’ll wire TimeSeriesSplit as the default CV in any utility we build, and I’ll refuse to expose a shuffle=True option on financial data. If we ever want to measure the inflation explicitly, we can use the script from this drill as the reference.
Always compare to the majority-class baseline. The absolute accuracy number (55.27% looked like something!) is meaningless without knowing what a trivial model would have scored. Majority baseline is the first sanity check; random permutation (the L2 permutation test) is the second.
This generalizes to regression problems too. Same logic applies to predicting returns instead of direction. The same TimeSeriesSplit object works.

Cross-links

Applies the lesson from: Halls-Moore reference doc — Chapter 15 gotcha on p. 275
Complements: Level 2 writeup — the permutation test is the out-of-sample check; time-series CV is the cross-validation check; both belong in our toolkit
Feeds: Infrastructure plan — TimeSeriesSplit as the default CV scheme in the autoinv consolidation package

Plot

outputs/level_2b_time_series_cv.png — fold-by-fold bar chart comparing the two CV schemes against the majority-class baseline. Visually obvious that shuffled scores cluster higher than walk-forward scores across most folds.