Level 3 — Linear Algebra Drills

Two drills from the quant-from-scratch roadmap Level 3 homework. The machinery that underpins portfolio construction, factor models, and every neural network.

Drill 1 — PCA on S&P 500 returns

Script: ../scripts/level_3_sp500_pca.py

Setup: Pull daily returns for a universe of large-cap US stocks, 2022-01-04 → 2026-03-31. Standardize each column (mean 0, unit variance). Compute the correlation matrix, eigendecompose it, and look at the eigenvalue spectrum.

Universe: the script tries to fetch the current S&P 500 list from Wikipedia but Wikipedia returned a 403 (bot detection), so it fell back to a hard-coded list of 100 large-caps. After dropping tickers with insufficient history, 99 stocks made the final cut. We should revisit this when/if we need the full 500 — options: pandas_datareader (broken), wikipedia-api package, or just a static CSV we check into the repo.

Results:

Top 10 principal components:
  PC    Eigenvalue    Var expl    Cumulative
   1       30.0055     30.31%        30.31%
   2        8.5081      8.59%        38.90%
   3        4.3682      4.41%        43.31%
   4        2.9405      2.97%        46.29%
   5        2.3480      2.37%        48.66%
  ...
  20        ...                      68.34%
  22        ...                     ~70%

PC1 alone:      30.31%  ('the market factor')
PC1-5 total:    48.66%
PC1-20 total:   68.34%
PCs needed to reach 70%: 22

Fraction of PC1 loadings with same sign: 100.00%

PC1 is the market factor — confirmed. 100% of the PC1 loadings have the same sign, which means this principal component is “everything moves together.” That’s the single most important eigenvector and it explains ~30% of all variance across the universe on its own. This is the core insight the article is pointing at: there’s one dominant mode in equity returns, and it’s just “beta.”

But the article’s “5 eigenvectors explain ~70%” claim did NOT reproduce on this sample. I needed 22 PCs to cross 70%, not 5. A few possible reasons:

Time period. 2022–2026 spans post-COVID rate hike cycle, tech / old-economy divergence, and a period of unusually high sector dispersion. When the market is less “all rowing together,” you need more components to explain the variance.
Universe size. 99 tickers vs 500. More stocks means more idiosyncratic variance, but it cuts both ways — with only 99 large-caps you’re already close to the market factor and should need fewer components, not more. So this probably isn’t it.
Covariance vs correlation PCA. I used the correlation matrix (standardized returns). The article might have used raw covariance, which over-weights high-volatility stocks and typically gives a larger first eigenvalue. Worth re-running on the covariance matrix as a sanity check.
Rolling window effects. The article is almost certainly quoting a rough rule of thumb, not a specific measurement. “Five eigenvectors explain ~70%” is the kind of number that gets rounded for pedagogical clarity.

My read: the article’s number is directionally correct but imprecise. The important qualitative result — PC1 is the market factor, a small number of components capture the dominant modes, the rest is idiosyncratic noise — holds exactly as advertised. The quantitative “5 components = 70%” claim doesn’t hold on a 99-stock 2022-2026 sample. I’d call this a validated teaching moment about the difference between textbook heuristics and live data.

Plot: outputs/level_3_sp500_pca.png — scree plot + cumulative variance curve, with the 70% threshold marked.

TODO for later: re-run on the full S&P 500, on a longer time period (say 2010-2026), and on the covariance matrix instead of correlation, to see which of the three factors above is driving the discrepancy. That’s a deeper L3 exercise once we’re wired up for bulk ticker pulls.

Drill 2 — Markowitz mean-variance optimizer with cvxpy

Script: ../scripts/level_3_markowitz_optimizer.py

Setup: 10-asset universe (AAPL, MSFT, NVDA, GOOGL, AMZN, JPM, XOM, JNJ, PG, KO) pulled via yfinance 2022–2026. Compute sample mean returns and sample covariance, annualize, and solve three problems with cvxpy: (a) unconstrained minimum variance, (b) efficient frontier sweep across minimum-return targets, (c) max-Sharpe portfolio on the frontier with rf=3%.

Constraints: long/short bounded between -10% and +40% per position, weights sum to 1.

Results:

Annualized sample mean returns:
  AAPL   12.38%      GOOGL  12.78%    XOM     6.92%
  MSFT   11.62%      AMZN   19.91%    JNJ    55.63% (*)
  NVDA   21.64%      JPM    10.20%    PG      1.34%
                                       KO     30.31% (*)

Minimum-variance portfolio:
  Return: 13.63%, Vol: 12.16%, Sharpe: 0.874
  Largest holdings: GOOGL 29%, JPM 26%, PG 14%, KO 14%, XOM 12%

Max-Sharpe portfolio:
  Return: 33.35%, Vol: 17.87%, Sharpe: 1.698

What this drill illustrates beautifully:

Sample means are garbage expected returns. Look at the JNJ row: 55.63% annualized expected return. JNJ has not compounded at anywhere near 55% over any meaningful period in history. This number is a product of picking a specific start and end date on a specific sample. The Markowitz optimizer took JNJ at its word. This is the article’s “estimation error is the real enemy” lesson showing up immediately: the optimizer trusts whatever noise you feed it.
Max-Sharpe looks amazing and is fake. Sharpe of 1.698 on a vanilla long-only-ish portfolio? That’s hedge-fund territory. It’s also the product of trusting sample means. If we rolled the window forward by six months, the weights and the Sharpe would look completely different. This is the “full Kelly dies from estimation error” lesson.
cvxpy is the right abstraction. The whole optimization loop — constraints, objective, efficient frontier sweep — is about 15 lines. When we want to add transaction cost penalties, cardinality constraints, or sector caps at L4, it’s a drop-in change to the constraints list. No reinventing.
The minimum-variance portfolio is more honest. 12% vol, 13.6% return, Sharpe 0.87 — these are numbers that could plausibly survive out-of-sample. Min-var doesn’t require you to trust expected return estimates because the objective only depends on the covariance matrix (which is far more stable than means).

Plot: outputs/level_3_markowitz_frontier.png — efficient frontier, individual asset positions (with labels), min-var point (green star), max-Sharpe point (red star).

What to do differently when this matters for real money (per the articles):

Use shrinkage estimators (Ledoit-Wolf) instead of sample covariance
Use Black-Litterman posterior means instead of sample means — the prior is something like equilibrium market-cap weights, and the posterior combines that with views
Use fractional Kelly, not full Kelly, on position sizing
Resample the optimizer inputs (block bootstrap) and average the resulting weights
Add turnover / transaction cost penalties
Cap max single-position weight much tighter than 40%

All of the above are cvxpy-friendly drop-ins when we need them at L4 / PM4.

Gate check → proceed to Level 4?

Yes, with one honest asterisk.

Passing:

PCA decomposition runs cleanly, PC1 is identified as the market factor (100% same-sign loadings) ✓
Eigenvalue spectrum is computed and plotted ✓
Markowitz optimizer runs end-to-end via cvxpy ✓
Efficient frontier + min-var + max-Sharpe all computed ✓
Both plots render, both scripts reproducible ✓

Asterisk:

The article’s “5 eigenvectors explain ~70%” quantitative claim did NOT reproduce on our 99-stock 2022-2026 sample (needed 22 PCs). I documented this as a teaching moment rather than a blocker. We should revisit when we pull the full S&P 500 and test on longer history.

Next action (Level 4): calculus and convex optimization. Gradient descent from scratch on the Rosenbrock function, then a portfolio optimization problem with explicit transaction cost constraints via cvxpy. That’s a small step from what we already have.

Automated Investing Infrastructure Plan
Level 2 writeup — sample-mean estimation error preview
Source roadmap article
Simulation guide (PM track)

Level 3 — Linear Algebra Drills

Drill 1 — PCA on S&P 500 returns

Drill 2 — Markowitz mean-variance optimizer with cvxpy

Gate check → proceed to Level 4?

Related