Level 4 — Calculus & Optimization Drills
Two drills from the quant-from-scratch roadmap Level 4 homework. Calculus is the language of change, and every optimizer underneath every model in this project is doing some form of gradient descent. This level is about understanding what the machinery is actually doing so we can reason about when it breaks.
Drill 1 — Gradient descent from scratch on Rosenbrock
Script: ../scripts/level_4_gradient_descent.py
Setup: The Rosenbrock function f(x,y) = (1-x)² + 100(y - x²)² has a known minimum at (1, 1) with f = 0. It’s smooth but has a narrow curved valley, which means naive gradient descent bounces against the walls instead of sliding along the floor. We implement three optimizers by hand — vanilla GD, momentum GD, and Adam — all using the analytical gradient. No torch, no scipy.
Analytical gradient (derived from the chain rule):
∂f/∂x = -2(1-x) - 400x(y - x²)
∂f/∂y = 200(y - x²)
Results (5,000 steps from starting point (-1.5, 2.5)):
Vanilla GD final=(0.91407, 0.83517) f=0.007396 dist_to_min=0.18588
Momentum GD final=(0.98932, 0.97870) f=0.000114 dist_to_min=0.02383
Adam final=(0.99622, 0.99245) f=0.000014 dist_to_min=0.00844
Interpretation:
- Vanilla GD makes it to the valley but can’t navigate the curve well. After 5,000 steps it’s still 0.19 away from the minimum. If we cranked the learning rate higher, it would oscillate and diverge. This is the “bouncing against the valley walls” failure mode that motivates momentum.
- Momentum GD (β=0.9) dampens oscillations by accumulating a velocity vector. Final distance 0.024 — about 8× better than vanilla.
- Adam combines momentum with per-parameter adaptive learning rates. Final distance 0.008 — best by far. This is why Adam is the default in basically every modern deep learning codebase.
The punchline that matters for this project: when an optimizer “converges,” it’s worth asking what that actually means numerically. Vanilla GD reported a function value of 0.0074, which sounds tiny. But it’s 0.19 away from the true minimum in parameter space. In a portfolio optimization context, “close in objective value” and “close in weight space” can be wildly different things, and the difference matters when the weights are being traded for real money.
Plot: outputs/level_4_rosenbrock.png — Rosenbrock contour plot (log-spaced level curves) with all three descent paths overlaid.
Drill 2 — Portfolio optimization with transaction cost constraints
Script: ../scripts/level_4_portfolio_with_costs.py
Setup: Extends the L3 Markowitz optimizer with an L1 transaction cost penalty. Start from an equal-weight prior portfolio w_prev, minimize variance + λ · ||w - w_prev||_1 subject to the usual constraints (sum = 1, long/short bounds, minimum return target of 15%). Sweep λ from 0 (pure Markowitz) to 0.5 (essentially frozen).
Why L1 (not L2) turnover penalty: brokerage commissions, bid-ask spreads, and market impact for moderate sizes are approximately linear in trade size, which maps to L1 norm. L2 would penalize big trades quadratically, which overstates the cost of rebalancing illiquid positions and understates it for small ones.
Results (10-asset universe, 2022-2026 sample, target 15% return):
lambda turnover return vol largest change
0.000 87.1% 15.00% 12.19% GOOGL +20.0%
0.001 81.9% 15.00% 12.20% GOOGL +19.4%
0.010 51.9% 15.00% 12.81% GOOGL +16.0%
0.050 7.4% 16.69% 16.35% JNJ -3.7%
0.100 0.0% 18.27% 17.58% MSFT +0.0%
0.500 0.0% 18.27% 17.58% AAPL +0.0%
Interpretation:
- λ = 0 (no transaction cost): the optimizer rebalances aggressively — 87% of the portfolio weight changes. GOOGL goes from 10% (equal-weight) to 30%. Variance drops from 17.58% → 12.19%. Looks like a free lunch.
- λ = 0.01: turnover cut in half to 52%. Still hits the 15% return target. Variance slightly higher at 12.81%. This is the sweet spot for a “cheap rebalance” cost regime.
- λ = 0.05: turnover collapses to 7.4%. The optimizer starts giving up on aggressive rebalancing because it’s too expensive. Notice the return JUMPS to 16.69% and vol goes back up to 16.35% — the optimizer is sticking closer to the (high-return, high-vol) equal-weight prior because moving away from it costs more than the variance reduction is worth.
- λ ≥ 0.1: turnover hits 0%. The optimizer freezes at the prior. The return constraint (mu·w ≥ 15%) is still satisfied since equal-weight returns 18.27% on this sample, so there’s no rebalance required to hit the floor.
Why this matters operationally: Naive Markowitz says “rebalance to these new weights.” A cost-aware optimizer says “rebalance partway toward these new weights, stopping when the trade cost exceeds the variance-reduction benefit.” The latter is what a real portfolio manager does. The λ parameter is where you put your estimate of turnover cost in basis points.
Connection to the articles: this is directly the kind of “more complex constraint” mentioned in the L4 homework. The L1 penalty is also what you’d use for the ||w||_1 ≤ k cardinality relaxation (LASSO), which is the convex hull of hard cardinality constraints. We’ll see this again at PM4 when we optimize prediction-market positions with turnover-aware execution.
Plot: outputs/level_4_portfolio_with_costs.png — left panel: turnover vs λ (symlog scale). Right panel: weight bars comparing prior, λ=0 (free rebalance), and λ=0.1 (cost-aware) allocations.
Cross-drill observations
-
Both drills are about understanding what the optimizer does when it succeeds vs when it fails. Drill 1 shows that “convergence” is a spectrum — vanilla GD, momentum GD, and Adam all “converge” in the sense of reducing f, but they end up in very different places. Drill 2 shows that the “optimal” answer depends entirely on the cost function you hand the optimizer: change the cost, get a completely different answer.
-
Convex optimization is an abstraction layer you can trust exactly as much as you trust your constraints. cvxpy made drill 2 trivial — adding the turnover penalty was one line. But the choice of L1 vs L2 penalty, the λ value, the prior portfolio, and the return target all live outside the solver and are judgment calls. The optimizer is only as smart as the modeler.
-
This is where the project starts to look operational rather than academic. Everything in L1-L3 was a vibe check. L4 drill 2 is an actual tool I’d run on a real portfolio tomorrow if we had a prior position and a cost estimate. The gap between “drill” and “production” is narrow here — mostly data plumbing and the shrinkage-estimator upgrade from the L3 notes.
Gate check → proceed to Level 5?
Yes. All L4 drills pass cleanly:
- Gradient descent from scratch runs, all three optimizers converge to the minimum ✓
- Adam beats momentum beats vanilla, exactly as theory predicts ✓
- Portfolio optimizer with L1 turnover penalty runs end-to-end via cvxpy ✓
- λ sweep shows expected monotonic turnover shrinkage ✓
- Both plots render, both scripts reproducible ✓
Next action (Level 5): stochastic calculus. This is the hardest level per the article — 6-8 weeks of study recommended. Drills: Black-Scholes from scratch, Monte Carlo convergence check, and all five Greeks. The key insight to internalize is that (dW_t)² = dt, which is why Itô’s lemma has the second-order term that ordinary calculus drops. Black-Scholes is derived from Itô’s lemma + a delta-hedging argument that cancels the dW terms, so the option price ends up independent of the stock’s drift μ. That’s the mind-bending risk-neutral pricing result.
After L5, we have the math foundation. The next track is the Prediction Markets / simulation guide: Monte Carlo binary contracts, importance sampling for tail events, particle filters for live updating, copulas for correlated portfolios, and the 5-layer production stack. That’s where this project starts to look like a real system.
Related
- Automated Investing Infrastructure Plan
- Level 3 writeup — Markowitz baseline this drill extends
- Source roadmap article
- Simulation guide (PM track, next)