Logistic Regression Gradient & Convexity
Derive and implement logistic regression *correctly* (loss, gradient, Hessian intuition) and explain why the loss is convex.
Students can implement vectorized logistic regression, gradient-check it, and communicate convexity + optimization reasoning at FAANG interview depth.
Progress — 0/8 tasks
Interview Angles
- • How do you compute `log(1 + exp(z))` stably?
- • Why is Newton fast but expensive?
- • What breaks if data is perfectly separable?
FAANG Gotchas
- • Most bugs are missing `1/n` constants or shape/broadcast mistakes.
Asked At
Logistic Regression Gradient & Convexity — Student Lab
Complete all TODOs. This lab is math-first and stability-first.
Section 0 — Synthetic Dataset
We’ll create a binary classification dataset with both separable and non-separable regimes.
Sigmoid + Stable Loss
Section 1 — Sigmoid + Stable Loss
Task 1.1: Stable sigmoid
Explain: Why does sigmoid saturate for large |z| and what does that do to gradients?
Task 1.2: Stable binary cross-entropy loss
We use labels y in {0,1}.
Loss per example: -y log(p) - (1-y) log(1-p) where p=sigmoid(z).
Interview Angle: explain a stable form of log-loss.
Gradient (Derive → Implement → Check)
Section 2 — Gradient (Derive → Implement → Check)
Task 2.1: Derive the gradient (write in markdown)
Show that for loss averaged over n examples:
grad = X^T (p - y) / n
Checkpoint: What is the shape of grad?
Task 2.2: Implement loss and gradient
FAANG gotcha: ensure y is 0/1, not -1/+1.
Task 2.3: Numerical gradient check (finite differences)
Compare your analytic gradient to a central-difference numerical gradient.
Checkpoint: Why can eps too small make the check worse?
Convexity & Hessian Intuition
Section 3 — Convexity & Hessian Intuition
Task 3.1: Hessian-vector product (HVP)
For logistic regression: H = (1/n) X^T S X where S = diag(p(1-p)).
Implement HVP: compute H@v without building full H explicitly.
- ●s = p*(1-p)
- ●compute Xv then multiply by s then X^T
Task 3.2: Empirical PSD check
Check that v^T H v >= 0 for random v (PSD).
Explain: Why does PSD imply convexity?
Optimization (Bonus)
Section 4 — Optimization (Bonus)
Task 4.1: One step of GD vs Newton (conceptual)
Implement one gradient descent step. (Newton step is optional.)
FAANG gotcha: perfectly separable data can push weights to infinity; explain why.
Section 4 — GD Step
Gradient descent update rule:
w' = w - η ∇L(w)
Where:
- ●
w= current weights - ●
w'= updated weights after one optimization step - ●
η= learning rate (step size) - ●
∇L(w)= gradient of the loss with respect tow
Newton (Hessian) step notation:
w' = w - H(w)^{-1} ∇L(w)
Where:
- ●
H(w) = ∇²L(w)= Hessian matrix (second derivatives) - ●
H(w)^{-1} ∇L(w)rescales the gradient using local curvature
Interpretation: gradient descent uses only slope; Newton step uses slope + curvature for a more curvature-aware update.