Logistic Regression Gradient & Convexity

Medium🎯 Classical MLW3 D1

Logistic Regression Gradient & Convexity

Derive and implement logistic regression *correctly* (loss, gradient, Hessian intuition) and explain why the loss is convex.

Students can implement vectorized logistic regression, gradient-check it, and communicate convexity + optimization reasoning at FAANG interview depth.

Progress — 0/8 tasks

1Tasks

2Sigmoid + Stable Loss

3Gradient (Derive → Implement → Check)

4Convexity & Hessian Intuition

5Optimization (Bonus)

Interview Angles

• How do you compute `log(1 + exp(z))` stably?
• Why is Newton fast but expensive?
• What breaks if data is perfectly separable?

FAANG Gotchas

• Most bugs are missing `1/n` constants or shape/broadcast mistakes.

Asked At

Google (Deep Learning, Research teams)Amazon (AWS ML, Applied Science)Meta (Computer Vision, Ranking)Microsoft (Azure ML, Research)Netflix (Content classification)Apple (Device ML, Core ML)

Python 3 — Notebook

0/8 solvedSubstack Notes

Dataset & Setup

Logistic Regression Gradient & Convexity — Student Lab

Complete all TODOs. This lab is math-first and stability-first.

Section 0 — Synthetic Dataset

We’ll create a binary classification dataset with both separable and non-separable regimes.

Synthetic dataset (first two features)

Loading editor...

Solution

Sigmoid + Stable Loss

Stable sigmoid

Section 1 — Sigmoid + Stable Loss

Task 1.1: Stable sigmoid

Sigmoid function curve

Explain: Why does sigmoid saturate for large |z| and what does that do to gradients?

Explain: Why does sigmoid saturate for large |z| and what does that do to gradients?

Loading editor...

Solution

Stable binary cross-entropy loss

Task 1.2: Stable binary cross-entropy loss

We use labels y in {0,1}.

Loss per example: -y log(p) - (1-y) log(1-p) where p=sigmoid(z).

Binary cross-entropy loss curves

Interview Angle: explain a stable form of log-loss.

Loading editor...

Solution

Gradient (Derive → Implement → Check)

Implement loss and gradient

Section 2 — Gradient (Derive → Implement → Check)

Task 2.1: Derive the gradient (write in markdown)

Show that for loss averaged over n examples: grad = X^T (p - y) / n

Checkpoint: What is the shape of grad?

Task 2.2: Implement loss and gradient

FAANG gotcha: ensure y is 0/1, not -1/+1.

Loading editor...

Solution

Numerical gradient check (finite differences)

Task 2.3: Numerical gradient check (finite differences)

Compare your analytic gradient to a central-difference numerical gradient.

Checkpoint: Why can eps too small make the check worse?

Loading editor...

Solution

Convexity & Hessian Intuition

Hessian-vector product (HVP)

Section 3 — Convexity & Hessian Intuition

Task 3.1: Hessian-vector product (HVP)

For logistic regression: H = (1/n) X^T S X where S = diag(p(1-p)).

Implement HVP: compute H@v without building full H explicitly.

●s = p*(1-p)
●compute Xv then multiply by s then X^T

Loading editor...

Solution

Empirical PSD check

Task 3.2: Empirical PSD check

Check that v^T H v >= 0 for random v (PSD).

Explain: Why does PSD imply convexity?

Explain: Why does PSD imply convexity?

Loading editor...

Solution

Optimization (Bonus)

One step of GD vs Newton (conceptual)

Section 4 — Optimization (Bonus)

Task 4.1: One step of GD vs Newton (conceptual)

Implement one gradient descent step. (Newton step is optional.)

FAANG gotcha: perfectly separable data can push weights to infinity; explain why.

Section 4 — GD Step

Gradient descent update rule:

w' = w - η ∇L(w)

Where:

●w = current weights
●w' = updated weights after one optimization step
●η = learning rate (step size)
●∇L(w) = gradient of the loss with respect to w

Newton (Hessian) step notation:

w' = w - H(w)^{-1} ∇L(w)

Where:

●H(w) = ∇²L(w) = Hessian matrix (second derivatives)
●H(w)^{-1} ∇L(w) rescales the gradient using local curvature

Interpretation: gradient descent uses only slope; Newton step uses slope + curvature for a more curvature-aware update.

Loading editor...

Solution