#30

Activation Functions & Initialization

Hard🧠 Deep LearningW7 D1

Activation Functions & Initialization

Progress — 0/5 tasks

1Tasks

2Activations

3Initialization schemes

4Forward signal propagation across depth

5Backward gradient propagation (toy)

Python 3 — Notebook

0/5 solvedSubstack Notes

Dataset & Setup

Activation Functions & Initialization — Student Lab

Focus: implement activations + init schemes and empirically verify signal/gradient propagation across depth.

Section 0 — Setup

We’ll work with synthetic Gaussian inputs so we can isolate activation/init effects.

Loading editor...

Solution

Activations

Implement ReLU, tanh, and GELU (approx).

Section 1 — Activations

Task 1.1

Implement ReLU, tanh, and GELU (approx).

Task 1.2

Compare output mean/std on standard normal input.

tanh activation curve

GELU activation curve

Optional concept — Softplus

softplus(v) = log(1 + e^v)

Why it is used:

●smooth approximation of ReLU
●avoids numerical instability with a stable implementation
●useful in logistic-loss derivations

Behavior:

●if v is very large positive, log(1 + e^v) ≈ v
●if v is very large negative, log(1 + e^v) ≈ 0

So Softplus behaves like max(0, v) but smoothly.

Loading editor...

Solution

Initialization schemes

Implement:

Section 2 — Initialization schemes

We’ll initialize weight matrices for a linear layer: Y = X W (no bias).

Task 2.1

Implement:

●naive normal init with std=1
●Xavier normal init
●He normal init

Return a weight matrix of shape (fan_in, fan_out).

Loading editor...

Solution

Forward signal propagation across depth

Write `simulate_forward(X0, L, init_fn, act_fn)` returning stats per layer.

Section 3 — Forward signal propagation across depth

We simulate an L-layer network: X_{l+1} = act(X_l W_l)

Task 3.1

Write simulate_forward(X0, L, init_fn, act_fn) returning stats per layer.

We care about:

●mean/std of activations
●for ReLU: fraction of zeros
●for tanh: saturation (|a| > 0.95) and average local derivative

Task 3.2

Compare naive vs Xavier/He for depth L=50 using both ReLU and tanh.

Loading editor...

Solution

Backward gradient propagation (toy)

Implement activation derivatives for ReLU and tanh.

Section 4 — Backward gradient propagation (toy)

We estimate gradient flow using a simple scalar loss:

●Forward: X_{l+1} = act(X_l W_l)
●Loss: mean(X_L)
●Backward (approx): propagate gradients using local Jacobians

This is not a full autodiff engine; it’s a controlled experiment to see gradient norms explode/vanish.

Task 4.1

Implement activation derivatives for ReLU and tanh.

Task 4.2

Simulate gradient norms across depth for different init schemes.

Loading editor...

Solution