Activation Functions & Initialization
Progress β 0/5 tasks
Activation Functions & Initialization β Student Lab
Focus: implement activations + init schemes and empirically verify signal/gradient propagation across depth.
Section 0 β Setup
Weβll work with synthetic Gaussian inputs so we can isolate activation/init effects.
Activations
Section 1 β Activations
Task 1.1
Implement ReLU, tanh, and GELU (approx).
Task 1.2
Compare output mean/std on standard normal input.
Optional concept β Softplus
softplus(v) = log(1 + e^v)
Why it is used:
- βsmooth approximation of ReLU
- βavoids numerical instability with a stable implementation
- βuseful in logistic-loss derivations
Behavior:
- βif
vis very large positive,log(1 + e^v) β v - βif
vis very large negative,log(1 + e^v) β 0
So Softplus behaves like max(0, v) but smoothly.
Initialization schemes
Section 2 β Initialization schemes
Weβll initialize weight matrices for a linear layer: Y = X W (no bias).
Task 2.1
Implement:
- βnaive normal init with std=1
- βXavier normal init
- βHe normal init
Return a weight matrix of shape (fan_in, fan_out).
Forward signal propagation across depth
Section 3 β Forward signal propagation across depth
We simulate an L-layer network: X_{l+1} = act(X_l W_l)
Task 3.1
Write simulate_forward(X0, L, init_fn, act_fn) returning stats per layer.
We care about:
- βmean/std of activations
- βfor ReLU: fraction of zeros
- βfor tanh: saturation (
|a| > 0.95) and average local derivative
Task 3.2
Compare naive vs Xavier/He for depth L=50 using both ReLU and tanh.
Backward gradient propagation (toy)
Section 4 β Backward gradient propagation (toy)
We estimate gradient flow using a simple scalar loss:
- βForward: X_{l+1} = act(X_l W_l)
- βLoss: mean(X_L)
- βBackward (approx): propagate gradients using local Jacobians
This is not a full autodiff engine; itβs a controlled experiment to see gradient norms explode/vanish.
Task 4.1
Implement activation derivatives for ReLU and tanh.
Task 4.2
Simulate gradient norms across depth for different init schemes.