Transformer Architecture
Progress — 0/10 tasks
Transformer Architecture — Student Lab
Implement a minimal Transformer encoder-style block forward pass in NumPy:
- ●embeddings + sinusoidal positional encoding
- ●multi-head self-attention (from scratch)
- ●FFN
- ●residuals + layer norm (Pre-LN)
Focus: shapes, masking, and correctness.
Section 0 — Utilities (softmax, LayerNorm, activation)
Utilities (softmax, LayerNorm, activation)
Section 0 — Utilities (softmax, LayerNorm, activation)
Embeddings + positional encoding
Section 1 — Embeddings + positional encoding
Task 1.1
Implement sinusoidal positional encoding pos_enc(T, D) returning (T,D).
Use the standard formula with sin/cos on even/odd dims.
Task 1.2
Implement token embedding lookup.
Given tokens (B,T) and embedding table E (V,D), return X (B,T,D).
Multi-head self-attention
Section 2 — Multi-head self-attention
Task 2.1
Implement masks:
- ●
causal_mask(T)-> (1, 1, T, T) - ●
padding_mask(lengths, T)-> (B, 1, 1, T) Mask convention: 1=keep, 0=mask.
Task 2.2
Implement head split/merge utilities.
Task 2.3
Implement scaled dot-product attention over heads. Inputs: Q,K,V each shape (B,h,T,Dh). Mask shape should be broadcastable to (B,h,T,T). Return (O,A) where O=(B,h,T,Dh), A=(B,h,T,T).
Task 2.4
Implement a multi-head self-attention forward pass:
- ●project Q,K,V with
Wq,Wk,Wv - ●split into heads
- ●apply attention
- ●merge heads
- ●output projection
WoReturn (Y, A).
Feed-forward network (FFN)
Section 3 — Feed-forward network (FFN)
Task 3.1
Implement FFN forward: FFN(x) = (gelu(x W1 + b1)) W2 + b2.
Transformer block (Pre-LN)
Section 4 — Transformer block (Pre-LN)
Pre-LN block:
- ●
x1 = x + MHA(LN(x)) - ●
x2 = x1 + FFN(LN(x1))
Task 4.1
Implement transformer_block_forward(x, params, n_heads, attn_mask=None) returning (y, attn_weights).