#45

Transformer Architecture

Hard📝 NLP & TransformersW10 D1

Transformer Architecture

Progress — 0/10 tasks

1Tasks

2Utilities (softmax, LayerNorm, activation)

3Embeddings + positional encoding

4Multi-head self-attention

5Feed-forward network (FFN)

6Transformer block (Pre-LN)

Python 3 — Notebook

0/10 solvedSubstack Notes

Dataset & Setup

Transformer Architecture — Student Lab

Implement a minimal Transformer encoder-style block forward pass in NumPy:

●embeddings + sinusoidal positional encoding
●multi-head self-attention (from scratch)
●FFN
●residuals + layer norm (Pre-LN)

Focus: shapes, masking, and correctness.

Section 0 — Utilities (softmax, LayerNorm, activation)

Loading editor...

Solution

Utilities (softmax, LayerNorm, activation)

Section 0 — Utilities (softmax, LayerNorm, activation)

Loading editor...

Solution

Embeddings + positional encoding

Implement sinusoidal positional encoding `pos_enc(T, D)` returning (T,D).

Section 1 — Embeddings + positional encoding

Task 1.1

Implement sinusoidal positional encoding pos_enc(T, D) returning (T,D). Use the standard formula with sin/cos on even/odd dims.

Loading editor...

Solution

Implement token embedding lookup.

Task 1.2

Implement token embedding lookup. Given tokens (B,T) and embedding table E (V,D), return X (B,T,D).

Loading editor...

Solution

Multi-head self-attention

Implement masks:

Section 2 — Multi-head self-attention

Task 2.1

Implement masks:

●causal_mask(T) -> (1, 1, T, T)
●padding_mask(lengths, T) -> (B, 1, 1, T) Mask convention: 1=keep, 0=mask.

Loading editor...

Solution

Implement head split/merge utilities.

Task 2.2

Implement head split/merge utilities.

Loading editor...

Solution

Implement scaled dot-product attention over heads.

Task 2.3

Implement scaled dot-product attention over heads. Inputs: Q,K,V each shape (B,h,T,Dh). Mask shape should be broadcastable to (B,h,T,T). Return (O,A) where O=(B,h,T,Dh), A=(B,h,T,T).

Loading editor...

Solution

Implement a multi-head self-attention forward pass:

Task 2.4

Implement a multi-head self-attention forward pass:

●project Q,K,V with Wq,Wk,Wv
●split into heads
●apply attention
●merge heads
●output projection Wo Return (Y, A).

Loading editor...

Solution

Feed-forward network (FFN)

Implement FFN forward: `FFN(x) = (gelu(x W1 + b1)) W2 + b2`.

Section 3 — Feed-forward network (FFN)

Task 3.1

Implement FFN forward: FFN(x) = (gelu(x W1 + b1)) W2 + b2.

Loading editor...

Solution

Transformer block (Pre-LN)

Implement `transformer_block_forward(x, params, n_heads, attn_mask=None)` returning (y, attn_weights).

Section 4 — Transformer block (Pre-LN)

Pre-LN block:

●x1 = x + MHA(LN(x))
●x2 = x1 + FFN(LN(x1))

Task 4.1

Implement transformer_block_forward(x, params, n_heads, attn_mask=None) returning (y, attn_weights).

Loading editor...

Solution