S30 Logo
S30 AI Labwww.thes30.com
Back
#45

Transformer Architecture

Hard📝 NLP & TransformersW10 D1

Transformer Architecture

Progress — 0/10 tasks

1Tasks
2Utilities (softmax, LayerNorm, activation)
3Embeddings + positional encoding
4Multi-head self-attention
5Feed-forward network (FFN)
6Transformer block (Pre-LN)
Python 3 — Notebook
0/10 solvedSubstack Notes
1
Dataset & Setup

Transformer Architecture — Student Lab

Implement a minimal Transformer encoder-style block forward pass in NumPy:

  • embeddings + sinusoidal positional encoding
  • multi-head self-attention (from scratch)
  • FFN
  • residuals + layer norm (Pre-LN)

Focus: shapes, masking, and correctness.

Section 0 — Utilities (softmax, LayerNorm, activation)

Loading editor...
Solution
1

Utilities (softmax, LayerNorm, activation)

2
Section 0 — Utilities (softmax, LayerNorm, activation)

Section 0 — Utilities (softmax, LayerNorm, activation)

Loading editor...
Solution
2

Embeddings + positional encoding

3
Implement sinusoidal positional encoding `pos_enc(T, D)` returning (T,D).
1

Section 1 — Embeddings + positional encoding

Task 1.1

Implement sinusoidal positional encoding pos_enc(T, D) returning (T,D). Use the standard formula with sin/cos on even/odd dims.

Loading editor...
Solution
4
Implement token embedding lookup.
1

Task 1.2

Implement token embedding lookup. Given tokens (B,T) and embedding table E (V,D), return X (B,T,D).

Loading editor...
Solution
3

Multi-head self-attention

5
Implement masks:
2

Section 2 — Multi-head self-attention

Task 2.1

Implement masks:

  • causal_mask(T) -> (1, 1, T, T)
  • padding_mask(lengths, T) -> (B, 1, 1, T) Mask convention: 1=keep, 0=mask.
Loading editor...
Solution
6
Implement head split/merge utilities.
2

Task 2.2

Implement head split/merge utilities.

Loading editor...
Solution
7
Implement scaled dot-product attention over heads.
3

Task 2.3

Implement scaled dot-product attention over heads. Inputs: Q,K,V each shape (B,h,T,Dh). Mask shape should be broadcastable to (B,h,T,T). Return (O,A) where O=(B,h,T,Dh), A=(B,h,T,T).

Loading editor...
Solution
8
Implement a multi-head self-attention forward pass:
2

Task 2.4

Implement a multi-head self-attention forward pass:

  • project Q,K,V with Wq,Wk,Wv
  • split into heads
  • apply attention
  • merge heads
  • output projection Wo Return (Y, A).
Loading editor...
Solution
4

Feed-forward network (FFN)

9
Implement FFN forward: `FFN(x) = (gelu(x W1 + b1)) W2 + b2`.
1

Section 3 — Feed-forward network (FFN)

Task 3.1

Implement FFN forward: FFN(x) = (gelu(x W1 + b1)) W2 + b2.

Loading editor...
Solution
5

Transformer block (Pre-LN)

10
Implement `transformer_block_forward(x, params, n_heads, attn_mask=None)` returning (y, attn_weights).
3

Section 4 — Transformer block (Pre-LN)

Pre-LN block:

  1. x1 = x + MHA(LN(x))
  2. x2 = x1 + FFN(LN(x1))

Task 4.1

Implement transformer_block_forward(x, params, n_heads, attn_mask=None) returning (y, attn_weights).

Loading editor...
Solution

Need help? Share feedback