# Transformer Architecture — Theory & Concepts

> **Read this before attempting the exercises.** The Transformer is the most important architecture in modern AI.

---

## 1. Architecture Overview

![Transformer block overview](/images/w10-d1-transformer-block-overview.svg)

```text
Input tokens -> Embedding + Positional Encoding
                    |
            +----------------+
            |  Multi-Head    | <-- Self-Attention
            |  Attention     |
            +----------------+
            |  Add & Norm    | <-- Residual + LayerNorm
            +----------------+
            |  Feed-Forward  | <-- 2-layer MLP
            |  Network       |
            +----------------+
            |  Add & Norm    |
            +----------------+
                    |
              x N layers
                    |
            Output logits
```

---

## 2. Multi-Head Attention

Instead of one attention function, use multiple heads that attend to different aspects:

![Multi-head split and concat](/images/w10-d1-multihead-split-concat.svg)

```python
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        scores = (Q @ K.transpose(-2, -1)) / (self.d_k ** 0.5)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attn = F.softmax(scores, dim=-1)
        out = attn @ V
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.W_o(out)
```

**Why multiple heads?** Each head can learn different attention patterns -- one for syntax, one for coreference, one for semantic similarity, etc.

---

## 3. Positional Encoding

Attention is permutation-invariant. Positional encodings add position information:

![Positional encoding intuition](/images/w10-d1-positional-encoding-intuition.svg)

```python
def positional_encoding(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len).unsqueeze(1).float()
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(np.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe
```

Modern models use **learned positional embeddings** or **RoPE** (Rotary Position Embedding).

---

## 4. Feed-Forward Network

Each position is processed independently by a 2-layer MLP:

```python
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),      # expand (d_ff = 4x d_model)
            nn.GELU(),
            nn.Linear(d_ff, d_model),      # project back
        )
```

---

## 5. Encoder vs Decoder

| Component | Attention Type | Use Case |
|-----------|---------------|----------|
| **Encoder** | Bidirectional (see all tokens) | BERT, classification, NER |
| **Decoder** | Causal (see only past tokens) | GPT, text generation |
| **Encoder-Decoder** | Cross-attention | T5, translation |

---

## 6. Key Dimensions

| Hyperparameter | GPT-2 Small | BERT-Base | GPT-3 |
|---------------|-------------|-----------|-------|
| d_model | 768 | 768 | 12,288 |
| n_heads | 12 | 12 | 96 |
| n_layers | 12 | 12 | 96 |
| d_ff | 3072 | 3072 | 49,152 |
| Parameters | 117M | 110M | 175B |

---

*Now you're ready to attempt the exercises. Switch to the **Exercise** tab and start coding!*