#40

Embeddings & Representation

Hard📝 NLP & TransformersW9 D1

Embeddings & Representation

Progress — 0/11 tasks

1Tasks

2Tokenization + vocabulary

3Co-occurrence matrix

4PMI / PPMI

5SVD embeddings

6Skip-gram with Negative Sampling (SGNS)

7Tiny analogy checks

Python 3 — Notebook

0/11 solvedSubstack Notes

Dataset & Setup

Embeddings & Representation — Student Lab

You will implement classic embedding methods offline:

●co-occurrence matrices
●PMI / PPMI
●SVD embeddings
●skip-gram with negative sampling (Word2Vec-style)

Focus: correctness, determinism, and interview-ready intuition.

Corpus (offline)

We include repeated patterns so embeddings have signal.

Loading editor...

Solution

Tokenization + vocabulary

Implement `tokenize` + `build_vocab` (UNK at id 0).

Section 0 — Tokenization + vocabulary

Task 0.1

Implement tokenize(s) that extracts lowercase alphabetic tokens. Hint: regex [a-z]+.

Task 0.2

Build vocab with min_count and reserve id 0 for UNK. Return (word2id, id2word, counts).

Loading editor...

Solution

Co-occurrence matrix

Implement `word_id` and `build_cooccurrence` with sliding window.

Section 1 — Co-occurrence matrix

We build a target-context co-occurrence matrix C with a sliding window.

Task 1.1

Implement build_cooccurrence(texts, word2id, window=2) returning C of shape (V,V). Count context words within +/- window positions (excluding the center token).

FAANG gotcha: for large corpora, this should be sparse; we use dense here for simplicity.

Loading editor...

Solution

PMI / PPMI

Implement `ppmi(C)` returning a non-negative matrix.

Section 2 — PMI / PPMI

PMI(i,j) = log( P(i,j) / (P(i) P(j)) ). PPMI = max(PMI, 0).

Task 2.1

Implement ppmi(C) returning a matrix same shape as C. Use smoothing to avoid log(0).

Loading editor...

Solution

SVD embeddings

Implement `svd_embeddings(M, k)` and verify embedding shape.

Section 3 — SVD embeddings

We compute low-rank embeddings from PPMI.

Task 3.1

Implement svd_embeddings(M, k) returning E of shape (V,k). Use np.linalg.svd (dense).

Loading editor...

Solution

Task 3.2 — Similarity via cosine nearest neighbors.

Task 3.2 — Similarity

Implement cosine similarity and nearest neighbors for a query word.

Loading editor...

Solution

Task 3.3 — Analogy on SVD embeddings (bridge).

Task 3.3 — Analogy bridge

Implement analogy(a,b,c,E) and test on SVD embeddings before SGNS.

Loading editor...

Solution

Skip-gram with Negative Sampling (SGNS)

Create training pairs `(center_id, context_id)` from a windowed corpus.

Section 4 — Skip-gram with Negative Sampling (SGNS)

We train center and context embeddings so that true (center,context) pairs have high dot-product.

Task 4.1

Create training pairs (center_id, context_id) from the corpus for a given window. Return arrays centers, contexts.

Loading editor...

Solution

Implement negative sampling CDF and sampler.

Task 4.2

Implement negative sampling distribution p(w) ∝ count(w)^{0.75} and a sampler. Hint: build a CDF and sample with rng.random + np.searchsorted.

Loading editor...

Solution

Train SGNS embeddings.

Task 4.3

Train SGNS embeddings.

Implement:

●sigmoid
●one SGD step for a (center, context) positive pair and K negatives

Hints:

●maximize log σ(u_c · v_pos) + Σ log σ(-u_c · v_neg)
●use gradients on dot products
●keep learning rate small

Loading editor...

Solution

Tiny analogy checks

Implement `analogy(a,b,c,E)` and test on SVD vs SGNS.

Section 5 — Tiny analogy checks

Compute a - b + c and find the nearest neighbor.

Task 5.1

Implement analogy(a,b,c,E) and test on a few toy examples.

Loading editor...

Solution