S30 Logo
S30 AI Labwww.thes30.com
Back
#40

Embeddings & Representation

Hard๐Ÿ“ NLP & TransformersW9 D1

Embeddings & Representation

Progress โ€” 0/11 tasks

1Tasks
2Tokenization + vocabulary
3Co-occurrence matrix
4PMI / PPMI
5SVD embeddings
6Skip-gram with Negative Sampling (SGNS)
7Tiny analogy checks
Python 3 โ€” Notebook
0/11 solvedSubstack Notes
1
Dataset & Setup

Embeddings & Representation โ€” Student Lab

You will implement classic embedding methods offline:

  • โ—co-occurrence matrices
  • โ—PMI / PPMI
  • โ—SVD embeddings
  • โ—skip-gram with negative sampling (Word2Vec-style)

Focus: correctness, determinism, and interview-ready intuition.

Corpus (offline)

We include repeated patterns so embeddings have signal.

Loading editor...
Solution
1

Tokenization + vocabulary

2
Implement `tokenize` + `build_vocab` (UNK at id 0).
1

Section 0 โ€” Tokenization + vocabulary

Task 0.1

Implement tokenize(s) that extracts lowercase alphabetic tokens. Hint: regex [a-z]+.

Task 0.2

Build vocab with min_count and reserve id 0 for UNK. Return (word2id, id2word, counts).

Loading editor...
Solution
2

Co-occurrence matrix

3
Implement `word_id` and `build_cooccurrence` with sliding window.
1

Section 1 โ€” Co-occurrence matrix

We build a target-context co-occurrence matrix C with a sliding window.

Task 1.1

Implement build_cooccurrence(texts, word2id, window=2) returning C of shape (V,V). Count context words within +/- window positions (excluding the center token).

FAANG gotcha: for large corpora, this should be sparse; we use dense here for simplicity.

Loading editor...
Solution
3

PMI / PPMI

4
Implement `ppmi(C)` returning a non-negative matrix.
1

Section 2 โ€” PMI / PPMI

PMI(i,j) = log( P(i,j) / (P(i) P(j)) ). PPMI = max(PMI, 0).

Task 2.1

Implement ppmi(C) returning a matrix same shape as C. Use smoothing to avoid log(0).

Loading editor...
Solution
4

SVD embeddings

5
Implement `svd_embeddings(M, k)` and verify embedding shape.

Section 3 โ€” SVD embeddings

We compute low-rank embeddings from PPMI.

Task 3.1

Implement svd_embeddings(M, k) returning E of shape (V,k). Use np.linalg.svd (dense).

Loading editor...
Solution
6
Task 3.2 โ€” Similarity via cosine nearest neighbors.

Task 3.2 โ€” Similarity

Implement cosine similarity and nearest neighbors for a query word.

Loading editor...
Solution
7
Task 3.3 โ€” Analogy on SVD embeddings (bridge).

Task 3.3 โ€” Analogy bridge

Implement analogy(a,b,c,E) and test on SVD embeddings before SGNS.

Loading editor...
Solution
5

Skip-gram with Negative Sampling (SGNS)

8
Create training pairs `(center_id, context_id)` from a windowed corpus.
2

Section 4 โ€” Skip-gram with Negative Sampling (SGNS)

We train center and context embeddings so that true (center,context) pairs have high dot-product.

Task 4.1

Create training pairs (center_id, context_id) from the corpus for a given window. Return arrays centers, contexts.

Loading editor...
Solution
9
Implement negative sampling CDF and sampler.
1

Task 4.2

Implement negative sampling distribution p(w) โˆ count(w)^{0.75} and a sampler. Hint: build a CDF and sample with rng.random + np.searchsorted.

Loading editor...
Solution
10
Train SGNS embeddings.

Task 4.3

Train SGNS embeddings.

Implement:

  • โ—sigmoid
  • โ—one SGD step for a (center, context) positive pair and K negatives

Hints:

  • โ—maximize log ฯƒ(u_c ยท v_pos) + ฮฃ log ฯƒ(-u_c ยท v_neg)
  • โ—use gradients on dot products
  • โ—keep learning rate small
Loading editor...
Solution
6

Tiny analogy checks

11
Implement `analogy(a,b,c,E)` and test on SVD vs SGNS.

Section 5 โ€” Tiny analogy checks

Compute a - b + c and find the nearest neighbor.

Task 5.1

Implement analogy(a,b,c,E) and test on a few toy examples.

Loading editor...
Solution

Need help? Share feedback