Embeddings & Representation
Progress โ 0/11 tasks
Embeddings & Representation โ Student Lab
You will implement classic embedding methods offline:
- โco-occurrence matrices
- โPMI / PPMI
- โSVD embeddings
- โskip-gram with negative sampling (Word2Vec-style)
Focus: correctness, determinism, and interview-ready intuition.
Corpus (offline)
We include repeated patterns so embeddings have signal.
Tokenization + vocabulary
Section 0 โ Tokenization + vocabulary
Task 0.1
Implement tokenize(s) that extracts lowercase alphabetic tokens.
Hint: regex [a-z]+.
Task 0.2
Build vocab with min_count and reserve id 0 for UNK.
Return (word2id, id2word, counts).
Co-occurrence matrix
Section 1 โ Co-occurrence matrix
We build a target-context co-occurrence matrix C with a sliding window.
Task 1.1
Implement build_cooccurrence(texts, word2id, window=2) returning C of shape (V,V).
Count context words within +/- window positions (excluding the center token).
FAANG gotcha: for large corpora, this should be sparse; we use dense here for simplicity.
PMI / PPMI
Section 2 โ PMI / PPMI
PMI(i,j) = log( P(i,j) / (P(i) P(j)) ). PPMI = max(PMI, 0).
Task 2.1
Implement ppmi(C) returning a matrix same shape as C.
Use smoothing to avoid log(0).
SVD embeddings
Section 3 โ SVD embeddings
We compute low-rank embeddings from PPMI.
Task 3.1
Implement svd_embeddings(M, k) returning E of shape (V,k).
Use np.linalg.svd (dense).
Task 3.2 โ Similarity
Implement cosine similarity and nearest neighbors for a query word.
Task 3.3 โ Analogy bridge
Implement analogy(a,b,c,E) and test on SVD embeddings before SGNS.
Skip-gram with Negative Sampling (SGNS)
Section 4 โ Skip-gram with Negative Sampling (SGNS)
We train center and context embeddings so that true (center,context) pairs have high dot-product.
Task 4.1
Create training pairs (center_id, context_id) from the corpus for a given window.
Return arrays centers, contexts.
Task 4.2
Implement negative sampling distribution p(w) โ count(w)^{0.75} and a sampler.
Hint: build a CDF and sample with rng.random + np.searchsorted.
Task 4.3
Train SGNS embeddings.
Implement:
- โ
sigmoid - โone SGD step for a (center, context) positive pair and K negatives
Hints:
- โmaximize
log ฯ(u_c ยท v_pos) + ฮฃ log ฯ(-u_c ยท v_neg) - โuse gradients on dot products
- โkeep learning rate small
Tiny analogy checks
Section 5 โ Tiny analogy checks
Compute a - b + c and find the nearest neighbor.
Task 5.1
Implement analogy(a,b,c,E) and test on a few toy examples.