Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning
Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization Evaluation 2
Representing words 3
Representing words Definition: meaning (Webster dictionary) the idea that is represented by a word, phrase, etc. The idea that a person wants to express by using words, signs, etc. the idea that is expressed in a work of writing, art, etc. In linguistics: signifier < > signified (idea or thing) = denotation 4
Representing words with computers A word is the set of meanings it has in a taxonomy (graph of meanings) Hypernym: is-a relation Hyponym: the opposite of hypernym 5
Drawbacks Expensive! Subjective (how to split different synsets?) Incomplete wicked, badass, nifty, crack, ace, wizard, genius, ninja Missing functionality: how do you compute word similarity? How to compose meanings? 6
Discrete representation Words are atomic symbols (one-hot representation): V = {hotel, motel, walk, wife, spouse} hotel [1 0 0 0 0] motel [0 1 0 0 0] walk [0 0 1 0 0] wife [0 0 0 1 0] spouse [0 0 0 0 1] V 100, 000 7
Drawback Barack Obama s wife Barack Obama s spouse Barack Obama s wife Barack Obama s advisors Seattle motels Seattle hotels Seattle motels Seattle attractions But all words vectors are orthogonal and equidistant Goal: word vectors with a natural notion of similarity h hotel motel i > h hotel spouse i 8
Distributional similarity You shall know a word by the company it keeps (Firth, 1957) cashed a check at the bank across the street that bank holds the mortgage on my home said that the bank raised his forecast for employees of the bank have confessed to the charges Central idea: represent words by their context 9
Idea 1 word wife spouse context {met: 3, married: 4, children: 2, wedded: 1, } {met: 2, married: 5, children: 2, kids: 1, } Problem: married <==> wedded children <==> kids 10
Distributed representations language = 0.278 0.911 0.792 0.177 0.109 0.542 0.0003 Represent words and context as low dimensional vectors 11
Word vectors 12
Supervised learning Input: {(x i,y i )} N i=1, (x i,y i ) 2 X Y Output (probabilistic model): f : X! Y arg max y p(y x) Example: train a spam detector from spam and non-spam e-mails. 13 Intro to ML prerequisite
Word embeddings that bank holds the mortgage on my home 1. Define supervised learning task from raw text (no manual annotation!): 1. (x, y) = (bank, that) 2. (x, y) = (bank, holds) 3. (x, y) = (holds, bank) 4. (x, y) = (holds, the) 5. (x, y) = (the, holds) 6. (x, y) = (the, mortgage) 7. (x, y) = (mortgage, the) 8. (x, y) = (mortgage, on) 9. (x, y) = (on, mortgage), 10. (x, y) = (on, my) 11. (x, y) = (my, on) 12. (x, y) = (my, home) 14
Word embeddings 2. Define model for output given input p(holds bank) p (o c) = exp(u > o v c ) P V w=1 exp(u> wv c ) u: vector for outside word, v: vector for center word, V: number of words in vocabulary, θ: all parameters Multi-class classification model (number of classes?) How many parameters are in the model: =2 V d u,v 2 R d 15 Intro to ML prerequisite
Word embeddings 3. Define objective function for corpus of length T: L( ) = TY Y p (w t+j w t ) t=1 m apple j apple m j 6= 0 J( ) = log L( ) = TX X log p (w t+j w t ) t=1 m apple j apple m j 6= 0 Find parameters that maximize the objective 16 Intro to ML prerequisite
Word embeddings Intuitions: What probabilities would maximize the objective? Why should similar words have similar vectors? Why do we have different parameters for the center word and the output word? c(x, y) =2,c(x, z) =1 J( ) =p(y x) 2 p(z x) = p(y x) 2 (1 p(y x)) rj( ) =2p(y x) 3p(y x) 2 = p(y x)(2 3p(y x)) p(y x) = 2 3,p(z x) =1 3 17 Intro to ML prerequisite
18
Gradient descent How to find the right model parameters? Start at some point and move in the opposite direction of the gradient 19 Intro to ML prerequisite
Gradient descent f(x) =x 4 +3x 3 +2 f 0 (x) =4x 3 +9x 2 20 Intro to ML prerequisite
Gradient descent We want to minimize: TX J( ) = t=1 X j log p (w t+j w t ) Update rule: new j = old j new = old 2 R 2Vd @J( ) @ j rj( ) α is a step size 21 Intro to ML prerequisite
Stochastic gradient descent For large corpora (billions of tokens) this update is very slow Sample a window t Update gradients based on that window new = old rj t ( ) 22 Intro to ML prerequisite
Deriving the gradient Mostly applications of the chain rule Let s derive the gradient of a window (t) and an center word You will do this again in the assignment (and more) log p (w t+j w t ) 23
Whiteboard 24
Class 2: recap Goal: represent words with low-dimensional vectors Approach: Define a supervised learning problem from a corpus We defined the necessary components for skip-gram: Model (softmax over word labels for each word) Objective (minimize Negative Log Likelihood) Optimize with SGD We computed the gradient for some parameters by hand 25
Computational problem Computing the partition function is too expensive Solution 1: hierarchical softmax (Morin and Bengio, 2005) reduces computation time to log V by constructing a binary tree over the vocabulary Solution 2: Change the model skip-gram with negative sampling (home assignment 1) 26
Logistic regression (x, y) = ((bank, holds), 1) (x, y) = ((bank, table), 0) (x, y) = ((bank, eat), 0) (x, y) = ((holds, bank), 1) (x, y) = ((holds, quickly), 0) (x, y) = ((holds, which), 0) (x, y) = ((the, mortgage), 1) (x, y) = ((the, eat), 0) (x, y) = ((the, who), 0) What information is lost? X p(y =1 o, c) =? o2v 27
Logistic regression Model: p (y =1 c, o) = 1 1 + exp( u > o v c ) = (u> o v c ) p (y =0 c, o) =1 (u > o v c )= ( u > o v c ) Objective: X log( (u > w t+j v wt )) + X log( ( u > w (k) v wt )) t,j k p(w) p(w) = U(w) 3/4 / Z 28 Intro to ML prerequisite
Summary We defined the three necessary components. Model (binary classification) Objective (ML with negative sampling) Optimization method (SGD) 29
Many variants CBOW: predict center word from context Defining context: How big is the window? Is it sequential or based on syntactic information? Different model for every context position? Use stop words? 30
Matrix factorization 31
Landauer and Dumais (1997) Matrix factorization Consider the word-context co-occurrence matrix for a corpus: I like deep learning. I like NLP. I enjoy flying. I Like enjoy deep learning NLP flying. I 2 1 like 2 1 1 enjoy 1 1 deep 1 1 learning 1 1 NLP 1 1 flying 1 1. 1 1 1 32
Matrix factorization Reconstruct matrix from low-dimensional wordcontext representations. Minimizes: X (A ij  k ij) 2 = A Âk 2 i,j 33
Matrix factorization 34
Levy and Goldberg, 2015 Relation to skip-gram The output of skip-gram can be viewed as factorizing a word-context matrix: M V U T = M 2 R V V,V,U 2 R V d What should the values of M be? Mco is <vc, uo> 35
Relation to skip-gram Re-write objective: L( ) = X c,o #(c, o) log( (u > o v c )) + k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c,o #(c, o) k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c #(c) k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c = X c,o #(c) k X o #(o) V #(c, o) log( (u > o v c )) + #(c) k #(o) V log( ( u > o v c )) log( ( u > o v c ))
Relation to skip-gram Re-write objective: L( ) = X c,o #(c, o) log( (u > o v c )) + k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c,o #(c, o) k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c #(c) k E[log( ( u > o v c ))] = X c,o = X c,o #(c, o) log( (u > o v c )) + X c #(c) k X o #(o) T log( ( u> o v c )) #(c, o) log( (u > o v c )) + #(c) k #(o) T log( ( u> o v c )) 37
Relation to skip-gram Let s assume the dot products are independent of one another: Let x = u > o v c l(x) =#(c, o) log( (x)) + #(c) k #(o) T L( ) = X c,o l(x) log( ( x)) @l(x) #(o) =#(c, o) ( x) #(c) k (x) =0 @x T #(c, o) T x = log #(c) #(o) 1 k p(c, o) x = log log k =PMI(c, o) p(c) p(o) 38 log k
Relation to skip-gram Conclusion: Skip-gram with negative sampling implicitly factorizes a shifted PMI matrix Many NLP methods factorize the PMI matrix with matrix decomposition methods to obtain dense vectors. 39
Evaluation 40
Evaluation Intrinsic vs. extrinsic evaluation: Intrinsic: define some artificial task that tries to directly measure the quality of your learning algorithm Extrinsic: check whether your output is useful in a real NLP task 41
Intrinsic evaluation Word analogies: Normalize all word vectors to 1 man::woman < > king::?? a::b < > c::d d = arg max i (x b x a + x c ) > x i x b x a + x c Does cosine distance capture semantic and syntactic intuitions? 42
Visualization 43
Visualization 44
Visualization 45
Word analogies evaluation 46
Human correlation intrinsic evaluation word 1 word 2 human judgement tiger cat 7.35 book paper 7.46 computer internet 7.58 plane car 5.77 stock phone 1.62 stock CD 1.31 stock jaguar 0.92 47
Human correlation intrinsic evaluation Compute Spearman rank correlation between human similarity prediction and model similarity predictions (wordsim 353): 48
Extrinsic evaluation Task: named entity recognition. Find mentions of person, location, organization in text. Using good word representation might be useful 49
Extrinsic evaluation 50
Summary Words are central to language In most NLP systems some word representations are used Graph-based representations are difficult to manipulate and compose One-hot vectors are useful with enough data but lose all of generalization information Word embeddings provide a compact way to encode word meaning and similarity Skip-gram with negative sampling is a popular approach for learning word embeddings by casting an unsupervised problem as a supervised problem It is related to classical matrix decomposition methods. 51
Assignment 1 Implement skip-gram with negative sampling There is ample literature if you want to consider this for a project 52
Gradient checks @J( ) @ =lim!0 J( + ) J( ) 2 Compute for every parameters for small epsilon. 53