Lecture 7: Word Embeddings

Size: px

Start display at page:

Download "Lecture 7: Word Embeddings"

Trevor Bates
5 years ago
Views:

1 Lecture 7: Word Embeddings Kai-Wei Chang University of Virginia kw@kwchang.net Couse webpage: Natural Language Processing 1

2 This lecture v Learning word vectors (Cont.) v Representation learning in NLP 6501 Natural Language Processing 2

3 Recap: Latent Semantic Analysis v Data representation v Encode single-relational data in a matrix v Co-occurrence (e.g., from a general corpus) v Synonyms (e.g., from a thesaurus) v Factorization v Apply SVD to the matrix to find latent components v Measuring degree of relation v Cosine of latent vectors

4 Recap: Mapping to Latent Space via SVD C U Σ k k V ' k n d n d k v SVD generalizes the original data v Uncovers relationships not explicit in the thesaurus v Term vectors projected to k-dim latent space v Word similarity: cosine of two column vectors in ΣV $

5 Low rank approximation v Frobenius norm. C is a m n matrix v Rank of a matrix. 9 6 C / = 1 1 c v How many vectors in the matrix are independent to each other 6501 Natural Language Processing 5

6 Low rank approximation v Low rank approximation problem: min = C X / s. t. rank X = k v If I can only use k independent vectors to describe the points in the space, what are the best choices? Essentially, we minimize the reconstruction loss under a low rank constraint 6501 Natural Language Processing 6

7 Low rank approximation v Low rank approximation problem: min = C X / s. t. rank X = k v If I can only use k independent vectors to describe the points in the space, what are the best choices? Essentially, we minimize the reconstruction loss under a low rank constraint 6501 Natural Language Processing 7

8 Low rank approximation v Assume rank of C is r v SVD: C = UΣV ', Σ = diag(σ 8, σ 5 σ P, 0,0,0, 0) Σ = σ r non-zeros v Zero-out the r k trailing values Σ = diag(σ 8, σ 5 σ U, 0,0,0, 0) v C V = UΣ V V ' is the best k-rank approximation: C V = arg min = C X / s. t. rank X = k 6501 Natural Language Processing 8

9 Word2Vec v LSA: a compact representation of cooccurrence matrix v Word2Vec:Predict surrounding words (skip-gram) v Similar to using co-occurrence counts Levy&Goldberg (2014), Pennington et al. (2014) v Easy to incorporate new words or sentences 6501 Natural Language Processing 9

10 Word2Vec v Similar to language model, but predicting next word is not the goal. v Idea: words that are semantically similar often occur near each other in text v Embeddings that are good at predicting neighboring words are also good at representing similarity 6501 Natural Language Processing 10

11 Skip-gram v.s Continuous bag-of-words v What differences? 6501 Natural Language Processing 11

12 Skip-gram v.s Continuous bag-of-words 6501 Natural Language Processing 12

13 Objective of Word2Vec (Skip-gram) v Maximize the log likelihood of context word w \]9, w \]9^8,, w \]8, w \^8, w \^5,, w \^9 given word w \ v m is usually 5~ Natural Language Processing 13

14 Objective of Word2Vec (Skip-gram) v How to model log P(w \^4 w \ )? p w \^4 w \ = cde (f g hij l g h ) cde (f g n l g h ) gn v softmax function Again! v Every word has 2 vectors v v p : when w is the center word v u p : when w is the outside word (context word) 6501 Natural Language Processing 14

15 How to update? p w \^4 w \ = cde (f g hij l g h ) cde (f g n l g h ) gn v How to minimize J(θ) v Gradient descent! v How to compute the gradient? 6501 Natural Language Processing 15

16 Recap: Calculus v Gradient: x ' = x 8 x 5 x z, φ(x) φ x = x 8 φ(x) x 5 φ(x) x z v φ x = a x (or represented as a ' x) φ x = a 6501 Natural Language Processing 16

17 Recap: Calculus v If y = f u and u = g x (i.e,. y = f(g x ) ƒ = ƒ (f) ƒ ƒf ƒ ( ) ƒ ( ƒ ƒf ƒf ƒ ) 1. y = xˆ + 6 z 2. y = ln (x 5 + 5) 3. y = exp(x + 3x + 2) 6501 Natural Language Processing 17

18 Other useful formulation v y = exp x v y = log x dy dx = exp x dy dx = 1 x When I say log (in this course), usually I mean ln 6501 Natural Language Processing 18

19 6501 Natural Language Processing 19

20 Example v Assume vocabulary set is W. We have one center word c, and one context word o. v What is the conditional probability p o c p o c = exp (u v ) exp (u p n v ) pv v What is the gradient of the log likelihood w.r.t v? log p o c v = u E p w c [u p ] 6501 Natural Language Processing 20

21 Gradient Descent min J(w) p Update w: w w η J(w) 6501 Natural Language Processing 21

22 Local minimum v.s. global minimum 6501 Natural Language Processing 22

23 Stochastic gradient descent v Let J w = 8 6 J (w) v Gradient descent update rule: w w ž v Stochastic gradient descent: 6 J 4 w v Approximate 8 J w by the gradient at a single example J 3 w (why?) v At each step: Randomly pick an example i w w η J 3 w 6501 Natural Language Processing 23

24 Negative sampling v With a large vocabulary set, stochastic gradient descent is still not enough (why?) log p o c v = u E p w c [u p ] v Let s approximate it again! vonly sample a few words that do not appear in the context vessentially, put more weights on positive samples 6501 Natural Language Processing 24

25 More about Word2Vec relation to LSA v LSA factorizes a matrix of co-occurrence counts v (Levy and Goldberg 2014) proves that skip-gram model implicitly factorizes a (shifted) PMI matrix! v PMI(w,c) =log ( ) ( ) = log = log (, ) ( ) ( ) # w, c D #(w)#(c) 6501 Natural Language Processing 25

26 All problem solved? 6501 Natural Language Processing 26

27 Continuous Semantic Representations sunny rainy cab car wheel cloudy windy emotion joy sad feeling 6501 Natural Language Processing 27

28 Semantics Needs More Than Similarity Tomorrow will be rainy. Tomorrow will be sunny. similar(rainy, sunny)? antonym(rainy, sunny)? 6501 Natural Language Processing 28

29 Polarity Inducing LSA [Yih, Zweig, Platt 2012] v Data representation v Encode two opposite relations in a matrix using polarity v Synonyms & antonyms (e.g., from a thesaurus) v Factorization v Apply SVD to the matrix to find latent components v Measuring degree of relation v Cosine of latent vectors

30 Encode Synonyms & Antonyms in Matrix v Joyfulness: joy, gladden; sorrow, sadden v Sad: sorrow, sadden; joy, gladden Target word: rowvector Inducing polarity joy gladden sorrow sadden goodwill Group 1: joyfulness Group 2: sad Group 3: affection Cosine Score: + Synonyms

31 Encode Synonyms & Antonyms in Matrix v Joyfulness: joy, gladden; sorrow, sadden v Sad: sorrow, sadden; joy, gladden Target word: rowvector Inducing polarity joy gladden sorrow sadden goodwill Group 1: joyfulness Group 2: sad Group 3: affection Cosine Score: Antonyms

32 Continuous representations for entities Republic Party Democratic Party? George W Bush Laura Bush Michelle Obama 6501 Natural Language Processing 32

33 Continuous representations for entities Useful resources for NLP applications Semantic Parsing & Question Answering Information Extraction 6501 Natural Language Processing 33

Web-Mining Agents. Multi-Relational Latent Semantic Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Web-Mining Agents. Multi-Relational Latent Semantic Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Web-Mining Agents Multi-Relational Latent Semantic Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Tanya Braun (Übungen) Acknowledgements Slides by: Scott Wen-tau