GloVe: Global Vectors for Word Representation 1

Size: px

Start display at page:

Download "GloVe: Global Vectors for Word Representation 1"

Anis Kennedy
6 years ago
Views:

1 GloVe: Global Vectors for Word Representation 1 J. Pennington, R. Socher, C.D. Manning M. Korniyenko, S. Samson Deep Learning for NLP, 13 Jun

2 Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

3 Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

4 One-hot Vectors Words as discrete units.

5 One-hot Vectors Words as discrete units. Represent as one-hot vectors. Example Let our lexicon be: {king, queen, man, woman} king [1, 0, 0, 0] queen [0, 1, 0, 0] man [0, 0, 1, 0] woman [0, 0, 0, 1]

6 One-hot Vectors Words as discrete units. Represent as one-hot vectors. Example Let our lexicon be: {king, queen, man, woman} king [1, 0, 0, 0] queen [0, 1, 0, 0] man [0, 0, 1, 0] woman [0, 0, 0, 1] What would the dot product of the king vector and queen vector be?

7 One-hot Vectors Can we reduce the size of this space from R V to something smaller and thus find a subspace that encodes the relationships between words?

8 Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

9 Co-occurrence Matrix You shall know a word by the company it keeps. - Firth, J. R. Papers in Linguistics, 1957.

10 Co-occurrence Matrix You shall know a word by the company it keeps. - Firth, J. R. Papers in Linguistics, Example Let our corpus be: I like deep learning. I like NLP. I enjoy flying.

11 Word Vectors Co-occurrence Matrix counts I like enjoy deep learning NLP flying. I like enjoy deep learning NLP flying

12 How to make neighbours represent words? Use a co-occurrence matrix X : word-document (Latent Semantic Analysis) word-window (both syntactic and semantic information)

13 Drawbacks of simple cooccurence vectors increase in size with vocabulary very high dimensional sparsity issues

14 Low dimensional vectors Store most of the information in a fixed, small number of dimensions: a dense vector.

15 Low dimensional vectors Store most of the information in a fixed, small number of dimensions: a dense vector. Two possible solutions: Singular Value Decomposition (SVD) based methods // count based prediction Iteration based methods // direct prediction

16 Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

17 SVD 2 Dimensionality Reduction on X 2

18 SVD Model disadvantages The dimensions of the matrix change very often (new words are added very frequently and corpus changes in size). The matrix is extremely sparse since most words do not co-occur. Quadratic cost to train. Requires the incorporation of some hacks on X to account for the drastic imbalance in word frequency.

19 SVD Model disadvantages The dimensions of the matrix change very often (new words are added very frequently and corpus changes in size). The matrix is extremely sparse since most words do not co-occur. Quadratic cost to train. Requires the incorporation of some hacks on X to account for the drastic imbalance in word frequency. Iteration based methods solve many of these issues in a far more elegant manner.

20 Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

21 Neural networks and word embedding The main idea of our model is to predict between a center word w t and context words in terms of word which has a loss function p(context w t ) =... J = 1 p(w t w t ) We will be adjusting the vector representations of words to minimize the loss

22 History of directly learning low-dimensional word vectors Learning representations by back-propagating errors (Rumelhart et al., 1986) A neural probabilistic language model (Bengio et al., 2003) NLP (almost) from Scratch (Collobert & Watson, 2008) Distributed Representations of Words and Phrases and their Compositionality(Mikolov et al., 2013) - word2vec

23 History of directly learning low-dimensional word vectors Learning representations by back-propagating errors (Rumelhart et al., 1986) A neural probabilistic language model (Bengio et al., 2003) NLP (almost) from Scratch (Collobert & Watson, 2008) Distributed Representations of Words and Phrases and their Compositionality(Mikolov et al., 2013) - word2vec

24 word2vec Main idea Predict between every word and its context words.

25 word2vec Main idea Predict between every word and its context words. Skip-grams(SG) Predicting surrounding context words given a center word.

26 word2vec Main idea Predict between every word and its context words. Skip-grams(SG) Predicting surrounding context words given a center word. Continuous Bag of Words (CBOW) Predicting a center word form the surrounding context.

27 word2vec Main idea Predict between every word and its context words. Skip-grams(SG) Predicting surrounding context words given a center word Continuous Bag of Words (CBOW) Predicting a center word form the surrounding context

28 word2vec Skip-gram prediction

29 word2vec Skip-gram prediction 3 3

30 word2vec Question How many vectors represent each word?

31 word2vec Objective function Idea: Maximize the probability of any context word given the current center word: J (θ) = T t=1 m j m, j 0 p(w t+j w t ; θ) T - our text(corpus). For each word t=1...t, we try to predict surrounding words m - size of our window θ represents all variables we will optimize

32 word2vec Negative Log Likelihood J (θ) = T t=1 m j m, j 0 p(w t+j w t ) J(θ) = 1 T T t=1 m j m, j 0 log p(w t+j w t )

33 word2vec Negative Log Likelihood J (θ) = T t=1 m j m, j 0 p(w t+j w t ) J(θ) = 1 T T t=1 m j m, j 0 log p(w t+j w t )

34 word2vec For p(w t+j w t ) the simplest formulation is p(o c) = exp(u o T v c ) V W =1 exp(u w T v c ) o - outside word index c - center word index v c - center vector of index c u o - outside vector of index o

36 4 4

37 Output Layer 5 5

38 word2vec Training the model Compute all vector gradients. We define the set of all parameters in a model in terms of one long vector θ. d - dimension of each vector V - number of words θ = v a.. v zebra u a.. u zebra R 2dV

39 word2vec Gradient Descent To minimize J(θ) over the entire training data would require us to compute gradients for all windows: θ new j = θ old j α θj old J(θ) Using matrix notation θ new = θ old α J(θ) θold θ new = θ old α θ J(θ)

40 word2vec Skip-gram model and negative sampling J t (θ) = log σ(u T o v c ) + k E j P(w) [log σ( uj T v c )] i=1 k - number of negative samples σ(x) - sigmoid function we maximize the probability of two words co-occuring in first log

41 word2vec Skip-gram model and negative sampling Clearer notation: J t (θ) = log σ(uo T v c ) + [log σ( uj T v c )] j P(w)

42 Count based and Direct prediction

43 Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

44 Basics Notation X word-word co-occurrence counts X ij frequency of j occurring in context of i X i = k X ik frequency of word in context of i P ij = P(j i) = X ij /X i probability that word j appears in the context of word i

45 Basics Meaning Extraction Selected co-occurrence probabilities from a 6 billion token corpus Probability and Ratio k = solid k = gas k = water k = fashion P(k ice) P(k steam) P(k ice)/p(k steam)

46 Basics Meaning Extraction Selected co-occurrence probabilities from a 6 billion token corpus Probability and Ratio k = solid k = gas k = water k = fashion P(k ice) P(k steam) P(k ice)/p(k steam) Starting point for word vector learning?

47 Basics Meaning Extraction Selected co-occurrence probabilities from a 6 billion token corpus Probability and Ratio k = solid k = gas k = water k = fashion P(k ice) P(k steam) P(k ice)/p(k steam) Starting point for word vector learning? Ratios of co-occurrence probabilities instead of probabilities themselves.

48 Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

49 The Model Objective Function J = V i,j=1 f (X ij )(w T i w j + b i + b j log(x ij )) where V is the size of the vocabulary.

50 The Model Most General Form F (w i, w j, w k ) = P ik P jk where w R d are word vectors and w R d are separate context word vectors. P ik P jk is extracted from the corpus.

51 The Model Most General Form F (w i, w j, w k ) = P ik P jk where w R d are word vectors and w R d are separate context word vectors. P ik P jk is extracted from the corpus. For F, encode the information from P ik P jk in the word vector space.

52 The Model F (w i, w j, w k ) = P ik P jk

53 The Model F (w i, w j, w k ) = P ik P jk F (w i w j, w k ) = P ik P jk so that we can restrict to the difference of the two target words.

54 The Model F (w i, w j, w k ) = P ik P jk F (w i w j, w k ) = P ik P jk so that we can restrict to the difference of the two target words. Notice that LHS are vectors while RHS is a scalar.

55 The Model F (w i, w j, w k ) = P ik P jk F (w i w j, w k ) = P ik P jk so that we can restrict to the difference of the two target words. Notice that LHS are vectors while RHS is a scalar. Solution: take the dot product of the arguments.

56 The Model F (w i, w j, w k ) = P ik P jk F (w i w j, w k ) = P ik P jk so that we can restrict to the difference of the two target words. Notice that LHS are vectors while RHS is a scalar. Solution: take the dot product of the arguments. F ((w i w j ) T w k ) = P ik P jk

57 The Model F (w i, w j, w k ) = P ik P jk F (w i w j, w k ) = P ik P jk so that we can restrict to the difference of the two target words. Notice that LHS are vectors while RHS is a scalar. Solution: take the dot product of the arguments. F ((w i w j ) T w k ) = P ik P jk For word-word co-occurrence matrices, word and context word are interchangeable.

58 The Model F (w i, w j, w k ) = P ik P jk F (w i w j, w k ) = P ik P jk so that we can restrict to the difference of the two target words. Notice that LHS are vectors while RHS is a scalar. Solution: take the dot product of the arguments. F ((w i w j ) T w k ) = P ik P jk For word-word co-occurrence matrices, word and context word are interchangeable. Exchange w w and X X T.

59 The Model F (w i, w j, w k ) = P ik P jk F (w i w j, w k ) = P ik P jk so that we can restrict to the difference of the two target words. Notice that LHS are vectors while RHS is a scalar. Solution: take the dot product of the arguments. F ((w i w j ) T w k ) = P ik P jk For word-word co-occurrence matrices, word and context word are interchangeable. Exchange w w and X X T. However, doing so does not make the model invariant.

60 The Model To restore the symmetry:

61 The Model To restore the symmetry: F must be a homomorphism between the groups (R, +) and (R >0, ): F ((w i w j ) T w k ) = F (w T i w k ) F (wj T w k )

62 The Model To restore the symmetry: F must be a homomorphism between the groups (R, +) and (R >0, ): F ((w i w j ) T w k ) = F (w T i w k ) F (w T i w k ) = P ik = X ik X i F (wj T w k ) from F ((w i w j ) T w k ) = P ik P jk

63 The Model To restore the symmetry: F must be a homomorphism between the groups (R, +) and (R >0, ): F ((w i w j ) T w k ) = F (w T i w k ) F (w T i w k ) = P ik = X ik X i F (wj T w k ) from F ((w i w j ) T w k ) = P ik P jk F = exp or w T i w k = log(p ik ) = log(x ik ) log(x i )

64 The Model To restore the symmetry: F must be a homomorphism between the groups (R, +) and (R >0, ): F ((w i w j ) T w k ) = F (w T i w k ) F (w T i F (wj T w k ) w k ) = P ik = X ik X i from F ((w i w j ) T w k ) = P ik P jk i w k = log(p ik ) = log(x ik ) log(x i ) F = exp or w T Remove log(xi ) from RHS. But this term is also independent of k so it can be absorbed into a bias b i for w i : wi T w k + b i

65 The Model To restore the symmetry: F must be a homomorphism between the groups (R, +) and (R >0, ): F ((w i w j ) T w k ) = F (w T i w k ) F (w T i F (wj T w k ) w k ) = P ik = X ik X i from F ((w i w j ) T w k ) = P ik P jk i w k = log(p ik ) = log(x ik ) log(x i ) F = exp or w T Remove log(xi ) from RHS. But this term is also independent of k so it can be absorbed into a bias b i for w i : wi T w k + b i Add a bias for wk : wi T w k + b i + b k

66 The Model To restore the symmetry: F must be a homomorphism between the groups (R, +) and (R >0, ): F ((w i w j ) T w k ) = F (w T i w k ) F (w T i F (wj T w k ) w k ) = P ik = X ik X i from F ((w i w j ) T w k ) = P ik P jk i w k = log(p ik ) = log(x ik ) log(x i ) F = exp or w T Remove log(xi ) from RHS. But this term is also independent of k so it can be absorbed into a bias b i for w i : wi T w k + b i Add a bias for wk : wi T w k + b i + b k Finally, wi T w k + b i + b k = log(x ik )

67 The Model w T i w k + b i + b k = log(x ik )

68 The Model w T i w k + b i + b k = log(x ik ) A simplification over F (w i, w j, w k ) = P ik P jk.

69 The Model w T i w k + b i + b k = log(x ik ) A simplification over F (w i, w j, w k ) = P ik P jk. Add an additive shift log(xik ) log(1 + X ik ) to maintain sparsity of X and avoid divergences.

70 The Model w T i w k + b i + b k = log(x ik ) A simplification over F (w i, w j, w k ) = P ik P jk. Add an additive shift log(xik ) log(1 + X ik ) to maintain sparsity of X and avoid divergences. What are the drawbacks to this model?

71 The Model w T i w k + b i + b k = log(x ik ) A simplification over F (w i, w j, w k ) = P ik P jk. Add an additive shift log(xik ) log(1 + X ik ) to maintain sparsity of X and avoid divergences. What are the drawbacks to this model? It weighs all co-occurrences, rare and non-existent ones, equally. Cast the above equation as a least squares problem with a weighting function f (X ij ).

72 The Model Using Least Squares Regression J = V i,j=1 f (X ij )(w T i w k + b i + b j log(x ij )) 2 where V is the size of the vocabulary.

73 The Model Properties of The Weighting Function f (0) = 0. If f is viewed as a continuous function, it should vanish as x 0 fast enough that the lim x 0 f (x)log 2 x is infinite. Xij are co-occurrence counts which are in N 0. If X ij is zero, the logarithm diverges. GloVe trains only on nonzero elements. f (x) should be non-decreasing so that rare co-occurrences are not overweighted. f (x) should be relatively small for large values of x, so that frequent co-occurrences are not overweighted.

74 The Model The Weighting Function f (x) = { (x/x max ) α if x < x max 1 otherwise. x max = 100 for the group s experiments.

75 The Model The Weighting Function f (x) = { (x/x max ) α if x < x max 1 otherwise. x max = 100 for the group s experiments. α = 3/4 Figure: α = 3/4. From Pennington et al 2014, 4.

76 The Model Relation to Skip-gram Softmax model for skip-gram is Q ij = probability that word j appears in context i. exp(w i T w j ) V k=1 exp(w i T w k ) for the

77 The Model Relation to Skip-gram Softmax model for skip-gram is Q ij = probability that word j appears in context i. The implied global objective function is J = logq ij i corpus j context(i) exp(w i T w j ) V k=1 exp(w i T w k ) for the

78 The Model Relation to Skip-gram Softmax model for skip-gram is Q ij = probability that word j appears in context i. The implied global objective function is J = logq ij i corpus j context(i) exp(w i T w j ) V k=1 exp(w i T w k ) for the However, the sum above can be much more efficient: V V J = X ij logq ij i=1 j=1

79 The Model Relation to Skip-gram Softmax model for skip-gram is Q ij = probability that word j appears in context i. The implied global objective function is J = logq ij i corpus j context(i) exp(w i T w j ) V k=1 exp(w i T w k ) for the However, the sum above can be much more efficient: V V J = X ij logq ij i=1 j=1 Recall that X i = k X ik and P ij = P(j i) = X ij /X i

80 The Model Relation to Skip-gram We can rewrite J as J = V i=1 X i V j=1 P ij logq ij = V X i H(P i, Q i ) i=1 where H(P i, Q i ) is the cross entropy of distributions P i and Q i.

81 The Model Relation to Skip-gram We can rewrite J as J = V i=1 X i V j=1 P ij logq ij = V X i H(P i, Q i ) i=1 where H(P i, Q i ) is the cross entropy of distributions P i and Q i. However, cross entropy error is not ideal. Why?

82 The Model Relation to Skip-gram We can rewrite J as J = V i=1 X i V j=1 P ij logq ij = V X i H(P i, Q i ) i=1 where H(P i, Q i ) is the cross entropy of distributions P i and Q i. However, cross entropy error is not ideal. Why? Distributions with long tails are modeled poorly and computational bottleneck.

83 The Model Relation to Skip-gram In a least squares objective, discard normalization factors in Q and P J ˆ= X i ( ˆP ij ˆQ ij ) 2 i,j where ˆP ij = X ij and ˆQ ij = exp(w T i w j ).

84 The Model Relation to Skip-gram In a least squares objective, discard normalization factors in Q and P J ˆ= X i ( ˆP ij ˆQ ij ) 2 i,j where ˆP ij = X ij and ˆQ ij = exp(w T i w j ). Another problem!

85 The Model Relation to Skip-gram In a least squares objective, discard normalization factors in Q and P J ˆ= X i ( ˆP ij ˆQ ij ) 2 i,j where ˆP ij = X ij and ˆQ ij = exp(w T i w j ). Another problem! X ij often takes very large values. Minimize the squared error of the logarithms of ˆP and ˆQ ˆ J = i,j X i (log ˆP ij log ˆQ ij ) 2 = i,j X i (w T i w j logx ij ) 2

86 The Model Relation to Skip-gram ˆ J = i,j f (X ij )(w T i w j logx ij ) 2 which is equivalent to the cost function J = V i,j=1 f (X ij )(w T i w k + b i + b j log(x ij ))

87 Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

88 Intrinsic Evaluation Comparing Intrinsic and Extrinsic Evaluations 6 Intrinsic Evaluation The evaluation of a set of word vectors generated by an embedding technique (such as Word2Vec or GloVe) on specific intermediate subtasks (such as analogy completion). Evaluation on specific subtask Fast to compute performance Helps understand subsystem Needs positive correlation with real task to determine usefulness 6 Mundra, Socher

89 Intrinsic Evaluation Comparing Intrinsic and Extrinsic Evaluations 7 Extrinsic Evaluation The evaluation of a set of word vectors generated by an embedding technique on the real task at hand. Evaluation on a real task Can be slow to compute performance Unclear if subsystem is the problem, other subsystems, or internal interactions If replacing subsystem improves performance, the change is likely good 7 Mundra, Socher

90 Intrinsic Evaluation Example Performance in completing word vector analogies. a : b :: c :?

91 Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

92 Nearest neighbors 8 The closest words to the target word frog: frogs toad litoria leptodactylidae rana 8

93 Word analogies 9 9

94 Glove Visualization 10 man-woman 10

95 Glove Visualization 11 comparative-superlative 11

96 Results on the word analogy task, given as percent accuracy

97 GloVe vs Skip-Gram

98 Accuracy on the analogy task for 300- dimensional vectors trained on different corpora

99 Conclusion GloVe, is a new global log-bilinear regression model for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks.

100 Further Reading T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient Estimation of Word Representations in Vector Space International Conference on Learning Representations, E. Huang, R. Socher, C.D. Manning, A. Ng. Improving Word Representations via Global Context and Multiple Word Prototypes Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, , 2012.

Neural Word Embeddings from Scratch

Neural Word Embeddings from Scratch Xin Li 12 1 NLP Center Tencent AI Lab 2 Dept. of System Engineering & Engineering Management The Chinese University of Hong Kong 2018-04-09 Xin Li Neural Word Embeddings