GloVe: Global Vectors for Word Representation 1

Similar documents
Neural Word Embeddings from Scratch

Natural Language Processing

The representation of word and sentence

word2vec Parameter Learning Explained

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

DISTRIBUTIONAL SEMANTICS

Deep Learning. Ali Ghodsi. University of Waterloo

Word Embeddings 2 - Class Discussions

An overview of word2vec

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

CS224n: Natural Language Processing with Deep Learning 1

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

arxiv: v3 [cs.cl] 30 Jan 2016

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Learning for NLP Part 2

Distributional Semantics and Word Embeddings. Chase Geigle

Natural Language Processing and Recurrent Neural Networks

Lecture 6: Neural Networks for Representing Word Meaning

CS224n: Natural Language Processing with Deep Learning 1

Deep Learning for NLP

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing

Deep Learning: A Statistical Perspective

text classification 3: neural networks

Neural Networks for NLP. COMP-599 Nov 30, 2016

Bayesian Paragraph Vectors

arxiv: v2 [cs.cl] 1 Jan 2019

Neural Networks Language Models

Lecture 7: Word Embeddings

Embeddings Learned By Matrix Factorization

Seman&cs with Dense Vectors. Dorota Glowacka

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Machine Learning for Smart Learners

Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations

NEURAL LANGUAGE MODELS

Deep Learning. Language Models and Word Embeddings. Christof Monz

STA141C: Big Data & High Performance Statistical Computing

Natural Language Processing

Statistical Machine Learning from Data

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Natural Language Processing

ECE521 Lecture 7/8. Logistic Regression

STA141C: Big Data & High Performance Statistical Computing

Lecture 5 Neural models for NLP

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Deep Learning Recurrent Networks 2/28/2018

Neural Network Training

Neural Network Language Modeling

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

A fast and simple algorithm for training neural probabilistic language models

CSC321 Lecture 7 Neural language models

Neural Networks: Backpropagation

Conditional Language modeling with attention

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

ATASS: Word Embeddings

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Data Mining & Machine Learning

Natural Language Processing with Deep Learning CS224N/Ling284

Linear Models for Regression. Sargur Srihari

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Regression and Its Applications

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

Generative Clustering, Topic Modeling, & Bayesian Inference

Naïve Bayes, Maxent and Neural Models

Dimensionality Reduction and Principle Components Analysis

Modeling Topics and Knowledge Bases with Embeddings

Probabilistic Latent Semantic Analysis

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

1 Inference in Dirichlet Mixture Models Introduction Problem Statement Dirichlet Process Mixtures... 3

Lecture 5: Web Searching using the SVD

Deep Learning For Mathematical Functions

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Bag of Words Meets Bags of Popcorn

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

13 Searching the Web with the SVD

ECE521 Lectures 9 Fully Connected Neural Networks

9 Searching the Internet with the SVD

Notes on Back Propagation in 4 Lines

Neural Networks Lecture 4: Radial Bases Function Networks

This lecture. Miscellaneous classification methods: Neural networks, Support vector machines, Transformation-based learning, K nearest neighbours.

Applied Natural Language Processing

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Machine Learning for Signal Processing Bayes Classification and Regression

CPSC 340 Assignment 4 (due November 17 ATE)

Continuous Space Language Model(NNLM) Liu Rong Intern students of CSLT

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Transcription:

GloVe: Global Vectors for Word Representation 1 J. Pennington, R. Socher, C.D. Manning M. Korniyenko, S. Samson Deep Learning for NLP, 13 Jun 2017 1 https://nlp.stanford.edu/projects/glove/

Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

One-hot Vectors Words as discrete units.

One-hot Vectors Words as discrete units. Represent as one-hot vectors. Example Let our lexicon be: {king, queen, man, woman} king [1, 0, 0, 0] queen [0, 1, 0, 0] man [0, 0, 1, 0] woman [0, 0, 0, 1]

One-hot Vectors Words as discrete units. Represent as one-hot vectors. Example Let our lexicon be: {king, queen, man, woman} king [1, 0, 0, 0] queen [0, 1, 0, 0] man [0, 0, 1, 0] woman [0, 0, 0, 1] What would the dot product of the king vector and queen vector be?

One-hot Vectors Can we reduce the size of this space from R V to something smaller and thus find a subspace that encodes the relationships between words?

Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

Co-occurrence Matrix You shall know a word by the company it keeps. - Firth, J. R. Papers in Linguistics, 1957.

Co-occurrence Matrix You shall know a word by the company it keeps. - Firth, J. R. Papers in Linguistics, 1957. Example Let our corpus be: I like deep learning. I like NLP. I enjoy flying.

Word Vectors Co-occurrence Matrix counts I like enjoy deep learning NLP flying. I 0 2 1 0 0 0 0 0 like 2 0 0 1 0 1 0 0 enjoy 1 0 0 0 0 0 1 0 deep 0 1 0 0 1 0 0 0 learning 0 0 0 1 0 0 0 1 NLP 0 1 0 0 0 0 0 1 flying 0 0 1 0 0 0 0 1. 0 0 0 0 1 1 1 0

How to make neighbours represent words? Use a co-occurrence matrix X : word-document (Latent Semantic Analysis) word-window (both syntactic and semantic information)

Drawbacks of simple cooccurence vectors increase in size with vocabulary very high dimensional sparsity issues

Low dimensional vectors Store most of the information in a fixed, small number of dimensions: a dense vector.

Low dimensional vectors Store most of the information in a fixed, small number of dimensions: a dense vector. Two possible solutions: Singular Value Decomposition (SVD) based methods // count based prediction Iteration based methods // direct prediction

Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

SVD 2 Dimensionality Reduction on X 2 http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-grammodel/

SVD Model disadvantages The dimensions of the matrix change very often (new words are added very frequently and corpus changes in size). The matrix is extremely sparse since most words do not co-occur. Quadratic cost to train. Requires the incorporation of some hacks on X to account for the drastic imbalance in word frequency.

SVD Model disadvantages The dimensions of the matrix change very often (new words are added very frequently and corpus changes in size). The matrix is extremely sparse since most words do not co-occur. Quadratic cost to train. Requires the incorporation of some hacks on X to account for the drastic imbalance in word frequency. Iteration based methods solve many of these issues in a far more elegant manner.

Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

Neural networks and word embedding The main idea of our model is to predict between a center word w t and context words in terms of word which has a loss function p(context w t ) =... J = 1 p(w t w t ) We will be adjusting the vector representations of words to minimize the loss

History of directly learning low-dimensional word vectors Learning representations by back-propagating errors (Rumelhart et al., 1986) A neural probabilistic language model (Bengio et al., 2003) NLP (almost) from Scratch (Collobert & Watson, 2008) Distributed Representations of Words and Phrases and their Compositionality(Mikolov et al., 2013) - word2vec

History of directly learning low-dimensional word vectors Learning representations by back-propagating errors (Rumelhart et al., 1986) A neural probabilistic language model (Bengio et al., 2003) NLP (almost) from Scratch (Collobert & Watson, 2008) Distributed Representations of Words and Phrases and their Compositionality(Mikolov et al., 2013) - word2vec

word2vec Main idea Predict between every word and its context words.

word2vec Main idea Predict between every word and its context words. Skip-grams(SG) Predicting surrounding context words given a center word.

word2vec Main idea Predict between every word and its context words. Skip-grams(SG) Predicting surrounding context words given a center word. Continuous Bag of Words (CBOW) Predicting a center word form the surrounding context.

word2vec Main idea Predict between every word and its context words. Skip-grams(SG) Predicting surrounding context words given a center word Continuous Bag of Words (CBOW) Predicting a center word form the surrounding context

word2vec Skip-gram prediction

word2vec Skip-gram prediction 3 3 http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-grammodel/

word2vec Question How many vectors represent each word?

word2vec Objective function Idea: Maximize the probability of any context word given the current center word: J (θ) = T t=1 m j m, j 0 p(w t+j w t ; θ) T - our text(corpus). For each word t=1...t, we try to predict surrounding words m - size of our window θ represents all variables we will optimize

word2vec Negative Log Likelihood J (θ) = T t=1 m j m, j 0 p(w t+j w t ) J(θ) = 1 T T t=1 m j m, j 0 log p(w t+j w t )

word2vec Negative Log Likelihood J (θ) = T t=1 m j m, j 0 p(w t+j w t ) J(θ) = 1 T T t=1 m j m, j 0 log p(w t+j w t )

word2vec For p(w t+j w t ) the simplest formulation is p(o c) = exp(u o T v c ) V W =1 exp(u w T v c ) o - outside word index c - center word index v c - center vector of index c u o - outside vector of index o

4 http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-grammodel/ 4

Output Layer 5 5 http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-grammodel/

word2vec Training the model Compute all vector gradients. We define the set of all parameters in a model in terms of one long vector θ. d - dimension of each vector V - number of words θ = v a.. v zebra u a.. u zebra R 2dV

word2vec Gradient Descent To minimize J(θ) over the entire training data would require us to compute gradients for all windows: θ new j = θ old j α θj old J(θ) Using matrix notation θ new = θ old α J(θ) θold θ new = θ old α θ J(θ)

word2vec Skip-gram model and negative sampling J t (θ) = log σ(u T o v c ) + k E j P(w) [log σ( uj T v c )] i=1 k - number of negative samples σ(x) - sigmoid function we maximize the probability of two words co-occuring in first log

word2vec Skip-gram model and negative sampling Clearer notation: J t (θ) = log σ(uo T v c ) + [log σ( uj T v c )] j P(w)

Count based and Direct prediction

Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

Basics Notation X word-word co-occurrence counts X ij frequency of j occurring in context of i X i = k X ik frequency of word in context of i P ij = P(j i) = X ij /X i probability that word j appears in the context of word i

Basics Meaning Extraction Selected co-occurrence probabilities from a 6 billion token corpus Probability and Ratio k = solid k = gas k = water k = fashion P(k ice) 1.9 10 4 6.6 10 5 3.0 10 3 1.7 10 5 P(k steam) 2.2 10 5 7.8 10 4 2.2 10 3 1.8 10 5 P(k ice)/p(k steam) 8.9 8.5 10 2 1.36 0.96

Basics Meaning Extraction Selected co-occurrence probabilities from a 6 billion token corpus Probability and Ratio k = solid k = gas k = water k = fashion P(k ice) 1.9 10 4 6.6 10 5 3.0 10 3 1.7 10 5 P(k steam) 2.2 10 5 7.8 10 4 2.2 10 3 1.8 10 5 P(k ice)/p(k steam) 8.9 8.5 10 2 1.36 0.96 Starting point for word vector learning?

Basics Meaning Extraction Selected co-occurrence probabilities from a 6 billion token corpus Probability and Ratio k = solid k = gas k = water k = fashion P(k ice) 1.9 10 4 6.6 10 5 3.0 10 3 1.7 10 5 P(k steam) 2.2 10 5 7.8 10 4 2.2 10 3 1.8 10 5 P(k ice)/p(k steam) 8.9 8.5 10 2 1.36 0.96 Starting point for word vector learning? Ratios of co-occurrence probabilities instead of probabilities themselves.

Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

The Model Objective Function J = V i,j=1 f (X ij )(w T i w j + b i + b j log(x ij )) where V is the size of the vocabulary.

The Model Most General Form F (w i, w j, w k ) = P ik P jk where w R d are word vectors and w R d are separate context word vectors. P ik P jk is extracted from the corpus.

The Model Most General Form F (w i, w j, w k ) = P ik P jk where w R d are word vectors and w R d are separate context word vectors. P ik P jk is extracted from the corpus. For F, encode the information from P ik P jk in the word vector space.

The Model F (w i, w j, w k ) = P ik P jk

The Model F (w i, w j, w k ) = P ik P jk F (w i w j, w k ) = P ik P jk so that we can restrict to the difference of the two target words.

The Model F (w i, w j, w k ) = P ik P jk F (w i w j, w k ) = P ik P jk so that we can restrict to the difference of the two target words. Notice that LHS are vectors while RHS is a scalar.

The Model F (w i, w j, w k ) = P ik P jk F (w i w j, w k ) = P ik P jk so that we can restrict to the difference of the two target words. Notice that LHS are vectors while RHS is a scalar. Solution: take the dot product of the arguments.

The Model F (w i, w j, w k ) = P ik P jk F (w i w j, w k ) = P ik P jk so that we can restrict to the difference of the two target words. Notice that LHS are vectors while RHS is a scalar. Solution: take the dot product of the arguments. F ((w i w j ) T w k ) = P ik P jk

The Model F (w i, w j, w k ) = P ik P jk F (w i w j, w k ) = P ik P jk so that we can restrict to the difference of the two target words. Notice that LHS are vectors while RHS is a scalar. Solution: take the dot product of the arguments. F ((w i w j ) T w k ) = P ik P jk For word-word co-occurrence matrices, word and context word are interchangeable.

The Model F (w i, w j, w k ) = P ik P jk F (w i w j, w k ) = P ik P jk so that we can restrict to the difference of the two target words. Notice that LHS are vectors while RHS is a scalar. Solution: take the dot product of the arguments. F ((w i w j ) T w k ) = P ik P jk For word-word co-occurrence matrices, word and context word are interchangeable. Exchange w w and X X T.

The Model F (w i, w j, w k ) = P ik P jk F (w i w j, w k ) = P ik P jk so that we can restrict to the difference of the two target words. Notice that LHS are vectors while RHS is a scalar. Solution: take the dot product of the arguments. F ((w i w j ) T w k ) = P ik P jk For word-word co-occurrence matrices, word and context word are interchangeable. Exchange w w and X X T. However, doing so does not make the model invariant.

The Model To restore the symmetry:

The Model To restore the symmetry: F must be a homomorphism between the groups (R, +) and (R >0, ): F ((w i w j ) T w k ) = F (w T i w k ) F (wj T w k )

The Model To restore the symmetry: F must be a homomorphism between the groups (R, +) and (R >0, ): F ((w i w j ) T w k ) = F (w T i w k ) F (w T i w k ) = P ik = X ik X i F (wj T w k ) from F ((w i w j ) T w k ) = P ik P jk

The Model To restore the symmetry: F must be a homomorphism between the groups (R, +) and (R >0, ): F ((w i w j ) T w k ) = F (w T i w k ) F (w T i w k ) = P ik = X ik X i F (wj T w k ) from F ((w i w j ) T w k ) = P ik P jk F = exp or w T i w k = log(p ik ) = log(x ik ) log(x i )

The Model To restore the symmetry: F must be a homomorphism between the groups (R, +) and (R >0, ): F ((w i w j ) T w k ) = F (w T i w k ) F (w T i F (wj T w k ) w k ) = P ik = X ik X i from F ((w i w j ) T w k ) = P ik P jk i w k = log(p ik ) = log(x ik ) log(x i ) F = exp or w T Remove log(xi ) from RHS. But this term is also independent of k so it can be absorbed into a bias b i for w i : wi T w k + b i

The Model To restore the symmetry: F must be a homomorphism between the groups (R, +) and (R >0, ): F ((w i w j ) T w k ) = F (w T i w k ) F (w T i F (wj T w k ) w k ) = P ik = X ik X i from F ((w i w j ) T w k ) = P ik P jk i w k = log(p ik ) = log(x ik ) log(x i ) F = exp or w T Remove log(xi ) from RHS. But this term is also independent of k so it can be absorbed into a bias b i for w i : wi T w k + b i Add a bias for wk : wi T w k + b i + b k

The Model To restore the symmetry: F must be a homomorphism between the groups (R, +) and (R >0, ): F ((w i w j ) T w k ) = F (w T i w k ) F (w T i F (wj T w k ) w k ) = P ik = X ik X i from F ((w i w j ) T w k ) = P ik P jk i w k = log(p ik ) = log(x ik ) log(x i ) F = exp or w T Remove log(xi ) from RHS. But this term is also independent of k so it can be absorbed into a bias b i for w i : wi T w k + b i Add a bias for wk : wi T w k + b i + b k Finally, wi T w k + b i + b k = log(x ik )

The Model w T i w k + b i + b k = log(x ik )

The Model w T i w k + b i + b k = log(x ik ) A simplification over F (w i, w j, w k ) = P ik P jk.

The Model w T i w k + b i + b k = log(x ik ) A simplification over F (w i, w j, w k ) = P ik P jk. Add an additive shift log(xik ) log(1 + X ik ) to maintain sparsity of X and avoid divergences.

The Model w T i w k + b i + b k = log(x ik ) A simplification over F (w i, w j, w k ) = P ik P jk. Add an additive shift log(xik ) log(1 + X ik ) to maintain sparsity of X and avoid divergences. What are the drawbacks to this model?

The Model w T i w k + b i + b k = log(x ik ) A simplification over F (w i, w j, w k ) = P ik P jk. Add an additive shift log(xik ) log(1 + X ik ) to maintain sparsity of X and avoid divergences. What are the drawbacks to this model? It weighs all co-occurrences, rare and non-existent ones, equally. Cast the above equation as a least squares problem with a weighting function f (X ij ).

The Model Using Least Squares Regression J = V i,j=1 f (X ij )(w T i w k + b i + b j log(x ij )) 2 where V is the size of the vocabulary.

The Model Properties of The Weighting Function f (0) = 0. If f is viewed as a continuous function, it should vanish as x 0 fast enough that the lim x 0 f (x)log 2 x is infinite. Xij are co-occurrence counts which are in N 0. If X ij is zero, the logarithm diverges. GloVe trains only on nonzero elements. f (x) should be non-decreasing so that rare co-occurrences are not overweighted. f (x) should be relatively small for large values of x, so that frequent co-occurrences are not overweighted.

The Model The Weighting Function f (x) = { (x/x max ) α if x < x max 1 otherwise. x max = 100 for the group s experiments.

The Model The Weighting Function f (x) = { (x/x max ) α if x < x max 1 otherwise. x max = 100 for the group s experiments. α = 3/4 Figure: α = 3/4. From Pennington et al 2014, 4.

The Model Relation to Skip-gram Softmax model for skip-gram is Q ij = probability that word j appears in context i. exp(w i T w j ) V k=1 exp(w i T w k ) for the

The Model Relation to Skip-gram Softmax model for skip-gram is Q ij = probability that word j appears in context i. The implied global objective function is J = logq ij i corpus j context(i) exp(w i T w j ) V k=1 exp(w i T w k ) for the

The Model Relation to Skip-gram Softmax model for skip-gram is Q ij = probability that word j appears in context i. The implied global objective function is J = logq ij i corpus j context(i) exp(w i T w j ) V k=1 exp(w i T w k ) for the However, the sum above can be much more efficient: V V J = X ij logq ij i=1 j=1

The Model Relation to Skip-gram Softmax model for skip-gram is Q ij = probability that word j appears in context i. The implied global objective function is J = logq ij i corpus j context(i) exp(w i T w j ) V k=1 exp(w i T w k ) for the However, the sum above can be much more efficient: V V J = X ij logq ij i=1 j=1 Recall that X i = k X ik and P ij = P(j i) = X ij /X i

The Model Relation to Skip-gram We can rewrite J as J = V i=1 X i V j=1 P ij logq ij = V X i H(P i, Q i ) i=1 where H(P i, Q i ) is the cross entropy of distributions P i and Q i.

The Model Relation to Skip-gram We can rewrite J as J = V i=1 X i V j=1 P ij logq ij = V X i H(P i, Q i ) i=1 where H(P i, Q i ) is the cross entropy of distributions P i and Q i. However, cross entropy error is not ideal. Why?

The Model Relation to Skip-gram We can rewrite J as J = V i=1 X i V j=1 P ij logq ij = V X i H(P i, Q i ) i=1 where H(P i, Q i ) is the cross entropy of distributions P i and Q i. However, cross entropy error is not ideal. Why? Distributions with long tails are modeled poorly and computational bottleneck.

The Model Relation to Skip-gram In a least squares objective, discard normalization factors in Q and P J ˆ= X i ( ˆP ij ˆQ ij ) 2 i,j where ˆP ij = X ij and ˆQ ij = exp(w T i w j ).

The Model Relation to Skip-gram In a least squares objective, discard normalization factors in Q and P J ˆ= X i ( ˆP ij ˆQ ij ) 2 i,j where ˆP ij = X ij and ˆQ ij = exp(w T i w j ). Another problem!

The Model Relation to Skip-gram In a least squares objective, discard normalization factors in Q and P J ˆ= X i ( ˆP ij ˆQ ij ) 2 i,j where ˆP ij = X ij and ˆQ ij = exp(w T i w j ). Another problem! X ij often takes very large values. Minimize the squared error of the logarithms of ˆP and ˆQ ˆ J = i,j X i (log ˆP ij log ˆQ ij ) 2 = i,j X i (w T i w j logx ij ) 2

The Model Relation to Skip-gram ˆ J = i,j f (X ij )(w T i w j logx ij ) 2 which is equivalent to the cost function J = V i,j=1 f (X ij )(w T i w k + b i + b j log(x ij ))

Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

Intrinsic Evaluation Comparing Intrinsic and Extrinsic Evaluations 6 Intrinsic Evaluation The evaluation of a set of word vectors generated by an embedding technique (such as Word2Vec or GloVe) on specific intermediate subtasks (such as analogy completion). Evaluation on specific subtask Fast to compute performance Helps understand subsystem Needs positive correlation with real task to determine usefulness 6 Mundra, Socher

Intrinsic Evaluation Comparing Intrinsic and Extrinsic Evaluations 7 Extrinsic Evaluation The evaluation of a set of word vectors generated by an embedding technique on the real task at hand. Evaluation on a real task Can be slow to compute performance Unclear if subsystem is the problem, other subsystems, or internal interactions If replacing subsystem improves performance, the change is likely good 7 Mundra, Socher

Intrinsic Evaluation Example Performance in completing word vector analogies. a : b :: c :?

Outline Background One-hot Vectors Co-occurrence Matrix SVD word2vec GloVe Basics The Model Intrinsic Evaluation Results

Nearest neighbors 8 The closest words to the target word frog: frogs toad litoria leptodactylidae rana 8 https://nlp.stanford.edu/projects/glove/

Word analogies 9 9 http://cs224d.stanford.edu/lectures/cs224d-lecture2.pdf

Glove Visualization 10 man-woman 10 https://nlp.stanford.edu/projects/glove/

Glove Visualization 11 comparative-superlative 11 https://nlp.stanford.edu/projects/glove/

Results on the word analogy task, given as percent accuracy 12 12 https://nlp.stanford.edu/pubs/glove.pdf

GloVe vs Skip-Gram 13 13 https://nlp.stanford.edu/pubs/glove.pdf

Accuracy on the analogy task for 300- dimensional vectors trained on different corpora 14 14 https://nlp.stanford.edu/pubs/glove.pdf

Conclusion GloVe, is a new global log-bilinear regression model for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks.

Further Reading T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient Estimation of Word Representations in Vector Space International Conference on Learning Representations, 2013. E. Huang, R. Socher, C.D. Manning, A. Ng. Improving Word Representations via Global Context and Multiple Word Prototypes Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 873-882, 2012.