Word Embeddings 2 - Class Discussions

Similar documents
GloVe: Global Vectors for Word Representation 1

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Neural Word Embeddings from Scratch

word2vec Parameter Learning Explained

DISTRIBUTIONAL SEMANTICS

The representation of word and sentence

Lecture 6: Neural Networks for Representing Word Meaning

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Natural Language Processing

arxiv: v3 [cs.cl] 30 Jan 2016

An overview of word2vec

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Learning. Ali Ghodsi. University of Waterloo

Embeddings Learned By Matrix Factorization

CS224n: Natural Language Processing with Deep Learning 1

A Convolutional Neural Network-based

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Distributional Semantics and Word Embeddings. Chase Geigle

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Natural Language Processing and Recurrent Neural Networks

STA141C: Big Data & High Performance Statistical Computing

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

CSC321 Lecture 7 Neural language models

Modeling Environment

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Towards Universal Sentence Embeddings

STA141C: Big Data & High Performance Statistical Computing

Seman&cs with Dense Vectors. Dorota Glowacka

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Lecture 7: Word Embeddings

Data Mining & Machine Learning

11. Learning graphical models

CPSC 540: Machine Learning

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

Generative Clustering, Topic Modeling, & Bayesian Inference

Lecture 21: Spectral Learning for Graphical Models

Lecture 16 Deep Neural Generative Models

arxiv: v2 [cs.cl] 1 Jan 2019

N-gram Language Modeling

1 Inference in Dirichlet Mixture Models Introduction Problem Statement Dirichlet Process Mixtures... 3

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Variational Autoencoders

Bayesian Paragraph Vectors

CS224n: Natural Language Processing with Deep Learning 1

Multiclass Classification-1

AN INTRODUCTION TO TOPIC MODELS

20: Gaussian Processes

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

N-gram Language Modeling Tutorial

Introduction to Gaussian Processes

Sparse Stochastic Inference for Latent Dirichlet Allocation

Nonparametric Bayesian Methods (Gaussian Processes)

Lecture 8: Graphical models for Text

Bayesian Methods for Machine Learning

Latent Dirichlet Allocation

CPSC 540: Machine Learning

Deep Learning. Language Models and Word Embeddings. Christof Monz

Ling 289 Contingency Table Statistics

Gaussian Models

The connection of dropout and Bayesian statistics

A Comparison of Two MCMC Algorithms for Hierarchical Mixture Models

Natural Language Processing

Loss Functions and Optimization. Lecture 3-1

Topic Models. Charles Elkan November 20, 2008

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Analysis for Natural Language Processing Lecture 2

COMP90051 Statistical Machine Learning

Deep Learning for Natural Language Processing

ECE521 week 3: 23/26 January 2017

Bayesian Support Vector Machines for Feature Ranking and Selection

Experiments on the Consciousness Prior

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

Deep Learning for NLP Part 2

Latent Dirichlet Allocation Based Multi-Document Summarization

ECE521 Lecture7. Logistic Regression

Can we do statistical inference in a non-asymptotic way? 1

Replicated Softmax: an Undirected Topic Model. Stephen Turner

VCMC: Variational Consensus Monte Carlo

Statistics for Particle Physics. Kyle Cranmer. New York University. Kyle Cranmer (NYU) CERN Academic Training, Feb 2-5, 2009

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Support Vector Machines

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lab 12: Structured Prediction

Sampling from Bayes Nets

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Algorithm-Independent Learning Issues

Math (P)Review Part I:

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

HOMEWORK #4: LOGISTIC REGRESSION

ATASS: Word Embeddings

Transcription:

Word Embeddings 2 - Class Discussions Jalaj February 18, 2016 Opening Remarks - Word embeddings as a concept are intriguing. The approaches are mostly adhoc but show good empirical performance. Paper 1 - Skip Gram Model (Mikolov et.al.) 1. How do word representations encode linguistic regularities and patterns? Consider this example: vec( Madrid ) - vec( Spain ) + vec( France ) is closer to vec( Paris ) than to any other word. This points to a linear structure in these language patterns. 2. Its like subtracting and adding contexts. From vec( Madrid ) we removed its context by subtracting vec( Spain ) then added the context of vec( France ) to get to vec( Paris ). 3. Question: Why do these patterns (which can be represented by simple algebraic operations) occur? 4. Skip-gram model: Objective is to obtain word representations that are useful in predicting surrounding words in a document. Given training words w 1, w 2,... w T and training context c, we want to maximize: 1 T T log p(w t+j w t ) t=1 c j c,j 0 where p(w t+j w t ) = p(w O w I ) = exp(χt w O ρ wi ) W w+1 exp(χt wρ wi ) (1) The parameters of this model are θ = {χ 1:W, ρ 1:W }. Problem: Need to compute derivatives for optimization which is very expensive. 5. View p(w O w I ) as potential function. Can we do variational inference or stochastic variational inference here? 1

6. Skip gram model vs Bengio s (NPL) model, CBOW (Continuous Bag of Words Representation): In skip-gram we are predicting contextual words given the current word c j c,j 0 log p(w t+j w t ). In NPL, CBOW: given the contextual words, we are predicting the current word p(w t w t 1,... w t c ). We are still encoding contextual information in both; does this flip matter? 7. Didn t really discuss Hierarchical Softmax which is a computationally efficient approximation to the full Softmax. 8. NCE (Noise Contrastive Estimation) is introduced as an alternative to Hierarchical Softmax and Negative sampling is introduced as a simplification to NCE. 9. NCE - connections to Generative Adversarial networks? Possible relations to model checking and goodness of fit? We didn t really comment much here. 10. Negative Sampling: Has the following objective function which replaces p(w O w I ) as in (1) p(w O w I ) = χ T w O ρ wi log(1 + exp(χ T w O ρ wi )) (2) K + (χ T w k ρ wi log(1 + exp(χ T w k ρ wi ))) (3) k 11. Intuition: For each word w I consider 2 classes - words w O that cooccur with w I and words w k that don t co-occur with w I (let s refer to them as noise). Maximizing p(w O w I ) as given in (2) is maximizing the probability of being able to distinguish between w O and the noise words w k. 12. Can we add some regularization term to (2)? Since we are simultaneously learning {χ 1:W, ρ 1:W } this makes sense. 13. There was no consensus whether (2) is some sort of Taylor approximation or Monte Carlo estimate of (1). Or whether its a new objective function itself which gives high-quality word representations. Can variational inference or stochastic variational inference give something like Negative Sampling? 2

14. I go with the first view. Negative Sampling is some sort of approximation to NCE which in itself approximates (1). Paper 2 - GloVe (Manning et.al.) 1. Two broad class of methods: (a) global matrix factorization (of the word-word co-occurrence matrix) type: which perform poorly on word analogy tasks (b) methods which take into account local contexts like the skipgram model which does better on analogy tasks (indicating its learned a finer structure). But the model is trained on separate local context windows instead of leveraging information from the entire corpus. 2. Model: X - matrix of word-word co-occurrences, X ij - the number of times word j occurs in the context of word j, X i = k X ik - the number of times any word appears in the context of word k. 3. The contextual information is indeed captured through this big matrix. This combines global matrix factorization type ideas and local contextual type ideas. 4. p ij = p(w O = j w I = i) X ij X i. By assuming a linear structure, they make a log-linear model: log(p ij ) = χ T j ρ i log(x ij ) χ T j ρ i + b i + b j for some bias terms b i and b j. I have continued the notation of χ j being the word representation and as in (1) 5. Criticism about the authors assuming a linear relationship in the model rather than coming up with a model that exhibits linear relationships (contrast this with Arora et. al. who come up with an underlying generative model and derive the linear structure ). 6. They then go on to solve a weighted least squares problem with the loss function as: f(x i,j )(χ T j ρ i + b i + b j log(x ij )) 2 i,j 3

7. f( ) is just a weight function to account for contagion (a feature of certain words occurring together in many articles like the frequent cooccurrence of in, the ). The choice of f( ) seems ad-hoc. Better ways to deal with it? Integrated/Robust models? 8. A comment was made in class as to why the squared loss was chosen in the objective function. Objective function of the skip-gram model can be reduced to minimizing the weighted sum of cross entropy error (with long tails) and also requires computing the normalizing constant, both of which are undesirable. As an alternative, they choose the squared loss function; a very nice discussion on this is given in Section 3.1 9. There was discussion on some model selection questions that can be explored. (a) Choice of dimension of embeddings? Size of the context window? (b) Can we have different context lengths?? context lengths? Can we weight along (c) Bayesian Non-Parametric models for choosing context length? Crossvalidation can be expensive. Paper 3 - RAND-WALK (Arora et.al.) 1. In the PMI (Pairwise Mutual Information) Model, we have a symmetric matrix whose entries are P (w, w ) = log p(w,w ) where p(w) is the p(w)p(w ) marginal probability of word w and p(w, w ) is the probability of words w, w appearing together in a given context size. 2. Low rank SVD on the PMI matrix gives the word vectors. Since the approximation v T wv w P MI(w, w ) is good in practice, it points to the linear structure. Arora et. al. provide a generative model and show that P MI(w, w ) = vt wv w d + O(ɛ) (4) for some constants d and ɛ. Thus their generative model implies linear structure (as against Pennington et. al. who sort of assume it). 4

c 1 c 2 c 3 c T w 1 w 2 w 3 w T 3. Their model is a latent variable random walk model as shown above. 4. Here c t R d represents the discourse vector at time t, representing the topics currently being talked about. Every latent word vector v w R d captures the correlation of word w with the discourse as: p(word w emitted at time t c t ) exp(c T t v w ) (5) 5. Using (5) and integrating over the discourse vectors c, they are able to derive (4). 6. We briefly tried understanding the relationship between analogies and likelihood ratios as given in Section 3 i.e. justification of the following statement: woman is the solution to the analogy king:queen::man:y because where χ is any word p(χ king) p(χ queen) p(χ man) p(χ woman) Below are some of my thoughts (which the readers may or may not find useful) on this issue; inspired by Maja and Adi s comments on piazza. They mention a nice thing about interchangeability. 7. One observation is to look at p(χ king) as the probability of χ occurring in the context of the word king. Now how does ratio help? Lets view χ as probe words. For probe words which are neither related to king or queen like tree or related to both like marriage, the ratio is close to 1. For probe words that favor king more, like fight the ratio will be large and for those which favor queen like beauty it will be small. 5

This is the same intuition as given by Pennington et. al. in Table 1; one can see the magnitudes to get a sense of how ratios help in nicely classifying probe words into 3 buckets. Now we want a word y, whose context classifies probe words into the same buckets (when the ratio is taken with the context of man ). 8. We want to look for y such that the probe words which capture similar dimensions of king and queen would also capture similar dimensions of man and y. Probe words which favor the king capture some dimension w.r.t. queen - say masculinity and those which favor the queen capture femininity (Pardon any poor choice of examples in the above paragraphs) 9. In some sense these probe words define a direction in the space of word representations - to go from king to queen. Following the same direction from man leads to woman 10. This is more formally put in the random walk model. There s an underlying slow random walk of the discourse vector c t and the probe words capture similar features between king and man by their correlation to c t. p(χ king) p(χ man) = t exp(c T t vw) exp(c T t v king) = exp(c T t vw) exp(c T t vman) exp(c T t (v king v man )) t p(χ queen) p(χ y) = t exp(c T t (v queen v y )) So essentially (v queen v y ) must be aligned along (v king v man ) 11. Probe words are used since c t are unobserved latent variables that are marginalized out. In equations 3.4, 3.5 and 3.6, they show that v king v man = v queen v woman = µ R + small noise. Proof of small noise follows from the isotropic assumption. 6