ANLP Lecture 22 Lexical Semantics with Dense Vectors

Similar documents
Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Seman&cs with Dense Vectors. Dorota Glowacka

DISTRIBUTIONAL SEMANTICS

Lecture 6: Neural Networks for Representing Word Meaning

From perceptrons to word embeddings. Simon Šuster University of Groningen

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

GloVe: Global Vectors for Word Representation 1

STA141C: Big Data & High Performance Statistical Computing

An overview of word2vec

STA141C: Big Data & High Performance Statistical Computing

Latent Semantic Analysis. Hongning Wang

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

CPSC 340 Assignment 4 (due November 17 ATE)

Deep Learning. Ali Ghodsi. University of Waterloo

Embeddings Learned By Matrix Factorization

text classification 3: neural networks

Natural Language Processing

ECE521 Lectures 9 Fully Connected Neural Networks

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Semantics with Dense Vectors

word2vec Parameter Learning Explained

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Neural Word Embeddings from Scratch

Lecture 7: Word Embeddings

Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

NEURAL LANGUAGE MODELS

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Natural Language Processing with Deep Learning CS224N/Ling284

Generative Clustering, Topic Modeling, & Bayesian Inference

Lecture 5 Neural models for NLP

Principal Component Analysis (PCA)

arxiv: v3 [cs.cl] 30 Jan 2016

ATASS: Word Embeddings

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Notes on Latent Semantic Analysis

SVMs, Duality and the Kernel Trick

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Information Retrieval

Data Mining and Matrices

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Linear and Logistic Regression. Dr. Xiaowei Huang

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

CS224n: Natural Language Processing with Deep Learning 1

Logistic Regression. COMP 527 Danushka Bollegala

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

CS 6120/CS4120: Natural Language Processing

Dimensionality Reduction and Principle Components Analysis

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Bayesian Paragraph Vectors

Natural Language Processing

Latent Semantic Analysis. Hongning Wang

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Vector Space Models. wine_spectral.r

Dimensionality reduction

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Naïve Bayes, Maxent and Neural Models

Machine Learning - MT & 14. PCA and MDS

Word Embeddings 2 - Class Discussions

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Principal Component Analysis

Stochastic gradient descent; Classification

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Lecture: Face Recognition and Feature Reduction

Boolean and Vector Space Retrieval Models

Neural networks and optimization

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

CS 6120/CS4120: Natural Language Processing

CSC321 Lecture 4: Learning a Classifier

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Applied Natural Language Processing

Deep Learning Recurrent Networks 2/28/2018

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

1 What a Neural Network Computes

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Learning to translate with neural networks. Michael Auli

PCA, Kernel PCA, ICA

Machine Learning for natural language processing

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Linear classifiers: Logistic regression

Manning & Schuetze, FSNLP, (c)

Natural Language Processing and Recurrent Neural Networks

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

CSC321 Lecture 7 Neural language models

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Lecture: Face Recognition and Feature Reduction

ECE521 week 3: 23/26 January 2017

CSC321 Lecture 4: Learning a Classifier

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

Probabilistic Latent Semantic Analysis

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Transcription:

ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018

Previous lectures: Sparse vectors recap How to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary The values in the vector were a function of the count of the word co-occurring with each neighbouring word Each word is thus represented with a vector that is long (with vocabularies of 20,000 to 50,000) and sparse (with most elements of the vector for each word equal to zero) Henry S. Thompson ANLP Lecture 22 1

Today s Lecture How to represent a word with vectors that are short (with length of 50 1,000) and dense (most values are non-zero) Why short vectors? Easier to include as features in machine learning systems Because they contain fewer parameters, they generalize better and are less prone to overfitting Sparse vectors are better at capturing synonymy Henry S. Thompson ANLP Lecture 22 2

Before density, another approach to normalisation Raw frequency moves common and uncommon words away from each other in vector space Cosine (length-weighted dot-product) normalises length PPMI is one way of trying to detect important co-occurences based on divergence between observed and predicted (from unigram MLEs) bigram probabilities Another intuition is that a word that s common in only some contexts carries more information than one that s common everywhere Henry S. Thompson ANLP Lecture 22 3

Common here, but not elsewhere... This is formalised under the name tf-idf tf Term frequency idf Inverse document frequency Originally from Information Retrieval, where there a lots of documents, often with lots of words in them Henry S. Thompson ANLP Lecture 22 4

tf: term frequency: Tf-idf: combine two factors Idf: inverse document frequency: N is total # of docs in collection df i is # of docs that have term i frequency count of term i in document d Terms such as the or good have very low idf because df i N tf-idf value for word t in document d: Henry S. Thompson ANLP Lecture 22 5

Summary: tf-idf Compare two words using tf-idf cosine to see if they are similar Compare two documents Take the centroid of vectors of all the terms in the document Centroid document vector is: d = t 1 + t 2 + + t k k And yes, division that s vector addition but sort-of arithmetic Henry S. Thompson ANLP Lecture 22 6

Tf-idf and PPMI are sparse representations tf-idf and PPMI vectors are long (length V = 20,000 to 50,000) sparse (most elements are zero) Alternative: dense vectors vectors which are short (length 50 1000) dense (most elements are non-zero) Henry S. Thompson ANLP Lecture 22 7

Neural network-inspired dense embeddings Methods for generating dense embeddings inspired by neural network models Neural network language models are trained to predict contexts from words this process amounts to learning word embeddings The intuition is that words with similar meanings tend to occur near each other in text The process for learning these embeddings has a strong relationship with dot-product similarity metrics We ll look at one particular version of this: skip-grams Henry S. Thompson ANLP Lecture 22 8

Word2vec skip-grams (and related approaches such as continuous bag of words (CBOW)) are often referred to as word2vec Code available on the web Popular embedding methods Very fast to train Idea: predict rather than count Henry S. Thompson ANLP Lecture 22 9

Word2vec Instead of counting how often each word w occurs near apricot Train a classifier on a binary prediction task: Is w likely to show up near apricot? We don t actually care about this task But we ll take the learned classifier weights as the word embeddings Henry S. Thompson ANLP Lecture 22 10

Use running text as implicitly supervised training data! A word c near apricot Acts as gold correct answer to the question Is word w likely to show up near apricot No need for hand-labeled supervision The idea comes from neural language modeling Henry S. Thompson ANLP Lecture 22 11

Prediction with Skip-grams Skip-gram model predicts each neighbouring word in a context window of L words, e.g. context window L = 2 the context is [ w t 2,w t 1,w t+1,w t+2] We want to predict each of the context words from word w j Skip-gram calculates the probability p(w k w j ) by computing dot product between context vector c k of word k and target vector v j for word w j The higher the dot product between two vectors, the more similar they are Henry S. Thompson ANLP Lecture 22 12

Prediction with Skip-grams Dot product c k v j is a number ranging from -inf. to +inf We use softmax function to normalize the dot product into probabilities: p ( w k w j ) = exp ( c k v j ) i V exp( c i v j ) Computing the denominator requires computing dot product between each word in V and the target word w i, which may take a long time Henry S. Thompson ANLP Lecture 22 13

Skip-gram with negative sampling Faster than using the softmax function In the training phase, for each target word the algorithm chooses surrounding context words as positive examples For each positive example, the algorithm samples k noise examples, or negative examples, according to their weighted unigram probability, from non-neighbour words The goal is to move the embeddings towards the neighbour words and away from noise words We re basically training a logistic regression classifier with two groups of features: positive and negative Henry S. Thompson ANLP Lecture 22 14

Skip-gram with negative sampling Training sentence for example word apricot: lemon, a tablespoon of apricot preserves or jam c 1 c 2 w c 3 c 4 goal - learn an embedding whose dot product with each context word is high We select 2 noise words for each of the context words: cement bacon dear coaxial apricot ocean hence never puddle n 1 n 2 n 3 n 4 w n 5 n 6 n 7 n 8 We want noise words n i to have a low dot-product with target embedding w Henry S. Thompson ANLP Lecture 22 15

Skip-Gram Goal Given a pair (t,c) = target, context (apricot, jam) (apricot, aardvark) Return probability that c is a real context word: P (+ t,c) P ( t,c) = 1 P (+ t,c) Henry S. Thompson ANLP Lecture 22 16

How to compute p(+ t,c)? Intuition: Words are likely to appear near similar words Model similarity with dot-product! Problem: Similarity(t,c) t c Dot product is not a probability! (Neither is cosine) Henry S. Thompson ANLP Lecture 22 17

Turning dot product into a probability Solution: use a sigmoid The sigmoid lies between 0 and 1: σ(x) = 1 1 + e x Henry S. Thompson ANLP Lecture 22 18

Skip-gram with negative sampling So, given the learning objective is to maximise: logp (+ t,c) + k logp ( t,n i ) i=1 We use σ and dot product to get: logσ (c w) + k logσ ( n i w) i=1 where σ is a sigmoid function of the dot product c is our good context vector, and the n i are the vectors for our k bad context words We want to maximise the dot product of w with c and minimise over all the dot products of w with all the n i Henry S. Thompson ANLP Lecture 22 19

Reminder: We have two matrices We re learning two separate embeddings for each word w: word embedding v and context embedding c Embeddings encoded in two matrices: word matrix W and context matrix C Each row i of word matrix W is 1 d vector embedding v i for word i in vocabulary V Each column i of the context matrix C is a d 1 vector embedding c i for word i in vocabulary V Henry S. Thompson ANLP Lecture 22 20

Learning skip-gram weights We can approach this as a logistic regression problem Use stochastic gradient descent Starts with randomly initialized W and C matrices then iterates over the training corpus to maximize the objective function Henry S. Thompson ANLP Lecture 22 21

Skip-gram as neural network Or we can use a neural network to find the weights We have input vector x of word w j represented as a one-hot vector one element = 1, and all the others equal 0 Predict probability of each of the output words in 3 steps: 1. Select embedding from W : x is multiplied by W to give the hidden (projection) layer 2. Compute dot product c k v j : for each of the context words, multiply the projection vector by context matrix C. This produces a 1 V dimensional output vector with a score for each word in V 3. Normalize dot products into probabilities: using soft-max Henry S. Thompson ANLP Lecture 22 22

Skip-gram as neural network, cont d The resulting network looks like this: If we look at just one output unit, we have a yes-no classifier Just like the one from lecture 19 (on slide 13) for the good context words we want yes results for the bad context words we want no results Henry S. Thompson ANLP Lecture 22 23

What we actually want A reminder that we re not really interested in the (logistic regression) classifier or the neural network context predictor We just want their weights to use as word embeddings We can either just use the weights from W discarding C altogether or combine W and C As usual, whether/how we do this is a matter for tuning to the particular application in view Which is also the case for the context window size L Henry S. Thompson ANLP Lecture 22 24

Some real embeddings Examples of the closest tokens to some target words using a phrase-based extension of the skip-gram algorithm (Mikolov et al. 2013): Redmond Havel ninjutsu graffiti capitulate Redmond Wash Vaclav Havel ninja spray paint capitulation Redmond President Washington Vaclav Havel Martial arts graffiti capitulated Microsoft Velvet Revolution swordsmanship taggers capitulating Henry S. Thompson ANLP Lecture 22 25

Properties of Embeddings Offsets between embeddings can capture relations between words, e.g. vector(king)+ (vector(woman) vector(man)) is close to vector(queen) Offsets can also capture grammatical number Henry S. Thompson ANLP Lecture 22 26

Singular Value Decomposition (SVD) [Optional background from here on] SVD is a method for finding the most important dimensions of a dataset SVD belongs to a family of methods that can approximate an N-dimensional dataset using fewer dimensions Such as Principle Component Analysis (PCA) or Factor Analysis First applied in Latent Semantic Analysis (LSA) to tasks generating embeddings from term-document matrices Henry S. Thompson ANLP Lecture 22 27

Singular Value Decomposition cont d Dimensionality reduction methods first rotate the axes of the original dataset into a new space The new space is chosen so that the highest order dimension captures the most variance in the original dataset The next dimension captures the next most variance, and so on While some information about the relationship between the original points is necessarily lost in the new transformation the remaining dimensions preserve as much as possible of the original setting Henry S. Thompson ANLP Lecture 22 28

Principal Component Analysis Here s an example of PCA: Henry S. Thompson ANLP Lecture 22 29

Latent Semantic Analysis (LSA) LSA is a particular application of SVD to a V c termdocument matrix X representing V words and their cooccurrence with c documents Cell values are some version of normalised counts, e.g. PPMI or tf-idf SVD factorises matrix X into the product of three matrices W, Σ and C: σ 1 0 0... 0 X W 0 σ 2 0... 0 = 0 0 σ 3... 0 C....... 0 0 0... σ m V c V m m m m c Henry S. Thompson ANLP Lecture 22 30

Latent Semantic Analysis factors W : a V m matrix, where each row w represents a word and each column represents m dimensions in a latent space The m column vectors are orthogonal to each other and are ordered by the amount of variance in the original dataset m = rank of X (number of linearly independent rows) Σ: a diagonal m m matrix with singular values along the diagonal, expressing the importance of each dimension C T : a diagonal m c matrix, where each row now represents one of the latent dimensions and the m row vectors are orthogonal to each other Henry S. Thompson ANLP Lecture 22 31

LSA: Dimensionality reduction By using only the top k dimensions of W, Σ and C, the product of these 3 matrices becomes a least-squares approximation to the original X Since the higher dimensions encode the most variance, SVD models the most important information in the original X That is, W gives us lower-dimension embeddings (fewer rows) for all the words (same number of columns) Henry S. Thompson ANLP Lecture 22 32

LSA: Dimensionality reduction, cont d Taking only the top k m dimensions after SVD is applied to the co-occurrence matrix X: σ 1 0 0... 0 X W k 0 σ 2 0... 0 [ = 0 0 σ 3... 0 C ]....... 0 0 0... σ k V c V k k k k c Henry S. Thompson ANLP Lecture 22 33

Word embedding vs. document classification Notice that because what we care about is words, in both X and W each row corresponds to a distinct word type This is in contrast to last week s lab, where we were interested in classifying documents, and each column corresponded to a different word type Henry S. Thompson ANLP Lecture 22 34

SVD and LSA Using only the top k dimensions leads to a reduced W matrix, with one k-dimensioned row per word This row acts as a dense k-dimensional vector (embedding) representing that word LSA embeddings generally set k = 300 LSA applies a particular weighting for each co-occurrence cell that multiplies two weights: local and global Henry S. Thompson ANLP Lecture 22 35

LSA term weighting The local weight of each term i in document j is its log frequency: log f (i,j) + 1 The global weight of term i is a version of its entropy with respect to the entire document collection: 1 + j p (i,j)logp (i,j) logd where D is the number of documents Henry S. Thompson ANLP Lecture 22 36

SVD and word-context In LSA, SVD is applied to a term-document matrix An alternative is to apply SVD to a word-word or wordcontext matrix the context dimensions are words (rather than documents as in LSA) Relies on PPMI-weighted word-word matrix Only top dimensions are used truncated SVD Henry S. Thompson ANLP Lecture 22 37

Word embeddings in truncated SVD Henry S. Thompson ANLP Lecture 22 38