Natural Language Processing

Similar documents
Natural Language Processing with Deep Learning CS224N/Ling284. Richard Socher Lecture 2: Word Vectors

GloVe: Global Vectors for Word Representation 1

DISTRIBUTIONAL SEMANTICS

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 1: Introduction and Word Vectors

Deep Learning for NLP

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

word2vec Parameter Learning Explained

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

An overview of word2vec

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Learning for NLP Part 2

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

CS224n: Natural Language Processing with Deep Learning 1

Lecture 7: Word Embeddings

Lecture 6: Neural Networks for Representing Word Meaning

arxiv: v3 [cs.cl] 30 Jan 2016

Neural Word Embeddings from Scratch

Word Embeddings 2 - Class Discussions

Natural Language Processing and Recurrent Neural Networks

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning. Ali Ghodsi. University of Waterloo

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Deep Learning for Natural Language Processing

Lecture 13: Structured Prediction

The representation of word and sentence

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

text classification 3: neural networks

Lecture 5 Neural models for NLP

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Introduction to Logistic Regression

Intelligent Systems (AI-2)

CS60021: Scalable Data Mining. Large Scale Machine Learning

Maschinelle Sprachverarbeitung

Embeddings Learned By Matrix Factorization

Maschinelle Sprachverarbeitung

Intelligent Systems (AI-2)

Linear Models in Machine Learning

Natural Language Processing

Information Extraction from Text

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

CSC321 Lecture 7 Neural language models

Distributional Semantics and Word Embeddings. Chase Geigle

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Natural Language Processing

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Probabilistic Graphical Models

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Ad Placement Strategies

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Sequence Modeling with Neural Networks

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Applied Natural Language Processing

N-gram Language Modeling

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

CPSC 340 Assignment 4 (due November 17 ATE)

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Deep Learning Recurrent Networks 2/28/2018

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Deep Learning. Language Models and Word Embeddings. Christof Monz

Conditional Random Field

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

NEURAL LANGUAGE MODELS

Bayesian Paragraph Vectors

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Lecture 17: Neural Networks and Deep Learning

Natural Language Processing with Deep Learning CS224N/Ling284

Logistic Regression. COMP 527 Danushka Bollegala

CS224n: Natural Language Processing with Deep Learning 1

lecture 6: modeling sequences (final part)

Natural Language Processing

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

The Perceptron. Volker Tresp Summer 2014

Machine Learning for natural language processing

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Natural Language Processing

Embedding-Based Techniques MATRICES, TENSORS, AND NEURAL NETWORKS

Seman&cs with Dense Vectors. Dorota Glowacka

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Lecture 12: Algorithms for HMMs

Recurrent Neural Network

Classification Based on Probability

Loss Functions and Optimization. Lecture 3-1

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Collaborative Filtering. Radek Pelánek

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

STA141C: Big Data & High Performance Statistical Computing

Logistic Regression & Neural Networks

arxiv: v2 [cs.cl] 1 Jan 2019

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Warm up: risk prediction with logistic regression

Foundations of Machine Learning

Transcription:

Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning

Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization Evaluation 2

Representing words 3

Representing words Definition: meaning (Webster dictionary) the idea that is represented by a word, phrase, etc. The idea that a person wants to express by using words, signs, etc. the idea that is expressed in a work of writing, art, etc. In linguistics: signifier < > signified (idea or thing) = denotation 4

Representing words with computers A word is the set of meanings it has in a taxonomy (graph of meanings) Hypernym: is-a relation Hyponym: the opposite of hypernym 5

Drawbacks Expensive! Subjective (how to split different synsets?) Incomplete wicked, badass, nifty, crack, ace, wizard, genius, ninja Missing functionality: how do you compute word similarity? How to compose meanings? 6

Discrete representation Words are atomic symbols (one-hot representation): V = {hotel, motel, walk, wife, spouse} hotel [1 0 0 0 0] motel [0 1 0 0 0] walk [0 0 1 0 0] wife [0 0 0 1 0] spouse [0 0 0 0 1] V 100, 000 7

Drawback Barack Obama s wife Barack Obama s spouse Barack Obama s wife Barack Obama s advisors Seattle motels Seattle hotels Seattle motels Seattle attractions But all words vectors are orthogonal and equidistant Goal: word vectors with a natural notion of similarity h hotel motel i > h hotel spouse i 8

Distributional similarity You shall know a word by the company it keeps (Firth, 1957) cashed a check at the bank across the street that bank holds the mortgage on my home said that the bank raised his forecast for employees of the bank have confessed to the charges Central idea: represent words by their context 9

Idea 1 word wife spouse context {met: 3, married: 4, children: 2, wedded: 1, } {met: 2, married: 5, children: 2, kids: 1, } Problem: married <==> wedded children <==> kids 10

Distributed representations language = 0.278 0.911 0.792 0.177 0.109 0.542 0.0003 Represent words and context as low dimensional vectors 11

Word vectors 12

Supervised learning Input: {(x i,y i )} N i=1, (x i,y i ) 2 X Y Output (probabilistic model): f : X! Y arg max y p(y x) Example: train a spam detector from spam and non-spam e-mails. 13 Intro to ML prerequisite

Word embeddings that bank holds the mortgage on my home 1. Define supervised learning task from raw text (no manual annotation!): 1. (x, y) = (bank, that) 2. (x, y) = (bank, holds) 3. (x, y) = (holds, bank) 4. (x, y) = (holds, the) 5. (x, y) = (the, holds) 6. (x, y) = (the, mortgage) 7. (x, y) = (mortgage, the) 8. (x, y) = (mortgage, on) 9. (x, y) = (on, mortgage), 10. (x, y) = (on, my) 11. (x, y) = (my, on) 12. (x, y) = (my, home) 14

Word embeddings 2. Define model for output given input p(holds bank) p (o c) = exp(u > o v c ) P V w=1 exp(u> wv c ) u: vector for outside word, v: vector for center word, V: number of words in vocabulary, θ: all parameters Multi-class classification model (number of classes?) How many parameters are in the model: =2 V d u,v 2 R d 15 Intro to ML prerequisite

Word embeddings 3. Define objective function for corpus of length T: L( ) = TY Y p (w t+j w t ) t=1 m apple j apple m j 6= 0 J( ) = log L( ) = TX X log p (w t+j w t ) t=1 m apple j apple m j 6= 0 Find parameters that maximize the objective 16 Intro to ML prerequisite

Word embeddings Intuitions: What probabilities would maximize the objective? Why should similar words have similar vectors? Why do we have different parameters for the center word and the output word? c(x, y) =2,c(x, z) =1 J( ) =p(y x) 2 p(z x) = p(y x) 2 (1 p(y x)) rj( ) =2p(y x) 3p(y x) 2 = p(y x)(2 3p(y x)) p(y x) = 2 3,p(z x) =1 3 17 Intro to ML prerequisite

18

Gradient descent How to find the right model parameters? Start at some point and move in the opposite direction of the gradient 19 Intro to ML prerequisite

Gradient descent f(x) =x 4 +3x 3 +2 f 0 (x) =4x 3 +9x 2 20 Intro to ML prerequisite

Gradient descent We want to minimize: TX J( ) = t=1 X j log p (w t+j w t ) Update rule: new j = old j new = old 2 R 2Vd @J( ) @ j rj( ) α is a step size 21 Intro to ML prerequisite

Stochastic gradient descent For large corpora (billions of tokens) this update is very slow Sample a window t Update gradients based on that window new = old rj t ( ) 22 Intro to ML prerequisite

Deriving the gradient Mostly applications of the chain rule Let s derive the gradient of a window (t) and an center word You will do this again in the assignment (and more) log p (w t+j w t ) 23

Whiteboard 24

Class 2: recap Goal: represent words with low-dimensional vectors Approach: Define a supervised learning problem from a corpus We defined the necessary components for skip-gram: Model (softmax over word labels for each word) Objective (minimize Negative Log Likelihood) Optimize with SGD We computed the gradient for some parameters by hand 25

Computational problem Computing the partition function is too expensive Solution 1: hierarchical softmax (Morin and Bengio, 2005) reduces computation time to log V by constructing a binary tree over the vocabulary Solution 2: Change the model skip-gram with negative sampling (home assignment 1) 26

Logistic regression (x, y) = ((bank, holds), 1) (x, y) = ((bank, table), 0) (x, y) = ((bank, eat), 0) (x, y) = ((holds, bank), 1) (x, y) = ((holds, quickly), 0) (x, y) = ((holds, which), 0) (x, y) = ((the, mortgage), 1) (x, y) = ((the, eat), 0) (x, y) = ((the, who), 0) What information is lost? X p(y =1 o, c) =? o2v 27

Logistic regression Model: p (y =1 c, o) = 1 1 + exp( u > o v c ) = (u> o v c ) p (y =0 c, o) =1 (u > o v c )= ( u > o v c ) Objective: X log( (u > w t+j v wt )) + X log( ( u > w (k) v wt )) t,j k p(w) p(w) = U(w) 3/4 / Z 28 Intro to ML prerequisite

Summary We defined the three necessary components. Model (binary classification) Objective (ML with negative sampling) Optimization method (SGD) 29

Many variants CBOW: predict center word from context Defining context: How big is the window? Is it sequential or based on syntactic information? Different model for every context position? Use stop words? 30

Matrix factorization 31

Landauer and Dumais (1997) Matrix factorization Consider the word-context co-occurrence matrix for a corpus: I like deep learning. I like NLP. I enjoy flying. I Like enjoy deep learning NLP flying. I 2 1 like 2 1 1 enjoy 1 1 deep 1 1 learning 1 1 NLP 1 1 flying 1 1. 1 1 1 32

Matrix factorization Reconstruct matrix from low-dimensional wordcontext representations. Minimizes: X (A ij  k ij) 2 = A Âk 2 i,j 33

Matrix factorization 34

Levy and Goldberg, 2015 Relation to skip-gram The output of skip-gram can be viewed as factorizing a word-context matrix: M V U T = M 2 R V V,V,U 2 R V d What should the values of M be? Mco is <vc, uo> 35

Relation to skip-gram Re-write objective: L( ) = X c,o #(c, o) log( (u > o v c )) + k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c,o #(c, o) k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c #(c) k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c = X c,o #(c) k X o #(o) V #(c, o) log( (u > o v c )) + #(c) k #(o) V log( ( u > o v c )) log( ( u > o v c ))

Relation to skip-gram Re-write objective: L( ) = X c,o #(c, o) log( (u > o v c )) + k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c,o #(c, o) k E[log( ( u > o v c ))] = X c,o #(c, o) log( (u > o v c )) + X c #(c) k E[log( ( u > o v c ))] = X c,o = X c,o #(c, o) log( (u > o v c )) + X c #(c) k X o #(o) T log( ( u> o v c )) #(c, o) log( (u > o v c )) + #(c) k #(o) T log( ( u> o v c )) 37

Relation to skip-gram Let s assume the dot products are independent of one another: Let x = u > o v c l(x) =#(c, o) log( (x)) + #(c) k #(o) T L( ) = X c,o l(x) log( ( x)) @l(x) #(o) =#(c, o) ( x) #(c) k (x) =0 @x T #(c, o) T x = log #(c) #(o) 1 k p(c, o) x = log log k =PMI(c, o) p(c) p(o) 38 log k

Relation to skip-gram Conclusion: Skip-gram with negative sampling implicitly factorizes a shifted PMI matrix Many NLP methods factorize the PMI matrix with matrix decomposition methods to obtain dense vectors. 39

Evaluation 40

Evaluation Intrinsic vs. extrinsic evaluation: Intrinsic: define some artificial task that tries to directly measure the quality of your learning algorithm Extrinsic: check whether your output is useful in a real NLP task 41

Intrinsic evaluation Word analogies: Normalize all word vectors to 1 man::woman < > king::?? a::b < > c::d d = arg max i (x b x a + x c ) > x i x b x a + x c Does cosine distance capture semantic and syntactic intuitions? 42

Visualization 43

Visualization 44

Visualization 45

Word analogies evaluation 46

Human correlation intrinsic evaluation word 1 word 2 human judgement tiger cat 7.35 book paper 7.46 computer internet 7.58 plane car 5.77 stock phone 1.62 stock CD 1.31 stock jaguar 0.92 47

Human correlation intrinsic evaluation Compute Spearman rank correlation between human similarity prediction and model similarity predictions (wordsim 353): 48

Extrinsic evaluation Task: named entity recognition. Find mentions of person, location, organization in text. Using good word representation might be useful 49

Extrinsic evaluation 50

Summary Words are central to language In most NLP systems some word representations are used Graph-based representations are difficult to manipulate and compose One-hot vectors are useful with enough data but lose all of generalization information Word embeddings provide a compact way to encode word meaning and similarity Skip-gram with negative sampling is a popular approach for learning word embeddings by casting an unsupervised problem as a supervised problem It is related to classical matrix decomposition methods. 51

Assignment 1 Implement skip-gram with negative sampling There is ample literature if you want to consider this for a project 52

Gradient checks @J( ) @ =lim!0 J( + ) J( ) 2 Compute for every parameters for small epsilon. 53