Natural Language Processing

Similar documents
Language Modeling. Michael Collins, Columbia University

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

The Language Modeling Problem (Fall 2007) Smoothed Estimation, and Language Modeling. The Language Modeling Problem (Continued) Overview

From perceptrons to word embeddings. Simon Šuster University of Groningen

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Lecture 17: Neural Networks and Deep Learning

Language Model. Introduction to N-grams

CS 6120/CS4120: Natural Language Processing

Natural Language Processing SoSe Words and Language Model

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Recurrent Neural Networks. Jian Tang

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

Natural Language Processing

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

{ Jurafsky & Martin Ch. 6:! 6.6 incl.

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

Language Models. Philipp Koehn. 11 September 2018

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Neural Networks Language Models

NEURAL LANGUAGE MODELS

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Probabilistic Language Modeling

The Noisy Channel Model and Markov Models

Introduction to N-grams

Slide credit from Hung-Yi Lee & Richard Socher

N-gram Language Modeling

Language Modeling. Introduction to N-grams. Klinton Bicknell. (borrowing from: Dan Jurafsky and Jim Martin)

text classification 3: neural networks

Machine Learning for natural language processing

Recurrent Neural Network

language modeling: n-gram models

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Natural Language Processing

Chapter 3: Basics of Language Modelling

Conditional Language Modeling. Chris Dyer

CS 6120/CS4120: Natural Language Processing

Language Modeling. Introduc*on to N- grams. Many Slides are adapted from slides by Dan Jurafsky

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

Language Modeling. Introduction to N- grams

Language Modeling. Introduction to N- grams

Naïve Bayes, Maxent and Neural Models

Neural Network Language Modeling

Neural Networks in Structured Prediction. November 17, 2015

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I

CSC321 Lecture 7 Neural language models

Intro to Neural Networks and Deep Learning

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Statistical Methods for NLP

Lecture 5 Neural models for NLP

Conditional Language modeling with attention

Introduction to RNNs!

arxiv: v3 [cs.lg] 14 Jan 2018

ANLP Lecture 6 N-gram models and smoothing

Recurrent and Recursive Networks

DT2118 Speech and Speaker Recognition

Natural Language Processing

Natural Language Processing

Sequence Modeling with Neural Networks

SGD and Deep Learning

Language Models. Data Science: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM PHILIP KOEHN

Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing (CSE 490U): Language Models

Intelligent Systems Discriminative Learning, Neural Networks

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

ECE521 Lecture 7/8. Logistic Regression

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks: Backpropagation

Quiz 1, COMS Name: Good luck! 4705 Quiz 1 page 1 of 7

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Supervised Learning. George Konidaris

Neural Networks. Intro to AI Bert Huang Virginia Tech

Deep Learning Recurrent Networks 2/28/2018

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

CMPT-825 Natural Language Processing

Learning to translate with neural networks. Michael Auli

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning for NLP

Neural networks COMS 4771

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Chapter 3: Basics of Language Modeling

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Deep Learning for Natural Language Processing

Deep Feedforward Networks

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

N-gram Language Modeling Tutorial

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

A fast and simple algorithm for training neural probabilistic language models

Transcription:

Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning and Richard Soccer

Plan Problem definition Trigram models Evaluation Estimation Interpolation Discounting 2

Motivations Define a probability distribution over sentences Why? p(01010 ) / p( 01010) p(01010) Machine translation P( high winds ) > P( large winds ) Spelling correction P( The office is fifteen minutes from here ) > P( The office is fifteen minuets from here ) Speech recognition (that s where it started!) P( recognize speech ) > P( wreck a nice beach ) And more! 3

Motivations Techniques will be useful later 4

Problem definition Given a finite vocabulary V = {the, a, man, telescope, Beckham, two,...} We have an infinite language L, which is V * concatenated with the special symbol STOP the STOP a STOP the fan STOP the fan saw Beckham STOP the fan saw saw STOP the fan saw Beckham play for Real Madrid STOP 5

Problem definition Input: a training set of example sentences Currently: hundreds of billions of words Output: a probability distribution p over L X x2l p(x) =1,p(x) 0 for all x 2 L p( the STOP ) = 10-12 p( the fan saw Beckham STOP ) = 2 x 10-8 p( the fan saw saw STOP ) = 10-15 6

A naive method Assume we have N training sentences Let x1, x2,, xn be a sentence, and c(x1, x2,, xn) be the number of times it appeared in the training data. Define a language model: p(x 1,...,x n )= c(x 1,...,x n ) N No generalization! 7

Markov processes Markov processes: Given a sequence of n random variables: We want a sequence probability model X 1,X 2,...,X n, n = 100, X i 2 V p(x 1 = x 1,X 2 = x 2, X n = x n ) There are V n possible sequences 8

First-order Markov process Chain rule p(x 1 = x 1,X 2 = x 2,...,X n = x n )= ny p(x 1 = x 1 ) p(x i = x i X 1 = x 1,...,X i 1 = x i 1 ) i=2 Markov assumption p(x i = x i X 1 = x 1,...,X i 1 = x i 1 )=p(x i = x i X i 1 = x i 1 ) 9

Second-order Markov process Relax independence assumption: p(x 1 = x 1,X 2 = x 2,...,X n = x n )= p(x 1 = x 1 ) p(x 2 = x 2 X 1 = x 1 ) ny p(x i = x i X i 2 = x i 2,X i 1 = x i 1 ) i=3 Simplify notation x 0 =,x 1 = Is this reasonable? 10

Detail: variable length Probability distribution over sequences of any length Define always X n =STOP, and obtain a probability distribution over all sequences Intuition: at every step you have probability α h to stop (conditioned on history) and (1-α h ) to keep going p(x 1 = x 1,X 2 = x 2,...,X n = x n )= ny p(x i = x i X i 2 = x i 2,X i 1 = x i 1 ) i=1 11

Trigram language model A trigram language model contains A vocabulary V A non negative parameters q(w u,v) for every trigram, such that w 2 V [ {STOP}, u,v 2 V [ { } The probability of a sentence x1,, xn, where xn=stop is ny p(x 1,...,x n )= q(x i x i 1,x i 2 ) i=1 12

Example p(the dog barks STOP) =q(the, ) q(dog,the) q(barks the, dog) q(stop dog, barks) 13

Limitation Markovian assumption is false He is from France, so it makes sense that his first language is We would want to model longer dependencies 14

Sparseness Maximum likelihood for estimating q Let c(w_1,, w_n) be the number of times that n-gram appears in a corpus q(w i w i 2,w i 1 )= c(w i 2,w i 1,w i ) c(w i 2,w i 1 ) If vocabulary has 20,000 words > Number of parameters is 8 x 10 12! Most sentences will have zero or undefined probabilities 15

Berkeley restaurant project sentences can you tell me about any good cantonese restaurants close by mid priced that food is what i m looking for tell me about chez pansies can you give me a listing of the kinds of food that are available i m looking for a good place to eat breakfast when is caffe venezia open during the day 16

Bigram counts 17

Bigram probabilities 18

What did we learn p(english want) < p(chinese want) - people like Chinese stuff more when it comes to this corpus p(to want) = 0.66 - English behaves in a certain way p(eat to) = 0.28 - English behaves in a certain way 19

Evaluation: perplexity Test data: S = {s 1, s 2,, s M } Parameters are not estimated from S A good language model has high p(s) and low perplexity Perplexity is the normalized inverse probability of S p(s) = log 2 p(s) = MY p(s i ) M is the number of words in the corpus i=1 MX log 2 p(s i ) i=1 perplexity = 2 l, l = 1 M 20 MX log 2 p(s i ) i=1

Evaluation: perplexity Say we have a vocabulary V and N = V + 1 and a trigram model with uniform distribution q(w u, v) = 1/N. l = 1 M MX i=1 = 1 M log 1 N log 2 p(s i ) M = log 1 perplexity = 2 l =2 log N = N N Perplexity is the effective vocabulary size. 21

Typical values of perplexity When V = 50,000 trigram model perplexity: 74 (<< 50000) bigram model: 137 unigram model: 955 22

Evaluation Extrinsic evaluations: MT, speech, spelling correction, 23

History Shannon (1950) estimated the perplexity score that humans get for printed English (we are good!) Test your perplexity 24

Estimating parameters Recall that the number of parameters for a trigram model with V = 20,000 is 8 x 10 12, leading to zeros and undefined probabilities q(w i w i 2,w i 1 )= c(w i 2,w i 1,w i ) c(w i 2,w i 1 ) 25

Bias-variance tradeoff Given a corpus of length M Trigram model: q(w i w i 2,w i 1 )= c(w i 2,w i 1,w i ) c(w i 1,w i ) Bigram model: q(w i w i 1 )= c(w i 1,w i ) c(w i 1 ) Unigram model: q(w i )= c(w i) 26 M

27

Linear interpolation q LI (w i w i 2,w i 1 )= 1 q(w i w i 2,w i 1 ) + 2 q(w i w i 1 ) + 3 q(w i ) i 0, 1 + 2 + 3 =1 Combine the three models to get all benefits 28

Linear interpolation Need to verify the parameters define a probability distribution X w2v q LI (w u, v) = X w2v 1 q(w u, v)+ 2 q(w v)+ 3 q(w) = 1 X q(w u, v)+ 2 X q(w v)+ 3 X q(w) w2v w2v w2v = 1 + 2 + 3 =1 29

Estimating coefficients Use validation/development set (intro to ML!) Partition training data to training (90%?) and dev (10%?) data and optimize the coefficient to minimize the perplexity (the measure we care about!) on the development data 30

Discounting methods x c(x) q(wi wi-1) the 48 the, dog 15 15/48 the, woman 11 11/48 the, man 10 10/48 the, park 5 5/48 the, job 2 2/48 the, telescope 1 1/48 the, manual 1 1/48 the, afternoon 1 1/48 the, country 1 1/48 the, street 1 1/48 Low count bigrams have high estimates 31

Discounting methods c*(x) = c(x) - 0.5 x c(x) c*(x) q(wi wi-1) the 48 the, dog 15 14.5 14.5/48 the, woman 11 10.5 10.5/48 the, man 10 9.5 9.5/48 the, park 5 4.5 4.5/48 the, job 2 1.5 1.5/48 the, telescope 1 0.5 0.5/48 the, manual 1 0.5 0.5/48 the, afternoon 1 0.5 0.5/48 the, country 1 0.5 0.5/48 the, street 1 0.5 0.5/48 32

Katz back-off In a bigram model A(w i 1 )={w : c(w i 1,w) > 0} B(w i 1 )={w : c(w i 1,w)=0} q BO (w i w i 1 )= ( c (w i 1,w i ) c(w i 1 ) w i 2 A(w i 1 ) q(w i ) (w i 1 ) Pw2B(w i 1 ) q(w) w i 2 B(w i 1 ) (w i 1 )=1 X w2a(w i 1 ) c (w i 1,w) c(w i 1 ) 33

Katz back-off In a trigram model A(w i 2,w i 1 )={w : c(w i 2,w i 1,w) > 0} B(w i 2,w i 1 )={w : c(w i 2,w i 1,w)=0} q BO (w i w i 2,w i 1 )= (w i 2,w i 1 )=1 ( c (w i 2,w i 1,w i ) c(w i 2,w i 1 ) w i 2 A(w i 2,w i 1 ) q (w i 2,w i 1 ) BO (w i w i 1 ) Pw2B(w i 2,w i 1 ) q BO(w w i 1 ) w i 2 B(w i 2,w i 1 ) X w2a(w i 2,w i 1 ) c (w i 2,w i 1,w) c(w i 2,w i 1 ) 34

Class 3

Debt 1: Gradient checks @J( ) @ =lim!0 J( + ) J( ) 2 Compute for every parameters for small epsilon. 36

Debt 2: matrix factorization We are in a sample space of pairs #(c,o) #(c) = o #(c,o) #(o) = c #(c,o) T = c,o #(c,o) Let m be the context size, D be the corpus length, and c(w) be the number of times word w appears in the corpus p(w) = c(w) D = m c(w) m D = #(o) T

Debt 3: linear interpolation q LI (w i w i 2,w i 1 )= 1 q(w i w i 2,w i 1 ) + 2 q(w i w i 1 ) + 3 q(w i ) i 0, 1 + 2 + 3 =1

Debt 3: linear interpolation (w i 2,w i 1 )= ( 1 c(w i 2,w i 1 )=0 2 1 apple c(w i 2,w i 1 ) apple 2 3 3 apple c(w i 2,w i 1 ) apple 10 4 otherwise q LI (w i w i 2,w i 1 )= (w i 2,w i 1 ) 1 q(w i w i 2,w i 1 ) (w i 2,w i 1 ) i 0 + (w i 2,w i 1 ) 2 q(w i w i 1 ) + (w i 2,w i 1 ) 3 q(w i ) (w i 2,w i 1 ) 1 + (w i 2,w i 1 ) 2 + (w i 2,w i 1 ) 3 =1

Debt 4: unknown words What if we see a completely new words at test time? p(s) = p(s) = 0 > infinite perplexity Solution: create a special token <unk> Fix vocabulary V and replace any word in the training set not in V with <unk> Train At test time, use p(<unk>) for words not in the training set

Administration Projects word vectors language models Late HW you can have 5 late days during the semester. Afterwards it is 5 points deduction per day.

Summary and re-cap Sentence probability: decompose with the chain rule Use a Markov independence assumption Smooth estimates to do better on rare events Many attempts to improve LMs using syntax but not easy More complex smoothing methods exist for handling rare events (Kneser-Ney, Good-Turing ) 42

Problem Our estimator q(w u, v) is based on a one-hot representation. There is no relation between words. p(played the, actress) p(played the actor) Can we use distributed representations? Neural networks 43

Plan for today Feed-forward neural networks for language modeling Recurrent neural networks for language modeling Vanishing/exploding gradients Computation graphs

Neural networks: history Proposed in the mid-20th century, advanced by back propagation in the 1980s In the 1990s SVMs appeared and were more successful Since the beginning of this decade show great success in speech, vision, and language 45

Motivation 1 Can we learn representations from raw data? 46

Motivation 2 Can we learn non-linear decision boundaries Actually very similar to motivation 1 47

A single neuron A neuron is a computational unit of the form f w,b (x) =f(w > x + b) x: input vector w: weights vector b: bias f: activation function 48 intro to ml

A single neuron If f is sigmoid, a neuron is logistic regression Let x, y be a binary classification training example p(y =1 x) = 1 1+e w> x b = (w> x + b) Provides a linear decision boundary 49 intro to ml

Single layer network Perform logistic regression in parallel multiple times (with different parameters) y = (ŵ >ˆx + b) = (w > x) x, w 2 R d+1, w d+1 =1, x d+1 = b 50

Single layer network L1: input layer L2: hidden layer L3: output layer Output layer provides the prediction Hidden layer is the learned representation 51

Multi-layer network Repeat: 52

Matrix notation a 1 = f(w 11 x 1 + W 12 x 2 + W 13 x 3 + b 1 ) a 2 = f(w 21 x 1 + W 22 x 2 + W 23 x 3 + b 2 ) a 3 = f(w 31 x 1 + W 32 x 2 + W 33 x 3 + b 3 ) z = Wx+ b, a = f(z) x 2 R 4, z 2 R 3, W 2 R 3 4 53

Language modeling with NNs Keep the Markov assumption Learn a probability q(u v, w) with distributed representations 54

Bengio et al, 2003 Language modeling with NNs e(w) =W e w,, W e 2 R d V,w 2 R V 1 h(w i 2,w i 1 )= (W h [e(w i 2 ); e(w i 2 )]), W h 2 R m 2d f(z) = softmax(w o z), W o 2 R V m,z 2 R m 1 p(w i w i 2,w i 1 )=f(h(w i 2,w i 1 )) i f(h(the, dog)) laughed p(laughed the, dog) h(the, dog) e(the) e(dog) 55 the dog

Loss function Minimize negative log-likelihood When we learned word vectors, we were basically training a language model 56

Advantages If we see in the training set The cat is walking in the bedroom We can hope to learn A dog was running through a room We can use n-grams with n > 3 and pay only a linear price Compared to what? 57

Parameter estimation We train with SGD How to efficiently compute the gradients? backpropagation (Rumelhart, Hinton and Williams, 1986) Proof in intro to ML, will repeat algorithm only and give an example here 58

Backpropagation Notation: W t : weight matrix at the input of layer t z t : output vector at layer t x = z 0 : input vector y : gold scalar ŷ = z L : predicted scalar l(y, ŷ) : loss function v t = W t z t t = @l(y, z t) @z t 1 : pre-activations : gradient vector 59

Backpropagation Run the network forward to obtain all values v t, z t Base: Recursion: L = l 0 (y, z L ) t = W > t+1( 0 (v t+1 ) t+1) 0 (v t+1 ), t+1 2 R d t+1 1,W t+1 2 R d t+1 d t Gradients: @l @W t =( t 0 (v t ))z > t 1 60

Bigram LM example Forward pass: z 0 2 R V 1 : one-hot vector input z 1 = W 1 z 0,W 1 2 R d 1 V,z 1 2 R d 1 1 z 2 = (W 2 z 1 ),W 2 2 R d 2 d 1,z 2 2 R d 2 1 z 3 = softmax(w 3 z 2 ),W 3 2 R V d 2,z 3 2 R V 1 l(y, z 3 )= X i y (i) log z (i) 3 61

Bigram LM example Backward pass: 0 (v 3 ) 3 =(z 3 y) 2 = W 3 > ( 0 (v 3 ) 3) =W 3 > (z 3 y) 1 = W 2 > ( 0 (v 2 ) 2) =W 2 > (z 2 (1 z 2 ) 2) @l =( 3 @W 3 0 (v 3 ))z 2 > =(z 3 y)z 2 > @l =( 2 @W 2 0 (v 2 ))z 1 > =( 2 z 2 (1 z 2 ))z 1 > @l =( 1 @W 1 0 (v 1 ))z 0 > = 1 z 0 > 62

Summary Neural nets can improve language models: better scalability for larger N use of word similarity complex decision boundaries Training through backpropagation But we still have a Markov assumption He is from France, so it makes sense that his first language is 63

Recurrent neural networks the dog the dog laughed Input: w 1,...,w t 1,w t,w t+1,...,w T,w i 2 R V Model: x t = W (e) w t,w (e) 2 R d V h t = (W (hh) h t 1 + W (hx) x t ),W (hh) 2 R D h D h,w (hx) 2 R D h d ŷ t = softmax(w (s) h t ),W (s) 2 R V D h 64 64

Recurrent neural networks

Recurrent neural networks Can exploit long range dependencies Each layer has the same weights (weight sharing/ tying) What is the loss function? TX J( ) = CE(y t, ŷ t ) t=1

Recurrent neural networks Component: model? RNN (saw before) Loss? sum of cross-entropy over time Optimization? SGD Gradient computation? back-propagation through time

Training RNNs Capturing long-range dependencies with RNNs is difficult Vanishing/exploding gradients Small changes in hidden layer value in step k cause huge/minuscule changes to values in hidden layer t for t >> k

Explanation Consider a simple linear RNN with no input: h t = W h t 1 h t = W t h 0 h t =(Q Q 1 ) t h 0 h t =(Q t Q 1 ) h 0 where W = Q Q 1 is an eigendecomposition Some eigenvalues will explode and some will shrink to zero Stretch input in the direction of eigenvector with largest eigenvalue

Pascanu et al, 2013 Explanation h t = W (hh) (h t 1 )+W (hx) x t @J t ( ) @ = @h t @h k = tx k=1 ty i=k+1 @J t ( ) @h t @h i @h i 1 = @h t @h t @h k @ ty i=k+1 W (hh)> diag( 0 (h i 1 )) @h i @h i apple W (hh) > diag( 0 (h i 1 )) apple 1 The contribution of step k to the total gradient at step t is small

Solutions Exploding gradient: gradient clipping Re-normalize gradient to be less than C (this is no longer the true gradient) Exploding gradients are easy to detect The problem is with the model! Change it (LSTMs, GRUs) Maybe later in this class

Illustration Pascanu et al, 2013

Results Jozefowicz et al, 2016

Summary Language modeling is a fundamental NLP task used in machine translation, spelling correction, speech recognition, etc. Traditional models use n-gram counts and smoothing Feed-forward take into account word similarity to generalize better Recurrent models can potentially learn to exploit long-range interactions Neural models dramatically reduced perplexity Recurrent networks are now used in many other NLP tasks (bidirectional RNNs, deep RNNs) 74