Natural Language Processing Language models Based on slides from Michael Collins, Chris Manning and Richard Soccer
Plan Problem definition Trigram models Evaluation Estimation Interpolation Discounting 2
Motivations Define a probability distribution over sentences Why? p(01010 ) / p( 01010) p(01010) Machine translation P( high winds ) > P( large winds ) Spelling correction P( The office is fifteen minutes from here ) > P( The office is fifteen minuets from here ) Speech recognition (that s where it started!) P( recognize speech ) > P( wreck a nice beach ) And more! 3
Motivations Techniques will be useful later 4
Problem definition Given a finite vocabulary V = {the, a, man, telescope, Beckham, two,...} We have an infinite language L, which is V * concatenated with the special symbol STOP the STOP a STOP the fan STOP the fan saw Beckham STOP the fan saw saw STOP the fan saw Beckham play for Real Madrid STOP 5
Problem definition Input: a training set of example sentences Currently: hundreds of billions of words Output: a probability distribution p over L X x2l p(x) =1,p(x) 0 for all x 2 L p( the STOP ) = 10-12 p( the fan saw Beckham STOP ) = 2 x 10-8 p( the fan saw saw STOP ) = 10-15 6
A naive method Assume we have N training sentences Let x1, x2,, xn be a sentence, and c(x1, x2,, xn) be the number of times it appeared in the training data. Define a language model: p(x 1,...,x n )= c(x 1,...,x n ) N No generalization! 7
Markov processes Markov processes: Given a sequence of n random variables: We want a sequence probability model X 1,X 2,...,X n, n = 100, X i 2 V p(x 1 = x 1,X 2 = x 2, X n = x n ) There are V n possible sequences 8
First-order Markov process Chain rule p(x 1 = x 1,X 2 = x 2,...,X n = x n )= ny p(x 1 = x 1 ) p(x i = x i X 1 = x 1,...,X i 1 = x i 1 ) i=2 Markov assumption p(x i = x i X 1 = x 1,...,X i 1 = x i 1 )=p(x i = x i X i 1 = x i 1 ) 9
Second-order Markov process Relax independence assumption: p(x 1 = x 1,X 2 = x 2,...,X n = x n )= p(x 1 = x 1 ) p(x 2 = x 2 X 1 = x 1 ) ny p(x i = x i X i 2 = x i 2,X i 1 = x i 1 ) i=3 Simplify notation x 0 =,x 1 = Is this reasonable? 10
Detail: variable length Probability distribution over sequences of any length Define always X n =STOP, and obtain a probability distribution over all sequences Intuition: at every step you have probability α h to stop (conditioned on history) and (1-α h ) to keep going p(x 1 = x 1,X 2 = x 2,...,X n = x n )= ny p(x i = x i X i 2 = x i 2,X i 1 = x i 1 ) i=1 11
Trigram language model A trigram language model contains A vocabulary V A non negative parameters q(w u,v) for every trigram, such that w 2 V [ {STOP}, u,v 2 V [ { } The probability of a sentence x1,, xn, where xn=stop is ny p(x 1,...,x n )= q(x i x i 1,x i 2 ) i=1 12
Example p(the dog barks STOP) =q(the, ) q(dog,the) q(barks the, dog) q(stop dog, barks) 13
Limitation Markovian assumption is false He is from France, so it makes sense that his first language is We would want to model longer dependencies 14
Sparseness Maximum likelihood for estimating q Let c(w_1,, w_n) be the number of times that n-gram appears in a corpus q(w i w i 2,w i 1 )= c(w i 2,w i 1,w i ) c(w i 2,w i 1 ) If vocabulary has 20,000 words > Number of parameters is 8 x 10 12! Most sentences will have zero or undefined probabilities 15
Berkeley restaurant project sentences can you tell me about any good cantonese restaurants close by mid priced that food is what i m looking for tell me about chez pansies can you give me a listing of the kinds of food that are available i m looking for a good place to eat breakfast when is caffe venezia open during the day 16
Bigram counts 17
Bigram probabilities 18
What did we learn p(english want) < p(chinese want) - people like Chinese stuff more when it comes to this corpus p(to want) = 0.66 - English behaves in a certain way p(eat to) = 0.28 - English behaves in a certain way 19
Evaluation: perplexity Test data: S = {s 1, s 2,, s M } Parameters are not estimated from S A good language model has high p(s) and low perplexity Perplexity is the normalized inverse probability of S p(s) = log 2 p(s) = MY p(s i ) M is the number of words in the corpus i=1 MX log 2 p(s i ) i=1 perplexity = 2 l, l = 1 M 20 MX log 2 p(s i ) i=1
Evaluation: perplexity Say we have a vocabulary V and N = V + 1 and a trigram model with uniform distribution q(w u, v) = 1/N. l = 1 M MX i=1 = 1 M log 1 N log 2 p(s i ) M = log 1 perplexity = 2 l =2 log N = N N Perplexity is the effective vocabulary size. 21
Typical values of perplexity When V = 50,000 trigram model perplexity: 74 (<< 50000) bigram model: 137 unigram model: 955 22
Evaluation Extrinsic evaluations: MT, speech, spelling correction, 23
History Shannon (1950) estimated the perplexity score that humans get for printed English (we are good!) Test your perplexity 24
Estimating parameters Recall that the number of parameters for a trigram model with V = 20,000 is 8 x 10 12, leading to zeros and undefined probabilities q(w i w i 2,w i 1 )= c(w i 2,w i 1,w i ) c(w i 2,w i 1 ) 25
Bias-variance tradeoff Given a corpus of length M Trigram model: q(w i w i 2,w i 1 )= c(w i 2,w i 1,w i ) c(w i 1,w i ) Bigram model: q(w i w i 1 )= c(w i 1,w i ) c(w i 1 ) Unigram model: q(w i )= c(w i) 26 M
27
Linear interpolation q LI (w i w i 2,w i 1 )= 1 q(w i w i 2,w i 1 ) + 2 q(w i w i 1 ) + 3 q(w i ) i 0, 1 + 2 + 3 =1 Combine the three models to get all benefits 28
Linear interpolation Need to verify the parameters define a probability distribution X w2v q LI (w u, v) = X w2v 1 q(w u, v)+ 2 q(w v)+ 3 q(w) = 1 X q(w u, v)+ 2 X q(w v)+ 3 X q(w) w2v w2v w2v = 1 + 2 + 3 =1 29
Estimating coefficients Use validation/development set (intro to ML!) Partition training data to training (90%?) and dev (10%?) data and optimize the coefficient to minimize the perplexity (the measure we care about!) on the development data 30
Discounting methods x c(x) q(wi wi-1) the 48 the, dog 15 15/48 the, woman 11 11/48 the, man 10 10/48 the, park 5 5/48 the, job 2 2/48 the, telescope 1 1/48 the, manual 1 1/48 the, afternoon 1 1/48 the, country 1 1/48 the, street 1 1/48 Low count bigrams have high estimates 31
Discounting methods c*(x) = c(x) - 0.5 x c(x) c*(x) q(wi wi-1) the 48 the, dog 15 14.5 14.5/48 the, woman 11 10.5 10.5/48 the, man 10 9.5 9.5/48 the, park 5 4.5 4.5/48 the, job 2 1.5 1.5/48 the, telescope 1 0.5 0.5/48 the, manual 1 0.5 0.5/48 the, afternoon 1 0.5 0.5/48 the, country 1 0.5 0.5/48 the, street 1 0.5 0.5/48 32
Katz back-off In a bigram model A(w i 1 )={w : c(w i 1,w) > 0} B(w i 1 )={w : c(w i 1,w)=0} q BO (w i w i 1 )= ( c (w i 1,w i ) c(w i 1 ) w i 2 A(w i 1 ) q(w i ) (w i 1 ) Pw2B(w i 1 ) q(w) w i 2 B(w i 1 ) (w i 1 )=1 X w2a(w i 1 ) c (w i 1,w) c(w i 1 ) 33
Katz back-off In a trigram model A(w i 2,w i 1 )={w : c(w i 2,w i 1,w) > 0} B(w i 2,w i 1 )={w : c(w i 2,w i 1,w)=0} q BO (w i w i 2,w i 1 )= (w i 2,w i 1 )=1 ( c (w i 2,w i 1,w i ) c(w i 2,w i 1 ) w i 2 A(w i 2,w i 1 ) q (w i 2,w i 1 ) BO (w i w i 1 ) Pw2B(w i 2,w i 1 ) q BO(w w i 1 ) w i 2 B(w i 2,w i 1 ) X w2a(w i 2,w i 1 ) c (w i 2,w i 1,w) c(w i 2,w i 1 ) 34
Class 3
Debt 1: Gradient checks @J( ) @ =lim!0 J( + ) J( ) 2 Compute for every parameters for small epsilon. 36
Debt 2: matrix factorization We are in a sample space of pairs #(c,o) #(c) = o #(c,o) #(o) = c #(c,o) T = c,o #(c,o) Let m be the context size, D be the corpus length, and c(w) be the number of times word w appears in the corpus p(w) = c(w) D = m c(w) m D = #(o) T
Debt 3: linear interpolation q LI (w i w i 2,w i 1 )= 1 q(w i w i 2,w i 1 ) + 2 q(w i w i 1 ) + 3 q(w i ) i 0, 1 + 2 + 3 =1
Debt 3: linear interpolation (w i 2,w i 1 )= ( 1 c(w i 2,w i 1 )=0 2 1 apple c(w i 2,w i 1 ) apple 2 3 3 apple c(w i 2,w i 1 ) apple 10 4 otherwise q LI (w i w i 2,w i 1 )= (w i 2,w i 1 ) 1 q(w i w i 2,w i 1 ) (w i 2,w i 1 ) i 0 + (w i 2,w i 1 ) 2 q(w i w i 1 ) + (w i 2,w i 1 ) 3 q(w i ) (w i 2,w i 1 ) 1 + (w i 2,w i 1 ) 2 + (w i 2,w i 1 ) 3 =1
Debt 4: unknown words What if we see a completely new words at test time? p(s) = p(s) = 0 > infinite perplexity Solution: create a special token <unk> Fix vocabulary V and replace any word in the training set not in V with <unk> Train At test time, use p(<unk>) for words not in the training set
Administration Projects word vectors language models Late HW you can have 5 late days during the semester. Afterwards it is 5 points deduction per day.
Summary and re-cap Sentence probability: decompose with the chain rule Use a Markov independence assumption Smooth estimates to do better on rare events Many attempts to improve LMs using syntax but not easy More complex smoothing methods exist for handling rare events (Kneser-Ney, Good-Turing ) 42
Problem Our estimator q(w u, v) is based on a one-hot representation. There is no relation between words. p(played the, actress) p(played the actor) Can we use distributed representations? Neural networks 43
Plan for today Feed-forward neural networks for language modeling Recurrent neural networks for language modeling Vanishing/exploding gradients Computation graphs
Neural networks: history Proposed in the mid-20th century, advanced by back propagation in the 1980s In the 1990s SVMs appeared and were more successful Since the beginning of this decade show great success in speech, vision, and language 45
Motivation 1 Can we learn representations from raw data? 46
Motivation 2 Can we learn non-linear decision boundaries Actually very similar to motivation 1 47
A single neuron A neuron is a computational unit of the form f w,b (x) =f(w > x + b) x: input vector w: weights vector b: bias f: activation function 48 intro to ml
A single neuron If f is sigmoid, a neuron is logistic regression Let x, y be a binary classification training example p(y =1 x) = 1 1+e w> x b = (w> x + b) Provides a linear decision boundary 49 intro to ml
Single layer network Perform logistic regression in parallel multiple times (with different parameters) y = (ŵ >ˆx + b) = (w > x) x, w 2 R d+1, w d+1 =1, x d+1 = b 50
Single layer network L1: input layer L2: hidden layer L3: output layer Output layer provides the prediction Hidden layer is the learned representation 51
Multi-layer network Repeat: 52
Matrix notation a 1 = f(w 11 x 1 + W 12 x 2 + W 13 x 3 + b 1 ) a 2 = f(w 21 x 1 + W 22 x 2 + W 23 x 3 + b 2 ) a 3 = f(w 31 x 1 + W 32 x 2 + W 33 x 3 + b 3 ) z = Wx+ b, a = f(z) x 2 R 4, z 2 R 3, W 2 R 3 4 53
Language modeling with NNs Keep the Markov assumption Learn a probability q(u v, w) with distributed representations 54
Bengio et al, 2003 Language modeling with NNs e(w) =W e w,, W e 2 R d V,w 2 R V 1 h(w i 2,w i 1 )= (W h [e(w i 2 ); e(w i 2 )]), W h 2 R m 2d f(z) = softmax(w o z), W o 2 R V m,z 2 R m 1 p(w i w i 2,w i 1 )=f(h(w i 2,w i 1 )) i f(h(the, dog)) laughed p(laughed the, dog) h(the, dog) e(the) e(dog) 55 the dog
Loss function Minimize negative log-likelihood When we learned word vectors, we were basically training a language model 56
Advantages If we see in the training set The cat is walking in the bedroom We can hope to learn A dog was running through a room We can use n-grams with n > 3 and pay only a linear price Compared to what? 57
Parameter estimation We train with SGD How to efficiently compute the gradients? backpropagation (Rumelhart, Hinton and Williams, 1986) Proof in intro to ML, will repeat algorithm only and give an example here 58
Backpropagation Notation: W t : weight matrix at the input of layer t z t : output vector at layer t x = z 0 : input vector y : gold scalar ŷ = z L : predicted scalar l(y, ŷ) : loss function v t = W t z t t = @l(y, z t) @z t 1 : pre-activations : gradient vector 59
Backpropagation Run the network forward to obtain all values v t, z t Base: Recursion: L = l 0 (y, z L ) t = W > t+1( 0 (v t+1 ) t+1) 0 (v t+1 ), t+1 2 R d t+1 1,W t+1 2 R d t+1 d t Gradients: @l @W t =( t 0 (v t ))z > t 1 60
Bigram LM example Forward pass: z 0 2 R V 1 : one-hot vector input z 1 = W 1 z 0,W 1 2 R d 1 V,z 1 2 R d 1 1 z 2 = (W 2 z 1 ),W 2 2 R d 2 d 1,z 2 2 R d 2 1 z 3 = softmax(w 3 z 2 ),W 3 2 R V d 2,z 3 2 R V 1 l(y, z 3 )= X i y (i) log z (i) 3 61
Bigram LM example Backward pass: 0 (v 3 ) 3 =(z 3 y) 2 = W 3 > ( 0 (v 3 ) 3) =W 3 > (z 3 y) 1 = W 2 > ( 0 (v 2 ) 2) =W 2 > (z 2 (1 z 2 ) 2) @l =( 3 @W 3 0 (v 3 ))z 2 > =(z 3 y)z 2 > @l =( 2 @W 2 0 (v 2 ))z 1 > =( 2 z 2 (1 z 2 ))z 1 > @l =( 1 @W 1 0 (v 1 ))z 0 > = 1 z 0 > 62
Summary Neural nets can improve language models: better scalability for larger N use of word similarity complex decision boundaries Training through backpropagation But we still have a Markov assumption He is from France, so it makes sense that his first language is 63
Recurrent neural networks the dog the dog laughed Input: w 1,...,w t 1,w t,w t+1,...,w T,w i 2 R V Model: x t = W (e) w t,w (e) 2 R d V h t = (W (hh) h t 1 + W (hx) x t ),W (hh) 2 R D h D h,w (hx) 2 R D h d ŷ t = softmax(w (s) h t ),W (s) 2 R V D h 64 64
Recurrent neural networks
Recurrent neural networks Can exploit long range dependencies Each layer has the same weights (weight sharing/ tying) What is the loss function? TX J( ) = CE(y t, ŷ t ) t=1
Recurrent neural networks Component: model? RNN (saw before) Loss? sum of cross-entropy over time Optimization? SGD Gradient computation? back-propagation through time
Training RNNs Capturing long-range dependencies with RNNs is difficult Vanishing/exploding gradients Small changes in hidden layer value in step k cause huge/minuscule changes to values in hidden layer t for t >> k
Explanation Consider a simple linear RNN with no input: h t = W h t 1 h t = W t h 0 h t =(Q Q 1 ) t h 0 h t =(Q t Q 1 ) h 0 where W = Q Q 1 is an eigendecomposition Some eigenvalues will explode and some will shrink to zero Stretch input in the direction of eigenvector with largest eigenvalue
Pascanu et al, 2013 Explanation h t = W (hh) (h t 1 )+W (hx) x t @J t ( ) @ = @h t @h k = tx k=1 ty i=k+1 @J t ( ) @h t @h i @h i 1 = @h t @h t @h k @ ty i=k+1 W (hh)> diag( 0 (h i 1 )) @h i @h i apple W (hh) > diag( 0 (h i 1 )) apple 1 The contribution of step k to the total gradient at step t is small
Solutions Exploding gradient: gradient clipping Re-normalize gradient to be less than C (this is no longer the true gradient) Exploding gradients are easy to detect The problem is with the model! Change it (LSTMs, GRUs) Maybe later in this class
Illustration Pascanu et al, 2013
Results Jozefowicz et al, 2016
Summary Language modeling is a fundamental NLP task used in machine translation, spelling correction, speech recognition, etc. Traditional models use n-gram counts and smoothing Feed-forward take into account word similarity to generalize better Recurrent models can potentially learn to exploit long-range interactions Neural models dramatically reduced perplexity Recurrent networks are now used in many other NLP tasks (bidirectional RNNs, deep RNNs) 74