Natural Language Processing 1 lecture 6: modeling sequences (final part) Ivan Titov Institute for Logic, Language and Computation
Outline After a recap: } Few more words about unsupervised estimation of HMMs (forward backward) } More on discriminative estimation (CRFs / MEMMs) } Recurrent neural networks / encoder-decoder } Syntactic parsing (PCFGs) 2
Examples of structured prediction problems Our x } Syntactic Parsing } Protein structure prediction } Visual scene parsing Our y 3
Examples of structured prediction problems Our x } Syntactic Parsing } Protein structure prediction Our y We cannot estimate a distinct set of parameters each y, we need to understand: (1) how to break y into parts; (2) how to predict these parts and (3) how these parts interact with each other } Visual scene parsing 4
Structured Prediction D = {x i,y i } l i=1 } Given a training dataset, The output is now a graph x i 2 X,y i 2 Y Now input-output pairs are mapped to the feature space } } } Represent examples in some features space Assume the classification rule: Estimate the parameters by optimizing some objective on the training data } For example, the hinge loss (max-margin classification / SVMs): 1X arg min w y 2 w 1,...,w M 2 y ŷ = arg max y f : X Y! R n w T f(x, y), w 2 R n, y 2 Y The highest score among the remaining structures Classification is computationally challenging as we search through the space Y s. t. w T f(x i,y i ) max w T f(x i,y) y2y/y i 1 score for the "gold" structure 5
Consider the sequence labeling example The feature space f f N N N N M V... N N N M M V... birds dogs can fly can fly... (dogs:n can:m fly:v)= ( 0, 1, 0, 0,, 1, 1,.. 0, 1, T 1, ) (dogs:n can:n fly:n)= ( 0, 1, 1, 1,, 0, 0,.. 2, 0, T 0, ) w = ( 5, 4, 2, 2,, 10, 5,.. -1, 3, T 3, ) Counts of the corresponding fragments Features of another sequence We want to find weights which score all the wrong sequences below the correct ones w T f (dogs:n can:m fly:v) > w T f (dogs:n can:n fly:n) And all (M 3 1) other 'wrong' sequences for this sentence 6
Structured Perceptron } } Return to structured prediction: ŷ = arg max w T f(x, y) y2y Perceptron algorithm, given a training set D = {x i,y i } l i=1 w =0 do Pushes the correct sequence up and the incorrectly predicted one down // initialize err = 0 for i = 1 to l // over the training examples ŷ = arg max w T f(x i,y) // model prediction if ( w T f(x i, ŷ) > w T f(x i,y i ) ) // if mistake w += f(x i,y i ) f(x i, ŷ) // update err ++ // # errors endif endfor while ( err > 0 ) // repeat until no errors return w y Runs Viterbi during training 7
Averaged Structured Perceptron } } Return to structured prediction: ŷ = arg max w T f(x, y) y2y Perceptron algorithm, given a training set D = {x i,y i } l i=1 w =0; k =0 w k = w; k ++ do Do not run until convergence (just T iterations) // initialize err = 0 for i = 1 to l // over the training examples ŷ = arg max w T f(x i,y) // model prediction if ( w T f(x i, ŷ) > w T f(x i,y i ) ) // if mistake w += f(x i,y i ) f(x i, ŷ) // update err ++ // # errors endif endfor while ( err > 0 ) // repeat until no errors return 1 k kx i=1 w k y How can we compute the average (much more) memory efficiently? 8
Generative vs discriminative on smaller datasets } For smaller training sets } } Theoretical results: generative classifiers converge faster to their optimal error [Ng & Jordan, NIPS 01] Empirical: A discriminative classifier A generative model Error rates on predicting housing trends prices in Boston area # train examples 1 9
Hidden Markov Models: Unsupervised Estimation } } N the number tags, M vocabulary size Parameters (to be estimated from the training set): Note the change in notation from y to s } Transition probabilities a ji = P (s t = i s t 1 = j), A - [ N x N ] matrix } Emission probabilities b ik = P (x t = k s t = i), B - [ N x M] matrix } Training corpus: } x (1) = (In, an, Oct., 19, review, of,. ), y (1) = (IN, DT, NNP, CD, NN, IN,. ) } x (2) = (Ms., Haag, plays, Elianti,.), y (2) = (NNP, NNP, VBZ, NNP,.) } } x (L) = (The, company, said, ), y (L) = (DT, NN, VBD, NNP,.) For notation reasons, let's assume that all the sentences have length n } How to estimate the parameters using maximum likelihood estimation? } How You might do have we guessed estimate what models these estimates in the are? unsupervised set-up? 10
EM (intuition) (l) t (i) =P (s t = i x (l) 1,...,x(l) n ) (1) (1) 1 2 N 0.7 0.2 0.6 V 0.3 0.8 0.4 John loves Mary P (x t = books s t = N) = 1.3 4.7 N 0.5 0.1 0.8 V 0.5 0.9 0.2 Mary loves books P (x s) = ĈE(x, s) P x ĈE(x, s) N 0.9 0.4 0.5 V 0.1 0.6 0.5 John hates books Ĉ E (x, s) = LX l=1 nx t=1 (l) t (s)i(x (l) t = x) An indicator function: equals to 1 if the condition is true, 0 o.w.
EM (intuition) (l) t (i, j) =P (s t = i, s t+1 = j x (l) 1,...,x(l) n ) V-V: 0.1 V-V: 0.2 (i, j) V-N: 0.1 V-N: 0.1 $-N: 0.8 N-N: 0.2 N-N: 0.3 $-V: 0.2 N-V: 0.6 N-V: 0.4 (1) 0 (1) 1 (i, j) N-$: 0.7 V-$: 0.3 a ji = P (s t = N s t 1 = V )= 0.8 2.5 John loves Mary V-V: 0.2 V-V: 0.1 V-N: 0.2 V-N: 0.1 $-N: 0.7 N-N: 0.2 N-N: 0.1 $-V: 0.3 N-V: 0.4 N-V: 0.7 Mary loves books V-V: 0.1 V-V: 0.1 V-N: 0.2 V-N: 0.1 $-N: 0.6 N-N: 0.2 N-N: 0.3 $-V: 0.4 N-V: 0.5 N-V: 0.5 John hates books N-$: 0.8 V-$: 0.2 N-$: 0.6 V-$: 0.4 P (j i) = Ĉ T (i, j) = Ĉ T (i, j) P j 0 Ĉ T (i, j 0 ) LX nx 1 l=1 t=0 t=1 (l) t (i, j) Disclaimer: posterior distributions in this example may not satisfy natural consistency conditions - this is just an example
Intuitive conclusion: } We need to figure out how to compute: (1) probabilistic predictions about states t(i) =P (s t = i x 1,...,x n ) (2) probabilistic predictions about pairs of states t (i, j) =P (s t = i, s t+1 = j x 1,...,x n )
Forward-backward probabilities Picture based on one from Tommi Jaakkola Forward probability } Forward probabilities t (i) t (i) =P (x 1,...,x t,s t = i) } Backward probabilities t(i) t(i) =P (x t+1,...,x n s t = i) Backward probability We can think of this as of evidence about the current state from future observationsx
Forward-backward probabilities } Recursion for calculating forward probabilities t (i) =P (x 1,...,x t,s t = i) 1 (i) =P (i $)P (x 1 i) 0 1 t (i) = @ X j t 1 (j)p (i j) A P (x t i)
Forward-backward probabilities } Analogously, recursion for calculating backward probabilities t(i) =P (x t+1,...,x n s t = i) n(i) =1 0 t(i) = @ X j We assume here that n-th symbol is $ (or </s>) 1 P (j i)p (x t+1 j) t+1 (j) A
} The fw and bw probabilities are complementary and permit us to evaluate various probabilities: 1. 2. 3. Forward-backward probabilities t (i) =P (x 1,...,x t,s t = i) t(i) =P (x t+1,...,x n s t = i) P (x 1,x 2,...,x n ) t(i) =P (s t = i x 1,...,x n ) t (i, j) =P (s t = i, s t+1 = j x 1,...,x n ) These two we need for EM
Forward-backward probabilities The probability of an observed sequence: P (x 1,...,x n )= X P (x 1,...,x n,s t = i) i = X P (x 1,...,x t,s t = i)p (x t+1,...,x n s t = i) i = X i t (i) t (i)
Forward-backward probabilities The posterior probability that HMM was in state i at time t P (s t = i x 1,...,x n )= P (x 1,...,x n,s t = i) t(i) = P (x 1,...,x n ) = t(i) t (i) P j t(j) t (j)
Forward-backward probabilities The posterior probability of being at states i and j at times t and t +1, respectively (i, j) =P (s t = i, s t+1 = j x 1,...,x n )= t (i)p (s t+1 = j s t = i)p (x t+1 s t+1 = j) t+1 (j) P j t(j) t (j)
Putting everything together: } We need to figure out how to compute: (1) probabilistic predictions about t(i) =P (s t = i x 1,...,x n ) (2) probabilistic predictions about states pairs of states t (i, j) =P (s t = i, s t+1 = j x 1,...,x n ) See also chapter 6.5 of J&M Now we know how: via forward and backward probabilities Now we have all the components in place for EM Ĉ T (i, j) = Ĉ E (x, s) = LX l=1 LX l=1 nx 1 t=1 nx t=1 (l) t (i, j) ) (l) t (s)i(x (l) t = x) P (j i) = ) P Ĉ T (i, j) P j 0 Ĉ T (i, j 0 ) (x s) = ĈE(x, s) P x ĈE(x, s)
Viterbi / forward-backward / Baum-Welch } Have you noticed a relation between the forward algorithm and Viterbi? } Roughly speaking, Viterbi select the best previous states, whereas the forward algorithm sums over all possibilities: Forward: Viterbi: Forward: Viterbi: t (i) =P (x 1,...,x t,s t )= v t (i) = max P (x 1,...,x t,s 1,...,s t 1,s t ) s 1,...,s t 1 0 1 t (i) =P (x t i) @ X t j v t (i) =P (x t i) max v t j X s 1,...,s t 1 P (x 1,...,x t,s 1,...,s t 1,s t ) 1 (j)p (i j) A 1 (j)p (i j) A Sum-Product algorithm A Max-Product algorithm 22
Viterbi / forward-backward / Baum-Welch } Forward-backward computation algorithm for HMMs? forward probabilities (beliefs) backward probabilities (beliefs) } It can be generalized to more general graphs ("belief propagation")? 23
Summary so far } Supervised estimation: generative (ML) and (feature-rich) discriminative modeling (structured perceptron, conditional random fields, MEMM) } Unsupervised modeling: generative (EM) Recently, there was some work on estimation of feature-rich models for unsupervised modeling
Outline After a recap: } Few more words about unsupervised estimation of HMMs (forward backward) } More on discriminative estimation (CRFs / MEMMs) } Recurrent neural networks / encoder-decoder } Syntactic parsing (PCFGs)
Discriminative estimation } Generative models not only consider the labeling but also score how likely an input sequence is x 1,...,x n } Discriminative models instead concentrate only on modeling how likely an output sequence y 1,...,y n is for a given input x 1,...,x n I switched back to using y instead of s
Discriminative estimation } We have learnt one discriminative method: structured perceptron } but what if we want to get probabilities P (y x)?
Recap: logistic regression } Logistic ("softmax") regression (aka max-entropy classifier): Notation abuse: we will drop w Z(x) P (y x, w) = exp(wt f(x, y)) Z(x) is a partition function: Z(x) = X y 0 2Y exp(w T f(x, y 0 )) } "Conditional" log-likelihood: L(w) = LX log P (y (l) x (l), w)+kkwk 2 2 l=1 We will forget the regularizer in the future discussion
Recap: stochastic gradient descent } We compute the gradient of the conditional log-likelihood 4 function L(w) } We can update our parameter vector based on the gradient: w := w + 4 L(w) Learning rate In practice, slighter "smarter" gradient methods are normally used
Recap: gradient for multi-class } Let's derive the gradient: " @L(w) = @ L # X log P (y (l) x (l), w) @w i @w i l=1 2 0 13 = @ LX 4 @w T f(x (l),y (l) ) log X exp(w T f(x (l),y 0 )) A5 @w i 0 l=1 y 02Y 1 LX X = @f i (x (l),y (l) ) f i (x (l) exp(w T f(x (l), ỹ)), ỹ) P A l=1 ỹ y 0 2Y exp(wt f(x (l),y 0 )) 0 1 LX X = @f i (x (l),y (l) ) f i (x (l), ỹ)p (ỹ x, w) A l=1 ỹ } Intuition: we are trying to find such a model that feature expectations computed according to the model are similar to their estimates from the data
Structured case } The soft-max classifier P (y x, w) = exp(wt f(x, y)) Z(x) How do we generalize it to sequences?
Idea 1: Max-Ent Markov Models (MEMMs) y 1 y 2 y 3 y n. x 1 x 2 x 3 x n ny P (y x, w) =P (y 1 x, w) P (y t y t 1, x, w) t=2 Notation abuse: we will drop t P (y t y t 1, x, w) = exp(wt f(y t 1,y t, x,t)) Z(y t 1, x,t) } Essentially, prediction of each next label is just a classification decision
Idea 1: Max-Ent Markov Models (MEMMs) y 1 y 2 y 3 y n. x 1 x 2 x 3 x n ny P (y x, w) =P (y 1 x, w) P (y t y t 1, x, w) t=2 Notation abuse: we will drop t P (y t y t 1, x, w) = exp(wt f(y t 1,y t, x,t)) Z(y t 1, x,t) } Essentially, prediction of each next label is just a classification decision
Idea 1: Max-Ent Markov Models (MEMMs) y 1 y 2 y 3 y n. x ny P (y x, w) =P (y 1 x, w) P (y t y t 1, x, w) t=2 Notation abuse: we will drop t P (y t y t 1, x, w) = exp(wt f(y t 1,y t, x,t)) Z(y t 1, x,t) } Essentially, prediction of each next label is just a classification decision } What are the training examples? } How do we search?
MEMMs: Problem 1 P (y t y t 1, x, w) = exp(wt f(y t 1,y t, x,t)) Z(y t 1, x,t) } Since it is trained using as y t-1 the correct label, it will end up overrelying on it } At test time, y t-1 (unlike x) will be predicted (and often incorrect) } Formally, the distribution of features in training and testing is different and the examples are interdependent (the i.i.d. assumption standard in machine learning is broken) This is not only a theoretical problem
MEMMs: Problem 1 } Consider a sequence labeling problem } Input vectors are uniformly distributed } If input is (1,1,1) the output is also (1, 1, 1) } Otherwise, the output is (0, 0, 0) y 1 y 2 y 3 x 1 x 2 x 3 P (y x, w) =P (y 1 x, w) {0, 1} 3!{0, 1} 3 Let's forget for now that the probabilities are computed with softmax ny P (y t y t 1, x, w) t=2 Ignores input!! P (y 1 =1 x 1 = 1) = P (y 1 =0 x 1 = 1) = 1/4 3/4 P (y 2 =0 y 1 =0,x 2 ) =1, 8x 2 P (y 2 =1 y 1 =1,x 2 = 1)= 1 P ((1, 1, 1) (1, 1, 1)) = 1/4 P ((0, 0, 0) (1, 1, 1)) = 3/4
MEMMs: Problem 1 } Consider a sequence labeling problem {0, 1} 3!{0, 1} 3 } Input vectors are uniformly distributed One can say: this all has happened because we factorized } If input is (1,1,1) the output is also (1, 1, 1) the model badly (i.e. features are not appropriate) } Otherwise, the output is (0, 0, 0) y 1 y 2 y 3 x 1 x 2 x 3 P (y x, w) =P (y 1 x, w) yes, but Let's forget for now that the probabilities are computed with softmax ny P (y t y t 1, x, w) t=2 Ignores input!! P 1. (y 1 We =1 x do 1 not = 1) know = 1/4if we P will (y 2 better =0 y 1 with =0,x (complex) 2 ) =1, 8x real 2 P (y 1 problems =0 x 1 = 1) = 3/4 P (y 2 =1 y 1 =1,x 2 = 1)= 1 2. P Structured ((1, 1, 1) (1, perceptron 1, 1)) = 1/4 (and CRF) would learn a P perfect ((0, 0, 0) (1, classifier 1, 1)) with = this 3/4 factorization
MEMMs: Problem 2 } The "label bias problem" For example, our set of labels is small (=2) Imagine we perform Viterbi: Hypothesis 1: p = 0.001 It is often true At time t comes input completely inconsistent with Hypothesis 1 Hypothesis 2: p = 0.0001 t 2 t 1 t Can we the model (in principle) ensure that the red state is not included in the winning path?
MEMMs: Problem 2 } The "label bias problem" For example, our set of labels is small (=2) Imagine we perform Viterbi: Hypothesis 1: p = 0.001 0 It is often true p = 0.0001 At time t comes input completely inconsistent with Hypothesis 1 1 1 0 Hypothesis 2: p = 0.0001 p = 0.001 Can we the model (in principle) ensure that the red state is not included in the winning path? t 2 t 1 t The winner
MEMMs: Problem 2 } The "label bias problem" For example, our set of labels is small (=2) Imagine we perform Viterbi: Hypothesis 1: p = 0.001 0.5 It is often true p = 0.001 x 0.5 At time t comes input completely inconsistent with Hypothesis 1 0.5 Hypothesis 2: p = 0.0001 1 0 p = 0.001 x 0.5 Why is it tnot 2a serious t 1 t problem for HMMs? Can we the model (in principle) ensure that the red state is not included in the winning path? No because we need to preserve the probability mass (the probabilities sum to 1)
MEMMs: Conclusions } MEMMs are easy to estimate and use } In practice, they are very often used in NLP We will see them again in the context of parsing } but they have serious problems and, consequently, brittle } these problems are not unique to MEMMs but for any "piecewise" estimation approach (our neural / deep models as well) } Problems very similar to "Problem 1" come up when "pipelines" are used: } } For example, text -> PoS tagging -> syntactic parsing -> semantic analysis -> dialog state prediction -> Error propagates across stages and we should be 'careful' when training models for individual stages
Idea 2: conditional random fields (CRF) P P (y x, w) = exp(wt n t=1 f(y t 1,y t, x)) Z(x) The partition function: Z(x) = X The factor graph y 1 y 2 y 3 y n. y 0 2Y n exp(w T nx f(yt 0 t=1 I am not careful with START and STOP to simplify notation 1,y 0 t, x)) The summation is over the set of all potential labelings of the entire sentence (exponential size) w T f(y 1,y 2, x) w T f(y 2,y 3, x) } We score the entire sequence in one shot } What are the training examples? } How do we search?
CRF (Lafferty, McCallum and Pereira, 2001)
Idea 2: chain conditional random field (CRF) P (y x, w) = exp(wt P n t=1 f(y t 1,y t, x)) Z(x) } We score the entire sequence in one shot } Do we have the i.i.d. problem (problem 1)? } Do we have the label bias problem (problem 2)?
How do we estimate the model? P P (y x, w) = exp(wt n t=1 f(y t Z(x) 1,y t, x)) Z(x) = X nx exp(w T f(yt 0 y 0 2Y n t=1 1,yt, 0 x)) } Do we have any hopes to compute the gradient?! LX nx L(w) = t, x) log Z(x (l) 4 4, w) = LX l=1 nx t=1 X l=1 y,y 0 2Y t=1 f(y (l) t 1,y(l) f(y (l) t 1,y(l) t, x) P (y t Again: matching data and model expectations 1 = y, y t = y 0 x, w)f(y (l) t 1 = y, y(l) t = y 0, x)) How do we compute the probabilities?
Summary: linear models for (supervised) sequence labeling + - HMMs Very easy to estimate Easy to generalize to unand semi-supervised learning Low asymptotic performance Structured perceptron MEMMs Simple to implement Fast to estimate (no decoding at training time) Does not yield probabilities Does not optimize an objective (kind of) They can be brittle (recall the problems mentioned) CRFs Gives out probabilities, motivated by a clear objective, stabile performance Harder to implement (forward-backward), expensive training (esp. when not 1 st order MM)
Outline After a recap: } Few more words about unsupervised estimation of HMMs (forward backward) } More on discriminative estimation (CRFs / MEMMs) } Recurrent neural networks / encoder-decoder } Syntactic parsing (PCFGs) 47
Recurrent neural networks (RNNs) y 1 y 2 y n Vanishing gradients problem is mitigated by Long-Short Term Memory Networks (LSTMs) but still a major issue } Lots of similarities to MEMMs } What's nice about them? } They can perform fairly well with minimal feature engineering } What's problematic about them? (v.s. MEMMs) } "Vanishing gradients" gradient information is not propagated from observations (y t ) to states far away in the past } Decoding? x 1 x 2 x n
Encoder-Decoder Encoder y 1 y 2 y n x 1 x 2 x n Decoder } Sounds like a crazy idea: compressing the entire sentences down to a single vector } More general than sequence labeling (why?) } Decoding is approximate } Brittle (but some "better" ideas are around: attention models)
NN models vs linear models Linear models - Often, exact decoding is possible; - Some degree of interpretability - Easier to encode prior knowledge (?) - Convex optimization (for supervised learning) + - - Feature engineering is crucial - They can be expensive (!) in practice Representation learning models (incl. deep / neural) - feature induction - parameter sharing across multiple tasks / features (!!) - (related) modeling compositionality of language - can be efficient (e.g., on GPUs) - Non-convex optimization - Not very interpretable - Exact decoding is not possible - They can be more expensive (for some tasks)
NLP Problems Doc. classification Types of structures Models/Views Set-ups Modeling frameworks Bags Naive Bayes Topic analysis Shallow synt. parsing /tagging Sequences / Chains Topic models HMMs Supervised estimation Generative ML Syntactic parsing Relation extraction Semantic parsing Models of inference Machine translation Question answering Spanning trees Hierarchical trees History- / transition-based models PCFGs DOP Unsupervised Partially/semisupervised Many problems in NLP can be cast / or DAGs Global scoring approximated with sequence (e.g., MST) model Generative Bayes Discriminative Discriminative Bayes Representation learning (factorizations / NNs) Opinion analysis Bipartite graphs Summarization Dialogue systems "IBM" models What about dialog systems? (e.g., Siri)...............