lecture 6: modeling sequences (final part)

Similar documents
Lecture 13: Structured Prediction

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Machine Learning for Structured Prediction

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture 5 Neural models for NLP

Sequence Labeling: HMMs & Structured Perceptron

Natural Language Processing

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Conditional Random Field

Statistical Methods for NLP

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

CSE 490 U Natural Language Processing Spring 2016

Conditional Random Fields for Sequential Supervised Learning

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Machine Learning for NLP

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Hidden Markov Models

Machine Learning for NLP

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Graphical models for part of speech tagging

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

ECE521 Lecture 7/8. Logistic Regression

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Hidden Markov Models

with Local Dependencies

STA 414/2104: Machine Learning

Probabilistic Models for Sequence Labeling

From perceptrons to word embeddings. Simon Šuster University of Groningen

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Neural Networks in Structured Prediction. November 17, 2015

Lecture 12: Algorithms for HMMs

CSE 447/547 Natural Language Processing Winter 2018

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Sequential Supervised Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Lab 12: Structured Prediction

A brief introduction to Conditional Random Fields

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

STA 4273H: Statistical Machine Learning

Lecture 12: EM Algorithm

Undirected Graphical Models

Machine Learning Lecture 5

Lecture 12: Algorithms for HMMs

2.2 Structured Prediction

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Intelligent Systems (AI-2)

Probabilistic Context-free Grammars

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Statistical Methods for NLP

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Soft Inference and Posterior Marginals. September 19, 2013

Expectation Maximization (EM)

Introduction to Machine Learning Midterm Exam

NLP Programming Tutorial 8 - Recurrent Neural Nets

Machine Learning Lecture 12

Machine Learning for Signal Processing Bayes Classification and Regression

Logistic Regression & Neural Networks

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Brief Introduction of Machine Learning Techniques for Content Analysis

Sequence Modeling with Neural Networks

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Expectation Maximization (EM)

Intelligent Systems (AI-2)

Dynamic Approaches: The Hidden Markov Model

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

ECE521 Lectures 9 Fully Connected Neural Networks

Recurrent Neural Networks. Jian Tang

Data-Intensive Computing with MapReduce

Introduction to Machine Learning Midterm, Tues April 8

Qualifying Exam in Machine Learning

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Loss Functions, Decision Theory, and Linear Models

Conditional Random Fields

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Applied Natural Language Processing

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Warm up: risk prediction with logistic regression

Structured Prediction

COMP90051 Statistical Machine Learning

Introduction to Machine Learning Midterm Exam Solutions

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning (CS 567) Lecture 2

CSC242: Intro to AI. Lecture 21

Lecture 15. Probabilistic Models on Graph

Machine Learning, Fall 2012 Homework 2

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

INTRODUCTION TO DATA SCIENCE

Hidden Markov Models

Statistical NLP: Hidden Markov Models. Updated 12/15

Probabilistic Graphical Models

Naïve Bayes, Maxent and Neural Models

Transcription:

Natural Language Processing 1 lecture 6: modeling sequences (final part) Ivan Titov Institute for Logic, Language and Computation

Outline After a recap: } Few more words about unsupervised estimation of HMMs (forward backward) } More on discriminative estimation (CRFs / MEMMs) } Recurrent neural networks / encoder-decoder } Syntactic parsing (PCFGs) 2

Examples of structured prediction problems Our x } Syntactic Parsing } Protein structure prediction } Visual scene parsing Our y 3

Examples of structured prediction problems Our x } Syntactic Parsing } Protein structure prediction Our y We cannot estimate a distinct set of parameters each y, we need to understand: (1) how to break y into parts; (2) how to predict these parts and (3) how these parts interact with each other } Visual scene parsing 4

Structured Prediction D = {x i,y i } l i=1 } Given a training dataset, The output is now a graph x i 2 X,y i 2 Y Now input-output pairs are mapped to the feature space } } } Represent examples in some features space Assume the classification rule: Estimate the parameters by optimizing some objective on the training data } For example, the hinge loss (max-margin classification / SVMs): 1X arg min w y 2 w 1,...,w M 2 y ŷ = arg max y f : X Y! R n w T f(x, y), w 2 R n, y 2 Y The highest score among the remaining structures Classification is computationally challenging as we search through the space Y s. t. w T f(x i,y i ) max w T f(x i,y) y2y/y i 1 score for the "gold" structure 5

Consider the sequence labeling example The feature space f f N N N N M V... N N N M M V... birds dogs can fly can fly... (dogs:n can:m fly:v)= ( 0, 1, 0, 0,, 1, 1,.. 0, 1, T 1, ) (dogs:n can:n fly:n)= ( 0, 1, 1, 1,, 0, 0,.. 2, 0, T 0, ) w = ( 5, 4, 2, 2,, 10, 5,.. -1, 3, T 3, ) Counts of the corresponding fragments Features of another sequence We want to find weights which score all the wrong sequences below the correct ones w T f (dogs:n can:m fly:v) > w T f (dogs:n can:n fly:n) And all (M 3 1) other 'wrong' sequences for this sentence 6

Structured Perceptron } } Return to structured prediction: ŷ = arg max w T f(x, y) y2y Perceptron algorithm, given a training set D = {x i,y i } l i=1 w =0 do Pushes the correct sequence up and the incorrectly predicted one down // initialize err = 0 for i = 1 to l // over the training examples ŷ = arg max w T f(x i,y) // model prediction if ( w T f(x i, ŷ) > w T f(x i,y i ) ) // if mistake w += f(x i,y i ) f(x i, ŷ) // update err ++ // # errors endif endfor while ( err > 0 ) // repeat until no errors return w y Runs Viterbi during training 7

Averaged Structured Perceptron } } Return to structured prediction: ŷ = arg max w T f(x, y) y2y Perceptron algorithm, given a training set D = {x i,y i } l i=1 w =0; k =0 w k = w; k ++ do Do not run until convergence (just T iterations) // initialize err = 0 for i = 1 to l // over the training examples ŷ = arg max w T f(x i,y) // model prediction if ( w T f(x i, ŷ) > w T f(x i,y i ) ) // if mistake w += f(x i,y i ) f(x i, ŷ) // update err ++ // # errors endif endfor while ( err > 0 ) // repeat until no errors return 1 k kx i=1 w k y How can we compute the average (much more) memory efficiently? 8

Generative vs discriminative on smaller datasets } For smaller training sets } } Theoretical results: generative classifiers converge faster to their optimal error [Ng & Jordan, NIPS 01] Empirical: A discriminative classifier A generative model Error rates on predicting housing trends prices in Boston area # train examples 1 9

Hidden Markov Models: Unsupervised Estimation } } N the number tags, M vocabulary size Parameters (to be estimated from the training set): Note the change in notation from y to s } Transition probabilities a ji = P (s t = i s t 1 = j), A - [ N x N ] matrix } Emission probabilities b ik = P (x t = k s t = i), B - [ N x M] matrix } Training corpus: } x (1) = (In, an, Oct., 19, review, of,. ), y (1) = (IN, DT, NNP, CD, NN, IN,. ) } x (2) = (Ms., Haag, plays, Elianti,.), y (2) = (NNP, NNP, VBZ, NNP,.) } } x (L) = (The, company, said, ), y (L) = (DT, NN, VBD, NNP,.) For notation reasons, let's assume that all the sentences have length n } How to estimate the parameters using maximum likelihood estimation? } How You might do have we guessed estimate what models these estimates in the are? unsupervised set-up? 10

EM (intuition) (l) t (i) =P (s t = i x (l) 1,...,x(l) n ) (1) (1) 1 2 N 0.7 0.2 0.6 V 0.3 0.8 0.4 John loves Mary P (x t = books s t = N) = 1.3 4.7 N 0.5 0.1 0.8 V 0.5 0.9 0.2 Mary loves books P (x s) = ĈE(x, s) P x ĈE(x, s) N 0.9 0.4 0.5 V 0.1 0.6 0.5 John hates books Ĉ E (x, s) = LX l=1 nx t=1 (l) t (s)i(x (l) t = x) An indicator function: equals to 1 if the condition is true, 0 o.w.

EM (intuition) (l) t (i, j) =P (s t = i, s t+1 = j x (l) 1,...,x(l) n ) V-V: 0.1 V-V: 0.2 (i, j) V-N: 0.1 V-N: 0.1 $-N: 0.8 N-N: 0.2 N-N: 0.3 $-V: 0.2 N-V: 0.6 N-V: 0.4 (1) 0 (1) 1 (i, j) N-$: 0.7 V-$: 0.3 a ji = P (s t = N s t 1 = V )= 0.8 2.5 John loves Mary V-V: 0.2 V-V: 0.1 V-N: 0.2 V-N: 0.1 $-N: 0.7 N-N: 0.2 N-N: 0.1 $-V: 0.3 N-V: 0.4 N-V: 0.7 Mary loves books V-V: 0.1 V-V: 0.1 V-N: 0.2 V-N: 0.1 $-N: 0.6 N-N: 0.2 N-N: 0.3 $-V: 0.4 N-V: 0.5 N-V: 0.5 John hates books N-$: 0.8 V-$: 0.2 N-$: 0.6 V-$: 0.4 P (j i) = Ĉ T (i, j) = Ĉ T (i, j) P j 0 Ĉ T (i, j 0 ) LX nx 1 l=1 t=0 t=1 (l) t (i, j) Disclaimer: posterior distributions in this example may not satisfy natural consistency conditions - this is just an example

Intuitive conclusion: } We need to figure out how to compute: (1) probabilistic predictions about states t(i) =P (s t = i x 1,...,x n ) (2) probabilistic predictions about pairs of states t (i, j) =P (s t = i, s t+1 = j x 1,...,x n )

Forward-backward probabilities Picture based on one from Tommi Jaakkola Forward probability } Forward probabilities t (i) t (i) =P (x 1,...,x t,s t = i) } Backward probabilities t(i) t(i) =P (x t+1,...,x n s t = i) Backward probability We can think of this as of evidence about the current state from future observationsx

Forward-backward probabilities } Recursion for calculating forward probabilities t (i) =P (x 1,...,x t,s t = i) 1 (i) =P (i $)P (x 1 i) 0 1 t (i) = @ X j t 1 (j)p (i j) A P (x t i)

Forward-backward probabilities } Analogously, recursion for calculating backward probabilities t(i) =P (x t+1,...,x n s t = i) n(i) =1 0 t(i) = @ X j We assume here that n-th symbol is $ (or </s>) 1 P (j i)p (x t+1 j) t+1 (j) A

} The fw and bw probabilities are complementary and permit us to evaluate various probabilities: 1. 2. 3. Forward-backward probabilities t (i) =P (x 1,...,x t,s t = i) t(i) =P (x t+1,...,x n s t = i) P (x 1,x 2,...,x n ) t(i) =P (s t = i x 1,...,x n ) t (i, j) =P (s t = i, s t+1 = j x 1,...,x n ) These two we need for EM

Forward-backward probabilities The probability of an observed sequence: P (x 1,...,x n )= X P (x 1,...,x n,s t = i) i = X P (x 1,...,x t,s t = i)p (x t+1,...,x n s t = i) i = X i t (i) t (i)

Forward-backward probabilities The posterior probability that HMM was in state i at time t P (s t = i x 1,...,x n )= P (x 1,...,x n,s t = i) t(i) = P (x 1,...,x n ) = t(i) t (i) P j t(j) t (j)

Forward-backward probabilities The posterior probability of being at states i and j at times t and t +1, respectively (i, j) =P (s t = i, s t+1 = j x 1,...,x n )= t (i)p (s t+1 = j s t = i)p (x t+1 s t+1 = j) t+1 (j) P j t(j) t (j)

Putting everything together: } We need to figure out how to compute: (1) probabilistic predictions about t(i) =P (s t = i x 1,...,x n ) (2) probabilistic predictions about states pairs of states t (i, j) =P (s t = i, s t+1 = j x 1,...,x n ) See also chapter 6.5 of J&M Now we know how: via forward and backward probabilities Now we have all the components in place for EM Ĉ T (i, j) = Ĉ E (x, s) = LX l=1 LX l=1 nx 1 t=1 nx t=1 (l) t (i, j) ) (l) t (s)i(x (l) t = x) P (j i) = ) P Ĉ T (i, j) P j 0 Ĉ T (i, j 0 ) (x s) = ĈE(x, s) P x ĈE(x, s)

Viterbi / forward-backward / Baum-Welch } Have you noticed a relation between the forward algorithm and Viterbi? } Roughly speaking, Viterbi select the best previous states, whereas the forward algorithm sums over all possibilities: Forward: Viterbi: Forward: Viterbi: t (i) =P (x 1,...,x t,s t )= v t (i) = max P (x 1,...,x t,s 1,...,s t 1,s t ) s 1,...,s t 1 0 1 t (i) =P (x t i) @ X t j v t (i) =P (x t i) max v t j X s 1,...,s t 1 P (x 1,...,x t,s 1,...,s t 1,s t ) 1 (j)p (i j) A 1 (j)p (i j) A Sum-Product algorithm A Max-Product algorithm 22

Viterbi / forward-backward / Baum-Welch } Forward-backward computation algorithm for HMMs? forward probabilities (beliefs) backward probabilities (beliefs) } It can be generalized to more general graphs ("belief propagation")? 23

Summary so far } Supervised estimation: generative (ML) and (feature-rich) discriminative modeling (structured perceptron, conditional random fields, MEMM) } Unsupervised modeling: generative (EM) Recently, there was some work on estimation of feature-rich models for unsupervised modeling

Outline After a recap: } Few more words about unsupervised estimation of HMMs (forward backward) } More on discriminative estimation (CRFs / MEMMs) } Recurrent neural networks / encoder-decoder } Syntactic parsing (PCFGs)

Discriminative estimation } Generative models not only consider the labeling but also score how likely an input sequence is x 1,...,x n } Discriminative models instead concentrate only on modeling how likely an output sequence y 1,...,y n is for a given input x 1,...,x n I switched back to using y instead of s

Discriminative estimation } We have learnt one discriminative method: structured perceptron } but what if we want to get probabilities P (y x)?

Recap: logistic regression } Logistic ("softmax") regression (aka max-entropy classifier): Notation abuse: we will drop w Z(x) P (y x, w) = exp(wt f(x, y)) Z(x) is a partition function: Z(x) = X y 0 2Y exp(w T f(x, y 0 )) } "Conditional" log-likelihood: L(w) = LX log P (y (l) x (l), w)+kkwk 2 2 l=1 We will forget the regularizer in the future discussion

Recap: stochastic gradient descent } We compute the gradient of the conditional log-likelihood 4 function L(w) } We can update our parameter vector based on the gradient: w := w + 4 L(w) Learning rate In practice, slighter "smarter" gradient methods are normally used

Recap: gradient for multi-class } Let's derive the gradient: " @L(w) = @ L # X log P (y (l) x (l), w) @w i @w i l=1 2 0 13 = @ LX 4 @w T f(x (l),y (l) ) log X exp(w T f(x (l),y 0 )) A5 @w i 0 l=1 y 02Y 1 LX X = @f i (x (l),y (l) ) f i (x (l) exp(w T f(x (l), ỹ)), ỹ) P A l=1 ỹ y 0 2Y exp(wt f(x (l),y 0 )) 0 1 LX X = @f i (x (l),y (l) ) f i (x (l), ỹ)p (ỹ x, w) A l=1 ỹ } Intuition: we are trying to find such a model that feature expectations computed according to the model are similar to their estimates from the data

Structured case } The soft-max classifier P (y x, w) = exp(wt f(x, y)) Z(x) How do we generalize it to sequences?

Idea 1: Max-Ent Markov Models (MEMMs) y 1 y 2 y 3 y n. x 1 x 2 x 3 x n ny P (y x, w) =P (y 1 x, w) P (y t y t 1, x, w) t=2 Notation abuse: we will drop t P (y t y t 1, x, w) = exp(wt f(y t 1,y t, x,t)) Z(y t 1, x,t) } Essentially, prediction of each next label is just a classification decision

Idea 1: Max-Ent Markov Models (MEMMs) y 1 y 2 y 3 y n. x 1 x 2 x 3 x n ny P (y x, w) =P (y 1 x, w) P (y t y t 1, x, w) t=2 Notation abuse: we will drop t P (y t y t 1, x, w) = exp(wt f(y t 1,y t, x,t)) Z(y t 1, x,t) } Essentially, prediction of each next label is just a classification decision

Idea 1: Max-Ent Markov Models (MEMMs) y 1 y 2 y 3 y n. x ny P (y x, w) =P (y 1 x, w) P (y t y t 1, x, w) t=2 Notation abuse: we will drop t P (y t y t 1, x, w) = exp(wt f(y t 1,y t, x,t)) Z(y t 1, x,t) } Essentially, prediction of each next label is just a classification decision } What are the training examples? } How do we search?

MEMMs: Problem 1 P (y t y t 1, x, w) = exp(wt f(y t 1,y t, x,t)) Z(y t 1, x,t) } Since it is trained using as y t-1 the correct label, it will end up overrelying on it } At test time, y t-1 (unlike x) will be predicted (and often incorrect) } Formally, the distribution of features in training and testing is different and the examples are interdependent (the i.i.d. assumption standard in machine learning is broken) This is not only a theoretical problem

MEMMs: Problem 1 } Consider a sequence labeling problem } Input vectors are uniformly distributed } If input is (1,1,1) the output is also (1, 1, 1) } Otherwise, the output is (0, 0, 0) y 1 y 2 y 3 x 1 x 2 x 3 P (y x, w) =P (y 1 x, w) {0, 1} 3!{0, 1} 3 Let's forget for now that the probabilities are computed with softmax ny P (y t y t 1, x, w) t=2 Ignores input!! P (y 1 =1 x 1 = 1) = P (y 1 =0 x 1 = 1) = 1/4 3/4 P (y 2 =0 y 1 =0,x 2 ) =1, 8x 2 P (y 2 =1 y 1 =1,x 2 = 1)= 1 P ((1, 1, 1) (1, 1, 1)) = 1/4 P ((0, 0, 0) (1, 1, 1)) = 3/4

MEMMs: Problem 1 } Consider a sequence labeling problem {0, 1} 3!{0, 1} 3 } Input vectors are uniformly distributed One can say: this all has happened because we factorized } If input is (1,1,1) the output is also (1, 1, 1) the model badly (i.e. features are not appropriate) } Otherwise, the output is (0, 0, 0) y 1 y 2 y 3 x 1 x 2 x 3 P (y x, w) =P (y 1 x, w) yes, but Let's forget for now that the probabilities are computed with softmax ny P (y t y t 1, x, w) t=2 Ignores input!! P 1. (y 1 We =1 x do 1 not = 1) know = 1/4if we P will (y 2 better =0 y 1 with =0,x (complex) 2 ) =1, 8x real 2 P (y 1 problems =0 x 1 = 1) = 3/4 P (y 2 =1 y 1 =1,x 2 = 1)= 1 2. P Structured ((1, 1, 1) (1, perceptron 1, 1)) = 1/4 (and CRF) would learn a P perfect ((0, 0, 0) (1, classifier 1, 1)) with = this 3/4 factorization

MEMMs: Problem 2 } The "label bias problem" For example, our set of labels is small (=2) Imagine we perform Viterbi: Hypothesis 1: p = 0.001 It is often true At time t comes input completely inconsistent with Hypothesis 1 Hypothesis 2: p = 0.0001 t 2 t 1 t Can we the model (in principle) ensure that the red state is not included in the winning path?

MEMMs: Problem 2 } The "label bias problem" For example, our set of labels is small (=2) Imagine we perform Viterbi: Hypothesis 1: p = 0.001 0 It is often true p = 0.0001 At time t comes input completely inconsistent with Hypothesis 1 1 1 0 Hypothesis 2: p = 0.0001 p = 0.001 Can we the model (in principle) ensure that the red state is not included in the winning path? t 2 t 1 t The winner

MEMMs: Problem 2 } The "label bias problem" For example, our set of labels is small (=2) Imagine we perform Viterbi: Hypothesis 1: p = 0.001 0.5 It is often true p = 0.001 x 0.5 At time t comes input completely inconsistent with Hypothesis 1 0.5 Hypothesis 2: p = 0.0001 1 0 p = 0.001 x 0.5 Why is it tnot 2a serious t 1 t problem for HMMs? Can we the model (in principle) ensure that the red state is not included in the winning path? No because we need to preserve the probability mass (the probabilities sum to 1)

MEMMs: Conclusions } MEMMs are easy to estimate and use } In practice, they are very often used in NLP We will see them again in the context of parsing } but they have serious problems and, consequently, brittle } these problems are not unique to MEMMs but for any "piecewise" estimation approach (our neural / deep models as well) } Problems very similar to "Problem 1" come up when "pipelines" are used: } } For example, text -> PoS tagging -> syntactic parsing -> semantic analysis -> dialog state prediction -> Error propagates across stages and we should be 'careful' when training models for individual stages

Idea 2: conditional random fields (CRF) P P (y x, w) = exp(wt n t=1 f(y t 1,y t, x)) Z(x) The partition function: Z(x) = X The factor graph y 1 y 2 y 3 y n. y 0 2Y n exp(w T nx f(yt 0 t=1 I am not careful with START and STOP to simplify notation 1,y 0 t, x)) The summation is over the set of all potential labelings of the entire sentence (exponential size) w T f(y 1,y 2, x) w T f(y 2,y 3, x) } We score the entire sequence in one shot } What are the training examples? } How do we search?

CRF (Lafferty, McCallum and Pereira, 2001)

Idea 2: chain conditional random field (CRF) P (y x, w) = exp(wt P n t=1 f(y t 1,y t, x)) Z(x) } We score the entire sequence in one shot } Do we have the i.i.d. problem (problem 1)? } Do we have the label bias problem (problem 2)?

How do we estimate the model? P P (y x, w) = exp(wt n t=1 f(y t Z(x) 1,y t, x)) Z(x) = X nx exp(w T f(yt 0 y 0 2Y n t=1 1,yt, 0 x)) } Do we have any hopes to compute the gradient?! LX nx L(w) = t, x) log Z(x (l) 4 4, w) = LX l=1 nx t=1 X l=1 y,y 0 2Y t=1 f(y (l) t 1,y(l) f(y (l) t 1,y(l) t, x) P (y t Again: matching data and model expectations 1 = y, y t = y 0 x, w)f(y (l) t 1 = y, y(l) t = y 0, x)) How do we compute the probabilities?

Summary: linear models for (supervised) sequence labeling + - HMMs Very easy to estimate Easy to generalize to unand semi-supervised learning Low asymptotic performance Structured perceptron MEMMs Simple to implement Fast to estimate (no decoding at training time) Does not yield probabilities Does not optimize an objective (kind of) They can be brittle (recall the problems mentioned) CRFs Gives out probabilities, motivated by a clear objective, stabile performance Harder to implement (forward-backward), expensive training (esp. when not 1 st order MM)

Outline After a recap: } Few more words about unsupervised estimation of HMMs (forward backward) } More on discriminative estimation (CRFs / MEMMs) } Recurrent neural networks / encoder-decoder } Syntactic parsing (PCFGs) 47

Recurrent neural networks (RNNs) y 1 y 2 y n Vanishing gradients problem is mitigated by Long-Short Term Memory Networks (LSTMs) but still a major issue } Lots of similarities to MEMMs } What's nice about them? } They can perform fairly well with minimal feature engineering } What's problematic about them? (v.s. MEMMs) } "Vanishing gradients" gradient information is not propagated from observations (y t ) to states far away in the past } Decoding? x 1 x 2 x n

Encoder-Decoder Encoder y 1 y 2 y n x 1 x 2 x n Decoder } Sounds like a crazy idea: compressing the entire sentences down to a single vector } More general than sequence labeling (why?) } Decoding is approximate } Brittle (but some "better" ideas are around: attention models)

NN models vs linear models Linear models - Often, exact decoding is possible; - Some degree of interpretability - Easier to encode prior knowledge (?) - Convex optimization (for supervised learning) + - - Feature engineering is crucial - They can be expensive (!) in practice Representation learning models (incl. deep / neural) - feature induction - parameter sharing across multiple tasks / features (!!) - (related) modeling compositionality of language - can be efficient (e.g., on GPUs) - Non-convex optimization - Not very interpretable - Exact decoding is not possible - They can be more expensive (for some tasks)

NLP Problems Doc. classification Types of structures Models/Views Set-ups Modeling frameworks Bags Naive Bayes Topic analysis Shallow synt. parsing /tagging Sequences / Chains Topic models HMMs Supervised estimation Generative ML Syntactic parsing Relation extraction Semantic parsing Models of inference Machine translation Question answering Spanning trees Hierarchical trees History- / transition-based models PCFGs DOP Unsupervised Partially/semisupervised Many problems in NLP can be cast / or DAGs Global scoring approximated with sequence (e.g., MST) model Generative Bayes Discriminative Discriminative Bayes Representation learning (factorizations / NNs) Opinion analysis Bipartite graphs Summarization Dialogue systems "IBM" models What about dialog systems? (e.g., Siri)...............