Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1
Quiz 2 v Lectures 9-13 v Lecture 12: before page 44 v Lecture 13: before page 33 v Key points: v HMM model v Three basic problems v Sequential tagging CS6501: NLP 2
Three basic problems for HMMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): How likely the sentence I love cat occurs POS tags of I love cat occurs How to learn the model? v Find the best model parameters v Case 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vforward-backward algorithm CS6501: NLP 3
Supervised Learning Setting v Assume we have annotated examples Tag set: DT, JJ, NN, VBD POS Tagger The/DT grand/jj jury/nn commented/vbd on/in a/dt number/nn of/in other/jj topics/nns./. CS6501: NLP 4
Sequence tagging problems v Many problems in NLP (ML) have data with tag sequences v Brainstorm: name other sequential tagging problems CS6501: NLP 5
OCR example CS6501: NLP 6
Noun phrase (NP) chunking v Task: identify all non-recursive NP chunks CS6501: NLP 7
The BIO encoding v Define three new tags v B-NP: beginning of a noun phrase chunk v I-NP: inside of a noun phrase chunk v O: outside of a noun phrase chunk POS Tagging with a restricted Tagset? CS6501: NLP 8
Shallow parsing v Task: identify all non-recursive NP, verb ( VP ) and preposition ( PP ) chunks CS6501: NLP 9
BIO Encoding for Shallow Parsing v Define new tags v B-NP B-VP B-PP: beginning of an NP, VP, PP chunk v I-NP I-VP I-PP: inside of an NP, VP, PP chunk v O: outside of any chunk POS Tagging with a restricted Tagset? CS6501: NLP 10
Named Entity Recognition v Task: identify all mentions of named entities (people, organizations, locations, dates) CS6501: NLP 11
BIO Encoding for NER v Define many new tags v B-PERS, B-DATE, : beginning of a mention of a person/date... v I-PERS, I-DATE, : inside of a mention of a person/date... v O: outside of any mention of a named entity CS6501: NLP 12
Sequence tagging v Many NLP tasks are sequence tagging tasks v Input: a sequence of tokens/words v Output: a sequence of corresponding labels v E.g., POS tags, BIO encoding for NER v Solution: finding the most probable label sequence for the given word sequence vt = argmax t P t w CS6501: NLP 13
Sequential tagging v.s independent prediction Sequence labeling t = argmax t P t w t is a vector/matrix Independent classifier y = argmax - P(y x) y is a single label t i t j y i y j w i w j x i x j CS6501: NLP 14
Sequential tagging v.s independent prediction Sequence labeling t = argmax t P t w t is a vector/matrix Dependency between both (t, w) and (t 4, t 5 ) Structured output Difficult to solve the inference problem Independent classifiers y = argmax - P(y x) y is a single label Dependency only within (y, x) Independent output Easy to solve the inference problem CS6501: NLP 15
Recap: Viterbi Decoding Induction: δ 7 q = P w 7 t 7 = q max >? δ 7@A q B P t 7 = q t 7@A = q B CS6501 Natural Language Processing 16
Recap: Viterbi algorithm v Store the best tag sequence for w A w 4 that ends in t 5 in T[j][i] v T[j][i] = max P(w A w 4, t A, t 4 = t 5 ) v Recursively compute T[j][i] from the entries in the previous column T[j][i-1] v T j i = P w 4 t 5 Max 7 T k i 1 P t 5 t 7 Generating the current observation The best i-1 tag sequence Transition from the previous best ending tag CS6501: NLP 17
Two modeling perspectives v Generative models v Model the joint probability of labels and words v t = argmax t P t w = argmax t M w,t M w = argmax t P(t, w) v Discriminative models v Directly model the conditional probability of labels given the words Often modeled by v t = argmax t P t w Softmax function CS6501: NLP 18
Generative V.S. discriminative models v Binary classification as an example Generative Model s view Discriminative Model s view CS6501: NLP 19
Generative V.S. discriminative models Generative joint distribution Full probabilistic specification for all the random variables Dependence assumption has to be specified for P w t and P(t) Can be used in unsupervised learning Discriminative conditional distribution Only explain the target variable Arbitrary features can be incorporated for modeling P t w Need labeled data, suitable for (semi-) supervised learning CS6501: NLP 20
Independent Classifiers vp t w = P(t 4 w 4 ) 4 v ~95% accuracy (token-wise) t A t O t P t Q w A w O w P w Q CS6501: NLP 21
Maximum entropy Markov models v MEMMs are discriminative models of the labels t given the observed input sequence w v P t w = P(t 4 w 4, t 4@A ) 4 CS6501: NLP 22
Design features v Emission-like features v Binary feature functions v f first-letter-capitalized-nnp (China) = 1 v f first-letter-capitalized-vb (know) = 0 VB know v Integer (or real-valued) feature functions v f number-of-vowels-nnp (China) = 2 NNP China v Transition-like features v Binary feature functions v f first-letter-capitalized-vb-nnp (China) = 1 Not necessarily independent features! CS6501: NLP 23
Parameterization of P(t 4 w 4, t 4@A ) v Associate a real-valued weight λ to each specific type of feature function v λ 7 for f first-letter-capitalized-nnp (w) v Define a scoring function f t 4, t 4@A, w 4 = λ 7 f 7 (t 4, t 4@A, w 4 ) 7 v Naturally P t 4 w 4, t 4@A exp f t 4, t 4@A, w 4 v Recall the basic definition of probability v P(x) > 0 v p(x) [ = 1 CS6501: NLP 24
Parameterization of MEMMs P t w = 4 P(t 4 w 4, t 4@A ) = abc d e f,e fgh,i f 4 v It is a log-linear model abc d e,e fgh,i f v log p t w = f(t 4, t 4@A, w 4 ) 4 C(λ) v Viterbi algorithm can be used to decode the most probable label sequence solely based on f(t 4, t 4@A, w 4 ) 4 j = 4 exp f(t 4, t 4@A, w 4 ) e exp f t, t 4@A, w 4 4 Constant only related to λ λ: parameters CS6501: NLP 25
Parameter estimation (Intuition) v Maximum likelihood estimator can be used in a similar way as in HMMs v λ = argmax k t,i log P(t w) = argmax k t,i 4 f(t 4, t 4@A, w 4 ) C(λ) Decompose the training data into such units CS6501: NLP 26
Parameter estimation (Intuition) v Essentially, training local classifiers using previous assigned tags as features CS6501: NLP 27
More about MEMMs v Emission features can go across multiple observations v f t 4, t 4@A, w 4 7 λ 7 f 7 (t 4,t 4@A, w) v Especially useful for shallow parsing and NER tasks CS6501: NLP 28
Label biased problem v Consider the following tag sequences as the training data Thomas/B-PER Jefferson/I-PER Thomas/B-LOC Hall/I-LOC B-PER E-PER other B-LOC E-LOC CS6501: NLP 29
Label biased problem v Thomas/B-PER Jefferson/I-PER Thomas/B-LOC Hall/I-LOC v MEMM: P(B-PER Thomas,other)= ½ P(B-LOC Thomas,other)= ½ P(I-PER Jefferson, B-PER)=1 P(I-LOC Jefferson, B-LOC)=1 Should globally normalize! other B-PER E-PER B-LOC E-LOC CS6501: NLP 30
Conditional Random Field v Model global dependency v P t w exp S t, w = exp S t, w / tb exp S(t B, w) Score entire sequence directly t A t O t P t Q w A w O w P w Q CS6501: NLP 31
Conditional Random Field v S t, w = ( λ 7 f 7 t 4, w + γ q g q (t 4, t 4@A, w) ) i 7 v P t w exp S t, w = exp( λ 7 f 7 t 4, w + γ q g q (t 4, t 4@A, w) 4 7 q ) q t A t O t P t Q Edge feature g(t 4,t 4@A, w) Node feature f(t 4,w) w A w O w P w Q CS6501: NLP 32
Design features v Emission-like features v Binary feature functions v f first-letter-capitalized-nnp (China) = 1 v f first-letter-capitalized-vb (know) = 0 VB know v Integer (or real-valued) feature functions v f number-of-vowels-nnp (China) = 2 NNP China v Transition-like features v Binary feature functions v f first-letter-capitalized-vb-nnp (China) = 1 Not necessarily independent features! CS6501: NLP 33
General Idea v We want the score to the correct answer S t, w higher than others. S t, w > S t B, w t B T, t B t v Different level of mistakes S t, w S t B, w + Δ(t B, t ) t B T v Several ML models can be used v Structured Perceptron v Structured SVM v Learning to Search CS6501: NLP 34
Log-linear model v P t w exp S t, w v S t, w = ( λ 7 f 7 t 4, w + γg q (t 4, t 4@A, w) ) i 7 = k λ 7 ( 4 f 7 t 4,w ) q + l γ q ( 4 g q (t 4,t 4@A, w)) λ A λ O γ A γ O f A t 4, w ) 4 f O t 4, w ) 4 g A (t 4, t 4@A, w)) 4 4 g O (t 4, t 4@A, w)) θ F(t, w) Essentially, we aggregate transition and emission patterns as features CS6501: NLP 35
MEMM v.s. CRF Like in the previous slide, we can rearrange the summations v Score function can be the same: S t, w = ( λ 7 f 7 t 4, w + γg q (t 4, t 4@A, w) i 7 = f(t 4, t 4@A, w 4 ) i v MEMM: Locally normalized q ) P t w = P t 4 w 4, t 4@A v CRF: 4 = f abc d(e f,e fgh,i f ) abc d e,e fgh,i f globally normalized f j P t w = abc (Ž t,w ) abc t t,w = f abc d(e f,e fgh,i f ) abc d(e f,e fgh,i f ) t f CS6501: NLP 36
HMM v.s. MEMM v.s. CRF P(X,Y) P(Y X) CS6501: NLP 37
Structured Prediction beyond sequence tagging Assign values to a set of interdependent output variables Task Input Output Part-of-speech Tagging They operate ships and banks. Pronoun Verb Noun And Noun Dependency Parsing Segmentation They operate ships and banks. Root They operate ships and banks. 38
Inference v Find the best scoring output given the model argmax S y, x - v Output space is usually exponentially large v Inference algorithms: v Specific: e.g., Viterbi (linear chain) v General: Integer linear programming (ILP) v Approximate inference algorithms: e.g., belief propagation, dual decomposition 39
Learning Structured Models Solve inferences Update the model (stochastic) gradient updates 40
Example: Structured Perceptron v Goal: we want the score to the correct answer S y, x; θ higher than others. S y, x; θ > S y B,x; θ y B T, y B y v Let S y, x;θ = θ F(y, x; θ) v Give training data {(y i, x i )}, i = 1 N v Loop until converge v For i = 1 N v Let y B = arg max θ F(y, x; θ) y v If y B y : θ θ + η(f y, x; θ F(y, x; θ)) Kai-Wei Chang 41