Sequence Labeling: HMMs & Structured Perceptron

Sequence Labeling: HMMs & Structured Perceptron CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu

HMM: Formal Specification Q: a finite set of N states Q = {q 0, q 1, q 2, q 3, } N N Transition probability matrix A = [a ij ] a ij = P(q j q i ), Σ a ij = 1 I Sequence of observations O = o 1, o 2,... o T Each drawn from a given set of symbols (vocabulary V) N V Emission probability matrix, B = [b it ] b it = b i (o t ) = P(o t q i ), Σ b it = 1 i Start and end states An explicit start state q 0 or alternatively, a prior distribution over start states: {π 1, π 2, π 3, }, Σ π i = 1 The set of final states: q F

HMMs: Three Problems Likelihood: Given an HMM λ = (A, B, ), and a sequence of observed events O, find P(O λ) Decoding: Given an HMM λ = (A, B, ), and an observation sequence O, find the most likely (hidden) state sequence Learning: Given a set of observation sequences and the set of states Q in λ, compute the parameters A and B

Computing Likelihood π 1 =0.5 π 2 =0.2 π 3 =0.3 t: 1 2 3 4 5 6 O: λ stock Assuming λ stock models the stock market, how likely are we to observe the sequence of outputs?

Computing Likelihood First try: Sum over all possible ways in which we could generate O from λ What s the problem? Takes O(N T ) time to compute! Right idea, wrong algorithm!

Forward Algorithm Use an N T trellis or chart [α tj ] Forward probabilities: α tj or α t (j) = P(being in state j after seeing t observations) = P(o 1, o 2,... o t, q t =j) Each cell = extensions of all paths from other cells α t (j) = i α t-1 (i) a ij b j (o t ) α t-1 (i): forward path probability until (t-1) a ij : transition probability of going from state i to j b j (o t ): probability of emitting symbol o t in state j P(O λ) = i α T (i)

Forward Algorithm: Formal Initialization Definition Recursion Termination

Forward Algorithm O = find P(O λ stock )

states Forward Algorithm Static Bear Bull t=1 t=2 t=3 time

states Forward Algorithm: Initialization Static α 1 (Static) 0.3 0.3 =0.09 Bear α 1 (Bear) 0.5 0.1 =0.05 Bull α 1 (Bull) 0.2 0.7=0.14 t=1 t=2 t=3 time

states Forward Algorithm: Recursion Static 0.3 0.3 =0.09 Bear 0.5 0.1 =0.05... and so on Bull 0.2 0.7=0.14 0.14 0.6 0.1=0.0084 0.0145 α 1 (Bull) a BullBull b Bull ( ) t=1 t=2 t=3 time

states Forward Algorithm: Recursion Work through the rest of these numbers Static 0.3 0.3 =0.09?? Bear 0.5 0.1 =0.05?? Bull 0.2 0.7=0.14 0.0145? t=1 t=2 t=3 time

Decoding π 1 =0.5 π 2 =0.2 π 3 =0.3 t: 1 2 3 4 5 6 O: λ stock Given λ stock as our model and O as our observations, what are the most likely states the market went through to produce O?

Viterbi Algorithm Use an N T trellis [v tj ] Just like in forward algorithm v tj or v t (j) = P(in state j after seeing t observations and passing through the most likely state sequence so far) = P(q 1, q 2,... q t-1, q t=j, o 1, o 2,... o t ) Each cell = extension of most likely path from other cells v t (j) = max i v t-1 (i) a ij b j (o t ) v t-1 (i): Viterbi probability until (t-1) a ij : transition probability of going from state i to j b j (o t ) : probability of emitting symbol o t in state j P = max i v T (i)

Viterbi vs. Forward Almost the same algorithm! Key difference: maximize instead of sum over previous paths We also need to store the most likely path (transition): Use backpointers to keep track of most likely transition At the end, follow the chain of backpointers to recover the most likely state sequence

Viterbi Algorithm: Formal Definition Initialization Recursion Termination

Viterbi Algorithm O = find most likely state sequence given λstock

states Viterbi Algorithm Static Bear Bull t=1 t=2 t=3 time

states Viterbi Algorithm: Initialization Static α 1 (Static) 0.3 0.3 =0.09 Bear α 1 (Bear) 0.5 0.1 =0.05 Bull α 1 (Bull) 0.2 0.7=0.14 t=1 t=2 t=3 time

states Viterbi Algorithm: Recursion Static 0.3 0.3 =0.09 Bear 0.5 0.1 =0.05 Max Bull 0.2 0.7=0.14 0.14 0.6 0.1=0.0084 0.0084 α 1 (Bull) a BullBull b Bull ( ) t=1 t=2 t=3 time

states Viterbi Algorithm: Recursion Static 0.3 0.3 =0.09 Bear 0.5 0.1 =0.05 store backpointer... and so on Bull 0.2 0.7=0.14 0.0084 t=1 t=2 t=3 time

states Viterbi Algorithm: Recursion Work through the rest of the algorithm Static 0.3 0.3 =0.09?? Bear 0.5 0.1 =0.05?? Bull 0.2 0.7=0.14 0.0084? t=1 t=2 t=3 time

Learning HMMs for POS tagging is a supervised task A POS tagged corpus tells us the hidden states! We can compute Maximum Likelihood Estimates (MLEs) for the various parameters

Supervised Training Transition Probabilities Any P(t i t i-1 ) = C(t i-1, t i ) / C(t i-1 ), from the tagged data Example: for P(NN VB) count how many times a noun follows a verb divide by the total number of times you see a verb

Supervised Training Emission Probabilities Any P(w i t i ) = C(w i, t i ) / C(t i ), from the tagged data For P(bank NN) count how many times bank is tagged as a noun divide by how many times anything is tagged as a noun

Supervised Training Priors Any P(q 1 = t i ) = π i = C(t i )/N, from the tagged data For π NN, count the number of times NN occurs and divide by the total number of tags (states)

Approaches to POS tagging Discriminative Classifiers Multiclass classification problem Logistic Regression/Perceptron Model context using lots of features Generative Models Structured prediction problem (Sequence labeling) Hidden Markov Models Models transitions between states/pos Structured perceptron Classification with lots of features over structured models!

Let s restructure HMMs with features Given a sentence X, predict its part of speech sequence Y Natural language processing ( NLP ) is a field of computer science JJ NN NN -LRB- NN -RRB- VBZ DT NN IN NN NN

Let s restructure HMMs with features POS POS transition probabilities POS Word emission probabilities P T (JJ <s>) * P T (NN JJ) * P T (NN NN) P Y P X Y <s> JJ NN NN LRB NN RRB... </s> I+1 i=1 P T y i y i 1 I 1 P E x i y i natural language processing ( nlp )... P E (natural JJ) * P E (language NN) * P E (processing NN)

Restructuring HMM With Features Normal HMM: P X, Y = I 1 P E x i y i I+1 i=1 P T y i y i 1 Log Likelihood: logp X, Y = I 1 log P E x i y i I+1 i=1 log P T y i y i 1 Score S X, Y = I w E,yi,x i 1 I+1 w T,yi 1,y i i=1 When: w E,yi,x i = logp E x i y i w T,yi 1,y i = logp T y i y i 1 log P(X,Y) = S(X,Y)

Example I visited Nara φ( ) = PRP VBD NNP φ T,<S>,PRP (X,Y 1 ) = 1 φ T,PRP,VBD (X,Y 1 ) = 1 φ T,VBD,NNP (X,Y 1 ) = 1 φ T,NNP,</S> (X,Y 1 ) = 1 φ E,PRP, I (X,Y 1 ) = 1 φ E,VBD, visited (X,Y 1 ) = 1 φ E,NNP, Nara (X,Y 1 ) = 1 φ CAPS,PRP (X,Y 1 ) = 1 φ CAPS,NNP (X,Y 1 ) = 1 φ SUF,VBD,...ed (X,Y 1 ) = 1 I visited Nara φ( ) = NNP VBD NNP φ T,<S>,NNP (X,Y 2 ) = 1 φ T,NNP,VBD (X,Y 2 ) = 1 φ T,VBD,NNP (X,Y 2 ) = 1 φ T,NNP,</S> (X,Y 2 ) = 1 φ E,NNP, I (X,Y 2 ) = 1 φ E,VBD, visited (X,Y 2 ) = 1 φ E,NNP, Nara (X,Y 2 ) = 1 φ CAPS,NNP (X,Y 2 ) = 2 φ SUF,VBD,...ed (X,Y 2 ) = 1

How to decode? We must find the POS sequence that satisfies: Y = argmax Y w i ϕ i X, Y i Solution: Viterbi algorithm

HMM Viterbi Algorithm Forward step, calculate the best path to a node Find the path to each node with the lowest negative log probability Backtracking step, reproduce the path

HMM Viterbi with Features Same as probabilities, use feature weights 0:<S> I 1:NN 1:JJ 1:VB 1:PRP 1:NNP best_score[ 1 NN ] = w T,<S>,NN + w E,NN,I best_score[ 1 JJ ] = w T,<S>,JJ + w E,JJ,I best_score[ 1 VB ] = w T,<S>,VB + w E,VB,I best_score[ 1 PRP ] = w T,<S>,PRP + w E,PRP,I best_score[ 1 NNP ] = w T,<S>,NNP + w E,NNP,I

HMM Viterbi with Features Can add additional features 0:<S> I 1:NN 1:JJ 1:VB 1:PRP 1:NNP best_score[ 1 NN ] = w T,<S>,NN + w E,NN,I + w CAPS,NN best_score[ 1 JJ ] = w T,<S>,JJ + w E,JJ,I + w CAPS,JJ best_score[ 1 VB ] = w T,<S>,VB + w E,VB,I + w CAPS,VB best_score[ 1 PRP ] = w T,<S>,PRP + w E,PRP,I + w CAPS,PRP best_score[ 1 NNP ] = w T,<S>,NNP + w E,NNP,I + w CAPS,NNP

Learning: the Structured Perceptron Recall the perceptron algorithm If there is a mistake: w w + yφ x, y Update weights to: increase score of positive examples decrease score of negative examples What are positive/negative examples in structured perceptron? As here, when y is a sequence of labels

Learning in the Structured Perceptron Positive example, correct feature vector: I visited Nara φ( ) PRP VBD NNP Negative example, incorrect feature vector: I visited Nara φ( ) NNP VBD NNP

Problem: There are many incorrect feature vectors! I visited Nara φ( ) NNP VBD NNP I visited Nara φ( ) PRP VBD NN I visited Nara φ( ) PRP VB NNP

Structured Perceptron Update Rule Use the incorrect answer with the highest score Y = argmax Y Update rule becomes w w + φ X, Y φ X, Y Y' is the correct answer w i ϕ i X, Y i Note: If highest scoring answer is correct, no change

Structured Perceptron Algorithm create map w for I iterations for each labeled pair X, Y_prime in the data Y_hat = hmm_viterbi(w, X) phi_prime = create_features(x, Y_prime) phi_hat = create_features(x, Y_hat) w += phi_prime - phi_hat

Recap: POS tagging Example of a sequence labeling task Can be addressed using Hidden Markov Models Decoding with Viterbi algorithm Supervised training: counts-based estimates Structured Perceptron Decoding with Viterbi algorithm Supervised training: perceptron algorithm with structured weight updates First example of structured prediction, first step toward predicting syntactic structure

Aside: unsupervised training for HMMs? Baum-Welch algorithm (sketch) Start with random parameters A,B Until convergence E-step: determine probability of observations being generated by various hidden state sequences (using forward-backward algorithm) M-step: re-estimate A,B using these probabilities More details here