Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015
Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about your project proposal Due Oct 22 2
Exercise in Supervised Training What are the MLE for the following training corpus? DT NN VBD IN DT NN the cat sat on the mat DT NN VBD JJ the cat was sad RB VBD DT NN so was the mat DT JJ NN VBD IN DT JJ NN the sad cat was on the sad mat 3
Outline Hidden Markov models: review of last class Things to do with HMMs: Forward algorithm Backward algorithm Viterbi algorithm Baum-Welch as Expectation Maximization 4
Parts of Speech in English Nouns Verbs Adjectives Prepositions Adverbs Determiners restaurant, me, dinner find, eat, is good, vegetarian in, of, up, above quickly, well, very the, a, an 5
Sequence Labelling Predict labels for an entire sequence of inputs:??????????? Pierre Vinken, 61 years old, will join the board NNP NNP, CD NNS JJ, MD VB DT NN Pierre Vinken, 61 years old, will join the board Must consider: Current word Previous context 6
Decomposing the Joint Probability Graph specifies how join probability decomposes Q 1 Q 2 Q 3 Q 4 Q 5 O 1 O 2 O 3 O 4 O 5 T 1 P(O, Q) = P Q 1 P(Q t+1 Q t ) Initial state probability t=1 State transition probabilities T t=1 P(O t Q t ) Emission probabilities 7
Model Parameters Let there be N possible tags, W possible words Parameters θ has three components: 1. Initial probabilities for Q 1 : Π = {π 1, π 2,, π N } (categorical) 2. Transition probabilities for Q t to Q t+1 : A = a ij i, j [1, N] (categorical) 3. Emission probabilities for Q t to O t : B = b i (w k ) i 1, N, k 1, W (categorical) How many distributions and values of each type are there? 8
Model Parameters MLE Recall categorical distributions MLE: For our parameters: π i = P Q 1 = i = P outcome i = # Q 1 = i #(sentences) #(outcome i) # all events a ij = P Q t+1 = j Q t = i) = #(i, j) / #(i) b ik = P O t = k Q t = i) = #(word k, tag i) / #(i) 9
Questions for an HMM 1. Compute likelihood of a sequence of observations, P(O θ) Forward algorithm, backward algorithm 2. What state sequence best explains a sequence of observations? argmax Q P(Q, O θ) Viterbi algorithm 3. Given an observation sequence (without labels), what is the best model for it? Forward-backward algorithm Baum-Welch algorithm Expectation Maximization 10
Q1: Compute Likelihood Marginalize over all possible state sequences P(O θ) = Q P(O, Q θ) Problem: Exponentially many paths (N T ) Solution: Forward algorithm Dynamic programming to avoid unnecessary recalculations Create a table of all the possible state sequences, annotated with probabilities 11
States Forward Algorithm Trellis of possible state sequences O 1 O 2 O 3 O 4 O 5 VB α VB (1) α VB (2) α VB (3) α VB (4) α VB (5) NN α NN (1) α NN (2) α NN (3) α NN (4) α NN (5) DT α DT (1) α DT (2) α DT (3) α DT (4) α DT (5) JJ α JJ (1) α JJ (2) α JJ (3) α JJ (4) α JJ (5) CD α CD (1) α CD (2) α CD (3) α CD (4) α CD (5) Time α i t is P(O 1:t, Q t = i θ) Probability of current tag and words up to now 12
States First Column α i t is P(O 1:t, Q t = i θ) Consider: O 1 O 2 O 3 O 4 O 5 VB α VB (1) α VB (2) α VB (3) α VB (4) α VB (5) NN α NN (1) α NN (2) α NN (3) α NN (4) α NN (5) α j 1 = π j b j (O 1 ) DT α DT (1) α DT (2) α DT (3) α DT (4) α DT (5) JJ α JJ (1) α JJ (2) α JJ (3) α JJ (4) α JJ (5) CD α CD (1) α CD (2) α CD (3) α CD (4) α CD (5) Initial state probability First emission Time 13
States Middle Cells α i t is P(O 1:t, Q t = i θ) Consider: O 1 O 2 O 3 O 4 O 5 VB α VB (1) α VB (2) α VB (3) α VB (4) αn VB (5) NN α NN (1) α NN (2) α NN (3) αα NN j (4) t = α NN α(5) i t 1 a ij b j (O t ) DT α DT (1) α DT (2) α DT (3) α DT (4) α DT (5) JJ α JJ (1) α JJ (2) α JJ (3) α JJ (4) α JJ (5) CD α CD (1) α CD (2) α CD (3) α CD (4) α CD (5) Time All possible ways to get to current state Emission from current state i=1 14
States After Last Column α i t is P(O 1:t, Q t = i θ) O 1 O 2 O 3 O 4 O 5 VB α VB (1) α VB (2) α VB (3) α VB (4) α VB (5) NN α NN (1) α NN (2) α NN (3) α NN (4) α NN (5) DT α DT (1) α DT (2) α DT (3) α DT (4) α DT (5) JJ α JJ (1) α JJ (2) α JJ (3) α JJ (4) α JJ (5) = P O θ N j=1 α j (T) CD α CD (1) α CD (2) α CD (3) α CD (4) α CD (5) Time Sum over last column for overall likelihood 15
Forward Algorithm Summary Create trellis α i (t) for i = 1 N, t = 1 T α j 1 = π j b j (O 1 ) for j = 1 N for t = 2 T: for j = 1 N: N α j t = α i t 1 a ij b j (O t ) i=1 N P O θ = α j (T) j=1 Runtime: O(N 2 T) 16
Backward Algorithm Nothing stops us from going backwards too! This is not just for fun, as we'll see later. Define new trellis with cells β i (t) β i t = P O t+1:t Q t = i, θ) Probability of all subsequent words, given current tag is i. Note that unlike α i (t), it excludes the current word 17
Backward Algorithm Summary Create trellis β i (t) for i = 1 N, t = 1 T β i T = 1 for i = 1 N for t = T-1 1: for i = 1 N: P O θ = β i t = N i=1 N j=1 Runtime: O(N 2 T) a ij b j (O t+1 )β j t + 1 π i b i (O 1 )β i (1) Remember β j (t + 1) does not include b j (O t+1 ), so we need to add this factor in! 18
Forward and Backward α i t = P(O 1:t, Q t = i θ) β i t = P O t+1:t Q t = i, θ) That means Thus, α i t β i t = P(O, Q t = i θ) Probability of the entire sequence of observations, and we are in state i at timestep t. P O θ = N i=1 α i t β i (t) for any t = 1 T 19
Working in the Log Domain Practical note: need to avoid underflow work in log domain log p i = log p i = a i Log sum trick log( p i ) = log( exp a i ) a i = log p i Let b = max a i log( exp a i ) = log exp b exp(a i b) = b + log exp(a i b) 20
Q2: Sequence Labelling Find most likely state sequence for a sample Q = argmax Q P(Q, O θ) Intuition: use forward algorithm, but replace summation with max This is now called the Viterbi algorithm. Trellis cells: δ i t = max Q 1:t 1 P(Q 1:t 1, O 1:t, Q t = i θ) 21
Viterbi Algorithm Summary Create trellis δ i (t) for i = 1 N, t = 1 T δ j 1 = π j b j (O 1 ) for j = 1 N for t = 2 T: for j = 1 N: δ j t = max δ i t 1 a ij b j (O t ) i Take max δ i T i Runtime: O(N 2 T) 22
States Backpointers O 1 O 2 O 3 O 4 O 5 VB δ VB (1) δ VB (2) δ VB (3) δ VB (4) δ VB (5) NN δ NN (1) δ NN (2) δ NN (3) δ NN (4) δ NN (5) DT δ DT (1) δ DT (2) δ DT (3) δ DT (4) δ DT (5) JJ δ JJ (1) δ JJ (2) δ JJ (3) δ JJ (4) δ JJ (5) CD δ CD (1) δ CD (2) δ CD (3) δ CD (4) δ CD (5) Time Keep track of where the max entry to each cell came from Work backwards to recover best label sequence 23
Exercise 3 states (X, Y, Z), 2 possible emissions (!,@) π = 0.2 0.5 0.3 0.5 0.4 0.1 A = 0.2 0.3 0.5 0.1 0.1 0.8 0.1 0.9 B = 0.5 0.5 0.7 0.3 Run the forward, backward, and Viterbi algorithms on the emission sequence "!@@" 24
Q3: Unsupervised Training Problem: No state sequences to compute above Solution: Guess the state sequences Initialize parameters randomly Repeat for a while: 1. Predict the current state sequences using the current model 2. Update the current parameters based on current predictions 25
"Hard EM" or Viterbi EM Initialize parameters randomly Repeat for a while: 1. Predict the current state sequences using the current model with the Viterbi algorithm 2. Update the current parameters using the current predictions as in the supervised learning case Can also use "soft" predictions; i.e., the probabilities of all the possible state sequences 26
Baum-Welch Algorithm a.k.a., Forward-backward algorithm EM algorithm applied to HMMs: Expectation Maximization Get expected counts for hidden structures using current θ k. Find θ k+1 to maximize the likelihood of the training data given the expected counts from the E-step. 27
Responsibilities Needed Supervised/Hard EM: Baum-Welch: A single tag Distribution over tags Call such probabilities responsibilities. γ i t = P(Q t = i O, θ k ) Probability of being in state i at time t given the observed sequence under the current model. ξ ij t = P(Q t = i, Q t+1 = j O, θ k ) Probability of transitioning from i at time t to j at time t + 1. 28
E-Step γ γ i t = P(Q t = i O, θ k ) = P(Q t = i, O θ k ) P(O θ k ) = α i t β i (t) P(O θ k ) 29
E-Step ξ ξ ij t = P(Q t = i, Q t+1 = j O, θ k ) = P(Q t = i, Q t+1 = j, O θ k ) P(O θ k ) = α i t a ij b j O t+1 β j (t + 1) P(O θ k ) Beginning to state i at time t Transition from i to j Emit O t+1 Rest of the sequence 30
M-Step "Soft" versions of the MLE updates: π i k+1 = γ i (1) a k+1 ij = t=1 T 1 ξ ij (t) T 1 γ i (t) t=1 T b k+1 i w k = t=1 γ i t Ot =w k γ i (t) T t=1 Compare: # i, j #(i) # word k, tag i #(i) With multiple sentences, sum up expected counts over all sentences 31
When to Stop EM? Multiple options When training set likelihood stops improving When prediction performance on held-out development or validation set stops improving 32
Why Does EM Work? EM finds a local optimum in P O θ. You can show that after each step of EM: P O θ k+1 > P(O θ k ) However, this is not necessarily a global optimum. 33
Proof of Baum-Welch Correctness Part 1: Show that in our procedure, one iteration corresponds to this update: θ k+1 = argmax θ log P O, Q θ P(Q O, θ k ) Q http://www.cs.berkeley.edu/~stephentu/writeups/hmm-baum-welch-derivation.pdf Part 2: Show that improving the quantity Q log P O, Q θ P(Q O, θk ) corresponds to improving P O θ https://en.wikipedia.org/wiki/expectation%e2%80%93maximization_algorithm#proof_of_c orrectness 34
Dealing with Local Optima Random restarts Train multiple models with different random initializations Model selection on development set to pick the best one Biased initialization Bias your initialization using some external source of knowledge (e.g., external corpus counts or clustering procedure, expert knowledge about domain) Further training will hopefully improve results 35
Caveats Baum-Welch with no labelled data generally gives poor results, at least for linguistic structure (~40% accuracy, according to Johnson, (2007)) Semi-supervised learning: combine small amounts of labelled data with larger corpus of unlabelled data 36
In Practice Per-token (i.e., per word) accuracy results on WSJ corpus: Most frequent tag baseline ~90 94% HMMs (Brants, 2000) 96.5% Stanford tagger (Manning, 2011) 97.32% 37
Other Sequence Modelling Tasks Chunking (a.k.a., shallow parsing) Find syntactic chunks in the sentence; not hierarchical [ NP The chicken] [ V crossed] [ NP the road] [ P across] [ NP the lake]. Named-Entity Recognition (NER) Identify elements in text that correspond to some high level categories (e.g., PERSON, ORGANIZATION, LOCATION) [ ORG McGill University] is located in [ LOC Montreal, Canada]. Problem: need to detect spans of multiple words 38
First Attempt Simply label words by their category ORG ORG ORG - ORG - - - LOC McGill, UQAM, UdeM, and Concordia are located in Montreal. What is the problem with this scheme? 39
IOB Tagging Label whether a word is inside, outside, or at the beginning of a span as well. For n categories, 2n+1 total tags. B ORG I ORG O O O B LOC I LOC McGill University is located in Montreal, Canada B ORG B ORG B ORG O B ORG O O O B LOC McGill, UQAM, UdeM, and Concordia are located in Montreal. 40