Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 8: Hidden Markov Models 1

x = Fish Sleep y = (N, V) Sequence Predic=on (POS Tagging) x = The Dog Ate My Homework y = (D, N, V, D, N) x = The Fox Jumped Over The Fence y = (D, N, V, P, D, N) 2

Challenges Mul=variable Output Make mul=ple predic=ons simultaneously Variable Length Input/Output Sentence lengths not fixed 3

Mul=variate Outputs x = Fish Sleep y = (N, V) Mul=class predic=on: POS Tags: Det, Noun, Verb, Adj, Adv, Prep Replicate Weights:! # # w = # # "# w 1 w 2! w K $ & & & & %&! # # b = # # "# " $ $ f (x w, b) = $ $ $ # How many classes? b 1 b 2! b K $ & & & & %& Score All Classes: w 1 T x b 1 w 2 T x b 2! w K T x b K % ' ' ' ' ' & Predict via Largest Score: argmax k " $ $ $ $ $ # w 1 T x b 1 w 2 T x b 2! w K T x b K % ' ' ' ' ' & 4

Mul=class Predic=on x = Fish Sleep y = (N, V) Mul=class predic=on: All possible length- M sequences as different class (D, D), (D, N), (D, V), (D, Adj), (D, Adv), (D, Pr) (N, D), (N, N), (N, V), (N, Adj), (N, Adv), L M classes! Length 2: 6 2 = 36! POS Tags: Det, Noun, Verb, Adj, Adv, Prep L=6 5

Mul=class Predic=on x = Fish Sleep y = (N, V) Mul=class predic=on: All possible length- M sequences as different class (D, (Not D), Tractable (D, N), (D, for V), Sequence (D, Adj), (D, Predic=on) Adv), (D, Pr) (N, D), (N, N), (N, V), (N, Adj), (N, Adv), L M classes! Length 2: 6 2 = 36! POS Tags: Det, Noun, Verb, Adj, Adv, Prep L=6 Exponen=al Explosion in #Classes! 6

Why is Naïve Mul=class Intractable? x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep (D, D, D), (D, D, N), (D, D, V), (D, D, Adj), (D, D, Adv), (D, D, Pr) (D, N, D), (D, N, N), (D, N, V), (D, N, Adj), (D, N, Adv), (D, N, Pr) (D, V, D), (D, V, N), (D, V, V), (D, V, Adj), (D, V, Adv), (D, V, Pr) (N, D, D), (N, D, N), (N, D, V), (N, D, Adj), (N, D, Adv), (N, D, Pr) (N, N, D), (N, N, N), (N, N, V), (N, N, Adj), (N, N, Adv), (N, N, Pr) Assume pronouns are nouns for simplicity. 7

Why is Naïve Mul=class Intractable? x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep (D, D, D), (D, D, N), (D, D, V), (D, D, Adj), (D, D, Adv), (D, D, Pr) (D, N, D), (D, N, N), (D, N, V), (D, N, Adj), (D, N, Adv), (D, N, Pr) (D, V, D), (D, V, N), (D, V, V), (D, V, Adj), (D, V, Adv), (D, V, Pr) Treats Every Combina=on As Different Class (Learn model for each combina=on) (N, D, D), (N, D, N), (N, D, V), (N, D, Adj), (N, D, Adv), (N, D, Pr) (N, N, D), (N, N, N), (N, N, V), (N, N, Adj), (N, N, Adv), (N, N, Pr) Exponen=ally Large Representa=on! (Exponen=al Time to Consider Every Class) (Exponen=al Storage) Assume pronouns are nouns for simplicity. 8

Independent Classifica=on x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently (assump=on) Independent mul=class predic=on per word Predict for x= I independently Predict for x= fish independently Predict for x= ogen independently Concatenate predic=ons. Assume pronouns are nouns for simplicity. 9

Independent Classifica=on x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently (assump=on) Independent mul=class predic=on per word Predict for x= I independently #Classes = #POS Tags (6 in our example) Predict for x= fish independently Predict for x= ogen independently Concatenate predic=ons. Solvable using standard mul=class predic=on. Assume pronouns are nouns for simplicity. 10

Independent Classifica=on x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently Independent mul=class predic=on per word P(y x) x= I x= fish x= ogen y= Det 0.0 0.0 0.0 y= Noun 1.0 0.75 0.0 y= Verb 0.0 0.25 0.0 PredicKon: (N, N, Adv) Correct: (N, V, Adv) y= Adj 0.0 0.0 0.4 y= Adv 0.0 0.0 0.6 y= Prep 0.0 0.0 0.0 Why the mistake? Assume pronouns are nouns for simplicity. 11

Context Between Words x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Independent Predic=ons Ignore Word Pairs In Isola=on: Fish is more likely to be a Noun But Condi=oned on Following a (pro)noun Fish is more likely to be a Verb! 1st Order Dependence (Model All Pairs) 2 nd Order Considers All Triplets Arbitrary Order = Exponen=al Size (Naïve Mul=class) Assume pronouns are nouns for simplicity. 12

1 st Order Hidden Markov Model x = (x 1,x 2,x 4,x 4,,x M ) y = (y 1,y 2,y 3,y 4,,y M ) (sequence of words) (sequence of POS tags) P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 13

Graphical Model Representa=on Y 0 Y 1 Y 2 Y M Y End X 1 X 2 X M Op=onal ( ) = P(End y M ) P(y i y i 1 ) P x, y M P(x i y i ) M 14

1 st Order Hidden Markov Model ( ) = P(End y M ) P(y i y i 1 ) P x, y Joint Distribu=on M P(x i y i ) Op=onal M P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y0 is defined to be the Start state Prior probability of y M being the final state 15

1 st Order Hidden Markov Model P x y M ( ) = P(x i y i ) Condi=onal Distribu=on on x given y Given a POS Tag Sequence y: Can compute each P(x i y) independently! (x i condi=onally independent given y i ) P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 16

1 st Order Hidden Markov Model Addi=onal Complexity of (#POS Tags) 2 Models All State- State Pairs (all POS Tag- Tag pairs) Models All State- Observa=on Pairs (all Tag- Word pairs) P(x i y i ) Same Complexity as Independent Mul=class Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 17

Rela=onship to Naïve Bayes Y 0# Y 1# Y 2# # Y M# Y End# X 1) X 2) # X M) Y 1# Y 2# # Y M# X 1) X 2) # X M) Reduces to a sequence of disjoint Naïve Bayes models (if we ignore transi=on probabili=es) 18

P ( word state/tag ) Two- word language: fish and sleep Two- tag language: Noun and Verb P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Given Tag Sequence y: P( fish sleep (N,V) ) = 0.8*0.5 P( fish fish (N,V) ) = 0.8*0.5 P( sleep fish (V,V) ) = 0.8*0.5 P( sleep sleep (N,N) ) = 0.2*0.5 Slides borrowed from Ralph Grishman 19

Sampling HMMs are genera=ve models Models joint distribu=on P(x,y) Can generate samples from this distribu=on First consider condi=onal distribu=on P(x y) P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Given Tag Sequence y = (N,V): Sample each word independently: Sample P(x 1 N) (0.8 Fish, 0.2 Sleep) Sample P(x 2 V) (0.5 Fish, 0.5 Sleep) What about sampling from P(x,y)? 20

Forward Sampling of P(y,x) ( ) = P(End y M ) P(y i y i 1 ) P x, y M M P(x i y i ) P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 0.2 0.8 start 0.8 0.7 noun verb end 0.2 Slides borrowed from Ralph Grishman Ini=alize y 0 = Start Ini=alize i = 0 1. i = i + 1 2. Sample y i from P(y i y i- 1 ) 3. If y i == End: Quit 4. Sample x i from P(x i y i ) 5. Goto Step 1 Exploits Condi=onal Ind. Requires P(End y i ) 21

Forward Sampling of P(y,x L) ( ) = P(End y M ) P(y i y i 1 ) P x, y M M P(x i y i ) M P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 0.8 0.2 0.91 start noun verb 0.667 0.09 Slides borrowed from Ralph Grishman 0.333 Ini=alize y 0 = Start Ini=alize i = 0 1. i = i + 1 2. If(i == M): Quit 3. Sample y i from P(y i y i- 1 ) 4. Sample x i from P(x i y i ) 5. Goto Step 1 Exploits Condi=onal Ind. Assumes no P(End y i ) 22

1 st Order Hidden Markov Model P( x k+1:m, y k+1:m x 1:k, y 1:k ) = P( x k+1:m, y k+1:m y k ) Memory- less Model only needs y k to model rest of sequence P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y0 is defined to be the Start state Prior probability of y M being the final state 23

Viterbi Algorithm 24

Most Common Predic=on Problem Given input sentence, predict POS Tag seq. argmax y P( y x) Naïve approach: Try all possible y s Choose one with highest probability ExponenKal Kme: L M possible y s 25

argmax y ( ) = P(x i y i ) P x y P y L Bayes s Rule P( y x) = argmax y L = argmax y = argmax y ( ) = P(End y L ) P(y i y i 1 ) P(y, x) P(x) P(y, x) P(x y)p(y) 26

argmax y P(y, x) = argmax y = argmax y M M P(y i y i 1 ) P(x i y i ) M M argmax P(y i y i 1 ) P(x i y i ) y 1:M 1 = argmax argmax P(y M y M 1 )P(x M y M )P(y 1:M 1 x 1:M 1 ) y M y 1:M 1 M P( y 1:k x 1:k ) = P(x 1:k y 1:k )P(y 1:k ) ( ) = P(x i y i ) P x 1:k y 1:k P y 1:k k k ( ) = P(y i+1 y i ) Exploit Memory- less Property: The choice of y L only depends on y 1:M- 1 via P(y M y M- 1 )! 27

Dynamic Programming Input: x = (x 1,x 2,x 3,,x M ) Computed: best length- k prefix ending in each Tag: Examples: Yˆ # & k (V ) = % argmax P(y 1:k 1 V, x 1:k )( V ˆ # & Y k (N) = % argmax P(y 1:k 1 N, x 1:k )( N $ y 1:k 1 ' $ y 1:k 1 ' Sequence Concatena=on Claim: ˆ Y k+1 (V ) = # & argmax P(y 1:k V, x 1:k+1 ) % ( V $ y 1:k { Y ˆk ( T) } T ' # & = argmax P(y 1:k, x 1:k )P(y k+1 = V y k )P(x k+1 y k+1 = V ) % ( V $ y 1:k { Y ˆk ( T) } T ' Pre- computed Recursive DefiniKon! 28

Solve: " % Yˆ 2 (V ) = argmax P(y 1, x 1 )P(y 2 = V y 1 )P(x 2 y 2 = V ) $ ' V # y 1 { Y ˆ1 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Ŷ 1 (V) y 1 =V Ŷ 2 (V) y 1 =D Ŷ 1 (D) Ŷ 2 (D) y 1 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 1 (Z) is just Z 29

Solve: " % Yˆ 2 (V ) = argmax P(y 1, x 1 )P(y 2 = V y 1 )P(x 2 y 2 = V ) $ ' V # y 1 { Y ˆ1 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Ŷ 1 (V) Ŷ 2 (V) Ŷ 1 (D) Ŷ 2 (D) y 1 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 1 (Z) is just Z Ex: Ŷ 2 (V) = (N, V) 30

Solve: ˆ Y 3 (V ) = " % argmax P(y 1:2, x 1:2 )P(y 3 = V y 2 )P(x 3 y 3 = V ) $ ' V # y 1:2 { Y ˆ2 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Ŷ 1 (V) Ŷ 2 (V) y 2 =V Ŷ 3 (V) y 2 =D Ŷ 1 (D) Ŷ 2 (D) Ŷ 3 (D) y 2 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 3 (N) Ex: Ŷ 2 (V) = (N, V) 31

Solve: Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) ˆ Y 3 (V ) = " % argmax P(y 1:2, x 1:2 )P(y 3 = V y 2 )P(x 3 y 3 = V ) $ ' V # y 1:2 { Y ˆ2 ( T) } T & Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Claim: Only need to check solu=ons of Ŷ 2 (Z), Z=V,D,N Ŷ 1 (V) Ŷ 2 (V) y 2 =V Ŷ 3 (V) y 2 =D Ŷ 1 (D) Ŷ 1 (N) Suppose Ŷ 2 (D) Ŷ 3 (V) = (V,V,V) Ŷ 3 (D) prove that Ŷ 3 (V) = (N,V,V) has higher prob. Ex: Ŷ 2 (V) = (N, V) y 2 =N Proof depends on 1 st order property Prob. Ŷ 2 (N) of (V,V,V) & (N,V,V) Ŷ 3 (N) differ in 3 terms P(y 1 y 0 ), P(x 1 y 1 ), P(y 2 y 1 ) None of these depend on y 3! 33

ˆ Y M (V ) = # % $ & argmax P(y 1:M 1, x 1:M 1 )P(y M = V y M 1 )P(x M y M = V )P(End y M = V ) ( V { Y M 1 ( T) } T ' y 1:M 1 ˆ Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Store each Ŷ 3 (Z) & P(Ŷ 3 (Z),x 1:3 ) Op=onal Ŷ 1 (V) Ŷ 2 (V) Ŷ 3 (V) Ŷ M (V) Ŷ 1 (D) Ŷ 2 (D) Ŷ 3 (D) Ŷ M (D) Ŷ 1 (N) Ŷ 2 (N) Ŷ 3 (N) Ŷ M (N) Ex: Ŷ 2 (V) = (N, V) Ex: Ŷ 3 (V) = (D,N,V) 34

Solve: For k=1..m Viterbi Algorithm argmax y Itera=vely solve for each Ŷ k (Z) Z looping over every POS tag. Predict best Ŷ M (Z) P( y x) = argmax y = argmax y = argmax y P(y, x) P(x) P(y, x) P(x y)p(y) Also known as Mean A Posteriori (MAP) inference 35

Numerical Example x= (Fish Sleep) 0.2 0.8 start 0.8 noun verb 0.7 end 0.2 Slides borrowed from Ralph Grishman 36

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 0 1 2 3 start 1 verb 0 noun 0 end 0 Slides borrowed from Ralph Grishman 37

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 1: fish 0 1 2 3 start 1 0 verb 0.2 *.5 noun 0.8 *.8 end 0 0 Slides borrowed from Ralph Grishman 38

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 1: fish 0 1 2 3 start 1 0 verb 0.1 noun 0.64 end 0 0 Slides borrowed from Ralph Grishman 39

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep (if fish is verb) 0 1 2 3 start 1 0 0 verb 0.1.1*.1*.5 noun 0.64.1*.2*.2 end 0 0 - Slides borrowed from Ralph Grishman 40

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep (if fish is verb) 0 1 2 3 start 1 0 0 verb 0.1.005 noun 0.64.004 end 0 0 - Slides borrowed from Ralph Grishman 41

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep (if fish is a noun) 0 1 2 3 start 1 0 0 verb 0.1.005.64*.8*.5 noun 0.64.004.64*.1*.2 end 0 0 - Slides borrowed from Ralph Grishman 42

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep (if fish is a noun) 0 1 2 3 start 1 0 0 verb 0.1.005.256 noun 0.64.004.0128 end 0 0 - Slides borrowed from Ralph Grishman 43

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep take maximum, set back pointers 0 1 2 3 start 1 0 0 verb 0.1.005.256 noun 0.64.004.0128 end 0 0 - Slides borrowed from Ralph Grishman 44

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 2: sleep take maximum, set back pointers 0 1 2 3 start 1 0 0 verb 0.1.256 noun 0.64.0128 end 0 0 - Slides borrowed from Ralph Grishman 45

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 3: end 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7.0128*.1 Slides borrowed from Ralph Grishman 46

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Token 3: end take maximum, set back pointers 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7.0128*.1 Slides borrowed from Ralph Grishman 47

0.2 0.8 start 0.8 0.7 noun verb end 0.2 P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 Decode: fish = noun sleep = verb 0 1 2 3 start 1 0 0 0 verb 0.1.256 - noun 0.64.0128 - end 0 0 -.256*.7 Slides borrowed from Ralph Grishman 48

0.2 start 0.8 0.7 noun verb end Decode: fish = noun sleep = verb 0.8 0.2 P(x y) What might go wrong for long sequences? 0 1 2 3 start 1 0 0 0 y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 verb 0 Underflow!.1.256 - Small numbers get repeatedly mul=plied noun 0.64.0128 - together exponen=ally small! end 0 0 -.256*.7 Slides borrowed from Ralph Grishman 49

Solve: argmax y For k=1..m Viterbi Algorithm (w/ Log Probabili=es) P( y x) = argmax y = argmax y = argmax y Itera=vely solve for each log(ŷ k (Z)) Z looping over every POS tag. Predict best log(ŷ M (Z)) P(y, x) P(x) P(y, x) log P(x y)+ log P(y) Log(Ŷ M (Z)) accumulates addikvely, not mulkplicakvely 50

Recap: Independent Classifica=on x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently Independent mul=class predic=on per word P(y x) x= I x= fish x= ogen y= Det 0.0 0.0 0.0 y= Noun 1.0 0.75 0.0 y= Verb 0.0 0.25 0.0 PredicKon: (N, N, Adv) Correct: (N, V, Adv) y= Adj 0.0 0.0 0.4 y= Adv 0.0 0.0 0.6 y= Prep 0.0 0.0 0.0 Assume pronouns are nouns for simplicity. Mistake due to not modeling mulkple words. 51

Recap: Viterbi Models pairwise transi=ons between states Pairwise transi=ons between POS Tags 1 st order model P x, y ( ) = P(End y M ) P(y i y i 1 ) M P(x i y i ) M x= I fish ogen Independent: (N, N, Adv) HMM Viterbi: (N, V, Adv) *Assuming we defined P(x,y) properly 52

Training HMMs 53

Supervised Training Given: N S = {(x i, y i )} Word Sequence (Sentence) POS Tag Sequence Goal: Es=mate P(x,y) using S P x, y ( ) = P(End y M ) P(y i y i 1 ) Maximum Likelihood! M M P(x i y i ) 54

Aside: Matrix Formula=on Define Transi=on Matrix: A A ab = P(y i+1 =a y i =b) or Log( P(y i+1 =a y i =b) ) P(y next y) y= Noun y= Verb y next = Noun 0.09 0.667 y next = Verb 0.91 0.333 Observa=on Matrix: O O wz = P(x i =w y i =z) or Log(P(x i =w y i =z) ) P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 55

Aside: Matrix Formula=on ( ) = P(End y M ) P(y i y i 1 ) P x, y M P(x i y i ) M ( ) = P(End y M ) P(y i y i 1 ) P x, y M M P(x i y i ) = A End,y M A y i,y i 1 M O x i,y i M log(p(x, y)) = A! End,y M + Ay! + i,y Ox! i 1 i,y i M M Log prob. formula=on Each entry of Ã is define as log(a) 56

Maximum Likelihood argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S Es=mate each component separately: M P(End y M ) P(y i y i 1 ) P(x i y i ) M A ab = N j=1 M j 1" # y i+1 j =a i=0 ( ) ( y i j=b) N M j 1" # y i j =b$ % j=1 i=0 $ % O wz = N j=1 M j 1" # x i j =w ( ) ( y i j=z) N M j 1" # y i j =z$ % j=1 $ % (Derived via minimizing neg. log likelihood) 57

Recap: Supervised Training argmax A,O (x,y) S Maximum Likelihood Training Coun=ng sta=s=cs Super easy! Why? P( x, y) = argmax A,O (x,y) S What about unsupervised case? M P(End y M ) P(y i y i 1 ) P(x i y i ) M 58

Condi=onal Independence Assump=ons argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S Everything decomposes to products of pairs I.e., P(y i+1 =a y i =b) doesn t depend on anything else Can just es=mate frequencies: P(End y M ) P(y i y i 1 ) P(x i y i ) How ogen y i+1 =a when y i =b over training set Note that P(y i+1 =a y i =b) is a common model across all loca=ons of all sequences. M M 60

Condi=onal Independence Assump=ons argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S P(End y M ) P(y i y i 1 ) P(x i y i ) Everything decomposes # Parameters: to products of pairs I.e., P(y i+1 =a y i =b) doesn t depend on anything else Transi=ons A: #Tags 2 Observa=ons O: #Words x #Tags Can just es=mate frequencies: Avoids How ogen directly y i+1 =a when model y i =b over word/word training set pairings Note that P(y i+1 =a y #Tags i =b) is a = common 10s model across all loca=ons of all #Words sequences. = 10000s M M 61

Unsupervised Training What about if no y s? Just a training set of sentences N S = { x i } S=ll want to es=mate P(x,y) How? Why? Word Sequence (Sentence) argmax P( x i ) = argmax P x i, y i i y ( ) 62

Why Unsupervised Training? Supervised Data hard to acquire Require annota=ng POS tags Unsupervised Data plen=ful Just grab some text! Might just work for POS Tagging! Learn y s that correspond to POS Tags Can be used for other tasks Detect outlier sentences (sentences with low prob.) Sampling new sentences. 64

EM Algorithm (Baum- Welch) If we had y s è max likelihood. If we had (A,O) è predict y s Chicken vs Egg! 1. Ini=alize A and O arbitrarily 2. Predict prob. of y s for each training x 3. Use y s to es=mate new (A,O) ExpectaKon Step MaximizaKon Step 4. Repeat back to Step 1 un=l convergence hyp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 65

Expecta=on Step Given (A,O) For training x=(x 1,,x M ) Predict P(y i ) for each y=(y 1, y M ) x 1 x 2 x L P(y i =Noun) 0.5 0.4 0.05 P(y i =Det) 0.4 0.6 0.25 P(y i =Verb) 0.0 0.7 Encodes current model s beliefs about y Marginal Distribu=on of each y i 66

Recall: Matrix Formula=on Define Transi=on Matrix: A A ab = P(y i+1 =a y i =b) or Log( P(y i+1 =a y i =b) ) P(y next y) y= Noun y= Verb y next = Noun 0.09 0.667 y next = Verb 0.91 0.333 Observa=on Matrix: O O wz = P(x i =w y i =z) or Log(P(x i =w y i =z) ) P(x y) y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.2 0.5 67

Maximiza=on Step Max. Likelihood over Marginal Distribu=on Supervised: Unsupervised: A ab = A ab = N M j j=1 M j N 1" # y i+1 j =a j=1 i=0 i=0 N P(y j i M j j=1 ( ) ( y i j=b) M j N 1" # y i j =b$ % j=1 i=0 i=0 = b, y i+1 j = a) O wz = P(y j i $ % = b) O wz = Marginals M j N 1" # x i j =w j=1 M j N 1! " x i j =w# $ j=1 N M j j=1 ( ) ( y i j=z) M j N 1" # y i j =z$ % j=1 P(y j i P(y j i $ % Marginals = z) = z) Marginals 68

Compu=ng Marginals (Forward- Backward Algorithm) Solving E- Step, requires compute marginals x 1 x 2 x L P(y i =Noun) 0.5 0.4 0.05 P(y i =Det) 0.4 0.6 0.25 P(y i =Verb) 0.0 0.7 Can solve using Dynamic Programming! Similar to Viterbi 69

Nota=on Probability of observing prefix x 1:i and having the i- th state be y i =Z α z (i) = P(x 1:i, y i = Z A,O) Probability of observing suffix x i+1:m given the i- th state being y i =Z β z (i) = P(x i+1:m y i = Z, A,O) Compu=ng Marginals = Combining the Two Terms P(y i = z x) = z' a z (i)β z (i) a z' (i)β z' (i) hyp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 70

Nota=on Probability of observing prefix x 1:i and having the i- th state be y i =Z α z (i) = P(x 1:i, y i = Z A,O) Probability of observing suffix x i+1:m given the i- th state being y i =Z β z (i) = P(x i+1:m y i = Z, A,O) Compu=ng Marginals = Combining the Two Terms P(y i = b, y i 1 = a x) = a',b' a a (i 1)P(y i = b y i 1 = a)p(x i y i = b)β b (i) a a' (i 1)P(y i = b' y i 1 = a')p(x i y i = b')β b' (i) hyp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 71

Forward (sub- )Algorithm Solve for every: α z (i) = P(x 1:i, y i = Z A,O) Naively: α z (i) = P(x 1:i, y i = Z A,O) = y 1:i 1 ExponenKal Time! P(x 1:i, y i = Z, y 1:i 1 A,O) Can be computed recursively (like Viterbi) α z (1) = P(y 1 = z y 0 )P(x 1 y 1 = z) = O x 1,z A z,start α z (i +1) = O x i+1,z L j=1 α j (i) A z, j Viterbi effec=vely replaces sum with max 72

Backward (sub- )Algorithm Solve for every: β z (i) = P(x i+1:m y i = Z, A,O) Naively: β z (i) = P(x i+1:m y i = Z, A,O) = y i+1:l P(x i+1:m, y i+1:m y i = Z, A,O) Can be computed recursively (like Viterbi) β z (L) =1 L j=1 β z (i) = β j (i +1) A j,z O x i+1, j ExponenKal Time! 73

Forward- Backward Algorithm Runs Forward α z (i) = P(x 1:i, y i = Z A,O) Runs Backward β z (i) = P(x i+1:m y i = Z, A,O) For each training x=(x 1,,x M ) Computes each P(y i ) for y=(y 1,,y M ) P(y i = z x) = z' a z (i)β z (i) a z' (i)β z' (i) 74

Recap: Unsupervised Training Train using only word sequences: N S = { x i } y s are hidden states All pairwise transi=ons are through y s Hence hidden Markov Model Train using EM algorithm Converge to local op=mum Word Sequence (Sentence) 75

Ini=aliza=on How to choose #hidden states? By hand Cross Valida=on P(x) on valida=on data Can compute P(x) via forward algorithm: P(x) = P(x, y) = α z (M ) P(End y M = z) y z 76

Recap: Sequence Predic=on & HMMs Models pairwise dependences in sequences x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Independent: (N, N, Adv) HMM Viterbi: (N, V, Adv) Compact: only model pairwise between y s Main LimitaKon: Lots of independence assump=ons Poor predic=ve accuracy 77

Next Week Condi=onal Random Fields Sequen=al version of logis=c regression Removes many independence assump=ons More accurate in prac=ce Can only be trained in supervised se}ng Recita=on Tonight: Recap of Viterbi and Forward/Backward 78