Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models

Size: px

Start display at page:

Download "Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models"

Prudence Mitchell
5 years ago
Views:

1 Machine Learning & Data Mining CS/CNS/EE 155 Lecture 11: Hidden Markov Models 1

2 Kaggle Compe==on Part 1 2

3 Kaggle Compe==on Part 2 3

4 Announcements Updated Kaggle Report Due Date: 9pm on Monday Feb 13 th Next Recita=on Tuesday Feb 14 th 7pm-8pm Recap of concepts covered today 4

5 x = Fish Sleep y = (N, V) Sequence Predic=on (POS Tagging) x = The Dog Ate My Homework y = (D, N, V, D, N) x = The Fox Jumped Over The Fence y = (D, N, V, P, D, N) 5

6 Challenges Mul=variable Output Make mul=ple predic=ons simultaneously Variable Length Input/Output Sentence lengths not fixed 6

7 Mul=variate Outputs x = Fish Sleep y = (N, V) Mul=class predic=on: POS Tags: Det, Noun, Verb, Adj, Adv, Prep Replicate Weights:! # # w = # # "# w 1 w 2! w K $ & & & & %&! # # b = # # "# " $ $ f (x w, b) = $ $ $ # How many classes? b 1 b 2! b K $ & & & & %& Score All Classes: w 1 T x b 1 w 2 T x b 2! w K T x b K % ' ' ' ' ' & Predict via Largest Score: argmax k " $ $ $ $ $ # w 1 T x b 1 w 2 T x b 2! w K T x b K % ' ' ' ' ' & 7

8 Mul=class Predic=on x = Fish Sleep y = (N, V) Mul=class predic=on: All possible length-m sequences as different class (D, D), (D, N), (D, V), (D, Adj), (D, Adv), (D, Pr) (N, D), (N, N), (N, V), (N, Adj), (N, Adv), L M classes! Length 2: 6 2 = 36! POS Tags: Det, Noun, Verb, Adj, Adv, Prep L=6 8

9 Mul=class Predic=on x = Fish Sleep y = (N, V) Mul=class predic=on: All possible length-m sequences as different class (D, (Not D), Tractable (D, N), (D, for V), Sequence (D, Adj), (D, Predic=on) Adv), (D, Pr) (N, D), (N, N), (N, V), (N, Adj), (N, Adv), L M classes! Length 2: 6 2 = 36! POS Tags: Det, Noun, Verb, Adj, Adv, Prep L=6 Exponen=al Explosion in #Classes! 9

10 Why is Naïve Mul=class Intractable? x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep (D, D, D), (D, D, N), (D, D, V), (D, D, Adj), (D, D, Adv), (D, D, Pr) (D, N, D), (D, N, N), (D, N, V), (D, N, Adj), (D, N, Adv), (D, N, Pr) (D, V, D), (D, V, N), (D, V, V), (D, V, Adj), (D, V, Adv), (D, V, Pr) (N, D, D), (N, D, N), (N, D, V), (N, D, Adj), (N, D, Adv), (N, D, Pr) (N, N, D), (N, N, N), (N, N, V), (N, N, Adj), (N, N, Adv), (N, N, Pr) Assume pronouns are nouns for simplicity. 10

N, Adj), (D, N, Adv), (D, N, Pr) (D, V, D), (D, V, N), (D, V, V), (D, V, Adj), (D, V, Adv), (D, V, Pr) Treats Every Combina=on As Different Class (Learn model for

11 Why is Naïve Mul=class Intractable? x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep (D, D, D), (D, D, N), (D, D, V), (D, D, Adj), (D, D, Adv), (D, D, Pr) (D, N, D), (D, N, N), (D, N, V), (D, N, Adj), (D, N, Adv), (D, N, Pr) (D, V, D), (D, V, N), (D, V, V), (D, V, Adj), (D, V, Adv), (D, V, Pr) Treats Every Combina=on As Different Class (Learn model for each combina=on) (N, D, D), (N, D, N), (N, D, V), (N, D, Adj), (N, D, Adv), (N, D, Pr) (N, N, D), (N, N, N), (N, N, V), (N, N, Adj), (N, N, Adv), (N, N, Pr) Exponen=ally Large Representa=on! (Exponen=al Time to Consider Every Class) (Exponen=al Storage) Assume pronouns are nouns for simplicity. 11

12 Independent Classifica=on x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently (assump=on) Independent mul=class predic=on per word Predict for x= I independently Predict for x= fish independently Predict for x= oken independently Concatenate predic=ons. Assume pronouns are nouns for simplicity. 12

Independent Classifica=on x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently (assump=on) Independent mul=class predic=on per word Predict for x= I independently

13 Independent Classifica=on x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently (assump=on) Independent mul=class predic=on per word Predict for x= I independently #Classes = #POS Tags (6 in our example) Predict for x= fish independently Predict for x= oken independently Concatenate predic=ons. Solvable using standard mul=class predic=on. Assume pronouns are nouns for simplicity. 13

14 Independent Classifica=on x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently Independent mul=class predic=on per word P(y x) x= I x= fish x= onen y= Det y= Noun y= Verb PredicQon: (N, N, Adv) Correct: (N, V, Adv) y= Adj y= Adv y= Prep Why the mistake? Assume pronouns are nouns for simplicity. 14

15 Context Between Words x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep Independent Predic=ons Ignore Word Pairs In Isola=on: Fish is more likely to be a Noun But Condi=oned on Following a (pro)noun Fish is more likely to be a Verb! 1st Order Dependence (Model All Pairs) 2 nd Order Considers All Triplets Arbitrary Order = Exponen=al Size (Naïve Mul=class) Assume pronouns are nouns for simplicity. 15

16 1 st Order Hidden Markov Model x = (x 1,x 2,x 4,x 4,,x M ) y = (y 1,y 2,y 3,y 4,,y M ) (sequence of words) (sequence of POS tags) P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 16

17 Graphical Model Representa=on Y 0 Y 1 Y 2 Y M Y End X 1 X 2 X M Op=onal ( ) = P(End y M ) P(y i y i 1 ) P x, y M P(x i y i ) i=1 M i=1 17

18 1 st Order Hidden Markov Model ( ) = P(End y M ) P(y i y i 1 ) P x, y Joint Distribu=on M P(x i y i ) i=1 Op=onal M i=1 P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y0 is defined to be the Start state Prior probability of y M being the final state 18

19 1 st Order Hidden Markov Model P x y M i=1 ( ) = P(x i y i ) Condi=onal Distribu=on on x given y Given a POS Tag Sequence y: Can compute each P(x i y) independently! (x i condi=onally independent given y i ) P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 19

20 1 st Order Hidden Markov Model Addi=onal Complexity of (#POS Tags) 2 Models All State-State Pairs (all POS Tag-Tag pairs) Models All State-Observa=on Pairs (all Tag-Word pairs) P(x i y i ) Same Complexity as Independent Mul=class Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 20

21 Rela=onship to Naïve Bayes Y 0# Y 1# Y 2# # Y M# Y End# X 1) X 2) # X M) Y 1# Y 2# # Y M# X 1) X 2) # X M) Reduces to a sequence of disjoint Naïve Bayes models (if we ignore transi=on probabili=es) 21

22 P ( word state/tag ) Two-word language: fish and sleep Two-tag language: Noun and Verb P(x y) y= Noun y= Verb x= fish x= sleep Given Tag Sequence y: P( fish sleep (N,V) ) = 0.8*0.5 P( fish fish (N,V) ) = 0.8*0.5 P( sleep fish (V,V) ) = 0.8*0.5 P( sleep sleep (N,N) ) = 0.2*0.5 Slides borrowed from Ralph Grishman 22

23 Sampling HMMs are genera=ve models Models joint distribu=on P(x,y) Can generate samples from this distribu=on First consider condi=onal distribu=on P(x y) P(x y) y= Noun y= Verb x= fish x= sleep Given Tag Sequence y = (N,V): Sample each word independently: Sample P(x 1 N) (0.8 Fish, 0.2 Sleep) Sample P(x 2 V) (0.5 Fish, 0.5 Sleep) What about sampling from P(x,y)? 23

24 Forward Sampling of P(y,x) ( ) = P(End y M ) P(y i y i 1 ) P x, y M i=1 M i=1 P(x i y i ) P(x y) y= Noun y= Verb x= fish x= sleep start noun verb end 0.2 Slides borrowed from Ralph Grishman Ini=alize y 0 = Start Ini=alize i = 0 1. i = i Sample y i from P(y i y i-1 ) 3. If y i == End: Quit 4. Sample x i from P(x i y i ) 5. Goto Step 1 Exploits Condi=onal Ind. Requires P(End y i ) 24

25 Forward Sampling of P(y,x L) ( ) = P(End y M ) P(y i y i 1 ) P x, y M M P(x i y i ) i=1 M i=1 P(x y) y= Noun y= Verb x= fish x= sleep start noun verb Slides borrowed from Ralph Grishman Ini=alize y 0 = Start Ini=alize i = 0 1. i = i If(i == M): Quit 3. Sample y i from P(y i y i-1 ) 4. Sample x i from P(x i y i ) 5. Goto Step 1 Exploits Condi=onal Ind. Assumes no P(End y i ) 25

26 1 st Order Hidden Markov Model P( x k+1:m, y k+1:m x 1:k, y 1:k ) = P( x k+1:m, y k+1:m y k ) Memory-less Model only needs y k to model rest of sequence P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y0 is defined to be the Start state Prior probability of y M being the final state 26

27 Viterbi Algorithm 27

28 Most Common Predic=on Problem Given input sentence, predict POS Tag seq. argmax y P( y x) Naïve approach: Try all possible y s Choose one with highest probability ExponenQal Qme: L M possible y s 28

29 argmax y ( ) = P(x i y i ) P x y P y L i=1 Bayes s Rule P( y x) = argmax y L i=1 = argmax y = argmax y ( ) = P(End y L ) P(y i y i 1 ) P(y, x) P(x) P(y, x) P(x y)p(y) 29

30 argmax y P(y, x) = argmax y = argmax y M M P(y i y i 1 ) P(x i y i ) i=1 M M i=1 argmax P(y i y i 1 ) P(x i y i ) y 1:M 1 i=1 = argmax argmax P(y M y M 1 )P(x M y M )P(y 1:M 1 x 1:M 1 ) y M y 1:M 1 M i=1 P( y 1:k x 1:k ) = P(x 1:k y 1:k )P(y 1:k ) ( ) = P(x i y i ) P x 1:k y 1:k P y 1:k k k i=1 ( ) = P(y i+1 y i ) i=1 Exploit Memory-less Property: The choice of y M only depends on y 1:M-1 via P(y M y M-1 )! 30

31 Dynamic Programming Input: x = (x 1,x 2,x 3,,x M ) Computed: best length-k prefix ending in each Tag: Examples: Yˆ # & k (V ) = % argmax P(y 1:k 1 V, x 1:k )( V ˆ # & Y k (N) = % argmax P(y 1:k 1 N, x 1:k )( N $ y 1:k 1 ' $ y 1:k 1 ' Sequence Concatena=on Claim: ˆ Y k+1 (V ) = # & argmax P(y 1:k V, x 1:k+1 ) % ( V $ y 1:k { Y ˆk ( T) } T ' # & = argmax P(y 1:k, x 1:k )P(y k+1 = V y k )P(x k+1 y k+1 = V ) % ( V $ y 1:k { Y ˆk ( T) } T ' Pre-computed Recursive DefiniQon! 31

32 Solve: " % Yˆ 2 (V ) = argmax P(y 1, x 1 )P(y 2 = V y 1 )P(x 2 y 2 = V ) $ ' V # y 1 { Y ˆ1 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Ŷ 1 (V) y 1 =V Ŷ 2 (V) y 1 =D Ŷ 1 (D) Ŷ 2 (D) y 1 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 1 (Z) is just Z 32

33 Solve: " % Yˆ 2 (V ) = argmax P(y 1, x 1 )P(y 2 = V y 1 )P(x 2 y 2 = V ) $ ' V # y 1 { Y ˆ1 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Ŷ 1 (V) Ŷ 2 (V) Ŷ 1 (D) Ŷ 2 (D) y 1 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 1 (Z) is just Z Ex: Ŷ 2 (V) = (N, V) 33

34 Solve: ˆ Y 3 (V ) = " % argmax P(y 1:2, x 1:2 )P(y 3 = V y 2 )P(x 3 y 3 = V ) $ ' V # y 1:2 { Y ˆ2 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Ŷ 1 (V) Ŷ 2 (V) y 2 =V Ŷ 3 (V) y 2 =D Ŷ 1 (D) Ŷ 2 (D) Ŷ 3 (D) y 2 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 3 (N) Ex: Ŷ 2 (V) = (N, V) 34

35 Solve: Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) ˆ Y 3 (V ) = " % argmax P(y 1:2, x 1:2 )P(y 3 = V y 2 )P(x 3 y 3 = V ) $ ' V # y 1:2 { Y ˆ2 ( T) } T & Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Claim: Only need to check solu=ons of Ŷ 2 (Z), Z=V,D,N Ŷ 1 (V) Ŷ 2 (V) y 2 =V Ŷ 3 (V) y 2 =D Ŷ 1 (D) Ŷ 1 (N) Suppose Ŷ 2 (D) Ŷ 3 (V) = (V,V,V) Ŷ 3 (D) prove that Ŷ 3 (V) = (N,V,V) has higher prob. Ex: Ŷ 2 (V) = (N, V) y 2 =N Proof depends on 1 st order property Prob. Ŷ 2 (N) of (V,V,V) & (N,V,V) Ŷ 3 (N) differ in 3 terms P(y 1 y 0 ), P(x 1 y 1 ), P(y 2 y 1 ) None of these depend on y 3! 35

36 ˆ Y M (V ) = # % $ & argmax P(y 1:M 1, x 1:M 1 )P(y M = V y M 1 )P(x M y M = V )P(End y M = V ) ( V { Y M 1 ( T) } T ' y 1:M 1 ˆ Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Store each Ŷ 3 (Z) & P(Ŷ 3 (Z),x 1:3 ) Op=onal Ŷ 1 (V) Ŷ 2 (V) Ŷ 3 (V) Ŷ M (V) Ŷ 1 (D) Ŷ 2 (D) Ŷ 3 (D) Ŷ M (D) Ŷ 1 (N) Ŷ 2 (N) Ŷ 3 (N) Ŷ M (N) Ex: Ŷ 2 (V) = (N, V) Ex: Ŷ 3 (V) = (D,N,V) 36

37 Solve: For k=1..m Viterbi Algorithm argmax y Itera=vely solve for each Ŷ k (Z) Z looping over every POS tag. Predict best Ŷ M (Z) P( y x) = argmax y = argmax y = argmax y P(y, x) P(x) P(y, x) P(x y)p(y) Also known as Mean A Posteriori (MAP) inference 37

38 Numerical Example x= (Fish Sleep) start 0.8 noun verb 0.7 end 0.2 Slides borrowed from Ralph Grishman 38

39 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep start 1 verb 0 noun 0 end 0 Slides borrowed from Ralph Grishman 39

40 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 1: fish start 1 0 verb 0.2 *.5 noun 0.8 *.8 end 0 0 Slides borrowed from Ralph Grishman 40

41 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 1: fish start 1 0 verb 0.1 noun 0.64 end 0 0 Slides borrowed from Ralph Grishman 41

42 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 2: sleep (if fish is verb) start verb 0.1.1*.1*.5 noun *.2*.2 end Slides borrowed from Ralph Grishman 42

43 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 2: sleep (if fish is verb) start verb noun end Slides borrowed from Ralph Grishman 43

44 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 2: sleep (if fish is a noun) start verb *.8*.5 noun *.1*.2 end Slides borrowed from Ralph Grishman 44

45 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 2: sleep (if fish is a noun) start verb noun end Slides borrowed from Ralph Grishman 45

46 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 2: sleep take maximum, set back pointers start verb noun end Slides borrowed from Ralph Grishman 46

47 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 2: sleep take maximum, set back pointers start verb noun end Slides borrowed from Ralph Grishman 47

48 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 3: end start verb noun end * *.1 Slides borrowed from Ralph Grishman 48

49 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 3: end take maximum, set back pointers start verb noun end * *.1 Slides borrowed from Ralph Grishman 49

50 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Decode: fish = noun sleep = verb start verb noun end *.7 Slides borrowed from Ralph Grishman 50

0.2 start 0.8 0.7 noun verb end Decode: fish = noun sleep = verb 0.8 0.2 P(x y) What might go wrong for long sequences? 0 1 2 3 start 1 0 0 0 y= Noun y= Verb x= fish 0.8 0.5 x= sleep 0.

51 0.2 start noun verb end Decode: fish = noun sleep = verb P(x y) What might go wrong for long sequences? start y= Noun y= Verb x= fish x= sleep verb 0 Underflow! Small numbers get repeatedly mul=plied noun together exponen=ally small! end *.7 Slides borrowed from Ralph Grishman 51

52 Solve: argmax y For k=1..m Viterbi Algorithm (w/ Log Probabili=es) P( y x) = argmax y = argmax y = argmax y Itera=vely solve for each log(ŷ k (Z)) Z looping over every POS tag. Predict best log(ŷ M (Z)) P(y, x) P(x) P(y, x) log P(x y)+ log P(y) Log(Ŷ M (Z)) accumulates addiqvely, not mulqplicaqvely 52

53 Recap: Independent Classifica=on x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently Independent mul=class predic=on per word P(y x) x= I x= fish x= onen y= Det y= Noun y= Verb PredicQon: (N, N, Adv) Correct: (N, V, Adv) y= Adj y= Adv y= Prep Assume pronouns are nouns for simplicity. Mistake due to not modeling mulqple words. 53

54 Recap: Viterbi Models pairwise transi=ons between states Pairwise transi=ons between POS Tags 1 st order model P x, y ( ) = P(End y M ) P(y i y i 1 ) M P(x i y i ) i=1 M i=1 x= I fish oken Independent: (N, N, Adv) HMM Viterbi: (N, V, Adv) *Assuming we defined P(x,y) properly 54

55 Training HMMs 55

56 Supervised Training Given: N S = {(x i, y i )} i=1 Word Sequence (Sentence) POS Tag Sequence Goal: Es=mate P(x,y) using S P x, y ( ) = P(End y M ) P(y i y i 1 ) i=1 Maximum Likelihood! M M i=1 P(x i y i ) 56

57 Aside: Matrix Formula=on Define Transi=on Matrix: A A ab = P(y i+1 =a y i =b) or Log( P(y i+1 =a y i =b) ) P(y next y) y= Noun y= Verb y next = Noun y next = Verb Observa=on Matrix: O O wz = P(x i =w y i =z) or Log(P(x i =w y i =z) ) P(x y) y= Noun y= Verb x= fish x= sleep

58 Aside: Matrix Formula=on ( ) = P(End y M ) P(y i y i 1 ) P x, y M P(x i y i ) i=1 M i=1 ( ) = P(End y M ) P(y i y i 1 ) P x, y M M P(x i y i ) i=1 = A End,y M A y i,y i 1 i=1 M i=1 O x i,y i M i=1 log(p(x, y)) = A! End,y M + Ay! + i,y Ox! i 1 i,y i M i=1 M i=1 Log prob. formula=on Each entry of Ã is define as log(a) 58

59 Maximum Likelihood argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S Es=mate each component separately: M P(End y M ) P(y i y i 1 ) P(x i y i ) i=1 M i=1 A ab = N j=1 M j 1" # y i+1 j =a i=0 ( ) ( y i j=b) N M j 1" # y i j =b$ % j=1 i=0 $ % O wz = N j=1 M j 1" # x i j =w i=1 ( ) ( y i j=z) N M j 1" # y i j =z$ % j=1 i=1 $ % (Derived via minimizing neg. log likelihood) 59

60 Recap: Supervised Training argmax A,O (x,y) S Maximum Likelihood Training Coun=ng sta=s=cs Super easy! Why? P( x, y) = argmax A,O (x,y) S What about unsupervised case? M P(End y M ) P(y i y i 1 ) P(x i y i ) i=1 M i=1 60

61 Condi=onal Independence Assump=ons argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S Everything decomposes to products of pairs I.e., P(y i+1 =a y i =b) doesn t depend on anything else Can just es=mate frequencies: P(End y M ) P(y i y i 1 ) P(x i y i ) How oken y i+1 =a when y i =b over training set Note that P(y i+1 =a y i =b) is a common model across all loca=ons of all sequences. M i=1 M i=1 61

62 Condi=onal Independence Assump=ons argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S P(End y M ) P(y i y i 1 ) P(x i y i ) Everything decomposes # Parameters: to products of pairs I.e., P(y i+1 =a y i =b) doesn t depend on anything else Transi=ons A: #Tags 2 Observa=ons O: #Words x #Tags Can just es=mate frequencies: Avoids How oken directly y i+1 =a when model y i =b over word/word training set pairings Note that P(y i+1 =a y #Tags i =b) is a = common 10s model across all loca=ons of all #Words sequences. = 10000s M i=1 M i=1 62

63 Unsupervised Training What about if no y s? Just a training set of sentences N S = { x i } i=1 S=ll want to es=mate P(x,y) How? Why? Word Sequence (Sentence) argmax P( x i ) = argmax P x i, y i i y ( ) 63

64 Why Unsupervised Training? Supervised Data hard to acquire Require annota=ng POS tags Unsupervised Data plen=ful Just grab some text! Might just work for POS Tagging! Learn y s that correspond to POS Tags Can be used for other tasks Detect outlier sentences (sentences with low prob.) Sampling new sentences. 64

65 EM Algorithm (Baum-Welch) If we had y s è max likelihood. If we had (A,O) è predict y s Chicken vs Egg! 1. Ini=alize A and O arbitrarily 2. Predict prob. of y s for each training x 3. Use y s to es=mate new (A,O) ExpectaQon Step MaximizaQon Step 4. Repeat back to Step 1 un=l convergence hzp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 65

66 Expecta=on Step Given (A,O) For training x=(x 1,,x M ) Predict P(y i ) for each y=(y 1, y M ) x 1 x 2 x L P(y i =Noun) P(y i =Det) P(y i =Verb) Encodes current model s beliefs about y Marginal Distribu=on of each y i 66

67 Recall: Matrix Formula=on Define Transi=on Matrix: A A ab = P(y i+1 =a y i =b) or Log( P(y i+1 =a y i =b) ) P(y next y) y= Noun y= Verb y next = Noun y next = Verb Observa=on Matrix: O O wz = P(x i =w y i =z) or Log(P(x i =w y i =z) ) P(x y) y= Noun y= Verb x= fish x= sleep

68 Maximiza=on Step Max. Likelihood over Marginal Distribu=on Supervised: Unsupervised: A ab = A ab = N M j j=1 M j N 1" # y i+1 j =a j=1 i=0 i=0 N P(y j i M j j=1 ( ) ( y i j=b) M j N 1" # y i j =b$ % j=1 i=0 i=0 = b, y i+1 j = a) O wz = P(y j i $ % = b) O wz = Marginals M j N 1" # x i j =w j=1 i=1 M j N 1! " x i j =w# $ j=1 i=1 N M j j=1 i=1 ( ) ( y i j=z) M j N 1" # y i j =z$ % j=1 i=1 P(y j i P(y j i $ % Marginals = z) = z) Marginals 68

69 Compu=ng Marginals (Forward-Backward Algorithm) Solving E-Step, requires compute marginals x 1 x 2 x L P(y i =Noun) P(y i =Det) P(y i =Verb) Can solve using Dynamic Programming! Similar to Viterbi 69

70 Nota=on Probability of observing prefix x 1:i and having the i-th state be y i =Z α z (i) = P(x 1:i, y i = Z A,O) Probability of observing suffix x i+1:m given the i-th state being y i =Z β z (i) = P(x i+1:m y i = Z, A,O) Compu=ng Marginals = Combining the Two Terms P(y i = z x) = z' a z (i)β z (i) a z' (i)β z' (i) hzp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 70

71 Nota=on Probability of observing prefix x 1:i and having the i-th state be y i =Z α z (i) = P(x 1:i, y i = Z A,O) Probability of observing suffix x i+1:m given the i-th state being y i =Z β z (i) = P(x i+1:m y i = Z, A,O) Compu=ng Marginals = Combining the Two Terms P(y i = b, y i 1 = a x) = a',b' a a (i 1)P(y i = b y i 1 = a)p(x i y i = b)β b (i) a a' (i 1)P(y i = b' y i 1 = a')p(x i y i = b')β b' (i) hzp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 71

72 Forward (sub-)algorithm Solve for every: α z (i) = P(x 1:i, y i = Z A,O) Naively: α z (i) = P(x 1:i, y i = Z A,O) = y 1:i 1 ExponenQal Time! P(x 1:i, y i = Z, y 1:i 1 A,O) Can be computed recursively (like Viterbi) α z (1) = P(y 1 = z y 0 )P(x 1 y 1 = z) = O x 1,z A z,start α z (i +1) = O x i+1,z L j=1 α j (i) A z, j Viterbi effec=vely replaces sum with max 72

73 Backward (sub-)algorithm Solve for every: β z (i) = P(x i+1:m y i = Z, A,O) Naively: β z (i) = P(x i+1:m y i = Z, A,O) = y i+1:l P(x i+1:m, y i+1:m y i = Z, A,O) Can be computed recursively (like Viterbi) β z (L) =1 L j=1 β z (i) = β j (i +1) A j,z O x i+1, j ExponenQal Time! 73

74 Forward-Backward Algorithm Runs Forward α z (i) = P(x 1:i, y i = Z A,O) Runs Backward β z (i) = P(x i+1:m y i = Z, A,O) For each training x=(x 1,,x M ) Computes each P(y i ) for y=(y 1,,y M ) P(y i = z x) = z' a z (i)β z (i) a z' (i)β z' (i) 74

75 Recap: Unsupervised Training Train using only word sequences: N S = { x i } i=1 y s are hidden states All pairwise transi=ons are through y s Hence hidden Markov Model Train using EM algorithm Converge to local op=mum Word Sequence (Sentence) 75

76 Ini=aliza=on How to choose #hidden states? By hand Cross Valida=on P(x) on valida=on data Can compute P(x) via forward algorithm: P(x) = P(x, y) = α z (M ) P(End y M = z) y z 76

77 Recap: Sequence Predic=on & HMMs Models pairwise dependences in sequences x= I fish oken POS Tags: Det, Noun, Verb, Adj, Adv, Prep Independent: (N, N, Adv) HMM Viterbi: (N, V, Adv) Compact: only model pairwise between y s Main LimitaQon: Lots of independence assump=ons Poor predic=ve accuracy 77

78 Next Week Deep Genera=ve Models (Recent Applica=ons Lecture) More on Unsupervised Learning Recita=on TUESDAY (7pm): Recap of Viterbi and Forward/Backward 78

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 8: Hidden Markov Models 1 x = Fish Sleep y = (N, V) Sequence Predic=on (POS Tagging) x = The Dog Ate My Homework y = (D, N, V, D, N) x = The Fox Jumped