Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Size: px
Start display at page:

Download "Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models"

Transcription

1 Machine Learning & Data Mining CS/CNS/EE 155 Lecture 8: Hidden Markov Models 1

2 x = Fish Sleep y = (N, V) Sequence Predic=on (POS Tagging) x = The Dog Ate My Homework y = (D, N, V, D, N) x = The Fox Jumped Over The Fence y = (D, N, V, P, D, N) 2

3 Challenges Mul=variable Output Make mul=ple predic=ons simultaneously Variable Length Input/Output Sentence lengths not fixed 3

4 Mul=variate Outputs x = Fish Sleep y = (N, V) Mul=class predic=on: POS Tags: Det, Noun, Verb, Adj, Adv, Prep Replicate Weights:! # # w = # # "# w 1 w 2! w K $ & & & & %&! # # b = # # "# " $ $ f (x w, b) = $ $ $ # How many classes? b 1 b 2! b K $ & & & & %& Score All Classes: w 1 T x b 1 w 2 T x b 2! w K T x b K % ' ' ' ' ' & Predict via Largest Score: argmax k " $ $ $ $ $ # w 1 T x b 1 w 2 T x b 2! w K T x b K % ' ' ' ' ' & 4

5 Mul=class Predic=on x = Fish Sleep y = (N, V) Mul=class predic=on: All possible length- M sequences as different class (D, D), (D, N), (D, V), (D, Adj), (D, Adv), (D, Pr) (N, D), (N, N), (N, V), (N, Adj), (N, Adv), L M classes! Length 2: 6 2 = 36! POS Tags: Det, Noun, Verb, Adj, Adv, Prep L=6 5

6 Mul=class Predic=on x = Fish Sleep y = (N, V) Mul=class predic=on: All possible length- M sequences as different class (D, (Not D), Tractable (D, N), (D, for V), Sequence (D, Adj), (D, Predic=on) Adv), (D, Pr) (N, D), (N, N), (N, V), (N, Adj), (N, Adv), L M classes! Length 2: 6 2 = 36! POS Tags: Det, Noun, Verb, Adj, Adv, Prep L=6 Exponen=al Explosion in #Classes! 6

7 Why is Naïve Mul=class Intractable? x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep (D, D, D), (D, D, N), (D, D, V), (D, D, Adj), (D, D, Adv), (D, D, Pr) (D, N, D), (D, N, N), (D, N, V), (D, N, Adj), (D, N, Adv), (D, N, Pr) (D, V, D), (D, V, N), (D, V, V), (D, V, Adj), (D, V, Adv), (D, V, Pr) (N, D, D), (N, D, N), (N, D, V), (N, D, Adj), (N, D, Adv), (N, D, Pr) (N, N, D), (N, N, N), (N, N, V), (N, N, Adj), (N, N, Adv), (N, N, Pr) Assume pronouns are nouns for simplicity. 7

8 Why is Naïve Mul=class Intractable? x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep (D, D, D), (D, D, N), (D, D, V), (D, D, Adj), (D, D, Adv), (D, D, Pr) (D, N, D), (D, N, N), (D, N, V), (D, N, Adj), (D, N, Adv), (D, N, Pr) (D, V, D), (D, V, N), (D, V, V), (D, V, Adj), (D, V, Adv), (D, V, Pr) Treats Every Combina=on As Different Class (Learn model for each combina=on) (N, D, D), (N, D, N), (N, D, V), (N, D, Adj), (N, D, Adv), (N, D, Pr) (N, N, D), (N, N, N), (N, N, V), (N, N, Adj), (N, N, Adv), (N, N, Pr) Exponen=ally Large Representa=on! (Exponen=al Time to Consider Every Class) (Exponen=al Storage) Assume pronouns are nouns for simplicity. 8

9 Independent Classifica=on x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently (assump=on) Independent mul=class predic=on per word Predict for x= I independently Predict for x= fish independently Predict for x= ogen independently Concatenate predic=ons. Assume pronouns are nouns for simplicity. 9

10 Independent Classifica=on x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently (assump=on) Independent mul=class predic=on per word Predict for x= I independently #Classes = #POS Tags (6 in our example) Predict for x= fish independently Predict for x= ogen independently Concatenate predic=ons. Solvable using standard mul=class predic=on. Assume pronouns are nouns for simplicity. 10

11 Independent Classifica=on x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently Independent mul=class predic=on per word P(y x) x= I x= fish x= ogen y= Det y= Noun y= Verb PredicKon: (N, N, Adv) Correct: (N, V, Adv) y= Adj y= Adv y= Prep Why the mistake? Assume pronouns are nouns for simplicity. 11

12 Context Between Words x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Independent Predic=ons Ignore Word Pairs In Isola=on: Fish is more likely to be a Noun But Condi=oned on Following a (pro)noun Fish is more likely to be a Verb! 1st Order Dependence (Model All Pairs) 2 nd Order Considers All Triplets Arbitrary Order = Exponen=al Size (Naïve Mul=class) Assume pronouns are nouns for simplicity. 12

13 1 st Order Hidden Markov Model x = (x 1,x 2,x 4,x 4,,x M ) y = (y 1,y 2,y 3,y 4,,y M ) (sequence of words) (sequence of POS tags) P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 13

14 Graphical Model Representa=on Y 0 Y 1 Y 2 Y M Y End X 1 X 2 X M Op=onal ( ) = P(End y M ) P(y i y i 1 ) P x, y M P(x i y i ) M 14

15 1 st Order Hidden Markov Model ( ) = P(End y M ) P(y i y i 1 ) P x, y Joint Distribu=on M P(x i y i ) Op=onal M P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y0 is defined to be the Start state Prior probability of y M being the final state 15

16 1 st Order Hidden Markov Model P x y M ( ) = P(x i y i ) Condi=onal Distribu=on on x given y Given a POS Tag Sequence y: Can compute each P(x i y) independently! (x i condi=onally independent given y i ) P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 16

17 1 st Order Hidden Markov Model Addi=onal Complexity of (#POS Tags) 2 Models All State- State Pairs (all POS Tag- Tag pairs) Models All State- Observa=on Pairs (all Tag- Word pairs) P(x i y i ) Same Complexity as Independent Mul=class Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y 0 is defined to be the Start state Prior probability of y M being the final state 17

18 Rela=onship to Naïve Bayes Y 0# Y 1# Y 2# # Y M# Y End# X 1) X 2) # X M) Y 1# Y 2# # Y M# X 1) X 2) # X M) Reduces to a sequence of disjoint Naïve Bayes models (if we ignore transi=on probabili=es) 18

19 P ( word state/tag ) Two- word language: fish and sleep Two- tag language: Noun and Verb P(x y) y= Noun y= Verb x= fish x= sleep Given Tag Sequence y: P( fish sleep (N,V) ) = 0.8*0.5 P( fish fish (N,V) ) = 0.8*0.5 P( sleep fish (V,V) ) = 0.8*0.5 P( sleep sleep (N,N) ) = 0.2*0.5 Slides borrowed from Ralph Grishman 19

20 Sampling HMMs are genera=ve models Models joint distribu=on P(x,y) Can generate samples from this distribu=on First consider condi=onal distribu=on P(x y) P(x y) y= Noun y= Verb x= fish x= sleep Given Tag Sequence y = (N,V): Sample each word independently: Sample P(x 1 N) (0.8 Fish, 0.2 Sleep) Sample P(x 2 V) (0.5 Fish, 0.5 Sleep) What about sampling from P(x,y)? 20

21 Forward Sampling of P(y,x) ( ) = P(End y M ) P(y i y i 1 ) P x, y M M P(x i y i ) P(x y) y= Noun y= Verb x= fish x= sleep start noun verb end 0.2 Slides borrowed from Ralph Grishman Ini=alize y 0 = Start Ini=alize i = 0 1. i = i Sample y i from P(y i y i- 1 ) 3. If y i == End: Quit 4. Sample x i from P(x i y i ) 5. Goto Step 1 Exploits Condi=onal Ind. Requires P(End y i ) 21

22 Forward Sampling of P(y,x L) ( ) = P(End y M ) P(y i y i 1 ) P x, y M M P(x i y i ) M P(x y) y= Noun y= Verb x= fish x= sleep start noun verb Slides borrowed from Ralph Grishman Ini=alize y 0 = Start Ini=alize i = 0 1. i = i If(i == M): Quit 3. Sample y i from P(y i y i- 1 ) 4. Sample x i from P(x i y i ) 5. Goto Step 1 Exploits Condi=onal Ind. Assumes no P(End y i ) 22

23 1 st Order Hidden Markov Model P( x k+1:m, y k+1:m x 1:k, y 1:k ) = P( x k+1:m, y k+1:m y k ) Memory- less Model only needs y k to model rest of sequence P(x i y i ) Probability of state y i genera=ng x i P(y i+1 y i ) Probability of state y i transi=oning to y i+1 P(y 1 y 0 ) P(End y M ) Not always used y0 is defined to be the Start state Prior probability of y M being the final state 23

24 Viterbi Algorithm 24

25 Most Common Predic=on Problem Given input sentence, predict POS Tag seq. argmax y P( y x) Naïve approach: Try all possible y s Choose one with highest probability ExponenKal Kme: L M possible y s 25

26 argmax y ( ) = P(x i y i ) P x y P y L Bayes s Rule P( y x) = argmax y L = argmax y = argmax y ( ) = P(End y L ) P(y i y i 1 ) P(y, x) P(x) P(y, x) P(x y)p(y) 26

27 argmax y P(y, x) = argmax y = argmax y M M P(y i y i 1 ) P(x i y i ) M M argmax P(y i y i 1 ) P(x i y i ) y 1:M 1 = argmax argmax P(y M y M 1 )P(x M y M )P(y 1:M 1 x 1:M 1 ) y M y 1:M 1 M P( y 1:k x 1:k ) = P(x 1:k y 1:k )P(y 1:k ) ( ) = P(x i y i ) P x 1:k y 1:k P y 1:k k k ( ) = P(y i+1 y i ) Exploit Memory- less Property: The choice of y L only depends on y 1:M- 1 via P(y M y M- 1 )! 27

28 Dynamic Programming Input: x = (x 1,x 2,x 3,,x M ) Computed: best length- k prefix ending in each Tag: Examples: Yˆ # & k (V ) = % argmax P(y 1:k 1 V, x 1:k )( V ˆ # & Y k (N) = % argmax P(y 1:k 1 N, x 1:k )( N $ y 1:k 1 ' $ y 1:k 1 ' Sequence Concatena=on Claim: ˆ Y k+1 (V ) = # & argmax P(y 1:k V, x 1:k+1 ) % ( V $ y 1:k { Y ˆk ( T) } T ' # & = argmax P(y 1:k, x 1:k )P(y k+1 = V y k )P(x k+1 y k+1 = V ) % ( V $ y 1:k { Y ˆk ( T) } T ' Pre- computed Recursive DefiniKon! 28

29 Solve: " % Yˆ 2 (V ) = argmax P(y 1, x 1 )P(y 2 = V y 1 )P(x 2 y 2 = V ) $ ' V # y 1 { Y ˆ1 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Ŷ 1 (V) y 1 =V Ŷ 2 (V) y 1 =D Ŷ 1 (D) Ŷ 2 (D) y 1 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 1 (Z) is just Z 29

30 Solve: " % Yˆ 2 (V ) = argmax P(y 1, x 1 )P(y 2 = V y 1 )P(x 2 y 2 = V ) $ ' V # y 1 { Y ˆ1 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Ŷ 1 (V) Ŷ 2 (V) Ŷ 1 (D) Ŷ 2 (D) y 1 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 1 (Z) is just Z Ex: Ŷ 2 (V) = (N, V) 30

31 Solve: ˆ Y 3 (V ) = " % argmax P(y 1:2, x 1:2 )P(y 3 = V y 2 )P(x 3 y 3 = V ) $ ' V # y 1:2 { Y ˆ2 ( T) } T & Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Ŷ 1 (V) Ŷ 2 (V) y 2 =V Ŷ 3 (V) y 2 =D Ŷ 1 (D) Ŷ 2 (D) Ŷ 3 (D) y 2 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 3 (N) Ex: Ŷ 2 (V) = (N, V) 31

32 Solve: Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) ˆ Y 3 (V ) = " % argmax P(y 1:2, x 1:2 )P(y 3 = V y 2 )P(x 3 y 3 = V ) $ ' V # y 1:2 { Y ˆ2 ( T) } T & Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Claim: Only need to check solu=ons of Ŷ 2 (Z), Z=V,D,N Ŷ 1 (V) Ŷ 2 (V) y 2 =V Ŷ 3 (V) y 2 =D Ŷ 1 (D) Ŷ 2 (D) Ŷ 3 (D) y 2 =N Ŷ 1 (N) Ŷ 2 (N) Ŷ 3 (N) Ex: Ŷ 2 (V) = (N, V) 32

33 Solve: Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) ˆ Y 3 (V ) = " % argmax P(y 1:2, x 1:2 )P(y 3 = V y 2 )P(x 3 y 3 = V ) $ ' V # y 1:2 { Y ˆ2 ( T) } T & Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Claim: Only need to check solu=ons of Ŷ 2 (Z), Z=V,D,N Ŷ 1 (V) Ŷ 2 (V) y 2 =V Ŷ 3 (V) y 2 =D Ŷ 1 (D) Ŷ 1 (N) Suppose Ŷ 2 (D) Ŷ 3 (V) = (V,V,V) Ŷ 3 (D) prove that Ŷ 3 (V) = (N,V,V) has higher prob. Ex: Ŷ 2 (V) = (N, V) y 2 =N Proof depends on 1 st order property Prob. Ŷ 2 (N) of (V,V,V) & (N,V,V) Ŷ 3 (N) differ in 3 terms P(y 1 y 0 ), P(x 1 y 1 ), P(y 2 y 1 ) None of these depend on y 3! 33

34 ˆ Y M (V ) = # % $ & argmax P(y 1:M 1, x 1:M 1 )P(y M = V y M 1 )P(x M y M = V )P(End y M = V ) ( V { Y M 1 ( T) } T ' y 1:M 1 ˆ Store each Ŷ 1 (Z) & P(Ŷ 1 (Z),x 1 ) Store each Ŷ 2 (Z) & P(Ŷ 2 (Z),x 1:2 ) Store each Ŷ 3 (Z) & P(Ŷ 3 (Z),x 1:3 ) Op=onal Ŷ 1 (V) Ŷ 2 (V) Ŷ 3 (V) Ŷ M (V) Ŷ 1 (D) Ŷ 2 (D) Ŷ 3 (D) Ŷ M (D) Ŷ 1 (N) Ŷ 2 (N) Ŷ 3 (N) Ŷ M (N) Ex: Ŷ 2 (V) = (N, V) Ex: Ŷ 3 (V) = (D,N,V) 34

35 Solve: For k=1..m Viterbi Algorithm argmax y Itera=vely solve for each Ŷ k (Z) Z looping over every POS tag. Predict best Ŷ M (Z) P( y x) = argmax y = argmax y = argmax y P(y, x) P(x) P(y, x) P(x y)p(y) Also known as Mean A Posteriori (MAP) inference 35

36 Numerical Example x= (Fish Sleep) start 0.8 noun verb 0.7 end 0.2 Slides borrowed from Ralph Grishman 36

37 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep start 1 verb 0 noun 0 end 0 Slides borrowed from Ralph Grishman 37

38 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 1: fish start 1 0 verb 0.2 *.5 noun 0.8 *.8 end 0 0 Slides borrowed from Ralph Grishman 38

39 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 1: fish start 1 0 verb 0.1 noun 0.64 end 0 0 Slides borrowed from Ralph Grishman 39

40 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 2: sleep (if fish is verb) start verb 0.1.1*.1*.5 noun *.2*.2 end Slides borrowed from Ralph Grishman 40

41 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 2: sleep (if fish is verb) start verb noun end Slides borrowed from Ralph Grishman 41

42 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 2: sleep (if fish is a noun) start verb *.8*.5 noun *.1*.2 end Slides borrowed from Ralph Grishman 42

43 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 2: sleep (if fish is a noun) start verb noun end Slides borrowed from Ralph Grishman 43

44 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 2: sleep take maximum, set back pointers start verb noun end Slides borrowed from Ralph Grishman 44

45 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 2: sleep take maximum, set back pointers start verb noun end Slides borrowed from Ralph Grishman 45

46 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 3: end start verb noun end * *.1 Slides borrowed from Ralph Grishman 46

47 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Token 3: end take maximum, set back pointers start verb noun end * *.1 Slides borrowed from Ralph Grishman 47

48 start noun verb end 0.2 P(x y) y= Noun y= Verb x= fish x= sleep Decode: fish = noun sleep = verb start verb noun end *.7 Slides borrowed from Ralph Grishman 48

49 0.2 start noun verb end Decode: fish = noun sleep = verb P(x y) What might go wrong for long sequences? start y= Noun y= Verb x= fish x= sleep verb 0 Underflow! Small numbers get repeatedly mul=plied noun together exponen=ally small! end *.7 Slides borrowed from Ralph Grishman 49

50 Solve: argmax y For k=1..m Viterbi Algorithm (w/ Log Probabili=es) P( y x) = argmax y = argmax y = argmax y Itera=vely solve for each log(ŷ k (Z)) Z looping over every POS tag. Predict best log(ŷ M (Z)) P(y, x) P(x) P(y, x) log P(x y)+ log P(y) Log(Ŷ M (Z)) accumulates addikvely, not mulkplicakvely 50

51 Recap: Independent Classifica=on x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Treat each word independently Independent mul=class predic=on per word P(y x) x= I x= fish x= ogen y= Det y= Noun y= Verb PredicKon: (N, N, Adv) Correct: (N, V, Adv) y= Adj y= Adv y= Prep Assume pronouns are nouns for simplicity. Mistake due to not modeling mulkple words. 51

52 Recap: Viterbi Models pairwise transi=ons between states Pairwise transi=ons between POS Tags 1 st order model P x, y ( ) = P(End y M ) P(y i y i 1 ) M P(x i y i ) M x= I fish ogen Independent: (N, N, Adv) HMM Viterbi: (N, V, Adv) *Assuming we defined P(x,y) properly 52

53 Training HMMs 53

54 Supervised Training Given: N S = {(x i, y i )} Word Sequence (Sentence) POS Tag Sequence Goal: Es=mate P(x,y) using S P x, y ( ) = P(End y M ) P(y i y i 1 ) Maximum Likelihood! M M P(x i y i ) 54

55 Aside: Matrix Formula=on Define Transi=on Matrix: A A ab = P(y i+1 =a y i =b) or Log( P(y i+1 =a y i =b) ) P(y next y) y= Noun y= Verb y next = Noun y next = Verb Observa=on Matrix: O O wz = P(x i =w y i =z) or Log(P(x i =w y i =z) ) P(x y) y= Noun y= Verb x= fish x= sleep

56 Aside: Matrix Formula=on ( ) = P(End y M ) P(y i y i 1 ) P x, y M P(x i y i ) M ( ) = P(End y M ) P(y i y i 1 ) P x, y M M P(x i y i ) = A End,y M A y i,y i 1 M O x i,y i M log(p(x, y)) = A! End,y M + Ay! + i,y Ox! i 1 i,y i M M Log prob. formula=on Each entry of à is define as log(a) 56

57 Maximum Likelihood argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S Es=mate each component separately: M P(End y M ) P(y i y i 1 ) P(x i y i ) M A ab = N j=1 M j 1" # y i+1 j =a i=0 ( ) ( y i j=b) N M j 1" # y i j =b$ % j=1 i=0 $ % O wz = N j=1 M j 1" # x i j =w ( ) ( y i j=z) N M j 1" # y i j =z$ % j=1 $ % (Derived via minimizing neg. log likelihood) 57

58 Recap: Supervised Training argmax A,O (x,y) S Maximum Likelihood Training Coun=ng sta=s=cs Super easy! Why? P( x, y) = argmax A,O (x,y) S What about unsupervised case? M P(End y M ) P(y i y i 1 ) P(x i y i ) M 58

59 Recap: Supervised Training argmax A,O (x,y) S Maximum Likelihood Training Coun=ng sta=s=cs Super easy! Why? P( x, y) = argmax A,O (x,y) S What about unsupervised case? M P(End y M ) P(y i y i 1 ) P(x i y i ) M 59

60 Condi=onal Independence Assump=ons argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S Everything decomposes to products of pairs I.e., P(y i+1 =a y i =b) doesn t depend on anything else Can just es=mate frequencies: P(End y M ) P(y i y i 1 ) P(x i y i ) How ogen y i+1 =a when y i =b over training set Note that P(y i+1 =a y i =b) is a common model across all loca=ons of all sequences. M M 60

61 Condi=onal Independence Assump=ons argmax A,O (x,y) S P( x, y) = argmax A,O (x,y) S P(End y M ) P(y i y i 1 ) P(x i y i ) Everything decomposes # Parameters: to products of pairs I.e., P(y i+1 =a y i =b) doesn t depend on anything else Transi=ons A: #Tags 2 Observa=ons O: #Words x #Tags Can just es=mate frequencies: Avoids How ogen directly y i+1 =a when model y i =b over word/word training set pairings Note that P(y i+1 =a y #Tags i =b) is a = common 10s model across all loca=ons of all #Words sequences. = 10000s M M 61

62 Unsupervised Training What about if no y s? Just a training set of sentences N S = { x i } S=ll want to es=mate P(x,y) How? Why? Word Sequence (Sentence) argmax P( x i ) = argmax P x i, y i i y ( ) 62

63 Unsupervised Training What about if no y s? Just a training set of sentences N S = { x i } S=ll want to es=mate P(x,y) How? Why? Word Sequence (Sentence) argmax P( x i ) = argmax P x i, y i i y ( ) 63

64 Why Unsupervised Training? Supervised Data hard to acquire Require annota=ng POS tags Unsupervised Data plen=ful Just grab some text! Might just work for POS Tagging! Learn y s that correspond to POS Tags Can be used for other tasks Detect outlier sentences (sentences with low prob.) Sampling new sentences. 64

65 EM Algorithm (Baum- Welch) If we had y s è max likelihood. If we had (A,O) è predict y s Chicken vs Egg! 1. Ini=alize A and O arbitrarily 2. Predict prob. of y s for each training x 3. Use y s to es=mate new (A,O) ExpectaKon Step MaximizaKon Step 4. Repeat back to Step 1 un=l convergence hyp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 65

66 Expecta=on Step Given (A,O) For training x=(x 1,,x M ) Predict P(y i ) for each y=(y 1, y M ) x 1 x 2 x L P(y i =Noun) P(y i =Det) P(y i =Verb) Encodes current model s beliefs about y Marginal Distribu=on of each y i 66

67 Recall: Matrix Formula=on Define Transi=on Matrix: A A ab = P(y i+1 =a y i =b) or Log( P(y i+1 =a y i =b) ) P(y next y) y= Noun y= Verb y next = Noun y next = Verb Observa=on Matrix: O O wz = P(x i =w y i =z) or Log(P(x i =w y i =z) ) P(x y) y= Noun y= Verb x= fish x= sleep

68 Maximiza=on Step Max. Likelihood over Marginal Distribu=on Supervised: Unsupervised: A ab = A ab = N M j j=1 M j N 1" # y i+1 j =a j=1 i=0 i=0 N P(y j i M j j=1 ( ) ( y i j=b) M j N 1" # y i j =b$ % j=1 i=0 i=0 = b, y i+1 j = a) O wz = P(y j i $ % = b) O wz = Marginals M j N 1" # x i j =w j=1 M j N 1! " x i j =w# $ j=1 N M j j=1 ( ) ( y i j=z) M j N 1" # y i j =z$ % j=1 P(y j i P(y j i $ % Marginals = z) = z) Marginals 68

69 Compu=ng Marginals (Forward- Backward Algorithm) Solving E- Step, requires compute marginals x 1 x 2 x L P(y i =Noun) P(y i =Det) P(y i =Verb) Can solve using Dynamic Programming! Similar to Viterbi 69

70 Nota=on Probability of observing prefix x 1:i and having the i- th state be y i =Z α z (i) = P(x 1:i, y i = Z A,O) Probability of observing suffix x i+1:m given the i- th state being y i =Z β z (i) = P(x i+1:m y i = Z, A,O) Compu=ng Marginals = Combining the Two Terms P(y i = z x) = z' a z (i)β z (i) a z' (i)β z' (i) hyp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 70

71 Nota=on Probability of observing prefix x 1:i and having the i- th state be y i =Z α z (i) = P(x 1:i, y i = Z A,O) Probability of observing suffix x i+1:m given the i- th state being y i =Z β z (i) = P(x i+1:m y i = Z, A,O) Compu=ng Marginals = Combining the Two Terms P(y i = b, y i 1 = a x) = a',b' a a (i 1)P(y i = b y i 1 = a)p(x i y i = b)β b (i) a a' (i 1)P(y i = b' y i 1 = a')p(x i y i = b')β b' (i) hyp://en.wikipedia.org/wiki/baum%e2%80%93welch_algorithm 71

72 Forward (sub- )Algorithm Solve for every: α z (i) = P(x 1:i, y i = Z A,O) Naively: α z (i) = P(x 1:i, y i = Z A,O) = y 1:i 1 ExponenKal Time! P(x 1:i, y i = Z, y 1:i 1 A,O) Can be computed recursively (like Viterbi) α z (1) = P(y 1 = z y 0 )P(x 1 y 1 = z) = O x 1,z A z,start α z (i +1) = O x i+1,z L j=1 α j (i) A z, j Viterbi effec=vely replaces sum with max 72

73 Backward (sub- )Algorithm Solve for every: β z (i) = P(x i+1:m y i = Z, A,O) Naively: β z (i) = P(x i+1:m y i = Z, A,O) = y i+1:l P(x i+1:m, y i+1:m y i = Z, A,O) Can be computed recursively (like Viterbi) β z (L) =1 L j=1 β z (i) = β j (i +1) A j,z O x i+1, j ExponenKal Time! 73

74 Forward- Backward Algorithm Runs Forward α z (i) = P(x 1:i, y i = Z A,O) Runs Backward β z (i) = P(x i+1:m y i = Z, A,O) For each training x=(x 1,,x M ) Computes each P(y i ) for y=(y 1,,y M ) P(y i = z x) = z' a z (i)β z (i) a z' (i)β z' (i) 74

75 Recap: Unsupervised Training Train using only word sequences: N S = { x i } y s are hidden states All pairwise transi=ons are through y s Hence hidden Markov Model Train using EM algorithm Converge to local op=mum Word Sequence (Sentence) 75

76 Ini=aliza=on How to choose #hidden states? By hand Cross Valida=on P(x) on valida=on data Can compute P(x) via forward algorithm: P(x) = P(x, y) = α z (M ) P(End y M = z) y z 76

77 Recap: Sequence Predic=on & HMMs Models pairwise dependences in sequences x= I fish ogen POS Tags: Det, Noun, Verb, Adj, Adv, Prep Independent: (N, N, Adv) HMM Viterbi: (N, V, Adv) Compact: only model pairwise between y s Main LimitaKon: Lots of independence assump=ons Poor predic=ve accuracy 77

78 Next Week Condi=onal Random Fields Sequen=al version of logis=c regression Removes many independence assump=ons More accurate in prac=ce Can only be trained in supervised se}ng Recita=on Tonight: Recap of Viterbi and Forward/Backward 78

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models Machine Learning & Data Mining CS/CNS/EE 155 Lecture 11: Hidden Markov Models 1 Kaggle Compe==on Part 1 2 Kaggle Compe==on Part 2 3 Announcements Updated Kaggle Report Due Date: 9pm on Monday Feb 13 th

More information

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017 1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow

More information

CS 6140: Machine Learning Spring 2017

CS 6140: Machine Learning Spring 2017 CS 6140: Machine Learning Spring 2017 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis@cs Assignment

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

Sta$s$cal sequence recogni$on

Sta$s$cal sequence recogni$on Sta$s$cal sequence recogni$on Determinis$c sequence recogni$on Last $me, temporal integra$on of local distances via DP Integrates local matches over $me Normalizes $me varia$ons For cts speech, segments

More information

Bayesian networks Lecture 18. David Sontag New York University

Bayesian networks Lecture 18. David Sontag New York University Bayesian networks Lecture 18 David Sontag New York University Outline for today Modeling sequen&al data (e.g., =me series, speech processing) using hidden Markov models (HMMs) Bayesian networks Independence

More information

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule. CSE 473: Ar+ficial Intelligence Markov Models - II Daniel S. Weld - - - University of Washington [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188

More information

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on Professor Wei-Min Shen Week 13.1 and 13.2 1 Status Check Extra credits? Announcement Evalua/on process will start soon

More information

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16 Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign

More information

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What

More information

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler + Machine Learning and Data Mining Bayes Classifiers Prof. Alexander Ihler A basic classifier Training data D={x (i),y (i) }, Classifier f(x ; D) Discrete feature vector x f(x ; D) is a con@ngency table

More information

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Assignment

More information

CS 6140: Machine Learning Spring 2016

CS 6140: Machine Learning Spring 2016 CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa?on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis?cs Assignment

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

CSE 473: Ar+ficial Intelligence

CSE 473: Ar+ficial Intelligence CSE 473: Ar+ficial Intelligence Hidden Markov Models Luke Ze@lemoyer - University of Washington [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188

More information

CSE 473: Ar+ficial Intelligence. Example. Par+cle Filters for HMMs. An HMM is defined by: Ini+al distribu+on: Transi+ons: Emissions:

CSE 473: Ar+ficial Intelligence. Example. Par+cle Filters for HMMs. An HMM is defined by: Ini+al distribu+on: Transi+ons: Emissions: CSE 473: Ar+ficial Intelligence Par+cle Filters for HMMs Daniel S. Weld - - - University of Washington [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on. Professor Wei-Min Shen Week 8.1 and 8.2

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on. Professor Wei-Min Shen Week 8.1 and 8.2 CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on Professor Wei-Min Shen Week 8.1 and 8.2 Status Check Projects Project 2 Midterm is coming, please do your homework!

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i CSE 473: Ar+ficial Intelligence Bayes Nets Daniel Weld [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at hnp://ai.berkeley.edu.]

More information

Sequence Labeling: HMMs & Structured Perceptron

Sequence Labeling: HMMs & Structured Perceptron Sequence Labeling: HMMs & Structured Perceptron CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu HMM: Formal Specification Q: a finite set of N states Q = {q 0, q 1, q 2, q 3, } N N Transition

More information

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary An undirected graphical model Decompose the score over the structure into a collection of

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

Probability and Structure in Natural Language Processing

Probability and Structure in Natural Language Processing Probability and Structure in Natural Language Processing Noah Smith, Carnegie Mellon University 2012 Interna@onal Summer School in Language and Speech Technologies Quick Recap Yesterday: Bayesian networks

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

Introduc)on to Ar)ficial Intelligence

Introduc)on to Ar)ficial Intelligence Introduc)on to Ar)ficial Intelligence Lecture 13 Approximate Inference CS/CNS/EE 154 Andreas Krause Bayesian networks! Compact representa)on of distribu)ons over large number of variables! (OQen) allows

More information

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14 STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced

More information

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018. Recap: HMM ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2018 Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed

More information

Today. Finish word sense disambigua1on. Midterm Review

Today. Finish word sense disambigua1on. Midterm Review Today Finish word sense disambigua1on Midterm Review The midterm is Thursday during class 1me. H students and 20 regular students are assigned to 517 Hamilton. Make-up for those with same 1me midterm is

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace A.I. in health informatics lecture 8 structured learning kevin small & byron wallace today models for structured learning: HMMs and CRFs structured learning is particularly useful in biomedical applications:

More information

UVA CS / Introduc8on to Machine Learning and Data Mining

UVA CS / Introduc8on to Machine Learning and Data Mining UVA CS 4501-001 / 6501 007 Introduc8on to Machine Learning and Data Mining Lecture 13: Probability and Sta3s3cs Review (cont.) + Naïve Bayes Classifier Yanjun Qi / Jane, PhD University of Virginia Department

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can

More information

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction Administrivia Hidden Markov Models (HMMs) for Information Extraction Group meetings next week Feel free to rev proposals thru weekend Daniel S. Weld CSE 454 What is Information Extraction Landscape of

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Lecture 11: Viterbi and Forward Algorithms

Lecture 11: Viterbi and Forward Algorithms Lecture 11: iterbi and Forward lgorithms Kai-Wei Chang CS @ University of irginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/lp16 CS6501 atural Language Processing 1 Quiz 1 Quiz 1 30 25

More information

Lecture 12: EM Algorithm

Lecture 12: EM Algorithm Lecture 12: EM Algorithm Kai-Wei hang S @ University of Virginia kw@kwchang.net ouse webpage: http://kwchang.net/teaching/nlp16 S6501 Natural Language Processing 1 Three basic problems for MMs v Likelihood

More information

Lecture 11: Hidden Markov Models

Lecture 11: Hidden Markov Models Lecture 11: Hidden Markov Models Cognitive Systems - Machine Learning Cognitive Systems, Applied Computer Science, Bamberg University slides by Dr. Philip Jackson Centre for Vision, Speech & Signal Processing

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Computa(on of Similarity Similarity Search as Computa(on. Stoyan Mihov and Klaus U. Schulz

Computa(on of Similarity Similarity Search as Computa(on. Stoyan Mihov and Klaus U. Schulz Computa(on of Similarity Similarity Search as Computa(on Stoyan Mihov and Klaus U. Schulz Roadmap What is similarity computa(on? NLP examples of similarity computa(on Machine transla(on Speech recogni(on

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

STAD68: Machine Learning

STAD68: Machine Learning STAD68: Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! h0p://www.cs.toronto.edu/~rsalakhu/ Lecture 1 Evalua;on 3 Assignments worth 40%. Midterm worth 20%. Final

More information

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: Hidden Markov Models Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: www.ioalgorithms.info Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

Probabilistic Graphical Models

Probabilistic Graphical Models CS 1675: Intro to Machine Learning Probabilistic Graphical Models Prof. Adriana Kovashka University of Pittsburgh November 27, 2018 Plan for This Lecture Motivation for probabilistic graphical models Directed

More information

Hidden Markov Models

Hidden Markov Models CS 2750: Machine Learning Hidden Markov Models Prof. Adriana Kovashka University of Pittsburgh March 21, 2016 All slides are from Ray Mooney Motivating Example: Part Of Speech Tagging Annotate each word

More information

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks Matt Gormley Lecture 24 April 9, 2018 1 Homework 7: HMMs Reminders

More information

So# Inference and Posterior Marginals. September 19, 2013

So# Inference and Posterior Marginals. September 19, 2013 So# Inference and Posterior Marginals September 19, 2013 So# vs. Hard Inference Hard inference Give me a single solucon Viterbi algorithm Maximum spanning tree (Chu- Liu- Edmonds alg.) So# inference Task

More information

Conditional Random Fields for Sequential Supervised Learning

Conditional Random Fields for Sequential Supervised Learning Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Where are we? è Five major sec3ons of this course

Where are we? è Five major sec3ons of this course UVA CS 4501-001 / 6501 007 Introduc8on to Machine Learning and Data Mining Lecture 12: Probability and Sta3s3cs Review Yanjun Qi / Jane University of Virginia Department of Computer Science 10/02/14 1

More information

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models Part 2: Algorithms Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Advanced Data Science

Advanced Data Science Advanced Data Science Dr. Kira Radinsky Slides Adapted from Tom M. Mitchell Agenda Topics Covered: Time series data Markov Models Hidden Markov Models Dynamic Bayes Nets Additional Reading: Bishop: Chapter

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 3: ASR: HMMs, Forward, Viterbi Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The

More information

An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology. Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University

An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology. Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University Why Do We Care? Necessity in today s labs Principled approach:

More information

Par$cle Filters Part I: Theory. Peter Jan van Leeuwen Data- Assimila$on Research Centre DARC University of Reading

Par$cle Filters Part I: Theory. Peter Jan van Leeuwen Data- Assimila$on Research Centre DARC University of Reading Par$cle Filters Part I: Theory Peter Jan van Leeuwen Data- Assimila$on Research Centre DARC University of Reading Reading July 2013 Why Data Assimila$on Predic$on Model improvement: - Parameter es$ma$on

More information

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Recent Applica6ons of Lasso

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Recent Applica6ons of Lasso Machine Learning Data Mining CS/CNS/EE 155 Lecture 4: Recent Applica6ons of Lasso 1 Today: Two Recent Applica6ons Cancer Detec0on Personaliza0on via twi9er music( Biden( soccer( Labour( Applica6ons of

More information

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011 Discrimina)ve Latent Variable Models SPFLODD November 15, 2011 Lecture Plan 1. Latent variables in genera)ve models (review) 2. Latent variables in condi)onal models 3. Latent variables in structural SVMs

More information

Hidden Markov Models and Applica2ons. Spring 2017 February 21,23, 2017

Hidden Markov Models and Applica2ons. Spring 2017 February 21,23, 2017 Hidden Markov Models and Applica2ons Spring 2017 February 21,23, 2017 Gene finding in prokaryotes Reading frames A protein is coded by groups of three nucleo2des (codons): ACGTACGTACGTACGT ACG-TAC-GTA-CGT-ACG-T

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

Natural Language Processing

Natural Language Processing SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 20, 2017 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class

More information

lecture 6: modeling sequences (final part)

lecture 6: modeling sequences (final part) Natural Language Processing 1 lecture 6: modeling sequences (final part) Ivan Titov Institute for Logic, Language and Computation Outline After a recap: } Few more words about unsupervised estimation of

More information

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression UVA CS 4501-001 / 6501 007 Introduc8on to Machine Learning and Data Mining Lecture 3: Linear Regression Yanjun Qi / Jane University of Virginia Department of Computer Science 1 Last Lecture Recap q Data

More information

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio,

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391 Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391 Parameters of an HMM States: A set of states S=s 1, s n Transition probabilities: A= a 1,1, a 1,2,, a n,n

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33) Hidden Markov Models Markov models are

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Part 2. Representation Learning Algorithms

Part 2. Representation Learning Algorithms 53 Part 2 Representation Learning Algorithms 54 A neural network = running several logistic regressions at the same time If we feed a vector of inputs through a bunch of logis;c regression func;ons, then

More information

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari Partially Directed Graphs and Conditional Random Fields Sargur srihari@cedar.buffalo.edu 1 Topics Conditional Random Fields Gibbs distribution and CRF Directed and Undirected Independencies View as combination

More information

Generative Model (Naïve Bayes, LDA)

Generative Model (Naïve Bayes, LDA) Generative Model (Naïve Bayes, LDA) IST557 Data Mining: Techniques and Applications Jessie Li, Penn State University Materials from Prof. Jia Li, sta3s3cal learning book (Has3e et al.), and machine learning

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)

More information

Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #3 Machine Learning. Edward Chang

Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #3 Machine Learning. Edward Chang Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval Lecture #3 Machine Learning Edward Y. Chang Edward Chang Founda'ons of LSMM 1 Edward Chang Foundations of LSMM 2 Machine Learning

More information

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9 CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification

More information

CS838-1 Advanced NLP: Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 1 Evalua:on

More information

CS 7180: Behavioral Modeling and Decision- making in AI

CS 7180: Behavioral Modeling and Decision- making in AI CS 7180: Behavioral Modeling and Decision- making in AI Hidden Markov Models Prof. Amy Sliva October 26, 2012 Par?ally observable temporal domains POMDPs represented uncertainty about the state Belief

More information

Statistical Sequence Recognition and Training: An Introduction to HMMs

Statistical Sequence Recognition and Training: An Introduction to HMMs Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with

More information