CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Size: px

Start display at page:

Download "CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto"

Tiffany Phelps
5 years ago
Views:

1 CSC41/2511 Natural Language Computng Sprng 219 Lecture 5 Frank Rudzcz and Chloé Pou-Prom Unversty of Toronto

2 Defnton of an HMM θ A hdden Markov model (HMM) s specfed by the 5-tuple {S, W, Π, A, B}: S = {s 1,, s N } W = {w 1,, w K } : set of states (e.g., moods) : output alphabet (e.g., words) Π = {π 1,, π N } : ntal state probabltes A = a j,, j S : state transton probabltes B = b w, S, w W : state output probabltes yeldng Q = {q,, q T }, q S : state sequence O = o,, o T, o W : output sequence

3 Fundamental tasks for HMMs 1. Gven a model wth partcular parameters θ = Π, A, B, how do we effcently compute the lkelhood of a partcular observaton sequence, P(O; θ)? We prevously computed the probabltes of word sequences usng N-grams. The probablty of a partcular sequence s usually useful as a means to some other end.

4 Fundamental tasks for HMMs 2. Gven an observaton sequence O and a model θ, how do we choose a state sequence Q = {q,, q T } that best explans the observatons? Ths s the task of nference.e., guessng at the best explanaton of unknown ( latent ) varables gven our model. Ths s often an mportant part of classfcaton.

5 Fundamental tasks for HMMs 3. Gven a large observaton sequence O, how do we choose the best parameters θ = Π, A, B that explan the data O? Ths s the task of. As before, we want our parameters to be set so that the avalable tranng data s maxmally lkely, But dong so wll nvolve guessng unseen nformaton.

6 Fundamental tasks for HMMs 2. Gven an observaton sequence O and a model θ, how do we choose a state sequence Q = {q,, q T } that best explans the observatons? Ths s the task of nference.e., guessng at the best explanaton of unknown ( latent ) varables gven our model. Ths s often an mportant part of classfcaton.

7 Example PoS state sequences Wll/MD the/dt char/nn char/?? the/dt meetng/nn from/in that/dt char/nn? a) MD DT NN VB Wll the char char b) MD DT NN NN Wll the char char

8 Task 2: Choosng Q = {q q T } The purpose of fndng the best state sequence Q out of all possble state sequences Q s that t tells us what s most lkely to be gong on under the hood. E.g., t tells us the most lkely part-of-speech tags, E.g., t tells us the most lkely Englsh words gven French translatons (*n a very smple model). Wth the Forward algorthm, we ddn t care about specfc state sequences we were summng over all possble state sequences.

9 Task 2: Choosng Q = {q q T } In other words, Q = argmax Q P(O, Q; θ) where P O, Q; θ = π q b q o T t=1 a qt 1 q t b qt o t

Recall Observaton lkelhoods depend on the state, whch changes over tme We cannot smply choose the state that maxmzes the probablty of o t wthout consderng the state sequence. word P(word) upsde.

10 Recall Observaton lkelhoods depend on the state, whch changes over tme We cannot smply choose the state that maxmzes the probablty of o t wthout consderng the state sequence. word P(word) upsde.25 down.25 promse.5 frend.3 monster.5 mdnght.9 halloween.1 word P(word) upsde.1 down.5 promse.5 frend.6 monster.5 mdnght.1 halloween.5 word P(word) upsde.3 down promse frend.2 monster.5 mdnght.5 halloween.4

11 The Vterb algorthm The Vterb algorthm s an nductve dynamcprogrammng algorthm that uses a new knd of trells. We defne the probablty of the most probable path leadng to the trells node at (state, tme t) as δ t = ψ (t): max P(q q t 1, o o t 1, q t = s ; θ) q q t 1 The best possble prevous state, f If I m n state at tme t.

12 Vterb example For llustraton, we assume a smpler state-transton topology:.4.1 s d.5 word P(word) upsde.1 down.5 promse.5 frend.6 monster.5 mdnght.1 halloween.5 word P(word) word P(word) upsde.25 down.25 s s s h upsde.3 down promse.5 promse frend.3 frend.2 monster.5 mdnght monster.5 mdnght.5 halloween.1 halloween.4

13 Step 1: Intalzaton of Vterb Intalze wth δ = π b (o ) and ψ = for all states. π s b s (o ) δ: max probablty ψ: backtrace π h b h (o ) π d b d (o ) 1 2 Tme, t

14 Step 1: Intalzaton of Vterb For example, let s assume π d =.8, π h =.2, and O = upsde, frend, halloween. 25 δ: max probablty ψ: backtrace o = upsde o 1 = frend o 2 = halloween Observatons, o t

15 Step 2: Inducton of Vterb The best path to state s j at tme t, δ j t, depends on the best path to each possble prevous state, δ t 1, and ther transtons to j, a j. 6 δ j t = max ψ j t = argmax δ t 1 a j b j (o t ) δ t 1 a j. 8 o = upsde o 1 = frend o 2 = halloween Observatons, o t

16 Step 2: Inducton of Vterb Specfcally δ s 1 = max δ a s b s (o 1 ) ψ s 1 = argmax δ a s. 6 δ h 1 = max δ a h b h (o 1 ) ψ h 1 = argmax δ a h. 8 δ d 1 = max δ a d b d (o 1 ) ψ d 1 = argmax δ a d o = upsde o 1 = frend o 2 = halloween Observatons, o t

17 Step 2: Inducton of Vterb δ 1 s = max δ s δ=, aa ssd b= s (o, 1 ) δ s a sd = ψ 1 s = argmax δ a δ s h =.6, a hd =, δ h a hd =. 6 δ δ 1 h d = max =.8, δ a dd a = h b.4, h (o 1 ) ψ 1 = argmax δ a h δ d a dd = max δ a d b d (o 1 ) argmax δ a d o = upsde o 1 = frend o 2 = halloween Observatons, o t

18 Step 2: Inducton of Vterb. 6 δ s 1 = max δ a s b s (o 1 ) ψ 1 s = argmax δ a s δ d a dd =.32, b d frend =.6 δ 1 max h = δmax aδ d ba dh o 1 b h (o = 1 1. ) = 1. 92E 2 ψ h 1 = argmax δ a h E 2 d d was the most lkely prevous state o = upsde o 1 = frend o 2 = halloween Observatons, o t

19 Step 2: Inducton of Vterb δ s =, a sh =, δ s a sh = δ h =.6, a hh =.8, δ 1 s = max δ a s b s (o 1 ) δ h a hh =. 48 ψ 1 δ d s = argmax =.8, aδ dh = a s.5, δ d a dh =.4. 6 max δ a h b h (o 1 ) argmax δ a h E 2 d o = upsde o 1 = frend o 2 = halloween Observatons, o t

20 Step 2: Inducton of Vterb δ h a hh =.48, b h frend =.2 δ 1 s = max δ a s b s (o 1 ) max δ a h b h o 1 = = 9. 6E 3 ψ 1 s = argmax δ a s E 3 h E 2 d o = upsde o 1 = frend o 2 = halloween Observatons, o t

21 Step 2: Inducton of Vterb max δ a s b s (o 1 ) argmax δ a s E 3 δ h s =, a ss = 1., δ s a ss = δ h =.6, a hs =.2, δ h a hs = E δ 2 d =.8, a ds =.1, δ d a ds =.8 d o = upsde o 1 = frend o 2 = halloween Observatons, o t

22 Step 2: Inducton of Vterb 3. 6E 3 h. 6 δ h a hh =.12, b s frend = E 3 hmax δ a s b s o 1 = = 3. 6E E 2 d o = upsde o 1 = frend o 2 = halloween Observatons, o t

23 Step 2: Inducton of Vterb 3. 6E 3 h δ s 2 = max δ 1 a s b s (o 2 ) ψ s 2 = argmax δ 1 a s E 3 h δ h 2 = max δ 1 a h b h (o 2 ) ψ h 2 = argmax δ 1 a h E 2 d δ d 2 = max δ 1 a s b s (o 2 ) ψ d 2 = argmax δ 1 a d o = upsde o 1 = frend o 2 = halloween Observatons, o t

24 Step 2: Inducton of Vterb δ s 1 = 3.6E 3, a sd =, δ s 1 a sd = E 3 h 9. 6E 3 h δ 2 δs h = 1 max = 9.6E δ 3 1 a, a s hd b s = (o, 2 ) δ ψ 2 s h 1 a = argmax hd = δ 1 a s δ d 1 = 1.92E 2, a dd =.4, δ 2 δh d = 1max a dd = δ a h b h (o 2 ) ψ h 2 = argmax δ 1 a h E 2 d δ d 2 = max δ 1 a s b s (o 2 ) ψ d 2 = argmax δ 1 a d o = upsde o 1 = frend o 2 = halloween Observatons, o t

25 Step 2: Inducton of Vterb Contnung 3. 6E 3 h δ s 2 = 3.6E 3.1 ψ s 2 = s E 3 h δ h 2 = 9.6E 3.4 ψ h 2 = d E 2 d δ d 2 = 7.68E 3.5 ψ d 2 = d o = upsde o 1 = frend o 2 = halloween Observatons, o t

26 Step 3: Concluson of Vterb Choose the best fnal state: Q T = argmax δ T 3. 6E 3 h 3.6E 5 h E 3 h 3.84E 3 d E 2 d 3.84E 4 d o = upsde o 1 = frend o 2 = halloween Observatons, o t

27 Step 3: Concluson of Vterb Recursvely choose the best prevous state: Q t 1 = ψ Qt (t) 3. 6E 3 h 3.6E 5 h E 3 h 3.84E 3 d E 2 d 3.84E 4 d o = upsde o 1 = frend o 2 = halloween Observatons, o t

28 Step 3: Concluson of Vterb 3. 6E 3 3.6E 5 h h Sequence probablty: E 3 h 3.84E 3 d P(O, Q ; θ) = max δ (T) E E 4 d d o = upsde o 1 = frend o 2 = halloween Observatons, o t

29 Why dd we choose Q = {q q T }? Recall the purpose of HMMs: To represent multvarate systems where some varable s unknown/hdden/latent. Fndng the best hdden-state sequence Q allows us to: Identfy unseen parts-of-speech gven words, Identfy equvalent Englsh words gven French words, Identfy unknown phonemes gven speech sounds, Decpher hdden messages from encrypted symbols, Identfy hdden relatonshps from gene sequences, Identfy hdden market condtons gven stock prces,

30 Workng n the log doman Our formulaton was Q = argmax Q P(O, Q; θ) ths s equvalent to Q = argmn log 2 P(O, Q; θ) Q where log 2 P O, Q; θ = log 2 π q b q o log 2 a qt 1 q t b qt o t T t=1

31 Fundamental tasks for HMMs 3. Gven a large observaton sequence O for tranng, but not the state sequence, how do we choose the best parameters θ = Π, A, B that explan the data O? Ths s the task of. As wth observable Markov models and MLE, we want our parameters to be set so that the avalable tranng data s maxmally lkely, But dong so wll nvolve guessng unseen nformaton

32 Task 3: Choosng θ = Π, A, B We want to modfy the parameters of our model θ = Π, A, B so that P(O; θ) s maxmzed for some tranng data O: θ = argmax θ P(O; θ) Why? E.g., f we later want to choose the best state sequence Q for prevously unseen test data, the parameters of the HMM should be tuned to smlar tranng data.

33 Task 3: Choosng θ = Π, A, B θ = argmax P(O; θ) = argmax σ Q P(O, Q; θ) θ θ Can we do ths? P O, Q; θ = P q :t P w :t q :t t ς = P(q q 1 )P w q Recall that we could use MLE when Q was known

34 Task 3: Choosng θ = Π, A, B P O, Q; θ = P q :t P w :t q :t t ς = P(q q 1 )P w q If the tranng data contaned state sequences, we could smply do maxmum lkelhood estmaton, as before: P q q 1 = Count(q 1 q ) Count(q 1 ) P w q = Count(w q ) Count(q ) But we don t know the states; we can t count them. However, we can use an teratve hll-clmbng approach f we can guess the counts.

35 What to do wth ncomplete data? When our tranng data are ncomplete (.e., one or more varables n our model s hdden) we cannot use maxmum lkelhood estmaton. We have no way of countng the state-transtons because we don t know whch sequence of states generated our observatons. We can guess the counts f we have some good pre-exstng model.

36 Expectng and maxmzng If we knew θ, we could make expectatons such as Expected number of tmes n state s, Expected number of transtons s s j If we knew: Expected number of tmes n state s, Expected number of transtons s s j then we could compute the maxmum lkelhood estmate of θ = π, a j, {b w }

37 Expectaton-maxmzaton Expectaton-maxmzaton (EM) s an teratve tranng algorthm that alternates between two steps: Expectaton (E): guesses the expected counts for the hdden sequence usng the current model θ k. Maxmzaton (M): computes a new θ that maxmzes the lkelhood of the data, gven the guesses of the E-step. Ths θ k+1 s then used n the next E-step. Contnue untl convergence or stoppng condton

38 Baum-Welch re-estmaton Baum-Welch (BW): n. a specfc verson of EM for HMMs. a.k.a. forward-backward algorthm. 1. Intalze the model. 2. Compute expectatons for α t and β (t) for each state and tme t, gven tranng data O. 3. Adjust our start, transton, and observaton probabltes to maxmze the lkelhood of O. 4. Go to 2. and repeat untl convergence or stoppng condton

39 Local maxma Baum-Welch changes θ to clmb a `hll n P(O; θ). How we ntalze θ can have a bg effect. P(O; θ) θ

40 Step 1: BW ntalzaton Our ntal guess for the parameters, θ, can be: a) All probabltes are unform (e.g., b w a = b (w b ) for all states and words w).33 s d word P(word) upsde.143 down.143 promse.143 frend.143 monster.143 mdnght.143 halloween word P(word) word P(word) upsde.143 down.143 s s s h upsde.143 down.143 promse.143 promse.143 frend.143 frend.143 monster.143 mdnght monster.143 mdnght.143 halloween.143 halloween.143

41 Step 1: BW ntalzaton Our ntal guess for the parameters, θ, can be: b) All probabltes are drawn randomly (subject to the condton that σ P = 1).4 s d word P(word) upsde.1 down.5 promse.5 frend.6 monster.5 mdnght.1 halloween word P(word) word P(word) upsde.25 down.25 s s s h upsde.3 down promse.5 promse frend.3 frend.2 monster.5 mdnght monster.5 mdnght.5 halloween.1 halloween.4

e.g. k-means All blue dots are words n state BLUE.

42 Step 1: BW ntalzaton Our ntal guess for the parameters, θ, can be: c) Observaton dstrbutons are drawn from pror dstrbutons: e.g., b w a = P(w a ) for all states. sometmes ths nvolves pre-clusterng, e.g. k-means All blue dots are words n state BLUE. Ther probablty dstrbuton s word P(word) upsde.2 down.1 promse.3 frend.5 monster.7 mdnght.2 halloween.8

43 What to expect when you re expectng If we knew θ, we could estmate expectatons such as Expected number of tmes n state s, Expected number of transtons s s j If we knew: Expected number of tmes n state s, Expected number of transtons s s j then we could compute the maxmum lkelhood estmate of θ = a j, {b w }, π

44 BW E-step (occupaton) We defne γ t = P(q t = O; θ k ) as the probablty of beng n state at tme t, based on our current model, θ k, gven the entre observaton, O. and rewrte as: γ t = P(q t =, O; θ k ) P(O; θ k ) = α t β (t) P(O; θ k ) Remember, α t and β (t) depend on values from θ = π, a j, b w

45 Combnng α and β P O, q t = ; θ = α t β t P O; θ = α t β (t) N =1 s 1 s 2 s 3 s N 1 2 T 1

46 BW E-step (transton) We defne ξ j t = P(q t =, q t+1 = j O; θ k ) as the probablty of transtonng from state at tme t to state j at tme t + 1 based on our current model, θ k, and gven the entre observaton, O. Ths s: ξ j t = P(q t =, q t+1 = j, O; θ k ) P(O; θ k ) = α t a j b j (o t+1 )β j (t + 1) P(O; θ k ) Agan, these estmates come from our model at teraton k, θ k.

47 BW E-step (transton) s a j b j (o t+1 ) s j t 1 α (t) t t + 1 β j (t + 1) t + 2

48 Expectng and maxmzng If we knew θ, we could estmate expectatons such as Expected number of tmes n state s, Expected number of transtons s s j If we knew: Expected number of tmes n state s, Expected number of transtons s s j then we could compute the maxmum lkelhood estmate of θ = a j, {b w }, π

49 BW M-step We update our parameters as f we were dong MLE: I. Intal-state probabltes: π = γ () for 1.. N II. State-transton probabltes: a j = σ t= T 1 ξ j (t) T 1 γ t σ t= for, j 1.. N III. Dscrete observaton probabltes: b j w = σ t= T 1 γ j t ot =w T 1 γ j t σ t= P q j q = Count(q q j ) Count(q ) for j 1.. N and w V P w q = Count(w q ) Count(q )

50 Baum-Welch teraton We update our parameters after each teraton θ k+1 = π, a j, b j w rnse, and repeat untl θ k θ k+1 (untl change almost stops). Baum proved that P O; θ k+1 P(O; θ k ) although ths method does not guarantee a global maxmum.

51 Features of Baum-Welch Although we re not guaranteed to acheve a global optmum, the local optma are often good enough. BW does not estmate the number of states, whch must be known beforehand. Moreover, some constrants on topology are often mposed beforehand to assst tranng.

52 Dscrete vs. contnuous If our observatons are drawn from a contnuous space (e.g., speech acoustcs), the probabltes b (X) must also be contnuous. HMMs generalze to contnuous dstrbutons, or multvarate observatons, e.g., b ( 14.28,.85,.21 ).

53 Adaptaton It can take a LOT of data to tran HMMs. Imagne that we re gven a traned HMM but not the data. Also magne that ths HMM has been traned wth data from many sources (e.g., many speakers). We want to use ths HMM wth a partcular new source for whom we have some data (but not enough to fully tran the HMM properly from scratch). To be more accurate for that source, we want to change the orgnal HMM parameters slghtly gven the new data.

54 Deleted nterpolaton For added robustness, we can combne estmates of a generc HMM, G, traned wth lots of data from many sources wth a specfc HMM, S, traned wth a lttle data from a sngle source. P DI o = λp o; θ G + 1 λ P(o; θ S ) Ths gves us a model tuned to our target source (S), but wth some general knowledge (G) bult n. How do we pck λ [.. 1]?

55 Deleted nterpolaton learnng λ 1. Intalze λ wth an emprcal or guessed estmate. 2. Gven O a, whch s adaptaton data of whch O a,j s the j th partton, and there are M parttons, 3. Update λ (the weght of model G) accordng to: O a,1 O a,j O a,3 መλ = 1 M P(Oa,j M ; θ G ) P DI (O a ) j=1 We contnue untl λ and መλ are suffcently close.

56 Asde Maxmum a Posteror (MAP) Gven adaptaton data O a, the MAP estmate s θ =argmax θ P O a θ P(θ) If we can guess some structure for P(θ), we can use EM to estmate new parameters (or Monte Carlo). For contnuous b (o), we use Drchlet dstrbuton that defnes the hyper-parameters of the model and the Lagrange method to descrbe the change n parameters θ θ.

57 Summary Important deas to know: The defnton of an HMM (e.g., ts parameters). The purpose of the Forward algorthm. How to compute α (t) and β (t) The purpose of the Vterb algorthm. How to compute δ (t) and ψ (t). The purpose of the Baum-Welch algorthm. Some understandng of EM. Some understandng of the equatons.

59 State duraton The probablty of stayng n a partcular state s for a specfc perod of tme, τ, dmnshes exponentally over tme, all else beng equal. a τ 1 (1 a ) From Phlp Jackson at Unversty of Surrey

60 Combnng HMMs Often, we lnk HMMs together. E.g., we have lots of speech data for /w/, /ah/, and /n/, but almost no data for the word one. /w/ Traned only wth /w/ data. /ah/ Traned only wth /ah/ data. /n/ Traned only wth /n/ data. one

61 N-best lsts In our dscusson of the Vterb algorthm, we encountered a stuaton where one state at tme t was equally lkely to have been reached from two other states at tme t 1. Sometmes nstead of keepng track of only the sngle best path to state at tme t, we n fact keep track of the N-best paths to state at tme t. E.g., n our Vterb trells: δ: max probablty δ: 2 nd max probablty δ: 3 rd max probablty ψ: best backtrace ψ: 2 nd best backtrace ψ: 3 rd best backtrace

62 Generatve vs. dscrmnatve HMMs are generatve classfers. You can generate synthetc samples from because they model the phenomenon tself. Other classfers (e.g., artfcal neural networks and support vector machnes) are dscrmnatve n that ther probabltes are traned specfcally to reduce the error n classfcaton.... ANN... SVM

63 Readng (optonal) Mannng & Schütze: Secton Note that they use another formulaton Rabner, L. (199) A Tutoral on Hdden Markov Models and Selected Applcatons n Speech Recognton. In: Readngs n speech recognton. Morgan Kaufmann. (posted on course webste) Optonal software: Hdden Markov Model Toolkt ( Sc-kt s HMM (

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto Revisiting PoS tagging Will/MD the/dt chair/nn chair/?? the/dt meeting/nn from/in that/dt