The Basc Idea of EM Janxn Wu LAMDA Group Natonal Key Lab for Novel Software Technology Nanjng Unversty, Chna wujx2001@gmal.com June 7, 2017 Contents 1 Introducton 1 2 GMM: A workng example 2 2.1 Gaussan mxture model....................... 2 2.2 The hdden varable nterpretaton................. 3 2.3 What f we can observe the hdden varable?............ 5 2.4 Can we mtate an oracle?...................... 6 3 An nformal descrpton of the EM algorthm 6 4 The Expectaton-Maxmzaton algorthm 7 4.1 Jontly-non-concave ncomplete log-lkelhood........... 7 4.2 Possbly) Concave complete data log-lkelhood.......... 8 4.3 The general EM dervaton..................... 9 4.4 The E- & M-steps.......................... 11 4.5 The EM algorthm.......................... 12 4.6 Wll EM converge?.......................... 12 5 EM for GMM 13 1 Introducton Statstcal learnng models are very mportant n many areas nsde computer scence, ncludng but not confned to machne learnng, computer vson, pattern recognton and data mnng. It s also mportant n some deep learnng models, such as the Restrcted Boltzmann machne RBM). 1
Statstcal learnng models have parameters, and estmatng such parameters from data s one of the key problems n the study of such models. Expectaton- Maxmzaton EM) s arguably the most wdely used parameter estmaton technque. Hence, t s worthwhle to know some bascs of EM. However, although EM s a must-have knowledge n studyng statstcal learnng models, t s not easy for begnners. Ths note ntroduces the basc dea behnd EM. I want to emphasze that the man purpose of ths note s to ntroduce the basc dea or, emphaszng the ntuton) behnd EM, not for coverng all detals of EM or presentng rgorous mathematcal dervatons. 1 2 GMM: A workng example Let us start from a smple workng example: GMM). the Gaussan Mxture Model 2.1 Gaussan mxture model In Fgure 1, we show three curves correspondng to three dfferent probablty densty functons p.d.f.). The blue curve s the p.d.f. of a normal dstrbuton N10, 16),.e., a Gaussan dstrbuton wth the mean µ = 10 and the standard devaton σ = 4 and σ 2 = 16). We denote ths p.d.f. as p 1 x) = Nx; 10, 16). The red curve s another normal dstrbuton N30, 49) wth µ = 30 and σ = 7. Smlarly, we denote t as p 2 x) = Nx; 30, 49). We are nterested n the black curve, whose frst half s smlar to the blue one, whle the second half s smlar to the red one. Ths curve s also the p.d.f. of a dstrbuton, denoted by p 3. Snce the black curve s smlar to parts of the blue and red curves, t s reasonable to conjecture that p 3 are related to both p 1 and p 2. Indeed, p 3 s a weghted combnaton of p 1 and p 2. In ths example, p 3 x) = 0.2p 1 x) + 0.8p 2 x). 1) Because 0.2 + 0.8 = 1, t s easy to verfy that p 3 z) 0 always holds and p 3x) dx = 1. Hence, p 3 s a vald p.d.f. p 3 s a mxture of two Gaussans p 1 and p 2 ), hence a Gaussan mxture model GMM). The defnton of a GMM s n fact more general: t can have more than two components, and the Gaussans can be multvarate. 1 The frst verson of ths note was wrtten n Chnese, and was started as a note-takng n a course n the Georga Insttute of Technology whle I was a graduate student there. That verson was typeset n Mcrosoft Word. Unfortunately, that verson contaned a lot of errors and I dd not have a chance to check t agan. Ths verson wrtten n 2016) s started whle I am preparng materals for the Pattern Recognton course I wll teach n the Sprng Semester n Nanjng Unversty. It s greatly expanded, and the errors that I found are corrected. 2
0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 5 10 15 20 25 30 35 40 45 50 Fgure 1: A smple GMM llustraton. A GMM s a dstrbuton whose p.d.f. has the followng form: px) = = N α Nx; µ, Σ ) 2) N α exp 1 ) 2π) d/2 Σ 1/2 2 x µ ) T Σ 1 x µ ), 3) n whch x s a d-dmensonal random vector. In ths GMM, there are N Gaussan components, wth the -th Gaussan has the mean vector µ R d and the covarance matrx Σ R d d. 2 These Gaussan components are combned together usng a lnear combnaton, where the weght for the -th component s α called the mxng coeffcents). The mxng coeffcents must satsfy the followng condtons N α = 1, 4) α 0,. 5) It s easy to verfy that under these condtons, px) s a vald multvarate probablty densty functon. 2.2 The hdden varable nterpretaton We can have a dfferent nterpretaton of the Gaussan mxture model, usng the hdden varable concept, as llustrated n Fgure 2. 2 We wll use boldface letters to denote a vector. 3
Z X Fgure 2: GMM as a graphcal model. In Fgure 2, the random varable X follows a Gaussan mxture model cf. Equaton 3). Its parameter s θ = {α, µ, Σ } N. 6) If we want to sample an nstance from ths GMM, we could drectly sample from the p.d.f. n Equaton 3. However, there s another two-step way to perform the samplng. Let us defne a random varable Z. Z s a multnomal dscrete dstrbuton, takng values from the set {1, 2,..., N}. The probablty that Z takes the value Z = s α,.e., PrZ = ) = α, for 1 N. Then, the two-step samplng procedure s: Step 1 Sample from Z, and get a value 1 N); Step 2 Sample x from the -th Gaussan component Nµ, Σ ). It s easy to verfy that the sample x acheved from ths two-step samplng procedure follows the underlyng GMM dstrbuton n Equaton 3. In learnng GMM parameters, we are gven a sample set {x 1, x 2,..., x M }, where x are..d. dentcally and ndependently dstrbuted) nstances sampled from the p.d.f. n Equaton 3. From ths set of samples, we want to estmate or learn the GMM parameters θ = {α, µ, Σ } N. Because we are gven the samples x, the random varable X cf. Fgure 2) are called observed or observable) random varables. As shown n Fgure 2, observed random varables are usually shown as a flled crcle. 4
The random varable Z, however, s not observable, and s called a hdden varable or a latent varable). Hdden varables are shown as a crcled, as the Z node n Fgure 2. 2.3 What f we can observe the hdden varable? In real applcatons, we do not know the value or nstantaton) of Z, because t s hdden not observable). Ths fact makes estmatng GMM parameters rather dffcult, and technques such as EM the focus of ths note) have to be employed. However, for the sample set X = {x 1, x 2,..., x M }, let us consder the scenaro n whch we can further suppose that some oracle has gven us the value of Z: Z = {z 1, z 2,..., z M }. In other words, we know that x s sampled from the z -th Gaussan component. In ths case, t s easy to estmate the parameters θ. Frst, we can fnd all those samples that are generated from the -th component, and use X to denote ths subset of samples. In precse mathematcal languages, X = {x j z j =, 1 j M}. 7) The mxng coeffcent estmaton s a smple countng. We can count the number of examples whch are generated from the -th Gaussan component as m = X, where s the sze number of elements) of a set. Then, the maxmum lkelhood estmaton for α s ˆα = m N j=1 m j = m M. 8) Second, t s also easy to estmate the µ and Σ parameters for any 1 N. The maxmum lkelhood estmaton solutons are the same as those sngle Gaussan equatons: 3 ˆµ = 1 m x, 9) x X ˆΣ = 1 x ˆµ m ) x ˆµ ) T. 10) x X In short, f we know the hdden varable s nstantatons, the estmaton s straghtforward. Unfortunately, we are only gven the observed sample set X. The hdden varable nstantatons Z s unknown to us. Ths fact complcates the entre parameter estmaton process. 3 Please refer to my note on propertes of normal dstrbutons for dervaton of these equatons. 5
2.4 Can we mtate an oracle? A natural queston to ask ourselves s: f we do not have an oracle to teach us, can we mtate the oracle s teachng? In other words, can we guess the value of z j for x j? A natural choce s to use the posteror pz j x j, θ t) ) as a replacement for z j. Ths term s the posteror probablty gven the sample x j and the current parameter value θ t). 4 The posteror probablty s the best educated guess we can have gven the nformaton that s handy. In ths guessng game, we have at least two ssues n our way. Frst, an oracle s supposed to know everythng, and wll be able to tell us that x 7 comes from the thrd Gaussan component, wth 100% confdence. If an oracle exsts, we can smply say z 7 = 3 n ths example. However, our guess wll never be determnstc t can at best be a probablty dstrbuton about the random varable z j. Hence, we wll assume that for every observed sample x, there s a correspondng hdden vector z, whose values can be guessed but cannot be observed. We stll use Z to denote the underlyng random varable, and use Z to denote the set of hdden vectors. In the GMM example, a vector z j wll have N dmensons, but one and only one of these dmensons wll be 1, and all others wll be 0. Second, the guess we have about z j s a dstrbuton determned by the posteror pz j x j, θ t) ). However, what we really want are values nstead of a dstrbuton. How are we gong to use ths guess? A common trck n statstcal learnng s to use ts expectaton. We wll leave the detals about how the expectaton s used to later sectons. 3 An nformal descrpton of the EM algorthm Now we are ready to gve an nformal descrpton of the EM algorthm. We frst ntalze the values of θ n any reasonable way; Then, we can estmate the best possble Z expectaton of ts posteror dstrbuton) usng X and the current θ estmaton; Wth ths Z estmaton, we can fnd a better estmate of θ usng X ; A better θ combned wth X ) wll lead to a better guess of Z; Ths process estmatng θ and Z n alternatng order) can proceed untl the change n θ s small.e., the procedure converges). In stll more nformal languages, after proper ntalzaton of the parameters, we can: 4 As we wll see, EM s an teratve process, n whch the varable t s the teraton ndex. We wll update the parameter θ s every teraton, and use θ t) to denote ts value n the t-th teraton. 6
E-Step Fnd a better guess of the non-observable hdden varables, by usng the data and current parameter values; M-Step Fnd a better parameter estmaton, by usng the current guess for the hdden varables and the data; Repeat Repeat the above two steps untl convergence. In the EM algorthm, the frst step s usually called the Expectaton step, abbrevated as the E-step; whle the second step s usually called the Maxmzaton step, abbrevated as the M-step. The EM algorthm repeats E- and M-steps n alternatng order. When the algorthm converges, we get the desred parameter estmatons. 4 The Expectaton-Maxmzaton algorthm Now we wll show more detals of the EM algorthm. Suppose we are dealng wth two sets of random varables: the observed varables X and the hdden varables Z. The jont p.d.f. s px, Z; θ), where θ are the parameters. We are gven a set of nstances of X to learn the parameters, as X = {x 1, x 2,..., x M }. The task s to estmate θ from X. For every x j, there s a correspondng z j. And we want to clarfy that θ now nclude the parameters that are assocated wth Z. In the GMM example, z j are estmates for Z, {α, µ, Σ } N are parameters specfyng X, and θ nclude both sets of parameters. 4.1 Jontly-non-concave ncomplete log-lkelhood If we use the maxmum lkelhood ML) estmaton technque, the ML estmate for θ s ˆθ = arg max px θ). 11) θ Or equvalently, we can maxmze the log-lkelhood ˆθ = arg max ln px θ), 12) θ because ln ) s a monotoncally ncreasng functon. Then, parameter estmaton becomes an optmzaton problem. We wll use the notaton Lθ) to denote the log-lkelhood, that s, Lθ) = ln px θ). 13) Recent developments n optmzaton tells that we can generally consder a mnmzaton problem as easy f t s convex, but non-convex problems are usually dffcult to solve. Equvalently, a concave maxmzaton problem s generally consdered easy, whle non-concave maxmzaton s usually dffcult, because the negatve of a convex functon s a concave one, and vce versa. 7
Unfortunately, the log-lkelhood s non-concave n most cases. Take the Gaussan mxture model as an example, the lkelhood px θ) s M N α px θ) = exp 1 2π) d/2 Σ 1/2 2 x j µ ) T Σ 1 x j µ )) ). 14) j=1 The log-lkelhood has the followng form: M N α ln exp 1 2π) d/2 Σ 1/2 2 x j µ ) T Σ 1 x j µ )) ), 15) j=1 Ths equaton s non-concave wth respect to the jont optmzaton varables {α, µ, Σ } n. In other words, ths s a dffcult maxmzaton problem. We have two sets of random varables X and Z. The log-lkelhood n Equaton 15 s called the ncomplete data log-lkelhood because Z s not n that equaton. 4.2 Possbly) Concave complete data log-lkelhood The complete data log-lkelhood s ln px, Z θ). 16) Let us use GMM as an example once more. In GMM, the z j vectors whch form Z) s an N-dmensonal vector wth N 1 0 s and only one dmenson wth value 1. Hence, the complete data lkelhood s px, Z θ) = M N [ j=1 α exp 2π) d/2 Σ 1/2 1 2 x j µ ) T Σ 1 x j µ ))] zj. 17) Ths equaton can be explaned usng the two-step samplng process. Let us assume x j s generated by the -th Gaussan component. Then, f, we know that z j 1, otherwse z j = z j = 1. In other words, the term nsde [ ] wll equal 1 for N 1 tmes when z j = 0, and the remanng one entry wll be evaluated to α Nx; µ, Σ ), whch exactly matches the 2-step samplng procedure. 5 Then, the complete data log-lkelhood s M j=1 N 1 z j 2 ln Σ 1 x j µ ) T Σ 1 x j µ ) ) ) + ln α + const. 18) Let us consder the scenaro when the hdden varable z j s known, but α, µ and Σ are unknown. Here we suppose Σ s nvertble for 1 N. Instead 5 The frst step has probablty α, and the second step has densty Nx; µ, Σ ). These two steps are ndependent of each other, hence the product rule apples. 8
of consderng parameters µ, Σ ), we consder µ, Σ 1 ). 6 It s well known that the log-determnant functon ln s concave. It s also easy to prove that the quadratc term z j µ ) T Σ 1 z j µ ) s jontly convex wth respect to varables µ, Σ 1 ), whch drectly mples that ts negatve s concave. 7 Hence, ths sub-problem can be effcently solved. From ths optmzaton perspectve, we can understand the EM algorthm from a dfferent pont of vew. Although the orgnal maxmum lkelhood parameter estmaton problem s dffcult to solve jontly non-concave), the EM algorthm can usually but not always) make concave subproblems, hence becomng effcently solvable. 4.3 The general EM dervaton Now we talk about EM n the general sense. We have observable varables X and samples X. We also have hdden varables Z and unobservable samples Z. The overall system parameters are denoted by θ. The parameter learnng problem tres to fnd optmal parameters ˆθ by maxmzng the ncomplete data log-lkelhood We assume Z s dscrete, and hence ˆθ = arg max ln px θ). 19) θ px θ) = Z px, Z θ). 20) However, ths assumpton s manly for notatonal smplcty. If Z s contnuous, we can replace the summaton wth an ntegral. Although we have mentoned prevously that we can use the posteror of Z,.e., pz X, θ) as our guess, t s also nterestng to observe what wll happen to the complete data lkelhood f we use a arbtrary dstrbuton for Z and hence understand why the posteror s specal and why we should use t). Let q be any vald probablty dstrbuton for Z, we can measure how dfferent s q to the posteror usng the classc Kullback-Lebler KL) dvergence measure, as KLq p) = ) pz X, θ) qz) ln. 21) qz) Z The probablty theory tells us that px, Z θ) px θ) = pz X, θ) 22) px, Z θ) qz) = qz) pz X, θ). 23) 6 It s more natural to understand ths choce as usng the canoncal parameterzaton of a normal dstrbuton. Please refer to my note on propertes of normal dstrbutons. 7 For knowledge about convexty, please refer to the book Convex optmzaton by Stephen Boyd and Leven Vandenberghe, Cambrdge Unversty Press. The PDF verson of ths book s avalable at http://stanford.edu/~boyd/cvxbook/. 9
Hence, ) ln px θ) = qz) ln px θ) 24) = Z = Z Z qz) ln px θ) 25) px, Z θ) qz) ln qz) qz) pz X, θ) px, Z θ) qz) ln qz) ln qz) ) ) pz X, θ) qz) 26) = Z 27) = px, Z θ) qz) ln + KLq p) qz) Z 28) = Lq, θ) + KLq p). 29) We have decomposed the ncomplete data log-lkelhood nto two terms. The frst term s Lq, θ), defned as Lq, θ) = Z qz) ln px, Z θ) qz). 30) The second term s a KL-dvergence between q and the posteror KLq p) = ) pz X, θ) qz) ln, 31) qz) Z whch was coped from Equaton 21. There are some nce propertes of the KL-dvergence. For example, Dq p) 0 32) always holds, and the qualty sgn s true f and only f q = p. 8 consequence of ths property s that One drect Lq, θ) ln px θ) 33) always holds, and Lq, θ) = ln px θ) f and only f qz) = pz X, θ). 34) In other words, we have found a lower bound of ln px θ). Hence, n order to maxmze ln px θ), we can perform two steps. 8 For more propertes of the KL-dvergence, please refer to the book Elements of nformaton theory by Thomas M. Cover and Joy A. Thomas, John Wley & Sons, Inc. 10
The frst step s to make the lower bound Lq, θ) equal to ln px θ). As aforementoned, we know the equalty hold f and only f ˆqZ) = pz X, θ). Now we have ln px θ) = Lˆq, θ), 35) and L only depends on θ now. Ths s the Expectaton step E-step) n the EM algorthm. In the second step, we can maxmze Lˆq, θ) wth respect to θ. Snce ln px θ) = Lˆq, θ), an ncrease of Lˆq, θ) also means an ncrease of the log-lkelhood ln px θ). And, because we are maxmzng Lˆq, θ) n ths step, the log-lkelhood wll always ncrease f we are not already at a local mnmum of the log-lkelhood. Ths s the Maxmzaton step M-step) n the EM algorthm. 4.4 The E- & M-steps In the E-step, we already know that we should set ˆqZ) = pz X, θ), 36) whch s straghtforward at least n ts mathematcal form). Then, how shall we maxmze Lˆq, θ)? We can substtute ˆq nto the defnton of L. We wll fnd the optmal θ that maxmzes L after pluggng n ˆq. However, note that ˆq nvolves θ too. Hence, we need some more notatons. Suppose we are n the t-th teraton. In the E-step, ˆq s computed usng the current parameter, as ˆqZ) = pz X, θ t) ). 37) Then, L becomes Lˆq, θ) = Z ˆqZ) ln px, Z θ) ˆqZ) 38) = Z = Z ˆqZ) ln px, Z θ) ˆqZ) ln ˆqZ) 39) pz X, θ t) ) ln px, Z θ) + const, 40) n whch const = ˆqZ) ln ˆqZ) does not nvolve the varable θ, hence can be gnored. The term remanng s n fact an expectaton, whch we denote as Qθ, θ t) ), Qθ, θ t) ) = Z pz X, θ t) ) ln px, Z θ) 41) = E Z X,θ t) [ln px, Z θ)]. 42) That s, n the E-step, we compute the posteror of Z. In the M-step, we compute the expectaton of the complete data log-lkelhood ln px, Z θ) 11
wth respect to the posteror dstrbuton pz X, θ t) ), and we maxmze the expectaton to get a better parameter estmate: θ t+1) = arg max θ Qθ, θ t) ) = arg max E Z X,θ t) [ln px, Z θ)]. 43) θ Thus, three computatons are nvolved n EM: 1) posteror, 2) expectaton, 3) maxmzaton. We treat 1) as the E-step, and 2)+3) as the M-step. Some researchers prefer to treat 1)+2) as the E-step, and 3) as the M-step. However, no matter how the computatons are attrbuted to dfferent steps, the EM algorthm does not change. 4.5 The EM algorthm Now we are ready to wrte down the EM algorthm. Algorthm 1 The Expectaton-Maxmzaton Algorthm 1: t 0 2: Intalze the parameters to θ 0) 3: The Eexpectaton)-step: Fnd pz X, θ t) ) 4: The Maxmzaton)-step.1: Fnd the expectaton Qθ, θ t) ) = E Z X,θ t) [ln px, Z θ)] 44) 5: The Maxmzaton)-step.2: Fnd a new parameter estmate θ t+1) = arg max Qθ, θ t) ) 45) θ 6: t t + 1 7: If the log-lkelhood has not converged, go to the E-step agan Lne 3) 4.6 Wll EM converge? The analyss of EM s convergence property s a complex topc. However, t s easy to show that the EM algorthm wll help acheve hgher lkelhood and converge to a local mnmum. Let us consder two tme steps t 1 and t. From Equaton 35, we get that: Lˆq t), θ t) ) = ln px θ t) ), 46) Lˆq t 1), θ t 1) ) = ln px θ t 1) ). 47) Note that we have added the tme ndex to the superscrpt of ˆq to emphasze that ˆq also changes among teratons. 12
Now because at the t 1)-th teraton we have θ t) = arg max Lˆq t 1), θ), 48) θ Lˆq t 1), θ t) ) Lˆq t 1), θ t 1) ). 49) Smlarly, at the t-th teraton, based on Equaton 33 and Equaton 35, we have Lˆq t 1), θ t) ) ln px θ t) ) = Lˆq t), θ t) ). 50) Puttng these equatons together, we get ln px θ t) ) = Lˆq t), θ t) ) [Use 46)] 51) Lˆq t 1), θ t) ) [Use 50)] 52) Lˆq t 1), θ t 1) ) [Use 49)] 53) = ln px θ t 1) ). [Use 47)] 54) Hence, EM wll converge to a local mnmum of the lkelhood. However, the analyss of ts convergence rate s very complex and beyond the scope of ths ntroductory note. 5 EM for GMM Now we can apply the EM algorthm to GMM. The frst thng s to compute the posteror. Usng the Bayes theorem, we have pz j x j, θ t) ) = px j, z j θ t) ), 55) px j θ t) ) n whch z j can be 0 or 1, and z j = 1 s true f and only f x j s generated by the -th Gaussan component. Next, we wll compute the Q functon, whch s the expectaton of the complete data log-lkelhood ln px, Z θ) wth respect to the posteror dstrbuton we just found. The GMM complete data log-lkelhood was already computed n Equaton 18. For easer reference, we copy ths equaton here: M j=1 N 1 z j 2 ln Σ 1 x j µ ) T Σ 1 x j µ ) ) ) + ln α + const, 56) The expectaton of Equaton 56 wth respect to Z s M j=1 N 1 γ j 2 ln Σ 1 x j µ ) T Σ 1 x j µ ) ) ) + ln α, 57) 13
where the constant term s gnored and γ j s the expectaton of z j x j, θ t). In other words, we need to compute the expectaton of the condtonal dstrbuton defned by Equaton 55. In Equaton 55, the denomnator does not depend on Z, and px j θ t) ) equals N Nx j ; µ t), Σ t) ). For the numerator, we can drectly compute αt) ts expectaton, as E [ ] px j, z j θ t) ) [ ] = E pz j θ t) )px j z j, θ t) ). 58) Note that when z j = 0, we always have px j z j, θ t) ) = 0. Thus, [ ] E pz j θ t) )px j z j, θ t) ) = Prz j = 1)px j µ t), Σ t) ) 59) or, Hence, we have [ γ j = E = α t) z j x j, θ t)] α t) [ γ j = E z j x j, θ t)] = α t) N k=1 αt) k Nx j ; µ t), Σ t) ). 60) Nx j ; µ t), Σ t) ), 61) Nx j ; µ t), Σ t) )) Nx j; µ t) k, Σt) k ) 62) for 1 N, 1 j M. After γ j s computed, Equaton 57 s completely specfed. We start the optmzaton from α. Because there s a constrant that N α = 1, we use the Lagrange multpler method, remove rrelevant terms, and get M N N ) γ j ln α + λ α 1. 63) j=1 Settng the dervatve to 0 gves us that for any 1 N, M j=1 γ j α + λ = 0 64) M j=1 or, α = γj. Because N α = 1, we know that λ = M N j=1 γ j. Hence, α = λ M j=1 γj M N. j=1 γj For notatonal smplcty, we defne j=1 m = j=1 M γ j. 65) j=1 From the defnton of γ j, t s easy to prove that N N M M N ) M m = γ j = γ j = 1 = M. 66) j=1 14
Then, we get the updatng rule for α : α t+1) = m M. 67) Furthermore, usng smlar steps n dervng the sngle Gaussan equatons, 9 t s easy to show that for any 1 N, M µ t+1) j=1 = γ jx j, 68) m Σ t+1) = M j=1 γ j x j µ t+1) ) ) T x j µ t+1). 69) m Puttng these results together, we have the complete set of updatng rules for GMM. If at teraton t, the parameter are estmated as α t), µ t), and Σ t) for 1 N, the EM algorthm updates these parameters as for 1 N, 1 j M) Exercses γ j = m = α Nx j ; µ t), Σ t) )) Nx j; µ t) k N k=1 αt) k, Σt) k ), 70) M γ j, 71) j=1 M µ t+1) j=1 = γ jx j, 72) m Σ t+1) = M j=1 γ j x j µ t+1) ) ) T x j µ t+1). 73) m 1. Derve the updatng equatons for Gaussan Mxture Models by yourself. You should not refer to Secton 5 durng your dervaton. If you have just fnshed readng Secton 5, wat for at least 2 to 3 hours before workng on ths problem. 2. In ths problem, we wll use the Expectaton-Maxmzaton method to learn parameters n a hdden Markov model HMM). As wll be shown n ths problem, the Baum-Welch algorthm s ndeed performng EM updates. To work out the soluton for ths problem, you wll also need knowledge and facts learned n the HMM and nformaton theory notes. We wll use the notatons n the HMM note. For your convenence, the notatons are repeated as follows. 9 Please refer to my note on propertes of normal dstrbutons. 15
There are N dscrete states, denoted by symbols S 1, S 2,..., S N. There are M output dscrete symbols, denoted by V 1, V 2,..., V M. Assumng one sequence wth T tme steps, whose hdden state s Q t and whose observed output s O t at tme t 1 t T ). We use q t and o t to denote the ndexes for state and output symbols at tme t, respectvely,.e., Q t = S qt and O t = V ot. The notaton 1 : t denotes all the ordered tme steps between 1 and t. For example, o 1:T s the sequence of all observed output symbols. An HMM has parameters λ = π, A, B), where π R N specfes the ntal state dstrbuton, A R N N s the state transton matrx, and B R N M s the observaton probablty matrx. Note that A j = PrQ t = S j Q t 1 = S ) and b j k) = PrO t = V k Q t = S j ) are elements of A and B, respectvely. In ths problem, we use a varable r to denote the ndex of EM teratons. Hence, λ 1) are the ntal parameters. Varous probabltes have been defned n the HMM note, denoted by α t ), β t ), γ t ), δ t ) and ξ t, j). In ths problem, we assume that at the r-th teratons, λ r) are known and these probabltes are computed usng λ r). The purpose of ths problem s to use the EM algorthm to fnd λ r+1) usng a tranng sequence o 1:T and λ r), by treatng Q and O as the hdden and observed random varables, respectvely. a) Suppose the hdden varables can be observed as S q1, S q2,..., S qt. Show that the complete data log-lkelhood s T 1 ln π q1 + ln A qtq t+1 + t=1 T ln b qt o t ). 74) b) The expectaton of Equaton 74 wth respect to the hdden varables Q t condtoned on o 1:T and λ r) ) forms an auxlary functon Qλ, λ r ) the E-step). Show that the expectaton of the frst term n Equaton 74 equals N γ 1) ln π,.e., E Q1:T [ln π Q1 ] = t=1 N γ 1 ) ln π. 75) c) Because the parameter π only hnges on Equaton 75, the update rule for π can be found by maxmzng ths equaton. Prove that we should set π r+1) = γ 1 ) n the M-step. Note that γ 1 ) s computed usng λ r) 16
as parameter values. Hnt: The rght hand sde of Equaton 75 s related to the cross entropy.) d) The second part of the E-step calculates the expectaton of the mddle term n Equaton 74. Show that [ T 1 ] N N T 1 ) E Q1:T ln A qtq t+1 = ξ t, j) ln A j. 76) t=1 j=1 t=1 e) For the M-step relevant to A, prove that we should set T 1 A r+1) t=1 j = ξ t, j) T 1 t=1 γ t). 77) f) The fnal part of the E-step calculates the expectaton of the last term n Equaton 74. Show that [ T ] N M T E Q1:T ln b qt o t ) = o t = k γ t j). 78) t=1 j=1 k=1 t=1 g) For the M-step relevant to B, prove that we should set T b r+1) t=1 j k) = o t = k γ t j) T t=1 γ, 79) tj) n whch s the ndcator functon. h) Are these results obtaned usng EM the same as those n the Baum- Welch? 17