9 : Learning Partially Observed GM : EM Algorithm

Size: px

Start display at page:

Download "9 : Learning Partially Observed GM : EM Algorithm"

Madison Merritt
5 years ago
Views:

1 10-708: Probablstc Graphcal Models , Sprng : Learnng Partally Observed GM : EM Algorthm Lecturer: Erc P. Xng Scrbes: Rohan Ramanath, Rahul Goutam 1 Generalzed Iteratve Scalng In ths secton, we summarze from the prevous lecture the use of Generalzed Iteratve Scalng (GIS) to fnd the Mamum Lelhood Estmaton (MLE) of a feature based Undrected Graphcal Model (UGM). Each feature has a weght θ whch represents the numercal strength of the feature and whether t ncreases or decreases the probablty of the clque. To represent these features as a part of the prabablstc model, we can multply n the clque potentals to get: p() = ep{ θ f ( c )} Z(θ) At the mamum lelhood settng of the parameters, for each clque, the model margnals must be equal to the observed margnals,.e. p m() MLE () = N = p(), where m() s the number of tmes the confguraton s found out of a possble N. The scaled lelhood functon can then be represented as: l(θ; D) = l(θ; D) N = = 1 log p( n θ) N n p() log p( θ) = p() θ f () log Z(θ) Solvng the log Z(θ) term s very complcated. The tangent to the log functon forms an upper bound on t (log Z(θ) µz(θ) log µ 1, µ). Ths fact s used to get a lower bound on the value of l(θ; D) by usng µ = Z 1 (θ (t) ). Substtutng θ (t) = θ θ (t), we get: l(θ; D) p() = = = p() θ f () Z(θ) Z(θ (t) ) log Z(θ(t) ) + 1 θ f () ep{ θ f ()} log Z(θ (t) ) + 1 Z(θ (t) ) θ p()f () θ p()f () ep{ θ(t) p( θ (t) )ep{ f ()}ep{ θ(t) f ()} log Z(θ (t) ) + 1 Z(θ (t) ) θ (t) f ()} log Z(θ (t) ) + 1 1

2 2 9 : Learnng Partally Observed GM : EM Algorthm We can further rela the constrans by assumng f () = 1 and f () 0. Snce the eponental functon s conve, ep( π ) π ep( ) l(θ; D) θ p()f () Λ = p()f () ep( θ (t) ) θ ep( θ (t) ) = p()f () p( θ(t) )f () = p( θ (t) ) f () ep{ θ (t) } log Z(θ (t) ) + 1 = Λ(θ) p( θ (t) )f () = 0 p()f () p(t) ()f () Z(θ(t) ) where p (t) () s the unnormalzed verson of p( θ (t) ). Substtutng the result n the update equaton we get: θ (t+1) = θ (t) + θ (t) p (t+1) () = p (t) () e θ(t) f () = p (t) () ( p()f () p(t) ()f () ( θ (t+1) = θ (t) + log p()f ) () p(t) ()f () ) f() To summarze we can say that GIS can be used for teratve scalng on general UGM wth feature based pottentals. IPF s a specal case of GIS n whch the clque potental s bult on features defned as an ndcator functon of clque confguratons. 2 Mture Models Let us consder a dataset X. We wsh to model the data by specfyng a jont dstrbuton p(x, Z ) = p(x Z )p(z ). Here, Z Multnomal(π) (where π j 0, j π j = 1,and the parameter π j gves p(z = j),) and X Z j = j N (µ j, Σ j ). We let denote the number of values that the Z s can tae on. Thus, our model posts that each X was generated by randomly choosng Z from {1,..., }, and then X was drawn from one of Gaussans dependng on Z. Ths s called the mture of Gaussans model. Also, note that the z () s are latent random varables, meanng that theyre hdden/unobserved. Ths maes the problem tractable as each condtonal dstrbuton P (X Z ) can now be assumed to be unmodal. In the specfc case of a mture of gaussans, margnal densty s factorzed as: P (X µ, Σ) = π N (X µ, Σ ) where π s called the mture proporton and N (X µ, Σ ) s the gaussan dstrbuton for the mture component. The tas s to learn the values of π, µ and Σ. In the fully observed case, t s farly straghtforward as we would have smply decomposed the log lelhood functon and then found the mamum lelhood estmates for all the parameters. The log lelhood can be epressed as:

3 9 : Learnng Partally Observed GM : EM Algorthm 3 Fgure 1: A mture of 2 Gaussans l(θ; D) = log = log P (, z ) P (z π)p ( z, µ, Σ) = = log π z + log N ( µ, Σ ) z z log(π ) 1 z ( µ ) T Σ 1 2 ( µ ) + C Snce t s fully observed, the log lelhood completely decomposes. However, f we do not observe the values of z as n the mture model settng, all the parameters become coupled together as shown below: l c (θ; D) = log z P (, z θ) = log z P (z θ z )P ( z, θ ) It s easy to observe that the summaton s nsde the log and the log lelhood does not decompose ncely anymore. Hence, t s hard to fnd a closed form soluton to the mamum lelhood estmates. Ths provdes motvaton for the Epectaton Mamzaton (EM) algorthm. 3 K-Means Gven a set of observatons ( 1, 2,..., n ), where each observaton s a d-dmensonal real vector, -means clusterng ams to partton the n observatons nto sets ( n) gven by z = {z 1, z 2,..., z n } so as to mnmze the wthn-cluster sum of squares (WCSS). We start wth ntal random guesses for the class centrods and terate between the followng two steps untl convergence. Epectaton Step: Use some dstance metrc (Eucldean dstance, cosne smlarty, e.t.c) and the

4 4 9 : Learnng Partally Observed GM : EM Algorthm current guess of the centrods to assgn every data pont to the nearest centrod. z (t) = argmn ( µ (t) )T Σ 1(t) ( µ (t) ) Mamzaton Step: Use the new class assgnment to recompute centrod values. µ (t+1) = δ(z(t), ) δ(z(t), ) To understand the EM vewpont, we can thn of each of the clusters beng assocated wth some dstrbuton,.e. P ( z = ) N (X µ, Σ ) and we would le to learn the parameters (µ and Σ ) for each dstrbuton. 4 EM Algorthm The EM algorthm s an effcent teratve procedure to compute the Mamum Lelhood (ML) estmate n the presence of latent varables. In ML estmaton, we wsh to estmate the model parameter(s) for whch the observed data are the most lely. As seen n Secton 2 the mamum lelhood estmator cannot be used for nference n the partally observed settng. Fndng a mamum lelhood soluton requres tang dervatves of the lelhood functon wth respect to both the parameters and the latent varables and smultaneously solvng the resultng equatons. In statstcal models wth latent varables, ths usually leads to ntractable solutons. The EM algorthm wors around ths by consderng the epected complete log lelhood. Each teraton of the EM algorthm conssts of two processes: The E-step, and the M-step. In the epectaton, or E-step, the latent varables are estmated gven the observed data and current estmate of the model parameters. Ths s acheved usng the condtonal epectaton, eplanng the choce of termnology.in the M-step, the lelhood functon s mamzed under the assumpton that the latent varables are nown. The estmate of the latent varables from the E-step are used n leu of the actual latent varables. The log lelhood from Secton 2 can be wrtten n the form of an epected log lelhood as: < z > P (Z X;θ) log(π ) 1 2 < z > P (Z X;θ (( µ ) T Σ 1 ( µ ) + log Σ ) + C where < z > P (Z X;θ) s the epected value of z wth respect to P (Z X; θ). The EM algorthm mamzes < l c (θ; D) > P (Z X;θ) by teratng between two steps (called E and M steps). In the E step, we compute the suffcent statstcs of the hdden varable (.e., Z) usng the current estmate of the parameters (.e. µ and Σ ). The M step, mamzes the value of the parameters usng the epected value of the hdden varables computed n the precedng E step. Specfcally, n the GMM, we compute < z > P (Z X;θ) n the E step and mamze the epected complete log lelhood wth respect to the model parameters µ, Σ and π n the M step to obtan: π (t+1) = < n > P (Z X;θ) N µ (t+1) = Σ (t+1) = τ (t) τ (t) τ (t) ( µ t+1 where τ (t) =< z > P (Z X;θ) and < n > P (Z X;θ) = τ (t) )( µ t+1 τ (t) ) T

5 9 : Learnng Partally Observed GM : EM Algorthm 5 5 Comparson between K-means and EM EM algorthm for the mture of gaussans s a soft verson of K means. For K-means, n the E-step, we assgn ponts to clusters and n the M-step, we recompute the cluster centers assumng each pont belongs to a sngle cluster. In the EM verson, the E-step we assgn ponts to clusters n a probablstc manner and n the M-step, we recompute the cluster centrod assumng that the ponts are assgned to clusters n a probablstc manner where the weght of each pont s gven by τ (t) 6 Theory underlyng EM Algorthm The prevous sectons tell us how the EM algorthm can be used to compute the Mamum Lelhood (ML) estmate n the presence of mssng or hdden data. Let X represent the set of observable varables and Z represent the set of all latent varables. The probablty model s p(x, Z θ). If Z were observed, we defne the complete log lelhood as l c (θ; X, Z) = log p(x, Z θ) The ML estmaton would then be mamzng the complete log lelhood. If the probablty p(x, Z θ) factors such that separate components of θ occur n separate factors, then the log separates them nto dfferent terms whch can be mamzed ndependently (decouplng). However, snce Z s not observed, t has to be margnalzed out and the ncomplete log lelhood functon taes the followng form : l c (θ; X) = log p(x θ) = log Z p(x, Z θ) The ncomplete log lelhood cannot be decoupled because the summaton over p(x, Z θ) les nsde the logarthm. In order to handle ths, we average over z to remove the randomness usng an averagng dstrbuton q(z X). The epected complete log lelhood can then be defned as : < l c (θ;, Z) > q = Z q(z, θ) log p(, Z θ) The epected complete log lelhood s a determnstc functon of θ. It s also lnear n the complete log lelhood functon and, hence, nherts ts factorzablty. The epected complete log lelhood also provdes a lower bound on the complete log lelhood as shown below: l(θ; ) = log p( θ) = log Z p(, Z θ) = log Z q(z ) p(, Z θ) q(z )

6 6 9 : Learnng Partally Observed GM : EM Algorthm Z q(z ) log p(, Z θ) (Jensen s nequalty) q(z ) l(θ; ) < l c (θ;, Z) > q +H q where H q s the entropy of the dstrbuton q and does not depend on θ. H q s a constant for a partcular dstrbuton q. Hence, n order to mamze the complete log lelhood wth respect to θ, t s suffcent to mamze the eptected complete log lelhood. The EM algorthm can be seen as a coordnate ascent algorthm. For a fed data X, let a functonal F, called the free energy, be defned as : F (q, θ) = Z q(z ) log p(, Z θ) q(z ) l(θ; ) The EM algorthm s a coordnate ascent algorthm on F. The E-step and the M-step can then be defned as E-step : q t+1 = argma q F (q, θ t ) M-step : θ t+1 = argma θ F (q t+1, θ) At the (t + 1) th step, we mamze the free energy F (q, θ) wth respect to q. For ths optmal choce of q t+1, we mamze F (q, θ) agan wth respect to θ to get the optmal updated value θ t+1. Let us loo at the E-step. If we set q t+1 (z ) to the posteror dstrbuton of the latent varables gven the data and parameters, p(z, θ t ), we mamze F (q, θ t ). The proof of ths clam s gven below : F (p(z, θ t ), θ t ) = Z = Z p(z, θ t ) log p(, Z θt ) p(z, θ t ) q(z ) log p( θ t ) = log p( θ t ) = l(θ t ; ) Snce l(θ t ; ) s an upper bound on F (q, θ t ), the proof shows that F (q, θ t ) s mamzed by settng q(z ) to the posteror probablty dstrbuton p(z, θ t ). The above clam can also be proved usng varatonal calculus or the fact that l(θ; ) F (q, θ) = KL(q p(z, θ)) Wthout loss of generalty, we can assume p(, Z θ) to be a generalzed eponental famly dstrbuton. Then, p(, Z θ) = 1 Z(θ) h(, Z)ep{ θ f (, Z)} A specal case s when p( Z) are GLM. Then, f (, Z) = η T (z)ξ ()

7 9 : Learnng Partally Observed GM : EM Algorthm 7 The epected complete log lelhood under q t+1 = p(z, θ t ) s gven below : < l c (θ t ;, Z) > q t+1 = = q(z, θ t ) log p(, Z θ t ) A(θ) Z θ t < f (, Z) > q(z,θ t ) A(θ) = p GLIM θ t < η (Z) > q(z,θ t ) ξ () A(θ) Now, let us analyze the M-step of the EM algorthm. The M-step can be vewed as the mamzaton of the epected complete log lelhood. Ths s shown below : F (q, θ) = Z = Z q(z ) log p(, Z θ) q(z ) q(z ) log p(, Z θ) Z q(z ) log q(z ) = < l c (θ;, Z) > q +H q The free energy, hence, breas nto two terms. The frst term s the epected complete log lelhood and the second term, whch s ndependent of θ, s the entropy. Thus, mamzng the free energy s equvalent to mamzng the epected complete log lelhood. θ t+1 = argma θ < l c (θ;, Z) > q t+1= argma θ q(z ) log p(, Z θ) Under optmal q t+1, ths s equvalent to solvng a standard MLE of fully observed model p(, Z θ), wth the suffcent statstcs nvolvng Z replaced by ther epectatons w.r.t. p(z, θ). Z 7 Eample: Hdden Marov Model Let us loo at the learnng problem of Hdden Marov Models (HMM) n the framewor of the EM algorthm. The learnng problem n HMMs could be Supervsed learnng In supervsed learnng, we have annotated data wth the correct answer nown. Eamples nclude a genomc regon = where we have good annotatons of the CpG slands or f a casno player allows us to observe hm as he changes dce and produces 10,000 rolls. Snce all varables are fully observed, we can learn the parameters θ of the model by applyng MLE. Unsupervsed learnng In unsupervsed learnng, we do not have annotated data. We only observe some of the varables and the remanng varables (latent varables) reman unnown. Eamples nclude the porcupne genome where we don t now how frequent the CpG slands are or 10,000 rolls of dce of a casno player wthout nowng when he changes dce.

8 9 : Learnng Partally Observed GM : EM Algorthm Fgure 2: A hdden marov model 7.1 Baum-Welch Algorthm The Baum-Welch algorthm s used learn the parameters θ of a HMM.

8 8 9 : Learnng Partally Observed GM : EM Algorthm Fgure 2: A hdden marov model 7.1 Baum-Welch Algorthm The Baum-Welch algorthm s used learn the parameters θ of a HMM. The complete log lelhood of a HMM can be wrtten as l c (θ;, Y ) = log p(, Y ) = log n T T p(y n,1 ) p(y n,t Y n,t 1 ) p( n,t Y n,t ) t=2 t=1 The epected complete log lelhood can be wrtten as < l c (θ;, Y ) > = (< Yn,1 > p(yn,1 n) log π ) + T (< Yn,t 1Y j n,t > p(yn,t 1Y n,t n) log a,j ) n n t=2 + T ( n,t < Yn,t > p(yn,t n) log b, ) n t=1 where A = {a,j } = P (Y t = j Y t 1 = ) s the transton matr and B = {b,j } = P (Y t = t = j) s the emsson matr. The E-step of the EM algorthm s then γ n,t = < Y n,t >= p(y n,t = 1 n ) ξ,j n,t = < Y n,t 1Y j n,t >= p(y n,t 1 = 1, Y j n,t = 1 n ) The M-step of the algorthm s π ML = a ML j = b ML = n γ n,1 N T n t=2 ξ,j n,t T 1 n t=1 γ n,t T n t=1 γ n,t n,t T 1 n t=1 γ n,t In the unsupervsed learnng case, the Baum-Welch algorthm proceeds as follows : 1. Start wth the best guess of parameters θ for the model

9 9 : Learnng Partally Observed GM : EM Algorthm 9 2. Estmate A j, B n the tranng data. A j = n,t < Y n,t 1Y j n,t > B = n,t < Y n,t > n,t 3. Update θ accordng to A j, B. Now, problem becomes supervsed learnng. 4. Repeat steps 2 and 3 untl convergence. It can be proven that we get a more (or equally) lely parameter set θ n each teraton. 8 EM for general BNs Algorthm 1 shows the algorthm for EM n a general Bayesan Networ. whle not converged do for each node do ESS = 0 end for each data sample n do do nference wth n,h for each node do ESS + =< SS ( n,, n,π ) > p(n,h n, H ) end end for each node do θ = MLE(ESS ) end end Algorthm 1: EM algorthm for general BN 9 Summary of EM algorthm In summary, the EM algorthm s a way of mamzng the lelhood of the latent varable models. computes the ML estmate of the parameters by breang the orgnal problem nto two parts: It 1. The estmaton of unobserved data from observed data and current parameters. 2. Usng ths model wth all models observed to fnd the ML estmates of parameters. The algorthm alternates between fllng n the latent varables usng the best guess and updatng the parameters based on ths guess. 10 EM for Condtonal Mture Model The model for the Condtonal mture model s defned as

10 10 9 : Learnng Partally Observed GM : EM Algorthm P (Y ) = p(z = 1, ξ)p(y z = 1,, θ, σ) The objectve functon for EM s defned as < l c (θ;, y, z) >= n < log p(z n n, ξ) > p(z,y ) + n < log(p(y n n, Z n, θ, σ) > p(z,y ) The E-step n the EM algorthm s τ (t) n = P (Zn = 1 n, Y n, θ) = P (Z n = 1 n )p (Y n n, θ, σ 2 ) j p(zj n = 1 n )p j (y n n, θ j, σj 2) The M-step n the EM algorthm uses the normal equaton for lnear regresson θ = ( T ) 1 T Y, but wth the data re-weghted by τ or usng the weghted IRLS algorthm to update ξ, θ, σ based on data ponts ( n, y n ) wth weghts τn (t). 11 EM Varants There are some varants of the EM algorthm. Sparse EM: The sparse EM algorthm does not recompute the posteror probablty on each data pont under all models, whch are very close to 0. Instead, t eeps an actve lst whch t updates every once n a whle. Generalzed (Incomplete) EM : Sometmes, t s hard to compute the Mamum lelhood estmate of the parameters n the M-step, even wth the complete data. However, we can stll mae progress by dong a M-step that ncreases the lelhood a bt (le a gradent step). 12 EM: Report Card EM s one of the most popular methods to get the Mamum Lelhood Estmate or the Mamum a Posteror Estmate of parameters n a model where there are some latent unobserved varables. EM does not requre any learnng rate parameter and s very fast for low-dmensons. It also ensures convergence as each teraton s guaranteed to mprove lelhood. The cons of EM algorthm are that t can gve us a local optma nstead of global optma. EM can be slower than conjugate gradent, especally near convergence. It requres an epensve nference step and t s a mamum lelhood/map method. Dsclamer: Some portons of the content has been drectly taen from the lecture sldes 1 and last years scrbe notes 2. The authors do not clam orgnalty of the scrbe. 1 epng/class/10708/lectures/lecture9-em.pdf 2 epng/class/ /lecture/scrbe9.pdf

9 : Learning Partially Observed GM : EM Algorithm

9 : Learning Partially Observed GM : EM Algorithm 10-708: Probablstc Graphcal Models 10-708, Sprng 2012 9 : Learnng Partally Observed GM : EM Algorthm Lecturer: Erc P. Xng Scrbes: Mrnmaya Sachan, Phan Gadde, Vswanathan Srpradha 1 Introducton So far n