8 : Learning Partially Observed GM: the EM algorithm

Size: px

Start display at page:

Download "8 : Learning Partially Observed GM: the EM algorithm"

Carmel Rose
5 years ago
Views:

1 10-708: Probabilistic Graphical Models, Sprig : Learig Partially Observed GM: the EM algorithm Lecturer: Eric P. Xig Scribes: Auric Qiao, Hao Zhag, Big Liu 1 Itroductio Two fudametal questios i graphical models are learig ad iferece. Learig is the process of estimatig the parameters, ad i some cases, the topology of the etwor, from data. We have covered learig methods for completely observed graphical models, both directed ad udirected, i previous chapters. Focus of this chapter is o learig of partially observed or uobserved graphical models. Two estimatio priciples that will be covered are maximal margi ad maximum etropy. O iferece approaches, we previously covered the elimiatio algorithm ad message-passig algorithm as exact iferece techiques. Other exact iferece algorithms iclude the juctio tree algorithm, which is less ofte used these days. Doig exact iferece, however, ca be expesive, especially i the domai of big data. Thus, may efforts have bee placed o approximate iferece techiques. These iclude stochastic simulatio, samplig methods, Marov chai Mote Carlo methods, Variatioal algorithms, etc. 2 Mixture Models Mixture models are models where the probability distributio of a data poit is defied by a weighted sum of coditioal lielihoods. We say that a distributio f is a mixture of K compoet distributio p 1, p 2,..., p K if: p(x) = λ P (x) with the λ > 0 beig the mixture proportio, ad λ = Uobserved Variables A variable ca be uobserved (latet) for a umber of reasos. It might be a imagiary quatity meat to provide some simplified ad abstract view of the data geeratio process, for example the speech recogitio models. It also ca be a real-world object or pheomea which is difficult or impossible to measure, such as the temperature of a star, causes of a disease, ad evolutioary acestors. Or else it might be a real-world object or pheomea which was ot sometimes measured, due to faulty system compoets, etc. Latet variable ca be either discrete or cotiuous. Discrete latet variables ca be used to partitio or cluster data ito sub-groups. Cotiuous latet variables (factors), o the other had, ca be used for dimesioality reductio (factor aalysis, etc). 1

2 2 8 : Learig Partially Observed GM: the EM algorithm 2.2 Gaussia Mixture Models A Gaussia mixture model is a probabilistic model i which we assume all data poits are geerated from a mixture of a fiite umber of Gaussia distributios with uow parameters. Cosider a mixture of K Gaussia mixtures, P (X µ, Σ) = π N (X µ, Σ ) where π is the mixture proportio, ad N (X µ, Σ ) is the mixture compoet. Let s see how this expressio is derived. By itroducig Z as a latet class idicator vector: P (z ) = muitiomial(z : π) = (Π ) z We have X as a coditioal Gaussia variable with a class-specific mea ad covariace: P (x z 1 = 1, µ, Σ) = (2π) m/2 Σ { exp 1/2 1 2 (x µ ) T Σ 1 (x µ ) } Therefore, the lielihood of a sample x ca be expressed as: P (x µ, Σ) = P (z, x) = p(z = 1 π)p(x z = 1, µ, Σ) = ] [(π ) z N(x : µ, Σ ) z ) z = π N(x µ, Σ )) Now we wat to do learig of the mixture model. I fully observed i.i.d settigs, the log lielihood ca be decomposed ito a sum of local terms (at least for directed models). This ca be illustrated as: l(θ, D) = log p(x, z θ) = log p(z θ z ) + log p(x z, θ x ) With latet variables, however, all the parameters become coupled together via margializatio: l(θ, D) = log z p(x, z θ) = log z p(z θ z )p(x z, θ x ) It ca be observed that the summatio is iside the log ad the log lielihood does ot decompose icely. It is hard to fid a closed form solutio to the maximum lielihood estimates. We the tur to lear the mixture desities usig gradiet descet o the log lielihood, with: l(θ) = log p(x θ) = log π p (x θ )

3 8 : Learig Partially Observed GM: the EM algorithm 3 Figure 1: Mixture Model Learig with Latet Variables By taig the derivative with respect to θ we have: l θ = 1 p(x θ) = = π p (x θ ) θ π p(x θ) p (x θ ) log p (x θ ) θ p ( θ ) log p (x θ ) π p(x θ) θ = r l θ = 0 Thus, the gradiet is the resposibility weighted sum of the idividual log lielihood gradiets. This ca be passed to a cojugate gradiet routie. 3 Parameter Costraits As we ca see above, mixture desities with latet variables are ot easy to lear. By usig gradiet approach whe o closed form solutio ca be derived, oe has less cotrols o the coditios of the parameters. I order to mitigate such effect, people did come up with solutios by imposig costraits o parameters. Ofte we have costraits o the parameters, e.g. π = 1, ad lettig Σ beig symmetric positive defiite (hece Σ ii > 0). We ca either use costraied optimizatio, or we ca re-parametrize i terms of ucostraied values. For ormalized weights, oe ca use the Softmax trasform. For covariace matrices, oe ca use the Cholesy decompositio i form of: where A is upper diagoal with positive diagoal: Σ 1 = A T A A ii = exp(λ i ) > 0; A ij = η ij (j > i); A ij = 0(j < i) where the parameters λ i ad η ij are ucostraied. The oe ca use the chai rule to compute l l π ad A.

4 4 8 : Learig Partially Observed GM: the EM algorithm 4 Idetifiability Idetifiability is a property that a model has to satisfy i order do to precise iferece. A mixture model iduces a multi-modal lielihood. Hece, gradiet ascet ca oly fid a local maximum. Mixture models are uidetifiable, sice we ca always switch the hidde labels without affectig the lielihood. Therefore, we should be careful i tryig to iterpret the meaig of latet variables. 5 Expectatio-Maximizatio Algorithm We will cotiue to use the Gaussia mixture model to demostrate the expectatio-maximizatio algorithm. If we assume that all variables are observed, we ca lear the parameters simply by usig MLE. The data log-lielihood fuctio is: l(θ, D) = log p(z, x ) = log p(z π)p(x z, µ, σ) = = log π z + log N(x ; µ, σ) z z log π z 1 2σ 2 (x µ ) 2 + C C is some costat that we do ot bother calculatig, sice maximizig the log-lielihood is idepedet of ay additive costat. Our maximum lielihood estimates of the parameters are the: ˆπ,MLE = arg max l(θ, D) π ˆµ,MLE = arg max l(θ, D) µ ˆσ,MLE = arg max l(θ, D) σ We ca easily put these fuctios ito a optimizatio algorithm, or i the case of ˆµ, obtai a simple closed form: ˆµ,MLE = z x z However, the above calculatios brea dow i the case of uobserved z. For example, it is ot possible to use the formula for ˆµ,MLE sice we do ot ow what values of z to use. 5.1 K-meas We ca obtai some isight ito solvig this problem by cosiderig a related algorithm: K-meas. The K-meas algorithm fids clusters of data poits as follows: 1. Iitialize poits ( ceters ) radomly. 2. Assig each data poit to the closest ceter. 3. Move each ceter to the mea of all data poits assiged to it. 4. Repeat from 2. util coverged.

5 8 : Learig Partially Observed GM: the EM algorithm 5 We ca view the K-meas problem as a graphical model, where the hidde variables is the ceter each data poit is assiged to. More formally, ad writig the above i a form that is closer to our Gaussia mixture model, we get: z (t) ( = arg max µ (t+1) = x µ (t) ) T ( 1(t) Σ ) (z δ (t), (z δ (t), x ) x µ (t) Here, z are the ceters assiged to each data poit, ad µ are the ceters. ) 5.2 The EM Algorithm Movig closer to the Gaussia mixture model, what if istead of usig the mea as ceters, we use Gaussia distributios? This way, each ceter is associated with a covariace matrix as well as a mea. Now, istead of assigig each data poit to the closest ceter, we ca do a soft assigmet to all ceters. For example, a data poit p ca be 1 2 assiged to ceter a, 1 4 ceter b, ad 1 4 ceter c. This essetially gives us the EM algorithm for Gaussia mixture models, ad iteratively improves the log-lielihood: l c (θ; x, z) = log p(z π) p(z x) + log p(x z, µ, Σ) p(z x) = z log π 1 z 2 ( (x µ ) T Σ 1 (x µ ) + log Σ + C ) The E step Durig the expectatio step, we compute the expected values of the sufficiet statistics of the hidde variables, give the curret estimates of the parameters. I the case of Gaussia mixture models, we compute the followig: ( τ (t) = z q (t) = p z = 1 x, µ (t), Σ (t)) = π (t) N ( Σ i π (t) i N x µ (t) ( x µ (t) i ), Σ(t) The E step is essetially performig iferece with a estimate of the parameters. I -meas, we calculate the hard assigmets z (t) while i the EM algorithm, we calculate the expected value of a sufficiet statistic τ (t). I the case of Gaussia mixture models, τ (t) = z q (t)., Σ (t) i ) The M step Durig the M step, we estimate the parameters with MLE, usig the sufficiet statistics of the hidde variables we calculated i the E step. I the case of Gaussia mixture models, the sufficiet statistic is the expected value of the hidde variables. π = arg max l c (θ) = π (t+1) π = z q (t) N µ = arg max µ l c (θ) = µ (t+1) = = τ (t) N τ (t) x τ (t) = N

6 6 8 : Learig Partially Observed GM: the EM algorithm I -meas, µ weighted sum. Σ = arg max l c (θ) = Σ (t+1) Σ = ( τ (t) ) ( x µ (t+1) τ (t) x µ (t+1) is calculated as a weighted sum where each weight is 0 or 1, while i EM, it is a proper ) T 5.3 Theory of the EM Algorithm Supposig that the latet variables z ca be observed, we defie the complete log-lielihood as l c (θ; x, z) = log p(x, z θ) With observed z, the above ca usually be optimized easily. p(x, z θ) is the product of may factors, so the log-lielihood fuctio above decomposes ito the sum of may factors, where each ca be obtimized separately. However, z is ot observed, so the log-lielihood caot be optimized i this way. The lielihood fuctio we are really tryig to optimize is the log of a margial probability, call the icomplete log-liehihood: l c (θ, x) = log p(x θ) = log z p(x, z θ) Ufortuately, this objective fuctio is a sum iside of a log, which does ot decouple as icely. Istead, we will optimize the followig expected complete log-lielihood: l c (θ; x, z) q = z q(z x, θ) log p(x, z θ) where q is ay arbitrary distributio. Notice, ow, that the log p(x, z θ) iside the sum, lie the complete log-lielihood fuctio, factors icely. So, the expected complete log-lielihood fuctio factors, ad is easier to optimize. However, it is ot clear if maximizig this fuctio actually maximizes the icomplete loglielihood. As a first step, we ca easily show that the expected complete log-lielihood is a lower boud for the icomplete log-lielihood: l(θ; x) = log p(x θ) = log z p(x, z θ) = log z p(x, z θ) q(z x) q(z x) Usig Jese s iequality, z = z q(z x) log p(x, z θ) q(z x) q(z x) log p(x, z θ) z q(z x) log q(z x) = l c (θ; x, z) q + H q Note that the above is the sum of the expected complete log-lielihood ad the etropy of the distributio q.

7 8 : Learig Partially Observed GM: the EM algorithm 7 As Coordiate Ascet The EM algorithm is also i some sese a coordiate descet algorithm. To see this, defie the followig fuctioal for a fixed data x, called the free eergy: F (q, θ) = z q(z x) log p(x, z θ) q(z x) l(θ; x) The EM algorithm is the E step: q t+1 = arg max q F (q, θ t ) M step: θ t+1 = arg max θ F (q t+1, θ t ) I performig the E step, we obtai the posterior distributio of the latet variables give the data ad the parameters, ie. q t+1 = arg max F (q, θ t ) = p(z x, θ t ) q To prove this, we just have to show that this value of q t+1 attais the boud l(θ; x) F (q t+1, θ). F (p(z x, θ t ), θ t ) = z = z p(z x, θ t ) log p(x, z θt ) p(z x, θ t ) q(z x) log p(x θ t ) = log p(x θ t ) = l(θ t ; x) Without loss of geerality, assume that p(x, z θ) is a geeralized expoetial family distributio, so that { } p(x, z θ) = 1 h(x, z) exp θ i f i (x, z) Z(θ) The the complete log lielihood uder q t+1 = p(z x, θ t ) is i l c (θ t ; x, z) p t+1 = z = i q(z x, θ t ) log p(x, z θ t ) A(θ) θ t i f i (x, z) q(z x,θ t ) A(θ) I the special case of p beig a geeralized liear model, = i θ t i η i (z) q(z x,θt )ξ i (x) A(θ) Let s ow tae a loo at the M step, ad remember that the free eergy is the sum of two terms F (q, θ) = l c (θ; x, z) q +H q. The secod term is costat whe maximizig with respect to θ, so we eed oly cosider the first term. θ t+1 = arg max q(z x) log p(x, z θ) θ z Whe q is optimal, this is the same as MLE of the fully observed p(x, z θ), but istead of the sufficiet statistics for z, their expectatios with respect to p(z x, θ) are used.

8 8 8 : Learig Partially Observed GM: the EM algorithm 6 Example: Hidde Marov Model With the EM algorithm, it becomes very straitforward for us to solve some simple graphical models with hidde variables. Let s tae the Hidde Marov Model (HMM) as a example. Geerally, the learig paradigms for a HMM model are classfied ito the followig two categories: Supervised Learig: we estimate the model parameters whe the label iformatio is provided, i.e., the right aswer is ow. For example, we lear from a geomic regio x = x 1, x 2,..., x where we have good aotatios of the CpG islads. Usupervised Learig: we estimate the model without label iformatio. For istace, the procupie geome; we do t ow how frequet are the CpG islads there, either do we ow their compositio, but we still wat to lear from the data. Our goal is to lear from the data usig Maximal Lielihood Estimatio (MLE). Specifically, we update the model parameters θ by maximizig p(x θ). Now we itroduce a algorithm amed Baum-Welch algorithm, which is a variat of EM algorithm for learig HMM. Recall that the complete log-lielihood of HMM is writte as follows: It ca be further writte as: l e (θ; x, y) = log p(x, y) = log (p(y,1 T t=2 p(y,t y,t 1 ) T p(x,t y,t ))) (1) t=1 < l c (θ; x, y) > = + (< Y,1 i > p(y,1 x ) log π i ) + T (< Y,t 1Y i j,t > p(y,t 1Y,t x ) log a i,j ) t=2 T (x,t < Y,t i > p(y,t x ) log b i, ) t=1 (2) where we deote A = {a i,j } = p(y t = j Y t 1 = i) as the trasitio matrix ad B = {b i,j } = p(y t = i x t = j) as the emissio matrix. Accordigly, we derive the EM algorithm of HMM as follows: The E step The M step γ i,t =< Y i,t >= p(y i,t = 1 x ) ξ i,j,t =< Y i,t 1Y j,t >= p(y i,t 1 = 1, Y j,t = 1 x ) π ML i = a ML ij = b ML i, = γi,1 N T t=2 ξi,j,t T 1 t=1 γi,t T t=1 γi,tx,t T 1 t=1 γi,t (3) (4)

9 8 : Learig Partially Observed GM: the EM algorithm 9 I the usupervised learig settig, we are give x = x 1,..., x N while their state path y = y 1,..., y N is ot uow. The EM algorithm chages accordigly: 1. Start with the best guess of parameters θ for the model 2. Estimate A i,j, B i,j i the traiig data: A ij =,t < Y i,t 1Y j,t > B i = m,t < Y i,t > x,t (5) 3. Update θ accordig to A ij, B j. Now, problem becomes supervised learig. 4. Repeat steps 2 ad 3 util covergece. The above algorithm is called the Baum-Welch algorithm. 7 EM for geeral BNs Now we demostrate the EM algorithm for geeral BNs: while ot coverged do for each ode i do ESS i = 0 ed for each data sample do do iferece with x,h for each ode i do ESS i+ =< SS i (x,i, x,πi ) > p(x,h x, H ) ed ed for each ode i do θ i = MLE(ESS i ) ed ed 8 Summary: EM algorithm I summary, EM algorithm ca be regarded as a way of maximizig lielihood fuctio for latet variable models. It fids MLE of parameters whe the origial (hard) problem ca be broe up ito two (easy) pieces: Estimate some missig or uobserved data from observed data ad curret parameters Usig this complete data, fid the maximum lielihood parameter estimates The EM algorithm alterates betwee fillig i the latet variables usig the best guess ad updatig the parameters based o this guess. Specifically, i the M-step we optimize a lower boud o the lielihood. I the E-step we close the gap, maig the boud equals the lielihood.

10 10 8 : Learig Partially Observed GM: the EM algorithm 9 Coditioal Mixture Model ad EM Algorithm Cosider the followig situatio: we will model p(y X) usig differet experts, each resposible for differet regios of the iput space. We have a latet variable Z, which chooses expert usig softmax gatig fuctio: p(z = 1 x) = Softmax(ξ T x) (6) While, each expert ca be a liear regressio model (or more complex models): The posterior expert resposibilities are: Now we derive the EM for coditioal mixture model: The objective fuctio is as follows: p(y x, z = 1) = N(y; θ T x, σ 2 ) (7) p(z = 1 x, y, θ) = p(z = 1 x)p (y x, θ, σ 2 ) j p(zj = 1 x)p j (y x, θ j, σ 2 j ) (8) < l c (θ; x, y, z) > = < log p(z x, ξ) > p(z x,y) + < log p(y x, z, θ, σ) > p(z x,y) The EM algorithm: E-step: = < z > log(softmax(ξ T x )) 1 2 < z > ( (y θ T x ) σ 2 + log σ 2 + C) τ (t) = p(z = 1 x, y, θ) = p(z = 1 x )p (y x, θ, σ 2 ) j p(zj = 1 x )p j (y x, θ j, σ 2 j ) (10) (9) M-step: We use the ormal equatio for stadard liear regressio: θ = (X T X) 1 X T Y, but with the data reweighted by τ We ca also use IRLS or reweighted IRLS to update {ξ, θ, σ } based o data pair (x, y ), with weights τ (t). 10 Hierarchical Mixture of experts The hierarchial mixture of experts is lie a soft versio of a depth-2 classificatio/regressio tree. I the model, the P (Y X, G 1, G 2 ) ca be modeled as a GLIM, with parameters depedet o the values of G 1 ad G 2, which specify a coditioal path to a give leaf i the tree. 11 Mixture of overlappig experts Differet from the typical mixture of experts, by removig the li betwee X ad Z i the graphical model of mixture of experts, we ca mae the partitios idepedet of the iput, thus allowig overlappig. Accordigly, this leads to a mixture of liear regressors, i which each subpopulatio has a differet coditioal mea: p(z = 1 x, y, θ) = p(z = 1 x)p (y x, θ σ 2) j p(zj = 1 x)p j (y x, θ j, σj 2) (11)

11 8 : Learig Partially Observed GM: the EM algorithm Partially Hidde Data We ca also lear the model parameters whe there are missig(hidde) variables. I this case, the cost fuctio becomes: l c (θ; D) = log p(x, y θ) + log p(x m, y m θ) (12) y Complete m Missig Now we ca thi of this i a ew way: i the E-step we estimate the hidde variables o the icomplete cases oly. The, i the M-step we optimize the log lielihood o the complete data plus the expected lielihood o the icomplete data usig the E-step. 13 EM Variats I this sectio, we briefly itroduce two variats of the basic EM algorithm: The first oe is the sparse EM. It does ot re-compute exactly the posterior probability o each data poit uder all models, because it is almost zero. Istead, it eeps a active list which you update every oce i a while. The secod oe is the geeralized (icomplete) EM. Sometimes we may fid that it s very difficult to fid the ML parameters i the M-step, eve give the completed data. However, we ca still mae progress by doig a M-step that improves the lielihood a bit. This is i the lie of the IRLS step i the mixture of experts models. 14 A Report Card for EM EM is oe of the most popular methods to get the MLE or MAP estimatio for a model with latet variables. The advatages of EM are listed as follows. o learig rate (step-size) parameter automatically eforces parameter costraits very fast for low dimesios each iteratio guarateed to improve lielihood While, the disadvatages of EM are ca get stuc i local miima ca be slower tha cojugate gradiet (especially ear covergece) requires expesive iferece step is a maximum lilihood/map method There are still some moder developmets upo traditioal EM, such as covex relaxatio, direct o-covex optimizatio.

Expectation-Maximization Algorithm.

Expectation-Maximization Algorithm. Expectatio-Maximizatio Algorithm. Petr Pošík Czech Techical Uiversity i Prague Faculty of Electrical Egieerig Dept. of Cyberetics MLE 2 Likelihood.........................................................................................................