9 : Learning Partially Observed GM : EM Algorithm

Size: px

Start display at page:

Download "9 : Learning Partially Observed GM : EM Algorithm"

Noah Bell
5 years ago
Views:

1 10-708: Probablstc Graphcal Models , Sprng : Learnng Partally Observed GM : EM Algorthm Lecturer: Erc P. Xng Scrbes: Mrnmaya Sachan, Phan Gadde, Vswanathan Srpradha 1 Introducton So far n the course, we have studed how graphcal models brng the worlds of graph theory and statstcs together, provdng an appealng medum by whch we can model a large sets of varables and ndependences va a data structure that lends tself naturally to the desgn of effcent general-purpose algorthms. We have studed about representaton and nference. Gven the graphcal model, nference tells us how can we use these models to effcently answer probablstc queres. In the prevous lecture, we saw another mportant topc n graphcal models,.e. learnng. Learnng s used to specfy two thngs that descrbe a GM: the graph topology (structure) and the parameters of each local dstrbuton. We also saw a technque for learnng parameters n graphcal models when all the varables n the GM are fully observed called Maxmum Lelhood Estmaton. We assume that the goal of learnng n ths case s to fnd the values of the parameters of each dstrbuton whch maxmze the lelhood of the tranng data. We saw that the log-lelhood functon decomposes accordng to the structure of the graph, and hence we can maxmze the contrbuton to the log-lelhood of each node ndependently (assumng the parameters n each node are ndependent of the other nodes). However, often we may want to model some extra latent varables that we care about and we mght want to learn them from data. The latent varables could be magnary quanttes meant to mae the analyss of the data easer (e.g. n HMM) or they may not exst, but be hard or mpossble to measure. Dscrete latent varables are often used to partton/cluster data nto sub groups and contnuous latent varables (factors) are used for dmensonalty reducton. In ths lecture, we descrbe the expectaton maxmzaton algorthm (EM) whch can be used to learn model parameters n the partally observed settng. 2 Mxture Models We motvate the EM algorthm through the well-nown mxture model example. Consder the problem of clusterng a set of N observed data ponts x 1... x N nto K clusters. One way to approach the clusterng problem s to estmate the margnal densty P (X ). However, ths can n general be hard as P (X ) can be multmodal. However, a mxture model s a way of constranng the form of P (X ) so that the problem becomes tractable and yet have suffcent power to model real world data. A typcal fnte-dmensonal mxture model (shown n Fgure 1) conssts of: N random varables correspondng to the observatons, each assumed to be dstrbuted accordng to a mxture of K components, wth each component belongng to the same parametrc famly of dstrbutons but wth dfferent parameters. 1

2 2 9 : Learnng Partally Observed GM : EM Algorthm Fgure 1: A Graphcal Model representaton of the Gaussan Mxture Model N latent varables specfyng the dentty of the mxture component for each observaton, each dstrbuted accordng to a K-dmensonal categorcal dstrbuton. A set of K multnomally dstrbuted mxture weghts. K sets of parameters, each set specfes the parameters of the correspondng mxture component. The graphcal model allows the factorzaton of the margnal densty as: P (X ) = Z P (X Z )P (Z ) Ths maes the problem tractable as each condtonal dstrbuton P (X Z ) can now be assumed to be unmodal. In the specfc case of a mxture of gaussans, margnal densty s factorzed as: P (X µ, Σ) = π N(X µ, Σ ) In the above equaton, we have:- π - Mxture proporton N(x µ, Σ ) - Gaussan dstrbuton for the mxture component Our tas s to learn the values of π, µ and Σ. Note that f we were worng n a fully observed scenaro, ths would have been relatvely easy. We would have smply decomposed the log lelhood functon and then found the maxmum lelhood estmates for all the parameters. In the fully observed scenaro, the log-lelhood can be wrtten as: l(θ; D) = log = log P (x, z ) P (z π)p(x z, µ, σ ) = = log π z + log N(x µ, σ ) z z log(π ) 1 z (x µ ) T Σ 1 2 (x µ ) + C

3 9 : Learnng Partally Observed GM : EM Algorthm 3 Whle parameters can be easly learned n the completely observed case as the log lelhood decomposes. However, f we do not observe the values of z as n the mxture model settng, all the parameters become coupled together va margnalzaton. l c (θ; D) = log z P (X, z θ) = log z P (z θ z )P (X z, θ x ) Notce the summaton nsde the log. Hence, the log lelhood functon no longer decomposes ncely and hence, t s hard to fnd closed form maxmum lelhood estmates. Ths s why we ntroduce the EM algorthm. 3 K-Means Recall the K-means algorthm. K-means s a clusterng technque that ams to partton n observatons nto K clusters n whch each observaton belongs to the cluster wth the nearest mean resultng n a parttonng of the data space nto Vorono cells. We start wth ntal random guesses for the class centrods. Then we terate between the followng two steps untl convergence (when class centrods stop changng): Usng the guesses for the class centrods, assgn each data pont to the nearest centrod usng some dstance metrc. z (t) = argmn (x µ (t) Usng the new class assgnments, recompute class centrods. )T Σ 1(t) (x µ (t) ) (1) µ (t+1) = δ(z(t), )x δ(z (t), ) (2) Now let s genealze the problem. We would le to thn of each of our clusters as beng assocated wth some dstrbuton e.g. P (x z = ) N(µ, Σ ) and we would le to learn the parameters (µ and Σ ) for each dstrbuton. In ths smple example, we would prove that the soluton descrbed n equaton 1 and 2 s n fact what the EM algorthm for K-means does. 4 EM Algorthm We now ntroduce the EM algorthm. We had seen that the maxmum lelhood estmator cannot be generally used to do nference n general n a partally observed settng. Fndng a maxmum lelhood soluton requres tang dervatves of the lelhood functon wth respect to both the parameters and the latent varables and smultaneously solvng the resultng equatons. In statstcal models wth latent varables, ths usually leads to ntractable solutons. The result s typcally a set of nterlocng equatons n whch the soluton to the parameters requres the values of the latent varables and vce-versa and substtutng one set of equatons nto the other produces an unsolvable system of equatons. The EM algorthm wors around ths by consderng the expected complete log lelhood. The expected complete log lelhood, n the Gaussan mxture model settng can be wrtten as: < z > P (Z X;θ) log(π ) 1 2 < z > P (Z X;θ) ((x µ ) T Σ 1 (x µ ) + log Σ + C) (3)

4 4 9 : Learnng Partally Observed GM : EM Algorthm Here, <... > P (Z X;θ) denotes the expectaton wth respect to P (Z X; θ) The EM algorthm maxmzes < l c (θ; x, z) > P (Z X;θ) by teratng between two steps (called E and M steps). The E step computes the expected value of the suffcent statstcs of the hdden varables (.e., Z) gven the current estmate of the parameters (.e., π and µ) and the M step computes the parameters under the current results of the expected value of the hdden varables. Hence, n the Gaussan mxture model, we compute < z > P (Z X;θ) n the E step and maxmze (3) wth respect to the model parameters π, µ and Σ n the M step to obtan: π t+1 = < n > P (Z X;θ) N µ t+1 = Σ t+1 = τ (t) x τ (t) τ (t) (x µ t+1 )(x µ t+1 τ (t) where τ (t) =< z > P (Z X;θ) and < n > P (Z X;θ) = τ (t). ) T 5 Comparson between K-means and EM Next, we compare the K-means and the EM algorthm for a mxture of gaussans. The EM algorthm for a mxture of gaussans s le a soft verson of the K means algorthm. For K-means, n the E-Step, we assgn ponts to clusters and n the M-step, we recompute the cluster centers assumng each pont belongs to a sngle cluster. In the E-Step of the EM verson, we assgn ponts to clusters n a soft, probablstc manner and n the M-step, we recompute class centers assumng that the ponts are assgned to clusters n a probablstc manner where the weght of each pont s gven by τ (t). 6 Theory underlyng EM Algorthm After the nce ntroducton, we analyze the EM algorthm and the theory underlyng t. We have seen how the EM algorthm can be effectvely used to compute the Maxmum Lelhood (ML) estmate n the presence of unobserved (.e. hdden) data. So, not surprsngly, our nvestgaton began from the lelhood functon. Let X denote the observable varables(s), and Z denote the latent varables(s). Had Z been observed, then we could have defned the complete log lelhood (l c (θ; X, Z) = log ) and proceeded wth ML estmaton. However, n a partally observed case, the lelhood s a random quantty and hence cannot be maxmzed drectly. Wth Z unobserved, we defne the ncomplete log lelhood, whch s the log of a margnal probablty. However, the ncomplete log lelhood s coupled! l c (θ; X) = logp (X θ) = log Z To counter ths, we ntroduce the expected complete log lelhood wth respect to a dstrbuton q(z): < l c (θ; X, Z) > q = z q(z X, θ) log It s easy to see that the expected complete log lelhood s a determnstc functon of θ. It s lnear and nherts the factorzablty of the complete log lelhood. We clam that maxmzng ths surrogate could

5 9 : Learnng Partally Observed GM : EM Algorthm 5 yeld a maxmzer of the lelhood n the partally observed scenaro. Ths s due to a nce relatonshp between the complete lelhood and the expected complete lelhood as derved below: l(θ; X) = log P (X θ) = log Z = log Z Z q(z X) q(z X) q(z X) log q(z X) l(θ; X) < l c (θ; X, Z) > q +H q (4) where, H q s the entropy of random varable Z and does not depend on θ. From the thrd lne to fouth lne, probablty theory verson of the Jensens Inequalty s used. From the dervaton, we see that < l c (θ; X, Z) > q +H q s a lower bound for the orgnal lelhood. Note that H q s constant for a chosen dstrbuton q. Hence, to maxmze the orgnal lelhood, t suffces to maxmze the lower bound (expected complete lelhood). Ths technque s actvely used n Physcs. Next, we descrbe ths beautful connecton. For fxed data X, we defne a functonal called the free energy: F (q, θ) = Z q(z X) log q(z X) l(θ; X) Another way to see the EM algorthm s as a jont maxmzaton procedure that teratvely maxmzes a better and better lower bound F to the true lelhood l(θ; X). Then, the EM algorthm s just a coordnate-ascent method on F gven by the followng EM steps: 1. E step: q t+1 = argmax q F (q, θ t ) 2. M step: θ t+1 = argmax θ F (q t+1, θ t ) Ths jont maxmzaton vew of EM s useful as t leads to many varants of the EM algorthm that use alternatve strateges to maxmze the free energy functon for example, by performng partal maxmzaton n the frst maxmzaton step. Next, we show that ths s exactly the EM algorthm. Frst, we loo at the E-step. Here we want to show that q whch maxmzes F (q, θ t ) s the actually the posteror dstrbuton over latent varables gven the data and the current value of parameters. Ths could be useful for testng (e.g. for classfcaton). In other words: q t+1 = argmax q F (q, θ t ) = P (Z X, θ t ) It can be easly proved that ths settng attans the bound.e. l(θ; x) F (q, ). F (p(z X, θ t ), θ t ) = Z = Z P (Z X, θ t ) log p(x, Z θt ) p(z X, θ t ) q(z X) log p(x θ t ) = log P (X θ t ) = l(θ t ; X) Ths result can be alternatvely shown usng varatonal calculus or the fact that l(θ; X) F (q, θ) = KL(q p(z X, θ)).

6 6 9 : Learnng Partally Observed GM : EM Algorthm Wthout loss of generalty, let us assume that p(x, z θ) s a generalzed exponental famly dstrbuton. Ths leads to: p(x, Z θ) = 1 Z(θ) h(x, Z) exp { θ f (X, Z)} As a specal case, when p(x Z) are GLM s, f (X, Z) = η T (z)ξ (z). The expected complete log lelhood under q t+1 = p(z X, θ t ) s: < l c (θ t ; X, Z) > q t+1 = Z = = q(z X, θ t ) log p(x, Z θ t ) A(θ) θ t < f (X, Z) > q(z X,θt ) A(θ) θ t < η (z) > q(z X,θt ) ξ (x) A(θ) Next, we loo at the M-step. As we saw when we derved equaton 4, we now that the free energy breas nto the followng two terms: F (q, θ) = Z q(z X) log p(x, Z θ) p(z X) = Z q(z X) log p(x, Z θ) Z q(z X) log q(z X) = < l c (θ; X, Z) > q +H q The frst term s the expected complete log lelhood (energy) and the second term, whch does not depend on θ, s the entropy. Thus, agan, n the M-step maxmzng wth respect to θ for a fxed dstrbuton q s equvalent to maxmzng the frst term. θ t+1 = argmax θ < l c (θ; X, Z) > q t+1= argmax θ q(z X) log p(x, Z θ) Under optmal q t+1, ths s equvalent to solvng a standard MLE of fully observed model p(x, Z θ), wth the suffcent statstcs nvolvng Z replaced by ther expectatons w.r.t. p(z X, θ). Z 7 Example: Hdden Marov Model We have seen nference n the Hdden Marov Model n the prevous lectures. Now, we are well equpped to tacle the ssue of learnng n HMMs. Learnng n HMMs can be categorzed nto supervsed and unsupervsed learnng. Supervsed Learnng s the case when the rght answer s nown e.g. a genomc regon x = x 1... x 1,000,000 where we have good (expermental) annotatons of the CpG slands and the casno player who allows us to observe hm one evenng, as he changes dce and produces (say) 10,000 rolls. In ths case, the estmaton can be acheved by ML estmaton as all the varables are fully observed. In the unsupervsed learnng scenaro, we only observe some of the random varables e.g. the porcupne genome when we dont now how frequent are the CpG slands there, nether do we now ther composton and the case when we observe the 10,000 rolls of the casno player, but we dont see when he changes the dce.

7 9 : Learnng Partally Observed GM : EM Algorthm 7 Here, we ntroduce a well now EM algorthm to do learnng n HMMs, the Bomb-Welch algorthm. Recall the complete log-lelhood of HMMs: l c (θ; X, Y ) = logp (X, Y ) = log n The expected complete log lelhood can be wrtten as: T T (p(y n,1 ) p(y n,t y n,t 1 ) p(x n,t y n,t )) t=2 t=1 (< yn,1 > p(yn,1 x n) log π ) + n n T t=2(< y n,t 1y j n,t > p(yn,t 1,y n,t x n)) + n T (x n,t < yn,t > p(yn,t x n) log b, ) t=1 The EM updates can be obtaned as: E Step: M Step: γ n,t = < y n,t >= p(y n,t = 1 X n ) ξ,j n,t = < y n,t 1y j n,t >= p(y n,t 1 = 1, y j n,t = 1 X n ) π ML = n γ n,1 N a ML j = T n t=2 ξj n,t T 1 t=1 γ n,t n b ML = T n t=1 γ n,txn,t T 1 n t=1 γ n,t In the unsupervsed settng, we are gven x = x 1... x N for whch the true state path y = y 1... y N s unnown. We can use the EM algorthm here too. The algorthm s as follows: Start wth the best guess of a model M and parameters θ. Compute the statstcs (A j = n,t < y n,t 1y j n,t > and B j = n,t < y n,t >) from the tranng data. Update θ usng A j and B as computed before, treatng t as a supervsed learnng problem. Repeat untl Convergence. EM algorthm s a very powerful algorthm and s wdely used n practce. Next, we lay down the general technque for EM n Bayesan Networs: 8 Summary of EM algorthm EM algorthm s a way of maxmzng lelhood functon for latent varable models. It fnds MLE of parameters when the orgnal (hard) problem can be broen up nto two (easy) peces: the estmaton of some mssng or unobserved data from observed data and current parameters and usng ths complete data to fnd the maxmum lelhood parameter estmates. The algorthm alternates between fllng n the latent varables usng the best guess (posteror) and updatng the parameters based on ths guess. In the M-step, we optmze a lower bound on the lelhood and n the E-step we close the gap, mang the bound tght.

8 8 9 : Learnng Partally Observed GM : EM Algorthm Algorthm 1: EM for general BNs Whle not converged: % E-step For each node : ESS = 0 % Reset expected suffcent statstcs For each data sample n Do nference wth X n,h For each node : ESS + =< SS (X n,, X n,π ) > p(xn,h X n, H ) % M-step For each node : θ = MLE(ESS ) 9 EM Varants: Next, we dscuss the applcaton of EM to specfc models le Condtonal mxture model: Mxture of experts model whch models p(y X) usng dfferent experts, each responsble for dfferent regons of the nput space, the condtonal mxture model, herarchcal mxture of experts and mxture of overlappng experts models. We pont the readers to the lecture sldes for detals. Fnally, two varatons of the EM algorthm, namely the Sparse EM algorthm and Generalzed (Incomplete) EM were ntroduced. In sparse EM, we do not recompute exactly the posteror probablty on each data pont under all models, because t s almost zero. Instead, we eep an actve lst whch s updated once n a whle. In the Generalzed (Incomplete) EM, we see that t mght be hard to fnd the ML parameters n the M-step, even gven the completed data. However, we can stll mae progress by the completed data by dong an M-step that mproves the lelhood a bt (e.g. gradent step). Ths s n lne wth the IRLS step n the mxture of experts model. 10 EM: Report Card EM s one of the most popular algorthms to perform learnng n a partally observed settng. It does not requre any cumbersome tunng of learnng rate or step-sze parameters as n smlar gradent based methods. The parameter constrants are automatcally enforced. It s very fast for low dmensons and convergence s guaranteed as the lelhood mproves at each teraton. However, there are some ssues wth the EM approach. The EM algorthm can get stuc n local mnmas as the optmzaton can be non-convex n general. It can be slower than conjugate gradent (especally near convergence). It requres expensve nference steps and s a maxmum lelhood/map method. These ssues lead us to see better nference technques le Gbbs Samplng and Varatonal Inference that we wll see n the lectures to follow. Dsclamer: Some portons of the content has been drectly taen from the lecture sldes 1, last year s scrbe notes 2 and Kevn Murphy s ntroducton to GMs and BNs 3. The authors do not clam orgnalty of the scrbe. 1 epxng/class/10708/lecture/lecture9-em.pdf 2 epxng/class/ /lecture/scrbe10.pdf 3 murphy/bayes/bnntro.html

EM and Structure Learning

EM and Structure Learning EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder