The Expectation-Maximization Algorithm

Size: px

Start display at page:

Download "The Expectation-Maximization Algorithm"

Phillip Singleton
5 years ago
Views:

1 The Expectaton-Maxmaton Algorthm Charles Elan November 16, 2007 Ths chapter explans the EM algorthm at multple levels of generalty. Secton 1 gves the standard hgh-level verson of the algorthm. Secton 2 then extends ths explanaton to mae EM applcable to problems wth many tranng examples. Next, Secton 3 explans how EM can be used for fttng a mxture of arbtrary component dstrbutons. Fnally, Secton 4 explans how the EM algorthm can be vewed as a double maxmaton, and Secton 5 explans Jensen s nequalty, the basc mathematcal fact that underles all versons of the EM algorthm. At the lowest, most concrete level, there are dfferent EM algorthms for fttng many partcular probablstc models; a mxture of Gaussans s just one example. The prevous chapter descrbes EM at ths lowest level. At an ntermedate level, EM for any mxture model nvolves an E-step that computes degrees of membershp, and an M-step that does weghted maxmum lelhood; ths level s the topc of Secton 3 below. More abstractly, EM s an teratve method for maxmng lelhood; ths level s explaned n Secton 1. At an even hgher level, EM nvolves two maxmatons; ths pont of vew s explaned n Secton 4. Many other tutoral explanatons of expectaton-maxmaton exst, ncludng [Bl98, Del02, Bor04]. Three are especally recommended: [Mn98, Rus98, Roc07]. 1 The general EM algorthm To smplfy notaton, assume ntally that the entre tranng data consttute one outcome x of a random varable X. Also let θ be all the parameters of the model 1

2 p(x; θ). The goal, accordng to the prncple of maxmum lelhood, s to choose θ to maxme the lelhood functon, whch s L(θ; x) = p(x; θ). Let Z be any dscrete auxlary random varable whose dstrbuton, le that of X, s a functon of θ. Let range over the possble outcomes of Z and note that by defnton p(x; θ) = p(x, ; θ). Suppose we have a current estmate θ t for the parameters. Multplyng nsde ths sum by p( x; θ t )/p( x; θ t ) gves that the log lelhood s D = log p(x; θ) = log p(x, ; θ) p( x; θ t) p( x; θ t ). Note that p( x; θ t) = 1 and p( x; θ t ) 0 for all. Therefore D s the logarthm of a weghted sum, so we can apply Jensen s nequalty, whch says log j w jv j j w j log v j, gven j w j = 1 and each w j 0. Here, we let the sum range over the values of Z, wth the weght w j beng p( x; θ t ). We get D E = p( x; θ t ) log p(x, ; θ) p( x; θ t ). Separatng the fracton nsde the logarthm to obtan two sums gves ( ) ( ) E = p( x; θ t ) log p(x, ; θ) p( x; θ t ) log p( x; θ t ). Snce E D and we want to maxme D, consder maxmng E. The weghts p( x; θ t ) do not depend on θ, so we only need to maxme the frst sum, whch s p( x; θ t ) log p(x, ; θ). In general, the E-step of an EM algorthm s to compute p( x; θ t ) for all. The M-step s then to fnd θ to maxme p( x; θ t) log p(x, ; θ). How do we now that maxmng E actually leads to an mprovement n the lelhood? Wth θ = θ t, E = p( x; θ t ) log p(x, ; θ t) p( x; θ t ) = p( x; θ t ) log p(x; θ t ) = log p(x; θ t ) whch s the log lelhood at θ t. So any θ that maxmes E must lead to a lelhood that s better than the lelhood at θ t. 2

3 2 EM wth ndependent tranng examples The EM algorthm derved above can be extended to the case where we have a tranng set {x 1,..., x n } such that each x s ndependent. In ths case the log lelhood s D = log p(x ; θ). Let the auxlary random varables be a set {Z 1,..., Z n } such that the dstrbuton of each Z s a functon only of x and θ. By an argument smlar to above, D = log p(x, ; θ) p( x ; θ t ) p( x ; θ t ). Usng Jensen s nequalty separately for each gves D E = p( x ; θ t ) log p(x, ; θ) p( x ; θ t ). As before, to maxme E we want to maxme the sum p( x ; θ t ) log p(x, ; θ). The E-step s to compute p( x ; θ t ) for all for each. The M-step s then to fnd θ t+1 = argmax θ p( x ; θ t ) log p(x, ; θ). 3 EM for mxture models For a mxture model wth K components each s between 1 and K. The sum to maxme s p( x ; θ t )[log p(; θ) + log p(x ; θ)]. Usng the mxture model notaton from above, we have p(; θ) = α and p(x ; θ) = f(x ; λ ). The sum to maxme s then E = p( x ; θ t )[log α + log f(x ; λ )]. 3

4 For the E-step we use Bayes rule: w = p( x ; θ t ) = f(x ; λ )α j f(x ; λ j )α j. For the M-step, the two terms nsde the square bracets n E nvolve dsjont sets of parameters, so we can do two separate maxmatons. The frst one s to maxme w log α = c log α where c = w, subject to the constrant α = 1. Usng a Lagrange multpler, one can show that the soluton s α = c j c. j The second one s to maxme w log f(x ; λ ). Ths can be dvded nto K separate maxmatons, each of the form λ = argmax λ w log f(x ; λ). Each of these maxmatons s a weghted maxmum-lelhood problem, as clamed prevously. 4 EM as double maxmaton Ths secton gves a smplfed explanaton of a pont of vew on the EM algorthm due orgnally to [NH98]. The standard EM algorthm uses the weghts p( x; θ t ), but other weghts g may also be used. For any such weghts, the log lelhood can be wrtten D = log p(x; θ) = log p(x, ; θ) g g 4

5 and Jensen s nequalty s applcable f g = 1 and g 0 for all. In ths case D E = p(x, ; θ) g log. g The overall goal s to maxme D, so consder choosng g and choosng θ to maxme E. However, rather than choosng g and θ smultaneously, suppose we choose g frst based on θ = θ t, and then choose θ based on the new g. The frst maxmaton s of E = g (log p(x, ; θ t ) log g ). wth θ t fxed. Introducng a Lagrange multpler λ for the constrant g = 1 gves the unconstraned objectve functon F = λ(1 g ) + g (log p(x, ; θ t ) log g ). The partal dervatves are F g = λ + ( 1) + log p(x, ; θ t ) log g. Solvng for when the partal dervatves equal ero yelds The constrant g = 1 gves log g = constant + log p(x, ; θ t ). g = p(x, ; θ t) p(x, ; θ t) = p(x, ; θ t) = p( x; θ t ) p(x; θ t ) whch are the weghts used n the standard EM algorthm. As shown above, wth these weghts and wth θ = θ t, E = log p(x; θ t ) whch s the log lelhood D at θ t. In the double maxmaton verson of EM, both the E and M steps are maxmatons. The E-step s to solve whle the M-step solves w = argmax g g log p(x, ; θ t) g θ t+1 = argmax θ w log p(x, ; θ). 5

6 5 Jensen s nequalty The mathematcal fact on whch the EM algorthm s based s nown as Jensen s nequalty. It s the followng lemma. Lemma: Suppose the weghts w j are nonnegatve and sum to one, and let each x j be any real number for j = 1 to j = n. Let f : R R be any concave functon. Then f( w j x j ) w j f(x j ). j j Proof: The proof s by nducton on n. For the base case n = 2, the defnton of beng concave says that f(wa + (1 w)b) wf(a) + (1 w)f(b). The logarthm functon s concave, so Jensen s nequalty apples to t. References [Bl98] Jeff A. Blmes. A gentle tutoral of the EM algorthm and ts applcaton to parameter estmaton for gaussan mxture and hdden marov models. Techncal Report TR , U. C. Bereley, Aprl [Bor04] Sean Borman. The expectaton maxmaton algorthm: A short tutoral. Unpublshed paper avalable at [Del02] Fran Dellaert. The expectaton maxmaton algorthm. Techncal Report GIT-GVU-02-20, Georga Insttute of Technology, [Mn98] Thomas P. Mna. Expectaton-maxmaton as lower bound maxmaton. Unpublshed paper avalable at mcrosoft.com/ mna, [NH98] Radford M. Neal and Geoffrey E. Hnton. A vew of the EM algorthm that justfes ncremental, sparse, and other varants. In Proceedngs of the NATO Advanced Study Insttute on Learnng n graphcal models, pages , Norwell, MA, USA, Kluwer Academc Publshers. 6

7 [Roc07] Alexs Roche. EM algorthms and varants: An nformal tutoral. Unpublshed paper avalable at people/roche/, [Rus98] Stuart Russell. The EM algorthm. Unpublshed note avalable at russell/classes/- cs281/s98/em.ps,

Finding Dense Subgraphs in G(n, 1/2)

Finding Dense Subgraphs in G(n, 1/2) Fndng Dense Subgraphs n Gn, 1/ Atsh Das Sarma 1, Amt Deshpande, and Rav Kannan 1 Georga Insttute of Technology,atsh@cc.gatech.edu Mcrosoft Research-Bangalore,amtdesh,annan@mcrosoft.com Abstract. Fndng