Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop
Learning in DAGs Two things could be learned: - Graph structure - Parameters governing the conditional probability distribution Learning the structure often involves a search over candidate structures and a method to score each structure. - In practice, it is often difficult to extract the conditional independence relationships that make DAGs so appealing in the first place. - MCMC methods are also used to search over the space of structures. We will focus on the problem of learning the parameters.
Maximum Likelihood Learning Consider the parameter set θ = (θ1,..., θ) which govern the conditional probability distributions P(Xi Pai, θ). One way to learn the parameters is to maximize the likelihood (or probability) of the data D (set of observed variables): θ ML = arg max θ = arg max θ = arg max θ = arg max θ {ln P (D θ)} { ln When there are no latent variables, the situation is much easier: - If we can condition on all the variables, the graph factors by d-separation and we can estimate the parameters for all the P(Xi Pai, θ) independently (eg. Naïve Bayes Classifier). N n=1 P (x n θ) } { N } ln P (x n θ) n=1 { N ln } P (x n, h n θ) h n n=1 Sum over H (latents) could be problematic
ML Example: Multivariate Gaussian Multivariate Gaussian distribution: p(x = x) = N (x µ, Σ) = 1 exp (2π) D/2 Σ 1/2 { 1 } 2 (x µ)t Σ 1 (x µ) Mean Covariance Maximum Likelihood Solution: µ ML = 1 N Σ ML = 1 N N n=1 x n N (x n µ ML )(x n µ ML ) T n=1 Nice simple closed-form solutions
Gaussian Mixture Models Now let s consider a random variable distributed according to a mixture of Gaussians. Conditional distributions for a D-dimensional X: P (I = i) = w i p(x = x I = i) = N (x µ i, Σ i ) = 1 (2π) D/2 Σ i 1/2 exp ( 1 ) 2 (x µ i) T Σ 1 i (x µ i ) where I is an index over the multivariate Gaussian components in the mixture and the mixing proportion, w i, is the marginal probability that X is generated by mixture component i. Marginal distributions: p(x = x) = i p(x = x I = i)p (I = i) = i w i N (x µ i, Σ) 1 1 P (x i) P (x) 0.5 0.5 0 0 0.5 1 (a) 0 0 0.5 1 (b)
Gaussian Mixture Models (cont.) Graphical model: w I µ X Σ N
Maximum Likelihood of GMM Log likelihood function: ln p(d w, µ, Σ) = { } N ln w i N (x n µ i, Σ i ) n=1 i Sum over mixture components appears inside the log - No closed form ML solution
Complete and Incomplete Data If we knew the mixture component identities, things would be easier. - This is the difference between complete data and incomplete data: 1 1 0.5 0.5 0 0 0.5 1 (a) complete 0 0 0.5 1 (b) incomplete The complete data picture treats the latent variables as missing data.
The makings of an iterative scheme Problem: we don t know the values of the latent variables (they re missing!) The EM idea: instead maximize the expected value of the completedata log likelihood - with the expectation w.r.t. P(latents observed, parameters) 1 1 0.5 0.5 0 0 0.5 1 (b) 0 0 0.5 1 (a)
Expectation-Maximization Algorithm E-step (expectation): evaluate the posterior distribution P(Z X, θ old ) using current estimate, θ old, of the parameters. M-step (maximization): re-estimate θ by maximizing the expected complete-data log-likelihood: θ new = arg max θ = arg max θ Q(θ, θ old ) { } P (Z X, θ old ) ln P (X, Z θ) Z Note that the log and the summation have been exchanged - this will often make the summation tractable. Iterate E and M steps until convergence. Guaranteed to converge to a local optimum with linear convergence rate
E Step: Mixture of Gaussians Calculate P(In Xn, θ) for each observed example Xn X n = [X 1,n,..., X d,n,..., X D,n ] T P (I n = i X n, θ) = P (I n = i w i )P (X n I n = i, µ i, Σ i ) P (X n ) P (I n = i w i )P (X n I n = i, µ i, Σ i ) = i P (I n = i w i )P (X n I n = i, µ i, Σ i ) w i N (X n µ i, Σ i ) = j w jn (X n µ j, Σ j ) Where θ = {θ 1,..., θ i,..., θ K }, θ i = {w i, µ i, Σ i } and N ( ) is the multivariate Gaussian probability density function.
M Step: Mixture of Gaussians For mixtures of Gaussians: θ arg max θ { } P (I n = i X n = x n, θ) ln P (X n = x n, I n = i θ ) n i We already computed P(In = i Xn = xn,θ) in the E step and we can decompose the joint P(Xn = xn, In = i θ ): P (i n x n, θ) ln p(x n, i n θ ) = n n i = n P (i n x n, θ) ln p(x n i n, θ )P (i n θ ) Now we maximize this expression w.r.t θ (on to the M step) i P (i n x n, θ) ln w i + n i P (i n x n, θ) ln N (x n µ i, Σ i) i
M Step: Mixture of Gaussians (cont.) Let s consider updating wi: (subject to the constraint i wi = 1) w i [ ( )] P (i n x n, θ) ln w i + λ w i 1 n i i = 0 N n=1 1 w i P (i n x n, θ) + λ = 0 w i 1 N N P (i n x n, θ) n=1
M Step: Mixture of Gaussians (cont.) Now consider updating the mean vectors μi: µ i [ ] P (i n x n, θ) ln N (x n µ i, Σ i) n i = 0 µ i [ P (i n x n, θ) n i ( 1 2 ln( Σ i ) 1 2 (x n µ i ) T Σ 1 i (x n µ i )) ] = 0 µ i N n=1 P (i n x n, θ)x n N n=1 P (i n x n, θ)
M Step: Mixture of Gaussians (cont.) Finally, let s consider updating the covariance matrices Σi: [ ] P (i n x n, θ) ln N (x n µ i, Σ i) = 0 Σ i n i Σ i [ P (i n x n, θ) n i ( 1 2 ln( Σ i ) 1 2 (x n µ i ) T Σ 1 i (x n µ i )) ] = 0 This is the new μi Σ i N n=1 P (i n x n, θ)(x n µ i )(x n µ i ) T N n=1 P (i n x n, θ)
EM Gaussian Mixture: Summary Given observed X1 to XN and hidden variables I1 to IN (mixture component) iterate E and M steps until convergence. E step: for each data point n compute P (i n x n, θ) = w i N (x n µ i, Σ i ) K j=1 w jn (x n µ j, Σ j ) M step: Update the parameters of component i (from 1 to K) with w i 1 N N P (i n x n, θ) n=1 µ i N n=1 P (i n x n, θ)x n N n=1 P (i n x n, θ) Σ i N n=1 P (i n x n, θ)(x n µ i )(x n µ i ) T N n=1 P (i n x n, θ)
EM I: GMM model of Old Faithful data Time between eruptions (minutes) Duration of eruption (minutes)
!"#$%&'()'"*&%&+(,-..'*(,#$//01(2'*3'*(450"&31(6778 9$*%5:/;$'*(!<(2%5$/;
!"#$%&'()'"*&%&+(,-..'*(,#$//01(2'*3'*(450"&31(6778 9$*%5:/;$'*(!<(2%5$/;
!"#$%&'()'"*&%&+(,-..'*(,#$//01(2'*3'*(450"&31(6778 9$*%5:/;$'*(!<(2%5$/;
!"#$%&'()'"*&%&+(,-..'*(,#$//01(2'*3'*(450"&31(6778 9$*%5:/;$'*(!<(2%5$/;
!"#$%&'()'"*&%&+(,-..'*(,#$//01(2'*3'*(450"&31(6778 9$*%5:/;$'*(!<(2%5$/;
!"#$%&'()'"*&%&+(,-..'*(,#$//01(2'*3'*(450"&31(6778 9$*%5:/;$'*(!<(2%5$/;
Over-fitting with Gaussian Mixtures with Singularities (infinities) occur in the likelihood function when components collapse onto a data point fitting in Gaussian Mixture Models Also, maximum likelihood cannot the number 1 1 determine 2 N (xn xnfunction, σ I) = whend/2 as σ 0! ofin components ties likelihood a component σ (2π) apses onto a data point: &'()'"*&%&+(,-..'*(,#$//01(2'*3'*(450"&31(6778 9$*%5:/;$'*(!<(2%5$/; ith maximum likelihood cannot determine the number Also, maximum likelihood cannot determine the number of mixture components components (the likelihood always increases with more components).
EM II: Clustering Documents with Naïve Bayes Consider you have a collection of D unlabeled documents. Build an initial naïve Bayes classifier with parameters θ. Use EM to find the maximum likelihood estimation of the parameters. Naïve Bayes assumption for document clustering: - The probability of a document di given class cj is the product of the probabilities of the words wdi,k in the document given that class: P (d i c j, θ) = k P (w di,k c j, θ) - The model parameters are the probabilities of the words wt given the class cj: θwt cj (consider this as a class specific vocabulary) and the marginal probabilities of the class cj: θcj Repeat until convergence: - E step: Use the current classifier (θ) to estimate component membership of each unlabled document, i.e. the probability that each class generated each document P(cj di, θ). - M step: Re-estimate the classifier (θ) given the estimated component membership of each document.
EM II: Clustering Documents with Naïve Bayes E step: M step: P (y i = c j d i, θ) = P (c j θ)p (d i c j, θ) P (d i θ) = θ wt c j P (w t c j, θ) = θ cj P (c j θ) = P (c j θ) d i k=1 P (w d i,k c j, θ) C r=1 P (c r θ) d i k=1 P (w d i,k c r, θ) D V s=1 i=1 Num(w t, d i )P (y i = c j d i ) D i=1 Num(w s, d i )P (y i = c j d i ) D i=1 P (y i = c j d i ) D where D is the number of documents, C is the number of classes, d i is the number of words in document d i and w t is the t-th word in the vocabulary of size V.
EM: optimizing a lower bound Recall: our original goal is to maximize the likelihood p(x θ). Suppose that direct optimization of p(x θ) is difficult, but that optimizing the complete-data likelihood function p(x, Z θ) is significantly easier. Introduce a distribution q(z) over the latents, for any choice of q(z): ln p(x θ) = L(q, θ) + KL(q p) where L(q, θ) = { } p(x, Z θ) q(z) ln q(z) Z KL(q p) = { } p(z X, θ) q(z) ln q(z) Z
EM: optimizing a lower bound (cont.) Maximizing L(q, θ) with respect to a free-form q distribution, we obtain the true posterior distribution: The lower bound L(q, θ) L(q, θ) = Z q(z) = p(z X, θ) then becomes p(z X, θ old ) ln = Q(θ, θ old ) + const { } p(x, Z θ) p(z X, θ old ) which, as a function of θ is the expected complete-data log likelihood (up to an additive constant).
EM: optimizing a lower bound (cont.) Initial Configuration:
EM: optimizing a lower bound (cont.) E-step:
EM: optimizing a lower bound (cont.) M-step:
Acknowledgments The material presented here is taken from tutorials, notes and lecture slides from Yoshua Bengio, Christopher Bishop, Andrew Moore, Tom Mitchell and Scott Davies.!"#$%#"%"#&$'%#$()&$'*%(+,%-*$'*%".%#&$*$%*/0,$*1%2+,'$3%(+,%4)"##%3"-/,% 5$%,$/06&#$,%0.%7"-%."-+,%#&0*%*"-')$%8(#$'0(/%-*$.-/%0+%6090+6%7"-'%"3+% /$)#-'$*1%:$$/%.'$$%#"%-*$%#&$*$%*/0,$*%9$'5(#08;%"'%#"%8",0.7%#&$8%#"%.0#% 7"-'%"3+%+$$,*1%<"3$'<"0+#%"'060+(/*%('$%(9(0/(5/$1%=.%7"-%8(>$%-*$%".%(% *06+0.0)(+#%?"'#0"+%".%#&$*$%*/0,$*%0+%7"-'%"3+%/$)#-'$;%?/$(*$%0+)/-,$%#&0*% 8$**(6$;%"'%#&$%."//"30+6%/0+>%#"%#&$%*"-')$%'$?"*0#"'7%".%2+,'$ 3@*% #-#"'0(/*A%&##?ABB3331)*1)8-1$,-BC(38B#-#"'0(/* 1%D"88$+#*%(+,% )"''$)#0"+*%6'(#$.-//7%'$)$09$,1%