EM Algorithm Lukáš Cerman, Václav Hlaváč Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo nám. 13, Czech Republic cermal1@fel.cvut.cz http://cmp.felk.cvut.cz/ cermal1/files/lectureem.pdf LECTURE OUTLINE Task Formulation EM as Lower Bound Maximization Examples Relation to Unsupervised Learning Relation to K-Means Including Known Priors
TASK FORMULATION Consider an experiment with the probability model P (x, y θ), where x X are observed data, y Y are unobserved data, θ Θ are parameters of the distribution. The task is to estimate θ given a set of measurements X = {x1,... xn xi X }. Marginalize over missing data 2/13 P (x θ) = P (x, y θ). ML principle, defines the likelihood of the observed data, l(θ) = P (X θ) = P (x i θ) = P (x i, y θ). Maximimize log-likelihood L(θ) = log l(θ) θ = argmax L(θ) = argmax log P (x i, y θ).
EM AS A LOWER BOUND MAXIMIZATION 3/13 No closed form solution exists for θ = argmax L(θ). Option 1: numerical solution using gradient-based optimization techniques. Option 2: EM algorithm as a rather simple, alternative, solution to the problem. Instead of maximizing L(θ) maximize its lower bound F (θ, α). F (θ, α) = α(x i, y) log P (x i, y θ) α(x i, y) log α(x i, y) P (x i, y θ) α(x i, y) Proof by Jensen s inequality α(x i, y) log P (x i, y θ) α(x i, y) log α(x i, y) P (x i, y θ) α(x i, y) for each x i X.
FINDING THE OPTIMAL BOUND 4/13 Fix θ, maximize F (θ, α) with respect to α(xi, y). Introduce Lagrange multiplier λ to enforce α(x i, y). G(α) = λ 1 α(x i, y) + α(x i, y) log P (x i, y θ) α(x i, y) log α(x i, y) Take the derivative G(α) α(x i, y) = λ + log P (x i, y θ) log α(x i, y) 1 Solve for α(xi, y) α(x i, y) = P (x i, y θ) P (x i, y θ) = P (y x i, θ)
EXAMINING THE OPTIMAL BOUND 5/13 By examining the optimal bound, we see that it indeed touches the objective function likelihood L(θ). F (θ, α) = = = = α(x i, y) log P (x i, y θ) α(x i, y) P (y x i, θ) log log P (x i, y θ) P (x i, y θ) P (y x i, θ) log P (x i, y θ) = L(θ)
MAXIMIZING THE BOUND 6/13 Fix α, maximize F (θ, α) with respect to θ. argmax F (θ, α) = argmax = argmax = argmax α(x i, y) log P (x i, y θ) α(x i, y) log α(x i, y) α(x i, y) log P (x i, y θ) E P (y xi,θ)[log P (x i, y θ)] It is maximizing our expectation of complete log-likelihood log P (xi, y θ) over our current estimate P (y x i, θ).
EM ALGORITHM 7/13 At each iteration find an optimal lower bound F (θt, α) at the current guess θ t. Then maximize this bound to obtain an improved estimate θt+1. E-step: calculate M-step: calculate α(x i, y) t = P (x i, y θ t ) P (x i, y θ t ) = P (y x i, θ t ) θ t+1 = argmax E P (y xi,θ t )[log P (x i, y θ)] Initial values θ0 may be chosen randomly.
EXAMPLE GAUSSIAN MIXTURE MODELS Gaussian mixture model is defined by 8/13 P (x, y θ) = P (x y, θ)p (y θ) = N (x µ y, Σ y )P (y) E-step: calculate M-step: calculate α(x i, y) t = N (x µt y, Σ t y)p t (y) N (x µ t y, Σ t y)p t (y) P t+1 (y) = 1 n µ t+1 y = Σ t+1 y = α(x i, y) t α(x i, y) t x i α(x i, y) t α(x i, y) t (x i µ t y)(x i µ t y) α(x i, y) t
EXAMPLE IMAGE RECONSTRUCTION 9/13 Each pixel xirc in image i at position (r, c), is observed with a gaussian noise N (0, σ). Probability of observing value x irc assuming the face at position k i is therefore P (x irc k i, f, b, σ) = { N (fr,c ki +1, σ) for c k i, k i + w) N (b, σ) elsewhere, where b is background intensity, f rc are face pixels. Probability of observing set of m images X = {X1,..., Xi} is P (X f, b, σ) = P (X i, k f, b, σ) = P (k)p (X i k, f, b, σ). i i k Unobserved data are here the face positions ki. Parameters of probability model are face pixels f r,c, background intensity b and noise variation σ. k
RELATION TO UNSUPERVISED LEARNING 10/13 Consider a classification problem with measuments x X and classes y X. Releation between each measument x and its class assignment y can be described using probability P (x, y θ). Having an unlabeled training set of measurements X = {x1,..., xn} one can use EM algorithm to estimate the probability model and even to classify the observed data without any information from the teacher. To classify the data one can use output of E-step α(x i, y) t = P (x i, y θ t ) P (x i, y θ t ) = P (y x i, θ t ) which can be interpreted as a probablity of x i being of a class y.
RELATION TO K-MEANS K-means is unsupervised clustering algorithm. It iterates classification step 11/13 α(x i, y) t = { 1 for y = argmin y Y 0 elsewhere x i µ t y, and learning step µ t+1 y = α(x i, y) t x i α(x i, y) t. Whereas the K-means algorithm performs hard assignment of data points to clusters, the EM algorithm makes a soft assignment based on posterior probabilities P (y x i, θ). One can derive K-means algorithm as a particular limit of EM for GMM as follows. P (x, y θ) = P (x y, θ)p (y θ) = N (x µ y, ɛi)p (y) [ ] P t x (y) exp i µ t y 2 α(x i, y) t 2ɛ = [ P t xi ] µ (y) exp t y 2 Letting ɛ 0 one obtain hard assignment, just as in the K-means. 2ɛ
INCLUDING KNOWN PRIORS 12/13 EM can be used to find MAP solutions for models with defined priors P (θ) P (θ X) = P (X θ)p (θ) P (X) P (X θ)p (θ) = P (X, θ) Optimized lower bound is then F (θ, α) = α(x i, y) log P (x i, y, θ) α(x i, y) E-step: calculate M-step: calculate α(x i, y) t = P (x i, y, θ t ) P (x i, y, θ t ) = P (y x i, θ t ) θ t+1 = argmax E P (y xi,θ t )[log P (x i, y, θ)]
REFERENCES 13/13 [1] Ch. M. Bishop. Pattern Recognition and Machine Learning. Springer Science+Bussiness Media, New York, NY, 2006. [2] F. Dellaert. The expectation maximization algorithm, 2002. [3] V. Franc and M. Švec. Excercise in RPZ expectation-maximization algorithm, 2002. [4] M. I. Schlesinger and V. Hlaváč. Ten Lectures on Statistical and Structural Pattern Recognition. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2002. [5] M. Urban. The EM algorithm for Nick Carter. http://cmp.felk.cvut.cz/cmp/courses/recognition/labs/em/index en.html, 2007. [6] Wikipedia. Jensen s inequality. http://en.wikipedia.org/wiki/jensen s inequality, 2007.