The Expectation Maximization Algorithm

Similar documents
Estimating Gaussian Mixture Densities with EM A Tutorial

Estimating the parameters of hidden binomial trials by the EM algorithm

Mixture Models and EM

Lecture 6: April 19, 2002

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

An introduction to Variational calculus in Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Expectation Maximization

Expectation Maximization

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

A Note on the Expectation-Maximization (EM) Algorithm

Review and Motivation

Technical Details about the Expectation Maximization (EM) Algorithm

What is the expectation maximization algorithm?

EM Algorithm LECTURE OUTLINE

Linear Dynamical Systems

Self-Organization by Optimizing Free-Energy

Latent Variable Models and EM algorithm

Weighted Finite-State Transducers in Computational Biology

Deriving the EM-Based Update Rules in VARSAT. University of Toronto Technical Report CSRG-580

Mixtures of Gaussians. Sargur Srihari

Week 3: The EM algorithm

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

13: Variational inference II

Clustering with k-means and Gaussian mixture distributions

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Graphical models: parameter learning

Lecture 8: Graphical models for Text

STA 4273H: Statistical Machine Learning

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Stochastic Variational Inference

Does the Wake-sleep Algorithm Produce Good Density Estimators?

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

The Variational Gaussian Approximation Revisited

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

EM for Spherical Gaussians

Expectation maximization

MIXTURE MODELS AND EM

Statistical Pattern Recognition

Model Learning for Switching Linear Systems with Autonomous Mode Transitions

Introduction to Machine Learning

Variational Principal Components

Cheng Soon Ong & Christian Walder. Canberra February June 2018

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

STA 4273H: Statistical Machine Learning

Newscast EM. Abstract

Study Notes on the Latent Dirichlet Allocation

Lecture 4: Probabilistic Learning

Series 7, May 22, 2018 (EM Convergence)

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Gaussian Mixture Models, Expectation Maximization

An Introduction to Expectation-Maximization

Expectation Maximization and Mixtures of Gaussians

The Expectation-Maximization Algorithm

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

PAC Generalization Bounds for Co-training

The Expectation Maximization or EM algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Curve Fitting Re-visited, Bishop1.2.5

The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

The Expectation-Maximization Algorithm

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Lecture 16 Deep Neural Generative Models

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Convergence Rate of Expectation-Maximization

Expectation Maximization Algorithm

Chaos, Complexity, and Inference (36-462)

Latent Variable Models and Expectation Maximization

Conditional probabilities and graphical models

1 EM Primer. CS4786/5786: Machine Learning for Data Science, Spring /24/2015: Assignment 3: EM, graphical models

Bayesian Machine Learning

EXPECTATION- MAXIMIZATION THEORY

Overlapping Astronomical Sources: Utilizing Spectral Information

an introduction to bayesian inference

8.1 Competitive Learning

But if z is conditioned on, we need to model it:

The relation between memory and power-law exponent

Expectation Maximization (EM)

Expectation Maximization

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Latent Variable Models and Expectation Maximization

Using Expectation-Maximization for Reinforcement Learning

On the Slow Convergence of EM and VBEM in Low-Noise Linear Models

Clustering on Unobserved Data using Mixture of Gaussians

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

3.2 Iterative Solution Methods for Solving Linear

Latent Variable Models

Unsupervised Learning with Permuted Data

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Expectation Maximisation (EM) CS 486/686: Introduction to Artificial Intelligence University of Waterloo

CHAPTER 2 Estimating Probabilities

EM-algorithm for motif discovery

Transcription:

The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining the EM algorithm (Hartley, 958; Dempster et al., 977; McLachlan and Krishnan, 997). This is just a slight variation on Tom Minka s tutorial (Minka, 998), perhaps a little easier (or perhaps not). It includes a graphical example to provide some intuition. Intuitive Explanation of EM EM is an iterative optimization method to estimate some unknown parameters, given measurement data U. However, we are not given some hidden nuisance variables J, which need to be integrated out. In particular, we want to maximize the posterior probability of the parameters given the data U, marginalizing over J: = argmax J J n P (, J U) () The intuition behind EM is an old one: alternate between estimating the unknowns and the hidden variables J. This idea has been around for a long time. However, instead of finding the best J J given an estimate at each iteration, EM computes a distribution over the space J. One of the earliest papers on EM is (Hartley, 958), but the seminal reference that formalized EM and provided a proof of convergence is the DLR paper by Dempster, Laird, and Rubin (Dempster et al., 977). A recent book devoted entirely to EM and applications is (McLachlan and Krishnan, 997), whereas (Tanner, 996) is another popular and very useful reference. One of the most insightful explanations of EM, that provides a deeper understanding of its operation than the intuition of alternating between variables, is in terms of lowerbound maximization (Neal and Hinton, 998; Minka, 998). In this derivation, the E-step can be interpreted as constructing a local lower-bound to the posterior distribution, whereas the M-step optimizes the bound, thereby improving the estimate for the unknowns. This is demonstrated below for a simple example.

.9.8.7.6.5.4.... 5 4 4 5 Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively..5.4... θ θ Figure : The true likelihood function of the two component means θ and θ, given the data in Figure.

.5.5.5.5.5.5 4 5 6 4 5 6.5.5.5 4 5 6.5.5.5.5.5.5 4 5 6 4 5 6 i=, Q=.79564.5.4... θ θ i=, Q=.78856 i=, Q=.56.5.5.4.4...... θ θ θ θ i=4, Q=.8749 i=5, Q=.7666.5.5.4.4...... θ θ θ θ Figure : Lower Bounds

Consider the mixture estimation problem shown in Figure, where the goal is to estimate the two component means θ and θ given 6 samples drawn from the mixture, but without knowing from which mixture each sample was drawn. The state space is two-dimensional, and the true likelihood function is shown in Figure. Note that there are two modes, located respectively at (, ) and (, ). This makes perfect sense, as we can switch the mixture components without affecting the quality of the solution. Note also that the true likelihood is computed by integrating over all possible data associations, and hence we can find a maximum likelihood solution without solving a correspondence problem. However, even for only 6 samples, this requires summing over the space of 64 possible data-associations. EM proceeds as follows in this example. In the E-step, a soft assignment is computed that assigns a posterior probability to each possible association of each individual sample. In the current example, there are mixtures and 6 samples, so the computed probabilities can be represented in a 6 table. Given these probabilities, EM computes a tight lower bound to the true likelihood function of Figure. The bound is constructed such that it touches the likelihood function at the current estimate, and it is only close to the true likelihood in the neighborhood of this estimate. The bound and its corresponding probability table is computed in each iteration, as shown in Figure. In this case, EM was run for 5 iterations. In the M-step, the lower bound is maximized (shown by a black asterisk in the figure), and the corresponding new estimate (θ, θ ) is guaranteed to lie closer to the location of the nearest local maximum of the likelihood. Each next bound is an increasingly better approximation to the mode of the likelihood, until at convergence the bound touches the likelihood at the local maximum, and progress can no longer be made. This is shown in the last panel of Figure. EM as Lower Bound Maximization EM can be derived in many different ways, one of the most insightful being in terms of lower bound maximization (Neal and Hinton, 998; Minka, 998), as illustrated with the example from Section. In this section, we derive the EM algorithm on that basis, closely following (Minka, 998). The goal is to maximize the posterior probability () of the parameters given the data U, in the presence of hidden data J. Equivalently, we can maximize the logarithm of the joint distribution (which is proportional to the posterior): log = argmax log P (U, ) = argmax J J n P (U, J, ) () The idea behind EM is to start with a guess t for the parameters, compute an easily computed lower bound B(; t ) to the function log P ( U), and maximize that bound instead. If iterated, this procedure will converge to a local maximizer of the objective function, provided the bound improves at each iteration. To motivate this, note that the key problem with maximizing () is that it involves the logarithm of a (big) sum, which is difficult to deal with. Fortunately, we can construct 4

a tractable lower bound B(; t ) that instead contains a sum of logarithms. To derive the bound, first trivially rewrite log P (U, ) as log P (U, ) = log J J n P (U, J, ) = log J J n P (U, J, ) where is an arbitrary probability distribution over the space J n of hidden variables J. By Jensen s inequality, we have B(; t ) = J J n log P (U, J, ) <= log J J n P (U, J, ) Note that we have transformed a log of sums into a sum of logs, which was the prime motivation.. Finding an Optimal Bound EM goes one step further and tries to find the best bound, defined as the bound B(; t ) that touches the objective function log P (U, ) at the current guess t. Intuitively, finding the best bound at each iteration will guarantee that we obtain an improved estimate t+ when we locally maximize the bound with respect to. Since we know B(; t ) to be a lower bound, the optimal bound at t can be found by maximizing B( t ; t ) = log P (U, J, t ) J J n with respect to the distribution. Introducing a Lagrange multiplier λ to enforce the constraint J J =, the objective becomes n [ G(f t ) = λ ] + log P (U, J, t ) log J J n J J n J J n Taking the derivative and solving for we obtain G = λ + log P (U, J, t ) log = P (U, J, t ) J J n P (U, J, t ) = P (J U, t ) () By examining the value of the resulting optimal bound at t we see that it indeed touches the objective function: B( t ; t ) = P (J U, t ) log P (U, J, t ) P (J U, t ) = log P (U, t ) J J n 5

. Maximizing The Bound In order to maximize B(; t ) with respect to, note that we can write it as B(; t ) = log P (U, J, ) + H = log P (U, J ) + log P () + H = Q t () + log P () + H where. denotes the expectation with respect to = P (J U, t ), and Q t () is the expected complete log-likelihood, defined as: Q t () = log P (U, J ) P () is the prior on the parameters H = log is the entropy of the distribution Since H does not depend on, we can maximize the bound with respect to using the first two terms only: t+ = argmax B(; t ) = argmax [ Q t () + log P () ] (4). The EM Algorithm At each iteration, the EM algorithm first finds an optimal lower bound B(; t ) at the current guess t (equation ), and then maximizes this bound to obtain an improved estimate t+ (equation 4). Because the bound is expressed as an expectation, the first step is called the expectation-step or E-step, whereas the second step is called the maximization-step or M-step. The EM algorithm can thus be conveniently summarized as: E-step: calculate = P (J U, t ) M-step: t+ = argmax [Q t () + log P ()] It is important to remember that Q t () is calculated in the E-step by evaluating using the current guess t (hence the superscript t), whereas in the M-step we are optimizing Q t () with respect to the free variable to obtain the new estimate t+. It can be proved that the EM algorithm converges to a local maximum of log P (U, ), and thus equivalently maximizes the log-posterior log P ( U) (Dempster et al., 977; McLachlan and Krishnan, 997). 6

A Relation to the Expected Log-Posterior Note that we have chosen to define Q t () as the expected log-likelihood as in (Dempster et al., 977; McLachlan and Krishnan, 997), i.e., Q t () = log P (U, J ) An alternative route is to compute the expected log-posterior (Tanner, 996): log P ( U, J) = log P (U, J ) + log P () log P (U, J) (5) Here the second term does not depend on J and can be taken out of the expectation, and the last term does not depend on. Hence, maximizing (5) with respect to is equivalent to (4): argmax log P ( U, J) = argmax = argmax [ log P (U, J ) + log P ()] [ Q t () + log P () ] This is of course identical to (4). References [] Dempster, A., Laird, N., and Rubin, D. (977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 9(): 8. [] Hartley, H. (958). Maximum likelihood estimation from incomplete data. Biometrics, 4:74 94. [] McLachlan, G. and Krishnan, T. (997). The EM algorithm and extensions. Wiley series in probability and statistics. John Wiley & Sons. [4] Minka, T. (998). Expectation-Maximization as lower bound maximization. Tutorial published on the web at http://www-white.media.mit.edu/ tpminka/papers/em.html. [5] Neal, R. and Hinton, G. (998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In Jordan, M., editor, Learning in Graphical Models. Kluwer Academic Press. [6] Tanner, M. (996). Tools for Statistical Inference. Springer Verlag, New York. Third Edition. 7