Technical Details about the Expectation Maximization (EM) Algorithm

Similar documents
Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Expectation Maximization

Gaussian Mixture Models

But if z is conditioned on, we need to model it:

A minimalist s exposition of EM

Introduction to Machine Learning

Machine Learning Lecture Notes

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

An Introduction to Expectation-Maximization

Clustering with k-means and Gaussian mixture distributions

Latent Variable Models and EM algorithm

The Expectation-Maximization Algorithm

Clustering, K-Means, EM Tutorial

Lecture 8: Graphical models for Text

Statistical Pattern Recognition

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

The Expectation Maximization Algorithm

Expectation Maximization

Mixtures of Gaussians. Sargur Srihari

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Expectation Maximization and Mixtures of Gaussians

Latent Variable View of EM. Sargur Srihari

Expectation Maximization

Clustering and Gaussian Mixture Models

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Week 3: The EM algorithm

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

The Expectation Maximization or EM algorithm

Latent Variable Models and Expectation Maximization

Variational Inference (11/04/13)

Self-Organization by Optimizing Free-Energy

ECE 5984: Introduction to Machine Learning

K-Means and Gaussian Mixture Models

Estimating Gaussian Mixture Densities with EM A Tutorial

Linear Dynamical Systems

Clustering with k-means and Gaussian mixture distributions

Latent Variable Models and Expectation Maximization

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Probabilistic Graphical Models

A Note on the Expectation-Maximization (EM) Algorithm

Algorithmisches Lernen/Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Expectation Maximization Algorithm

EXPECTATION- MAXIMIZATION THEORY

Variational inference

CPSC 540: Machine Learning

Variational Autoencoders

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Expectation Propagation Algorithm

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

13 : Variational Inference: Loopy Belief Propagation and Mean Field

CS Lecture 18. Expectation Maximization

Expectation maximization

Clustering with k-means and Gaussian mixture distributions

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 6: April 19, 2002

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

13: Variational inference II

STA 4273H: Statistical Machine Learning

Latent Variable Models

Gaussian Mixture Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Mixture Models and Expectation-Maximization

Review and Motivation

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Statistical Pattern Recognition

Lecture 14. Clustering, K-means, and EM

Expectation maximization tutorial

CSCI-567: Machine Learning (Spring 2019)

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

The Expectation-Maximization Algorithm

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

COM336: Neural Computing

Cheng Soon Ong & Christian Walder. Canberra February June 2018

1 Expectation Maximization

Variational Mixture of Gaussians. Sargur Srihari

Note for plsa and LDA-Version 1.1

10-701/15-781, Machine Learning: Homework 4

G8325: Variational Bayes

Clustering and Gaussian Mixtures

Quantitative Biology II Lecture 4: Variational Methods

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

CS281 Section 4: Factor Analysis and PCA

CS534 Machine Learning - Spring Final Exam

Latent Variable Models and EM Algorithm

Data Mining Techniques

Two Useful Bounds for Variational Inference

p L yi z n m x N n xi

Weighted Finite-State Transducers in Computational Biology

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Variational Principal Components

Transcription:

Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used as a method for estimating the parameters in a probabilistic model. The basic idea is to compute the parameter θ MLE where: θ MLE arg max P(X θ θ Θ P(X θ is the (observable data lielihood. The parameter θ is omitted sometimes for simple notation. MLE is normally done by taing the derivative of the data lielihood P(X with respect to the model parameter θ and solving the equation. However, in some cases where we have hidden (unobserved variables in the model, the derivative w.r.t. the model parameter does not have a close form solution. We will illustrate this problem with a simple example of mixture model with hidden variables. 1.1 An Example: Mixture of Bernoulli Distributions Suppose we have binary data points x 1, x 2,, x, each of which is i.i.d. drawn from one out of K Bernoulli Distribution with parameter q. Thus p(x n q q xn (1 q 1 xn. The probability of picing the th Bernoulli component out of K is π, which is often referred as mixing proportion. We don t now beforehand which one out of K components each data point is drawn from, thus the variable representing these component assignments (will define later is hidden in this mixture model. Write the data lielihood and log-lielihood: P(x log P(x 1 K π p(x n q K log π p(x n q with the parameter θ {π, q}, and π 1. otice that the second summation is inside the log, thus it is hard to decouple the parameters (which will lead to non-close form solution. Tae the derivative w.r.t. q and set to 0, we obtain: 0 1 π j1 π jp(x n q j p(x n q q 1

where Thus we have: ( p(x n q xn 1 x n p(x n q f(q ; x n p(x n q q q 1 q 0 π p(x n q j1 π jp(x n q j f(q ; x n The form of the f(q ; x n is not important in this case, we only need to now that it is a function of q. As for the first term in the summation, we define the hidden variables z n, representing the component assignment for data point x n using a vector of size K 1. If x n is drawn from the th component, z n 1 while the remaining are all 0. We could evaluate the posterior distribution of z n : p(z n 1 x n, θ p(z n 1p(x n z n 1 j1 p(z nj 1p(x n z nj 1 π p(x n q j1 π jp(x n q j which is the first term in the summation. Since z n is a binary variable, the expectation E[z n ] p(z n 1 x n, θ. We can thin of this term as a responsibility of component for data point x n. Let s denote this term as γ(z n for simplicity, following the notation from [1]. ow we now the MLE solution q MLE must satisfy: 0 γ(z n f(q ; x n It is impossible to have analytical solution, because in order to compute q MLE, we need to compute γ(z n which is dependent on q itself. This is not a coincident, we will see this again shortly when we derive the MLE solution for mixing proportion. For the mixing proportion π, tae the derivative with a Lagrange multiplier λ and set to 0, we obtain: p(x n q 0 j1 π jp(x n q j + λ We can multiply π on both sides and sum over : 0 1 π p(x n q K j1 π jp(x n q j + π λ Maing use of use of the fact that π 1, we obtain: λ. Substitute λ bac to the derivative and multiply both side with π, we obtain: 1 π p(x n q j1 π jp(x n q j π γ(z n π 2

We could define γ(z n, where can be interpreted as the expected number of data points drawn from component. Therefore: π which is again dependent on γ(z n, while γ(z n depends on π. This example of mixture of Bernoulli distributions illustrates the difficulty to directly maximize the lielihood for models with hidden variables. However, we could get some intuition about an iterative algorithm where the derivation above can be made use of: Start the algorithm by randomly initializing the parameter. Then in each iterative step, compute the γ(z n based on the old parameter. Then the new parameter can be updated accordingly with the current value of γ(z n. This is the basic intuition behind Expectation Maximization (EM algorithm. 2 EM in General One of the problems with directly maximizing the observable data lielihood, as shown above, is that the summation is inside the logarithm. So what if we move on to the complete data lielihood P(X, and then marginalize the hidden variable? Let s do the derivation 1. Start from the log-lielihood: log P(X θ log P(X, θ Here we mae use of the variational point of view by adding a variational distribution q( and the fact that logarithm function is concave: ( P(X, θ log P(X θ log q( q( ( P(X, θ q( log q( By maing use of the Jensen s inequality, we move out the summation from the logarithm successfully. We could of course find out q( by exploring when the equality holds for Jensen s inequality. However, we will solve it from a different approach here. We first want to learn how much we have lost from the Jensen s inequality: log P(X θ ( P(X, θ q( log q( q( log P(X θ ( P(X, θ q( log q( ( P(X θq( q( log P(X, θ KL(q p ( q( log q( P( X, θ 1 There are actually a lot of different versions of derivations for EM algorithm. The one presented here gives the most intuition to the author. 3

Thus, the difference is actually the KL-divergence between the variational distribution q( and the posterior distribution P( X, θ. Thus, we could rearrange the log-lielihood as: log P(X θ ( P(X, θ q( log +KL(q p q( }{{} L(q,θ Since the KL-divergence is non-negative for any q and p, L(q, θ acts as a lower-bound for the log-lielihood. Let s denote the log-lielihood as l(θ for simplicity. EM algorithm has 2 steps as its name suggests: Expectation(E step and Maximization(M step. In the E step, from the variational point of view, our goal is to choose a proper distribution q( such that it best approximates the log-lielihood. At this moment, we have existing parameter θ old. Thus we set the variational distribution q( equal the posterior distribution P( X, θ old so that KL(q p 0. In that case, we mae the lower-bound L(q, θ equal l(θ. In the M step, we have q( fixed and maximize the L(q, θ, which is equivalent to maximize l(θ, w.r.t. the parameter θ. Unless we reach the convergence, the lower-bound L(q, θ will increase with the new parameter θ new. Since the parameter θ changes from the E step, KL-divergence no longer equals 0, which creates gap between L(q, θ and l(θ again. And this gap will be filled out in the next E step. To see analytically the objective function in M step, substitute q( with P( X, θ old from E step: L(q, θ ( P(X, θ P( X, θ old log P( X, θ old P( X, θ old log P(X, θ H{P( X, θ old } }{{} Q(θ,θ old where H{P( X, θ old } represents the negative entropy of P( X, θ old, which is irrelevant to the parameter θ. Thus we could consider it as a constant. What really matters is Q(θ, θ old which we could view as the expectation of P(X, θ under the posterior distribution P( X, θ old. There are a few very nicely-drawn figures, visualizing the whole procedures of EM algorithm in Chapter 9 of [1]. The sweet spot in M step is that, instead of directly maximizing P(X which does not have a close form solution, computing P(X, is generally much simpler because we can just view it as a model with no hidden variables. EM algorithm is usually referred as a typical example of coordinate ascent, where in each E/M step, we have one variable fixed (θ old in E step and q( in M step, and maximize w.r.t. another one. Coordinate ascent is widely used in numerical optimization. 3 EM Applications in the Mixture Models 3.1 Mixture of Bernoulli Revised ow let s go bac to the problem we encountered earlier on mixture of Bernoulli distributions. 4

Assume we have some pre-set initial values θ old {π old complete lielihood and log-lielihood: log P(x, P(x, 1 K 1, qold ( π q xn (1 q 1 xn z n } for parameter. We first write the K ( z n log π + x n log q + (1 x n log(1 q ote how this one differs from the observable data log-lielihood in Section 1.1. What we want to maximize in the M step is actually E z [log P(x, ] where has been decomposed as z n in the complete log-lielihood. Also we have already shown that E(z n γ(z n in Section 1.1, thus: E [log P(x, z] 1 K γ(z n ( log π + x n log q + (1 x n log(1 q Tae the derivative w.r.t. q and set to 0, we obtain: q new γ(z nx n γ(z n As for π, similarly, we obtain: π new γ(z n To summarize the 2 steps of EM algorithm for the mixture of Bernoulli distributions: E step: Compute γ(z n with current parameter θ old {π old, qold }: M step: Update π and q : γ(z n p(z n 1 x n, θ π old j1 πold j π new γ(z n q new γ(z nx n γ(z n {π new 3.2 Mixture of Gaussian Distributions, q new } {π old, qold } p(x n q old p(x n q old A common scenario for applying EM algorithm is to estimate the parameter for mixture of Gaussian distributions, or Gaussian Mixture Models (GMM. The EM solution for GMM is actually very similar to the one for mixture of Bernoulli distributions derived above. ow assume we have vectors of D dimensions {x 1,, x } each of which is drawn i.i.d. from a Gaussian distribution (µ, Σ with mixing proportion π. Define the hidden variable z n the same as in Section 1.1. Write the complete log-lielihood: log P(X, 1 j K ( z n log π + log (µ, Σ 5

where the expectation of complete log-lielihood under the posterior distribution follows: E [log P(X, ] 1 ote in GMM, the th component follows (µ, Σ, thus: γ(z n K γ(z n ( log π + log (µ, Σ π (x n µ, Σ j1 π j (x n µ j, Σ j Similarly from the derivation in the mixture of Bernoulli distributions, we can obtain the µ, Σ, and π by taing the derivative and setting to 0: µ new γ(z nx n γ(z n Σ new γ(z n(x n µ new T (x n µ new γ(z n π new γ(z n We could see that these derivations are almost exactly the same with the mixture of Bernoulli distributions, except that in GMM there is one more parameter, covariance matrix Σ, to estimate. Formally summarize the 2 steps of EM algorithm for GMM: E step: Compute γ(z n with current parameter θ old {π old, µold, Σold }: γ(z n p(z n 1 x n, θ π old j1 πold j (x n µ old, Σold (x n µ old, Σ old j j M step: Update π, µ and Σ : Σ new {π new 3.3 Thoughts on the mixture models π new γ(z n µ new γ(z nx n γ(z n γ(z n(x n µ new T (x n µ new γ(z n, µ new, Σ new } {π old, µold, Σold } A recent blog post 2 by Larry Wasserman pointed out the merit and defect of mixture models (mainly from the view of a statistician. Clearly Larry is not a fan of mixture models. I do agree with some of the points. However, as a not-too-deep-math machine learner, I still find mixture models (especially GMM capable to do the job in some applications (which theoretical result doesn t care about. 2 http://normaldeviate.wordpress.com/2012/08/04/mixture-models-the-twilightzone-of-statistics/ 6

4 Variations of the EM EM algorithm can be thought of as an example of a generalized algorithm named GEM (generalized EM, where in the M step, the maximization can be only partially implemented. In this case, the lielihood can still be increased. Similarly, the E step can also be partially performed, which could actually lead to an incremental algorithm, as summarized in [2]. The EM algorithm described in Section 2 (I will call it normal EM totally separate the 2 steps, which is computationally inefficient. In the E step, we calculate the component responsibility γ(z n for each data point with each possible component assignment, and store all of the K values, as in the M step, all of them are required as reflected from the update rule. What incremental EM does is to mae these 2 steps coherent. As proved in [2], both q( and log P(X, θ can be factorized according to the data points, assuming the data points are independent. Furthermore, this could lead to the factorization of L(q, θ, with local maxima unchanged: where L(q, θ L n (q n, θ L n (q n, θ E qn [log P(x n, z n θ] + H{q n } ote that the factorization (product is summation in the logarithm domain. We have shown that in the EM algorithm, both E and M steps can be considered as maximizing L(q, θ in a coordinate ascent manner, thus an incremental version of the EM is described in Algorithm 1: Algorithm 1: EM algorithm with partially implemented E step Initialization: Randomly initialize θ old and q n. q n does not have to be consistent with θ old ; Repeat until convergence: E step: Select some data point i to be updated: Set q j for the data points where j i unchanged. Updated q i to maximize L i (q i, θ old, given by the value p(z i x i, θ old. M step: Update the parameter the same as the normal EM algorithm. We can mae some strategic decision for choosing data points, e.g. choose data points which we are more uncertain about their component assignment (q i is not stabilized yet. However, essentially, this variation does not help with the inefficiency problem mentioned earlier. As we could see, the bottlenec is in the M step. To address this problem, assume the complete log-lielihood can be summarized in some form of the sufficient statistics S(X, (e.g. exponential family and factorized as: S(X, S n (x n, z n (1 We can rewrite the normal EM algorithm in the form of sufficient statistics (omitted here. However, we could mae use of the property of the sufficient statistics. In each iteration, we only compute how much S(X, will be increased given some of the chosen data points and then we can directly obtain the updated parameter given updated sufficient statistics. This incremental EM is described as in Algorithm 2. 7

Algorithm 2: Incremental EM algorithm Initialization: Randomly initialize θ old and S old n. S old n with θ old. S old is computed given Eq. 1. Repeat until convergence: does not have to be consistent E step: Select some data point i to be updated: Set S j where j i unchanged from the previous iteration. Set S i E qi [S i (x i, z i ], where q i p(z i x i, θ old. Set S S old S old i + S i M step: Update θ based on the current value of S. ote that in Algorithm 2, both E and M steps are independent of the number of data points. As mentioned above, the benefit comes from the incremental sufficient statistics, as it reflects the changes to the complete log-lielihood immediately. Thus speedy convergence can be achieved as we fully utilize the intermediate computation. A sparse variant of the incremental EM can be further proposed [2]. The intuition behind this setting comes from the fact that in many cases, only a small portion of all the possible values of hidden variable has non-negligible probability. Substantial computation may be saved if we could freeze those negligible probability for many iterations, and only update those are chosen as plausible values. Such EM algorithm can still guarantee increasing the log-lielihood after each iteration, while mostly importantly, it is computationally efficient. References [1] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics. Springer-Verlag ew Yor, Inc., Secaucus, J, USA, 2006. [2] R.M. eal and G.E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. ATO ASI Series D, Behavioural and Social Sciences, 89:355 370, 1998. 8