Expectation Maximization and Mixtures of Gaussians

Similar documents
Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Automatic Differentiation and Neural Networks

Note 1: Varitional Methods for Latent Dirichlet Allocation

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Mixture Models and Expectation-Maximization

A Note on the Expectation-Maximization (EM) Algorithm

Gaussian Mixture Models

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Latent Variable Models and EM algorithm

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Technical Details about the Expectation Maximization (EM) Algorithm

Computing the MLE and the EM Algorithm

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Expectation Propagation Algorithm

But if z is conditioned on, we need to model it:

K-Means and Gaussian Mixture Models

The Expectation-Maximization Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Expectation maximization

1 Expectation Maximization

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Gaussian Mixture Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Optimization and Gradient Descent

CPSC 540: Machine Learning

Introduction to Machine Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Note for plsa and LDA-Version 1.1

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation maximization tutorial

Graphical Models for Collaborative Filtering

Latent Variable Models

Expectation-Maximization

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

EM for Spherical Gaussians

p(d θ ) l(θ ) 1.2 x x x

STA 4273H: Statistical Machine Learning

Convergence Rate of Expectation-Maximization

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear Models for Regression

Lecture 4: Probabilistic Learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Lecture 2 Machine Learning Review

A minimalist s exposition of EM

Probabilistic Graphical Models

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

LDA with Amortized Inference

CSE446: Clustering and EM Spring 2017

An introduction to Variational calculus in Machine Learning

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

ECE521 Lecture7. Logistic Regression

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

EM Algorithm LECTURE OUTLINE

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

7.5 Partial Fractions and Integration

Linear & nonlinear classifiers

Estimating Gaussian Mixture Densities with EM A Tutorial

Lecture 2: GMM and EM

OPTIMISATION 2007/8 EXAM PREPARATION GUIDELINES

Lecture 4 September 15

Mixtures of Gaussians. Sargur Srihari

COM336: Neural Computing

Chapter 3 Numerical Methods

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Topic 2: Logistic Regression

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ECE 5984: Introduction to Machine Learning

Probabilistic clustering

Expectation Maximization Algorithm

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Machine Learning 4771

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Introduction to Machine Learning HW6

Clustering and Gaussian Mixture Models

Unsupervised Learning

Machine Learning Lecture Notes

The Expectation Maximization Algorithm

Metric-based classifiers. Nuno Vasconcelos UCSD

OPTIMISATION /09 EXAM PREPARATION GUIDELINES

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Machine Learning for NLP

Unsupervised Learning

Lagrange Relaxation and Duality

Empirical Risk Minimization and Optimization

The Expectation-Maximization Algorithm

Variational Autoencoder

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 10: October 27, 2016

Week 3: The EM algorithm

Using Expectation-Maximization for Reinforcement Learning

Learning Tetris. 1 Tetris. February 3, 2009

Transcription:

Statistical Machine Learning Notes 10 Expectation Maximiation and Mixtures of Gaussians Instructor: Justin Domke Contents 1 Introduction 1 2 Preliminary: Jensen s Inequality 2 3 Expectation Maximiation in General 2 3.1 M-Step....................................... 3 3.2 E-Step....................................... 4 4 Convergence 5 5 Mixtures of Gaussians 6 5.1 E-Step....................................... 7 5.2 M-Step....................................... 7 6 Examples 8 6.1 2 clusters..................................... 9 6.2 3 clusters..................................... 12 6.3 5 clusters..................................... 15 1 Introduction Suppose we have decided to fit a maximum-likelihood model. How should we do it? There are three major approaches: 1

Expectation Maximiation and Mixtures of Gaussians 2 1. Closed-form solution. 2. Apply a nonlinear optimiation tool. In general, there is no guarantee of convergence to the global optima, though there are such guarantees in many special cases. 3. Expectation Maximiation 2 Preliminary: Jensen s Inequality For any distribution q(x), and for any concave function f(x) f ( xq(x) ) x x q(x)f(x). Intuitively speaking, this says if we compute f at the expected value of x, this will be at least as high as the expected value of f. 3 Expectation Maximiation in General Consider fitting some model p(x,; θ), but we only have data for x. (We will return to this issue of when this might come up later on.) What to do? A reasonable approach is to maximie the log-likelihood of x alone. Here, we assume that is discrete. log p(; θ) = log p(, ; θ) (3.1) How to maximie this with respect to θ? One option is, well, just to maximie it. For a given θ, we can compute the right-hand side of Eq. 3.1. So, there is nothing to prevent us from just taking the gradient of this expression, and running gradient descent (or whatever). There is absolutely nothing wrong with this approach. However, in certain cases, it can get quite complicated Now, consider any distribution r( x). We have Applying Jensen s inequality, we have log p(x; θ) = log ( p(x,; θ)) ) r( x) r( x)

Expectation Maximiation and Mixtures of Gaussians 3 log p(x; θ) r( x) log p(x,; θ)). r( x) Putting this all together, we can create a lower-bound on the log-likelihood l(θ) = log p(; θ) q(θ, r) = r( ) log p(,; θ)). r( ) The basic idea of EM is very simple. Instead of directly maximiing l, maximie the lower bound q. One quick way to do this would be to just pick some random distribution r, and then maximie with respect to θ. This works, but we can do better. Why not also maximie with respect to the distribution r? Thus, EM alternates with respect to two steps: Maximie q with respect to r. This is called the Expectation (E) step. Maximie q with respect to θ. This is called the Maximiation (M) step. The names for these steps may seem somewhat confusing at the moment, but will become clear later on. Before continuing on to the Abstract Expectation Maximiation Iterate Until Convergence: For all, r( ) p( ; θ) (E-step) θ arg max θ r( ) log p(,; θ) (M-step) Let s look out how these rules emerge, and try to understand the properties of EM. First, let us consider the the M-step. 3.1 M-Step So far, we haven t really discussed the point of EM. We introduce a lower-bound on l, then alternate between two different optimiation step. So what? Why should this help us? Consider the M-Step. θ arg max θ r( ) log p(,; θ).

Expectation Maximiation and Mixtures of Gaussians 4 This is essentially just a regular maximum likelihood problem, just weighted by r( ). This can be very convenient. As we will see below, in some circumstances where maximiing l with respect to θ would entail a difficult numerical optimiation, the above optimiation can be done in closed-form. This increased convenience of this is the real gain of our odd lower-bounding strategy. 3.2 E-Step On the other hand, we appear to pay a price for the convenience of the M-step. Namely, we aren t fitting l any more, just a lower-bound on l. How steep is this price? Well, let s consider the maximiation with respect to r. What we need to do is, independently for all, maximie r( ) log p(,; θ) Theorem 1. This is maximied by setting r( ) log r( ) r( ) = p( ; θ). Proof. Our problem is r( ) log p(,; θ) s.t. r( ) 0 r( ) = 1. max r( ) r( ) log r( ) Form the Lagrangian L = r( ) log p(,; θ) r( ) log r( ) λ r( ) + ν ( 1 r( ) ) For a given set of Lagrange multipliers {λ }, ν we can find the maximiing r( ) by From this it follows that dl dr( ) = log p(,; θ) log r( ) 1 λ ν = 0

Expectation Maximiation and Mixtures of Gaussians 5 r( ) p(,; θ). Now the only way that this can be true while also having r be a normalied distribution is that it is the conditional distribution of p. r( ) = p( ; θ) So this proof explains the meaning of expectation step. All that we need to do, in order to maximie with respect to r, is set each conditional distribution r( ) to be the conditional distribution p( ; θ). 4 Convergence How well should EM work compared to a traditional maximiation of the likelihood in Eq. 3.1? To answer this, consider the value of q at the completion of the E-step. q(θ, r) = = = = = r( ) log p( ; θ) log p( ; θ) log p(,; θ) r( ) p(,; θ) p( ; θ) p( ; θ) log p(; θ) log p(; θ) p( ; θ)p(; θ)) p( ; θ) = l(θ) After the completion of the E-step our lower bound on the likelihood is tight. This is fantastic news. Suppose that EM converges to some r and θ. By the definition of convergence, θ must be a (local) maxima of q. However, from the tightness of the bound, we know that if

Expectation Maximiation and Mixtures of Gaussians 6 it was possible to locally improve l, it would also be possible 1 to locally improve q. Thus, at convergence, we have also achieved a local maxima of l. Thus, we should not expect to pay a penalty in accuracy from the convenience of EM. On the other hand, note that Expectation-Maximiation does not remove the issue of local minima. Both in theory and in practice, local minima remain a real problem. 5 Mixtures of Gaussians The above is all very abstract. Let s remember the supposed advantages of EM. We have claimed that there are some circumstances where a direct maximiation of l(θ) would be hard (in implementation, running time, or otherwise), while performing each of the steps in EM is easy. This would all be more credible if we actually saw at least one such case. The example we consider here is by far the most common example of EM, namely for learning mixtures of Gaussians. A mixture of Gaussians is defined by p(x) = π N(x; µ, Σ ). Where π 0 and π = 1. The basic argument for mixtures of Gaussians is that they can approximate almost anything. Now, once again, we could directly maximie the log likelihood of a mixture of Gaussians. It would be a little painful, but certainly doable. The following application of EM gives a simpler algorithm. 1 To be precise, it is possible to show that not only will q(θ, r) = l(θ) at convergence, but also that dq(θ, r)/dθ = dl(θ)/dθ. dq dθ = = = = = r( ) d p(,; θ) log dθ r( ) r( ) d log p(,; θ) dθ 1 d r( ) p(,; θ) p(,; θ) dθ p( ; θ) d p(,; θ) p(,; θ) dθ 1 d p(,; θ) p(; θ) dθ dl dθ = = = d dθ log p(,; θ) 1 d p(,; θ) p(,; θ) dθ 1 d p(,; θ) p(; θ) dθ

Expectation Maximiation and Mixtures of Gaussians 7 p() = π p(x ) = N(x; µ, Σ ) p(x, ) = π N(x; µ, Σ ) This is quite a clever idea. We do not really have hidden variables, but we introduce these for the convenience of applying the EM algorithm. 5.1 E-Step r( ) p( ) = p(, ) p() = π N(x; µ, Σ ) π N(x; µ, Σ ) This is not hard at all and can simply be computed in closed-form. 5.2 M-Step Do the maximiation arg max {µ },{Σ } = arg max {µ },{Σ } = arg max {µ },{Σ } r( ) log p(, ; θ). r( ) log ( π N(; µ, Σ ) ) r( ) log N(; µ, Σ ) + r( ) log π Maximiing with respect to π is quite easy 2. This is done by setting 2 Consider the Lagrangian L = r( )log π + λ(1 Here, we aren t explicitly enforcing that π 0. As we see below, however, we can get away with this, since the logarithm of π becomes infinitely large as π 0. For some fixed Lagrange multiplier λ, we can find the best weights π by π )

Expectation Maximiation and Mixtures of Gaussians 8 π = 1 r( ) r( ). This can be accomplished by doing (separately for each ) arg max µ,σ It turns out that this can be done by setting r( ) log N(; µ, Σ ) µ 1 r( ) r( ) Σ 1 r( ) r( )( µ )( µ ) T. Note the similarity of these updates to a fitting a normal Gaussian distribution. The only difference here is that we compute the weighted mean and the weighted covariance matrix. 6 Examples dl dπ = r( ) 1 π λ = 0. This results in the condition that π r( ). However, since we know that π = 1, we can just normalie these scores and recover the answer.

Expectation Maximiation and Mixtures of Gaussians 9 6.1 2 clusters In this first experiment, we use only two Gaussians. We can see that little happens in the first few iterations, as the two Gaussians greatly overlap. As they start to cover different areas around 10 iterations, the likelihood rapidly increases. 600 650 700 likelihood 750 800 850 900 950 1000 0 10 20 30 40 50 iteration

Expectation Maximiation and Mixtures of Gaussians 10 max p(, ) p(x ) p(x) Iteration 1 2 3 5

Expectation Maximiation and Mixtures of Gaussians 11 max p(, ) p(x ) p(x) Iteration 10 20 30 50

Expectation Maximiation and Mixtures of Gaussians 12 6.2 3 clusters Finally, the experiment is repeated with a mixture of 3 Gaussians. Here, we see a major increase in the likelihood of the final mixture. 600 650 700 likelihood 750 800 850 900 950 1000 0 10 20 30 40 50 iteration

Expectation Maximiation and Mixtures of Gaussians 13 max p(, ) p(x ) p(x) Iteration 1 2 3 5

Expectation Maximiation and Mixtures of Gaussians 14 max p(, ) p(x ) p(x) Iteration 10 20 30 50

Expectation Maximiation and Mixtures of Gaussians 15 6.3 5 clusters Finally, the experiment is repeated with a mixture of 5 Gaussians. In this case, only a small increase in the likelihood obtains over with mixture of 3 Gaussians. 600 650 700 likelihood 750 800 850 900 950 1000 0 10 20 30 40 50 iteration

Expectation Maximiation and Mixtures of Gaussians 16 max p(, ) p(x ) p(x) Iteration 1 2 3 5

Expectation Maximiation and Mixtures of Gaussians 17 max p(, ) p(x ) p(x) Iteration 10 20 30 50