A minimalist s exposition of EM

Similar documents
IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Technical Details about the Expectation Maximization (EM) Algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Variational Inference (11/04/13)

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Probabilistic Graphical Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

The Expectation-Maximization Algorithm

K-Means and Gaussian Mixture Models

Gaussian Mixture Models

Clustering, K-Means, EM Tutorial

Gaussian Mixture Models

13: Variational inference II

Expectation Propagation Algorithm

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Expectation Propagation for Approximate Bayesian Inference

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

STA 4273H: Statistical Machine Learning

The Expectation Maximization or EM algorithm

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Latent Variable Models and EM algorithm

Expectation Maximization Algorithm

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Expectation Maximization

Computing the MLE and the EM Algorithm

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Statistical Pattern Recognition

Introduction to Machine Learning

Introduction to Machine Learning CMU-10701

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Series 7, May 22, 2018 (EM Convergence)

Probabilistic Time Series Classification

Lecture 4: Probabilistic Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Maximum likelihood estimation

Statistical Pattern Recognition

Gaussian Mixture Models, Expectation Maximization

Bayesian Machine Learning - Lecture 7

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Expectation Maximization

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Hidden Markov Models and Gaussian Mixture Models

Posterior Regularization

Expectation Maximization (EM)

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Introduction to Machine Learning

Machine learning - HT Maximum Likelihood

Hidden Markov Models

Variational Inference. Sargur Srihari

Week 3: The EM algorithm

Mixtures of Gaussians continued

A brief introduction to Conditional Random Fields

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Variational Autoencoders

G8325: Variational Bayes

Computer Intensive Methods in Mathematical Statistics

Expectation Maximization and Mixtures of Gaussians

Large Deviations for Weakly Dependent Sequences: The Gärtner-Ellis Theorem

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Latent Variable Models and EM Algorithm

Machine Learning Techniques for Computer Vision

Statistical learning. Chapter 20, Sections 1 4 1

Consistency of the maximum likelihood estimator for general hidden Markov models

Variational Autoencoder

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Expectation Maximization

Markov Chains and Hidden Markov Models

Expectation maximization

Probabilistic Graphical Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Clustering with k-means and Gaussian mixture distributions

Lecture 1 October 9, 2013

10708 Graphical Models: Homework 2

Lecture 13 : Variational Inference: Mean Field Approximation

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Machine Learning for Data Science (CS4786) Lecture 12

Quantitative Biology II Lecture 4: Variational Methods

A Note on the Expectation-Maximization (EM) Algorithm

STA 414/2104: Machine Learning

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning for Signal Processing Bayes Classification and Regression

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Clustering and Gaussian Mixture Models

An Introduction to Expectation-Maximization

Variational Bayes and Variational Message Passing

Linear Dynamical Systems (Kalman filter)

Dynamic models 1 Kalman filters, linearization,

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Unsupervised Learning

Lecture 14. Clustering, K-means, and EM

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Pattern Recognition and Machine Learning

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Transcription:

A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability function P (O, H ). In many cases, it is easy to find the maximum-likelihood estimate (MLE) solution if both O, H are observed: = arg max log P (O, H ) (1) This is largely due to the log operation directly on P (O, H ). For instance, if P (O, H ) is a member of the exponential family, the log application results in a substantially simpler expression (e.g., is smooth and concave in ). However, suppose we are only given partial samples O. The MLE solution is now = arg max log H P (O, H ) (2) A critical change is that log is no longer directly on the distribution. As a result, this objective is often difficult to optimize. By convention, we will write log P (O ) to refer to log H P (O, H ). The Expectation-Maximization (EM) algorithm allows us to optimize a significantly simpler objective which nevertheless is guaranteed to improve log P (O ): EM Input: model P (O, H ), partial samples O, number of iterations T Output: (not necessarily exact in the limit) estimation of (2) Initialize 0 (e.g., randomly). For t = 1... T : E-step: Calculate P (H O, t 1 ). M-step: t arg max Q(, t 1 ) where Q(, t 1 ) := H P (H O, t 1 ) log P (O, H ) Return T. Optimizing Q(, ) over is typically much easier since we have log back on P (O, H ). Thus we have reduced a difficult problem into a series of easy problems. 1

1.1 Proof that EM converges At a first glance, EM does not seem to be optimizing log P (O ). But we can show that it always increases this value unless it has reached a stationary point. To show this, let q be any distribution over H and define the following quantity: L(q, ) := H q(h) log P (O, H ) q(h) First, we show that the target objective log P (O ) is lower bounded by L(q, ) for any choice of q. Lemma 1.1. log P (O ) = L(q, ) + KL{q P (H O, )} for any q, where KL{q P (H O, )} := H q(h) log q(h) P (H O, ) 0 is the Kullback-Leibler (KL) divergence between q(h) and P (H O, ). Proof. It can be easily verified by using P (O, H ) = P (H O, )P (O ) in L(q, ). Next, we claim that for a fixed, the lower bound L(q, ) can be tightened to log P (O ) by choosing q(h) = P (H O, ). Lemma 1.2. max L(q, ) = log P (O ) q arg max L(q, ) = P (H O, ) q Proof. By Lemma 1.1, we have L(q, ) + KL{q P (H O, )} = C() where C() is constant in q. This means the maximizing q for L(q, ) is one that has zero KL divergence with P (H O, ). This gives the desired result. The final lemma shows that when q(h) = P (H O, ) is fixed, maximizing the lower bound L(q, ) over is equivalent to maximizing the Q(, ) over. Lemma 1.3. When q(h) is fixed as P (H O, ), arg max L(q, ) = arg max Q(, ) Proof. Plugging in q(h) = P (H O, ) inside L(q, ), we see the desired result since L(q, ) = P ( H O, ) log P (H O, ) P ( H O, ) log P ( H O, ) H H }{{}}{{} Q(, ) independent of These lemmas together yield the following statement. 2

Theorem 1.4. In EM, we have log P (O t ) log P (O t 1 ) with equality iff t 1 is a stationary point in log P (O ). Proof. In the t-th iteration, we fix P (H O, t 1 ) to compute t = arg max Q(, t 1 ). Then t = arg max L(q, ) where q (H) = P (H O, t 1 ) by Lemma 1.3. Regarding L(q, t ) and L(q, t 1 ), we have: 1. L(q, t ) log P (O t ) by Lemma 1.1. 2. L(q, t ) L(q, ) for any since t = arg max L(q, ). 3. L(q, t 1 ) = log P (O t 1 ) by Lemma 1.2 since q (H) = P (H O, t 1 ). These observations give us: log P (O t ) L(q, t ) L(q, t 1 ) = log P (O t 1 ) To see why L(q, t ) > L(q, t 1 ) unless t 1 is a stationary point in log P (O ), suppose t 1 is not a stationary point. Then the gradient of log P (O ) at t 1 is nonzero. By Lemma 1.1 and 1.2, we see that t 1 L(q, ) + t 1 KL{q P (H O, )} = t 1 L(q, ) 0 Thus the gradient of L(q, ) at t 1 is nonzero either, so t = arg max L(q, ) will satisfy L(q, t ) > L(q, t 1 ) and therefore log P (O t ) > log P (O t 1 ). EM can be viewed as an (unconventional) alternating optimization algorithm that iteratively improves the lower bound of the objective instead of the objective itself. It repeats the following two steps: E-step: q arg max q L(q, ) M-step: arg max L(q, ) The E-step defines a new (tight) lower bound function that is tangent to log P (O ) at. The M-step modifies to optimize this lower bound, in the process making the lower bound loose. The argument about strict improvement can be visualized as follows: if is not a stationary point in log P (O ), then it cannot be a stationary point in L(q, ) which is tangent to log P (O ) and shares the gradient at. 2 Examples of EM 2.1 Gaussian mixture model (GMM) Consider a mixture of m univariate spherical Gaussians = {(γ j, µ j )} m : the j- th Gaussian has mean µ j and standard deviation 1, and is associated with a prior probability γ j of being selected. Assume we have n iid. samples from this model. Complete samples would have the form X = {(o (i), h (i) )} n where o(i) is generated by the h (i) -th Gaussian: in that case, solving (1) is trivial. However, assume instead we have partial samples O = {o (i) } n ; 3

we do not observe the corresponding H = {h (i) } n. Then we must maximize P (O ) over : γ j log exp ( (o(i) µ j ) 2 ) 2π 2 in which complicated interactions between parameters are unresolved by log. Thus we turn to EM: we optimize instead Q(, t 1 ) = p(h (i) = j o (i), t 1 ) (log γ j log 2π (o(i) µ j ) 2 ) 2 over. This leads to a simple formula for setting t using t 1. 2.2 Hidden Markov model (HMM) Consider an HMM with a discrete observation space X and a discrete state space Y. It is parameterized by π, t, and o and defines the probability distribution p(x 1... x N, y 1... y N ) = π(y 1 ) o(x j y j ) t(y j y j 1 ) for a sequence pair x 1... x N X N and y 1... y N Y N. Assume we have n iid. sample sequences of length N from this model. Complete samples would have the form X = {(x (i), y (i) )} n where x(i) X N is an observation sequence and y (i) Y N is the corresponding state sequence. Again, solving (1) is trivial if we are given X. Assume instead we have partial samples O = {x (i) } n. It is unclear how to maximize P (O ) over : log y Y N π(y 1 ) o(x (i) j y j) t(y j y j 1 ) In comparison, we can optimize a substantially simpler objective using EM: N N Q(, t 1 ) = p(y o (i), t 1 ) log π(y 1 ) + log o(x (i) j y j) + log t(y j y j 1 ) y Y N The sum over the elements of Y N can be achieved with dynamic programming. 3 Discussion Despite the widespread use of EM today, it is often heuristically understood. Especially, it is often mistaken as plugging in conditional expectations in place of unobserved values in (1). This is in general incorrect (Flury and Zoppe, 2000), even though it is the case for a wide class of practical models (Appendix). It is important to understand EM from an optimization point of view to avoid this pitfall. Reference: Bishop, Christopher M. Pattern recognition and machine learning (2006). 4

4 Appendix: trick for categorical distributions Whenever P (O, H ) is a categorical distribution, i.e., it defines a distribution over counts of events as a product of parameters associated with the events, we can actually plug in conditional expectations in place of unobserved values in (1) and solve that problem. This makes the derivation of EM almost trivial. Since many useful models in the world are categorical (e.g., HMMs), this trick can be often handy. To make this concrete, suppose we have a categorical distribution over n + m events in which n events are observed and m events are unobserved. Let o i be the count of the i-th observed event and h j the count of the j-th unobserved event. The model defines a probability distribution as: p(o 1... o n, h 1... h m α, β) = n α oi i where = (α, β) is the model parameter. We will now examine the form of Q(, ) ( is fixed): Q(, ) := p(h 1... h m o 1... o n, ) o i log α i + h j log β j h 1...h m = o i log α i + p(h 1... h m o 1... o n, ) h j log β j h 1...h m = o i log α i + p(h j o 1... o n, )h j log β j = h j o i log α i + ĥ j ( ) log β j where ĥj( ) = h j p(h j o 1... o n, )h j is the expected count of the j-th unobserved event under. Compare this to the setting in which h 1... h m are fully observed and we maximize: log p(o 1... o n, h 1... h m α, β) = m β hj j o i log α i + h j log β j We see that this is exactly the same as Q(, ) except that ĥj( ) is switched with h j. Hence in this particular setup, each iteration of EM amounts to solving the fully observed MLE estimation in (1) but with unobserved values replaced by their corresponding conditional expections. 5