IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Similar documents
Cheng Soon Ong & Christian Walder. Canberra February June 2017

Expectation Maximization

Statistical Machine Learning Lectures 4: Variational Bayes

Curve Fitting Re-visited, Bishop1.2.5

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

A minimalist s exposition of EM

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

The Expectation-Maximization Algorithm

Expectation Propagation Algorithm

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Density Estimation. Seungjin Choi

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning

Expectation Maximization

Posterior Regularization

The Expectation Maximization or EM algorithm

PATTERN RECOGNITION AND MACHINE LEARNING

Mixture Models and Expectation-Maximization

Latent Variable Models and EM algorithm

Gaussian Mixture Models

Quantitative Biology II Lecture 4: Variational Methods

Variational Inference (11/04/13)

Gaussian Mixture Models

Gaussian Mixture Models, Expectation Maximization

An introduction to Variational calculus in Machine Learning

Series 7, May 22, 2018 (EM Convergence)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Foundations of Statistical Inference

13: Variational inference II

Data Mining Techniques

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

COM336: Neural Computing

Week 3: The EM algorithm

G8325: Variational Bayes

Variational Inference. Sargur Srihari

Machine Learning Lecture Notes

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Understanding Covariance Estimates in Expectation Propagation

But if z is conditioned on, we need to model it:

Introduction to Statistical Learning Theory

Unsupervised Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Review and Motivation

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Lecture 14. Clustering, K-means, and EM

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Expectation Maximization

Bayesian Machine Learning

Lecture 8: Graphical models for Text

Mixtures of Gaussians. Sargur Srihari

Variational inference

CSC 2541: Bayesian Methods for Machine Learning

Statistics: Learning models from data

Ch 4. Linear Models for Classification

Expectation Propagation for Approximate Bayesian Inference

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Statistical Pattern Recognition

Expectation Maximization Algorithm

Technical Details about the Expectation Maximization (EM) Algorithm

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Machine Learning 4771

Variational Autoencoder

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Introduction to Bayesian Statistics

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

1 Expectation Maximization

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Lecture 13 : Variational Inference: Mean Field Approximation

Note 1: Varitional Methods for Latent Dirichlet Allocation

Latent Variable Models and EM Algorithm

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

A Note on the Expectation-Maximization (EM) Algorithm

Variational Principal Components

Unsupervised Learning

Latent Variable Models

Statistical Pattern Recognition

Variational Inference and Learning. Sargur N. Srihari

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Maximum Likelihood Estimation. only training data is available to design a classifier

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Expectation Maximization and Mixtures of Gaussians

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Machine Learning Techniques for Computer Vision

EM for Spherical Gaussians

1/37. Convexity theory. Victor Kitov

Nested Sampling. Brendon J. Brewer. brewer/ Department of Statistics The University of Auckland

EM Algorithm. Expectation-maximization (EM) algorithm.

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Variational Autoencoders (VAEs)

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine learning - HT Maximum Likelihood

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Transcription:

IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing. More generally, however, the EM algorithm can also be applied when there is latent, i.e. unobserved, data which was never intended to be observed in the first place. In that case, we simply assume that the latent data is missing and proceed to apply the EM algorithm. The EM algorithm has many applications throughout statistics. It is often used for example, in machine learning and data mining applications, and in Bayesian statistics where it is often used to obtain the mode of the posterior marginal distributions of parameters. The Classical EM Algorithm We begin by assuming that the complete data-set consists of Z (X, Y) but that only X is observed. The complete-data log likelihood is then denoted by l(θ; X, Y) where θ is the unknown parameter vector for which we wish to find the MLE. E-Step: The E-step of the EM algorithm computes the expected value of l(θ; X, Y) given the observed data, X, and the current parameter estimate, θ old say. In particular, we define Q(θ; θ old ) : E [l(θ; X, Y) X, θ old ] l(θ; X, y) p(y X, θ old ) dy () where p( X, θ old ) is the conditional density of Y given the observed data, X, and assuming θ θ old. M-Step: The M-step consists of maximizing over θ the expectation computed in (). That is, we set We then set θ old θ new. θ new : max θ Q(θ; θ old ). The two steps are repeated as necessary until the seuence of θ new s converges. Indeed under very general circumstances convergence to a local maximum can be guaranteed and we explain why this is the case below. If it is suspected that the log-likelihood function has multiple local maximums then the EM algorithm should be run many times, using a different starting value of θ old on each occasion. The ML estimate of θ is then taken to be the best of the set of local maximums obtained from the various runs of the EM algorithm. Why Does the EM Algorithm Work? We use p( ) to denote a generic conditional PDF. Now observe that l(θ; X ) ln p(x θ) ln p(x, y θ) dy ln p(x, y θ) p(y X, θ old ) p(y X, θ old) dy

The EM Algorithm 2 [ p(x, Y θ) ln E E [ ln p(y X, θ old ) X, θ old ( ) ] p(x, Y θ) X, θ old p(y X, θ old ) E [ln p(x, Y θ) X, θ old ] E [ln p(y X, θ old ) X, θ old ] ] Q(θ; θ old ) E [ln p(y X, θ old ) X, θ old ] (3) where (2) follows from Jensen s Ineuality and since the ln function is concave. It is also clear (because the term inside the expectation becomes a constant) that the ineuality in (2) becomes an euality if we take θ θ old. Letting g(θ θ old ) denote the right-hand-side of (3), we therefore have l(θ; X ) g(θ θ old ) for all θ with euality when θ θ old. Therefore any value of θ that increases g(θ θ old ) beyond g(θ old θ old ) must also increase l(θ; X ) beyond l(θ old ; X ). The M-step finds such a θ by maximizing Q(θ; θ old ) over θ which is euivalent (why?) to maximizing g(θ θ old ) over θ. It is also worth mentioning that in many applications the function Q(θ; θ old ) will be a convex function of θ and therefore easy to optimize. (2) 2 Examples Example (Missing Data in a Multinomial Model) Suppose x : (x, x 2, x 3, x 4 ) is a sample from a Mult(n, π θ ) distribution where ( π θ 2 + ) 4 θ, 4 ( θ), 4 ( θ), 4 θ. The likelihood, L(θ; x), is then given by L(θ; x) so that the log-likelihood l(θ; x) is n! x!x 2!x 3!x 4! l(θ; x) C + x ln ( 2 + ) x ( ) x2 ( ) x3 ( ) x4 4 θ ( θ) ( θ) 4 4 4 θ ( 2 + ) 4 θ + (x 2 + x 3 ) ln ( θ) + x 4 ln (θ) where C is a constant that does not depend on θ. We could try to maximize l(θ; x) over θ directly using standard non-linear optimization algorithms. However, in this example we will perform the optimization instead using the EM algorithm. To do this we assume that the complete data is given by y : (y, y 2, y 3, y 4, y 5 ) and that y has a Mult(n, πθ ) distribution where ( ) πθ 2, 4 θ, 4 ( θ), 4 ( θ), 4 θ. However, instead of observing y we only observe (y + y 2, y 3, y 4, y 5 ), i.e, we only observe x. We therefore take X (y + y 2, y 3, y 4, y 5 ) and take Y y 2. The log-likelihood of the complete data is then given by l(θ; X, Y) C + y 2 ln (θ) + (y 3 + y 4 ) ln ( θ) + y 5 ln (θ) where again C is a constant containing all terms that do no depend on θ. It is also clear that the conditional density of Y satisfies ( ) θ/4 f(y X, θ) Bin y + y 2,. /2 + θ/4

The EM Algorithm 3 We can now implement the E-step and M-step. E-Step: Recalling that Q(θ; θ old ) : E [l(θ; X, Y) X, θ old ], we have Q(θ; θ old ) : C + E [y 2 ln (θ) X, θ old ] + (y 3 + y 4 ) ln ( θ) + y 5 ln (θ) C + (y + y 2 )p old ln (θ) + (y 3 + y 4 ) ln ( θ) + y 5 ln (θ) where p old : θ old /4 /2 + θ old /4. (4) M-Step: We now maximize Q(θ; θ old ) to find θ new. Taking the derivative we obtain which is zero when we take θ θ new where dq dθ (y + y 2 ) p old (y 3 + y 4 ) θ θ + y 5 θ θ new : y 5 + p old (y + y 2 ) y 3 + y 4 + y 5 + p old (y + y 2 ) (5) Euations (4) and (5) now define the EM iteration which begins with some (judiciously) chosen value of θ old. Example 2 (A Simple Normal-Mixture Model) An extremely common application of the EM algorithm is to estimate the MLE of normal mixture models. This is often used, for example, in clustering algorithms. Suppose for example that X (X,..., X n ) are IID random variables each with PDF f x (x) m j e (x µj)2 /2σ 2 j p j 2πσj 2 where p j 0 for all j and where p j. The parameters in this model are the p j s, the µ j s and the σ j s. Instead of trying to finding the maximum likelihood estimates of these parameters directly via numerical optimization, we can use the EM algorithm. We do this by assuming the presence of an additional random variable, Y say, where P (Y j) p j for j,..., m. The realized value of Y then determines which of the m normal distributions generates the corresponding value of X. There are n such random variables, (Y,..., Y n ) : Y. Note that f x y (x i y i j, θ) e (xi µj)2 /2σ 2 j (6) 2πσj 2 where θ : (p,..., p m, µ,..., µ m, σ,..., σ m ) is the unknown parameter vector and that the likelihood is given by n L(θ; X, Y) p yi e (xi µyi ) 2 /2σ 2 y i. 2πσy 2 i The EM algorithm starts with an initial guess, θ old, and then iterates the E-step and M-step as described below until convergence. E-Step: We need to compute Q(θ; θ old ) : E [l(θ; X, Y) X, θ old ]. Indeed it is straightforward to show that Q(θ; θ old ) n m P (Y i j x i, θ old ) ln ( f x y (x i y i j, θ) P (Y i j θ) ). (7) j

The EM Algorithm 4 Note that f x y (x i y i j, θ) is given by (6) and that P (Y i j θ old ) p j,old. Finally, since P (Y i j x i, θ old ) P (Y i j, X i x i θ old ) P (X i x i θ old ) it is clear that we can compute (7) analytically. f x y (x i y i j, θ old ) P (Y i j θ old ) m k f x y(x i y i k, θ old ) P (Y i k θ old ) M-Step: We can now maximize Q(θ; θ old ) by setting the vector of partial derivatives, Q/ θ, eual to 0 and solving for θ new. After some algebra, we obtain µ j,new x ip (Y i j x i, θ old ) P (Y (8) i j x i, θ old ) σ 2 j,new (x i µ j ) 2 P (Y i j x i, θ old ) P (Y i j x i, θ old ) (9) p j,new n n P (Y i j x i, θ old ). (0) Given an initial estimate, θ old, the EM algorithm cycles through (8) to (0) repeatedly, setting θ old θ new after each cycle, until the estimates converge. 3 A More General Version of the EM Algorithm The EM algorithm is often stated more generally using the language of information theory. In this section we will describe this more general formulation and relate it back to EM algorithm as described in Section. As before the goal is to maximize the likelihood function, L(θ; X ), which is given by 2 L(θ; X ) p(x θ) p(x, y θ) dy. () The implicit assumption underlying the EM algorithm is that it is difficult to optimize p(x θ) with respect to θ but that it is much easier to optimize p(x, Y θ). We first introduce an arbitrary distribution, (Y), over the latent variables, Y, and note that we can decompose the log-likelihood, l(θ; X ), into two terms (the first of which is sometimes called the energy term) according to l(θ; X ) : ln p(x θ) L(, θ) }{{} energy y + KL( p Y X ) (2) where L(, θ) and KL( p Y X ) are the likelihood and Kullback-Leibler (KL) divergence 3 which are given by ( ) p(x, Y θ) L(, θ) (Y) ln (3) Y (Y) ( ) p(y X, θ) KL( p Y X ) (Y) ln. (Y) Y It is well-known (see the Appendix) that the KL divergence satisfies KL( p Y X ) 0 and euals 0 if and only if (Y) p Y X. It therefore follows that L(, θ) l(θ; X ) for all distributions, ( ). We can now use the decomposition of (2) to define the EM algorithm. We begin with an initial parameter estimate, θ old. E-Step: The E-step maximizes the lower bound, L(, θ old ), with respect to ( ) while keeping θ old fixed. In principle this is a variational problem since we are optimizing a functional, but the solution is easily found. First note that l(θ old ; X ) does not depend on ( ). It then follows from (2) (with θ θ old ) that maximizing L(, θ old ) is euivalent to minimizing KL( p Y X ). Since this latter term is always non-negative we see that The material in this section is drawn from Pattern Recognition and Machine Learning (2006) by Chris Bishop. 2 If Y is discrete then we replace the integral in () with a summation. 3 The KL divergence is also often called the relative entropy.

The EM Algorithm 5 L(, θ old ) is optimized when KL( p Y X ) 0 which, by our earlier observation, is the case when we take (Y) p(y X, θ old ). At this point we see that the lower bound, L(, θ old ), will now eual the current value of the log-likelihood, l(θ old ; X ). M-Step: In the M-step we keep (Y) fixed and maximize L(, θ) over θ to obtain θ new. This will therefore cause the lower bound to increase (if it is not already at a maximum) which in turn means that the log-likelihood must also increase. Moreover, at this new value θ new it will no longer be the case that KL( p Y X ) 0 and so by (2) the increase in the log-likelihood will be greater than the increase in the lower bound. Comparing the General EM Algorithm with the Classical EM Algorithm It is instructive to compare the E-step and M-step of the general EM algorithm with the corresponding steps of Section. To do this, first substitute (Y) p(y X, θ old ) into (3) to obtain L(, θ) Q(θ; θ old ) + constant (4) where Q(θ; θ old ) is the expected complete-date log-likelihood as defined in () where the expectation is taken assuming θ θ old. The M-step of the general EM algorithm is therefore identical to the M-step of Section since the constant term in (4) does not depend on θ. The E-step in the general EM algorithm takes (Y) p(y X, θ old ) which, at first glance, appears to be different to the E-step in Section. But there is no practical difference: the E-step in Section simply uses p(y X, θ old ) to compute Q(θ; θ old ) and, while not explicitly stated, the general E-step must also do this since it is reuired for the M-step. 3. The Case of Independent and Identically Distributed Observations When the data set consists of N IID observations, X {x n } with corresponding latent or unobserved variables, Y {y n }, then we can simplify the calculation of p(y X, θ old ). In particular we obtain p(y X, θ old ) p(x, Y θ old) Y p(x, Y θ old) N n p(x N n, y n θ old ) N Y n p(x n, y n θ old ) n p(x n, y n θ old ) N n p(x n θ old ) N p(y n x n, θ old ) n so that the posterior distribution also factorizes which makes its calculation much easier. Note that the E-step of Example 2 is consistent with this observation. 3.2 Bayesian Applications The EM algorithm can also be used to compute the mode of the posterior distribution, p(θ X ), in a Bayesian setting where we are given a prior, p(θ), on the unknown parameter (vector), θ. To see this, first note that we can write p(θ X ) p(x θ)p(θ)/p(x ) which upon taking logs yields ln p(θ X ) ln p(x θ) + ln p(θ) ln p(x ). (5) If we now use (2) to substitute for ln p(x θ) on the right-hand-side of (5) we obtain ln p(θ X ) L(, θ) + KL( p Y X ) + ln p(θ) ln p(x ). (6) We can now find the posterior mode of ln p(θ X ) using a version of the EM algorithm. The E-step is exactly the same as before since the final two terms on the right-hand-side of (6) do not depend on ( ). The M-step, where we keep ( ) fixed and optimize over θ, must be modified however to include the ln p(θ) term. There are also related methods that can be used to estimate the variance-covariance matrix, Σ, of θ. In this case it is uite common to approximate the posterior distribution of θ with a Gaussian distribution centered at the mode and with variance-covariance matrix, Σ. This is called a Laplacian approximation and it is a simple but commonly used framework for deterministic inference. It only works well, however, when the posterior is

The EM Algorithm 6 unimodal with contours that are approximately elliptical. It is also worth pointing out, however, that it is often straightforward to compute the mode of the posterior and determine a suitable Σ for the Gaussian approximation so that the Laplacian approximation need not rely on the EM algorithm. We also note in passing that the decomposition in (2) also forms the basis of another commonly used method of deterministic inference called variational Bayes. The goal with variational Bayes is to select ( ) from some parametric family of distributions, Q, to approximate p(y X ). The dependence on θ is omitted since we are now in a Bayesian setting and θ can be subsumed into the latent or hidden variables, Y. In choosing ( ) we seek to maximize the lower bound, L(), or euivalently by (2), to minimize KL( p Y X ). A common choice of Q is the set of distributions under which the latent variables are independent. Appendix: Kullback-Leibler Divergence Let P and Q be two probability distributions such that if P (x) 0 then Q(x) 0. The Kullback-Leibler (KL) divergence or relative entropy of Q from P is defined to be ( ) P (x) KL(P Q) P (x) ln (7) Q(x) x with the understanding 4 that 0 log 0 0. The KL divergence is a fundamental concept in information theory and machine learning. One can imagine P representing some true but unknown distribution that we approximate with Q and that KL(P Q) measures the distance between P and Q. This interpretation is valid because we will see below that KL(P Q) 0 with euality if and only if P Q. Note however, that the KL divergence is not a true measure of distance since it is asymmetric in that KL(P Q) KL(Q P ) and does not satisfy the triangle ineuality. In order to see that KL(P Q) 0 we first recall that a function f( ) is convex on R if We also recall Jensen s ineuality: f(αx + ( α)y) αf(x) + ( α)f(y) for all α [0, ]. Jensen s Ineuality: Let f( ) be a convex function on R and suppose E[X] < and E[f(X)] <. Then f(e[x]) E[f(X)]. Noting that ln(x) is a convex function we have ( ) Q(x) KL(P Q) P (x) ln x P (x) ( ln P (x) Q(x) ) x P (x) 0. by Jensen s ineuality Moreover it is clear from (7) that KL(P Q) 0 if P Q. In fact because ln(x) is strictly convex it is easy to see that KL(P Q) 0 only if P Q. Example 3 (An Optimization Trick that s Worth Remembering) Suppose c is a non-negative vector in R n, i.e. c R n +, and we wish to maximize c i ln( i ) over probability mass functions, {,..., n }. Let p {p,..., p n } where p i : c i / j c j so that p is a (known) pmf. We then have: n ( n ) { n } max c i ln( i ) c i max p i ln( i ) 4 This is consistent with the fact that lim x 0 x log x 0.

The EM Algorithm 7 ( n c i ) max { p i ln(p i ) n p i ln ( n ) ( n c i p i ln(p i ) min KL(p ) ( pi ) i ) }, from which it follows (why?) that the optimal satisfies p.