An Introduction to Expectation-Maximization

Similar documents
The Expectation-Maximization Algorithm

Latent Variable Models and EM algorithm

The Expectation Maximization or EM algorithm

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

13: Variational inference II

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Statistical Pattern Recognition

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Machine Learning Lecture Notes

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

But if z is conditioned on, we need to model it:

Expectation Maximization

Technical Details about the Expectation Maximization (EM) Algorithm

A Note on the Expectation-Maximization (EM) Algorithm

Mixtures of Gaussians. Sargur Srihari

Quantitative Biology II Lecture 4: Variational Methods

Latent Variable Models and EM Algorithm

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Gaussian Mixture Models

Lecture 13 : Variational Inference: Mean Field Approximation

COM336: Neural Computing

Parametric Techniques Lecture 3

Week 3: The EM algorithm

Parametric Techniques

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Expectation Maximization

Series 7, May 22, 2018 (EM Convergence)

10-701/15-781, Machine Learning: Homework 4

The Expectation-Maximization Algorithm

Gaussian Mixture Models, Expectation Maximization

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Variational Inference (11/04/13)

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Variational Inference and Learning. Sargur N. Srihari

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Gaussian Mixture Model

STA 4273H: Statistical Machine Learning

Lecture 8: Graphical models for Text

EM & Variational Bayes

Variational Autoencoders

Clustering with k-means and Gaussian mixture distributions

Posterior Regularization

Lecture 8: Bayesian Estimation of Parameters in State Space Models

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Cross entropy-based importance sampling using Gaussian densities revisited

4: Parameter Estimation in Fully Observed BNs

Foundations of Statistical Inference

Clustering with k-means and Gaussian mixture distributions

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

STAT 730 Chapter 4: Estimation

Machine learning - HT Maximum Likelihood

Expectation Maximization Algorithm

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Latent Variable Models

Lecture 4: Probabilistic Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Introduction to Probabilistic Graphical Models: Exercises

CS Lecture 18. Expectation Maximization

EM-algorithm for motif discovery

Clustering, K-Means, EM Tutorial

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Gaussian Mixture Models

CSC411 Fall 2018 Homework 5

Note for plsa and LDA-Version 1.1

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

Latent Dirichlet Alloca/on

Variational inference

Two Useful Bounds for Variational Inference

Variational Inference. Sargur Srihari

Study Notes on the Latent Dirichlet Allocation

Probabilistic Graphical Models

14 : Mean Field Assumption

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Computing the MLE and the EM Algorithm

G8325: Variational Bayes

10708 Graphical Models: Homework 2

Clustering and Gaussian Mixture Models

Introduction to Maximum Likelihood Estimation

Expectation Maximization

Note 1: Varitional Methods for Latent Dirichlet Allocation

Probabilistic Graphical Models for Image Analysis - Lecture 4

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

CSCI-567: Machine Learning (Spring 2019)

Bayesian Decision and Bayesian Learning

Data Mining Techniques

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

CSC321 Lecture 18: Learning Probabilistic Models

Review and Motivation

Statistical Data Mining and Machine Learning Hilary Term 2016

Instructor: Dr. Volkan Cevher. 1. Background

Transcription:

An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative model with latent variables. We first describe the E-steps and M-steps, and then use finite mixture model as an example to illustrate this procedure in practice. Finally, we discuss its intrinsic relations with an optimization problem, which reveals the nature of E-M. The Expectation and Maximization Consider a generative model with parameter. The model generates a set of data D, which comprises two parts: ) the observed data X, and 2) the latent data Z, which is observed. With this model, the complete lielihood of D is given by pd ) = px, Z ), ) and the marginal lielihood of the observed data X is given by px ) = z px, z ). 2) Here, we sum over all possible assignments z to the variable Z. If Z is continuous, then we can change the sum to the integral over the domain of Z. The maximum lielihood estimation MLE) of given X is to find the parameter Θ that maximizes the marginal lielihood, as ˆ = argmax Θ px ) = argmax log px ). 3) Θ Here, Θ is the parameter domain, i.e. the set of all valid parameters. In practice, it is usually easier to wor with the log-lielihood instead of the lielihood itself. For many problems, including all the examples that we shall see later, the size of the domain of Z grows exponentially as the problem scale increases, maing it computationally intractable to exactly evaluate or even optimize) the marginal lielihood as above. The expectation maximization E-M) algorithm was developed to address this issue, which provides an iterative approach to perform MLE. The E-M algorithm, as described below, alternates between E-steps and M-steps until convergence.. Initialize ˆ 0). One can simply set 0) to some random value in Θ, or employ some problem-specific heuristics to derive an initial guess. 2. For t =, 2,..., repeat: a. E-step. Compute the posterior distribution of Z given X and t ), as Z) = pz X; ˆ t ) ). 4) Here, we use to denote the posterior distribution of Z obtained at the t-th iteration. b. M-step. Solve the optimal ˆ t) by maximizing the expectation of the complete log-lielihood with respect to, as ˆ t) = argmax E log px, Z )) = argmax z) log px, z ). 5) Θ Θ The computation of this sum can often be greatly simplified by taing advantage of the independence between variables, as we shall see in next section. 3. The iterations stop when some convergence criterion is met. For example, it can be stopped when the difference between t) and t ) is below some threshold. z

2 E-M in Practice: Estimation of Finite Mixture Models One of the most important application of E-M is the maximum lielihood estimation of finite mixture models. Generally, a finite mixture model is composed of K component models, whose parameters are respective,..., K, and a prior distribution over these components, denoted by π. The probability density function pdf) of this finite mixture model is given by px M) = π p c x ). 6) = Here, we use M = π,,..., K ) to denote all parameters of the mixture model. p c is the pdf of a component model, which has a parameter. When each component model is a Gaussian model, then this finite mixture model is called a Gaussian mixture model. Generation of a sample x from a finite mixture model as described above can be done in two steps: ) draw an indicator variable z from the prior distribution π as z π = π,..., π K ). Here, z can tae an integer value in {,..., K}. This value indicates which particular component is going to be used to generate x. 2) Given z, draw x from the z-th component model. Thus, we can consider this as a process that generates two values: z and x. Given a set of samples X = {x,..., x n }, we can treat the associated set of indicators Z = {z,..., z n } as latent variables. To estimate the model parameters, we can use E-M algorithm. To derive the E-steps and the M-steps, we go through three steps as follows.. Write down the complete log-lielihood. The complete log-lielihood is the log of the joint pdf of both the observed and latent variables, given the model parameters. In a finite mixture model, it is given by log px, Z M) = log px i z i ; M)pz i M) 7) Here, px i z i ; M) = p c x i z i) and pz i M) = π z i. Therefore, substitution of these into the formula above results in log px, Z M) = log πz i + log p c x i z i) ). 8) This form is not easier to wor with when deriving M-steps. The general tric is to introduce an indicator function δ, which is defined by { z = ) δ z) = 9) 0 z ). With this notation, we can further rewrite the complete log-lielihood into log px, Z M) = = δ z i ) log π + log p c x i ) ). 0) 2. Derive the E-steps. Given the model parameters M and x i, the indicator z i is independent from other samples, and as a result, we have pz X; M) = pz i x i ; M). ) Following the Bayes rule, we can get pz i = x i ; M) = π p c x i ) K l= π lp c x i l ). 2) The model parameters M are updated at each iteration t. To simplify notation, we use M t) to denote the model parameters obtained at iteration t, and to denote the posterior distribution obtained based on M t ). Then, we have Z) = i z i ), with i z i ) = pz i x i ; M t ) ). 3) 2

3. Derive the M-steps. With respect to, the expectation of the complete log-lielihood is given by n E δ z i ) log π + log p c x i ) )) = E δ z i )log π + log p c x i )) ). 4) = = Here, each term has E δ z i )log π + log p c x i )) ) = E δ z i ) ) log π + log p c x i )), 5) and since the value of δ z i ) can only be either 0 or, we have E δ z i ) ) = Pr δ z i ) = ) = ). 6) Consequently, the expected complete log-lielihood can be written into = Hence, the optimal value of at time t, denoted by ˆ t) = argmax i ) log π + log p c x i ) ). 7) t) ˆ, can thus be obtained by i ) log p c x i ). 8) This has a similar form as the ordinary MLE problem, except that the term for each sample is modulated by a coefficient i. In addition, we have to estimate the value of π, which can be done by solving the following problem ˆπ t) = argmax π i ) log π, s.t. π =, π 0, =,..., K. 9) It is not difficult to derive with Lagrange multipliers) that the optimal solution to this problem is given by ˆπ t) = i ). 20) n Summary: the E-step is to calculate the posterior probabilities of z i, as The M-step is to optimize the model parameters, as and = i z i ) = pz i x i ; M t ) ) = π p c x i ) K l= π lp c x i l ). 2) ˆ t) = argmax ˆπ t) = n i ) log p c x i ), 22) i ). 23) In general, Eq.2) and Eq.23) are the same for all finite mixture models, while Eq.22) depends on the particular form of the component model. Here are some examples:. Univariate gaussian distribution. Here, = µ, σ 2 ), and the pdf is given by p c x ) = exp x µ ) 2 ). 24) 2πσ 2 The optimal solution in the M-step is given by ˆµ t) = π t) i )x i, and Here, π t) = n qt) i ). 3 2σ 2 ) 2 ˆσ t) = π t) i x i ˆµ t) )2. 25)

2. Multivariate gaussian distribution over R d. Here, = µ, Σ ), and the pdf is given by p c x ) = 2π)d Σ exp ) 2 x µ ) T Σ x µ ). 26) The optimal solution in the M-step is given by ˆµ t) = π t) i )x i, and ˆΣt) = π t) i x i ˆµ t) )xi ˆµ t) )T. 27) 3. Simple topic model. Suppose each component model is a topic characterized by a word distribution, with w) being the probability of generating the word w from the topic, and each sample x is a bag of words, with xw) being the number of times the word w appears in the bag. Then p c x ) = w) xw). 28) w W Here, W is the vocabulary, i.e. the set of all possible words. Under this model, the optimal solution in M-step is given by ˆ w) = i x i w). 29) N Here, N is the total number of words in all observed bags. 3 An Optimization View: Relating E-M and MLE In this section, we show that the E-M algorithm is actually maximizing the following objective function via coordinate ascend. Q; q) = E q log px, Z )) + Hq) = E q log px, Z )) E q log qz)). 30) Here, q is a distribution over Z, and is the model parameter. Hq) is called the entropy of the distribution q, which is defined to be Hq) E q log qz)) = qz) log qz). 3) z The complete log lielihood here can be decomposed into We substitute Eq.32) into Eq.30), leading to log px, Z ) = log px ) + log pz X; ). 32) Q; q) = E q log px )) + E q log pz X; )) E q log qz)) = log px ) KLqZ) pz X; )). 33) Therefore, with fixed, we have Here, ˆq = argmax q Q; q) = argmax E q log pz X; )) E q log qz)). 34) q E q log pz X; )) E q log qz)) = E q log pz X; ) log qz)) = KL qz) pz X; )). 35) Therefore, maximizing Q, q) with fixed is equivalent to minimizing the Kullbac-Leibler divergence between qz) and the posterior distribution pz X; ), which attains the minimum zero), when qz) = pz X; ). In other words, the optimal solution q to this problem is given by the posterior distribution pz X; ). Recall that the E-step is to compute this posterior distribution. In this sense, the E-step can be considered as the step to maximize Q; q) with respect to q, with fixed. From Eq.30), we can see that with q fixed, the optimal is given by ˆ = argmax Q; q) = argmax E q log px, Z )). 36) 4

This is to maximize the expectation of the complete lielihood, which is exactly what the M-step is doing. From the analysis above, we can see that the E-M is actually a coordinate ascend procedure to maximize Q; q). In particular, the E-step is to optimize q with fixed, while the M-step is to optimize with q fixed. Consequently, the value of Q; q) will eep increasing during this process, until it reaches a local optima. To gain further insight, we revisit Eq.33), and see that Q; q) equals the marginal log-lielihood of X with respect to minus the KL divergence between qz) and pz X; ). Since KL-divergence is always non-negative, Q; q) is a lower bound of, and the E-M is to maximize this lower bound. Furthermore, it is worth noting that at each E-step, when we obtain the optimal q by minimizing the KL divergence to zero, we close the gap between the lower bound and the true marginal log-lielihood with respect to t ). To mae it explicit, Since Qˆ, ˆ t ) ) Qˆ, ˆ t) ) Qˆq t+), ˆ t) ), we have Qˆ, ˆ t ) ) = log px t ) ). 37) log px t ) ) log px t) ). 38) This shows that the actual value of the marginal log-lielihood of X increases as t increases, implying that the E-M algorithm is actually maximizing the marginal log-lielihood of X, i.e. log px ), the goal of MLE. 5