The Expectation-Maximization Algorithm

Size: px

Start display at page:

Download "The Expectation-Maximization Algorithm"

Solomon Norton
5 years ago
Views:

1 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford

2 MLE for Latent Variable Models Latent Variables and Marginal Likelihoods Many probabilistic models have hidden variables that are not observable in the dataset D: these models are known as latent variable models Examples: Hidden Markov Models & Mixture Models How would MLE be carried out for such models? Each data point is drawn from a joint distribution P θ (X, Z) For a realization ((X 1, Z 1 ),, (X n, Z n )), we only observe the variables in the dataset D = (X 1,, X n ) Complete-data likelihood: n P θ ((X 1, Z 1 ),, (X n, Z n )) = P θ (X i, Z i ) Marginal likelihood: P θ (X 1,, X n ) = z i=1 n P θ (X i, Z i = z) i=1 2/29

3 MLE for Latent Variable Models The Hardness of Maximizing Marginal Likelihoods (I) The MLE is obtained by maximizing the marginal likelihood: ( ) n ˆθ n = arg max log P θ (X i, Z i = z) θ Θ i=1 Solving this optimization problem is often a hard task! Non-convex Many local maxima No analytic solution z 0 Complete-data likelihood 1 Marginal likelihood log(pθ(x)) log(pθ(x)) θ θ 3/29

4 MLE for Latent Variable Models The Hardness of Maximizing Marginal Likelihoods (II) The MLE for θ is obtained by maximizing the marginal log likelihood function: ( ) n ˆθ n = arg max log P θ (X i, Z i = z) θ Θ i=1 Solving this optimization problem is often a hard task! The methods used in the previous lecture would not work Need a simpler approximate procedure! The Expectation-Maximization is an iterative algorithm that computes an approximate solution for the MLE optimization problem z 4/29

5 MLE for Latent Variable Models Exponential Families (I) The EM algorithm is well-suited for exponential family distributions Exponential Family A single-parameter exponential family is a set of probability distributions that can be expressed in the form P θ (X ) = h(x ) exp (η(θ) T (X ) A(θ)), where h(x ), A(θ) and T (X ) are known functions An alternative, equivalent form often given as P θ (X ) = h(x ) g(θ) exp (η(θ) T (X )) The variable θ is called the parameter of the family 5/29

6 MLE for Latent Variable Models Exponential Families (II) Exponential family distributions: P θ (X ) = h(x ) exp (η(θ) T (X ) A(θ)) T (X ) is a sufficient statistic of the distribution The sufficient statistic is a function of the data that fully summarizes the data X within the density function P θ (X ) This means that for any data sets D 1 and D 2, the density function is the same if T (D 1 ) = T (D 2 ) This is true even if D 1 and D 2 are quite different The sufficient statistic of a set of independent identically distributed data observations is simply the sum of individual sufficient statistics, ie T (D) = n i=1 T (X i) 6/29

7 MLE for Latent Variable Models Exponential Families (III) Exponential family distributions: P θ (X ) = h(x ) exp (η(θ) T (X ) A(θ)) η(θ) is called the natural parameter The set of values of η(θ) for which the function P θ (X ) is finite is called the natural parameter space A(θ) is called the log-partition function The mean, variance and other moments of the sufficient statistic T (X ) can be derived by differentiating A(θ) 7/29

8 MLE for Latent Variable Models Exponential Families (IV) Exponential Family Example: Normal Distribution P θ (X ) = 1 ( ) (X µ) 2 exp 2πσ 2 σ 2 = 1 exp ( X 2 2 X µ + µ 2 ) 2π 2 σ 2 log(σ) ( [ exp µ, 1 ] T [ σ = 2 2σ, X, X 2 ] ( )) T µ 2 + log(σ) 2 2σ 2 2π [ µ η(θ) = σ 2, 1 ] T 2σ 2, h(x ) = (2π) 1 2 T (X ) = [ X, X 2] ( ) T µ 2, A(θ) = 2σ 2 + log(σ) 8/29

9 MLE for Latent Variable Models Exponential Families (V) Properties of Exponential Families Exponential families have sufficient statistics that can summarize arbitrary amounts of independent identically distributed data using a fixed number of values Exponential families have conjugate priors (an important property in Bayesian statistics) The posterior predictive distribution of an exponential-family random variable with a conjugate prior can always be written in closed form 9/29

10 MLE for Latent Variable Models Exponential Families (VI) The Canonical Form of Exponential Families If η(θ) = θ, then the exponential family is said to be in canonical form The canonical form is non-unique, since η(θ) can be multiplied by any nonzero constant, provided that T (X ) is multiplied by that constant s reciprocal, or a constant c can be added to η(θ) and h(x ) multiplied by exp( c T (x)) to offset it 10/29

11 EM: The Algorithm Expectation-Maximization (I) Two unknowns: The latent variables Z = (Z 1,, Z n ) The parameter θ Complications arise because we don t know the latent variables (Z 1,, Z n ) maximizing P θ ((X 1, Z 1 ),, (X n, Z n )) is often a simpler task! Recall that maximizing the complete-data likelihood is often simpler than maximizing the marginalized likelihood! 0 Complete-data likelihood 1 Marginal likelihood log(pθ(x)) log(pθ(x)) θ θ 11/29

12 12/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory EM: The Algorithm Expectation-Maximization (II) The EM Algorithm 1 Start with an initial guess ˆθ (o) for θ For every iteration t, do the following: 2 E-Step: Q(θ, ˆθ (t) ) = z log (P θ(z = z, D)) P(Z D, ˆθ (t) ) 3 M-Step: ˆθ (t+1) = arg max θ Θ Q(θ, ˆθ (t) ) 4 Go to step 2 if stopping criterion is not met

13 EM: The Algorithm Expectation-Maximization (III) Two unknowns: The latent variables Z = (Z 1,, Z n ) The parameter θ Expected Likelihood: z log (P θ(z = z, D)) P(Z D, θ) Here the logarithm acts directly on the complete-data likelihood, so the corresponding M-step will be tractable 0 Complete-data likelihood 1 Marginal likelihood log(pθ(x)) log(pθ(x)) θ θ 13/29

14 EM: The Algorithm Expectation-Maximization (III) Two unknowns: The latent variables Z = (Z 1,, Z n ) The parameter θ Expected Likelihood: z log (P θ(z = z, D)) P(Z D, θ) Here the logarithm acts directly on the complete-data likelihood, so the corresponding M-step will be tractable But we still have two terms (log (P θ (Z = z, D)) & P(Z D, θ)) that depend on the two unknowns Z and θ The EM algorithm: E-step: Fix the posterior Z D, θ by conditioning on the current guess for θ, ie Z D, θ (t) M-step: Update the guess for θ by solving a tractable optimization problem The EM algorithm breaks down the intractable MLE optimization problem into simpler, tractable iterative steps 14/29

15 EM: The Algorithm EM for Exponential Family (I) The critical points of the marginal likelihood function: log(p θ (D)) θ = 1 z P θ(d,z=z) P θ (D) θ = 0 log (P θ (D, Z)) θ = θ log exp ( η(θ), T (D, Z) A(θ)) h(d, Z) }{{} Canonical form of exponential family For η(θ) = θ, we have that P θ (D, Z) = (T (D, Z) θ ) θ A(θ) P θ (D, Z) 15/29

16 EM: The Algorithm EM for Exponential Family (II) For exponential families: E θ [T (D, Z)] = θ A(θ) P θ (D, Z) = (T (D, Z) E θ [T (D, Z)]) P θ (D, Z) θ 1 P Since θ (D,Z=z) P θ (D) z θ = 0, we have that 1 P θ (D) (T (D, Z = z) E θ [T (D, Z)]) P θ (D, Z = z) = 0 z z T (D, Z = z) P θ(d, Z = z) P θ (D) E θ [T (D, Z)] = 0 E θ [T (D, Z) D] E θ [T (D, Z)] = 0 16/29

17 EM: The Algorithm EM for Exponential Family (III) For the critical values of θ, the following condition is satisfied: E θ [T (D, Z) D] = E θ [T (D, Z)] How is this related to the EM objective Q(θ, ˆθ (t) )? Q(θ, ˆθ (t) ) = z log (P θ (Z = z, D)) Pˆθ(t)(Z D) = θ Eˆθ (t) [T (D, Z) D] A(θ) + Constant = θ Eˆθ (t) [T (D, Z) D] E θ [T (D, Z)] + Constant Q(θ,ˆθ (t) ) θ = 0 Eˆθ(t) [T (D, Z) D] = E θ [T (D, Z)] Since it is difficult to solve the above equation analytically, the EM algorithm solves for θ via successive approximations, ie solve the following for ˆθ (t+1) : Eˆθ(t) [T (D, Z) D] = Eˆθ(t+1) [T (D, Z)] 17/29

18 Multivariate Gaussian Mixture Models Example: Multivariate Gaussian Mixtures Parameters for a mixture of K Gaussians: mixture proportions {π k } K k=1, mean vectors and covariance matrices {(µ k, Σ k )} K k=1 3 K = X X1 Figure: Contour plot for the density of a mixture of 3 bivariate Gaussian distributions 18/29

19 Multivariate Gaussian Mixture Models The Generative Process Z i = z Categorical(π 1,, π K ), and X i N (µ z, Σ z ) X X1 Figure: A sample from a mixture model: every data point is colored according to its component membership 19/29

20 Multivariate Gaussian Mixture Models The Dataset Need to learn the parameters (π k, µ k, Σ k ) K k=1 from the data points D = (X 1,, X n ) that are not colored by the component memberships, ie we do not observe the latent variables Z = (Z 1,, Z n ) X2 1 2 X X X1 (a) (D, Z): the data points and their component memberships (b) D: the dataset with the observed data points (component memberships are latent) 20/29

21 EM for Gaussian Mixture Models MLE for the Gaussian Mixture Models The complete-data likelihood function is given by n P θ (D, Z) = π zi N (X i µ zi, Σ zi ) i=1 The marginal likelihood function is P θ (D) = n i=1 k=1 K π k N (X i µ k, Σ k ) The MLE can be obtained by maximizing the marginal log likelihood function: ( n K ) ˆθ n = arg max log π k N (X i µ k, Σ k ) θ Θ i=1 k=1 Exercise: Is the objective function above concave? 21/29

22 EM for Gaussian Mixture Models Implementing EM for the Gaussian Mixture Model (I) The expected complete-data log likelihood function is E z [P θ (D, Z)] = n i=1 k=1 γ(k, X i θ) = P θ (Z i = k X i ) K γ(k, X i θ) (log(π k ) + log (N (X i µ k, Σ k ))) γ(k, X i θ) is called the responsibility of component k towards data point X i γ(k, X i θ) = π k N (X i µ k, Σ k ) K j=1 π j N (X i µ j, Σ j ) Try to work out the derivation above yourself! 22/29

23 EM for Gaussian Mixture Models Implementing EM for the Gaussian Mixture Model (II) (E-step) Approximate expected complete-data likelihood by fixing the responsibilities γ(k, X i θ) using the parameter estimates obtained from the previous iteration Q(θ, ˆθ (t) ) = γ(k, X i ˆθ (t) ) = n K γ(k, X i ˆθ (t) ) (log(π k ) + log (N (X i µ k, Σ k ))) i=1 k=1 ˆπ (t) k K j=1 ˆπ(t) j N (X i ˆµ (t) (t), ˆΣ k k ) N (X i ˆµ (t) j, ˆΣ (t) j ) (M-step) Solve a tractable optimization problem (ˆπ (t+1), ˆµ (t+1), ˆΣ (t+1) ) = arg max (π,µ,σ) n i=1 k=1 K γ(k, X i ˆθ (t) ) (log(π k ) + log (N (X i µ k, Σ k ))) 23/29

24 EM for Gaussian Mixture Models Implementing EM for the Gaussian Mixture Model (III) The (M-step) yields the following parameter updating equations ˆπ (t+1) k ˆµ (t+1) k = 1 n ˆΣ (t+1) k = n γ(k, X i ˆθ (t) ) i=1 = 1 n X i γ(k, X i ˆθ (t) ) n i=1 n γ(k, X i ˆθ (t) ) n j=1 γ(k, X j ˆθ (t) ) (X i ˆµ (t+1) i=1 k )(X i ˆµ (t+1) k ) T Try to work out the updating equations by yourself! 24/29

25 EM for Gaussian Mixture Models EM in Practice Consider a Gaussian mixture model with K = 3, and the following parameters: π 1 = 06, π 2 = 005, and π 3 = 035 µ 1 = [ 14, 18] T, µ 2 = [ 14, 28] T, µ 3 = [ 19, 055] T Σ 1 = [ ] 08 08, Σ = [ ] 12 23, Σ = [ 04 ] Try writing a MATLAB code that generates a random dataset of 5000 data points drawn from the model specified above, and implement the EM algorithm to learn the model parameters from this dataset 25/29

26 EM for Gaussian Mixture Models EM in Practice The complete-data log likelihood increases after every EM iteration! This means that every new iteration finds a better estimate! Log-likelihood EM iteration 26/29

27 EM for Gaussian Mixture Models EM in Practice Compare the true density function with the estimated one 6 Contour Plot for the True Density Function Contour Plot for the Estimated Density Function X2 0 X X X 1 27/29

28 EM Performance Guarantees What Does EM Guarantee? The EM algorithm does not guarantee that ˆθ (t) will converge to ˆθ n EM guarantees the following: ˆθ (t) always converges (to a local optimum) Every iteration improves the marginal likelihood Pˆθ (t) (D) Does the Initial Value Matter? 1 The initial value θ (o) affects the speed of convergence and the value of θ ( )! Smart initialization methods are often needed 2 The K-means algorithm is often used to initialize the parameters in a Gaussian mixture model before applying the EM algorithm 28/29

29 29/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory EM Performance Guarantees References 1 Robert W Keener, Statistical theory: notes for a course in theoretical statistics, Robert W Keener, Theoretical Statistics: Topics for a Core Course, Christopher Bishop, Pattern Recognition and Machine Learning, 2007

Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s