Statistical Pattern Recognition

Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2

Agenda Expectation-maximization (EM) Overview EM Applications EM Algorithm EM Examples Mixture Models Gaussian Mixtures 2

Expectation-Maximization (EM) EM algorithm is a general technique for finding maximum likelihood estimators under missing (unobserved) data. EM is perhaps most often used and mostly half understood algorithm for unsupervised learning. It is very intuitive Many people rely on their intuition to apply the algorithm in different problem domains The EM algorithm estimates the parameters of a model iteratively. Starting from some initial guess, each iteration consists of an Expectation step, and an Maximization step 3

Missing Data Problem Occurs whenever part of the data is unknown intrinsically inaccessible Example: which model does a data point belong to in mixture models? data is lost / erroneous Example: Some faulty / noisy process has generated the data. If the missing data is correlated in any way with the observed, we can hope to extract information about the missing data from the observed. If the missing data is independent from the observed, everything is lost. 4

EM Applications Application Examples PoS (Part of Speech) Tagging Complete data: A sentence (a sequence of words) and a corresponding sequence of PoS tags. Observed data: the sentence Unobserved data: the sequence of tags Model: an HMM with transition/emission probability tables Model Building with Partial Observations We ll discuss this example today. Our goal is to build a probabilistic model The model parameters can be estimated from a set of training examples: x 1, x 2,, x n x i s are i.i.d (identically and independently distributed) Unfortunately, we only get to observe part of each training example: x i =(x io, x iu ) and we can only observe x io. How do we build the model? 5

EM Applications More applications Filling in missing data in samples Discovering the value of latent variables Estimating the parameters of HMMs Estimating parameters of finite mixtures Unsupervised learning of clusters 6

EM Algorithm General Idea Given a set of incomplete (observed) data Assume observed data come from a specific model Iterate following steps until convergence Expectation step: formulate some parameters for that model, use this to guess the missing (latent / unobserved) data Maximization step: from the missing data and observed data, find the most likely parameters E step initial guess Guess of unknown parameters Guess of unknown hidden structure Observed structure M step 7

EM Algorithm Assumptions: Suppose that observations are Xs. Latent data are Zs., and the unknown parameters are θ. Initialization: Initialize the parameters of θ to some random value General Algorithm: E Step: Compute the best structure for Z given current parameters values. M Step: Use the just-computed values of Z to compute a better estimate for the parameters. 8

EM Algorithm Intuition Consider a model p X θ, the MLE of θ can be found using: θ = argmax p X θ θ = argmax log p X θ θ Some times there is some hidden variable Z, so the model is p X, Z θ, the marginalization on latent variable Z yields: p X θ = p(x, z θ) z ow we can use MLE like before, but we need to perform above summation that may be computationally intractable. EM proposed to address this issue. 9

EM Algorithm Algorithm Initialize θ 0 with some random value in domain of θ For t = 1, 2, repeat: E-Step: Compute the posterior distribution of Z given X and θ t 1 as q t (Z) = p(z X; θ t 1 ) θ t M-Step: Find the optimal θ t by maximizing the expectation of the complete loglikelihood with respect to q t (Z) = argmax E q t (Z) θ log (p(x, Z θ)) = argmax θ q t Z log (p(x, z θ) The computation of this sum can often be greatly simplified by taking advantage of the independence The iterations stop when some convergence criterion is met. For example, when the difference between θ t and θ t 1 is below some threshold. z 10

EM Algorithm A simple example Maximum likelihood Assume after an exam in the class we have these grades: Grade A B C D # of students a b c d And suppose that we know: P(A)=½, P(B)=μ, P(C)=2μ, P(D)=½-3μ What s the maximum likelihood estimate of μ? p a, b, c, d μ = K 1 2 a μ b 2μ c 1 2 3μ d ln p a, b, c, d μ = ln K + aln 1 2 + b ln μ + cln 2μ + dln (1 2 3μ) ln P μ = b μ + 2c μ + 1 2 3c b+c = 0 μ = 3μ 6 b+c+d 11

EM Algorithm A simple example Hidden Information Suppose that we know that: Grade A B C D # of students a b c d P(A)=½, P(B)=μ, P(C)=2μ, P(D)=½-3μ number of high grades (A s + B s) = h umber of C s = c umber of D s = d What is the Maximum Likelihood estimate for μ now? Expectation: If we knew the value of μ we could compute the expected value for a and b 1 a 2 h b h 1 1 2 2 12

EM Algorithm A simple example Hidden Information (cont.) Maximization: Grade A B C D # of students a b c d P(A)=½, P(B)=μ, P(C)=2μ, P(D)=½-3μ If we knew the expected values of a and b we could compute the maximum likelihood value of μ like before Then, we begin with a first estimate for μ and iterate between expectation and maximization to improve our estimates for μ, a and b. (0) initial guess for (t)h b(t) E b (t) 1 (t) 2 b(t) c (t 1) ML estimates of given b(t) 6 b(t) c d 13

EM Algorithm Another example: K-means clustering Goal: represent a data set {x 1,, x } in terms of K clusters each of which is summarized by a prototype μ k Initialize prototypes, then iterate between two phases: E-step: assign each data point to nearest prototype M-step: update prototypes to be the cluster means Simplest version is based on Euclidean distance HW: Derive the EM equations for ( t) ( t) k-means algorithm ( P( Xu Xo, ), (, ) ). 14

Mixture Models Mixture density model estimation Models data with mixture density Where {,..., }, 1 m 1 P( c )... P( c m ) 1 P( x ) (, ) ( ) j p x c 1 j j P c j To generate a sample from distribution P(X θ) first select class j with probability P(c j ) then generate x according to probability P(x c j,θ j ) Provides a framework for building more complex probability distributions Can be used to cluster data (How?) m 15

Gaussian Mixtures Linear super-position of Gaussians ormalization and positivity require Example: Mixture of 3 Gaussians K P( x) P( c ) ( x, ) k K 1 k1 k k k P( c ) 1, 0 P( c ) 1 K K Separated Mixed 16

Gaussian Mixtures Fitting the Gaussian mixture model The goal: given the data set, find the corresponding parameters: mixing coefficients (or prior probabilities), means, and covariances If we knew which component generated each data point, the maximum likelihood solution would involve fitting each component to the corresponding cluster Problem: the data set is unlabelled We ll refer to the labels as latent (= hidden) variables Synthetic data set without labels 17

Gaussian Mixtures Maximum likelihood for the GMM The log likelihood function takes the form K ln p( X,, ) ln k ( xn k, k ) n1 k1 ote: The sum over components, appears inside the log. There is no closed form solution for maximum likelihood. Then, how to maximize the log likelihood? Using EM algorithm. 18

Gaussian Mixtures EM Algorithm Initialize the means μ k, covariances Σ k and mixing coefficients π k. Repeat the following steps until convergence: E step: Evaluate z ij s (latent variables) using the current parameter values z ij : a binary variable which is 1 if x i is drawn from the j th distribution p( c j ) p( xi c j ) j ( xi j, j ) zij p( c j xi ) K px ( i ) ( x, ) M step: Re-estimate the parameters using the current z ij s Equations in the next slides k1 k i k k 19

Gaussian Mixtures EM algorithm M step Let us proceed by simply differentiating the log likelihood Setting derivative with respect to μ k equal to zero, gives ( x, ) ( ) ( ) 0 k i k k k xi k zik k xi k i1 j( xi j, j) j i1 we suppose that z ik values are known, in M step. 1 1 multiplying both sides by k gives, which is simply k znkxn, k z n1 nk k n1 the weighted mean of the data 1 T Similarly for the covariances, we obtain k znk( xn k )( xn k ) k n1 ote that the condition which requires the mixing coefficients to sum to 1, must be satisfied, when maximizing log-likelihood with respect to the π k. Then, we use the Lagrange multiplier method, as shown in the next slide 20

Gaussian Mixtures EM algorithm M step Estimating π k s: Using Lagrange multiplier method, we must maximize the following quantity K ln PX (,, ) k 1 k1 which gives ( xn k, k ) 0 ( x, ) n1 j j n j j k multiplying both sides by π k and sum over k, we find λ =. So k. 21

Gaussian Mixtures EM algorithm Latent variable view to obtain M step estimations: We have: Then, K z nk k P zn k1 k P( z 1) ( ) P( x z 1) ( x, ) P( x z ) ( x, ) n k n k k n n n k k k1 n1 k1 n1 k1 K znk p( X, Z,, ) ( x, ) K ln p( X, Z,, ) z ln ln ( x, ) K nk k n k k nk k n k k z nk z nk Keeping the z ij s fixed and maximizing with respect to the parameters give the previous results: 1 k znkxn k n1 1 z ( x )( x ) k nk n k n k k n1 T k k 22

Gaussian Mixtures Example: Mixture of two Gaussians After 20 cycles the algorithm is close to convergence. 23

Any Question? End of Lecture 14 Thank you! Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 24