Latent Variable View of EM Sargur srihari@cedar.buffalo.edu 1
Examples of latent variables 1. Mixture Model Joint distribution is p(x,z) We don t have values for z 2. Hidden Markov Model A single time slice is a mixture with components p(x z) An extension of mixture model Choice of mixture component depends on choice of mixture component for previous distribution Latent variables are multinomial variables z n That describe component responsible for generating x n 2
Another example of latent variables 3. Topic Models (Latent Dirichlet Allocation) In NLP unobserved groups explain why some observed data are similar Each document is a mixture of various topics (latent variables) Topics generate words CAT-related: milk, meow, kitten DOG-related: puppy, bark, bone Multinomial distributions over words with Dirichlet priors 3
Main Idea of EM Goal of EM is: find maximum likelihood models for distributions p(x) that have latent (or missing) data E.g., GMMs, HMMs In case of Gaussian mixture models We have a complex distribution of observed variables x We wish to estimate its parameters µ k,σ k,π k p(x) = p(x,z) Introduce latent variables z so that p(x) = π k N(x µ k, Σ k ) joint distribution p(x,z) is more tractable (since we know forms of components) p(x z k = 1) = N(x µ k, Σ k ) Complicated form from simpler components The original distribution is obtained by marginalizing the joint distribution K k=1 z 4
Alternative View of EM This view recognizes key role of latent variables Observed data matrix X = x 1 x 2 x n Latent Variables - matrix = where n th row is the sample vector x nt =[x n1 x n2 x nd ] with corresponding row z nt =[z n1 z n2 z nk ] z 1 z 2 z n Goal of EM algorithm is to find maximum likelihood solution for p(x) given some X When we do not have 5
Likelihood with Latent Variables Likelihood function (from sum rule) is p(x θ) = p(x, θ) where θ is the set of all model parameters E.g., means, covariances, responsibilities Log-likelihood function is ln p(x θ) = ln p(x, θ) We wish to find θ that maximizes this Joint likelihood function can be written as X = x 1 x 2 x n = z 1 z 2 z n p(x, θ)=p( X,θ)p(X θ) We choose this form from the graph since we know X and not 6
Complication due to Latent Variables Log likelihood function is ln p(x θ) = ln Key Observation: p(x, θ) Summation inside brackets due to marginalization Not due to log-likelihood Summation over latent variables appears inside logarithm Even if joint distribution p(x, θ) belongs to exponential family the marginal distribution p(x θ) does not Taking log of Sum of Gaussians does not give simple quadratic Results in complicated expressions for maximum likelihood solution, i.e., what value of θ maximizes the likelihood 7
Complete and Incomplete Data Sets Log-likelihood function is: ln p(x θ) = ln p(x, θ) Complete Data {X,} For each observation in X we know corresponding value of latent variable in Since we can evaluate p(x, θ)=p( X,θ)p(X θ) maximization over θ is straightforward Incomplete Data {X} Actual data set Since we do not know We cannot evaluate p(x, θ) to maximize over θ 8
Maximizing Expectation of p(x, θ) Since we don t have the complete data set {X,} to evaluate p(x, θ) we instead evaluate its expectation Since we are given X, for a given θ we first determine the distribution of the latent variables p( X,θ) Then the expected log-likelihood of complete data is E [ ln p(x, θ)]= p( X,θ)ln p(x, θ) We maximize this by considering every value of θ And for each θ by summing over each value of Summation is due to expectation not sum rule! Since logarithm acts directly on the joint p(x, θ) and not on a summation it is tractable 9
E and M Steps E Step: Estimate the missing values Use current parameter value θ old to find the posterior distribution of the latent variables given by M Step: Determine revised parameter estimate θ new by maximizing where p( X, θ old ) ( ) = p ( X,θ old ) Q θ,θ old θ new = argmaxq(θ,θ old ) θ ln p X, θ ( ) Summation due to expectation is the expectation of p(x, θ) for some general parameter value θ 10
General EM Algorithm Given joint distribution p(x, θ) over observed variables X and latent variables governed by parameters θ goal is to maximize likelihood function p(x θ) Step 1: Choose an initial setting for the parameters θ old Step 2: E Step: Evaluate p( X, θ old ) Step 3: M Step: Evaluate θ new given by where Q(θ,θ old ) = p( X, θ old )ln p(x, θ) Check for convergence of either log-likelihood or parameter values If not satisfied then let θ old θ new Return to Step 2 θ new = argmaxq(θ,θ old ) θ 11
Missing Variables EM has been described for maximum likelihood function when there are discrete latent variables It can also be applied when there are unobserved variables corresponding to missing values in data set Take the joint distribution of all variables and then marginalize over missing ones EM is then used to maximize corresponding likelihood function Method is valid when data is missing at random Not if missing value depends on unobserved values E.g., if quantity exceeds some threshold 12
Gaussian Mixtures Revisited Apply EM (latent variable view) to GMM In the E-step we compute Expectation of log-likelihood of complete data {X,} wrt posterior of latent Variables Q(θ,θ old ) = p( X, θ old )ln p(x, θ) What is the form of the two product terms? In the M-step we maximize Q(θ,θ old ) wrt Will show that this leads to the same m.l estimates for GMM parameters π,µ,σ as before θ 13
Likelihood for Complete Data Likelihood function for the complete data set is p(x, π, µ, Σ) = Log-likelihood is N K N n=1 k=1 π k z nk ( ) N x n µ k, Σ k Much simpler than log-likelihood for incomplete data: ln p(x π, µ, Σ) = K ln p(x, π, µ, Σ) = z nk lnπ k + lnn x n µ k, Σ k n=1 k=1 N n=1 Maximum likelihood solution for complete data can be obtained in closed form Since we don t have values for latent variables, we obtain its expectation wrt the posterior distribution of latent variables 14 z nk K ln π k N(x n µ k, Σ k ) k=1 ( ) { }
Posterior Distribution of Latent Variables K K zk p(z) = k From π and p(x z) = N ( x µ k,σ k ) z k we have k = 1 p( X,µ,Σ) α N K n=1 k=1 ( π k N ( x n µ k,σ k )) From which we can get the expected value for the indicator variable as E[z nk ] = ( ) π k N x n µ k,σ k K j=1 ( ) π j N x n µ j,σ j Substituting into complete log-likelihood: N K E ln p( X, π,µ,σ) = γ ( z nk ){ lnπ k + ln N ( x n µ k,σ k )} n=1 k=1 ( ) = γ z nk k=1 z nk Final procedure: choose initial values for π old,µ old,σ old Evaluate the responsibilities (E-step) Keep responsibilities fixed and use closed-form solutions for N 1 µ = γ( z )x Σ k = 1 γ(z nk )(x n µ k )(x n µ k ) T N k k nk n Nk n= 1 N n =1 Nk π k = N π new,µ new,σ new 15
Relation to K-means EM for Gaussian mixtures has close similarity to K- means K-means performs a hard assignment of data points to clusters Each data point is associated uniquely with one cluster EM makes a soft assignment based on posterior probabilities K-means does not estimate the covariances of the clusters but only the cluster means 16