Latent Variable Models and EM Algorithm

SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/ Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 1 / 36

Probabilistic Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 2 / 36

Probabilistic Methods Mixture Models Algorithmic approach: Data Algorithm Analysis/ Interpretation Probabilistic modelling approach: Unobserved process Generative Model Analysis Interpretation Data Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 3 / 36

Mixture Models Probabilistic Unsupervised Learning Mixture Models Mixture models suppose that our dataset X was created by sampling iid from K distinct populations (called mixture components). Samples in population k can be modelled using a distribution F µk with density f (x µ k), where µ k is the model parameter for the k-th component. For a concrete example, consider a Gaussian with unknown mean µ k and known diagonal covariance σ 2 I, ( f (x µ k) = 2πσ 2 p 2 exp 1 ) 2σ x 2 µk 2 2. Generative model: for i = 1, 2,..., n: First determine the assignment variable independently for each data item i: Z i Discrete(π 1,..., π K) i.e., P(Z i = k) = π k where mixing proportions are π k 0 for each k and K k=1 πk = 1. Given the assignment Z i = k, then X i = (X (1) i,..., X (p) i ) is sampled (independently) from the corresponding k-th component: X i Z i = k f (x µ k) We observe X i = x i for each i but not Z i s (latent variables), and would like to infer the parameters {µ k} K k=1 and {π k} K k=1 (σ 2 can also be estimated). Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 4 / 36

Mixture Models Mixture Models Unknowns to learn given data are Parameters: θ = (π k, µ k) K k=1, where π 1,..., π K [0, 1], µ 1,..., µ K R p, and Latent variables: z 1,..., z n. The joint probability over all cluster indicator variables {Z i } are: p Z ((z i ) n i=1) = n π zi = i=1 n K i=1 k=1 π 1(zi=k) k The joint density at observations X i = x i given Z i = z i are: n n K p X ((x i ) n i=1 (Z i = z i ) n i=1) = f (x i µ zi ) = f (x i µ k ) 1(zi=k) i=1 i=1 k=1 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 5 / 36

Mixture Models Mixture Models: Joint pmf/pdf of observed and latent variables Unknowns to learn given data are Parameters: θ = (π k, µ k) K k=1, where π 1,..., π K [0, 1], µ 1,..., µ K R p, and Latent variables: z 1,..., z n. The joint probability mass function/density 1 is: n K p X,Z ((x i, z i ) n i=1) = p Z ((z i ) n i=1)p X ((x i ) n i=1 (Z i = z i ) n i=1) = (π k f (x i µ k )) 1(zi=k) i=1 k=1 And the marginal density of x i (resulting model on the observed data) is: p(x i ) = K p(z i = j, x i ) = j=1 K π j f (x i µ j ). j=1 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 6 / 36

Prod. Bernoulli Probabilistic Prod. Unsupervised Bernoulli Learning Sigmoid Mixturebelief Modelsnet 27.7 Table 11.1 Mixture Models: Gaussian Mixtures with Unequal Prod. Discrete in the likelihood means a factored distribution of the form j Gaussian means a factored distribution of the form Covariances j Summary of some popular directed latent variable models. Here Prod means product, so Cat(xij zi), and Prod. N (xij zi). PCA stands for principal components analysis. ICA stands for indepedendent components analysis. 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (a) figure from Murphy, 2012, Ch. 11. Here θ = (π k, µ k, Σ k ) K k=1 are all the model parametes and ( f (x (µ k, Σ k )) = (2π) p 2 Σk 1 2 exp 1 2 (x µ k) Σ 1 k (x µ k ) Figure 11.3 A mixture of 3 Gaussians in 2d. (a) We show the contours of constant probability for each component in the mixture. (b) A surface plot of the overall density. Based on Figure 2.23 of (Bishop ) 2006a). Figure generated by mixgaussplotdemo..1 Mixtures of Gaussians K p(x) = π k f (x (µ k, Σ k )) The most widely used mixturek=1 model is the mixture of Gaussians (MOG), also called a Gaussian mixture model or GMM. In this model, each base distribution in the mixture is a multivariate Gaussian with mean μ and covariance matrix Σ k. Thus the model has the form Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 7 / 36 (b),

Mixture Models Mixture Models: Responsibility Suppose we know the parameters θ = (π k, µ k ) K k=1. Z i is a random variable and its conditional distribution given data set X is: Q ik := p(z i = k x i ) = p(z i = k, x i ) p(x i ) = π kf (x i µ k ) K j=1 π jf (x i µ j ) The conditional probability Q ik is called the responsibility of mixture component k for data point x i. These conditionals softly partitions the dataset among the k components: K k=1 Q ik = 1. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 8 / 36

Mixture Models Mixture Models: Maximum Likehood How can we learn about the parameters θ = (π k, µ k ) K k=1 from data? Standard statistical methodology asks for the maximum likelihood estimator (MLE). The goal is to maximise the marginal probability of the data over the parameters ˆθ ML = argmax p(x θ) = argmax θ (π k,µ k) K k=1 = argmax (π k,µ k) K k=1 = argmax n p(x i (π k, µ k ) K k=1) i=1 n i=1 k=1 (π k,µ k) K k=1 i=1 K π k f (x i µ k ) n K log π k f (x i µ k ). k=1 } {{ } :=l((π k,µ k) K k=1 ) Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 9 / 36

Mixture Models Mixture Models: Maximum Likehood Marginal log-likelihood: l((π k, µ k ) K k=1) := log p(x (π k, µ k ) K k=1) = n K log π k f (x i µ k ) i=1 k=1 The gradient w.r.t. µ k : µk l((π k, µ k ) K k=1) = = n i=1 π k f (x i µ k ) K j=1 π jf (x i µ j ) µ k log f (x i µ k ) n Q ik µk log f (x i µ k ). i=1 Difficult to solve, as Q ik depends implicitly on µ k. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 10 / 36

Mixture Models Likelihood Surface for a Simple Example If latent variables z i s were all observed, we would have a unimodal likelihood surface but when we marginalise out the latents, the likelihood surface becomes multimodal: no unique MLE. 320 compbody.tex 35 19.5 30 25 20 mu 2 14.5 9.5 4.5 0.5 15 5.5 10 10.5 5 15.5 0 30 20 10 0 10 20 30 (a) 15.5 10.5 5.5 0.5 4.5 9.5 14.5 19.5 mu 1 (left) n = 200 data points from a mixture of two 1D Gaussians with Figure 11.6: Left: N = 200 data points sampled from a mixture of 2 Gaussians in 1d, with π k = 0.5, σ k = 5, µ 1 = 10 and µ 2 = 10. Right: π 1 = Likelihood π 2 = surface 0.5, p(d µ σ = 1, 5µ 2), and withµ all 1 other = 10, parameters µ 2 = set 10. to their true values. We see the two symmetric modes, reflecting the unidentifiability of the parameters. Produced by mixgaussliksurfacedemo. (right) Observed data log likelihood surface l (µ 1, µ 2 ), all the other parameters being assumed known. 11.4.2.5 Unidentifiability Note that Department mixture of Statistics, models Oxford are not identifiable, which SC4/SM8 means ATSML, thereht2018 are many settings of the parameters which have the 11 / same 36 (b)

Mixture Models Mixture Models: Maximum Likehood Recall we would like to solve: n µk l((π k, µ k ) K k=1) = Q ik µk log f (x i µ k ) = 0 i=1 What if we ignore the dependence of Q ik on the parameters? Taking the mixture of Gaussian with covariance σ 2 I as example, n i=1 = 1 σ 2 n ( Q ik µk p 2 log(2πσ2 ) 1 ) 2σ 2 x i µ k 2 2 i=1 Q ik (x i µ k ) = 1 σ 2 ( n i=1 Q ik x i µ k ( n i=1 Q ik) ) = 0 n µ ML? i=1 k = Q ikx i n i=1 Q ik Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 12 / 36

Mixture Models Mixture Models: Maximum Likehood The estimate is a weighted average of data points, where the estimated mean of cluster k uses its responsibilities to data points as weights. n µ ML? i=1 k = Q ikx i n i=1 Q. ik Makes sense: Suppose we knew that data point x i came from population z i. Then Q izi = 1 and Q ik = 0 for k z i and: µ ML? i:z k = x i=k i i:z 1 = avg{x i : z i = k} i=k Our best guess of the originating population is given by Q ik. Soft K-Means algorithm? Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 13 / 36

Mixture Models Mixture Models: Maximum Likehood Gradient w.r.t. mixing proportion π k (including a Lagrange multiplier λ ( k π k 1 ) to enforce constraint k π k = 1). ( πk l((π k, µ k ) K k=1) λ( ) K k=1 π k 1) n f (x i µ k ) = K j=1 π jf (x i µ j ) λ = i=1 n i=1 Q ik π k λ = 0 π k n i=1 Q ik Note: K n Q ik = k=1 i=1 n K i=1 k=1 Q ik }{{} =1 n πk ML? i=1 = Q ik n Again makes sense: the estimate is simply (our best guess of) the proportion of data points coming from population k. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 14 / 36

Mixture Models Mixture Models: The EM Algorithm Putting all the derivations together, we get an iterative algorithm for learning about the unknowns in the mixture model. Start with some initial parameters (π (0) k, µ (0) k ) K k=1. Iterate for t = 1, 2,...: Expectation Step: Maximization Step: Q (t) ik := π (t 1) k K j=1 π(t 1) j f (x i µ (t 1) k ) f (x i µ (t 1) j ) n π (t) i=1 k = Q(t) ik n n µ (t) i=1 k = Q(t) ik xi n i=1 Q(t) ik Will the algorithm converge? What does it converge to? Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 15 / 36

Mixture Models Example: Mixture of 3 Gaussians An example with 3 clusters. 5 0 5 10 5 0 5 X1 X2 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 16 / 36

Mixture Models Example: Mixture of 3 Gaussians After 1st E and M step. Iteration 1 data[,2] 5 0 5 5 0 5 10 data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 17 / 36

Mixture Models Example: Mixture of 3 Gaussians After 2nd E and M step. Iteration 2 data[,2] 5 0 5 5 0 5 10 data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 18 / 36

Mixture Models Example: Mixture of 3 Gaussians After 3rd E and M step. Iteration 3 data[,2] 5 0 5 5 0 5 10 data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 19 / 36

Mixture Models Example: Mixture of 3 Gaussians After 4th E and M step. Iteration 4 data[,2] 5 0 5 5 0 5 10 data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 20 / 36

Mixture Models Example: Mixture of 3 Gaussians After 5th E and M step. Iteration 5 data[,2] 5 0 5 5 0 5 10 data[,1] Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 21 / 36

EM Algorithm EM Algorithm In a maximum likelihood framework, the objective function is the log likelihood, l(θ) = Direct maximisation is not feasible. n K log π kf (x i µ k) i=1 Consider another objective function F(θ, q), where q is any probability distribution on latent variables z, such that: k=1 F(θ, q) l(θ) for all θ, q, max F(θ, q) = l(θ) q F(θ, q) is a lower bound on the log likelihood. We can construct an alternating maximisation algorithm as follows: For t = 1, 2... until convergence: q (t) := argmax F(θ (t 1), q) q θ (t) := argmax F(θ, q (t) ) θ Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 22 / 36

EM Algorithm Probabilistic Unsupervised Learning EM Algorithm The lower bound we use is called the variational free energy. q is a probability mass function for a distribution over z := (z i ) n i=1. F(θ, q) =E q [log p(x, z θ) log q(z)] [( n ) ] K =E q 1(z i = k) (log π k + log f (x i µ k )) log q(z) i=1 k=1 = [( n ) ] K q(z) 1(z i = k) (log π k + log f (x i µ k )) log q(z) z i=1 k=1 Lemma F(θ, q) l(θ) for all q and for all θ. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 23 / 36

EM Algorithm - Solving for q EM Algorithm Lemma F(θ, q) = l(θ) for q(z) = p(z x, θ). In combination with previous Lemma, this implies that q(z) = p(z x, θ) maximizes F(θ, q) for fixed θ, i.e., the optimal q is simply the conditional distribution given the data and that fixed θ. In mixture model, q (z) = p(z, x θ) p(x θ) = n i=1 π z i f (x i µ zi ) z n i=1 π z i f (x i µ z i ) = = n π zi f (x i µ zi ) k π kf (x i µ k ) n p(z i x i, θ). i=1 i=1 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 24 / 36

Similar derivation for optimal π k as before. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 25 / 36 Probabilistic Unsupervised Learning EM Algorithm - Solving for θ EM Algorithm Setting derivative with respect to µ k to 0, µk F(θ, q) = n q(z) 1(z i = k) µk log f (x i µ k ) z i=1 n = q(z i = k) µk log f (x i µ k ) = 0 i=1 This equation can be solved quite easily. E.g., for mixture of Gaussians, n µ i=1 k = q(z i = k)x i n i=1 q(z i = k) If it cannot be solved exactly, we can use gradient ascent algorithm (generalized EM): n µ k = µ k + α q(z i = k) µk log f (x i µ k ). i=1

EM Algorithm Probabilistic Unsupervised Learning EM Algorithm Start with some initial parameters (π (0) k, µ (0) k ) K k=1. Iterate for t = 1, 2,...: Expectation Step: q (t) (z i = k) := p(z i = k x i, θ (t 1) ) = Maximization Step: π (t 1) k K j=1 π(t 1) j f (x i µ (t 1) k ) f (x i µ (t 1) j ) n π (t) i=1 k = q(t) (z i = k) n n µ (t) i=1 k = q(t) (z i = k)x i n i=1 q(t) (z i = k) Theorem EM algorithm does not decrease the log likelihood. Proof: l(θ (t 1) ) = F(θ (t 1), q (t) ) F(θ (t), q (t) ) F(θ (t), q (t+1) ) = l(θ (t) ). Additional assumption, that 2 θ F(θ(t), q (t) ) are negative definite with eigenvalues < ɛ < 0, implies that θ (t) θ where θ is a local MLE. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 26 / 36

EM Algorithm Notes on Probabilistic Approach and EM Algorithm Some good things: Guaranteed convergence to locally optimal parameters. Formal reasoning of uncertainties, using both Bayes Theorem and maximum likelihood theory. Rich language of probability theory to express a wide range of generative models, and straightforward derivation of algorithms for ML estimation. Some bad things: Can get stuck in local minima so multiple starts are recommended. Slower and more expensive than K-means. Choice of K still problematic, but rich array of methods for model selection comes to rescue. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 27 / 36

EM Algorithm Flexible Gaussian Mixture Models We can allow each cluster to have its own mean and covariance structure to enable greater flexibility in the model. Different covariances Different, but diagonal covariances Identical covariances Identical and spherical covariances Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 28 / 36

Probabilistic PCA Probabilistic Unsupervised Learning Probabilistic PCA A probabilistic model related to PCA (also known as sensible PCA) has the following generative model: for i = 1, 2,..., n: Let k < n, p be given. Let Y i be a (latent) k-dimensional normally distributed random variable with 0 mean and identity covariance: Y i N (0, I k) We model the distribution of the ith data point given Y i as a p-dimensional normal: X i N (µ + LY i, σ 2 I) where the parameters are a vector µ R p, a matrix L R p k and σ 2 > 0. Tipping and Bishop, 1999 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 29 / 36

Probabilistic PCA Probabilistic PCA: EM vs MLE EM algorithm can be used for ML estimation (lecture notes), but PPCA can more directly give an MLE (which is not unique). Let λ 1 λ p be the eigenvalues of the sample covariance and V 1:k R p k the top k eigenvectors as before. Let Q R k k be any orthogonal matrix. Then an MLE is given by: µ MLE = x (σ 2 ) MLE = 1 p k p j=k+1 λ j L MLE = V 1:k diag((λ 1 (σ 2 ) MLE ) 1 2,..., (λk (σ 2 ) MLE ) 1 2 )Q However, EM can be faster, can be implemented online, can handle missing data and can be extended to more complicated models! Tipping and Bishop, 1999 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 30 / 36

Probabilistic PCA Probabilistic PCA PPCA latents principal subspace figures from M. Sahani s UCL course on Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 31 / 36

Probabilistic PCA Probabilistic PCA PPCA latents PCA projection principal subspace figures from M. Sahani s UCL course on Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 32 / 36

Probabilistic PCA Probabilistic PCA PPCA latents PPCA noise PPCA latent prior principal subspace figures from M. Sahani s UCL course on Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 33 / 36

Probabilistic PCA Probabilistic PCA PPCA latents PPCA posterior PPCA latent prior PPCA noise PPCA projection principal subspace figures from M. Sahani s UCL course on Unsupervised Learning Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 34 / 36

Probabilistic PCA Mixture of Probabilistic PCAs We have learnt two types of unsupervised learning techniques: Dimensionality reduction, e.g. PCA, MDS, Isomap. Clustering, e.g. K-means, linkage and mixture models. Probabilistic models allow us to construct more complex models from simpler pieces. Mixture of probabilistic PCAs allows both clustering and dimensionality reduction at the same time. Z i Discrete(π 1,..., π K ) Y i N (0, I d ) X i Z i = k, Y i = y i N (µ k + Ly i, σ 2 I p ) Allows flexible modelling of covariance structure without using too many parameters. Ghahramani and Hinton 1996 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 35 / 36

Probabilistic PCA Further reading Hastie et al, 8.5 Bishop, Chapter 9 Roweis and Ghahramani: A unifying review of linear Gaussian models Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 36 / 36