Latent Variable Models and EM algorithm

Similar documents
Latent Variable Models and EM Algorithm

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

The Expectation Maximization or EM algorithm

Week 3: The EM algorithm

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

An Introduction to Expectation-Maximization

Gaussian Mixture Models

Expectation Maximization

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Unsupervised Learning

But if z is conditioned on, we need to model it:

Expectation Maximization Algorithm

Posterior Regularization

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Variational Principal Components

Expectation Maximization

Expectation Maximization

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

CS Lecture 18. Expectation Maximization

Lecture 6: April 19, 2002

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Machine Learning Lecture Notes

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Cheng Soon Ong & Christian Walder. Canberra February June 2017

The Expectation-Maximization Algorithm

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

The Expectation-Maximization Algorithm

Gaussian Mixture Models

Variational Inference (11/04/13)

Data Mining Techniques

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Two Useful Bounds for Variational Inference

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13: Variational inference II

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Probabilistic & Unsupervised Learning

CS281 Section 4: Factor Analysis and PCA

Latent Variable Models

Technical Details about the Expectation Maximization (EM) Algorithm

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Lecture 8: Graphical models for Text

Statistical Data Mining and Machine Learning Hilary Term 2016

Estimating Gaussian Mixture Densities with EM A Tutorial

An introduction to Variational calculus in Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

CPSC 540: Machine Learning

Introduction to Machine Learning

Quantitative Biology II Lecture 4: Variational Methods

Expectation Maximization and Mixtures of Gaussians

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Foundations of Statistical Inference

10-701/15-781, Machine Learning: Homework 4

Machine learning - HT Maximum Likelihood

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Statistical Pattern Recognition

CSCI-567: Machine Learning (Spring 2019)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Expectation Maximization Mixture Models HMMs

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Probabilistic Graphical Models for Image Analysis - Lecture 4

Finite Singular Multivariate Gaussian Mixture

14 : Mean Field Assumption

Lecture 13 : Variational Inference: Mean Field Approximation

Lehel Csató. March Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca,

Outline lecture 6 2(35)

The Regularized EM Algorithm

CPSC 540: Machine Learning

Lecture 14. Clustering, K-means, and EM

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Expectation Propagation Algorithm

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Dimensional reduction of clustered data sets

Factor Analysis (10/2/13)

Matrix and Tensor Factorization from a Machine Learning Perspective

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Gaussian Mixture Distance for Information Retrieval

Statistical Machine Learning Lectures 4: Variational Bayes

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

A Note on the Expectation-Maximization (EM) Algorithm

Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems

Clustering with k-means and Gaussian mixture distributions

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

CPSC 540: Machine Learning

Stochastic Variational Inference

Variational Autoencoders (VAEs)

EM for Spherical Gaussians

The Expectation Maximization Algorithm

COM336: Neural Computing

K-Means and Gaussian Mixture Models

A minimalist s exposition of EM

Machine Learning Techniques for Computer Vision

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Transcription:

Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic algorithms based on the intuitive notions of clustering similar instances together and dissimilar instances apart. Their goal is not to model the probability of the observed data items. In contrast, probabilistic unsupervised learning constructs a generative model that describes clustering of the items. We assume that there is some latent / unobserved process that is governing the data generation - and based on the data, we will try to answer the questions about this generating process. Mixture models assume that our dataset X was created by sampling iid from K distinct populations called mixture components). In other words, data come from a mixture of several sources and the model for the data can be viewed as a convex combination of several distinct probability distributions, often modelled with a given parametric family. Samples in population can be modelled using a distribution F µ with density fx µ ), where µ is the model parameter for the -th component. For a concrete example, consider a p-dimensional multivariate normal density with unnown mean µ and nown diagonal covariance σ 2 I, fx µ ) 2πσ 2 p 2 exp 1 ) 2σ 2 x µ 2 2. 3.1) Such model corresponds to the following generative model, whereby for each data item i 1, 2,..., n, we i) first determine the assignment variable independently for each data item i): Z i i.i.d. Discreteπ 1,..., π K ) i.e., PZ i ) π where for 1,..., K, π 0, such that K 1 π 1, are the mixing proportions, additional model parameters to be inferred; ii) then, given the assignment Z i of the mixture component, X i X 1) i,..., X p) i ) is sampled independently) from the corresponding -th component: X i Z i ) fx µ ). We observe X i x i for each i but do not observe its assignment Z i latent variables), and would lie to infer the parameters θ µ 1,..., µ K, π 1,..., π K ) as well as the latent variables. Note that the complete log-lielihood in the model

n ) log pz, X θ) log π zi fx i µ zi ) log π zi + log fx i µ zi )) 3.2) is not available as z i is not observed. We can consider marginalising over the latent variables K K n n K ) px θ)... π zi fx i µ zi ) π fx i µ ). 3.3) z 1 1 z n1 giving the marginal log-lielihood of the observations, lθ) log px θ) 1 K log π fx i µ ). However, direct maximisation is not feasible and the marginal log-lielihood will often have many local optima. Fortunately, there is a simple local marginal log-lielihood maximisation algorithm called Expectation Maximisation EM), which we will describe in Section 3.3. 1 3.2 KL Divergence and Gibbs Inequality Before we describe the EM algorithm, we will review the notion of Kullbac-Leibler KL) divergence or relative entropy between probability distributions P and Q. KL divergence. Let P and Q be two absolutely continuous probability distributions on X R d with densities p and q respectively. Then the KL divergence from Q to P is defined as D KL P Q) px) log px) dx. 3.4) qx) Let P and Q be two discrete probability distributions with probability mass functions p and q respectively. Then the KL divergence from Q to P is defined as X D KL P Q) i px i ) log px i) qx i ). 3.5) In both cases, we can write [ D KL P Q) E p log px) ], 3.6) qx) where E p denotes that expectation is taen over p. By convexity of fx) logx) and Jensen s inequality 3.8), we have that [ D KL P Q) E p log qx) ] qx) log E p 0, 3.7) px) px) 2

where in the last step we used that X qx)dx 1 in continuous case and i qx i) 1 in discrete case. Jensen s inequality. Let f be a convex function and X be a random variable. Then E [fx)] f EX). 3.8) If f is strictly convex, then equality holds if and only if X is almost surely a constant. Thus, we conclude that KL-divergence is always non-negative. This consequence of Jensen s inequality is called Gibbs inequality. Moreover, since fx) logx) is strictly convex on x > 0, the equality holds if and only if px) qx) almost everywhere, i.e. P Q. Note that in general KL-divergence is not symmetric: D KL P Q) D KL Q P ). 3.3 EM Algorithm EM algorithm is a general purpose iterative strategy for local maximisation of the lielihood under missing data/hidden variables. The method has been proposed many times for specific models it was given its name and studied as a general framewor by [1]. Let X, z) be a pair of observed variables X, and latent variables z. Our probabilistic model is given by px, z θ), but we have no access to z. Therefore, we would lie to maximise the observed data log-lielihood marginal log-lielihood) lθ) log px θ) log px, z θ)dz over θ. However, marginalisation of latent variables typically results in an intractable optimization problem and we need to resort to approximations. Now, assume for a moment that we have access to another objective function Fθ, q), where qz) is a certain distribution on latent variables z, which we are free to choose and will call variational distribution. Moreover, assume that F satisfies Fθ, q) lθ) for all θ, q, 3.9) max Fθ, q) lθ), 3.10) q i.e. Fθ, q) is a lower bound on the log-lielihood for any variational distribution q 3.9), which also matches the log-lielihood at a particular choice of q 3.10). Given these two properites, we can construct an alternating maximisation: coordinate ascent algorithm as follows: Coordinate ascent on the lower bound. For t 1, 2... until convergence: q t) : argmax Fθ t 1), q) q θ t) : argmax Fθ, q t) ). θ 3

Theorem 3.1. Assuming 3.9) and 3.10), coordinate ascent on the lower bound Fθ, q) does not decrease the log lielihood lθ). Proof. lθ t 1) ) Fθ t 1), q t) ) Fθ t), q t) ) Fθ t), q t+1) ) lθ t) ). Additional assumption, that 2 θ Fθt), q t) ) are negative definite with eigenvalues < ɛ < 0, implies that θ t) θ where θ is a local MLE. But how to find such lower bound F? It is given by the so called variational free energy, which we define next. Definition 3.1. Variational free energy in a latent variable model px, z θ) is defined as Fθ, q) E q [log px, z θ) log qz)], 3.11) where q is any probability density/mass function over the latent variables z. Consider the KL divergence between qz) and the true conditional based on our model pz X, θ) px, z θ)/px θ) for the observations X and a fixed parameter vector θ. Since KL is non-negative, 0 D KL [qz) pz X, θ)] E z q log qz) pz X, θ) log px θ) + E z q log qz) px, z θ). Thus, we have obtained a lower bound on the marginal log-lielihood which holds true for any parameter value θ and any choice of the variational distribution q: lθ) log px θ) E z q log px, z θ) qz) entropy {}}{ E z q log px, z θ) E z q log qz). 3.12) }{{} energy The right hand side in 3.12) is precisely the variational free energy - we see it decomposes in two terms. The first term is usually referred to as energy using the physics terminology, more precisely it is the expected complete data log-lielihood if we observed z, we would just maximise the complete data log-lielihood log px, z θ), but since z is not observed we need to integrate it out - but recall that q here is any distribution over latent variables). The second term is the Shannon entropy Hq) E q log qz) of the variational distribution qz), and does not depend on θ it can be thought of as the complexity penalty on q). The inequality becomes an equality when KL divergence is zero, i.e. when qz) pz X, θ) which means that the optimal choice of variational distribution q for fixed parameter value θ is the true conditional of the latent variables given the observations and that θ. Thus, we have proved the following lemma: 4

Lemma 3.1. Let F be the variational free energy in a latent variable model px, z θ). Then Fθ, q) lθ) for all q and for all θ, and For any θ, Fθ, q) lθ) iff qz) pz x, θ). Thus, properties 3.9) and 3.10) are satisfied and we can recast the alternating maximisation of the variational free energy into iterative updates of q E-step, via the plug-in full conditional of z using the current estimate of θ) and the updates of θ M-step, by maximising the energy for the current estimate of q). Provided that both E-step and M-step can be solved exactly, EM Algorithm converges to the local maximum lielihood solution. EM Algorithm. Initialize θ 0). At time t 1: E-step: Set q t) z) pz X, θ t 1) ) M-step: Set θ t) arg max θ E z q t) log px, z θ). 3.4 EM Algorithm for Mixtures Consider again our mixture model from Section 3.1 with n pz, X θ) π zi fx i µ zi ). Recall that our latent variables z are discrete they correspond to cluster assignments) so q is a probability mass function over z : z i ) n. Using the expression 3.2), we can write the variational free energy as Fθ, q) E q [log px, z θ) log qz)] [ ) ] K E q 1z i ) log π + log fx i µ )) log qz) 1 [ ) ] K qz) 1z i ) log π + log fx i µ )) log qz) z 1 1 K qz i ) log π + log fx i µ )) + Hq). We will denote Q i qz i ), which is called responsibility of cluster for data item i. 5

Now, the E-step simplifies because pz X, θ) px, z θ) px θ) n π z i fx i µ zi ) z n π z i fx i µ z i ) n π zi fx i µ zi ) π fx i µ ) n pz i x i, θ). Thus, for a fixed θ t 1) µ t 1) 1,..., µ t 1) K, πt 1) 1,..., π t 1) ) we can set K Q t) i pz i x i, θ t 1) ) π t 1) fx i µ t 1) ) K j1 πt 1) j fx i µ t 1) j ). 3.13) Now, consider the M-step. For mixing proportions we have a constraint that K j1 π j 1, so we introduce the Lagrange multiplier and obtain π Fθ, q) λ ) K j1 π j 1) Q i π λ 0 π Q i. Since K Q i 1 K 1 Q i }{{} 1 n, the M-step update for mixing proportions is n π t) Qt) i, 3.14) n i.e., they are simply given by the total responsibility of each cluster. Note that this update holds regardless of the form of the parametric family f µ ) used for mixture components. Setting derivative with respect to µ to 0, we obtain µ Fθ, q) Q i µ log fx i µ ) 0. 3.15) This equation can be solved quite easily for mixture of normals in 3.1), giving the M-step update µ t) i x i n Qt) i n Qt), 3.16) 6

which implies that the -th cluster mean estimate is simply a weighted average of all the data items, where the weights correspond to the responsibilities of cluster for these points. Put together, the EM for normal mixture model with nown fixed) covariance is very similar to K-means algorithm where cluster assignments are soft, i.e. rather than assigning each data item x i to a single cluster at each iteration, we carry forward a responsibility vector Q i1,..., Q ik ) giving probabilities of x i belonging to each cluster. Indeed, K-means algorithm can be undestood as EM where σ 2 0, such that E-step will assign exactly one entry in Q i1,..., Q ik ) to one corresponding to the nearest mean vector) and the rest to zero. EM for Normal Mixtures nown covariance) Soft K-means 1. Initialize K cluster means µ 1,..., µ K and mixing proportions π 1,..., π K. 2. Update responsibilites E-step): For each i 1,..., n, 1,..., K: π exp 1 x 2σ Q i 2 i µ 2 ) 2 K j1 π j exp 1 ) 3.17) x 2σ 2 i µ j 2 2 3. Update parameters M-step): Set µ 1,..., µ K and π 1,..., π K and based on the new cluster responsibilities: n π Q n i, µ Q ix i n n Q. 3.18) i 4. Repeat steps 2-3 until convergence. 5. Return the responsibilites {Q i } and parameters µ 1,..., µ K, π 1,..., π K. In some cases, depending on the form of the parametric family f µ ) the M-step update for mixtures cannot be solved exactly. In these cases, we can use gradient ascent algorithm inside the M-step: µ r+1) µ r) This leads to generalized EM algorithm. + α Q i µ log fx i µ r) ). 3.5 Probabilistic PCA So far, we have considered the application of EM to clustering, but it can be applied to latent variable models more broadly. Here, we will derive EM for Probabilistic PCA [3], a latent variable model for probabilistic dimensionality reduction. Just lie in PCA, we try to model a collection of n p-dimensional vectors using a -dimensional representation with < p. Probabilistic PCA corresponds to the following generative model. For each data item i 1, 2,..., n: 7

Let Y i be a latent) -dimensional normally distributed random vector with mean 0 and identity covariance: Y i N 0, I ), Given Y i, the distribution of the i-th data item is a p-dimensional normal: X i N µ + LY i, σ 2 I) where the parameters θ µ, L, σ 2 ) correspond to a vector µ R p, a matrix L R p and σ 2 > 0. Note that unlie in clustering, the latent variables Y 1,..., Y n are now continuous. From an equivalent representation X i µ + LY i + ɛ, where ɛ N 0, σ 2 I p ) and is independent of Y, we see that the marginal model on X i s is ) fx θ) N x; µ, LL + σ 2 I, where parameters are denoted θ µ, L, σ 2). From here it is clear that the maximum marginal lielihood estimator of µ is available directly as ˆµ 1 n n X i and thus, we do not require EM to estimate µ. We will henceforth assume that the data is centred, to simplify notation and remove µ from the parameters. On the other hand, maximum marginal lielihood solution for L is unique only up to orthonormal transformations, which is why a certain form of L is usually enforced e.g. lower-triangular, orthogonal columns). [3] shows that the MLE for PPCA has the following form. Let λ 1 λ p be the eigenvalues of the sample covariance and V 1: R p the top eigenvectors as before. Let Q R be any orthogonal matrix. Then we have: µ MLE x σ 2 ) MLE 1 p p j+1 λ j L MLE V 1: diagλ 1 σ 2 ) MLE ) 1 2,..., λ σ 2 ) MLE ) 1 2 )Q. We note that the standard PCA is recovered when σ 2 0. However, the EM algorithm we derive below can be faster than eigendecomposition, can be implemented online, can handle missing data and can be extended to more complicated models. We will now proceed by deriving the EM algorithm. E-step. By Gaussian conditioning exercise), qy i ) py i x i, θ) N y i b i, R), where b i L L + σ 2 I) 1 L x i, 3.19) R σ 2 L L + σ 2 I) 1. 3.20) 8

M-step. Recall that the parameters of interest are θ L, σ 2) since the marginal maximum lielihood estimate of the mean parameter µ is directly available). We would lie to maximise the variational free energy given by: F θ, q) E y q [ log px i, y i θ) ] + const. By ignoring terms that do not depend on θ and denoting c to mean equal up to a constant independent on θ log px i, y i θ) c p 2 log σ2 1 2σ 2 x i Ly i ) x i Ly i ) c p 2 log σ2 1 { } 2σ 2 x i x i 2x i Ly i + yi L Ly i [ L Ly i y i c p 2 log σ2 1 { 2σ 2 x i x i 2x i Ly i + Tr Taing expectation over qy i ) N y i b i, R) gives E yi q log px i, y i θ)) c p 2 log σ2 1 { [ 2σ 2 x i x i 2x i Lb i + Tr L L It remains to sum over all observations to get F θ, q) c np log σ2 2 Now, we have 1 2σ 2 { x i x i 2 { F L 1 σ 2 x i b i [ x i Lb i + Tr L L b i b i L b i b i + nr )}, b i b i ]}. + nr )]} + R. )]} which by setting to 0 gives the update rule ) 1 L new) x i b i b i b i + nr). 3.21) Letting τ σ 2, we have: { F np 1 τ 2 τ 1 x i x i 2 2 and thus σ 2 ) new) 1 np { x i x i 2 [ x i Lb i + Tr L L b i b i [ x i L new) b i + Tr L new) L new) b i b i + nr + nr )]}, )]}.. 3.22) Both Probabilistic PCA and normal mixtures are examples of linear Gaussian models, all of which have the corresponding learning algorithms based on EM. For a unifying review of these and a number of other models from the same family, including factor analysis and hidden Marov models, cf. [2]. 9

References [1] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum lielihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B Methodological), 391):1 38, 1977. [2] Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models. Neural Comput., 112):305 345, February 1999. [3] M. E. Tipping and Christopher Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 21/3:611 622, January 1999. 10