Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42
Intro Question Intro Question David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 2 / 42
Intro Question Intro Question Suppose we begin with a dataset D = {x 1,...,x n } R 2 and we run k-means (or k-means++) to obtain k cluster centers. Below we have drawn the cluster centers. If we are given a new x R 2, we can assign it a label based on which cluster center is closest. What regions of the plane below correspond to each possible labeling? 1 0.8 0.6 0.4 0.2 0-0.2 0 0.2 0.4 0.6 0.8 1 1.2 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 3 / 42
Intro Question Intro Solution Note that each cell is disjoint (except for the boarders), and convex. This can be thought of as a limitation of k-means: neither will be true for GMMs. 1 0.8 0.6 0.4 0.2 0-0.2 0 0.2 0.4 0.6 0.8 1 1.2 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 4 / 42
Gaussian Mixture Models Gaussian Mixture Models David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 5 / 42
Gaussian Mixture Models Yesterday's Intro Question Consider the following probability model for generating data. 1 Roll a weighted k-sided die to choose a label z {1,...,k}. Let π denote the PMF for the die. 2 Draw x R d randomly from the multivariate normal distribution N(µ z,σ z ). Solve the following questions. 1 What is the joint distribution of x,z given π and the µ z,σ z values? 2 Suppose you were given the dataset D = {(x 1,z 1 ),...,(x n,z n )}. How would you estimate the die weightings, and the µ z,σ z values? 3 How would you determine the label for a new datapoint x? David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 6 / 42
Gaussian Mixture Models Yesterday's Intro Solution 1 The joint PDF/PMF is given by p(x,z) = π(z)f (x;µ z,σ z ) where f (x;µ z,σ z ) = 1 ( 2πΣz exp 1 ) 2 (x µ)t Σ 1 (x µ). 2 We could use maximum likelihood estimation. Our estimates are 3 arg max z p(x,z) n z = n 1(z i=1 i = z) ˆπ(z) = n z n ˆµ z = 1 n z i:z i =z x i ˆΣ z = 1 n z i:z i =z (x i ˆµ z )(x i ˆµ z ) T. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 7 / 42
Gaussian Mixture Models Probabilistic Model for Clustering Let's consider a generative model for the data. Suppose 1 There are k clusters. 2 We have a probability density for each cluster. Generate a point as follows 1 Choose a random cluster z {1,2,...,k}. 2 Choose a point from the distribution for cluster Z. The clustering algorithm is then: 1 Use training data to t the parameters of the generative model. 2 For each point, choose the cluster with the highest likelihood based on model. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 8 / 42
Gaussian Mixture Models Gaussian Mixture Model (k = 3) 1 Choose z {1,2,3} 2 Choose x z N(X µ z,σ z ). David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 9 / 42
Gaussian Mixture Models Gaussian Mixture Model Parameters (k Components) Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) What if one cluster had many more points than another cluster? David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 10 / 42
Gaussian Mixture Models Gaussian Mixture Model: Joint Distribution Factorize the joint distribution: p(x, z) = p(z)p(x z) = π z N (x µ z,σ z ) π z is probability of choosing cluster z. x z has distribution N(µ z,σ z ). z corresponding to x is the true cluster assignment. Suppose we know all the parameters of the model. Then we can easily compute the joint p(x,z), and the conditional p(z x). David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 11 / 42
Gaussian Mixture Models Latent Variable Model We observe x. In the intro problem we had labeled data, but here we don't observe z, the cluster assignment. Cluster assignment z is called a hidden variable or latent variable. Denition A latent variable model is a probability model for which certain variables are never observed. e.g. The Gaussian mixture model is a latent variable model. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 12 / 42
Gaussian Mixture Models The GMM Inference Problem We observe x. We want to know z. The conditional distribution of the cluster z given x is p(z x) = p(x,z)/p(x) The conditional distribution is a soft assignment to clusters. A hard assignment is z = arg max p(z x). z {1,...,k} So if we have the model, clustering is trivial. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 13 / 42
Mixture Models Mixture Models David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 14 / 42
Mixture Models Gaussian Mixture Model: Marginal Distribution The marginal distribution for a single observation x is p(x) = = k p(x, z) z=1 k π z N (x µ z,σ z ) z=1 Note that p(x) is a convex combination of probability densities. This is a common form for a probability model... David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 15 / 42
Mixture Models Mixture Distributions (or Mixture Models) Denition A probability density p(x) represents a mixture distribution or mixture model, if we can write it as a convex combination of probability densities. That is, k p(x) = w i p i (x), where w i 0, k i=1 w i = 1, and each p i is a probability density. i=1 In our Gaussian mixture model, x has a mixture distribution. More constructively, let S be a set of probability distributions: 1 Choose a distribution randomly from S. 2 Sample x from the chosen distribution. Then x has a mixture distribution. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 16 / 42
Learning in Gaussian Mixture Models Learning in Gaussian Mixture Models David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 17 / 42
Learning in Gaussian Mixture Models The GMM Learning Problem Given data x 1,...,x n drawn from a GMM, Estimate the parameters: Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) Once we have the parameters, we're done. Just do inference to get cluster assignments. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 18 / 42
Learning in Gaussian Mixture Models Estimating/Learning the Gaussian Mixture Model One approach to learning is maximum likelihood nd parameter values that give observed data the highest likelihood. The model likelihood for D = {x 1,...,x n } is L(π,µ,Σ) = = n p(x i ) i=1 n i=1 z=1 k π z N (x i µ z,σ z ). As usual, we'll take our objective function to be the log of this: { n k } J(π,µ,Σ) = log π z N (x i µ z,σ z ) i=1 z=1 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 19 / 42
Learning in Gaussian Mixture Models Properties of the GMM Log-Likelihood GMM log-likelihood: J(π,µ,Σ) = { n k } log π z N (x i µ z,σ z ) i=1 z=1 Let's compare to the log-likelihood for a single Gaussian: n logn(x i µ,σ) i=1 = nd 2 log(2π) n 2 log Σ 1 2 n (x i µ) Σ 1 (x i µ) For a single Gaussian, the log cancels the exp in the Gaussian density. = Things simplify a lot. For the GMM, the sum inside the log prevents this cancellation. = Expression more complicated. No closed form expression for MLE. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 20 / 42 i=1
Issues with MLE for GMM Issues with MLE for GMM David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 21 / 42
Issues with MLE for GMM Identiability Issues for GMM Suppose we have found parameters Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) that are at a local minimum. What happens if we shue the clusters? e.g. Switch the labels for clusters 1 and 2. We'll get the same likelihood. How many such equivalent settings are there? Assuming all clusters are distinct, there are k! equivalent solutions. Not a problem per se, but something to be aware of. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 22 / 42
Issues with MLE for GMM Singularities for GMM Consider the following GMM for 7 data points: Let σ 2 be the variance of the skinny component. What happens to the likelihood as σ 2 0? In practice, we end up in local minima that do not have this problem. Or keep restarting optimization until we do. Bayesian approach or regularization will also solve the problem. From Bishop's Pattern recognition and machine learning, Figure 9.7. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 23 / 42
Issues with MLE for GMM Gradient Descent / SGD for GMM What about running gradient descent or SGD on { n k } J(π,µ,Σ) = log π z N (x i µ z,σ z )? i=1 z=1 Can be done but need to be clever about it. Each matrix Σ 1,...,Σ k has to be positive semidenite. How to maintain that constraint? Rewrite Σ i = M i Mi T, where M i is an unconstrained matrix. Then Σ i is positive semidenite. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 24 / 42
The EM Algorithm for GMM The EM Algorithm for GMM David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 25 / 42
The EM Algorithm for GMM MLE for GMM From yesterday's intro questions, we know that we can solve the MLE problem if the cluster assignments z i are known n z = n 1(z i = z) i=1 ˆπ(z) = n z n ˆµ z = 1 n z i:z i =z x i ˆΣ z = 1 (x i ˆµ z )(x i ˆµ z ) T. n z i:z i =z In the EM algorithm we will modify the equations to handle our evolving soft assignments, which we will call responsibilities. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 26 / 42
The EM Algorithm for GMM Cluster Responsibilities: Some New Notation Denote the probability that observed value x i comes from cluster j by γ j i = P(Z = j X = x i). The responsibility that cluster j takes for observation x i. Computationally, γ j i = P(Z = j X = x i ). = p (Z = j,x = x i )/p(x) π j N (x i µ j,σ j ) = k π c=1 cn (x i µ c,σ c ) The vector ( ) γ 1 i,...,γk i is exactly the soft assignment for xi. Let n c = n i=1 γc i be the number of points soft assigned to cluster c. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 27 / 42
The EM Algorithm for GMM EM Algorithm for GMM: Overview If we know π and µ j,σ j for all j then we can easily nd γ j i = P(Z = j X = x i). If we know the (soft) assignments, we can easily nd estimates for π, µ j,σ j for all j. Repeatedly alternate the previous 2 steps. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 28 / 42
The EM Algorithm for GMM EM Algorithm for GMM: Overview 1 Initialize parameters µ, Σ, π. 2 E step. Evaluate the responsibilities using current parameters: γ j i = π j N (x i µ j,σ j ) k c=1 π cn (x i µ c,σ c ), for i = 1,...,n and j = 1,...,k. 3 M step. Re-estimate the parameters using responsibilities. [Compare with intro question.] µ new c = 1 n c Σ new c = 1 n c π new c = n c n, n γ c i x i i=1 n i=1 γ c i (x i µ new c )(x i µ new 4 Repeat from Step 2, until log-likelihood converges. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 29 / 42 c ) T
The EM Algorithm for GMM EM for GMM Initialization From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 30 / 42
The EM Algorithm for GMM EM for GMM First soft assignment: From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 31 / 42
The EM Algorithm for GMM EM for GMM First soft assignment: From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 32 / 42
The EM Algorithm for GMM EM for GMM After 5 rounds of EM: From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 33 / 42
The EM Algorithm for GMM EM for GMM After 20 rounds of EM: From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 34 / 42
The EM Algorithm for GMM Relation to K -Means EM for GMM seems a little like k-means. In fact, there is a precise correspondence. First, x each cluster covariance matrix to be σ 2 I. Then the density for each Gausian only depends on distance to the mean. As we take σ 2 0, the update equations converge to doing k-means. If you do a quick experiment yourself, you'll nd Soft assignments converge to hard assignments. Has to do with the tail behavior (exponential decay) of Gaussian. Can use k-means++ to initialize parameters of EM algorithm. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 35 / 42
Math Prerequisites for General EM Algorithm Math Prerequisites for General EM Algorithm David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 36 / 42
Math Prerequisites for General EM Algorithm Jensen's Inequality Which is larger: E[X 2 ] or E[X ] 2? avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 37 / 42
Math Prerequisites for General EM Algorithm Jensen's Inequality Theorem Which is larger: E[X 2 ] or E[X ] 2? Must be E[X 2 ] since Var[X ] = E[X 2 ] E[X ] 2 0. More general result is true: Jensen's Inequality If f : R R is convex and X is a random variable then E[f (X )] f (E[X ]). If f is strictly convex then we have equality i X = E[X ] with probability 1 (i.e., X is constant). avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 37 / 42
Math Prerequisites for General EM Algorithm Proof of Jensen Exercise Suppose X can take exactly two value: x 1 with probability π 1 and x 2 with probability π 2. Then prove Jensen's inequality. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 38 / 42
Math Prerequisites for General EM Algorithm Proof of Jensen Exercise Suppose X can take exactly two value: x 1 with probability π 1 and x 2 with probability π 2. Then prove Jensen's inequality. Let's compute E[f (X )]: E[f (X )] = π 1 f (x 1 ) + π 2 f (x 2 ) f (π 1 x 1 + π 2 x 2 ) = f (E[X ]). For the general proof, what do we know is true about all convex functions f : R R? avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 38 / 42
Math Prerequisites for General EM Algorithm Proof of Jensen 1 Let e = E[X ]. (Remember e is just a number.) 2 Since f has a subgradient at e, there is an underestimating line g(x) = ax + b that passes through the point (e,f (e)). 3 Then we have E[f (X )] E[g(X )] = E[aX + b] = ae[x ] + b = ae + b = f (e) = f (E[X ]). 4 If f is strictly convex then f = g at exactly 1 point, so equality i X is constant. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 39 / 42
Math Prerequisites for General EM Algorithm KL-Divergence Let p(x) and q(x) be probability mass functions (PMFs) on X. We want to measure how dierent they are. The Kullback-Leibler or KL Divergence is dene by KL(p q) = x X p(x)log p(x) q(x). (Assumes absolute continuity: q(x) = 0 implies p(x) = 0.) Can also write KL(p q) = E x p log p(x) q(x). Note, the KL-divergence is not symmetric and doesn't satisfy the triangle inequality. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 40 / 42
Math Prerequisites for General EM Algorithm Gibbs' Inequality Theorem Gibbs' Inequality Let p(x) and q(x) be PMFs on X. Then KL(p q) 0, with equality i p(x) = q(x) for all x X. Since KL(p q) = E p [ log this is screaming for Jensen's inequality. ( )] q(x), p(x) avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 41 / 42
Math Prerequisites for General EM Algorithm Gibbs' Inequality: Proof ( )] q(x) KL(p q) = E p [ log p(x) ( [ ]) q(x) log E p p(x) = log p(x) q(x) p(x) x:p(x)>0 ( ) = log q(x) x = log 1 = 0. Since log is strictly convex, we have equality i q/p is constant, i.e., q = p. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 42 / 42