Clustering and Gaussian Mixture Models

Size: px

Start display at page:

Download "Clustering and Gaussian Mixture Models"

Emory Tate
5 years ago
Views:

1 Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1

2 Recap of last lecture.. Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 2

3 Clustering Usually an unsupervised learning problem Given: N unlabeled examples {x 1,..., x N}; the number of partitions K Goal: Group the examples into K partitions Clustering groups examples based of their mutual similarities A good clustering is one that achieves: High within-cluster similarity Low inter-cluster similarity Examples: K-means, Spectral Clustering, Gaussian Mixture Model, etc. Picture courtesy: Data Clustering: 50 Years Beyond K-Means, A.K. Jain (2008) Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 3

4 Refresher: K-means Clustering Input: N examples {x 1,..., x N}; x n R D ; the number of partitions K Initialize: K cluster means µ 1,..., µ K, µ k R D ; many ways to initialize: Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics exist (e.g., K-means++, Arthur & Vassilvitskii, 2007) Iterate: (Re)-Assign each example x n to its closest cluster center C k = {n : k = arg min k x n µ k 2 } (C k is the set of examples assigned to cluster k with center µ k ) Update the cluster means Repeat while not converged µ k = mean(c k ) = 1 C k n C k A possible convergence criteria: cluster means do not change anymore x n Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 4

5 The K-means Objective Function Notation: Size K one-hot vector to denote membership of x n to cluster k Also equivalent to just saying z n = k z n = [ ] }{{} all zeros except the k-th bit K-means objective can be written in terms of the total distortion N K J(µ, Z) = z nk x n µ k 2 n=1 Distortion: Loss suffered on assigning points {x n} N n=1 to clusters {µ k} K Goal: To minimize the objective w.r.t. µ and Z Note: Non-convex objective. Also, exact optimization is NP-hard The K-means algorithm is a heuristic; alternates b/w minimizing J w.r.t. µ and Z ; converges to a local minima Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 5

$K-means: Some Limitations Makes hard assignments of points to clusters A point either totally belongs to a cluster or not at all No notion of a soft/fractional assignment (i.e., probability of being assigned to each cluster: say K = 3 and for some point x n, p 1 = 0.$

6 K-means: Some Limitations Makes hard assignments of points to clusters A point either totally belongs to a cluster or not at all No notion of a soft/fractional assignment (i.e., probability of being assigned to each cluster: say K = 3 and for some point x n, p 1 = 0.7, p 2 = 0.2, p 3 = 0.1) K-means often doesn t work when clusters are not round shaped, and/or may overlap, and/or are unequal Gaussian Mixture Model: A probabilistic approach to clustering (and density estimation) addressing many of these problems Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 6

7 Mixture Models Data distribution p(x) assumed to be a weighted sum of K distributions p(x) = K π k p(x θ k ) where π k s are the mixing weights: K π k = 1, π k 0 (intuitively, π k is the proportion of data generated by the k-th distribution) Each component distribution p(x θ k ) represents a cluster in the data Gaussian Mixture Model (GMM): component distributions are Gaussians p(x) = K π k N (x µ k, Σ k ) Mixture models used in many data modeling problems, e.g., Unsupervised Learning: Clustering (+density estimation) Supervised Learning: Mixture of Experts models Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 7

8 GMM Clustering: Pictorially Notice the mixed colored points in the overlapping regions Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 8

9 GMM as a Generative Model of Data Can think of the data {x 1, x n,..., x N} using a generative story For each example x n, first choose its cluster assignment z n {1, 2,..., K} as z n Multinoulli(π 1, π 2,..., π K ) Now generate x from the Gaussian with id z n x n z n N (µ, Σ z n zn) Note: p(z nk = 1) = π k is the prior probability of x n going to cluster k and p(z n) = K π z nk k Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 9

10 GMM as a Generative Model of Data Joint distribution of data and cluster assignments p(x, z) = p(z)p(x z) Marginal distribution of data K K p(x) = p(z k = 1)p(x z k = 1) = π k N (x µ k, Σ k ) Thus the generative model leads to exactly the same p(x) that we defined Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 10

11 Learning GMM Given N observations {x 1, x 2,..., x N} drawn from mixture distribution p(x) p(x) = K π k N (x µ k, Σ k ) Learning the GMM involves the following: Learning the cluster assignments {z 1, z 2,..., z N} Estimating the mixing weights π = {π 1,..., π K } and the parameters θ = {µ k, Σ k } K of each of the K Gaussians GMM, being probabilistic, allows learning probabilities of cluster assignments Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 11

12 GMM: Learning Cluster Assignment Probabilities For now, assume π = {π 1,..., π K } and θ = {µ k, Σ k } K are known Given θ, the posterior probabilities of cluster assignments, using Bayes rule γ nk = p(z nk = 1 x n) = p(z nk = 1)p(x n z nk = 1) K j=1 p(z nj = 1)p(x n z nj = 1) = π k N (x n µ k, Σ k ) K j=1 π j N (x n µ j, Σ j ) Here γ nk denotes the posterior probability that x n belongs to cluster k Posterior prob. γ nk prior probability π k times likelihood N (x n µ k, Σ k ) Note that unlike K-means, there is a non-zero posterior probability of x n belonging to each of the K clusters (i.e., probabilistic/soft clustering) Therefore for each example x n, we have a vector γ n of cluster probabilities K γ n = [γ n1 γ n2... γ nk ], γ nk = 1, γ nk > 0 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 12

13 GMM: Estimating Parameters Now assume the cluster probabilities γ 1,..., γ N are known Let us write down the log-likelihood of the model { N N N K } L = log p(x) = log p(x n) = log p(x n) = log π k N (x n µ k, Σ k ) n=1 n=1 Taking derivative w.r.t. µ k (done on black board) and setting to zero N n=1 Plugging and chugging, we get π k N (x n µ k, Σ k ) K j=1 π j N (x n µ j, Σ j ) } {{ } γ nk n=1 Σ 1 k (x n µ k ) = 0 µ k = N n=1 γ nk x n N n=1 γ nk = 1 N γ nk x n N k n=1 Thus mean of k-th Gaussian is the weighted empirical mean of all examples N k = N n=1 γ nk: effective num. of examples assigned to k-th Gaussian (note that each example belongs to each Gaussian, but partially ) Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 13

14 GMM: Estimating Parameters Doing the same, this time w.r.t. the covariance matrix Σ k of k-th Gaussian: Σ k = 1 N γ nk (x n µ k )(x n µ k ) N k n=1.. using similar computations as MLE of the covariance matrix of a single Gaussian (shown on board) Thus Σ k is the weighted empirical covariance of all examples Finally, the MLE objective for estimating π = {π 1, π 2,..., π K } N K K log π k N (x n µ k, Σ k ) + λ( π k 1) n=1 (λ is the Lagrange multiplier for K π k = 1) Taking derivative w.r.t. π k and setting it to zero gives Lagrange multiplier λ = N. Plugging it back and chugging, we get π k = N k N which makes intuitive sense (fraction of examples assigned to cluster k) Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 14

15 Summary of GMM Estimation Initialize parameters θ = {µ k, Σ k } K and mixing weights π = {π 1,..., π K }, and alternate between the following steps until convergence: Given current estimates of θ = {µ k, Σ k } K and π Estimate the posterior probabilities of cluster assignments γ nk = π k N (x n µ k, Σ k ) K j=1 π j N (x n µ j, Σ j ) n, k Given the current estimates of cluster assignment probabilities {γ nk } Estimate the mean of each Gaussian µ k = 1 N N γ nk x n k, where N k = γ nk N k n=1 n=1 Estimate the covariance matrix of each Gaussian Σ k = 1 N γ nk (x n µ k )(x n µ k ) k N k n=1 Estimate the mixing proportion of each Gaussian π k = N k N k Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 15

16 K-means: A Special Case of GMM Assume the covariance matrix of each Gaussian to be spherical Σ k = σ 2 I Consider the posterior probabilities of cluster assignments γ nk = π kn (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) = π k exp{ 1 2σ 2 x n µ k 2 } K j=1 π j exp{ 1 2σ 2 x n µ j 2 } As σ 2 0, the summation of denominator will be dominated by the term with the smallest x n µ j 2. For that j, γ nj π j exp{ 1 2σ 2 x n µ j 2 } π j exp{ 1 2σ 2 x n µ j 2 } = 1 For l j, γ nl 0 hard assignment with γ nj 1 for a single cluster j Thus, for Σ k = σ 2 I (spherical) and σ 2 0, GMM reduces to K-means Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 16

17 Next class: The Expectation Maximization Algorithm Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 17

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a