Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1

Recap of last lecture.. Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 2

Clustering Usually an unsupervised learning problem Given: N unlabeled examples {x 1,..., x N}; the number of partitions K Goal: Group the examples into K partitions Clustering groups examples based of their mutual similarities A good clustering is one that achieves: High within-cluster similarity Low inter-cluster similarity Examples: K-means, Spectral Clustering, Gaussian Mixture Model, etc. Picture courtesy: Data Clustering: 50 Years Beyond K-Means, A.K. Jain (2008) Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 3

Refresher: K-means Clustering Input: N examples {x 1,..., x N}; x n R D ; the number of partitions K Initialize: K cluster means µ 1,..., µ K, µ k R D ; many ways to initialize: Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics exist (e.g., K-means++, Arthur & Vassilvitskii, 2007) Iterate: (Re)-Assign each example x n to its closest cluster center C k = {n : k = arg min k x n µ k 2 } (C k is the set of examples assigned to cluster k with center µ k ) Update the cluster means Repeat while not converged µ k = mean(c k ) = 1 C k n C k A possible convergence criteria: cluster means do not change anymore x n Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 4

The K-means Objective Function Notation: Size K one-hot vector to denote membership of x n to cluster k Also equivalent to just saying z n = k z n = [0 0... 1 0 0] }{{} all zeros except the k-th bit K-means objective can be written in terms of the total distortion N K J(µ, Z) = z nk x n µ k 2 n=1 Distortion: Loss suffered on assigning points {x n} N n=1 to clusters {µ k} K Goal: To minimize the objective w.r.t. µ and Z Note: Non-convex objective. Also, exact optimization is NP-hard The K-means algorithm is a heuristic; alternates b/w minimizing J w.r.t. µ and Z ; converges to a local minima Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 5

K-means: Some Limitations Makes hard assignments of points to clusters A point either totally belongs to a cluster or not at all No notion of a soft/fractional assignment (i.e., probability of being assigned to each cluster: say K = 3 and for some point x n, p 1 = 0.7, p 2 = 0.2, p 3 = 0.1) K-means often doesn t work when clusters are not round shaped, and/or may overlap, and/or are unequal Gaussian Mixture Model: A probabilistic approach to clustering (and density estimation) addressing many of these problems Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 6

Mixture Models Data distribution p(x) assumed to be a weighted sum of K distributions p(x) = K π k p(x θ k ) where π k s are the mixing weights: K π k = 1, π k 0 (intuitively, π k is the proportion of data generated by the k-th distribution) Each component distribution p(x θ k ) represents a cluster in the data Gaussian Mixture Model (GMM): component distributions are Gaussians p(x) = K π k N (x µ k, Σ k ) Mixture models used in many data modeling problems, e.g., Unsupervised Learning: Clustering (+density estimation) Supervised Learning: Mixture of Experts models Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 7

GMM Clustering: Pictorially Notice the mixed colored points in the overlapping regions Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 8

GMM as a Generative Model of Data Can think of the data {x 1, x n,..., x N} using a generative story For each example x n, first choose its cluster assignment z n {1, 2,..., K} as z n Multinoulli(π 1, π 2,..., π K ) Now generate x from the Gaussian with id z n x n z n N (µ, Σ z n zn) Note: p(z nk = 1) = π k is the prior probability of x n going to cluster k and p(z n) = K π z nk k Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 9

GMM as a Generative Model of Data Joint distribution of data and cluster assignments p(x, z) = p(z)p(x z) Marginal distribution of data K K p(x) = p(z k = 1)p(x z k = 1) = π k N (x µ k, Σ k ) Thus the generative model leads to exactly the same p(x) that we defined Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 10

Learning GMM Given N observations {x 1, x 2,..., x N} drawn from mixture distribution p(x) p(x) = K π k N (x µ k, Σ k ) Learning the GMM involves the following: Learning the cluster assignments {z 1, z 2,..., z N} Estimating the mixing weights π = {π 1,..., π K } and the parameters θ = {µ k, Σ k } K of each of the K Gaussians GMM, being probabilistic, allows learning probabilities of cluster assignments Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 11

GMM: Learning Cluster Assignment Probabilities For now, assume π = {π 1,..., π K } and θ = {µ k, Σ k } K are known Given θ, the posterior probabilities of cluster assignments, using Bayes rule γ nk = p(z nk = 1 x n) = p(z nk = 1)p(x n z nk = 1) K j=1 p(z nj = 1)p(x n z nj = 1) = π k N (x n µ k, Σ k ) K j=1 π j N (x n µ j, Σ j ) Here γ nk denotes the posterior probability that x n belongs to cluster k Posterior prob. γ nk prior probability π k times likelihood N (x n µ k, Σ k ) Note that unlike K-means, there is a non-zero posterior probability of x n belonging to each of the K clusters (i.e., probabilistic/soft clustering) Therefore for each example x n, we have a vector γ n of cluster probabilities K γ n = [γ n1 γ n2... γ nk ], γ nk = 1, γ nk > 0 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 12

GMM: Estimating Parameters Now assume the cluster probabilities γ 1,..., γ N are known Let us write down the log-likelihood of the model { N N N K } L = log p(x) = log p(x n) = log p(x n) = log π k N (x n µ k, Σ k ) n=1 n=1 Taking derivative w.r.t. µ k (done on black board) and setting to zero N n=1 Plugging and chugging, we get π k N (x n µ k, Σ k ) K j=1 π j N (x n µ j, Σ j ) } {{ } γ nk n=1 Σ 1 k (x n µ k ) = 0 µ k = N n=1 γ nk x n N n=1 γ nk = 1 N γ nk x n N k n=1 Thus mean of k-th Gaussian is the weighted empirical mean of all examples N k = N n=1 γ nk: effective num. of examples assigned to k-th Gaussian (note that each example belongs to each Gaussian, but partially ) Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 13

GMM: Estimating Parameters Doing the same, this time w.r.t. the covariance matrix Σ k of k-th Gaussian: Σ k = 1 N γ nk (x n µ k )(x n µ k ) N k n=1.. using similar computations as MLE of the covariance matrix of a single Gaussian (shown on board) Thus Σ k is the weighted empirical covariance of all examples Finally, the MLE objective for estimating π = {π 1, π 2,..., π K } N K K log π k N (x n µ k, Σ k ) + λ( π k 1) n=1 (λ is the Lagrange multiplier for K π k = 1) Taking derivative w.r.t. π k and setting it to zero gives Lagrange multiplier λ = N. Plugging it back and chugging, we get π k = N k N which makes intuitive sense (fraction of examples assigned to cluster k) Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 14

Summary of GMM Estimation Initialize parameters θ = {µ k, Σ k } K and mixing weights π = {π 1,..., π K }, and alternate between the following steps until convergence: Given current estimates of θ = {µ k, Σ k } K and π Estimate the posterior probabilities of cluster assignments γ nk = π k N (x n µ k, Σ k ) K j=1 π j N (x n µ j, Σ j ) n, k Given the current estimates of cluster assignment probabilities {γ nk } Estimate the mean of each Gaussian µ k = 1 N N γ nk x n k, where N k = γ nk N k n=1 n=1 Estimate the covariance matrix of each Gaussian Σ k = 1 N γ nk (x n µ k )(x n µ k ) k N k n=1 Estimate the mixing proportion of each Gaussian π k = N k N k Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 15

K-means: A Special Case of GMM Assume the covariance matrix of each Gaussian to be spherical Σ k = σ 2 I Consider the posterior probabilities of cluster assignments γ nk = π kn (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) = π k exp{ 1 2σ 2 x n µ k 2 } K j=1 π j exp{ 1 2σ 2 x n µ j 2 } As σ 2 0, the summation of denominator will be dominated by the term with the smallest x n µ j 2. For that j, γ nj π j exp{ 1 2σ 2 x n µ j 2 } π j exp{ 1 2σ 2 x n µ j 2 } = 1 For l j, γ nl 0 hard assignment with γ nj 1 for a single cluster j Thus, for Σ k = σ 2 I (spherical) and σ 2 0, GMM reduces to K-means Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 16

Next class: The Expectation Maximization Algorithm Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 17