Unsupervised Learning: K-Means, Gaussian Mixture Models

Unsupervised Learning: K-Means, Gaussian Mixture Models These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Please send comments and corrections to Eric. Robot Image Credit: Viktoriya Sukhanova

Unsupervised Learning

Unsupervised Learning Supervised learning used labeled data pairs (x, y) to learn a function f : X Y But, what if we don t have labels? Most data doesn t! No labels = unsupervised learning Only some points are labeled = semi-supervised learning Labels may be expensive to obtain, so we only get a few

Unsupervised Learning Major Categories (and their uses) Clustering Density-estimation Dimensionality reduction (feature selection)

Kernel Density Estimation Some material adapted from slides by Le Song, GT.

Kernel Density Estimation Clustering with Gaussian Mixtures: Slide 8

Kernel Density Estimation Clustering with Gaussian Mixtures: Slide 9

Kernel Density Estimation Clustering with Gaussian Mixtures: Slide 10

Kernel Density Estimation Clustering with Gaussian Mixtures: Slide 11

Kernel Density Estimation Clustering with Gaussian Mixtures: Slide 12

Kernel Density Estimation Clustering with Gaussian Mixtures: Slide 13

K-Means Clustering Some material adapted from slides by Andrew Moore, CMU. Visit http://www.autonlab.org/tutorials/ for Andrew s repository of Data Mining tutorials.

Clustering Data

K-Means Clustering K-Means ( k, X ) Randomly choose k cluster center locations (centroids) Loop until convergence Assign each point to the cluster of the closest centroid Re-estimate the cluster centroids based on the data assigned to each cluster

K-Means Animation Example generated by Andrew Moore using Dan Pelleg s superduper fast K-means system: Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999.

K-Means Objective Function K-means finds a local optimum of the following objective function:

Problems with K-Means Very sensitive to the initial points Do many runs of K-Means, each with different initial centroids Seed the centroids using a better method than randomly choosing the centroids e.g., Farthest-first sampling Must manually choose k Learn the optimal k for the clustering Note that this requires a performance measure

Choosing the key parameter 1.The Elbow Method a. SSE= Ki=1 x cidist(x,ci)2 2.x-means clustering a. repeatedly attempting subdivision 3.Others: a.information Criterion Approach b.an Information Theoretic Approach c.the Silhouette Method d.cross-validation

Choosing the key parameter

Problems with K-Means How do you tell it which clustering you want? k = 2 Constrained clustering techniques (semi-supervised) Same-cluster constraint (must-link) Different-cluster constraint (cannot-link)

Gaussian Mixture Models and Expectation Maximization

Gaussian Mixture Models Recall the Gaussian distribution:

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i µ1 µ 2 µ 3 Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 36

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix σ 2 I Assume that each datapoint is generated according to the following recipe: µ 1 µ 2 µ 3 Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 37

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix σ 2 I Assume that each datapoint is generated according to the following recipe: Pick a component at random. Choose component i with probability P(ω i ). µ 2 Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 38

The GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix σ 2 I Assume that each datapoint is generated according to the following recipe: Pick a component at random. Choose component i with probability P(ω i ). Datapoint ~ N(µ i, σ 2 I ) µ 2 x Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 39

The General GMM assumption There are k components. The i th component is called ω i Component ω i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix Σ i Assume that each datapoint is generated according to the following recipe: Pick a component at random. Choose component i with probability P(ω i ). Datapoint ~ N(µ i, Σ i ) µ 1 µ 2 µ 3 Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 40

Expectation-Maximization for GMMs Iterate until convergence: On the t th iteration let our estimates be λ t = { μ 1 (t), μ 2 (t) μ c (t) } Just evaluate a Gaussian at x k E-step: Compute expected classes of all datapoints for each class M-step: Estimate μ given our data s class membership distributions Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 41

E.M. for General GMMs Iterate. On the t th iteration let our estimates be p i (t) is shorthand for estimate of P(ω i ) on t th iteration λ t = { μ 1 (t), μ 2 (t) μ c (t), Σ 1 (t), Σ 2 (t) Σ c (t), p 1 (t), p 2 (t) p c (t) } E-step: Compute expected clusters of all datapoints Just evaluate a Gaussian at x k M-step: Estimate μ, Σ given our data s class membership distributions R = #records Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 42

Other Methods and Reading Hidden Markov Models: Andrew Ng s Notes Mean Shift Mean Shift Tutorial Deep Methods Tutorial on Variational Autoencoders Clustering with Gaussian Mixtures: Slide 54