Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

Size: px

Start display at page:

Download "Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://www.cmpe.boun.edu."

Sophie Conley
5 years ago
Views:

1 Lecture Slides for INTRODUCTION TO Machine Learning ETHEM ALPAYDIN The MIT Press, 2010 h1p://

2 CHAPTER 7: Clustering

3 Semiparametric Density EsKmaKon Parametric: Assume a single model for p (x C i ) (Chapters 4 and 5) Semiparametric: p (x C i ) is a mixture of densikes MulKple possible explanakons/prototypes: Different handwrikng styles, accents in speech Nonparametric: No model; data speaks for itself (Chapter 8) 3

4 Mixture DensiKes where G i the components/groups/clusters, P ( G i ) mixture proporkons (priors), p ( x G i ) component densikes Gaussian mixture where p(x G i ) ~ N ( μ i, i ) parameters Φ = {P ( G i ), μ i, i } k i=1 4

5 Classes vs. Clusters Supervised: X = { x t,r t } t Classes C i i=1,...,k Unsupervised : X = { x t } t Clusters G i i=1,...,k where p ( x C i ) ~ N ( μ i, i ) Φ = {P (C i ), μ i, i } K i=1 where p ( x G i ) ~ N ( μ i, i ) Φ = {P ( G i ), μ i, i } k i=1 Labels, r t i? 5

6 k- Means Clustering Find k reference vectors (prototypes/codebook vectors/ codewords) which best represent data Reference vectors, m j, j =1,...,k Use nearest (most similar) reference: ReconstrucKon error 6

7 Encoding/Decoding 7

8 k- means Clustering 8

9 9

10 ExpectaKon- MaximizaKon (EM) Log likelihood with a mixture model Assume hidden variables z, which when known, make opkmizakon much simpler Complete likelihood, L c (Φ X,Z), in terms of x and z 10

11 E- and M- steps Iterate the two steps 1. E- step: EsKmate z given X and current Φ 2. M- step: Find new Φ given z, X, and old Φ. An increase in Q increases incomplete likelihood 11

12 EM in Gaussian Mixtures z t i = 1 if xt belongs to G i, 0 otherwise (labels r t i of supervised learning); assume p(x G i )~N(μ i, i ) E- step: M- step: Use esemated labels in place of unknown labels 12

13 P(G 1 x)=h 1 =0.5 13

14 Mixtures of Latent Variable Models Regularize clusters 1. Assume shared/diagonal covariance matrices 2. Use PCA/FA to decrease dimensionality: Mixtures of PCA/FA Can use EM to learn V i (Ghahramani and Hinton, 1997; Tipping and Bishop, 1999) 14

15 Aler Clustering Dimensionality reduckon methods find correlakons between features and group features Clustering methods find similarikes between instances and group instances Allows knowledge extrackon through number of clusters, prior probabilikes, cluster parameters, i.e., center, range of features. Example: CRM, customer segmentakon 15

16 Clustering as Preprocessing EsKmated group labels h j (sol) or b j (hard) may be seen as the dimensions of a new k dimensional space, where we can then learn our discriminant or regressor. Local representakon (only one b j is 1, all others are 0; only few h j are nonzero) vs Distributed representakon (Aler PCA; all z j are nonzero) 16

17 Mixture of Mixtures In classificakon, the input comes from a mixture of classes (supervised). If each class is also a mixture, e.g., of Gaussians, (unsupervised), we have a mixture of mixtures: 17

18 Hierarchical Clustering Cluster based on similarikes/distances Distance measure between instances x r and x s Minkowski (L p ) (Euclidean for p = 2) City- block distance 18

19 AgglomeraKve Clustering Start with N groups each with one instance and merge two closest groups at each iterakon Distance between two groups G i and G j : Single- link: Complete- link: Average- link, centroid 19

20 Example: Single- Link Clustering Dendrogram 20

21 Choosing k Defined by the applicakon, e.g., image quankzakon Plot data (aler PCA) and check for clusters Incremental (leader- cluster) algorithm: Add one at a Kme unkl elbow (reconstruckon error/log likelihood/ intergroup distances) Manually check for meaning 21

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a