Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Similar documents
Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Mixtures of Gaussians continued

Gaussian Mixture Models

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Expectation Maximization

EM (cont.) November 26 th, Carlos Guestrin 1

CSE446: Clustering and EM Spring 2017

Latent Variable Models and EM Algorithm

Expectation Maximization Algorithm

ECE 5984: Introduction to Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Bayesian Networks Structure Learning (cont.)

Unsupervised Learning

Gaussian Mixture Models, Expectation Maximization

Learning Theory Continued

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Classification Logistic Regression

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Clustering and Gaussian Mixture Models

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

K-Means and Gaussian Mixture Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation maximization

But if z is conditioned on, we need to model it:

Latent Variable Models and EM algorithm

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Statistical Pattern Recognition

Gaussian Mixture Models

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

CPSC 540: Machine Learning

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Review and Motivation

Introduction to Machine Learning

Machine Learning for Data Science (CS4786) Lecture 12

Clustering, K-Means, EM Tutorial

Graphical Models for Collaborative Filtering

The Expectation-Maximization Algorithm

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Machine Learning Lecture 5

Expectation Maximization

Gaussian Mixture Model

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Mixtures of Gaussians. Sargur Srihari

Latent Variable Models

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Lecture 14. Clustering, K-means, and EM

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

CS534 Machine Learning - Spring Final Exam

Lecture 7: Con3nuous Latent Variable Models

Statistical Pattern Recognition

Lecture 11: Unsupervised Machine Learning

Variational Inference (11/04/13)

Latent Variable View of EM. Sargur Srihari

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

A minimalist s exposition of EM

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Probabilistic Time Series Classification

Data Preprocessing. Cluster Similarity

Data Mining Techniques

Mixture Models and Expectation-Maximization

Latent Variable Models and Expectation Maximization

The Expectation Maximization or EM algorithm

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Generative Clustering, Topic Modeling, & Bayesian Inference

Expectation Maximization

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

10-701/15-781, Machine Learning: Homework 4

CS281 Section 4: Factor Analysis and PCA

Advanced Introduction to Machine Learning

Machine Learning for Signal Processing Bayes Classification and Regression

13: Variational inference II

Latent Variable Models and Expectation Maximization

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Dimensionality reduction

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

CSCI-567: Machine Learning (Spring 2019)

PCA and admixture models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Expectation maximization tutorial

Clustering with k-means and Gaussian mixture distributions

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Data Mining Techniques

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Probabilistic Graphical Models

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Unsupervised learning (part 1) Lecture 19

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Logistic Regression. Machine Learning Fall 2018

Transcription:

Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades posted today Out of 82 points Today: Review: PCA Start: unsupervised learning 2 1

Dimensionality Reduction PCA Machine Learning CSE4546 Sham Kakade University of Washington November 8, 2016 3 Linear projections, a review Project a point into a (lower dimensional) space: point: x = (x 1,,x d ) select a basis set of basis vectors (u 1,,u k ) we consider orthonormal basis: u i u i =1, and u i u j =0 for i¹ j select a center x, defines offset of space best coordinates in lower dimensional space defined by dot-products: (z 1,,z k ), z i = (x-x) u i Sham Kakade 4 2016 2

PCA finds projection that minimizes reconstruction error Given N data points: x i = (x i 1,,x i d ), i=1 N Will represent each point as a projection: where: and N N PCA: Given k<<d, find (u 1,,u k ) minimizing reconstruction error: N x 2 x 1 Sham Kakade 5 2016 Understanding the reconstruction error Note that x i can be represented exactly by d-dimensional projection: d Given k<<d, find (u 1,,u k ) minimizing reconstruction error: N Rewriting error: Sham Kakade 6 2016 3

Reconstruction error and covariance matrix N d N N Sham Kakade 7 2016 Minimizing reconstruction error and eigen vectors Minimizing reconstruction error equivalent to picking orthonormal basis (u 1,,u d ) minimizing: N d Eigen vector definition: Solution: use the eigenvectors from the SVD Sham Kakade 8 2016 4

Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 9 Clustering images Set of Images [Goldberger et al.] 10 5

Clustering web search results 11 Some Data 12 6

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 13 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 14 7

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) 15 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 16 8

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. and jumps there 6. Repeat until 2016 terminated! Sham Kakade 17 K-means Randomly initialize k centers µ (0) = µ 1 (0),, µ k (0) Classify: Assign each point jî {1, N} to nearest center: Recenter: µ i becomes centroid of its point: Equivalent to µ i average of its points! 18 9

What is K-means optimizing? Potential function F(µ,C) of centers µ and point allocations C: N Optimal K-means: min µ min C F(µ,C) 19 Does K-means converge??? Part 1 Optimize potential function: Fix µ, optimize C 20 10

Does K-means converge??? Part 2 Optimize potential function: Fix C, optimize µ 21 Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good??) local optimum (For LASSO it converged to the global optimum, because of convexity) Some theory of quality of local opt K-means is a coordinate descent algorithm! 22 11

Mixtures of Gaussians Machine Learning CSE546 Sham Kakade University of Washington November 8, 2016 23 (One) bad case for k-means Clusters may overlap Some clusters may be wider than others 24 12

Density Estimation Estimate a density based on x 1,,x N 25 Density Estimation Contour Plot of Joint Density 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 26 13

Density as Mixture of Gaussians Approximate density with a mixture of Gaussians Mixture of 3 Gaussians Contour Plot of Joint Density 0.75 0.7 0.65 0.6 0.55 0.7 0.65 0.6 0.55 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0 0.2 0.4 0.6 0.8 1 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 27 Gaussians in d Dimensions 1 P(x) = (2π ) m/2 Σ exp # 1 1/2 $ % 2 x µ ( ) T Σ 1 ( x µ ) & ' ( 28 14

Density as Mixture of Gaussians Approximate density with a mixture of Gaussians 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 Mixture of 3 Gaussians 0 0.2 0.4 0.6 0.8 1 p(x i,µ, ) = 29 Density as Mixture of Gaussians Approximate with density with a mixture of Gaussians Mixture of 3 Gaussians Our actual observations 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0 0.2 0.4 0.6 0.8 1 1 0.5 0 (b) 0 0.5 1 30 15

Clustering our Observations Imagine we have an assignment of each x i to a Gaussian Our actual observations 1 (a) 1 (b) 0.5 0.5 0 0 0 0.5 1 Complete data labeled by true cluster assignments 0 0.5 1 31 Clustering our Observations Imagine we have an assignment of each x i to a Gaussian 1 (a) Introduce latent cluster indicator variable z i 0.5 Then we have p(x i z i,,µ, ) = 0 0 0.5 1 Complete data labeled by true cluster assignments 32 16

Clustering our Observations We must infer the cluster assignments from the observations 1 0.5 (c) Posterior probabilities of assignments to each cluster *given* model parameters: r ik = p(z i = k x i,,µ, ) = 0 0 0.5 1 Soft assignments to clusters 33 Unsupervised Learning: not as hard as it looks Sometimes easy Sometimes impossible and sometimes in between 34 17

Summary of GMM Concept Estimate a density based on x 1,,x N KX p(x i, µ, ) = z in (x i µ z i, z i) z i =1 1 (a) 0.5 0 0 0.5 1 Complete data labeled by true cluster assignments Surface Plot of Joint Density, Marginalizing Cluster Assignments 35 Summary of GMM Components Observations x i i 2 R d, i =1, 2,...,N Hidden cluster labels z i 2{1, 2,...,K}, i =1, 2,...,N Hidden mixture means µ k 2 R d, k =1, 2,...,K Hidden mixture covariances Hidden mixture probabilities k 2 R d d, k =1, 2,...,K KX k, k =1 k=1 Gaussian mixture marginal and conditional likelihood : KX p(x i, µ, ) = z i p(x i z i,µ, ) z i =1 p(x i z i,µ, ) =N (x i µ z i, z i) 36 18

Expectation Maximization Machine Learning CSE546 Sham Kakade University of Washington November 8, 2016 37 Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? 38 19

But we don t see class labels!!! MLE: argmax Õ i P(z i,x i ) But we don t know z i Maximize marginal likelihood: argmax Õ i P(x i ) = argmax Õ i å k=1 K P(z i =k,x i ) 39 Special case: spherical Gaussians and hard assignments P(z i = k, x i 1 ) = (2π ) m/2 Σ k exp # 1 1/2 2 xi µ k $ % ( ) T Σ 1 k ( x i µ k ) & ' ( P(zi = k) If P(X z=k) is spherical, with same s for all classes: # P(x i z i = k) exp 1 x i 2& µ $ % 2σ 2 k ' ( If each x i belongs to one class C(i) (hard assignment), marginal likelihood: N K N % P(x i, z i = k) exp 1 & ' 2σ 2 i=1 k=1 i=1 x i µ C(i) ( ) * 2 Same as K-means!!! 40 20

Supervised Learning of Mixtures of Gaussians Mixtures of Gaussians: Prior class probabilities: P(z=k) Likelihood function per class: P(x z=k) Suppose, for each data point, we know location x and class z Learning is easy J For prior P(z) For likelihood function: 41 EM: Reducing Unsupervised Learning to Supervised Learning If we knew assignment of points to classes è Supervised Learning! Expectation-Maximization (EM) Guess assignment of points to classes In standard ( soft ) EM: each point associated with prob. of being in each class Recompute model parameters Iterate 42 21

Form of Likelihood Conditioned on class of point x i... p(x i z i,µ, ) = Marginalizing class assignment: p(x i,µ, ) = 43 Gaussian Mixture Model Most commonly used mixture model Observations: Parameters: Likelihood: z i Ex. = country of origin, = height of i th person k th mixture component = distribution of heights in country k x i 44 22

Example (Taken from Kevin Murphy s ML textbook) Data: gene expression levels Goal: cluster genes with similar expression trajectories 45 Mixture models are useful for Density estimation Allows for multimodal density Clustering Want membership information for each observation e.g., topic of current document Soft clustering: p(z i = k x i, )= Hard clustering: z i = arg max p(z i = k x i, )= k 46 23

Issues Label switching Color = label does not matter Can switch labels and likelihood is unchanged Log likelihood is not convex in the parameters Problem is simpler for complete data likelihood 47 ML Estimate of Mixture Model Params Log likelihood L x ( ), log p({x i } ) = X i log X z i p(x i,z i ) Want ML estimate ˆ ML = Neither convex nor concave and local optima 48 24

If complete data were observed Assume class labels z i were observed in addition to x i L x,z ( ) = X log p(x i,z i ) i Compute ML estimates Separates over clusters k! Example: mixture of Gaussians (MoG) = { k,µ k, k } K k=1 49 Iterative Algorithm Motivates a coordinate ascent-like algorithm: 1. Infer missing values z i given estimate of parameters ˆ 2. Optimize parameters to produce new ˆ given filled in data z i 3. Repeat Example: MoG (derivation soon + HW) 1. Infer responsibilities r ik = p(z i = k x i, ˆ (t 1) )= 2. Optimize parameters max w.r.t. k : max w.r.t. µ k, k : 50 25

E.M. Convergence EM is coordinate ascent on an interesting potential function Coord. ascent for bounded pot. func. è convergence to a local optimum guaranteed This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G. Vector Quantization for Speech Data 51 Gaussian Mixture Example: Start 52 26

After first iteration 53 After 2nd iteration 54 27

After 3rd iteration 55 After 4th iteration 56 28

After 5th iteration 57 After 6th iteration 58 29

After 20th iteration 59 Some Bio Assay data 60 30

GMM clustering of the assay data 61 Resulting Density Estimator 62 31

Expectation Maximization (EM) Setup More broadly applicable than just to mixture models considered so far Model: x y observable incomplete data not (fully) observable complete data parameters Interested in maximizing (wrt ): p(x ) = X y p(x, y ) Special case: x = g(y) 63 Expectation Maximization (EM) Derivation Step 1 Rewrite desired likelihood in terms of complete data terms p(y ) =p(y x, )p(x ) Step 2 Assume estimate of parameters Take expectation with respect to ˆ p(y x, ˆ ) 64 32

Expectation Maximization (EM) Derivation Step 3 Consider log likelihood of data at any L x ( ) L x (ˆ ) relative to log likelihood at ˆ Aside: Gibbs Inequality Proof: E p [log p(x)] E p [log q(x)] 65 Motivates EM Algorithm Initial guess: Estimate at iteration t: E-Step Compute M-Step Compute 66 33

Expectation Maximization (EM) Derivation L x ( ) L x (ˆ ) =[U(, ˆ ) U(ˆ, ˆ )] [V (, ˆ ) V (ˆ, ˆ )] Step 4 Determine conditions under which log likelihood at Using Gibbs inequality: exceeds that at ˆ If Then L x ( ) L x (ˆ ) 67 Example Mixture Models E-Step Compute M-Step Compute U(, ˆ (t) )=E[log p(y ) x, ˆ (t) ] ˆ (t+1) = arg max U(, ˆ (t) ) Consider y i = {z i,x i } i.i.d. p(x i,z i ) = z ip(x i E qt [log p(y )] = X z i)= E qt [log p(x i,z i )] = i 68 34

Coordinate Ascent Behavior Bound log likelihood: L x ( ) = U(, ˆ (t) )+V(, ˆ (t) ) L x (ˆ (t) )= U(ˆ (t), ˆ (t) )+V(ˆ (t), ˆ (t) ) Figure from KM textbook 69 Comments on EM Since Gibbs inequality is satisfied with equality only if p=q, any step that changes should strictly increase likelihood In practice, can replace the M-Step with increasing U instead of maximizing it (Generalized EM) Under certain conditions (e.g., in exponential family), can show that EM converges to a stationary point of L x ( ) Often there is a natural choice for y has physical meaning If you want to choose any y, not necessarily x=g(y), replace p(y ) in U with p(y, x ) 70 35

Initialization In mixture model case where y i = {z i,x i } there are many ways to initialize the EM algorithm Examples: Choose K observations at random to define each cluster. Assign other observations to the nearest centriod to form initial parameter estimates Pick the centers sequentially to provide good coverage of data Grow mixture model by splitting (and sometimes removing) clusters until K clusters are formed Can be quite important to convergence rates in practice 71 What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent 72 36