Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Similar documents
Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Mixtures of Gaussians continued

Gaussian Mixture Models

CSE446: Clustering and EM Spring 2017

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Expectation Maximization Algorithm

EM (cont.) November 26 th, Carlos Guestrin 1

Expectation Maximization

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Latent Variable Models and EM Algorithm

Gaussian Mixture Models, Expectation Maximization

Bayesian Networks Structure Learning (cont.)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Unsupervised Learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Clustering and Gaussian Mixture Models

Introduction to Machine Learning

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Statistical Pattern Recognition

Review and Motivation

Latent Variable Models and EM algorithm

K-Means and Gaussian Mixture Models

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Expectation maximization

Machine Learning for Data Science (CS4786) Lecture 12

Gaussian Mixture Models

But if z is conditioned on, we need to model it:

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

CPSC 540: Machine Learning

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

ECE 5984: Introduction to Machine Learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Mixtures of Gaussians. Sargur Srihari

The Expectation Maximization or EM algorithm

Expectation Maximization

Lecture 4: Probabilistic Learning

Clustering, K-Means, EM Tutorial

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Expectation Maximization

The Expectation-Maximization Algorithm

Mixture Models and Expectation-Maximization

COM336: Neural Computing

Data Preprocessing. Cluster Similarity

Probabilistic Graphical Models

A minimalist s exposition of EM

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Data Mining Techniques

Variational Inference (11/04/13)

Statistical Pattern Recognition

Unsupervised learning (part 1) Lecture 19

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Lecture 14. Clustering, K-means, and EM

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Gaussian Mixture Model

Latent Variable View of EM. Sargur Srihari

10-701/15-781, Machine Learning: Homework 4

Probabilistic clustering

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Graphical Models for Collaborative Filtering

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Mixture Models & EM algorithm Lecture 21

CSCI-567: Machine Learning (Spring 2019)

Latent Variable Models

Expectation Maximization

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

EM Algorithm LECTURE OUTLINE

Classification Logistic Regression

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Latent Variable Models and Expectation Maximization

Lecture 7: Con3nuous Latent Variable Models

Introduction to Machine Learning

Machine Learning Lecture 5

Expectation maximization tutorial

Naïve Bayes classification

Lecture 6: Gaussian Mixture Models (GMM)

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Weighted Finite-State Transducers in Computational Biology

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning CS-6350, Assignment - 5 Due: 14 th November 2013

Bias-Variance Tradeoff

STA 414/2104: Machine Learning

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

MIXTURE MODELS AND EM

Expectation Maximization (EM)

13: Variational inference II

Linear Dynamical Systems

Lecture 6: April 19, 2002

Lecture 2: GMM and EM

Hidden Markov Models in Language Processing

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Machine Learning Techniques for Computer Vision

Clustering with k-means and Gaussian mixture distributions

Transcription:

Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1

K-means Randomly initialize k centers µ (0) = µ 1 (0),, µ k (0) Classify: Assign each point j {1, N} to nearest center: Recenter: µ i becomes centroid of its point: Equivalent to µ i average of its points! 3 Mixtures of Gaussians Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 4 2

(One) bad case for k-means Clusters may overlap Some clusters may be wider than others 5 Density Estimation Estimate a density based on x 1,,x N 6 3

Density as Mixture of Gaussians Approximate density with a mixture of Gaussians Mixture of 3 Gaussians Contour Plot of Joint Density 0.75 0.7 0.65 0.6 0.55 0.7 0.65 0.6 0.55 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0 0.2 0.4 0.6 0.8 1 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 7 Gaussians in d Dimensions 1 P(x) = (2π ) m/2 Σ exp # 1 1/2 $ % 2 x µ ( ) T Σ 1 ( x µ ) & ' ( 8 4

Density as Mixture of Gaussians Approximate density with a mixture of Gaussians 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 Mixture of 3 Gaussians 0 0.2 0.4 0.6 0.8 1 p(x i,µ, ) = 9 Density as Mixture of Gaussians Approximate with density with a mixture of Gaussians Mixture of 3 Gaussians Our actual observations 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0 0.2 0.4 0.6 0.8 1 1 0.5 0 (b) 0 0.5 1 C. Bishop, Carlos Guestrin Pattern 2005-2014 Recognition & Machine Learning 10 5

Clustering our Observations Imagine we have an assignment of each x i to a Gaussian Our actual observations 1 (a) 1 (b) 0.5 0.5 0 0 0 0.5 1 Complete data labeled by true cluster assignments 0 0.5 1 C. Bishop, Carlos Guestrin Pattern 2005-2014 Recognition & Machine Learning 11 Clustering our Observations Imagine we have an assignment of each x i to a Gaussian 1 (a) Introduce latent cluster indicator variable z i 0.5 Then we have p(x i z i,,µ, ) = 0 0 0.5 1 Complete data labeled by true cluster assignments C. Bishop, Carlos Guestrin Pattern 2005-2014 Recognition & Machine Learning 12 6

Clustering our Observations We must infer the cluster assignments from the observations 1 0.5 (c) Posterior probabilities of assignments to each cluster *given* model parameters: r ik = p(z i = k x i,,µ, ) = 0 0 0.5 1 Soft assignments to clusters C. Bishop, Carlos Guestrin Pattern 2005-2014 Recognition & Machine Learning 13 Unsupervised Learning: not as hard as it looks Sometimes easy Sometimes impossible and sometimes in between 14 7

Summary of GMM Concept Estimate a density based on x 1,,x N KX p(x i, µ, ) = z in (x i µ z i, z i) z i =1 1 (a) 0.5 0 0 0.5 1 Complete data labeled by true cluster assignments Surface Plot of Joint Density, Marginalizing Cluster Assignments 15 Summary of GMM Components Observations x i i 2 R d, i =1, 2,...,N Hidden cluster labels z i 2{1, 2,...,K}, i =1, 2,...,N Hidden mixture means µ k 2 R d, k =1, 2,...,K Hidden mixture covariances Hidden mixture probabilities k 2 R d d, k =1, 2,...,K KX k, k =1 k=1 Gaussian mixture marginal and conditional likelihood : KX p(x i, µ, ) = z i p(x i z i,µ, ) z i =1 p(x i z i,µ, ) =N (x i µ z i, z i) 16 8

Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 6, 2014 17 Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? 18 9

But we don t see class labels!!! MLE: argmax i P(z i,x i ) But we don t know z i Maximize marginal likelihood: argmax i P(x i ) = argmax i k=1 K P(z i =k,x i ) 19 Special case: spherical Gaussians and hard assignments P(z i = k, x i 1 ) = (2π ) m/2 Σ k exp # 1 1/2 2 xi µ $ % k ( ) T Σ 1 k ( x i µ k ) & ' ( P(zi = k) If P(X z=k) is spherical, with same σ for all classes: # P(x i z i = k) exp 1 x i 2& µ $ % 2σ 2 k ' ( If each x i belongs to one class C(i) (hard assignment), marginal likelihood: N K N % P(x i, z i = k) exp 1 & ' 2σ 2 i=1 k=1 i=1 x i µ C(i) ( ) * 2 Same as K-means!!! 20 10

EM: Reducing Unsupervised Learning to Supervised Learning If we knew assignment of points to classes è Supervised Learning! Expectation-Maximization (EM) Guess assignment of points to classes In standard ( soft ) EM: each point associated with prob. of being in each class Recompute model parameters Iterate 21 Generic Mixture Models MoG Example: Observations: Parameters: Likelihood: Ex. z i = country of origin, x i = height of i th person k th mixture component = distribution of heights in country k 22 11

ML Estimate of Mixture Model Params Log likelihood L x ( ), log p({x i } ) = X i log X z i p(x i,z i ) Want ML estimate ˆ ML = Neither convex nor concave and local optima 23 If complete data were observed z i Assume class labels were observed in addition to L x,z ( ) = X log p(x i,z i ) i x i Compute ML estimates Separates over clusters k! Example: mixture of Gaussians (MoG) = { k,µ k, k } K k=1 24 12

Iterative Algorithm Motivates a coordinate ascent-like algorithm: 1. Infer missing values z i given estimate of parameters ˆ 2. Optimize parameters to produce new ˆ given filled in data z i 3. Repeat Example: MoG (derivation soon ) 1. Infer responsibilities r ik = p(z i = k x i, ˆ (t 1) )= 2. Optimize parameters max w.r.t. k : max w.r.t. µ k, k : 25 E.M. Convergence EM is coordinate ascent on an interesting potential function Coord. ascent for bounded pot. func. è convergence to a local optimum guaranteed This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G. Vector Quantization for Speech Data 26 13

Gaussian Mixture Example: Start 27 After first iteration 28 14

After 2nd iteration 29 After 3rd iteration 30 15

After 4th iteration 31 After 5th iteration 32 16

After 6th iteration 33 After 20th iteration 34 17

Some Bio Assay data 35 GMM clustering of the assay data 36 18

Resulting Density Estimator 37 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same Expectation Step: Fill in missing data, given current values of parameters, θ (t) If variable y is missing (could be many variables) Compute, for each data point x j, for each value i of y: P(y=i x j,θ (t) ) Maximization step: Find maximum likelihood parameters for (weighted) completed data : For each data point x j, create k weighted data points Set θ (t+1) as the maximum likelihood parameter estimate for this weighted data Repeat Carlos Guestrin 2005-2013 38 19

Initialization In mixture model case where y i = {z i,x i } there are many ways to initialize the EM algorithm Examples: Choose K observations at random to define each cluster. Assign other observations to the nearest centriod to form initial parameter estimates Pick the centers sequentially to provide good coverage of data Grow mixture model by splitting (and sometimes removing) clusters until K clusters are formed Can be quite important to quality of solution in practice 39 What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent 40 20

Expectation Maximization (EM) Setup More broadly applicable than just to mixture models considered so far Model: x y observable incomplete data not (fully) observable complete data parameters Interested in maximizing (wrt ): p(x ) = X y p(x, y ) Special case: x = g(y) 41 Expectation Maximization (EM) Derivation Step 1 Rewrite desired likelihood in terms of complete data terms p(y ) =p(y x, )p(x ) Step 2 Assume estimate of parameters Take expectation with respect to ˆ p(y x, ˆ ) 42 21

Expectation Maximization (EM) Derivation Step 3 Consider log likelihood of data at any L x ( ) L x (ˆ ) relative to log likelihood at ˆ Aside: Gibbs Inequality Proof: E p [log p(x)] E p [log q(x)] 43 Expectation Maximization (EM) Derivation L x ( ) L x (ˆ ) =[U(, ˆ ) U(ˆ, ˆ )] [V (, ˆ ) V (ˆ, ˆ )] Step 4 Determine conditions under which log likelihood at exceeds that at Using Gibbs inequality: ˆ If Then L x ( ) L x (ˆ ) 44 22

Motivates EM Algorithm Initial guess: Estimate at iteration t: E-Step Compute M-Step Compute 45 Example Mixture Models E-Step Compute M-Step Compute U(, ˆ (t) )=E[log p(y ) x, ˆ (t) ] ˆ (t+1) = arg max U(, ˆ (t) ) Consider y i = {z i,x i } i.i.d. p(x i,z i ) = z ip(x i E qt [log p(y )] = X z i)= E qt [log p(x i,z i )] = i 46 23

Coordinate Ascent Behavior Bound log likelihood: L x ( ) = U(, ˆ (t) )+V(, ˆ (t) ) L x (ˆ (t) )= U(ˆ (t), ˆ (t) )+V (ˆ (t), ˆ (t) ) Figure from KM textbook 47 Comments on EM Since Gibbs inequality is satisfied with equality only if p=q, any step that changes should strictly increase likelihood In practice, can replace the M-Step with increasing U instead of maximizing it (Generalized EM) Under certain conditions (e.g., in exponential family), can show that EM converges to a stationary point of L x ( ) Often there is a natural choice for y has physical meaning If you want to choose any y, not necessarily x=g(y), replace p(y ) in U with p(y, x ) 48 24

Initialization In mixture model case where y i = {z i,x i } there are many ways to initialize the EM algorithm Examples: Choose K observations at random to define each cluster. Assign other observations to the nearest centriod to form initial parameter estimates Pick the centers sequentially to provide good coverage of data Grow mixture model by splitting (and sometimes removing) clusters until K clusters are formed Can be quite important to convergence rates in practice 49 What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent 50 25