K-Means and Gaussian Mixture Models

Similar documents
Gaussian Mixture Models

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Mixtures of Gaussians. Sargur Srihari

The Expectation-Maximization Algorithm

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Gaussian Mixture Models, Expectation Maximization

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Machine Learning for Data Science (CS4786) Lecture 12

Clustering and Gaussian Mixture Models

Gaussian Mixture Models

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

CPSC 540: Machine Learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

CSE446: Clustering and EM Spring 2017

Clustering, K-Means, EM Tutorial

Introduction to Machine Learning

Expectation maximization

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation Maximization Algorithm

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Machine Learning Techniques for Computer Vision

Lecture 4: Probabilistic Learning

COM336: Neural Computing

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Expectation Maximization

Machine Learning for Signal Processing Bayes Classification and Regression

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning for Signal Processing Bayes Classification

CPSC 540: Machine Learning

The Multivariate Gaussian Distribution [DRAFT]

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Review and Motivation

STA 4273H: Statistical Machine Learning

Data Mining Techniques

Expectation Propagation Algorithm

Unsupervised Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Statistical Pattern Recognition

But if z is conditioned on, we need to model it:

Unsupervised learning (part 1) Lecture 19

ECE521 lecture 4: 19 January Optimization, MLE, regularization

CSCI-567: Machine Learning (Spring 2019)

Based on slides by Richard Zemel

Latent Variable View of EM. Sargur Srihari

Linear Dynamical Systems

Machine Learning Lecture 5

10-701/15-781, Machine Learning: Homework 4

Mixtures of Gaussians continued

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Latent Variable Models and EM Algorithm

Mixture of Gaussians Models

CS534 Machine Learning - Spring Final Exam

Expectation Maximization

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Latent Variable Models

Pattern Recognition and Machine Learning

STA 414/2104: Machine Learning

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

A minimalist s exposition of EM

Probabilistic Graphical Models

Lecture 3: Pattern Classification

Expectation Maximization and Mixtures of Gaussians

Mixture Models & EM algorithm Lecture 21

STA 4273H: Statistical Machine Learning

ECE 5984: Introduction to Machine Learning

Graphical Models for Collaborative Filtering

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Mixture Models and Expectation-Maximization

Introduction to Machine Learning

Maximum Likelihood Estimation. only training data is available to design a classifier

Latent Variable Models and Expectation Maximization

Linear Models for Classification

Introduction to Graphical Models

4th IMPRS Astronomy Summer School Drawing Astrophysical Inferences from Data Sets

Variational Autoencoders

CPSC 540: Machine Learning

Latent Variable Models and EM algorithm

Technical Details about the Expectation Maximization (EM) Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Latent Variable Models and Expectation Maximization

CSC321 Lecture 18: Learning Probabilistic Models

Probabilistic Graphical Models

CS6220: DATA MINING TECHNIQUES

Statistical Pattern Recognition

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Introduction to Machine Learning Midterm, Tues April 8

Parametric Techniques Lecture 3

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

1 EM Primer. CS4786/5786: Machine Learning for Data Science, Spring /24/2015: Assignment 3: EM, graphical models

Machine Learning: A Statistics and Optimization Perspective

CSC411 Fall 2018 Homework 5

Transcription:

K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42

K-Means Clustering K-Means Clustering David Rosenberg (New York University) DS-GA 1003 October 29, 2016 2 / 42

K-Means Clustering Example: Old Faithful Geyser Looks like two clusters. How to find these clusters algorithmically? David Rosenberg (New York University) DS-GA 1003 October 29, 2016 3 / 42

K-Means Clustering k-means: By Example Standardize the data. Choose two cluster centers. From Bishop s Pattern recognition and machine learning, Figure 9.1(a). avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 4 / 42

K-Means Clustering k-means: by example Assign each point to closest center. From Bishop s Pattern recognition and machine learning, Figure 9.1(b). avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 5 / 42

K-Means Clustering k-means: by example Compute new class centers. From Bishop s Pattern recognition and machine learning, Figure 9.1(c). avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 6 / 42

K-Means Clustering k-means: by example Assign points to closest center. From Bishop s Pattern recognition and machine learning, Figure 9.1(d). avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 7 / 42

K-Means Clustering k-means: by example Compute cluster centers. From Bishop s Pattern recognition and machine learning, Figure 9.1(e). avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 8 / 42

K-Means Clustering k-means: by example Iterate until convergence. From Bishop s Pattern recognition and machine learning, Figure 9.1(i). avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 9 / 42

K-Means Clustering k-means Algorithm: Standardizing the data Without standardizing: Blue and black show results of k-means clustering Wait time dominates the distance metric David Rosenberg (New York University) DS-GA 1003 October 29, 2016 10 / 42

K-Means Clustering k-means Algorithm: Standardizing the data With standardizing: Note several points have been reassigned from black to blue cluster. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 11 / 42

k-means: Failure Cases k-means: Failure Cases David Rosenberg (New York University) DS-GA 1003 October 29, 2016 12 / 42

k-means: Failure Cases k-means: Suboptimal Local Minimum The clustering for k = 3 below is a local minimum, but suboptimal: From Sontag s DS-GA 1003, 2014, Lecture 8. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 13 / 42

Gaussian Mixture Models Gaussian Mixture Models David Rosenberg (New York University) DS-GA 1003 October 29, 2016 14 / 42

Gaussian Mixture Models Probabilistic Model for Clustering Let s consider a generative model for the data. Suppose 1 There are k clusters. 2 We have a probability density for each cluster. Generate a point as follows 1 Choose a random cluster z {1,2,...,k}. 2 Choose a point from the distribution for cluster Z. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 15 / 42

Gaussian Mixture Models Gaussian Mixture Model (k = 3) 1 Choose z {1,2,3} with p(1) = p(2) = p(3) = 1 3. 2 Choose x z N(X µ z,σ z ). David Rosenberg (New York University) DS-GA 1003 October 29, 2016 16 / 42

Gaussian Mixture Models Gaussian Mixture Model Parameters (k Components) Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) For now, suppose all these parameters are known. We ll discuss how to learn or estimate them later. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 17 / 42

Gaussian Mixture Models Gaussian Mixture Model: Joint Distribution Factorize the joint distribution: p(x, z) = p(z)p(x z) = π z N (x µ z,σ z ) π z is probability of choosing cluster z. x z has distribution N(µ z,σ z ). z corresponding to x is the true cluster assignment. Suppose we know the model parameters π z,µ z,σ z. Then we can easily compute the joint p(x,z). David Rosenberg (New York University) DS-GA 1003 October 29, 2016 18 / 42

Gaussian Mixture Models Latent Variable Model We observe x. We don t observe z. (Cluster assignment). Cluster assignment z is called a hidden variable or latent variable. Definition A latent variable model is a probability model for which certain variables are never observed. e.g. The Gaussian mixture model is a latent variable model. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 19 / 42

Gaussian Mixture Models The GMM Inference Problem We observe x. We want to know z. The conditional distribution of the cluster z given x is p(z x) = p(x,z)/p(x) The conditional distribution is a soft assignment to clusters. A hard assignment is z = arg min p(z x). z {1,...,k} So if we have the model, clustering is trival. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 20 / 42

Mixture Models Mixture Models David Rosenberg (New York University) DS-GA 1003 October 29, 2016 21 / 42

Mixture Models Gaussian Mixture Model: Marginal Distribution The marginal distribution for a single observation x is p(x) = = k p(x, z) z=1 k π z N (x µ z,σ z ) z=1 Note that p(x) is a convex combination of probability densities. This is a common form for a probability model... avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 22 / 42

Mixture Models Mixture Distributions (or Mixture Models) Definition A probability density p(x) represents a mixture distribution or mixture model, if we can write it as a convex combination of probability densities. That is, k p(x) = w i p i (x), where w i 0, k i=1 w i = 1, and each p i is a probability density. i=1 In our Gaussian mixture model, x has a mixture distribution. More constructively, let S be a set of probability distributions: 1 Choose a distribution randomly from S. 2 Sample x from the chosen distribution. Then x has a mixture distribution. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 23 / 42

Learning in Gaussian Mixture Models Learning in Gaussian Mixture Models David Rosenberg (New York University) DS-GA 1003 October 29, 2016 24 / 42

Learning in Gaussian Mixture Models The GMM Learning Problem Given data x 1,...,x n drawn from a GMM, Estimate the parameters: Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) Once we have the parameters, we re done. Just do inference to get cluster assignments. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 25 / 42

Learning in Gaussian Mixture Models Estimating/Learning the Gaussian Mixture Model One approach to learning is maximum likelihood find parameter values that give observed data the highest likelihood. The model likelihood for D = {x 1,...,x n } is L(π,µ,Σ) = = n p(x i ) i=1 n i=1 z=1 k π z N (x i µ z,σ z ). As usual, we ll take our objective function to be the log of this: { n k } J(π,µ,Σ) = log π z N (x i µ z,σ z ) i=1 z=1 avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 26 / 42

Learning in Gaussian Mixture Models Properties of the GMM Log-Likelihood GMM log-likelihood: J(π,µ,Σ) = { n k } log π z N (x i µ z,σ z ) i=1 z=1 Let s compare to the log-likelihood for a single Gaussian: n logn(x i µ,σ) i=1 = nd 2 log(2π) n 2 log Σ 1 2 n (x i µ) Σ 1 (x i µ) For a single Gaussian, the log cancels the exp in the Gaussian density. = Things simplify a lot. For the GMM, the sum inside the log prevents this cancellation. = Expression more complicated. No closed form expression for MLE. David Rosenberg (New York University) DS-GA 1003 October 29, 2016 27 / 42 i=1

Issues with MLE for GMM Issues with MLE for GMM David Rosenberg (New York University) DS-GA 1003 October 29, 2016 28 / 42

Issues with MLE for GMM Identifiability Issues for GMM Suppose we have found parameters Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) that are at a local minimum. What happens if we shuffle the clusters? e.g. Switch the labels for clusters 1 and 2. We ll get the same likelihood. How many such equivalent settings are there? Assuming all clusters are distinct, there are k! equivalent solutions. Not a problem per se, but something to be aware of. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 29 / 42

Issues with MLE for GMM Singularities for GMM Consider the following GMM for 7 data points: Let σ 2 be the variance of the skinny component. What happens to the likelihood as σ 2 0? In practice, we end up in local minima that do not have this problem. Or keep restarting optimization until we do. Bayesian approach or regularization will also solve the problem. From Bishop s Pattern recognition and machine learning, Figure 9.7. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 30 / 42

Issues with MLE for GMM Gradient Descent / SGD for GMM What about running gradient descent or SGD on { n k } J(π,µ,Σ) = log π z N (x i µ z,σ z )? i=1 z=1 Can be done but need to be clever about it. Each matrix Σ 1,...,Σ k has to be positive semidefinite. How to maintain that constraint? Rewrite Σ i = M i Mi T, where M i is an unconstrained matrix. Then Σ i is positive semidefinite. But we actually prefer positive definite, to avoid singularities. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 31 / 42

Issues with MLE for GMM Cholesky Decomposition for SPD Matrices Theorem Every symmetric positive definite matrix A R d d has a unique Cholesky decomposition: A = LL T, where L a lower triangular matrix with positive diagonal elements. A lower triangular matrix has half the number of parameters. Symmetric positive definite is better because avoids singularities. Requires a non-negativity constraint on diagonal elements. e.g. Use projected SGD method like we did for the Lasso. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 32 / 42

The EM Algorithm for GMM The EM Algorithm for GMM David Rosenberg (New York University) DS-GA 1003 October 29, 2016 33 / 42

The EM Algorithm for GMM MLE for Gaussian Model Let s start by considering the MLE for the Gaussian model. For data D = {x 1,...,x n }, the log likelihood is given by n logn(x i µ,σ) = nd 2 log(2π) n 2 log Σ 1 2 i=1 n (x i µ) Σ 1 (x i µ). i=1 With some calculus, we find that the MLE parameters are µ MLE = 1 n Σ MLE = 1 n n i=1 x i n (x i µ MLE )(x i µ MLE ) T i=1 For GMM, If we knew the cluster assignment z i for each x i, we could compute the MLEs for each cluster. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 34 / 42

The EM Algorithm for GMM Cluster Responsibilities: Some New Notation Denote the probability that observed value x i comes from cluster j by γ j i = P(Z = j X = x i). The responsibility that cluster j takes for observation x i. Computationally, γ j i = P(Z = j X = x i ). = p (Z = j,x = x i )/p(x) π j N (x i µ j,σ j ) = k c=1 π cn (x i µ c,σ c ) The vector ( γ 1 ) i,...,γk i is exactly the soft assignment for xi. Let n c = n i=1 γc i be the number of points soft assigned to cluster c. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 35 / 42

The EM Algorithm for GMM EM Algorithm for GMM: Overview 1 Initialize parameters µ, Σ, π. 2 E step. Evaluate the responsibilities using current parameters: for i = 1,...,n and j = 1,...,k. γ j i = π j N (x i µ j,σ j ) k c=1 π cn (x i µ c,σ c ), 3 M step. Re-estimate the parameters using responsibilities: µ new c = 1 n γ c i x i n c Σ new c = 1 n c π new c = n c n, i=1 n γ c i (x i µ MLE )(x i µ MLE ) T i=1 4 Repeat from Step 2, until log-likelihood converges. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 36 / 42

The EM Algorithm for GMM EM for GMM Initialization From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 37 / 42

The EM Algorithm for GMM EM for GMM First soft assignment: From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 38 / 42

The EM Algorithm for GMM EM for GMM First soft assignment: From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 39 / 42

The EM Algorithm for GMM EM for GMM After 5 rounds of EM: From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 40 / 42

The EM Algorithm for GMM EM for GMM After 20 rounds of EM: From Bishop s Pattern recognition and machine learning, Figure 9.8. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 41 / 42

The EM Algorithm for GMM Relation to K-Means EM for GMM seems a little like k-means. In fact, there is a precise correspondence. First, fix each cluster covariance matrix to be σ 2 I. As we take σ 2 0, the update equations converge to doing k-means. If you do a quick experiment yourself, you ll find Soft assignments converge to hard assignments. Has to do with the tail behavior (exponential decay) of Gaussian. avid Rosenberg (New York University) DS-GA 1003 October 29, 2016 42 / 42