Clustering and Gaussian Mixture Models

Similar documents
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall 1395

Mixtures of Gaussians. Sargur Srihari

Gaussian Mixture Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

CSE446: Clustering and EM Spring 2017

Machine Learning for Data Science (CS4786) Lecture 12

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

CPSC 540: Machine Learning

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Probabilistic clustering

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Data Preprocessing. Cluster Similarity

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

K-Means and Gaussian Mixture Models

Introduction to Machine Learning

Statistical Pattern Recognition

Data Mining Techniques

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 11: Unsupervised Machine Learning

Clustering, K-Means, EM Tutorial

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Expectation Maximization

Latent Variable Models and Expectation Maximization

CSCI-567: Machine Learning (Spring 2019)

Expectation maximization

Expectation Maximization Algorithm

The Expectation-Maximization Algorithm

Latent Variable Models and Expectation Maximization

Expectation Maximization

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

ECE 5984: Introduction to Machine Learning

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Latent Variable Models and EM Algorithm

Gaussian Mixture Models, Expectation Maximization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Review and Motivation

Mixtures of Gaussians continued

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Probabilistic Machine Learning

Clustering and Gaussian Mixtures

Lecture : Probabilistic Machine Learning

Latent Variable Models

Unsupervised Learning

Clustering using Mixture Models

Lecture 4: Probabilistic Learning

Lecture 8: Clustering & Mixture Models

Bayesian Networks Structure Learning (cont.)

Lecture 14. Clustering, K-means, and EM

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Clustering with k-means and Gaussian mixture distributions

CSC321 Lecture 18: Learning Probabilistic Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Statistical Pattern Recognition

Clustering with k-means and Gaussian mixture distributions

Density Estimation. Seungjin Choi

CMU-Q Lecture 24:

Technical Details about the Expectation Maximization (EM) Algorithm

EM (cont.) November 26 th, Carlos Guestrin 1

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Probabilistic Graphical Models

The Expectation-Maximization Algorithm

Mixture Models and EM

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Gaussian Mixture Models

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Graphical Models for Collaborative Filtering

Expectation Maximization

Clustering with k-means and Gaussian mixture distributions

Mixture Models & EM algorithm Lecture 21

Lecture 6: Gaussian Mixture Models (GMM)

10-701/15-781, Machine Learning: Homework 4

Parameter Estimation. Industrial AI Lab.

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Probabilistic & Unsupervised Learning

Latent Variable Models and EM algorithm

CS281 Section 4: Factor Analysis and PCA

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CS6220: DATA MINING TECHNIQUES

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Unsupervised learning (part 1) Lecture 19

CS534 Machine Learning - Spring Final Exam

Linear Models for Classification

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Cheng Soon Ong & Christian Walder. Canberra February June 2017

COM336: Neural Computing

Linear Dynamical Systems

Lecture 7: Con3nuous Latent Variable Models

Dynamic Approaches: The Hidden Markov Model

EXPECTATION- MAXIMIZATION THEORY

Introduction to Machine Learning

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Statistical learning. Chapter 20, Sections 1 4 1

Transcription:

Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1

Recap of last lecture.. Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 2

Clustering Usually an unsupervised learning problem Given: N unlabeled examples {x 1,..., x N}; the number of partitions K Goal: Group the examples into K partitions Clustering groups examples based of their mutual similarities A good clustering is one that achieves: High within-cluster similarity Low inter-cluster similarity Examples: K-means, Spectral Clustering, Gaussian Mixture Model, etc. Picture courtesy: Data Clustering: 50 Years Beyond K-Means, A.K. Jain (2008) Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 3

Refresher: K-means Clustering Input: N examples {x 1,..., x N}; x n R D ; the number of partitions K Initialize: K cluster means µ 1,..., µ K, µ k R D ; many ways to initialize: Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics exist (e.g., K-means++, Arthur & Vassilvitskii, 2007) Iterate: (Re)-Assign each example x n to its closest cluster center C k = {n : k = arg min k x n µ k 2 } (C k is the set of examples assigned to cluster k with center µ k ) Update the cluster means Repeat while not converged µ k = mean(c k ) = 1 C k n C k A possible convergence criteria: cluster means do not change anymore x n Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 4

The K-means Objective Function Notation: Size K one-hot vector to denote membership of x n to cluster k Also equivalent to just saying z n = k z n = [0 0... 1 0 0] }{{} all zeros except the k-th bit K-means objective can be written in terms of the total distortion N K J(µ, Z) = z nk x n µ k 2 n=1 Distortion: Loss suffered on assigning points {x n} N n=1 to clusters {µ k} K Goal: To minimize the objective w.r.t. µ and Z Note: Non-convex objective. Also, exact optimization is NP-hard The K-means algorithm is a heuristic; alternates b/w minimizing J w.r.t. µ and Z ; converges to a local minima Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 5

K-means: Some Limitations Makes hard assignments of points to clusters A point either totally belongs to a cluster or not at all No notion of a soft/fractional assignment (i.e., probability of being assigned to each cluster: say K = 3 and for some point x n, p 1 = 0.7, p 2 = 0.2, p 3 = 0.1) K-means often doesn t work when clusters are not round shaped, and/or may overlap, and/or are unequal Gaussian Mixture Model: A probabilistic approach to clustering (and density estimation) addressing many of these problems Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 6

Mixture Models Data distribution p(x) assumed to be a weighted sum of K distributions p(x) = K π k p(x θ k ) where π k s are the mixing weights: K π k = 1, π k 0 (intuitively, π k is the proportion of data generated by the k-th distribution) Each component distribution p(x θ k ) represents a cluster in the data Gaussian Mixture Model (GMM): component distributions are Gaussians p(x) = K π k N (x µ k, Σ k ) Mixture models used in many data modeling problems, e.g., Unsupervised Learning: Clustering (+density estimation) Supervised Learning: Mixture of Experts models Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 7

GMM Clustering: Pictorially Notice the mixed colored points in the overlapping regions Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 8

GMM as a Generative Model of Data Can think of the data {x 1, x n,..., x N} using a generative story For each example x n, first choose its cluster assignment z n {1, 2,..., K} as z n Multinoulli(π 1, π 2,..., π K ) Now generate x from the Gaussian with id z n x n z n N (µ, Σ z n zn) Note: p(z nk = 1) = π k is the prior probability of x n going to cluster k and p(z n) = K π z nk k Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 9

GMM as a Generative Model of Data Joint distribution of data and cluster assignments p(x, z) = p(z)p(x z) Marginal distribution of data K K p(x) = p(z k = 1)p(x z k = 1) = π k N (x µ k, Σ k ) Thus the generative model leads to exactly the same p(x) that we defined Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 10

Learning GMM Given N observations {x 1, x 2,..., x N} drawn from mixture distribution p(x) p(x) = K π k N (x µ k, Σ k ) Learning the GMM involves the following: Learning the cluster assignments {z 1, z 2,..., z N} Estimating the mixing weights π = {π 1,..., π K } and the parameters θ = {µ k, Σ k } K of each of the K Gaussians GMM, being probabilistic, allows learning probabilities of cluster assignments Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 11

GMM: Learning Cluster Assignment Probabilities For now, assume π = {π 1,..., π K } and θ = {µ k, Σ k } K are known Given θ, the posterior probabilities of cluster assignments, using Bayes rule γ nk = p(z nk = 1 x n) = p(z nk = 1)p(x n z nk = 1) K j=1 p(z nj = 1)p(x n z nj = 1) = π k N (x n µ k, Σ k ) K j=1 π j N (x n µ j, Σ j ) Here γ nk denotes the posterior probability that x n belongs to cluster k Posterior prob. γ nk prior probability π k times likelihood N (x n µ k, Σ k ) Note that unlike K-means, there is a non-zero posterior probability of x n belonging to each of the K clusters (i.e., probabilistic/soft clustering) Therefore for each example x n, we have a vector γ n of cluster probabilities K γ n = [γ n1 γ n2... γ nk ], γ nk = 1, γ nk > 0 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 12

GMM: Estimating Parameters Now assume the cluster probabilities γ 1,..., γ N are known Let us write down the log-likelihood of the model { N N N K } L = log p(x) = log p(x n) = log p(x n) = log π k N (x n µ k, Σ k ) n=1 n=1 Taking derivative w.r.t. µ k (done on black board) and setting to zero N n=1 Plugging and chugging, we get π k N (x n µ k, Σ k ) K j=1 π j N (x n µ j, Σ j ) } {{ } γ nk n=1 Σ 1 k (x n µ k ) = 0 µ k = N n=1 γ nk x n N n=1 γ nk = 1 N γ nk x n N k n=1 Thus mean of k-th Gaussian is the weighted empirical mean of all examples N k = N n=1 γ nk: effective num. of examples assigned to k-th Gaussian (note that each example belongs to each Gaussian, but partially ) Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 13

GMM: Estimating Parameters Doing the same, this time w.r.t. the covariance matrix Σ k of k-th Gaussian: Σ k = 1 N γ nk (x n µ k )(x n µ k ) N k n=1.. using similar computations as MLE of the covariance matrix of a single Gaussian (shown on board) Thus Σ k is the weighted empirical covariance of all examples Finally, the MLE objective for estimating π = {π 1, π 2,..., π K } N K K log π k N (x n µ k, Σ k ) + λ( π k 1) n=1 (λ is the Lagrange multiplier for K π k = 1) Taking derivative w.r.t. π k and setting it to zero gives Lagrange multiplier λ = N. Plugging it back and chugging, we get π k = N k N which makes intuitive sense (fraction of examples assigned to cluster k) Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 14

Summary of GMM Estimation Initialize parameters θ = {µ k, Σ k } K and mixing weights π = {π 1,..., π K }, and alternate between the following steps until convergence: Given current estimates of θ = {µ k, Σ k } K and π Estimate the posterior probabilities of cluster assignments γ nk = π k N (x n µ k, Σ k ) K j=1 π j N (x n µ j, Σ j ) n, k Given the current estimates of cluster assignment probabilities {γ nk } Estimate the mean of each Gaussian µ k = 1 N N γ nk x n k, where N k = γ nk N k n=1 n=1 Estimate the covariance matrix of each Gaussian Σ k = 1 N γ nk (x n µ k )(x n µ k ) k N k n=1 Estimate the mixing proportion of each Gaussian π k = N k N k Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 15

K-means: A Special Case of GMM Assume the covariance matrix of each Gaussian to be spherical Σ k = σ 2 I Consider the posterior probabilities of cluster assignments γ nk = π kn (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) = π k exp{ 1 2σ 2 x n µ k 2 } K j=1 π j exp{ 1 2σ 2 x n µ j 2 } As σ 2 0, the summation of denominator will be dominated by the term with the smallest x n µ j 2. For that j, γ nj π j exp{ 1 2σ 2 x n µ j 2 } π j exp{ 1 2σ 2 x n µ j 2 } = 1 For l j, γ nl 0 hard assignment with γ nj 1 for a single cluster j Thus, for Σ k = σ 2 I (spherical) and σ 2 0, GMM reduces to K-means Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 16

Next class: The Expectation Maximization Algorithm Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 17