Gaussian Mixture Models

Similar documents
Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Mixtures of Gaussians continued

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Expectation Maximization Algorithm

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

CSE446: Clustering and EM Spring 2017

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

EM (cont.) November 26 th, Carlos Guestrin 1

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Clustering and Gaussian Mixture Models

Expectation Maximization

But if z is conditioned on, we need to model it:

Machine Learning for Data Science (CS4786) Lecture 12

ECE 5984: Introduction to Machine Learning

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Latent Variable Models and EM Algorithm

Gaussian Mixture Models, Expectation Maximization

Expectation maximization

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Bayesian Networks Structure Learning (cont.)

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Latent Variable Models and EM algorithm

Mixture Models & EM algorithm Lecture 21

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

CPSC 540: Machine Learning

Latent Variable Models

The Expectation Maximization or EM algorithm

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Unsupervised learning (part 1) Lecture 19

Lecture 8: Graphical models for Text

Expectation Maximization

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Data Preprocessing. Cluster Similarity

K-Means and Gaussian Mixture Models

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Unsupervised Learning

Expectation Maximization

Generative v. Discriminative classifiers Intuition

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

The Expectation-Maximization Algorithm

Lecture 14. Clustering, K-means, and EM

CSCI-567: Machine Learning (Spring 2019)

Gaussian Mixture Models

CS534 Machine Learning - Spring Final Exam

10-701/15-781, Machine Learning: Homework 4

Clustering, K-Means, EM Tutorial

Data Mining Techniques

Technical Details about the Expectation Maximization (EM) Algorithm

ECE 5984: Introduction to Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning for Signal Processing Bayes Classification and Regression

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Bayesian Machine Learning

Generative v. Discriminative classifiers Intuition

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Week 3: The EM algorithm

CS6220: DATA MINING TECHNIQUES

Lecture 6: Gaussian Mixture Models (GMM)

Bias-Variance Tradeoff

Notes on Machine Learning for and

Expectation Maximization

Statistical learning. Chapter 20, Sections 1 4 1

Statistical Pattern Recognition

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Mixtures of Gaussians. Sargur Srihari

EXPECTATION- MAXIMIZATION THEORY

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Expectation Maximization (EM)

CPSC 540: Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Latent Variable Models and Expectation Maximization

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Probabilistic Graphical Models

Mixture Models and Expectation-Maximization

COM336: Neural Computing

Expectation Maximization and Mixtures of Gaussians

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

An Introduction to Expectation-Maximization

Latent Variable Models and Expectation Maximization

Graphical Models for Collaborative Filtering

Unsupervised Learning: K-Means, Gaussian Mixture Models

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Review and Motivation

Statistical Data Mining and Machine Learning Hilary Term 2016

Support Vector Machines

CS6220: DATA MINING TECHNIQUES

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Variational Autoencoders

EM Algorithm LECTURE OUTLINE

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Density Estimation. Seungjin Choi

Transcription:

Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin

(One) bad case for K- means Clusters may overlap Some clusters may be wider than others Clusters may not be linearly separable

(One) bad case for K- means Clusters may overlap Some clusters may be wider than others Clusters may not be linearly separable

Partitioning Algorithms K-means hard assignment: each object belongs to only one cluster Mixture modeling soft assignment: probability that an object belongs to a cluster Generative approach: think of each cluster as a component distribution, and any data point is drawn from a mixture of multiple component distributions 4

Gaussian Mixture Model Mixture of K Gaussian distributions: (Multi-modal distribution) p(x y=i) Ν(µ i, σ 2 Ι ) p(x) = Σ p(x y=i) P(y=i) i µ 2 µ 1 Mixture component Mixture proportion µ 3

(One) bad case for K- means Clusters may overlap Some clusters may be wider than others Clusters may not be linearly separable

General GMM GMM Gaussian Mixture Model (Multi-modal distribution) p(x y=i) Ν(µ i, Σ i ) p(x) = Σ p(x y=i) P(y=i) i µ 2 µ 1 Mixture component Mixture proportion µ 3

General GMM GMM Gaussian Mixture Model (Multi-modal distribution) There are k components Component i has an associated mean vector µ i Each component generates data from a Gaussian with mean µ i and covariance matrix Σ i Each data point is generated according to the following recipe: 1) Pick a component at random: Choose component i with probability P(y=i) 2) Data-point x Ν(µ i, Σ i ) µ 2 µ 1 µ 3

(One) bad case for K- means Clusters may overlap Some clusters may be wider than others Clusters may not be linearly separable

General GMM GMM Gaussian Mixture Model (Multi-modal distribution) p(x y=i) Ν(µ i, Σ i ) Gaussian Bayes Classifier: P(y log P(y = = i x) j x) p(x y = i)p(y = i) = log p(x y = j)p(y = j) T T = x Wx + w x µ 2 µ 1 Quadratic Decision boundary second-order terms don t cancel out µ 3 Depend on µ 1, µ 2,.., µ K, Σ 1, Σ 2,.., Σ K, P(y=1),, P(Y=k)

Learning General GMM x 1,...,x m p(x) = kx i=1 p(x Y = i)p (Y = i) Mixture component Mixture proportion, p i 1 µ 1 2 µ 3 µ 1 µ 2 Gaussian mixture model 3 p(x Y = i) N(µ i, i ) Parameters: {p i,µ i, i } K i=1 How to estimate parameters? Maximum Likelihood But don t know labels Y (recall Gaussian Bayes classifier) 11

Learning General GMM Maximize marginal likelihood: argmax j P(x j ) = argmax j i=1 P(y j =i,x j ) K K = argmax j i=1 P(y j =i)p(x j y j =i) marginalizing y j P(y j =i) = P(y=i) Mixture component i is chosen with prob P(y = i) = m k 1 1 T arg max P( y = i) exp ( x j µ i ) i ( x j µ i ) j= 1 i= 1 det( ) 2 i How do we find the μ i s and P(y=i)s which give max. marginal likelihood? * Set log Prob (.) = 0 and solve for μ i s. Non-linear non-analytically solvable μ i * Use gradient descent: Doable, but often slow

GMM vs. k-means Maximize marginal likelihood: argmax j P(x j ) = argmax j i=1 P(y j =i,x j ) K K = argmax j i=1 P(y j =i)p(x j y j =i) argmax j P(x j ) = argmax j p(x j y j =C(j)) Same as k- means! = argmax = argmin

Expectation-Maximization (EM) A general algorithm to deal with hidden data, but we will study it in the context of unsupervised learning (hidden labels) first No need to choose step size as in Gradient methods. EM is an Iterative algorithm with two linked steps: E-step: fill-in hidden data (Y) using inference M-step: apply standard MLE/MAP method to estimate parameters {p i, μ i, Σ i } k i=1 We will see that this procedure monotonically improves the likelihood (or leaves it unchanged). Thus it always converges to a local optimum of the likelihood. k

EM for spherical, same variance GMMs E-step Compute expected classes of all datapoints for each class M-step P 1 2 ( y = i x, µ... µ ) exp x µ P( y i) j 1 k 2 j i = 2σ Compute MLE for μ given our data s class membership distributions (weights) In K-means E-step we do hard assignment EM does soft assignment µ i = m j=1 m j=1 ( ) P y = i x j x j P( y = i x ) j Similar to K-means, but with weighted data Iterate.

EM for general GMMs Iterate. On iteration t let our estimates be λ t = { μ 1 (t), μ 2 (t) μ k (t), Σ 1 (t), Σ 2 (t) Σ k (t), p 1 (t), p 2 (t) p k (t) } E-step Compute expected classes of all datapoints for each class M-step P ( ) ( ) ( ( ) ( ) ) t t t y = i x, λ p p x µ, Σ j t i j p i (t) is shorthand for estimate of P(y=i) on t th iteration Just evaluate a Gaussian at x j Compute MLEs given our data s class membership distributions (weights) ( y = i x j, λt ) x P( y = i x, λ ) ( t+ ) j µ 1 ( t+ 1) = Σ i j P p j t ( t+ 1) i j P = j = m i = i j P ( y i x,λ ) j t i ( t+ 1) ( t+ 1) ( y = i x j, λ t )( x j µ i )( x j µ i ) P( y = i x, λ ) j m = #data points j t T

EM for general GMMs: Example Σ 2 Σ 3 Σ 1 µ 1 µ 2 µ 3 P(y = x j,µ 1,µ 2,µ 3,Σ 1,Σ 2,Σ 3,p 1,p 2,p 3 )

After 1 st iteration

After 2 nd iteration

After 3 rd iteration

After 4 th iteration

After 5 th iteration

After 6 th iteration

After 20 th iteration

Example: GMM clustering

General GMM GMM Gaussian Mixture Model (Multi-modal distribution) p(x) = Σ p(x y=i) P(y=i) i Mixture component Mixture proportion µ 2 µ 1 µ 3 p(x y=i) Ν(µ i, Σ i )

Resulting Density Estimator

General EM algorithm Marginal likelihood x is observed, z is missing: log How to maximize marginal likelihood using EM?

Lower-bound on marginal likelihood log Variational approach P(z) f(z) Jensen s inequality: log z P(z) f(z) z P(z) log f(z) log: concave function log(ax+(1-a)y) a log(x) + (1-a) log(y) x ax+(1-a)y y

Lower-bound on marginal likelihood log P(z) f(z) Jensen s inequality: log z P(z) f(z) z P(z) log f(z) =:

EM as Coordinate Ascent log E-step: Fix θ, maximize F over Q Q t+ 1 = arg max F( θ Q t,q) M-step: Fix Q, maximize F over θ θ = arg max F( θ,q t + 1 t+ 1 θ )

Convergence of EM Marginal Likelihood function F(θ, Q t+1 ) Sequence of EM lower bound F-functions EM monotonically converges to a local maximum of likelihood!

EM & Local Maxima Typical likelihood function Different sequence of EM lower bound F-functions depending on initialization Use multiple, randomized initializations in practice

EM as Coordinate Ascent log E-step: Fix θ, maximize F over Q Q t+ 1 = arg max F( θ Q t,q) M-step: Fix Q, maximize F over θ θ = arg max F( θ,q t + 1 t+ 1 θ )

log E step E-step: Fix θ, maximize F over Q log 1 log KL divergence between two distributions

log E step E-step: Fix θ, maximize F over Q log + log KL>=0, above expression is maximized if KL divergence = 0 KL(Q,P) = 0 iff Q = P Therefore, E step:

log E step E-step: Fix θ, maximize F over Q log + log è Compute probability of missing data z given current choice of θ Re-aligns F with marginal likelihood!! log

log M step M-step: Fix Q, maximize F over θ log F (, Q (t+1) ) = mx X j=1 z Q (t+1) (z x j ) log P (z, x j ) Q (t+1) (z x j ) = mx X mx Q (t+1) (z x j ) log P (z, x j )+ H(Q (t+1) (z x j )) j=1 z j=1 = Fixed (Independent of θ) Log likelihood if z was known Expected log likelihood wrt Q

log M step M-step: Fix Q, maximize F over θ log F (, Q (t+1) ) = mx + H(Q (t+1) (z x j )) j=1 Fixed (Independent of θ) Expected log likelihood wrt Q (t+1)

EM as Coordinate Ascent log E-step: Fix θ, maximize F over Q Q t+ 1 = arg max F( θ M-step: Fix Q, maximize F over θ Q = arg max F( θ,q t,q) Compute probability of missing data given current choice of θ θ t + 1 t+ 1 θ ) E.g., P( y = i x j,µ ) t Compute estimate of θ by maximizing marginal likelihood using Q (t+1) (z x j )

Summary: EM Algorithm A way of maximizing likelihood function for hidden variable models. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces: 1. Estimate some missing or unobserved data from observed data and current parameters. 2. Using this complete data, find the maximum likelihood parameter estimates. Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess: 1. E-step: 2. M-step: = arg max F( θ In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap, making bound=likelihood. EM performs coordinate ascent on F, can get stuck in local minima. BUT Extremely popular in practice. Q θ t+ 1 t,q) Q t + 1 t+ 1 = arg max F( θ,q θ )