Mixtures of Gaussians. Sargur Srihari

Similar documents
Latent Variable View of EM. Sargur Srihari

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Variational Mixture of Gaussians. Sargur Srihari

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Clustering and Gaussian Mixture Models

Linear Dynamical Systems

K-Means and Gaussian Mixture Models

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Expectation Maximization

STA 4273H: Statistical Machine Learning

STA 414/2104: Machine Learning

Statistical Pattern Recognition

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

The Expectation-Maximization Algorithm

Data Mining Techniques

Review and Motivation

Clustering, K-Means, EM Tutorial

Latent Variable Models and Expectation Maximization

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Latent Variable Models and Expectation Maximization

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Mixture Models and EM

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CSCI-567: Machine Learning (Spring 2019)

Latent Variable Models and EM Algorithm

10708 Graphical Models: Homework 2

Expectation Maximization

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

STA 4273H: Statistical Machine Learning

CPSC 540: Machine Learning

Latent Variable Models

Machine Learning Techniques for Computer Vision

Statistical Pattern Recognition

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Basic Sampling Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS181 Midterm 2 Practice Solutions

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Gaussian Mixture Models

Gaussian Mixture Models

An Introduction to Expectation-Maximization

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning for Data Science (CS4786) Lecture 12

Introduction to Machine Learning

Variational Inference. Sargur Srihari

Probabilistic Graphical Models

Data Mining Techniques

Gaussian Mixture Models, Expectation Maximization

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

The Expectation Maximization or EM algorithm

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Probability and Information Theory. Sargur N. Srihari

p L yi z n m x N n xi

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Mixture Models and Expectation-Maximization

COM336: Neural Computing

Clustering and Gaussian Mixtures

Variational Inference and Learning. Sargur N. Srihari

Linear Models for Classification

A Note on the Expectation-Maximization (EM) Algorithm

Probabilistic clustering

Linear Classification: Probabilistic Generative Models

Variational Inference. Sargur Srihari

Technical Details about the Expectation Maximization (EM) Algorithm

Non-Parametric Bayes

Based on slides by Richard Zemel

Machine Learning for Signal Processing Bayes Classification and Regression

Mobile Robot Localization

Study Notes on the Latent Dirichlet Allocation

1 Expectation Maximization

Introduction to Machine Learning Midterm, Tues April 8

Lecture 4: Probabilistic Learning

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

13: Variational inference II

MCMC and Gibbs Sampling. Sargur Srihari

STA 4273H: Statistical Machine Learning

Lecture 8: Graphical models for Text

Introduction to Probabilistic Graphical Models

Clustering with k-means and Gaussian mixture distributions

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

COS513 LECTURE 8 STATISTICAL CONCEPTS

Chris Bishop s PRML Ch. 8: Graphical Models

Probabilistic & Unsupervised Learning

Recent Advances in Bayesian Inference Techniques

Bayesian Networks. Motivation

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

The Expectation-Maximization Algorithm

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Brief Introduction of Machine Learning Techniques for Content Analysis

Transcription:

Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1

9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm in General 2

Topics in Mixtures of Gaussians Goal of Gaussian Mixture Modeling Latent Variables Maximum Likelihood EM for Gaussian Mixtures 3

Goal of Gaussian Mixture Modeling Machine Learning A linear superposition of Gaussians in the form k=1 Goal of Modeling: K p(x) = π k (x µ k, Σ k ) Find maximum likelihood parameters π k, µ k, Σ k Examples of data sets and models 1-D data, K=2 subclasses 2-D data, K=3 k 1 2 π 0.4 0.6 µ 28 1.86 σ 0.48 0.88 Each data point is associated with a subclass k with probability π k

GMMs and Latent Variables A GMM is a linear superposition of Gaussian components Provides a richer class of density models than the single Gaussian We formulate a GMM in terms of discrete latent variables This provides deeper insight into this distribution Serves to motivate the EM algorithm Which gives a maximum likelihood solution to no. of components and their means/covariances 5

Latent Variable Representation Linear superposition of K Gaussians: K p(x) = π k (x µ k, Σ k ) k=1 Introduce a K-dimensional binary variable z Use 1-of-K representation (one-hot vector) Let z = z 1,..,z K whose elements are k z k {0,1} and z k = 1 K possible states of z corresponding to K components k 1 2 z 10 01 π k 0.4 0.6 µ k 28 1.86 σ k 0.48 0.88 k 1 2 3 z 100 010 001

Joint Distribution Define joint distribution of latent variable and observed variable p(x,z)=p(x z) p(z) x is observed variable z is the hidden or missing variable Marginal distribution p(z) Conditional distribution p(x z) 7

Graphical Representation of Mixture Model The joint distribution p(x,z) is represented in the form p(z)p(x z) Latent variable z=[z 1,..z K ] represents subclass Observed variable x We now specify marginal p(z)and conditional p(x z) Using them we specify p(x) in terms of observed and latent variables 8

Specifying the marginal p(z) Associate a probability with each component z k Denote p(z k = 1) = π k where parameters {π k } satisfy 0 π k 1 and π k = 1 k Because z uses 1-of-K it follows that p(z) p(z) = K z π k k k=1 p(x z) since z k {0,1} and components of z are mutually exclusive and hence are independent With one component p(z 1 ) = π 1 z 1 With two components p(z 1,z 2 ) = π 1 z 1 π 2 z 2 9

Specifying the Conditional p(x z) For a particular component (value of z) p(x z k = 1) = (x µ k, Σ k ) Thus p(x z) can be written in the form p(x z) = K k=1 ( x µ k, Σ ) z k k Due to the exponent z k all product terms except for one equal one p(z) p(x z) 10

Marginal distribution p(x) The joint distribution p(x,z) is given by p(z)p(x z) Thus marginal distribution of x is obtained by summing over all possible states of z to give p(x) = p(z)p(x z) = π z k ( x µ k k, Σ ) z k k = π k x µ k, Σ k Since z z k {0,1} z K k=1 ( ) This is the standard form of a Gaussian mixture K k=1 11

Value of Introducing Latent Variable If we have observations x 1,..,x Because marginal distribution is in the form p(x) = z p(x,z) It follows that for every observed data point x n there is a corresponding latent vector z n, i.e., its sub-class Thus we have found a formulation of Gaussian mixture involving an explicit latent variable We are now able to work with joint distribution p(x,z) instead of marginal p(x) Leads to significant simplification through introduction of expectation maximization 12

Another conditional probability (Responsibility) In EM p(z x) plays a role The probability p(z k =1 x) is denoted γ (z k ) From Bayes theorem View γ (z k ) p(z k = 1 x) = p(z k = 1)p(x z k = 1) p(z k = 1) = π k γ (z k ) = p(z k = 1 x) K j =1 p(z j = 1)p(x z j = 1) = π k (x µ k, Σ k ) K j =1 π j (x µ k, Σ j ) as prior probability of component k as the posterior probability p(x,z)=p(x z)p(z) it is also the responsibility that component k takes for explaining the observation x 13

Plan of Discussion ext we look at 1. How to get data from a mixture model synthetically and then 2. Given a data set {x 1,..x } how to model the data using a mixture of Gaussians 14

Synthesizing data from mixture Use ancestral sampling Start with lowest numbered node and draw a sample, Generate sample of z, called ẑ move to successor node and draw a sample given the parent value, etc. Then generate a value for x from conditional p(x ẑ ) Samples from p(x,z) are plotted according to value of x and colored with value of z Samples from marginal p(x) obtained by ignoring values of z 500 points from three Gaussians Complete Data set Incomplete Data set 15

Illustration of responsibilities Evaluate for every data point Posterior probability of each component Responsibility γ (z nk ) is associated with data point x n Color using proportion of red, blue and green ink If for a data point γ (z n1 ) = 1 it is colored red If for another point γ (z n2 ) = γ (z n3 ) = 0.5 it has equal blue and green and will appear as cyan 16

Maximum Likelihood for GMM We wish to model data set {x 1,..x } using a mixture of Gaussians ( items each of dimension D) Represent by x D matrix X n th row is given by x n T Represent latent variables with x K matrix Z n th row is given by z n T Z = X = Goal is to state the likelihood function so as to estimate the three sets of parameters by maximizing the likelihood z 1 z 2 z x 1 x 2 x 17

Graphical representation of GMM For a set of i.i.d. data points {x n } with corresponding latent points {z n } where n=1,.., Bayesian etwork for p(x,z) using plate notation x D matrix X x K matrix Z 18

Likelihood Function for GMM Mixture density function is ( ) p(x) = p(z)p(x z) = π k x µ k, Σ k z K Therefore Likelihood function is p(x π, µ, Σ) = n=1 k=1 K π k (x n µ k, Σ k ) k=1 Since z has values {z k } with probabilities {π k } Product is over the i.i.d. samples Therefore log-likelihood function is ln p(x π, µ, Σ) = n=1 K ln π k (x n µ k, Σ k ) k=1 Which we wish to maximize A more difficult problem than for a single Gaussian 19

Maximization of Log-Likelihood ln p(x π, µ, Σ) = n=1 K ln π k (x n µ k, Σ k ) k=1 Goal is to estimate the three sets of parameters π k,µ k,σ k By taking derivatives in turn w.r.t each while keeping others constant But there are no closed-form solutions Task is not straightforward since summation appears in Gaussian and logarithm does not operate on Gaussian While a gradient-based optimization is possible, we consider the iterative EM algorithm 20

Some issues with GMM m.l.e. Before proceeding with the m.l.e. briefly mention two technical issues: 1. Problem of singularities with Gaussian mixtures 2. Problem of Identifiability of mixtures 21

Problem of Singularities with Gaussian mixtures Consider Gaussian mixture components with covariance matrices Data point that falls on a mean µ j = x n contribute to the likelihood function (x n x n,σ j 2 I) = 1 1 (2π ) 1/2 σ j since exp(x n -µ j ) 2 =1 Σ k = σ k 2 I As σ j 0 term goes to infinity Therefore maximization of log-likelihood K ln π k (x n µ k, Σ k ) is not well-posed ln p(x π, µ, Σ) = n=1 k=1 Does not happen with a single Gaussian Multiplicative factors go to zero Does not happen in the Bayesian approach Problem is avoided using heuristics Resetting mean or covariance will One component assigns finite values and other to large value Multiplicative values Take it to zero 22

Problem of Identifiability A density p(x θ) is identifiable if θ θ ' then there is an x for which p(x θ) p(x θ ') A K-component mixture will have a total of K! equivalent solutions Corresponding to K! ways of assigning K sets of parameters to K components E.g., for K=3 K!=6: 123, 132, 213, 231, 312, 321 For any given point in the space of parameter values there will be a further K!-1 additional points all giving exactly same distribution However any of the equivalent solutions is as good as the other Two ways of labeling three Gaussian subclasses A B C B A C 23

EM for Gaussian Mixtures EM is a method for finding maximum likelihood solutions for models with latent variables Begin with log-likelihood function ln p(x π, µ, Σ) = K ln π k (x n µ k, Σ k ) k=1 We wish to find that maximize this quantity Task is not straightforward since summation appears in Gaussian and logarithm does not operate on Gaussian Take derivatives in turn w.r.t Means µ k and set to zero Σ k n=1 π,µ,σ covariance matrices and set to zero mixing coefficients and set to zero π k 24

EM for GMM: Derivative wrt Begin with log-likelihood function ln p(x π, µ, Σ) = n=1 K ln π k (x n µ k, Σ k ) k=1 µ k Take derivative w.r.t the means Making use of exponential form of Gaussian Use formulas: We get d dx lnu = u ' u and µ k d dx eu = e u u ' and set to zero 0 = n=1 j π k (x n µ k, Σ k ) 1 (x π j (x n µ j, Σ j ) n µ k k ) γ (z nk ) the posterior probabilities Inverse of covariance matrix 25

M.L.E. solution for Means Multiplying by µ k = 1 k γ (z nk )x n Where we have defined n=1 k = γ (z nk ) n=1 Σ k (assuming non-singularity) Mean of k th Gaussian component is the weighted mean of all the points in the data set: where data point x n is weighted by the posterior probability that component k was responsible for generating x n Which is the effective number of points assigned to cluster k 26

M.L.E. solution for Covariance Set derivative wrt Σ k to zero Making use of mle solution for covariance matrix of single Gaussian Σ k = 1 k n=1 γ (z nk )(x n µ k )(x n µ k ) T Similar to result for a single Gaussian for the data set but each data point weighted by the corresponding posterior probability Denominator is effective no of points in component 27

M.L.E. solution for Mixing Coefficients Maximize ln p(x π, µ, Σ) w.r.t. π k Must take into account that mixing coefficients sum to one Achieved using Lagrange multiplier and maximizing ln p(x π,µ,σ) + λ K k=1 π k 1 π k Setting derivative wrt to zero and solving gives π = k k 28

Summary of m.l.e. expressions GMM maximum likelihood parameter estimates Means 1 µ = γ( z )x k nk n k n= 1 Covariance matrices Mixing Coefficients Σ k = 1 k γ(z nk )(x n µ k )(x n µ k ) T π = k n =1 k k = γ(z nk ) All three are in terms of responsibilities and so we have not completely solved the problem n =1 29

EM Formulation The results for µ k,σ k,π k are not closed form solutions for the parameters Since γ (z nk ) the responsibilities depend on those parameters in a complex way Results suggest an iterative solution An instance of EM algorithm for the particular case of GMM 30

Informal EM for GMM First choose initial values for means, covariances and mixing coefficients Alternate between following two updates Called E step and M step In E step use current value of parameters to evaluate posterior probabilities, or responsibilities In the M step use these posterior probabilities to to reestimate means, covariances and mixing coefficients 31

EM using Old Faithful Data points and Initial mixture model Initial E step Determine responsibilities After first M step Re-evaluate Parameters After 2 cycles After 5 cycles After 20 cycles 32

Comparison with K-Means K-means result E-M result 33

Animation of EM for Old Faithful Data http://en.wikipedia.org/wiki/ File:Em_old_faithful.gif Code in R #initial parameter estimates (chosen to be deliberately bad) theta <- list( tau=c(0.5,0.5), mu1=c(2.8,75), mu2=c(3.6,58), sigma1=matrix(c(0.8,7,7,70),ncol=2), sigma2=matrix(c(0.8,7,7,70),ncol=2) ) 34

Practical Issues with EM Takes many more iterations than K-means Each cycle requires significantly more comparison Common to run K-means first in order to find suitable initialization Covariance matrices can be initialized to covariances of clusters found by K-means EM is not guaranteed to find global maximum of log likelihood function 35

Summary of EM for GMM Given a Gaussian mixture model Goal is to maximize the likelihood function w.r.t. the parameters (means, covariances and mixing coefficients) Step1: Initialize the means, covariances and mixing coefficients log-likelihood π k µ k Σ k and evaluate initial value of 36

EM continued Step 2: E step: Evaluate responsibilities using current parameter values γ (z k )= π k (x n µ k, Σ k ) K j =1 π j (x n µ j, Σ j )) Step 3: M Step: Re-estimate parameters using current responsibilities µ k new = 1 k γ (z nk )x n n=1 Σ k new = 1 k n=1 γ (z nk )(x n µ k new )(x n µ k new ) T π k new = k where k = γ (z nk ) n=1 37

EM Continued Step 4: Evaluate the log likelihood ln p(x π, µ, Σ) = n=1 ln K π k (x n µ k, Σ k ) k=1 And check for convergence of either parameters or log likelihood If convergence not satisfied return to Step 2 38