Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Similar documents
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Expectation Maximization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Clustering, K-Means, EM Tutorial

CSCI-567: Machine Learning (Spring 2019)

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Latent Variable Models and Expectation Maximization

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Gaussian Mixture Models, Expectation Maximization

Technical Details about the Expectation Maximization (EM) Algorithm

Gaussian Mixture Models

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Latent Variable Models and Expectation Maximization

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Mathematical Formulation of Our Example

Latent Variable View of EM. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari

Statistical Pattern Recognition

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Introduction to Machine Learning

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixtures

Data Mining Techniques

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Expectation maximization

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Expectation Maximization

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

The Expectation-Maximization Algorithm

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Latent Variable Models

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

K-Means and Gaussian Mixture Models

Variational Mixture of Gaussians. Sargur Srihari

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation Maximization Algorithm

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Linear Dynamical Systems

Machine Learning for Data Science (CS4786) Lecture 12

Lecture 6: Gaussian Mixture Models (GMM)

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

ECE 5984: Introduction to Machine Learning

Statistical Pattern Recognition

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Latent Variable Models and EM Algorithm

Review and Motivation

Probabilistic Graphical Models

Clustering with k-means and Gaussian mixture distributions

The Expectation Maximization or EM algorithm

A minimalist s exposition of EM

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Series 7, May 22, 2018 (EM Convergence)

CS534 Machine Learning - Spring Final Exam

CS281 Section 4: Factor Analysis and PCA

STA 4273H: Statistical Machine Learning

Unsupervised Learning

ECE521 week 3: 23/26 January 2017

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

A Note on the Expectation-Maximization (EM) Algorithm

Lecture 8: Clustering & Mixture Models

Machine Learning for Signal Processing Bayes Classification and Regression

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Clustering using Mixture Models

Variational Autoencoders

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Introduction to Machine Learning

Statistical learning. Chapter 20, Sections 1 4 1

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture 4: Probabilistic Learning

Gaussian Mixture Models

CPSC 540: Machine Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Gaussian Models

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Lecture 14. Clustering, K-means, and EM

Machine Learning Lecture 5

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Latent Variable Models and EM algorithm

Variational inference

Probabilistic Graphical Models

Mixture Models and EM

Ch 4. Linear Models for Classification

L11: Pattern recognition principles

CS181 Midterm 2 Practice Solutions

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Lecture 3: Pattern Classification

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering with k-means and Gaussian mixture distributions

Transcription:

Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization

Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions A very common example are mixture models, in particular Gaussian mixture models (GMM) Mixture models can be used for clustering (unsupervised learning) and to express more complex probability distributions As we will see, the parameters of mixture models can be estimated using maximum-likelihood estimation such as expectation-maximization 2

K-means Clustering Given: data set {x 1,...,x N }, number of clusters K Goal: find cluster centers {µ 1,...,µ K } so that J = NX KX r nk kx n µ k k 2 n=1 k=1 is minimal, where if is assigned to Idea: compute and iteratively r nk r nk =1 x n µ k Start with some values for the cluster centers µ k Find optimal assignments r nk Update cluster centers using these assignments Repeat until assignments or centers don t change 3

K-means Clustering Initialize cluster means: {µ 1,...,µ K } 4

Find optimal assignments: r nk = K-means Clustering ( 1 if k = arg min j kx n µ j k 0 otherwise 5

K-means Clustering Find new optimal means: @J @µ k =2 NX n=1 r nk (x n µ k )! =0 ) µ k = P N n=1 r nkx n P N n=1 r nk 6

Find new optimal assignments: r nk = K-means Clustering ( 1 if k = arg min j kx n µ j k 0 otherwise 7

K-means Clustering Iterate these steps until means and assignments do not change any more 8

2D Example Real data set Random initialization Magenta line is decision boundary 9

The Cost Function After every step the cost function J is minimized Blue steps: update assignments Red steps: update means Convergence after 4 rounds 10

K-means for Segmentation 11 PD Dr. Rudolph Triebel

K-Means: Additional Remarks K-means converges always, but the minimum is not guaranteed to be a global one There is an online version of K-means After each addition of x n, the nearest center μ k is updated: µ new k = µ old k + n (x n µ old k ) The K-medoid variant: Replace the Euclidean distance by a general measure V. NX KX J = r nk V(x n, µ k ) n=1 k=1 12

Mixtures of Gaussians Assume that the data consists of K clusters The data within each cluster is Gaussian For any data point x we introduce a K-dimensional binary random variable z so that: p(x) = KX k=1 p(z k = 1) {z } =: k N (x µ k, k ) where z k 2 {0, 1}, KX k=1 z k =1 13

A Simple Example Mixture of three Gaussians with mixing coefficients Left: all three Gaussians as contour plot Right: samples from the mixture model, the red component has the most samples 14

Parameter Estimation From a given set of training data want to find parameters {x 1,...,x N } so that the likelihood is maximized (MLE): we ( 1,...,K, µ 1,...,K, 1,...,K ) p(x 1,...,x N 1,...,K, µ 1,...,K, 1,...,K )= NY KX k N (x n µ k, k ) n=1 k=1 or, applying the logarithm: NX KX log p(x, µ, ) = log k N (x n µ k, k ) n=1 k=1 However: this is not as easy as maximumlikelihood for single Gaussians! 15

Problems with MLE for Gaussian Mixtures Assume that for one k the mean data point x n k = 2 µ k is exactly at a For simplicity: assume that Then: N (x n x n, 2 ki) = 1 p 2 D k ki This means that the overall log-likelihood can be maximized arbitrarily by letting k! 0 (overfitting) Another problem is the identifiability: The order of the Gaussians is not fixed, therefore: There are K! equivalent solutions to the MLE problem 16

Overfitting with MLE for Gaussian Mixtures One Gaussian fits exactly to one data point It has a very small variance, i.e. contributes strongly to the overall likelihood In standard MLE, there is no way to avoid this! 17

Expectation-Maximization EM is an elegant and powerful method for MLE problems with latent variables Main idea: model parameters and latent variables are estimated iteratively, where average over the latent variables (expectation) A typical example application of EM is the Gaussian Mixture model (GMM) However, EM has many other applications First, we consider EM for GMMs 18

Expectation-Maximization for GMM First, we define the responsibilities: (z nk )=p(z nk =1 x n ) z nk 2 {0, 1} X z nk =1 k 19

Expectation-Maximization for GMM First, we define the responsibilities: (z nk )=p(z nk =1 x n ) = k N (x n µ k, k ) P K j=1 jn (x n µ j, j ) Next, we derive the log-likelihood wrt. to : @log p(x, µ, ) @µ k! = 0 µ k 20

Expectation-Maximization for GMM First, we define the responsibilities: (z nk )=p(z nk =1 x n ) = k N (x n µ k, k ) P K j=1 jn (x n µ j, j ) Next, we derive the log-likelihood wrt. to : µ k @log p(x, µ, ) @µ k! = 0 and we obtain: µ k = P N n=1 P N n=1 (z nk)x n (z nk) 21

Expectation-Maximization for GMM We can do the same for the covariances: @log p(x, µ, ) @ k! = 0 and we obtain: k = P N n=1 (z nk)(x n µ k )(x n µ k ) T P N n=1 (z nk) Finally, we derive wrt. the mixing coefficients k : KX where: k =1 @log p(x, µ, ) @ k! = 0 k=1 22

Expectation-Maximization for GMM We can do the same for the covariances: @log p(x, µ, ) @ k! = 0 and we obtain: k = P N n=1 (z nk)(x n µ k )(x n µ k ) T P N n=1 (z nk) Finally, we derive wrt. the mixing coefficients k : KX where: k =1 @log p(x, µ, ) @ k! = 0 and the result is: k = 1 N 23 NX n=1 (z nk ) k=1

Algorithm Summary 1.Initialize means covariance matrices and µ k k mixing coefficients k 2.Compute the initial log-likelihood log p(x, µ, ) 3. E-Step. Compute the responsibilities: (z nk ) = k N (x n µ k, k ) P K j=1 jn (x n µ j, j ) µ new k = P N n=1 P N n=1 (z nk)x n (z nk) new k = 4. M-Step. Update the parameters: P N n=1 (z nk)(x n µ new k )(x n µ new k P N n=1 (z nk) ) T new k = 1 N NX n=1 (z nk ) 5.Compute log-likelihood; if not converged go to 3. 24

The Same Example Again 25

Observations Compared to K-means, points can now belong to both clusters (soft assignment) In addition to the cluster center, a covariance is estimated by EM Initialization is the same as used for K-means Number of iterations needed for EM is much higher Also: each cycle requires much more computation Therefore: start with K-means and run EM on the result of K-means (covariances can be initialized to the sample covariances of K-means) EM only finds a local maximum of the likelihood! 26

A More General View of EM Assume for a moment that we observe X and the binary latent variables Z. The likelihood is then: p(x, Z, µ, ) = where p(z n ) = p(x n z n, µ, ) = NY n=1 KY k=1 KY k=1 p(z n )p(x n z n, µ, ) and which leads to the log-formulation: log p(x, Z, µ, ) = z nk k N (x n µ k, k ) z nk NX n=1 27 Remember: z nk 2 {0, 1}, KX z nk =1 k=1 KX z nk (log k + log N (x n µ k, k )) k=1

The Complete-Data Log-Likelihood log p(x, Z, µ, ) = NX KX z nk (log k + log N (x n µ k, k )) n=1 k=1 This is called the complete-data log-likelihood Advantage: solving for the parameters ( k, µ k, k ) is much simpler, as the log is inside the sum! We could switch the sums and then for every mixture component k only look at the points that are associated with that component. This leads to simple closed-form solutions for the parameters However: the latent variables Z are not observed! 28

The Main Idea of EM Instead of maximizing the joint log-likelihood, we maximize its expectation under the latent variable distribution: E Z [log p(x, Z, µ, )] = NX KX E Z [z nk ](log k + log N (x n µ k, k )) n=1 k=1 where the latent variable distribution per point is: p(z n x n, ) = p(x n z n, )p(z n ) p(x n ) =(, µ, ) = Q K l=1 ( ln (x n µ l, l )) z nl P K j=1 jn (x n µ j, j ) 29

The Main Idea of EM The expected value of the latent variables is: plugging in we obtain: E[z nk ]= (z nk ) E Z [log p(x, Z, µ, )] = NX KX (z nk )(log k + log N (x n µ k, k )) n=1 k=1 We compute this iteratively: 1. Initialize i =0, ( i k, µ i k, i k) 2. Compute E[z nk ]= (z nk ) ( i+1 3. Find parameters that maximize this k, µ i+1 k, i+1 k ) 4. Increase i; if not converged, goto 2. 30

The Theory Behind EM We have seen that EM maximizes the expected complete-data log-likelihood, but: Actually, we need to maximize the log-marginal log p(x ) = log X Z p(x, Z ) It turns out that the log-marginal is maximized implicitly! 31

The Theory Behind EM We have seen that EM maximizes the expected complete-data log-likelihood, but: Actually, we need to maximize the log-marginal log p(x ) = log X Z p(x, Z ) It turns out that the log-marginal is maximized implicitly! log p(x ) =L(q, ) + KL(qkp) L(q, ) = X Z q(z) log p(x, Z ) q(z) KL(qkp) = X Z q(z) log p(z X, ) q(z) 32

Visualization The KL-divergence is positive or 0 Thus, the log-likelihood is at least as large as L or: L is a lower bound of the log-likelihood: log p(x ) L(q, ) 33

What Happens in the E-Step? The log-likelihood is independent of q Thus: L is maximized iff KL is minimal This is the case iff q(z) =p(z X, ) 34

What Happens in the M-Step? In the M-step we keep q fixed and find new L(q, ) = X Z p(z X, old ) log p(x, Z ) We maximize the first term, the second is indep. X This implicitly makes KL non-zero The log-likelihood is maximized even more! Z q(z) log q(z) 35

Visualization in Parameter-Space In the E-step we compute the concave lower bound for given old parameters (blue curve) In the M-step, we maximize this lower bound and obtain new parameters new old This is repeated (green curve) until convergence 36

Variants of EM Instead of maximizing the log-likelihood, we can use EM to maximize a posterior when a prior is given (MAP instead of MLE) less overfitting In Generalized EM, the M-step only increases the lower bound instead of maximization (useful if standard M-step is intractable) Similarly, the E-step can be generalized in that the optimization wrt. q is not complete Furthermore, there are incremental versions of EM, where data points are given sequentially and the parameters are updated after each data point. 37

Example 1: Learn a Sensor Model A Radar range finder on a metallic target will returns 3 types of measurement: The distance to target The distance to the wall behind the target A completely random value 38

Example 1: Learn a Sensor Model Which point corresponds to from which model? What are the different model parameters? Solution: Expectation-Maximization 39

Example 2: Environment Classification From each image, the robot extracts features: => points in nd space K-means only finds the cluster centers, not their extent and shape The centers and covariances can be obtained with EM 40

Example 3: Plane Fitting in 3D Has been done in this paper Given a set of 3D points, fit planes into the data Idea: Model parameters are normal vectors and distance to origin for a set of planes Gaussian noise model: p(z ) =N (d(z, ) 0, ) Introduce latent correspondence point-to-plane distance noise variance variables C ij and maximize the expected log-lik.: E[log p(z, C )] Maximization can be done in closed form 41

Example 3: Plane Fitting in 3D 42 PD Dr. Rudolph Triebel

Summary K-means is an iterative method for clustering! Mixture models can be formalized using latent! (unobserved) variables A very common example are Gaussian mixture! models (GMMs) To estimate the parameters of a GMM we can! use expectation-maximization (EM) In general EM can be interpreted as maximizing! a lower bound to the complete-data loglikelihood EM is guaranteed to converge, but it may run! into local maxima 43