MIXTURE MODELS AND EM

Similar documents
BAYESIAN DECISION THEORY. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

BAYESIAN DECISION THEORY

Gaussian Mixture Models, Expectation Maximization

Expectation Maximization

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Mixture of Gaussians Models

COM336: Neural Computing

CPSC 540: Machine Learning

The Expectation-Maximization Algorithm

Lecture 4: Probabilistic Learning

Statistical Pattern Recognition

Expectation Maximization Algorithm

Probabilistic clustering

Machine Learning Lecture 5

Machine Learning Techniques for Computer Vision

Clustering with k-means and Gaussian mixture distributions

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Lecture 6: April 19, 2002

Clustering with k-means and Gaussian mixture distributions

Machine Learning for Data Science (CS4786) Lecture 12

Statistical Pattern Recognition

Latent Variable Models and EM Algorithm

Linear Dynamical Systems

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA 414/2104: Machine Learning

Mixtures of Gaussians. Sargur Srihari

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Unsupervised Learning with Permuted Data

K-Means and Gaussian Mixture Models

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Machine Learning Lecture 3

Latent Variable Models and Expectation Maximization

Recent Advances in Bayesian Inference Techniques

L11: Pattern recognition principles

Finite Singular Multivariate Gaussian Mixture

Expectation Maximization

STA 4273H: Statistical Machine Learning

Lecture 7: Con3nuous Latent Variable Models

CSE 473: Artificial Intelligence Autumn Topics

Latent Variable Models and Expectation Maximization

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Review of Maximum Likelihood Estimators

Clustering, K-Means, EM Tutorial

COURSE INTRODUCTION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Mixtures of Gaussians with Sparse Structure

Factor Analysis and Kalman Filtering (11/2/04)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Clustering with k-means and Gaussian mixture distributions

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Non-Parametric Bayes

Expectation maximization

Mixtures of Gaussians continued

Week 3: The EM algorithm

Introduction to Machine Learning

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

But if z is conditioned on, we need to model it:

Introduction to Machine Learning

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

Lecture 3: Pattern Classification

CSE446: Clustering and EM Spring 2017

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

The Expectation Maximization Algorithm

CS 6140: Machine Learning Spring 2017

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

CSCI-567: Machine Learning (Spring 2019)

Hidden Markov Models Part 2: Algorithms

CS534 Machine Learning - Spring Final Exam

Accelerating the EM Algorithm for Mixture Density Estimation

Review and Motivation

ECE285/SIO209, Machine learning for physical applications, Spring 2017

The Expectation Maximization or EM algorithm

Latent Variable Models and EM algorithm

Hidden Markov Models Part 1: Introduction

STA414/2104 Statistical Methods for Machine Learning II

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

Mixture Models and EM

Expectation Maximization Mixture Models HMMs

Lecture 14. Clustering, K-means, and EM

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Curve Fitting Re-visited, Bishop1.2.5

Clustering and Gaussian Mixtures

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Gaussian Mixture Models

Clustering and Gaussian Mixture Models

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Estimating Gaussian Mixture Densities with EM A Tutorial

Generative Clustering, Topic Modeling, & Bayesian Inference

Nonparameteric Regression:

ECE 5984: Introduction to Machine Learning

Transcription:

Last updated: November 6, 212 MIXTURE MODELS AND EM

Credits 2 Some of these slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Simon Prince, University College London Sergios Theodoridis, University of Athens & Konstantinos Koutroumbas, National Observatory of Athens

Mixture Models and EM: Topics 3 1. Intuition 2. Equations 3. Examples 4. Applications

Mixture Models and EM: Topics 4 1. Intuition 2. Equations 3. Examples 4. Applications

Motivation 5 What do we do if a distribution is not wellapproximated by a standard parametric model?

Mixtures of Gaussians 6 Combine simple models into a complex model: Component Mixing coefficient K=3

Mixtures of Gaussians 7

Mixtures of Gaussians 8 Determining parameters log likelihood µ, σ and π using maximum Log of a sum; no closed form maximum. Solution: use standard, iterative, numeric optimization methods or the expectation maximization algorithm.

Hidden Variable Interpretation 9 p x n Assumptions for each training observation x n there is a hidden variable z n. z n = 1,,K represents which Gaussian x n came from With this interpretation, we have: p x n K ( ) = π k N ( x n ; µ k, Σ ) k k =1 K ( ) = p ( x n z n = k) p ( z n = k) k =1 where p ( z n = k) = π k and p ( x n z n = k) N x n ; µ k, Σ k ( ) x z = 1 z = 2 z = 3 z

Hidden Variable Interpretation 1 ( ) = π k N ( x n ; µ k, Σ ) k p x n π 1, π K, µ 1, µ K, Σ 1, Σ K K = p z n = k k =1 K k =1 ( ) p ( x n z n = k) x z = 1 z = 2 z = 3 z = 1 z = 2 z = 3 z x z

11 Hidden Variable Interpretation OUR GOAL To estimate the parameters θ: The means µ k The covariances Σ k The weights (mixing coefficients) π k for all K components of the model. ( ) = π k N ( x n ; µ k, Σ ) k p x n π 1, π K, µ 1, µ K, Σ 1, Σ K K = p z n = k k =1 K k =1 ( ) p ( x n z n = k) x z = 1 z = 2 z = 3 z THING TO NOTICE If we knew the hidden variables z n for the training data it would be easy to estimate parameters just estimate individual Gaussians separately. θ

Hidden Variable Interpretation 12 THING TO NOTICE #2: If we knew the parameters θ it would very easy to estimate the posterior distribution over each hidden variable z n using Bayes rule: ( ) = p x z = k,θ p z = k x,θ K k =1 ( )π k p ( x z = k,θ )π k p(z x) x z = 1 z = 2 z = 3 z z=1 z=2 z=3

Expectation Maximization 13 Chicken and egg problem: could find z 1...N if we knew θ could find θ if we knew z 1...N Solution: Expectation Maximization (EM) algorithm (Dempster, Laird and Rubin 1977) EM for Gaussian mixtures can be viewed as alternation between 2 steps: 1. Expectation Step (E-Step) For fixed θ find posterior distribution over responsibilities z 1...N 2. Maximization Step (M-Step) Now use these posterior probabilities to re-estimate θ

K-Means: A Poor-Man s Approximation 14 The K-means algorithm is very similar to EM, except that We assume the mixing coefficients are the same for each component. We assume each component is isotropic with common variance. We do not admit any uncertainty about the responsibilities at each iteration.

K-Means: A Poor-Man s Approximation 15 K-Means consists of the following steps: 1. Randomly select the mean for each component. 2. Associate each input with the nearest component mean. 3. Recompute the means based upon the new association. Steps 2 and 3 are alternated until convergence.

Example 16 2 (a) 2 (b) 2 (c) 2 2 2 2 2 2 2 2 2 2 (d) 2 (e) 2 (f) 2 2 2 2 2 2 2 2 2 2 (g) 2 (h) 2 (i) 2 2 2 2 2 2 2 2 2

Mixture Models and EM: Topics 17 1. Intuition 2. Equations 3. Examples 4. Applications

Responsibilities 18 The responsibility r nk of component k for observation x n is the posterior probability that component k generated x n. Responsibility r nk P ( z n = k x n ;θ ) 2 L =5 Responsibilities Update Equation: Responsibility r nk = p ( x z = k;θ n n )π k (t) K p ( x n z n = k;θ )π k (t) k =1 2 2 (e) 2 Let N k = effective number of observations explained by component k: N N k r nk n =1

Example: Mixture of Gaussians 19 Differentiating does not allow us to isolate the parameters analytically. However, it does generate equations that can be used for iterative estimation of these parameters by EM.

Example: Mixture of Gaussians 2 Parameter Update Equations: µ k = 1 N k N n =1 r nk x n Σ k = 1 N r N nk x n µ k k n =1 ( )( x n µ ) t k π k = 1 N N k =1 r nk

Mixture of (Multivariate) Gaussians 21 Here we will derive the formula for the mixing coefficents π k. Textbook Problem 11.2: Derive the EM formulae for the mean μ k and covarianceσ k. Please do this at home! When solving for Σ k, recall that: d da A = A A t and d da xt Ax = xx t

Initialization 22 EM typically takes longer to converge than K- Means, and the computations are more elaborate. It is thus common to use the solution to K-means to initialize the parameters for EM.

Convergence 23 EM is guaranteed to monotonically increase the likelihood. However, since in general the likelihood is nonconvex, we are not guaranteed to find the globally optimal parameters.

End of Lecture October 24, 212

Mixture Models and EM: Topics 25 1. Intuition 2. Equations 3. Examples 4. Applications

Bivariate Gaussian Mixture Example 26 1 (a) 1 (b) 1 (c).5.5.5 Samples from p x ( k, j k Θ) Samples from p ( x Θ k ) Responsibilities P j x ;Θ(t) k k ( ).5 1.5 1.5 1

2-Component Bivariate MATLAB Example 27 1 CSE 211Z 21W 8 Exam grade (%) 6 4 2 2 4 6 8 1 Assignment grade (%)

2-Component Bivariate MATLAB Example 28 %update responsibilities for i=1:k end p(:,i)=alphas(i).*mvnpdf(x,mu(i,:),squeeze(s(i,:,:))); p=p./repmat((sum(p,2)),1,k); %update parameters for i=1:k end Nk=sum(p(:,i)); mu(i,:)=p(:,i)'*x/nk; dev=x-repmat(mu(i,:),n,1); S(i,:,:)=(repmat(p(:,i),1,D).*dev)'*dev/Nk; alphas(i)=nk/n; Exam grade (%) 1 8 6 4 2 CSE 211Z 21W 2 4 6 8 1 Assignment grade (%)

Old Faithful Example 29 2 2 2 L =1 Time to next eruption (min) 2 2 2 (a) 2 L =2 2 2 2 (b) 2 L =5 2 2 2 (c) 2 L = 2 2 2 2 2 (d) 2 2 (e) 2 2 (f) 2 Duration of eruption (min)

Face Detection Example: 2 Components 3 Face Model Parameters Non-Face Model Parameters.4999.4675.51.5325 Prior Mean Standard deviation Prior Mean Standard deviation Each component is still assumed to have diagonal covariance. The face model and non-face model have divided the data into two clusters. In each case, these clusters have roughly equal weights. The primary thing that these seem to have captured is the photometric (luminance) variation. Note that the standard deviations have become smaller than for the single Gaussian model as any given data point is likely to be close to one mean or the other.

Results for MOG 2 Model 31 1 Pr(Hit).9.8.7.6.5.4.3.2.1.2.4.6.8 1 Pr(False Alarm) MOG 2 Diagonal Uniform Performance improves relative to a single Gaussian model, although it is not a dramatic improvement. We have a better description of the data likelihood.

MOG 5 Components 32.988.1925.262.2275.1575 Prior Mean Face Model Parameters Standard deviation.1737.225.195.22.1863 Prior Mean Non-Face Model Parameters Standard deviation

MOG 1 Components 33.75.1425.1437.988.138.1187.1638.1175.138..1137.688.763.8.1338.163.163.1263.9.988

Results for Mog 1 Model 34 1.9.8.7 Performance improves slightly more, particularly at low false alarm rates. Pr(Hit).6.5.4.3.2.1 MOG 1 MOG 2 Diagonal Uniform.2.4.6.8 1 Pr(False Alarm)

Background Subtraction 35 Test Image Desired Segmentation GOAL : (i) Learn background model (ii) use this to segment regions where the background has been occluded

What if the scene isn t static? 36 Gaussian is no longer a good fit to the data. Not obvious exactly what probability model would fit better.

Background Mixture Model 37

Background Subtraction Example 38