Expectation Maximization

Similar documents
Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization

Clustering and Gaussian Mixtures

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

The Expectation Maximization or EM algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning

Statistical Pattern Recognition

Statistical Pattern Recognition

Data Mining Techniques

Clustering with k-means and Gaussian mixture distributions

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Clustering, K-Means, EM Tutorial

Expectation Maximization

Gaussian Mixture Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Week 3: The EM algorithm

The Expectation-Maximization Algorithm

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Series 7, May 22, 2018 (EM Convergence)

Clustering with k-means and Gaussian mixture distributions

Lecture 7: Con3nuous Latent Variable Models

Clustering and Gaussian Mixture Models

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Expectation Maximization

Latent Variable Models and EM algorithm

Clustering with k-means and Gaussian mixture distributions

Gaussian Mixture Models

Review and Motivation

CSCI-567: Machine Learning (Spring 2019)

Mixtures of Gaussians. Sargur Srihari

Technical Details about the Expectation Maximization (EM) Algorithm

PATTERN RECOGNITION AND MACHINE LEARNING

Expectation maximization

COM336: Neural Computing

Probabilistic Graphical Models

Latent Variable Models

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Latent Variable Models and EM Algorithm

The Expectation-Maximization Algorithm

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

An Introduction to Expectation-Maximization

But if z is conditioned on, we need to model it:

Lecture 8: Graphical models for Text

CS181 Midterm 2 Practice Solutions

Lecture 14. Clustering, K-means, and EM

Variational Inference (11/04/13)

STA 4273H: Statistical Machine Learning

Variational Autoencoder

Linear Models for Classification

The Expectation Maximization Algorithm

STA 4273H: Statistical Machine Learning

Data Preprocessing. Cluster Similarity

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Auto-Encoding Variational Bayes

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

13: Variational inference II

Variational Autoencoders (VAEs)

Expectation Maximization Algorithm

STA 414/2104: Machine Learning

14 : Mean Field Assumption

Machine Learning Techniques for Computer Vision

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Quantitative Biology II Lecture 4: Variational Methods

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Mixture Models and Expectation-Maximization

Mixture Models and EM

Posterior Regularization

Variational Principal Components

Probabilistic clustering

Lecture 4: Probabilistic Learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Gaussian Mixture Models, Expectation Maximization

Unsupervised Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Linear Dynamical Systems

K-Means and Gaussian Mixture Models

Bayesian Machine Learning - Lecture 7

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Statistical learning. Chapter 20, Sections 1 4 1

Self-Organization by Optimizing Free-Energy

ECE 5984: Introduction to Machine Learning

Probabilistic Graphical Models for Image Analysis - Lecture 4

EM Algorithm LECTURE OUTLINE

Machine learning - HT Maximum Likelihood

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Based on slides by Richard Zemel

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Density Estimation. Seungjin Choi

Machine Learning Lecture 5

Transcription:

Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori

4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger than the to thanradius Probability the radius of theofbright bright circle, Distributions circle, tative p relative to a ref- which which has has In practice to a ref-annitydensity thereafter and thereafter entropy the entropy decreases decreases theasscale scale increases. increases. We discussed In practice this method probabilistic this method gives models gives stable stable identification at length identification of featuresaover variety a variety of sizes of sizes and copes and copes well with well with intra-class intra-class ckground model model as- as- of fea- re arts assumed are assumed to totures in(within a range a range r). r). d p r f p, U p ) dp r f er. e finder. p(d θ) (N, f) p(d θ) detected tures detected using using The n M. second The second is is variable and the and last the last ll ible possible occlusion occlusion l. shape s the shape and ocpearancthe appearance and and compasses many many of of and oc- obabilistic istic way, this way, this constrained objects objects K-Means small then only Gaussian themixture whitemodels circle is seen, and there is no ex- Expectation-Maximization α f small then only the white circle is seen, and there is no extremtrema in entropy. in entropy. There There is anis entropy an entropy extrema extrema when when the the In assignment variability. variability. The 3 and The4saliency you measure see measure that is designed given is designed to fully betoinvari- ant to observed invariant scaling, toparameters scaling, although although experimental training data, setting θ i to experimental tests show probability testsdistributions show that this that this is is not entirely not entirely the case thedue casetodue aliasing to aliasing and other and other effects. effects. Note, Note, straight-forward only monochrome only monochrome information information is used is used to detect to detect and represensent infeatures. many settings not all variables are observed and repre- However, (labelled) in the training data: x i = (x i, h i ).3. e.g..3. Feature Speech Feature recognition: representation have speech signals, but not phoneme The labels feature The feature detector detector identifies identifies regions regions of interest of interest on each on each image. e.g. image. Object The coordinates The recognition: coordinates of theof have centre the centre object give us give labels Xus and X(car, the andsize the bicycle), size but not of the partof region labels the region gives (wheel, gives S. Figure door, S. Figure illustrates seat) illustrates this on this two ontypical two typical images images from the frommotorbike the motorbike dataset. dataset. Unobserved variables are called latent variables figs from Fergus et al. Figure Figure : Output : Output of the c Ghane/Mori of feature the feature detector detector

Outline K-Means Gaussian Mixture Models Expectation-Maximization c Ghane/Mori 3

Outline K-Means Gaussian Mixture Models Expectation-Maximization c Ghane/Mori 4

Unsupervised Learning (a) We will start with an unsupervised learning (clustering) problem: Given a dataset {x,..., x N }, each x i R D, partition the dataset into K clusters Intuitively, a cluster is a group of points, which are close together and far from others c Ghane/Mori 5

Distortion Measure (a) (i) Formally, introduce prototypes (or cluster centers) µ k R D Use binary r nk, if point n is in cluster k, otherwise (-of-k coding scheme again) Find {µ k }, {r nk } to minimize distortion measure: J = N n= k= K r nk x n µ k e.g. two clusters k =, : J = x n µ + x n µ x n C x n C c Ghane/Mori 6

Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k n= k= However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence c Ghane/Mori 7

Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k n= k= However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence c Ghane/Mori 8

Minimizing Distortion Measure Minimizing J directly is hard N K J = r nk x n µ k n= k= However, two things are easy If we know µ k, minimizing J wrt r nk If we know r nk, minimizing J wrt µ k This suggests an iterative procedure Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Minimize J wrt µ k Rinse and repeat until convergence c Ghane/Mori 9

Determining Membership Variables (a) (b) Step in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk N K J = r nk x n µ k n= k= Terms for different data points x n are independent, for each data point set r nk to minimize K r nk x n µ k k= Simply set r nk = for the cluster center µ k with smallest distance c Ghane/Mori

Determining Membership Variables (a) (b) Step in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk N K J = r nk x n µ k n= k= Terms for different data points x n are independent, for each data point set r nk to minimize K r nk x n µ k k= Simply set r nk = for the cluster center µ k with smallest distance c Ghane/Mori

Determining Membership Variables (a) (b) Step in an iteration of K-means is to minimize distortion measure J wrt cluster membership variables r nk N K J = r nk x n µ k n= k= Terms for different data points x n are independent, for each data point set r nk to minimize K r nk x n µ k k= Simply set r nk = for the cluster center µ k with smallest distance c Ghane/Mori

Determining Cluster Centers (b) (c) Step : fix r nk, minimize J wrt the cluster centers µ k K N J = r nk x n µ k switch order of sums k= n= So we can minimize wrt each µ k separately Take derivative, set to zero: N r nk (x n µ k ) = n= µ k = n r nkx n n r nk i.e. mean of datapoints x n assigned to cluster k c Ghane/Mori 3

Determining Cluster Centers (b) (c) Step : fix r nk, minimize J wrt the cluster centers µ k K N J = r nk x n µ k switch order of sums k= n= So we can minimize wrt each µ k separately Take derivative, set to zero: N r nk (x n µ k ) = n= µ k = n r nkx n n r nk i.e. mean of datapoints x n assigned to cluster k c Ghane/Mori 4

K-means Algorithm Start with initial guess for µ k Iteration of two steps: Minimize J wrt r nk Assign points to nearest cluster center Minimize J wrt µ k Set cluster center as average of points in cluster Rinse and repeat until convergence c Ghane/Mori 5

K-means example (a) c Ghane/Mori 6

K-means example (b) c Ghane/Mori 7

K-means example (c) c Ghane/Mori 8

K-means example (d) c Ghane/Mori 9

K-means example (e) c Ghane/Mori

K-means example (f) c Ghane/Mori

K-means example (g) c Ghane/Mori

K-means example (h) c Ghane/Mori 3

K-means example (i) Next step doesn t change membership stop c Ghane/Mori 4

K-means Convergence Repeat steps until no change in cluster assignments For each step, value of J either goes down, or we stop Finite number of possible assignments of data points to clusters, so we are guarranteed to converge eventually Note it may be a local maximum rather than a global maximum to which we converge c Ghane/Mori 5

K-means Example - Image Segmentation Original image K-means clustering on pixel colour values Pixels in a cluster are coloured by cluster mean Represent each pixel (e.g. 4-bit colour value) by a cluster number (e.g. 4 bits for K = ), compressed version This technique known as vector quantization Represent vector (in this case from RGB, R 3 ) as a single discrete value c Ghane/Mori 6

Outline K-Means Gaussian Mixture Models Expectation-Maximization c Ghane/Mori 7

Hard Assignment vs. Soft Assignment (i) In the K-means algorithm, a hard assignment of points to clusters is made However, for points near the decision boundary, this may not be such a good idea Instead, we could think about making a soft assignment of points to clusters c Ghane/Mori 8

Gaussian Mixture Model (b) (a).5.5.5.5 The Gaussian mixture model (or mixture of Gaussians MoG) models the data as a combination of Gaussians Above shows a dataset generated by drawing samples from three different Gaussians c Ghane/Mori 9

Generative Model z (a).5 x.5 The mixture of Gaussians is a generative model To generate a datapoint x n, we first generate a value for a discrete variable z n {,..., K} We then generate a value x n N (x µ k, Σ k ) for the corresponding Gaussian c Ghane/Mori 3

Graphical Model π z n (a).5 x n µ Σ N.5 Full graphical model using plate notation Note z n is a latent variable, unobserved Need to give conditional distributions p(z n ) and p(x n z n ) The one-of-k representation is helpful here: z nk {, }, z n = (z n,..., z nk ) c Ghane/Mori 3

Graphical Model - Latent Component Variable π z n (a).5 x n µ Σ N.5 Use a Bernoulli distribution for p(z n ) i.e. p(z nk = ) = π k Parameters to this distribution {π k } Must have π k and K k= π k = p(z n ) = K k= πz nk k c Ghane/Mori 3

Graphical Model - Observed Variable π z n (a).5 x n µ Σ N.5 Use a Gaussian distribution for p(x n z n ) Parameters to this distribution {µ k, Σ k } p(x n z nk = ) = N (x n µ k, Σ k ) K p(x n z n ) = N (x n µ k, Σ k ) z nk k= c Ghane/Mori 33

Graphical Model - Joint distribution π z n (a).5 x n µ Σ N The full joint distribution is given by:.5 p(x, z) = = N p(z n )p(x n z n ) n= N K n= k= π z nk k N (x n µ k, Σ k ) z nk c Ghane/Mori 34

MoG Marginal over Observed Variables The marginal distribution p(x n ) for this model is: p(x n ) = z n p(x n, z n ) = z n p(z n )p(x n z n ) = K π k N (x n µ k, Σ k ) k= A mixture of Gaussians c Ghane/Mori 35

MoG Conditional over Latent Variable (b) (c).5.5.5.5 The conditional p(z nk = x n ) will play an important role for learning It is denoted by γ(z nk ) can be computed as: γ(z nk ) p(z nk = x n ) = = p(z nk = )p(x n z nk = ) K j= p(z nj = )p(x n z nj = ) π k N (x n µ k, Σ k ) K j= π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n c Ghane/Mori 36

MoG Conditional over Latent Variable (b) (c).5.5.5.5 The conditional p(z nk = x n ) will play an important role for learning It is denoted by γ(z nk ) can be computed as: γ(z nk ) p(z nk = x n ) = = p(z nk = )p(x n z nk = ) K j= p(z nj = )p(x n z nj = ) π k N (x n µ k, Σ k ) K j= π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n c Ghane/Mori 37

MoG Conditional over Latent Variable (b) (c).5.5.5.5 The conditional p(z nk = x n ) will play an important role for learning It is denoted by γ(z nk ) can be computed as: γ(z nk ) p(z nk = x n ) = = p(z nk = )p(x n z nk = ) K j= p(z nj = )p(x n z nj = ) π k N (x n µ k, Σ k ) K j= π jn (x n µ j, Σ j ) γ(z nk ) is the responsibility of component k for datapoint n c Ghane/Mori 38

MoG Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } Answer will be similar to k-means: If we know the latent variables z n, fitting the Gaussians is easy If we know the Gaussians µ k, Σ k, finding the latent variables is easy Rather than latent variables, we will use responsibilities γ(z nk ) c Ghane/Mori 39

MoG Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } Answer will be similar to k-means: If we know the latent variables z n, fitting the Gaussians is easy If we know the Gaussians µ k, Σ k, finding the latent variables is easy Rather than latent variables, we will use responsibilities γ(z nk ) c Ghane/Mori 4

MoG Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } Answer will be similar to k-means: If we know the latent variables z n, fitting the Gaussians is easy If we know the Gaussians µ k, Σ k, finding the latent variables is easy Rather than latent variables, we will use responsibilities γ(z nk ) c Ghane/Mori 4

MoG Maximum Likelihood Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } We can use the maximum likelihood criterion: θ ML = arg max θ = arg max θ N n= k= K π k N (x n µ k, Σ k ) { N K } log π k N (x n µ k, Σ k ) n= k= Unfortunately, closed-form solution not possible this time log of sum rather than log of product c Ghane/Mori 4

MoG Maximum Likelihood Learning Given a set of observations {x,..., x N }, without the latent variables z n, how can we learn the parameters? Model parameters are θ = {π k, µ k, Σ k } We can use the maximum likelihood criterion: θ ML = arg max θ = arg max θ N n= k= K π k N (x n µ k, Σ k ) { N K } log π k N (x n µ k, Σ k ) n= k= Unfortunately, closed-form solution not possible this time log of sum rather than log of product c Ghane/Mori 43

MoG Maximum Likelihood Learning - Problem Maximum likelihood criterion, -D: { N K θ ML = arg max log π k exp { (x n µ k ) /(σ ) } θ πσ n= k= c Ghane/Mori 44

MoG Maximum Likelihood Learning - Problem Maximum likelihood criterion, -D: { N K θ ML = arg max log π k exp { (x n µ k ) /(σ ) } θ πσ n= k= Suppose we set µ k = x n for some k and n, then we have one term in the sum: π k πσk exp { (x n µ k ) /(σ ) } = π k πσk exp { () /(σ ) } c Ghane/Mori 45

MoG Maximum Likelihood Learning - Problem Maximum likelihood criterion, -D: { N K θ ML = arg max log π k exp { (x n µ k ) /(σ ) } θ πσ n= k= Suppose we set µ k = x n for some k and n, then we have one term in the sum: π k πσk exp { (x n µ k ) /(σ ) } = π k πσk exp { () /(σ ) } In the limit as σ k, this goes to So ML solution is to set some µ k = x n, and σ k =! c Ghane/Mori 46

ML for Gaussian Mixtures Keeping this problem in mind, we will develop an algorithm for ML estimation of the parameters for a MoG model Search for a local optimum Consider the log-likelihood function l(θ) = { N K } log π k N (x n µ k, Σ k ) n= k= We can try taking derivatives and setting to zero, even though no closed form solution exists c Ghane/Mori 47

Maximizing Log-Likelihood - Means l(θ) = µ k l(θ) = = { N K } log π k N (x n µ k, Σ k ) n= N n= N n= k= π k N (x n µ k, Σ k ) j π jn (x n µ j, Σ j ) Σ k (x n µ k ) γ(z nk )Σ k (x n µ k ) Setting derivative to, and multiply by Σ k N N γ(z nk )µ k = γ(z nk )x n n= n= µ k = N γ(z nk )x n where N k = N k N γ(z nk ) n= n= c Ghane/Mori 48

Maximizing Log-Likelihood - Means l(θ) = µ k l(θ) = = { N K } log π k N (x n µ k, Σ k ) n= N n= N n= k= π k N (x n µ k, Σ k ) j π jn (x n µ j, Σ j ) Σ k (x n µ k ) γ(z nk )Σ k (x n µ k ) Setting derivative to, and multiply by Σ k N N γ(z nk )µ k = γ(z nk )x n n= n= µ k = N γ(z nk )x n where N k = N k N γ(z nk ) n= n= c Ghane/Mori 49

Maximizing Log-Likelihood - Means and Covariances Note that the mean µ k is a weighted combination of points x n, using the responsibilities γ(z nk ) for the cluster k µ k = N γ(z nk )x n N k n= N k = N n= γ(z nk) is the effective number of points in the cluster A similar result comes from taking derivatives wrt the covariance matrices Σ k : Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k n= c Ghane/Mori 5

Maximizing Log-Likelihood - Means and Covariances Note that the mean µ k is a weighted combination of points x n, using the responsibilities γ(z nk ) for the cluster k µ k = N γ(z nk )x n N k n= N k = N n= γ(z nk) is the effective number of points in the cluster A similar result comes from taking derivatives wrt the covariance matrices Σ k : Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k n= c Ghane/Mori 5

Maximizing Log-Likelihood - Mixing Coefficients We can also maximize wrt the mixing coefficients π k Note there is a constraint that k π k = Use Lagrange multipliers End up with: π k = N k N average responsibility that component k takes c Ghane/Mori 5

Three Parameter Types and Three Equations These three equations a solution does not make µ k = N γ(z nk )x n N k n= Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k π k = N k N n= All depend on γ(z nk ), which depends on all 3! But an iterative scheme can be used c Ghane/Mori 53

EM for Gaussian Mixtures Initialize parameters, then iterate: E step: Calculate responsibilities using current parameters π k N (x n µ γ(z nk ) = k, Σ k ) K j= π jn (x n µ j, Σ j ) M step: Re-estimate parameters using these γ(z nk ) µ k = N γ(z nk )x n N k n= Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k π k = N k N n= This algorithm is known as the expectation-maximization algorithm (EM) Next we describe its general form, why it works, and why it s called EM (but first an example) c Ghane/Mori 54

EM for Gaussian Mixtures Initialize parameters, then iterate: E step: Calculate responsibilities using current parameters π k N (x n µ γ(z nk ) = k, Σ k ) K j= π jn (x n µ j, Σ j ) M step: Re-estimate parameters using these γ(z nk ) µ k = N γ(z nk )x n N k n= Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k π k = N k N n= This algorithm is known as the expectation-maximization algorithm (EM) Next we describe its general form, why it works, and why it s called EM (but first an example) c Ghane/Mori 55

EM for Gaussian Mixtures Initialize parameters, then iterate: E step: Calculate responsibilities using current parameters π k N (x n µ γ(z nk ) = k, Σ k ) K j= π jn (x n µ j, Σ j ) M step: Re-estimate parameters using these γ(z nk ) µ k = N γ(z nk )x n N k n= Σ k = N γ(z nk )(x n µ N k )(x n µ k ) T k π k = N k N n= This algorithm is known as the expectation-maximization algorithm (EM) Next we describe its general form, why it works, and why it s called EM (but first an example) c Ghane/Mori 56

MoG EM - Example (a) Same initialization as with K-means before Often, K-means is actually used to initialize EM c Ghane/Mori 57

MoG EM - Example (b) Calculate responsibilities γ(z nk ) c Ghane/Mori 58

MoG EM - Example (c) Calculate model parameters {π k, µ k, Σ k } using these responsibilities c Ghane/Mori 59

MoG EM - Example (d) Iteration c Ghane/Mori 6

MoG EM - Example (e) Iteration 5 c Ghane/Mori 6

MoG EM - Example Iteration - converged (f) c Ghane/Mori 6

Outline K-Means Gaussian Mixture Models Expectation-Maximization c Ghane/Mori 63

General Version of EM In general, we are interested in maximizing the likelihood p(x θ) = Z p(x, Z θ) where X denotes all observed variables, and Z denotes all latent (hidden, unobserved) variables Assume that maximizing p(x θ) is difficult (e.g. mixture of Gaussians) But maximizing p(x, Z θ) is tractable (everything observed) p(x, Z θ) is referred to as the complete-data likelihood function, which we don t have c Ghane/Mori 64

General Version of EM In general, we are interested in maximizing the likelihood p(x θ) = Z p(x, Z θ) where X denotes all observed variables, and Z denotes all latent (hidden, unobserved) variables Assume that maximizing p(x θ) is difficult (e.g. mixture of Gaussians) But maximizing p(x, Z θ) is tractable (everything observed) p(x, Z θ) is referred to as the complete-data likelihood function, which we don t have c Ghane/Mori 65

A Lower Bound The strategy for optimization will be to introduce a lower bound on the likelihood This lower bound will be based on the complete-data likelihood, which is easy to optimize Iteratively increase this lower bound Make sure we re increasing the likelihood while doing so c Ghane/Mori 66

A Decomposition Trick To obtain the lower bound, we use a decomposition: ln p(x, Z θ) = ln p(x θ) + ln p(z X, θ) product rule ln p(x θ) = L(q, θ) + KL(q p) L(q, θ) { } p(x, Z θ) q(z) ln q(z) Z KL(q p) { } p(z X, θ) q(z) ln q(z) Z KL(q p) is known as the Kullback-Leibler divergence (KL-divergence), and is (see p.55 PRML, next slide) Hence ln p(x θ) L(q, θ) c Ghane/Mori 67

Kullback-Leibler Divergence KL(p(x) q(x)) is a measure of the difference between distributions p(x) and q(x): KL(p(x) q(x)) = x p(x) log q(x) p(x) Motivation: average additional amount of information required to encode x using code assuming distribution q(x) when x actually comes from p(x) Note it is not symmetric: KL(q(x) p(x)) KL(p(x) q(x)) in general It is non-negative: Jensen s inequality: ln( x xp(x)) x p(x) ln x Apply to KL: KL(p q) = x p(x) log q(x) p(x) ln ( x ) q(x) p(x) p(x) = ln x q(x) = c Ghane/Mori 68

Kullback-Leibler Divergence KL(p(x) q(x)) is a measure of the difference between distributions p(x) and q(x): KL(p(x) q(x)) = x p(x) log q(x) p(x) Motivation: average additional amount of information required to encode x using code assuming distribution q(x) when x actually comes from p(x) Note it is not symmetric: KL(q(x) p(x)) KL(p(x) q(x)) in general It is non-negative: Jensen s inequality: ln( x xp(x)) x p(x) ln x Apply to KL: KL(p q) = x p(x) log q(x) p(x) ln ( x ) q(x) p(x) p(x) = ln x q(x) = c Ghane/Mori 69

Kullback-Leibler Divergence KL(p(x) q(x)) is a measure of the difference between distributions p(x) and q(x): KL(p(x) q(x)) = x p(x) log q(x) p(x) Motivation: average additional amount of information required to encode x using code assuming distribution q(x) when x actually comes from p(x) Note it is not symmetric: KL(q(x) p(x)) KL(p(x) q(x)) in general It is non-negative: Jensen s inequality: ln( x xp(x)) x p(x) ln x Apply to KL: KL(p q) = x p(x) log q(x) p(x) ln ( x ) q(x) p(x) p(x) = ln x q(x) = c Ghane/Mori 7

Kullback-Leibler Divergence KL(p(x) q(x)) is a measure of the difference between distributions p(x) and q(x): KL(p(x) q(x)) = x p(x) log q(x) p(x) Motivation: average additional amount of information required to encode x using code assuming distribution q(x) when x actually comes from p(x) Note it is not symmetric: KL(q(x) p(x)) KL(p(x) q(x)) in general It is non-negative: Jensen s inequality: ln( x xp(x)) x p(x) ln x Apply to KL: KL(p q) = x p(x) log q(x) p(x) ln ( x ) q(x) p(x) p(x) = ln x q(x) = c Ghane/Mori 7

Increasing the Lower Bound - E step EM is an iterative optimization technique which tries to maximize this lower bound: ln p(x θ) L(q, θ) E step: Fix θ old, maximize L(q, θ old ) wrt q i.e. Choose distribution q to maximize L Reordering bound: L(q, θ old ) = ln p(x θ old ) KL(q p) ln p(x θ old ) does not depend on q Maximum is obtained when KL(q p) is as small as possible Occurs when q = p, i.e. q(z) = p(z X, θ) This is the posterior over Z, recall these are the responsibilities from MoG model c Ghane/Mori 7

Increasing the Lower Bound - E step EM is an iterative optimization technique which tries to maximize this lower bound: ln p(x θ) L(q, θ) E step: Fix θ old, maximize L(q, θ old ) wrt q i.e. Choose distribution q to maximize L Reordering bound: L(q, θ old ) = ln p(x θ old ) KL(q p) ln p(x θ old ) does not depend on q Maximum is obtained when KL(q p) is as small as possible Occurs when q = p, i.e. q(z) = p(z X, θ) This is the posterior over Z, recall these are the responsibilities from MoG model c Ghane/Mori 73

Increasing the Lower Bound - E step EM is an iterative optimization technique which tries to maximize this lower bound: ln p(x θ) L(q, θ) E step: Fix θ old, maximize L(q, θ old ) wrt q i.e. Choose distribution q to maximize L Reordering bound: L(q, θ old ) = ln p(x θ old ) KL(q p) ln p(x θ old ) does not depend on q Maximum is obtained when KL(q p) is as small as possible Occurs when q = p, i.e. q(z) = p(z X, θ) This is the posterior over Z, recall these are the responsibilities from MoG model c Ghane/Mori 74

Increasing the Lower Bound - E step EM is an iterative optimization technique which tries to maximize this lower bound: ln p(x θ) L(q, θ) E step: Fix θ old, maximize L(q, θ old ) wrt q i.e. Choose distribution q to maximize L Reordering bound: L(q, θ old ) = ln p(x θ old ) KL(q p) ln p(x θ old ) does not depend on q Maximum is obtained when KL(q p) is as small as possible Occurs when q = p, i.e. q(z) = p(z X, θ) This is the posterior over Z, recall these are the responsibilities from MoG model c Ghane/Mori 75

Increasing the Lower Bound - M step M step: Fix q, maximize L(q, θ) wrt θ The maximization problem is on L(q, θ) = Z q(z) ln p(x, Z θ) Z q(z) ln q(z) = Z p(z X, θ old ) ln p(x, Z θ) Z p(z X, θ old ) ln p(z X, θ old ) Second term is constant with respect to θ First term is ln of complete data likelihood, which is assumed easy to optimize Expected complete log likelihood what we think complete data likelihood will be c Ghane/Mori 76

Increasing the Lower Bound - M step M step: Fix q, maximize L(q, θ) wrt θ The maximization problem is on L(q, θ) = Z q(z) ln p(x, Z θ) Z q(z) ln q(z) = Z p(z X, θ old ) ln p(x, Z θ) Z p(z X, θ old ) ln p(z X, θ old ) Second term is constant with respect to θ First term is ln of complete data likelihood, which is assumed easy to optimize Expected complete log likelihood what we think complete data likelihood will be c Ghane/Mori 77

Increasing the Lower Bound - M step M step: Fix q, maximize L(q, θ) wrt θ The maximization problem is on L(q, θ) = Z q(z) ln p(x, Z θ) Z q(z) ln q(z) = Z p(z X, θ old ) ln p(x, Z θ) Z p(z X, θ old ) ln p(z X, θ old ) Second term is constant with respect to θ First term is ln of complete data likelihood, which is assumed easy to optimize Expected complete log likelihood what we think complete data likelihood will be c Ghane/Mori 78

Increasing the Lower Bound - M step M step: Fix q, maximize L(q, θ) wrt θ The maximization problem is on L(q, θ) = Z q(z) ln p(x, Z θ) Z q(z) ln q(z) = Z p(z X, θ old ) ln p(x, Z θ) Z p(z X, θ old ) ln p(z X, θ old ) Second term is constant with respect to θ First term is ln of complete data likelihood, which is assumed easy to optimize Expected complete log likelihood what we think complete data likelihood will be c Ghane/Mori 79

Why does EM work? In the M-step we changed from θ old to θ new This increased the lower bound L, unless we were at a maximum (so we would have stopped) This must have caused the log likelihood to increase The E-step set q to make the KL-divergence : ln p(x θ old ) = L(q, θ old ) + KL(q p) = L(q, θ old ) Since the lower bound L increased when we moved from θ old to θ new : ln p(x θ old ) = L(q, θ old ) < L(q, θ new ) = ln p(x θ new ) KL(q p new ) So the log-likelihood has increased going from θ old to θ new c Ghane/Mori 8

Why does EM work? In the M-step we changed from θ old to θ new This increased the lower bound L, unless we were at a maximum (so we would have stopped) This must have caused the log likelihood to increase The E-step set q to make the KL-divergence : ln p(x θ old ) = L(q, θ old ) + KL(q p) = L(q, θ old ) Since the lower bound L increased when we moved from θ old to θ new : ln p(x θ old ) = L(q, θ old ) < L(q, θ new ) = ln p(x θ new ) KL(q p new ) So the log-likelihood has increased going from θ old to θ new c Ghane/Mori 8

Why does EM work? In the M-step we changed from θ old to θ new This increased the lower bound L, unless we were at a maximum (so we would have stopped) This must have caused the log likelihood to increase The E-step set q to make the KL-divergence : ln p(x θ old ) = L(q, θ old ) + KL(q p) = L(q, θ old ) Since the lower bound L increased when we moved from θ old to θ new : ln p(x θ old ) = L(q, θ old ) < L(q, θ new ) = ln p(x θ new ) KL(q p new ) So the log-likelihood has increased going from θ old to θ new c Ghane/Mori 8

Why does EM work? In the M-step we changed from θ old to θ new This increased the lower bound L, unless we were at a maximum (so we would have stopped) This must have caused the log likelihood to increase The E-step set q to make the KL-divergence : ln p(x θ old ) = L(q, θ old ) + KL(q p) = L(q, θ old ) Since the lower bound L increased when we moved from θ old to θ new : ln p(x θ old ) = L(q, θ old ) < L(q, θ new ) = ln p(x θ new ) KL(q p new ) So the log-likelihood has increased going from θ old to θ new c Ghane/Mori 83

Bounding Example.9.8.7.6.5.4.3..!.!5!4!3!! 3 4 5 Consider component -D MoG with known variances (example from F. Dellaert) Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively. c Ghane/Mori 84

..!.!5!4!3!! 3 4 5 Bounding Example Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively..9.8.7.5.6.5.4.4.3.3...!.!5!4!3!! 3 4 5. 3 Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively.!!!!3!3!!! 3.5.4 Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. True likelihood function Recall we re fitting means θ, θ.3.. 3 3!!!! c Ghane/Mori!3!!3 85!

.5.5.5 3 4 5 6.3..!.!5!4!3!! 3 4 5 Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively. Bounding Example i=, Q=!3.79564.5.5.4.4.3.3.... 3!!!!3!3!!! 3 3!!!!3!3!!! 3 Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. i=, Q=!.78856 i=3, Q=!.56 Lower bound the likelihood function using averaging distribution q(z).5.5 ln p(x θ) = L(q, θ) + KL(q(Z) p(z X, θ)).4.4 Since q(z) = p(z X, θ old ), bound is tight (equal to actual.3.3 likelihood) at θ = θ old... c Ghane/Mori 86.

.5.5.5 3 4 3 4 5 6.5.5.5.5.5.5.3..!.!5!4!3!! 3 4 5 Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, respectively. Bounding Example i=, Q=!.78856!!!!3!3!.5.5.5.4.4.4.3.3.3...... 3!!!!3!3!!! 3 3!!!!3!3!!! 3 3 Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. Lower bound the likelihood function using averaging distribution i=4, Q=!.8749 q(z) ln p(x θ) = L(q, θ) + KL(q(Z) p(z X, θ)).5 Since q(z) = p(z X, θ old ), bound is tight (equal to actual.4 likelihood) at θ = θ old.3.5.4.3.. c Ghane/Mori..87

.5.5.5 3 4 5 6.5.5.5 3 4 5 6.5.5.5 3 4 5 6.3 3 K-Means Gaussian Mixture Models Expectation-Maximization..!.!5!4!3!! 3 4 5 Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are andi=,, Q=!.78856 respectively.!!!!!3!!3! Bounding Example i=3, Q=!.56.5.5.5.4.4.4.3.3.3...... 3 3!! 3 3!!!!!!!!!3!!!3!3!3!3!3!!! 3!!! 3 Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. Lower boundi=4, the Q=!.8749 likelihood function using averaging i=5, Q=!.7666 distribution.5.4.3 q(z) ln p(x θ) = L(q, θ) + KL(q(Z) p(z X,.5 θ)) Since q(z) = p(z X, θ old ), bound.4 is tight (equal to actual likelihood) at θ = θ old.3.... c Ghane/Mori 88

.5.5.5.5.5.5 3 4 5 6 3 4 5 6.5.5.5.5.5.5 3 4 5 6 3 4 5 6.3 K-Means 3 Gaussian Mixture Models 3 Expectation-Maximization..!!!!!!!3!3!3!!3!.!!5!4!3!! 3 4 5 Figure : EM example: Mixture components and data. The data consists of three samples drawn from each mixture component, shown above as circles and triangles. The means of the mixture components are and, i=4, respectively. Q=!.8749!!!! Bounding Example i=5, Q=!.7666.5.5.5.4.4.4.3.3.3...... 3 3!!! 3 3!!!!!!!3!3!!3!!3!!!!3!3! 3!!! 3 Figure : The true likelihood function of the two component means θ and θ, given the data in Figure. Lower bound the likelihood function using averaging distribution q(z) Figure 3: Lower Bounds ln p(x θ) = L(q, θ) + KL(q(Z) p(z X, θ)) Since q(z) = p(z X, θ old ), bound is tight (equal to actual likelihood) at θ = θ old 3 c Ghane/Mori 89

EM - Summary EM finds local maximum to likelihood p(x θ) = Z p(x, Z θ) Iterates two steps: E step fills in the missing variables Z (calculates their distribution) M step maximizes expected complete log likelihood (expectation wrt E step distribution) This works because these two steps are performing a coordinate-wise hill-climbing on a lower bound on the likelihood p(x θ) c Ghane/Mori 9

Conclusion Readings: Ch. 9., 9., 9.4 K-means clustering Gaussian mixture model What about K? Model selection: either cross-validation or Bayesian version (average over all values for K) Expectation-maximization, a general method for learning parameters of models when not all variables are observed c Ghane/Mori 9