Review and Motivation

Similar documents
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Gaussian Mixture Models, Expectation Maximization

Mixtures of Gaussians. Sargur Srihari

Expectation Maximization Algorithm

Statistical Pattern Recognition

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Expectation Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Expectation Maximization

Clustering and Gaussian Mixture Models

CSE446: Clustering and EM Spring 2017

CSCI-567: Machine Learning (Spring 2019)

K-Means and Gaussian Mixture Models

Expectation maximization

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Latent Variable Models and Expectation Maximization

Gaussian Mixture Models

The Expectation Maximization Algorithm

Latent Variable Models and Expectation Maximization

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Latent Variable Models and EM Algorithm

STA 414/2104: Machine Learning

Data Mining Techniques

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixtures of Gaussians continued

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Introduction to Machine Learning

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STA 4273H: Statistical Machine Learning

CPSC 540: Machine Learning

Clustering and Gaussian Mixtures

STA 4273H: Statistical Machine Learning

Statistical Pattern Recognition

The Expectation-Maximization Algorithm

Data Preprocessing. Cluster Similarity

EM for Spherical Gaussians

Expectation Maximization

Machine Learning for Signal Processing Bayes Classification and Regression

But if z is conditioned on, we need to model it:

ECE 5984: Introduction to Machine Learning

Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful.

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Linear Models for Classification

Unsupervised Learning

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Machine Learning for Data Science (CS4786) Lecture 12

Latent Variable Models

Gaussian Mixture Models

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Mixture of Gaussians Models

Technical Details about the Expectation Maximization (EM) Algorithm

Lecture 11: Unsupervised Machine Learning

Lecture 4: Probabilistic Learning

Clustering, K-Means, EM Tutorial

Expectation-Maximization

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Mixture Models & EM algorithm Lecture 21

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

EM (cont.) November 26 th, Carlos Guestrin 1

Lecture 6: Gaussian Mixture Models (GMM)

Latent Variable View of EM. Sargur Srihari

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Estimating Gaussian Mixture Densities with EM A Tutorial

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Lecture 6: April 19, 2002

Latent Variable Models and EM algorithm

An Introduction to Expectation-Maximization

CPSC 540: Machine Learning

CS534 Machine Learning - Spring Final Exam

13: Variational inference II

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Quantitative Biology II Lecture 4: Variational Methods

CS181 Midterm 2 Practice Solutions

Probabilistic clustering

1 EM Primer. CS4786/5786: Machine Learning for Data Science, Spring /24/2015: Assignment 3: EM, graphical models

Mixture Models and EM

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Clustering with k-means and Gaussian mixture distributions

MIXTURE MODELS AND EM

Mixture Models and Expectation-Maximization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Bayesian Methods for Machine Learning

STA 4273H: Sta-s-cal Machine Learning

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Machine Learning Techniques for Computer Vision

Instructor (Andrew Ng)

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning Lecture 5

Parameter estimation Conditional risk

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

COM336: Neural Computing

Variational Inference (11/04/13)

Computing the MLE and the EM Algorithm

EM Algorithm LECTURE OUTLINE

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Transcription:

Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which points go to which cluster, we can estimate a Gaussian mean and covariance for that cluster. We have introduced the idea of writing what you want to do as a function to be optimized (maximized or minimized). Maximum likelihood estimation to fit mean and covariance parameters of a Gaussian is a good example of this.

Review and Motivation want to do MLE of mixture of Gaussian parameters But this is hard, because of the summation in the mixture of Gaussian equation (can t take the log of a sum). If we knew which point contribute to which Gaussian component, the problem would be a lot easier (we can rewrite so that the summation goes away) So... let s guess which point goes with which component, and proceed with the estimation. We were unlikely to guess right the first time, but based on our initial estimation of parameters, we can now make a better guess at pairing points with components. Iterate This is the basic idea underlying the EM algorithm.

EM Algorithm What makes this estimation problem hard? 1) It is a mixture, so log-likelihood is messy 2) We don t directly see what the underlying process is

EM Algorithm What is the underlying process? procedure to generate a mixture of gaussians for i=1:n end generate a uniform U(0,1) random number to determine which of K components to draw a sample from (based on probabilities pi_k generate a sample from a Gaussian N(mu_k, Sigma_k)

EM Algorithm equivalent procedure to generate a mixture of gaussians for k=1:k end compute number of samples n_k = round(n * pi_k) to draw from the k-th component Gaussian generate n_k samples from Gaussian N(mu_k, Sigma_k) + + =

EM Algorithm labels Suppose some oracle told us which point comes from which Gaussian. How? By providing a latent variable z_nk which is 1 if point n comes from the kth component Gaussian, and 0 otherwise (a 1 of K representation)

EM Algorithm labels This lets us recover the underlying generating process decomposition: = + +

EM Algorithm labels And we can easily estimate each Gaussian, along with the mixture weights! N1 points N2 points N3 points = + + estimate pi_k = Nk / N estimate mu_1, Sigma_1 estimate mu_2, Sigma_2 estimate mu_3, Sigma_3

EM Algorithm Remember that this was a problem... how can I make that inner sum be a product instead???

EM Algorithm Remember that this was a problem... how can I make that inner sum be a product instead??? Again, if an oracle gave us the values of the latent variables (component that generated each point) we could work with the complete log likelihood and the log of that looks much better!

EM Algorithm Remember that this was a problem... how can I make that inner sum be a product instead??? Again, if an oracle gave us the values of the latent variables (component that generated each point) we could work with the complete log likelihood and the log of that looks much better!

Latent Variable View note: for a given n, there are k of these latent variables, and only ONE of them is 1 (all the rest are 0)

Latent Variable View This is thus equivalent to note: for a given n, there are k of these latent variables, and only ONE of them is 1 (all the rest are 0)

Latent Variable View

Latent Variable View can be estimated separately can be estimated separately can be estimated separately

Latent Variable View can be estimated separately can be estimated separately can be estimated separately these are coupled because the mixing weights all sum to 1, but it is no big deal to solve

EM Algorithm Unfortunately, oracle s don t exist (or if they do, they don t want to talk to us) So we don t know values of the the z_nk variables What EM proposes to do: 1) compute p(z X,theta), the posterior distribution over z_nk, given our current best guess at the values of theta 2) compute the expected value of the log likelihood ln(p(x,z theta)) with respect to the distribution p(z X,theta) 3) find theta_new that maximizes that function. This is our new best guess at the values of theta. 4) iterate...

Insight Since we don t know the latent variables, we instead take the expected value of the log likelihood with resepect to their distribution. In the GMM case, this is equivalent to softening the binary latent variables to continuous ones (the expected values of the latent variables) unknown discrete value 0 or 1 known continuous value between 0 and 1

Insight So now, after replacing the binary latent variables with their continuous expected values: all points contribute to the estimation of all components each point has unit mass to contribute, but splits it across the K components the amount of weight a point contributes to a component is proportional to the relative likelihood that the point was generated by that component

Latent Variable View (with oracle) can be estimated separately can be estimated separately can be estimated separately these are coupled because the mixing weights all sum to 1, but it is no big deal to solve

Latent Variable View (with EM, ) at iteration i i a constant i i i i can be estimated separately can be estimated separately i i can be estimated separately these are coupled because the mixing weights all sum to 1, but it is no big deal to solve

EM Algorithm for GMM E ownership weights M means covariances mixing probabilities

Gaussian Mixture Example: Start Copyright 2001, Andrew W. Moore

After first iteration

After 2nd iteration

After 3rd iteration

After 4th iteration

After 5th iteration

After 6th iteration

After 20th iteration

Recall: Labeled vs Unlabeled Data labeled Easy to estimate params (do each color separately) unlabeled Hard to estimate params (we need to assign colors)

EM produces a Soft labeling each point makes a weighted contribution to the estimation of ALL components

General EM Evaluate

Intuitive Explanation (in terms of function maximization and lower bounds) f(x) x1

Intuitive Explanation f(x) b1(x) x1

Intuitive Explanation f(x) b1(x) x1 x2

Intuitive Explanation f(x) b2(x) b1(x) x1 x2

Intuitive Explanation f(x) Why does this work? By construction, b1(x1) = f(x1) b1(x2) >= b1(x1) [it is a maximum] b1(x) b2(x) f (x2) >= b1(x2) [b1 is a lower bound] so, it is guaranteed that f(x2) >= f(x1) and in general, at each iteration f(xnew) >= f(xold) x1 x2 If f(x) is bounded above, then process should converge to a (local) maximum

More Rigorous Proof We will use Jensen s inequality for convex functions (see, for example, Bishop, PRML, p 56) With some manipulation, and reversing the inequality because log is a concave rather than convex function...

Proof that EM works this is function f(x) in our earlier picture definition of probability. Now use Jensen s inequality... lower bound b(x) in our earlier picture. Bishop calls this L(q,\theta)

Proof that EM works note, that when theta = theta_old if we expand this equation L(q,\theta_old) becomes just Showing that our lower bound touches the function ln p(x theta) at the current estimate theta_old (as promised by our earlier picture!)

Proof that EM works

Proof that EM works

Proof that EM works So... and we have increased our log likelihood. Therefore, finding argmax of L(theta,theta_old) is a good thing to do. But wait, there s more...

Proof that EM works This is the expected value Q(theta,theta_old) computed in the E-step of EM!!!! this doesn t depend on \theta. ignore it.

Proof that EM works This is the expected value Q(theta,theta_old) computed in the E-step of EM!!!! this doesn t depend on \theta. ignore it. therefore, EM is optimizing the right thing!