Gaussian Mixture Models, Expectation Maximization

Similar documents
Mixtures of Gaussians continued

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Unsupervised Learning: K-Means, Gaussian Mixture Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Expectation Maximization Algorithm

CSE446: Clustering and EM Spring 2017

Gaussian Mixture Models

K-Means and Gaussian Mixture Models

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Review and Motivation

Bayesian Networks Structure Learning (cont.)

Statistical Pattern Recognition

Mixture of Gaussians Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Gaussian Mixture Models

EM (cont.) November 26 th, Carlos Guestrin 1

MIXTURE MODELS AND EM

Machine Learning for Data Science (CS4786) Lecture 12

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Latent Variable Models and EM Algorithm

Clustering, K-Means, EM Tutorial

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Expectation Maximization

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Mixture Models & EM algorithm Lecture 21

Statistical learning. Chapter 20, Sections 1 4 1

Classification Based on Probability

10-701/15-781, Machine Learning: Homework 4

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Generative Clustering, Topic Modeling, & Bayesian Inference

Clustering and Gaussian Mixture Models

Data Mining Techniques

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Unsupervised Learning

Statistical Pattern Recognition

Latent Variable Models and Expectation Maximization

Latent Variable Models

COM336: Neural Computing

Latent Variable Models and Expectation Maximization

Mixtures of Gaussians. Sargur Srihari

CSCI-567: Machine Learning (Spring 2019)

ECE521 week 3: 23/26 January 2017

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

But if z is conditioned on, we need to model it:

The Expectation-Maximization Algorithm

CPSC 540: Machine Learning

Unsupervised learning (part 1) Lecture 19

CS6220: DATA MINING TECHNIQUES

Introduction to Machine Learning

Lecture 11: Unsupervised Machine Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Notes on Machine Learning for and

Cheng Soon Ong & Christian Walder. Canberra February June 2018

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear Dynamical Systems

Expectation Maximization

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Naïve Bayes classification

CS6220: DATA MINING TECHNIQUES

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Hidden Markov Models and Gaussian Mixture Models

Machine Learning for Signal Processing Bayes Classification

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

ECE 5984: Introduction to Machine Learning

Brief Introduction of Machine Learning Techniques for Content Analysis

Machine Learning Techniques for Computer Vision

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture 4: Probabilistic Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

STA 4273H: Statistical Machine Learning

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

CS 6140: Machine Learning Spring 2017

STA414/2104 Statistical Methods for Machine Learning II

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

An Introduction to Expectation-Maximization

CS534 Machine Learning - Spring Final Exam

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Markov chain Monte Carlo Lecture 9

Expectation maximization

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Probabilistic & Unsupervised Learning

Clustering and Gaussian Mixtures

STA 414/2104: Machine Learning

Probability and Information Theory. Sargur N. Srihari

The Bayes classifier

Machine Learning

Introduction to Graphical Models

Probability Theory for Machine Learning. Chris Cremer September 2015

Mobile Robot Localization

Unsupervised Learning: K- Means & PCA

Transcription:

Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak (Pomona), and the many others who made their course materials freely available online. Robot Image Credit: Viktoriya Sukhanova 123RF.com Gaussian Mixture Models Overview Learning Goals Describe the differences between k means and GMMs

K Means: Another View Initialize cluster centers Assign examples to closest center k means assumes spherical clusters Update cluster centers Based on slides by David Kauchak Gaussian Mixture Models Assume data came from mixture of Gaussians (elliptical data) Assign data to cluster with certain probability (soft clustering) k means GMMs p(blue) = 0.8 p(red) = 0.2 p(blue) = 0.9 p(red) = 0.1 Very similar at high level to k means: iterate between assigning examples and updating cluster centers Based on slides by David Kauchak

initialize cluster centers GMM Example soft cluster examples update cluster centers (based on weighted contribution of examples) keep iterating Based on slides by David Kauchak [Images by Chris Bishop, PRML] Learning GMMs Learning Goals Describe the technical details of GMMs

Univariate Gaussian Distribution (scalar) random variable X parameters: (scalar) mean, (scalar) variance 2 X ~ N(, 2 ) Wikipedia [Normal Distribution] Multivariate Gaussian Distribution random variable vector X = [X 1,, X n ] T parameters: mean vector n X ~ N(, ) covariance matrix (symmetric, positive definite) contour line: typically drawn at 1/e of peak height Wikipedia [Multivariate Normal Distribution]

Covariance Matrix Recall for pair of r.v. s X and Y, covariance is defined as cov(x,y) = [(X [X])(Y [Y])] For X = [X 1,, X n ] T, covariance matrix summarizes covariances across all pairs of variables: = [(X [X])(X [X]) T ] is n n matrix s.t. ij = cov(x i, X j ) parameters params params 0 0 0 0 GMMs as Generative Model There are k components Component j has associated mean vector j and covariance matrix j generates data from N( j, j ) Each example x (i) is generated according to following recipe: pick component j at random with probability j sample x (i) ~ N( j, j ) 1 2 2 x 1 3 3 Based on slides by Andrew Moore

GMMs as Generative Model We are given training set {x (1),, x (n) } (w/o labels) We model data by specifying joint distribution p(x (i), z (i) ) = p(x (i) ) z (i) ) p(z (i) ) Here, for k components, z (i) ~ Multinomial( ) j = p(z (i) = j) j 0, x (i) z (i) = j ~ N( j, j ) Goals: Determine z (i) (soft cluster assignments) Determine model parameters j, j, j (1 j k) Note: z (i) are latent r.v. s (they are hidden/unobserved) This is what makes estimation problem difficult Based on notes by Andrew Ng GMM Optimization Assume supervised setting (known cluster assignments) MLE for univariate Gaussian sum over points generated from this Gaussian MLE for multivariate Gaussian Based on slides by David Sontag

What if unobserved data? GMM Optimization Now what? Based on notes by Andrew Ng Expectation Maximization Based on slides by Andrew Moore

Expectation Maximization Learning Goals Describe when EM is useful Describe the two steps of EM Practice EM on a toy problem Expectation Maximization Clever method for maximizing marginal likelihoods Excellent approach for unsupervised learning Can do trivial things (upcoming example) One of most general unsupervised approaches with many other uses (e.g. HMM inference) Overview Begin with guess for model parameters Repeat until convergence Update latent variables based on our expectations [E step] Update model parameters to maximize log likelihood [M step] Based on notes by Andrew Ng and slides by Andrew Moore

Silly Example Let events be grades in a class component 1 = gets an A P(A) = ½ component 2 = gets a B P(B) = p component 3 = gets a C P(C) = 2p component 4 = gets a D P(D) = ½ 3p (note 0 p 1/6) Assume we want to estimate p from data. In a given class, there were a A s, b B s, c C s, d D s. What is the MLE of p given a, b, c, d? so if class got a b c d 14 6 9 10 Based on slides by Andrew Moore [Clustering with Gaussian Mixtures] Same Problem with Hidden Information Someone tells us that # of high grades (A s + B s) = h # of C s = c # of D s = d What is the MLE of p now? Remember P(A) = ½ P(B) = p P(C) = 2p P(D) = ½ 3p We can answer this question circularly: EXPECTATION If we know value of p, we could compute expected values of a and b. MAXIMIZATION If we know expected values of a and b, we could compute maximum likelihood value of p. Based on slides by Andrew Moore [Clustering with Gaussian Mixtures]

EM for Our Silly Example Begin with initial guess for p Iterate between Expectation and Maximization to improve our estimates of p and a & b Define p (t) = estimate of p on t th iteration b (t) = estimate of b on t th iteration Repeat until convergence E step = [b p (t) ] M step = MLE of p given b (t) Based on slides by Andrew Moore [Clustering with Gaussian Mixtures] EM Convergence Good news: converging to local optima is guaranteed Bad news: local optima Aside (idea behind convergence proof) likelihood must increase or remain same between each iteration [not obvious] likelihood can never exceed 1 [obvious] t p so likelihood must converge [obvious] (t) b (t) In our example, suppose we had h = 20, c = 10, d = 10 p (0) = 0 Error generally decreases by constant factor each time step (e.g. convergence is linear) Based on slides by Andrew Moore [Clustering with Gaussian Mixtures] 0 0 0 1 0.0833 2.857 2 0.0937 3.158 3 0.0947 3.185 4 0.0948 3.187 5 0.0948 3.187 6 0.0948 3.187

EM Applied to GMMs Learning Goals Describe how to optimize GMMs using EM Learning GMMs Recall z (i) indicates which of k Gaussians each x (i) comes from If z (i) s were known, maximizing likelihood is easy Maximize wrt,, gives fraction of examples assigned to component j mean and covariance of examples assigned to component j Based on notes by Andrew Ng

Since z (i) s are not known, use EM! Learning GMMs Repeat until convergence [E step] Know model parameters, guess values of z (i) s [M step] Know class probabilities, update model parameters Based on notes by Andrew Ng (This slide intentionally left blank.)

Since z (i) s are not known, use EM! Learning GMMs Repeat until convergence [E step] Know model parameters, guess values of z (i) s for each i,j, set compute posterior probability using Bayes Rule compare to k means w j (i) c (i) = (z (i) = j x (i) ;,, ) hard assignment to nearest cluster w j (i) s are soft guesses for values of z (i) s [M step] Know class probabilities, update model parameters update parameters evaluate Gaussian w/ j & j at x (i) prior probability of being assigned to component j = j same equations as when z (i) s are known except [[z (i) = j]] replaced with probability w j (i) Based on notes by Andrew Ng Final Comments EM is not magic Still optimizing non convex function with lots of local optima Computations are just easier (often, significantly so!) Problems EM susceptible to local optima reinitialize at several different initial parameters Extensions EM looks at maximum log likelihood of data also possible to look at maximum a posteriori

GMM Exercise We estimated a mixture of two Gaussians based on two dimensional data shown below. The mixture was initialized randomly in two different ways and run for three iterations based on each initialization. However, the figures got mixed up. Please draw an arrow from one figure to another to indicate how they follow from each other (you should only draw four arrows). Exercise by Tommi Jaakola Exercise by Tommi Jaakola