Expectation Maximization Algorithm

Similar documents
CSE446: Clustering and EM Spring 2017

Mixture Models & EM algorithm Lecture 21

Expectation Maximization

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Unsupervised learning (part 1) Lecture 19

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

EM (cont.) November 26 th, Carlos Guestrin 1

Gaussian Mixture Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Mixtures of Gaussians continued

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Gaussian Mixture Models, Expectation Maximization

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

ECE 5984: Introduction to Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Expectation Maximization

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CPSC 540: Machine Learning

But if z is conditioned on, we need to model it:

Review and Motivation

Bayesian Networks Structure Learning (cont.)

The Expectation Maximization or EM algorithm

Latent Variable Models and EM algorithm

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Latent Variable Models

Day 5: Generative models, structured classification

Naïve Bayes classification

Bias-Variance Tradeoff

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Clustering and Gaussian Mixture Models

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Dimensionality reduction

CSE446: non-parametric methods Spring 2017

K-Means and Gaussian Mixture Models

MACHINE LEARNING ADVANCED MACHINE LEARNING

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Latent Variable Models and EM Algorithm

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Gaussian Mixture Models

Machine Learning for Signal Processing Bayes Classification and Regression

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Data Mining Techniques

Mixture of Gaussians Models

Data Preprocessing. Cluster Similarity

Bayesian Methods: Naïve Bayes

Expectation maximization

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Unsupervised Learning

Expectation maximization tutorial

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Expectation Maximization

Support vector machines Lecture 4

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 3: Pattern Classification

STA 4273H: Statistical Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

ECE 5984: Introduction to Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

CPSC 540: Machine Learning

Clustering, K-Means, EM Tutorial

CSCI-567: Machine Learning (Spring 2019)

Introduction to Machine Learning

Regression with Numerical Optimization. Logistic

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CS281 Section 4: Factor Analysis and PCA

Naïve Bayes Lecture 17

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

CS 188: Artificial Intelligence. Bayes Nets

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Logistic Regression. William Cohen

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Expectation-Maximization

Generative v. Discriminative classifiers Intuition

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

L11: Pattern recognition principles

10-701/15-781, Machine Learning: Homework 4

Statistical learning. Chapter 20, Sections 1 4 1

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Statistical Pattern Recognition

Classification: The rest of the story

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Machine Learning for Data Science (CS4786) Lecture 12

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

CPSC 540: Machine Learning

CPSC 540: Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

STA 4273H: Statistical Machine Learning

CSE 473: Artificial Intelligence Autumn Topics

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Transcription:

Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld

The Evils of Hard Assignments? Clusters may overlap Some clusters may be wider than others Distances can be deceiving!

Probabilistic Clustering Try a probabilistic model! allows overlaps, clusters of different size, etc. Can tell a generative story for data P(X Y) P(Y) Challenge: we need to estimate model parameters without labeled Ys Y X 1 X 2?? 0.1 2.1?? 0.5-1.1?? 0.0 3.0?? -0.1-2.0?? 0.2 1.5

The General GMM assumption P(Y): There are k components P(X Y): Each component generates data from a multivariate Gaussian with mean μ i and covariance matrix S i Each data point is sampled from a generative process: 1. Choose component i with probability P(y=i) 2. Generate datapoint ~ N(m i, S i ) Gaussian mixture model (GMM) m 1 m 2 m 3

What Model Should We Use? Depends on X! Here, maybe Gaussian Naïve Bayes? Multinomial over clusters Y Gaussian over each X i given Y Y X 1 X 2?? 0.1 2.1?? 0.5-1.1?? 0.0 3.0?? -0.1-2.0?? 0.2 1.5

Could we make fewer assumptions? What if the X i co-vary? What if there are multiple peaks? Gaussian Mixture Models! P(Y) still multinomial P(X Y) is a multivariate Gaussian dist n P(X = x j Y = i) = 1 (2p) m/2 S i exp é - 1 1/2 ê 2 x -m j i ë ( ) T S 1 ī ( x j - m i ) ù ú û

The General GMM assumption 1.What s a Multivariate Gaussian? 2.What s a Mixture Model? m 1 m 2 m 3

Review: Gaussians

Learning Gaussian Parameters (given fully-observable data)

Geometry of the Multivariate Gaussian Eigenvalue, λ Eigenvector u i Covariance matrix, Σ, = degree to which x i vary together

Multivariate Gaussians Σ identity matrix 12

Multivariate Gaussians Σ = diagonal matrix X i are independent ala Gaussian NB 13

Multivariate Gaussians Σ = arbitrary (semidefinite) matrix specifies rotation (change of basis) eigenvalues specify relative elongation 14

The General GMM assumption 1.What s a Multivariate Gaussian? 2.What s a Mixture Model? m 1 m 2 m 3

Time to Eruption Mixtures of Gaussians (1) Old Faithful Data Set Duration of Last Eruption

Mixtures of Gaussians (1) Old Faithful Data Set Single Gaussian Mixture of two Gaussians

Mixtures of Gaussians (2) Combine simple models into a complex model: Component Mixing coefficient K=3

Mixtures of Gaussians (3)

Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians

Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians

Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians π i = probability point was generated from i th Gaussian

Detour/Review: Supervised MLE for GMM How do we estimate parameters for Gaussian Mixtures with fully supervised data? Have to define objective and solve optimization problem. P(y = i, x j ) = 1 (2p) m/2 S i exp é - 1 1/2 ê 2 x -m j i ë ( ) T S 1 ī ( x j -m i ) ù úp(y = i) û For example, MLE estimate has closed form solution: m ML = 1 n n åx n j=1 S ML = 1 n n å( x j - m )( ML x j - m ) T ML j=1

Univariate Gaussian Compare Mixture of Multivariate Gaussians m ML = 1 n n åx n j=1 S ML = 1 n n å( x j - m )( ML x j - m ) T ML j=1

That was easy! MLE: But what if unobserved data? argmax θ j P(y j,x j ) θ: all model parameters eg, class probs, means, and variance for naïve Bayes But we don t know y j s!!! Maximize marginal likelihood: argmax θ j P(x j ) = argmax j i=1 k P(y j =i,x j )

How do we optimize? Closed Form? Maximize marginal likelihood: argmax θ j P(x j ) = argmax j i=1 k P(y j =i,x j ) Almost always a hard problem! Usually no closed form solution Even when P(X,Y) is convex, P(X) generally isn t For all but the simplest P(X), we will have to do gradient ascent, in a big messy space with lots of local optimum

Learning general mixtures of Gaussian P(y i x j ) 1 (2) m / 2 S i exp 1 1/ 2 2 x m j i T S 1 i x j m i P(y i) Marginal likelihood: m P(x j ) P(x j,y i) j1 m k j1 m i1 k j1 i1 1 (2) m / 2 S i exp 1 1/ 2 2 x m j i T S 1 i x j m i P(y i) Need to differentiate and solve for μ i, Σ i, and P(Y=i) for i=1..k There will be no closed form solution, gradient is complex, lots of local optimum Wouldn t it be nice if there was a better way!?!

Expectation Maximization

The EM Algorithm A clever method for maximizing marginal likelihood: argmax θ j P(x j ) = argmax θ j i=1 k P(y j =i,x j ) A type of gradient ascent that can be easy to implement (eg, no line search, learning rates, etc.) Alternate between two steps: Compute an expectation Compute a maximization Not magic: still optimizing a non-convex function with lots of local optima The computations are just easier (often, significantly so!)

EM: Two Easy Steps Objective: argmax θ j i=1 k P(y j =i,x j θ) = j log i=1 k P(y j =i,x j θ) Data: {x j j=1.. n} Notation a bit inconsistent Parameters = = E-step: Compute expectations to fill in missing y values according to current parameters, For all examples j and values i for y, compute: P(y j =i x j, θ) M-step: Re-estimate the parameters with weighted MLE estimates Set θ = argmax θ j i P(y j =i x j, θ) log P(y j =i,x j θ) Especially useful when the E and M steps have closed form solutions!!!

EM algorithm: Pictorial View Given a set of Parameters and training data Supervised learning problem Relearn the parameters based on the new training data Class assignment is probabilistic or weighted (soft EM) Class assignment is hard (hard EM) Estimate the class of each training example using the parameters yielding new (weighted) training data

Simple example: learn means only! Consider: 1D data Mixture of k=2 Gaussians Variances fixed to σ=1 Dist n over classes is uniform Just need to estimate μ 1 and μ 2.01.03.05.07.09 m k m k é Õ åp(x, y = i) µ Õåexpê- 1 ë 2s 2 j=1 i=1 j=1 i=1 x - m i 2 ù úp(y = i) û

EM for GMMs: only learning means Iterate: On the t th iteration let our estimates be t = { μ 1 (t), μ 2 (t) μ k (t) } E-step Compute expected classes of all datapoints M-step py i x j,m 1...m k exp 1 2 2 x j m i Compute most likely new μs given class expectations m i m j1 m j1 Py i x j Py i x j x j 2 Py i

E.M. for General GMMs Iterate: On the t th iteration let our estimates be t = { μ 1 (t), μ 2 (t) μ k (t), S 1 (t), S 2 (t) S k (t), p 1 (t), p 2 (t) p k (t) } E-step Compute expected classes of all datapoints for each class ) ( ) ( ) (, p, P t i t i j t i t j x p x i y S m p i (t) is shorthand for estimate of P(y=i) on t th iteration M-step Compute weighted MLE for μ given expected classes above j t j j j t j t i x i y x x i y, P, P μ 1, P, P 1 1 1 S j t j T t i j t i j j t j t i x i y x x x i y m m m x i y p j t j t i, P 1) ( m = #training examples Just evaluate a Gaussian at x j

Gaussian Mixture Example: Start

After first iteration

After 2nd iteration

After 3rd iteration

After 4th iteration

After 5th iteration

After 6th iteration

After 20th iteration

What if we do hard assignments? Iterate: On the t th iteration let our estimates be θ t = { μ 1 (t), μ 2 (t) μ k (t) } E-step Compute expected classes of all datapoints py i x j,m 1...m k exp 1 x 2 2 j m i M-step Compute most likely new μs given class expectations m i m j1 m j1 Py i x j Py i x j x j m i = d y = i, x j m å j=1 2 Py i ( ) x j ( ) d y = i, x j δ represents hard assignment to most likely or nearest cluster Equivalent to k-means clustering algorithm!!!

Lets look at the math behind the magic! We will argue that EM: Optimizes a bound on the likelihood Is a type of coordinate ascent Is guaranteed to converge to a (often local) optima

The general learning problem with missing data Marginal likelihood: x is observed, z (eg class labels, y) is missing: Objective: Find argmax θ l(θ:data)

A Key Computation: E-step x is observed, z is missing Compute probability of missing data given current choice of Q(z x j ) for each x j e.g., probability computed during classification step corresponds to classification step in K-means

Properties of EM We will prove that EM converges to a local minima Each iteration improves the log-likelihood How? (Same as k-means) E-step can never decrease likelihood M-step can never decrease likelihood

Jensen s inequality Theorem: log z P(z) f(z) z P(z) log f(z) e.g., Binary case for convex function f:

Applying Jensen s inequality Use: log z P(z) f(z) z P(z) log f(z)

The M-step Lower bound: Maximization step: We are optimizing a lower bound! Use expected counts to do weighted learning: If learning requires Count(x,z) Use E Q(t+1) [Count(x,z)] Looks a bit like boosting!!!

Convergence of EM Define: potential function F(,Q): lower bound from Jensen s inequality EM is coordinate ascent on F! Thus, maximizes lower bound on marginal log likelihood

M-step can t decrease F(θ,Q): by definition! We are maximizing F directly, by ignoring a constant!

E-step: more work to show that F(θ,Q) doesn t decrease KL-divergence: measures distance between distributions KL=zero if and only if Q=P

E-step also doesn t decrease F: Step 1 Fix to (t), take a max over Q:

E-step also doesn t decrease F: Step 2 Fixing to (t) : Now, the max over Q yields: Q(z x j ) P(z x j, (t) ) Why? The likelihood term is a constant; the KL term is zero iff the arguments are the same distribution!! So, the E-step is actually a maximization / tightening of the bound. It ensures that:

EM is coordinate ascent M-step: Fix Q, maximize F over (a lower bound on ): E-step: Fix, maximize F over Q: Realigns F with likelihood:

What you should know K-means for clustering: algorithm converges because it s coordinate ascent Know what agglomerative clustering is EM for mixture of Gaussians: Also coordinate ascent How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Relation to K-means Hard / soft clustering Probabilistic model Remember, E.M. can get stuck in local minima, And empirically it DOES

Acknowledgements K-means & Gaussian mixture models presentation contains material from excellent tutorial by Andrew Moore: http://www.autonlab.org/tutorials/ K-means Applet: http://www.elet.polimi.it/upload/matteucc/clustering /tutorial_html/appletkm.html Gaussian mixture models Applet: http://www.neurosci.aist.go.jp/%7eakaho/mixtureem.html