Mixture Models & EM algorithm Lecture 21

Similar documents
Expectation Maximization Algorithm

Unsupervised learning (part 1) Lecture 19

CSE446: Clustering and EM Spring 2017

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Gaussian Mixture Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixtures of Gaussians continued

Gaussian Mixture Models, Expectation Maximization

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

K-Means and Gaussian Mixture Models

Naïve Bayes Lecture 17

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Clustering and Gaussian Mixture Models

Day 5: Generative models, structured classification

CPSC 540: Machine Learning

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Expectation Maximization

But if z is conditioned on, we need to model it:

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Statistical learning. Chapter 20, Sections 1 4 1

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

EM (cont.) November 26 th, Carlos Guestrin 1

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Latent Variable Models

Review and Motivation

Bias-Variance Tradeoff

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Naïve Bayes classification

Mixture of Gaussians Models

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Machine Learning for Data Science (CS4786) Lecture 12

Machine Learning for Signal Processing Bayes Classification and Regression

Lecture 3: Pattern Classification

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

The Expectation Maximization or EM algorithm

Linear Models for Classification

STA414/2104 Statistical Methods for Machine Learning II

Bayesian Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Bayesian Networks Structure Learning (cont.)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Data Preprocessing. Cluster Similarity

Support vector machines Lecture 4

Statistical Pattern Recognition

Mixtures of Gaussians. Sargur Srihari

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Lecture 6: Gaussian Mixture Models (GMM)

Expectation Maximization

Clustering, K-Means, EM Tutorial

Expectation maximization

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

10-701/15-781, Machine Learning: Homework 4

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Generative v. Discriminative classifiers Intuition

Dimensionality reduction

Notes on Machine Learning for and

Generative v. Discriminative classifiers Intuition

CSE 473: Artificial Intelligence Autumn Topics

Lecture : Probabilistic Machine Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Data Mining Techniques

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

L11: Pattern recognition principles

Logistic Regression. William Cohen

Latent Variable Models and EM Algorithm

ECE 5984: Introduction to Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Introduction to Logistic Regression and Support Vector Machine

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

STA 4273H: Statistical Machine Learning

Scribe to lecture Tuesday March

Linear classifiers Lecture 3

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Week 5: Logistic Regression & Neural Networks

Regression with Numerical Optimization. Logistic

Unsupervised Learning

Parameter estimation Conditional risk

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Pattern Recognition. Parameter Estimation of Probability Density Functions

Unsupervised Learning: K- Means & PCA

Introduction to Bayesian Learning. Machine Learning Fall 2018

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

MIXTURE MODELS AND EM

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Hidden Markov Models

CPSC 540: Machine Learning

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

CS 6140: Machine Learning Spring 2017

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CS 188: Artificial Intelligence. Bayes Nets

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

COMP90051 Statistical Machine Learning

Transcription:

Mixture Models & EM algorithm Lecture 21 David Sontag New York University Slides adapted from Carlos Guestrin, Dan Klein, Luke Ze@lemoyer, Dan Weld, Vibhav Gogate, and Andrew Moore

The Evils of Hard Assignments? Clusters may overlap Some clusters may be wider than others Distances can be deceiving!

ProbabilisOc Clustering Try a probabilisoc model! allows overlaps, clusters of different size, etc. Can tell a genera&ve story for data P(X Y) P(Y) Challenge: we need to esomate model parameters without labeled Ys Y X 1 X 2?? 0.1 2.1?? 0.5-1.1?? 0.0 3.0?? - 0.1-2.0?? 0.2 1.5

The General GMM assumpoon P(Y): There are k components P(X Y): Each component generates data from a mul<variate Gaussian with mean μ i and covariance matrix Σ i Each data point is sampled from a genera&ve process: 1. Choose component i with probability P(y=i) 2. Generate datapoint ~ N(m i, Σ i ) Gaussian mixture model (GMM) µ 1 µ 2 µ 3

What Model Should We Use? Depends on X! Here, maybe Gaussian Naïve Bayes? MulOnomial over clusters Y (Independent) Gaussian for each X i given Y p(y i = y k )=θ k Y X 1 X 2?? 0.1 2.1?? 0.5-1.1?? 0.0 3.0?? - 0.1-2.0?? 0.2 1.5

Could we make fewer assumpoons? What if the X i co- vary? What if there are mulople peaks? Gaussian Mixture Models! P(Y) soll mulonomial P(X Y) is a mulovariate Gaussian distribuoon: P(X = x j Y = i) = 1 (2! ) m/2! i exp # " 1 1/2 2 x " µ j i $ % ( ) T! "1 i ( x j " µ i ) & ' (

MulOvariate Gaussians P(X = P(X=x x j Y j )= = i) = 1 (2! ) m/2! i exp # " 1 1/2 2 x j " µ i $ % ( ) T! "1 i ( x j " µ i ) & ' ( Σ idenoty matrix

MulOvariate Gaussians P(X = P(X=x x j Y j )= = i) = 1 (2! ) m/2! i exp # " 1 1/2 2 x j " µ i $ % ( ) T! "1 i ( x j " µ i ) & ' ( Σ = diagonal matrix X i are independent ala Gaussian NB

MulOvariate Gaussians P(X = P(X=x x j Y j )= = i) = 1 (2! ) m/2! i exp # " 1 1/2 2 x j " µ i $ % ( ) T! "1 i ( x j " µ i ) & ' ( Σ = arbitrary (semidefinite) matrix: - specifies rotaoon (change of basis) - eigenvalues specify relaove elongaoon

Eigenvalue, λ, of Σ MulOvariate Gaussians Covariance matrix, Σ, = degree to which x i vary together 1 P(X = x j Y = i) = (2! ) m/2! i exp # P(X=x " 1 1/2 2 x j )= j " µ i $ % ( ) T! "1 i ( x j " µ i ) & ' (

Mixtures of Gaussians (1) Old Faithful Data Set Time to ErupOon DuraOon of Last ErupOon

Mixtures of Gaussians (1) Old Faithful Data Set Single Gaussian Mixture of two Gaussians

Mixtures of Gaussians (2) Combine simple models into a complex model: Component Mixing coefficient K=3

Mixtures of Gaussians (3)

EliminaOng Hard Assignments to Clusters Model data as mixture of mulovariate Gaussians

EliminaOng Hard Assignments to Clusters Model data as mixture of mulovariate Gaussians

EliminaOng Hard Assignments to Clusters Model data as mixture of mulovariate Gaussians Shown is the posterior probability that a point was generated from i th Gaussian: Pr(Y = i x)

ML esomaoon in supervised semng Univariate Gaussian Mixture of Mul&variate Gaussians ML esomate for each of the MulOvariate Gaussians is given by: n k! x n! k ML = 1 n # k k j=1 n j=1 µ ML = 1 n ( x j " µ ) ML x j " µ ML ( ) T Just sums over x generated from the k th Gaussian

MLE: That was easy! But what if unobserved data? argmax θ j P(y j,x j ) θ: all model parameters eg, class probs, means, and variances But we don t know y j s!!! Maximize marginal likelihood: K argmax θ j P(x j ) = argmax j k=1 P(Y j =k, x j )

How do we opomize? Closed Form? Maximize marginal likelihood: argmax θ j P(x j ) = argmax j k=1 P(Y j =k, x j ) Almost always a hard problem! Usually no closed form soluoon Even when lgp(x,y) is convex, lgp(x) generally isn t For all but the simplest P(X), we will have to do gradient ascent, in a big messy space with lots of local opomum K

Learning general mixtures of Gaussian P(y = k x j ) 1 (2π) m / 2 Σ k exp 1 1/ 2 2 x j µ k ( ) T Σ 1 ( k x j µ ) k P(y = k) Marginal likelihood: m P(x j ) = P(x j, y = k) j =1 = m j =1 m K k =1 K j =1 k =1 1 (2π) m / 2 Σ k exp 1 1/ 2 2 x j µ k ( ) T Σ 1 k ( x j µ k ) P(y = k) Need to differenoate and solve for μ k, Σ k, and P(Y=k) for k=1..k There will be no closed form soluoon, gradient is complex, lots of local opomum Wouldn t it be nice if there was a be<er way!?!

ExpectaOon MaximizaOon

The EM Algorithm A clever method for maximizing marginal likelihood: argmax θ j P(x j ) = argmax θ j k=1 K P(Y j =k, x j ) A type of gradient ascent that can be easy to implement (eg, no line search, learning rates, etc.) Alternate between two steps: Compute an expectaoon Compute a maximizaoon Not magic: s&ll op&mizing a non- convex func&on with lots of local op&ma The computaoons are just easier (oqen, significantly so!)

EM: Two Easy Steps Objec<ve: argmax θ lg j k=1 K P(Y j =k, x j θ) = j lg k=1 K P(Y j =k, x j θ) Data: {x j j=1.. n} Nota<on a bit inconsistent Parameters = θ=λ E- step: Compute expectaoons to fill in missing y values according to current parameters, θ For all examples j and values k for Y j, compute: P(Y j =k x j, θ) M- step: Re- esomate the parameters with weighted MLE esomates Set θ = argmax θ j k P(Y j =k x j, θ) log P(Y j =k, x j θ) Especially useful when the E and M steps have closed form soluoons!!!

EM algorithm: Pictorial View Given a set of Parameters and training data Class assignment is probabilisoc or weighted (soq EM) Supervised learning problem Relearn the parameters based on the new training data Class assignment is hard (hard EM) EsOmate the class of each training example using the parameters yielding new (weighted) training data

Simple example: learn means only! Consider: 1D data Mixture of k=2 Gaussians Variances fixed to σ=1 DistribuOon over classes is uniform Just need to esomate μ 1 and μ 2.01.03.05.07.09 m K m K P(x,Y j = k) exp 1 2σ 2 j =1 k =1 j =1 k =1 =2 x µ k 2 P(Y j = k)

EM for GMMs: only learning means Iterate: On the t th iteraoon let our esomates be λ t = { μ 1 (t), μ 2 (t) μ K (t) } E- step Compute expected classes of all datapoints M- step ( ) exp 1 P Y j = k x j,µ 1...µ K 2σ 2 x j µ k Compute most likely new μs given class expectaoons 2 P Y j = k ( ) µ k = m j =1 m j =1 P( Y j = k x ) j P( Y j = k x ) j x j

E.M. for General GMMs Iterate: On the t th iteraoon let our esomates be λ t = { μ 1 (t), μ 2 (t) μ K (t), Σ 1 (t), Σ 2 (t) Σ K (t), p 1 (t), p 2 (t) p K (t) } E- step Compute expected classes of all datapoints for each class M- step P( Y j = k x j,λ ) (t t p ) k p( (t x j µ ) (t ) k,σ ) k Compute weighted MLE for μ given expected classes above ( t +1 µ ) k = P Y j = k x j,λ t x j j j ( ) P( Y j = k x j,λ ) t p k (t +1) = j ( t +1 Σ ) k = P( Y j = k x j,λ ) t m t +1 P Y j = k x j,λ t x j µ k j ( ) Just evaluate a Gaussian at x j j p k (t) is shorthand for esomate of P(y=k) on t th iteraoon ( ) ( t +1) [ ] x j µ k P( Y j = k x j,λ ) t m = #training examples [ ] T

Gaussian Mixture Example: Start

Aqer first iteraoon

Aqer 2nd iteraoon

Aqer 3rd iteraoon

Aqer 4th iteraoon

Aqer 5th iteraoon

Aqer 6th iteraoon

Aqer 20th iteraoon

What if we do hard assignments? Iterate: On the t th iteraoon let our esomates be λ t = { μ 1 (t), μ 2 (t) μ K (t) } E- step Compute expected classes of all datapoints P( Y j = k x j,µ 1...µ ) K exp 1 2 x 2σ 2 j µ k P Y j = k M- step Compute most likely new μs given class expectaoons µ k = m j =1 m j =1 P( Y j = k x ) j P( Y j = k x ) j x j ( ) x j µ k = δ Y j = k, x j m j =1 ( ) ( ) δ Y j = k, x j Equivalent to k- means clustering algorithm!!! δ represents hard assignment to most likely or nearest cluster

Lets look at the math behind the magic! Next lecture, we will argue that EM: OpOmizes a bound on the likelihood Is a type of coordinate ascent Is guaranteed to converge to a (oqen local) opoma