Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Similar documents
Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Expectation Maximization Algorithm

Gaussian Mixture Models

CSE446: Clustering and EM Spring 2017

Latent Variable Models and EM algorithm

But if z is conditioned on, we need to model it:

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Data Preprocessing. Cluster Similarity

Bayesian Methods: Naïve Bayes

Clustering and Gaussian Mixture Models

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

10-701/15-781, Machine Learning: Homework 4

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Mixture Models & EM algorithm Lecture 21

Unsupervised learning (part 1) Lecture 19

Expectation maximization

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Latent Variable Models and EM Algorithm

Machine Learning for Data Science (CS4786) Lecture 12

Gaussian Mixture Models, Expectation Maximization

Technical Details about the Expectation Maximization (EM) Algorithm

CPSC 540: Machine Learning

CPSC 540: Machine Learning

K-Means and Gaussian Mixture Models

The Expectation-Maximization Algorithm

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Expectation maximization tutorial

Expectation Maximization and Mixtures of Gaussians

Mixtures of Gaussians continued

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

An Introduction to Expectation-Maximization

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

CPSC 540: Machine Learning

Unsupervised Learning

Lecture 4 September 15

Latent Variable Models

Gaussian Mixture Models

CS Lecture 18. Expectation Maximization

CS 6375 Machine Learning

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

CS6220: DATA MINING TECHNIQUES

Clustering, K-Means, EM Tutorial

Density Estimation. Seungjin Choi

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Expectation Maximization

Estimating Gaussian Mixture Densities with EM A Tutorial

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

A minimalist s exposition of EM

Machine Learning for Signal Processing Bayes Classification

Statistical Pattern Recognition

Expectation Maximization

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Introduction to Machine Learning

CPSC 540: Machine Learning

CMU-Q Lecture 24:

Maximum Likelihood Estimation. only training data is available to design a classifier

Variational Inference (11/04/13)

Probabilistic Graphical Models

Regression with Numerical Optimization. Logistic

Overfitting, Bias / Variance Analysis

Computing the MLE and the EM Algorithm

Outline lecture 6 2(35)

Expectation Maximization

Naive Bayes and Gaussian Bayes Classifier

Review and Motivation

ECE 5984: Introduction to Machine Learning

Bias-Variance Tradeoff

Lecture 3 September 1

The Expectation Maximization or EM algorithm

L11: Pattern recognition principles

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

EM for Spherical Gaussians

Maximum likelihood estimation

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CSCI-567: Machine Learning (Spring 2019)

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Machine Learning for Signal Processing Bayes Classification and Regression

Week 3: The EM algorithm

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Gaussian Statistics and Unsupervised Learning

Binary Classification / Perceptron

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Unsupervised Learning with Permuted Data

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Lecture 11: Unsupervised Machine Learning

Logistic Regression. Seungjin Choi

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Posterior Regularization

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Lecture 8: Graphical models for Text

Mixture of Gaussians Models

Transcription:

Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate

Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means was simply a bloc coordinate descent scheme for a specific objective function Today: how to learn probabilistic models for unsupervised learning problems 2

EM: Soft Clustering Clustering (e.g., -means) typically assumes that each instance is given a hard assignment to exactly one cluster Does not allow uncertainty in class membership or for an instance to belong to more than one cluster Problematic because data points that lie roughly midway between cluster centers are assigned to one cluster Soft clustering gives probabilities that an instance belongs to each of a set of clusters 3

Probabilistic Clustering Try a probabilistic model! Allows overlaps, clusters of different size, etc. Can tell a generative story for data p(x y) p(y) Challenge: we need to estimate model parameters without labeled y s (i.e., in the unsupervised setting) Z X 1 X 2?? 0.1 2.1?? 0.5-1.1?? 0.0 3.0?? -0.1-2.0?? 0.2 1.5 4

Probabilistic Clustering Clusters of different shapes and sizes Clusters can overlap! (-means doesn t allow this) 5

Finite Mixture Models Given a dataset: x (1),, x () Mixture model: Θ = {λ 1,, λ, θ 1,, θ } p x Θ = λ y p y (x θ y ) y=1 where p y (x θ y ) is a mixture component from some family of probability distributions parameterized by θ y and λ 0 such that y λ y = 1 are the mixture weights We can thin of λ y = p Y = y Θ for some random variable Y that taes values in {1,, } 6

Finite Mixture Models 7

Multivariate Gaussian A d-dimensional multivariate Gaussian distribution is defined by a d d covariance matrix Σ and a mean vector μ p x Σ, μ = 1 2π d det Σ exp 1 2 x μ T Σ 1 (x μ) The covariance matrix describes the degree to which pairs of variables vary together The diagonal elements correspond to variances of the individual variables 8

Multivariate Gaussian A d-dimensional multivariate Gaussian distribution is defined by a d d covariance matrix Σ and a mean vector μ p x Σ, μ = 1 2π d det Σ exp 1 2 x μ T Σ 1 (x μ) The covariance matrix must be a symmetric positive definite matrix in order for the above to mae sense Positive definite: all eigenvalues are positive & matrix is invertible Ensures that the quadratic form is concave 9

Gaussian Mixture Models (GMMs) We can define a GMM by choosing the th component of the mixture to be a Gaussian density with parameters θ = μ, Σ p x Σ, μ = 1 2π d det Σ exp 1 2 x μ T Σ 1 (x μ ) We could cluster by fitting a mixture of Gaussians to our data How do we learn these inds of models? 10

Learning Gaussian Parameters MLE for supervised univariate Gaussian μ MLE = 1 x i 2 σ MLE = 1 x i μ MLE 2 MLE for supervised multivariate Gaussian μ MLE = 1 x i Σ MLE = 1 x i μ MLE x i μ MLE T 11

Learning Gaussian Parameters MLE for supervised multivariate mixture of Gaussian distributions μ MLE = 1 x i Σ MLE = 1 x i μ MLE x i μ MLE T Sums are over the observations that were generated by the th mixture component (this requires that we now which points were generated by which distribution!) 12

The Unsupervised Case What if our observations do not include information about which of the mixture components generated them? Consider a joint probability distribution over data points, x (i), and mixture assignments, y {1,, } MLE: arg max Θ p(x i Θ) = arg max Θ = arg max Θ y=1 y=1 p(x i, Y = y Θ) p x i Y = y, Θ p(y = y Θ) 13

The Unsupervised Case What if our observations do not include information about which of the mixture components generated them? Consider a joint probability distribution over data points, x (i), and mixture assignments, y {1,, } MLE: arg max Θ p(x i Θ) = arg max Θ = arg max Θ y=1 y=1 p(x i, Y = y Θ) p x i We only now how to compute the probabilities for each mixture component Y = y, Θ p(y = y Θ) 14

The Unsupervised Case In the case of a Gaussian mixture model p x i Y = y, Θ = 1 2π d det Σ y exp 1 2 x(i) μ y T Σy 1 (x (i) μ y ) p Y = y Θ = λ y Differentiating the MLE objective yields a system of equations that is difficult to solve in general The solution: modify the objective to mae the optimization easier 15

Expectation Maximization 16

Jensen s Inequality For a convex function f: R n R, any a 1,, a [0,1] such that a 1 + + a = 1, and any x (1),, x R n, a 1 f x 1 + + a f x f a 1 x 1 + + a x 17

EM Algorithm log l(θ) = = log log y=1 F(Θ, q) y=1 y=1 p x i, Y = y Θ qi y q i y p x i, Y = y Θ q i y log p x i, Y = y Θ q i y 18

EM Algorithm log l(θ) = = log log y=1 F(Θ, q) y=1 y=1 p x i, Y = y Θ qi y q i y p x i, Y = y Θ q i y log p x i, Y = y Θ q i y q i (y) is an arbitrary positive probability distribution 19

EM Algorithm log l(θ) = = log log y=1 F(Θ, q) y=1 y=1 p x i, Y = y Θ qi y q i y p x i, Y = y Θ q i y log p x i, Y = y Θ q i y Jensen s ineq. 20

EM Algorithm arg max Θ,q 1,..,q y=1 q i y log p x i, Y = y Θ q i y This objective is not jointly concave in Θand q 1,, q Best we can hope for is a local maxima (and there could be A LOT of them) The EM algorithm is a bloc coordinate ascent scheme that finds a local optimum of this objective Start from an initialization Θ 0 and q 1 0,, q 0 21

EM Algorithm E step: with the θ s fixed, maximize the objective over q q t+1 arg max q 1,..,q y=1 q i y log p x i, Y = y Θ t q i y Using the method of Lagrange multipliers for the constraint that y q i y = 1 gives q i t+1 y = p Y = y X = x i, Θ t 22

EM Algorithm M step: with the q s fixed, maximize the objective over Θ θ t+1 arg max Θ y=1 q i t+1 y log p x i, Y = y Θ t q i t+1 y For the case of GMM, we can compute this update in closed form This is not necessarily the case for every model May require gradient ascent 23

EM Algorithm Start with random parameters E-step maximizes a lower bound on the log-sum for fixed parameters M-step solves the MLE estimation problem for fixed probabilities Iterate between the E-step and M-step until convergence 24

EM for Gaussian Mixtures E-step: M-step: Σ y t+1 = q t i y = λ t i p x i μ t y, Σt y t p x i μ t t y, Σ y μ y t+1 = y λ y q i t y x i q i t y q t i y x i t+1 μ y x i t+1 μ T y q t+1 i y Probability of x (i) under the appropriate multivariate normal distribution λ y t+1 = 1 q i t y 25

Gaussian Mixture Example: Start 26

After first iteration 27

After 2nd iteration 28

After 3rd iteration 29

After 4th iteration 30

After 5th iteration 31

After 6th iteration 32

After 20th iteration 33

Properties of EM EM converges to a local optima This is because each iteration improves the loglielihood Proof same as -means (just bloc coordinate ascent) E-step can never decrease lielihood M-step can never decrease lielihood If we mae hard assignments instead of soft ones, algorithm is equivalent to -means! 34