Gaussian Mixture Models

Similar documents
K-Means and Gaussian Mixture Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

The Expectation-Maximization Algorithm

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Gaussian Mixture Models, Expectation Maximization

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Expectation Maximization

Gaussian Mixture Models

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Latent Variable Models and EM algorithm

Expectation Maximization Algorithm

STA 4273H: Statistical Machine Learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

The Multivariate Gaussian Distribution [DRAFT]

Expectation Propagation Algorithm

Mixtures of Gaussians. Sargur Srihari

COM336: Neural Computing

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Data Mining Techniques

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Introduction to Machine Learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

13: Variational inference II

Lecture 4: Probabilistic Learning

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

CSE446: Clustering and EM Spring 2017

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Machine Learning Techniques for Computer Vision

Machine Learning for Data Science (CS4786) Lecture 12

A minimalist s exposition of EM

Expectation maximization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning

Unsupervised Learning

Clustering and Gaussian Mixture Models

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Variational Inference (11/04/13)

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Quantitative Biology II Lecture 4: Variational Methods

Mixture Models and Expectation-Maximization

Clustering, K-Means, EM Tutorial

PATTERN RECOGNITION AND MACHINE LEARNING

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

ECE 5984: Introduction to Machine Learning

Lecture 1 October 9, 2013

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

10708 Graphical Models: Homework 2

CPSC 540: Machine Learning

Lecture 11: Unsupervised Machine Learning

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Expectation Maximization and Mixtures of Gaussians

Machine learning - HT Maximum Likelihood

Expectation Propagation for Approximate Bayesian Inference

Series 7, May 22, 2018 (EM Convergence)

Stochastic Variational Inference

Machine Learning for Signal Processing Bayes Classification and Regression

Linear Models for Classification

Latent Variable Models

Variational Inference. Sargur Srihari

CSCI-567: Machine Learning (Spring 2019)

Posterior Regularization

Expectation Maximization

Introduction to Statistical Learning Theory

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Graphical Models for Collaborative Filtering

CS229 Lecture notes. Andrew Ng

Linear Dynamical Systems

CPSC 540: Machine Learning

Parametric Techniques Lecture 3

Review and Motivation

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 13 : Variational Inference: Mean Field Approximation

Bayesian Linear Regression [DRAFT - In Progress]

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

But if z is conditioned on, we need to model it:

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Lecture 14. Clustering, K-means, and EM

Quick Tour of Basic Probability Theory and Linear Algebra

An Introduction to Expectation-Maximization

Parametric Techniques

Chapter 16. Structured Probabilistic Models for Deep Learning

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Technical Details about the Expectation Maximization (EM) Algorithm

STA 4273H: Statistical Machine Learning

Machine Learning for Signal Processing Bayes Classification

Variational Autoencoder

Variational Autoencoders (VAEs)

Latent Variable Models and EM Algorithm

Based on slides by Richard Zemel

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Auto-Encoding Variational Bayes

Mixtures of Gaussians continued

Introduction to Graphical Models

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Bayesian Methods for Machine Learning

Probability and Information Theory. Sargur N. Srihari

Transcription:

Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42

Intro Question Intro Question David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 2 / 42

Intro Question Intro Question Suppose we begin with a dataset D = {x 1,...,x n } R 2 and we run k-means (or k-means++) to obtain k cluster centers. Below we have drawn the cluster centers. If we are given a new x R 2, we can assign it a label based on which cluster center is closest. What regions of the plane below correspond to each possible labeling? 1 0.8 0.6 0.4 0.2 0-0.2 0 0.2 0.4 0.6 0.8 1 1.2 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 3 / 42

Intro Question Intro Solution Note that each cell is disjoint (except for the boarders), and convex. This can be thought of as a limitation of k-means: neither will be true for GMMs. 1 0.8 0.6 0.4 0.2 0-0.2 0 0.2 0.4 0.6 0.8 1 1.2 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 4 / 42

Gaussian Mixture Models Gaussian Mixture Models David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 5 / 42

Gaussian Mixture Models Yesterday's Intro Question Consider the following probability model for generating data. 1 Roll a weighted k-sided die to choose a label z {1,...,k}. Let π denote the PMF for the die. 2 Draw x R d randomly from the multivariate normal distribution N(µ z,σ z ). Solve the following questions. 1 What is the joint distribution of x,z given π and the µ z,σ z values? 2 Suppose you were given the dataset D = {(x 1,z 1 ),...,(x n,z n )}. How would you estimate the die weightings, and the µ z,σ z values? 3 How would you determine the label for a new datapoint x? David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 6 / 42

Gaussian Mixture Models Yesterday's Intro Solution 1 The joint PDF/PMF is given by p(x,z) = π(z)f (x;µ z,σ z ) where f (x;µ z,σ z ) = 1 ( 2πΣz exp 1 ) 2 (x µ)t Σ 1 (x µ). 2 We could use maximum likelihood estimation. Our estimates are 3 arg max z p(x,z) n z = n 1(z i=1 i = z) ˆπ(z) = n z n ˆµ z = 1 n z i:z i =z x i ˆΣ z = 1 n z i:z i =z (x i ˆµ z )(x i ˆµ z ) T. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 7 / 42

Gaussian Mixture Models Probabilistic Model for Clustering Let's consider a generative model for the data. Suppose 1 There are k clusters. 2 We have a probability density for each cluster. Generate a point as follows 1 Choose a random cluster z {1,2,...,k}. 2 Choose a point from the distribution for cluster Z. The clustering algorithm is then: 1 Use training data to t the parameters of the generative model. 2 For each point, choose the cluster with the highest likelihood based on model. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 8 / 42

Gaussian Mixture Models Gaussian Mixture Model (k = 3) 1 Choose z {1,2,3} 2 Choose x z N(X µ z,σ z ). David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 9 / 42

Gaussian Mixture Models Gaussian Mixture Model Parameters (k Components) Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) What if one cluster had many more points than another cluster? David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 10 / 42

Gaussian Mixture Models Gaussian Mixture Model: Joint Distribution Factorize the joint distribution: p(x, z) = p(z)p(x z) = π z N (x µ z,σ z ) π z is probability of choosing cluster z. x z has distribution N(µ z,σ z ). z corresponding to x is the true cluster assignment. Suppose we know all the parameters of the model. Then we can easily compute the joint p(x,z), and the conditional p(z x). David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 11 / 42

Gaussian Mixture Models Latent Variable Model We observe x. In the intro problem we had labeled data, but here we don't observe z, the cluster assignment. Cluster assignment z is called a hidden variable or latent variable. Denition A latent variable model is a probability model for which certain variables are never observed. e.g. The Gaussian mixture model is a latent variable model. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 12 / 42

Gaussian Mixture Models The GMM Inference Problem We observe x. We want to know z. The conditional distribution of the cluster z given x is p(z x) = p(x,z)/p(x) The conditional distribution is a soft assignment to clusters. A hard assignment is z = arg max p(z x). z {1,...,k} So if we have the model, clustering is trivial. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 13 / 42

Mixture Models Mixture Models David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 14 / 42

Mixture Models Gaussian Mixture Model: Marginal Distribution The marginal distribution for a single observation x is p(x) = = k p(x, z) z=1 k π z N (x µ z,σ z ) z=1 Note that p(x) is a convex combination of probability densities. This is a common form for a probability model... David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 15 / 42

Mixture Models Mixture Distributions (or Mixture Models) Denition A probability density p(x) represents a mixture distribution or mixture model, if we can write it as a convex combination of probability densities. That is, k p(x) = w i p i (x), where w i 0, k i=1 w i = 1, and each p i is a probability density. i=1 In our Gaussian mixture model, x has a mixture distribution. More constructively, let S be a set of probability distributions: 1 Choose a distribution randomly from S. 2 Sample x from the chosen distribution. Then x has a mixture distribution. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 16 / 42

Learning in Gaussian Mixture Models Learning in Gaussian Mixture Models David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 17 / 42

Learning in Gaussian Mixture Models The GMM Learning Problem Given data x 1,...,x n drawn from a GMM, Estimate the parameters: Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) Once we have the parameters, we're done. Just do inference to get cluster assignments. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 18 / 42

Learning in Gaussian Mixture Models Estimating/Learning the Gaussian Mixture Model One approach to learning is maximum likelihood nd parameter values that give observed data the highest likelihood. The model likelihood for D = {x 1,...,x n } is L(π,µ,Σ) = = n p(x i ) i=1 n i=1 z=1 k π z N (x i µ z,σ z ). As usual, we'll take our objective function to be the log of this: { n k } J(π,µ,Σ) = log π z N (x i µ z,σ z ) i=1 z=1 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 19 / 42

Learning in Gaussian Mixture Models Properties of the GMM Log-Likelihood GMM log-likelihood: J(π,µ,Σ) = { n k } log π z N (x i µ z,σ z ) i=1 z=1 Let's compare to the log-likelihood for a single Gaussian: n logn(x i µ,σ) i=1 = nd 2 log(2π) n 2 log Σ 1 2 n (x i µ) Σ 1 (x i µ) For a single Gaussian, the log cancels the exp in the Gaussian density. = Things simplify a lot. For the GMM, the sum inside the log prevents this cancellation. = Expression more complicated. No closed form expression for MLE. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 20 / 42 i=1

Issues with MLE for GMM Issues with MLE for GMM David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 21 / 42

Issues with MLE for GMM Identiability Issues for GMM Suppose we have found parameters Cluster probabilities : π = (π 1,...,π k ) Cluster means : µ = (µ 1,...,µ k ) Cluster covariance matrices: Σ = (Σ 1,...Σ k ) that are at a local minimum. What happens if we shue the clusters? e.g. Switch the labels for clusters 1 and 2. We'll get the same likelihood. How many such equivalent settings are there? Assuming all clusters are distinct, there are k! equivalent solutions. Not a problem per se, but something to be aware of. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 22 / 42

Issues with MLE for GMM Singularities for GMM Consider the following GMM for 7 data points: Let σ 2 be the variance of the skinny component. What happens to the likelihood as σ 2 0? In practice, we end up in local minima that do not have this problem. Or keep restarting optimization until we do. Bayesian approach or regularization will also solve the problem. From Bishop's Pattern recognition and machine learning, Figure 9.7. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 23 / 42

Issues with MLE for GMM Gradient Descent / SGD for GMM What about running gradient descent or SGD on { n k } J(π,µ,Σ) = log π z N (x i µ z,σ z )? i=1 z=1 Can be done but need to be clever about it. Each matrix Σ 1,...,Σ k has to be positive semidenite. How to maintain that constraint? Rewrite Σ i = M i Mi T, where M i is an unconstrained matrix. Then Σ i is positive semidenite. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 24 / 42

The EM Algorithm for GMM The EM Algorithm for GMM David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 25 / 42

The EM Algorithm for GMM MLE for GMM From yesterday's intro questions, we know that we can solve the MLE problem if the cluster assignments z i are known n z = n 1(z i = z) i=1 ˆπ(z) = n z n ˆµ z = 1 n z i:z i =z x i ˆΣ z = 1 (x i ˆµ z )(x i ˆµ z ) T. n z i:z i =z In the EM algorithm we will modify the equations to handle our evolving soft assignments, which we will call responsibilities. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 26 / 42

The EM Algorithm for GMM Cluster Responsibilities: Some New Notation Denote the probability that observed value x i comes from cluster j by γ j i = P(Z = j X = x i). The responsibility that cluster j takes for observation x i. Computationally, γ j i = P(Z = j X = x i ). = p (Z = j,x = x i )/p(x) π j N (x i µ j,σ j ) = k π c=1 cn (x i µ c,σ c ) The vector ( ) γ 1 i,...,γk i is exactly the soft assignment for xi. Let n c = n i=1 γc i be the number of points soft assigned to cluster c. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 27 / 42

The EM Algorithm for GMM EM Algorithm for GMM: Overview If we know π and µ j,σ j for all j then we can easily nd γ j i = P(Z = j X = x i). If we know the (soft) assignments, we can easily nd estimates for π, µ j,σ j for all j. Repeatedly alternate the previous 2 steps. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 28 / 42

The EM Algorithm for GMM EM Algorithm for GMM: Overview 1 Initialize parameters µ, Σ, π. 2 E step. Evaluate the responsibilities using current parameters: γ j i = π j N (x i µ j,σ j ) k c=1 π cn (x i µ c,σ c ), for i = 1,...,n and j = 1,...,k. 3 M step. Re-estimate the parameters using responsibilities. [Compare with intro question.] µ new c = 1 n c Σ new c = 1 n c π new c = n c n, n γ c i x i i=1 n i=1 γ c i (x i µ new c )(x i µ new 4 Repeat from Step 2, until log-likelihood converges. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 29 / 42 c ) T

The EM Algorithm for GMM EM for GMM Initialization From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 30 / 42

The EM Algorithm for GMM EM for GMM First soft assignment: From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 31 / 42

The EM Algorithm for GMM EM for GMM First soft assignment: From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 32 / 42

The EM Algorithm for GMM EM for GMM After 5 rounds of EM: From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 33 / 42

The EM Algorithm for GMM EM for GMM After 20 rounds of EM: From Bishop's Pattern recognition and machine learning, Figure 9.8. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 34 / 42

The EM Algorithm for GMM Relation to K -Means EM for GMM seems a little like k-means. In fact, there is a precise correspondence. First, x each cluster covariance matrix to be σ 2 I. Then the density for each Gausian only depends on distance to the mean. As we take σ 2 0, the update equations converge to doing k-means. If you do a quick experiment yourself, you'll nd Soft assignments converge to hard assignments. Has to do with the tail behavior (exponential decay) of Gaussian. Can use k-means++ to initialize parameters of EM algorithm. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 35 / 42

Math Prerequisites for General EM Algorithm Math Prerequisites for General EM Algorithm David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 36 / 42

Math Prerequisites for General EM Algorithm Jensen's Inequality Which is larger: E[X 2 ] or E[X ] 2? avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 37 / 42

Math Prerequisites for General EM Algorithm Jensen's Inequality Theorem Which is larger: E[X 2 ] or E[X ] 2? Must be E[X 2 ] since Var[X ] = E[X 2 ] E[X ] 2 0. More general result is true: Jensen's Inequality If f : R R is convex and X is a random variable then E[f (X )] f (E[X ]). If f is strictly convex then we have equality i X = E[X ] with probability 1 (i.e., X is constant). avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 37 / 42

Math Prerequisites for General EM Algorithm Proof of Jensen Exercise Suppose X can take exactly two value: x 1 with probability π 1 and x 2 with probability π 2. Then prove Jensen's inequality. avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 38 / 42

Math Prerequisites for General EM Algorithm Proof of Jensen Exercise Suppose X can take exactly two value: x 1 with probability π 1 and x 2 with probability π 2. Then prove Jensen's inequality. Let's compute E[f (X )]: E[f (X )] = π 1 f (x 1 ) + π 2 f (x 2 ) f (π 1 x 1 + π 2 x 2 ) = f (E[X ]). For the general proof, what do we know is true about all convex functions f : R R? avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 38 / 42

Math Prerequisites for General EM Algorithm Proof of Jensen 1 Let e = E[X ]. (Remember e is just a number.) 2 Since f has a subgradient at e, there is an underestimating line g(x) = ax + b that passes through the point (e,f (e)). 3 Then we have E[f (X )] E[g(X )] = E[aX + b] = ae[x ] + b = ae + b = f (e) = f (E[X ]). 4 If f is strictly convex then f = g at exactly 1 point, so equality i X is constant. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 39 / 42

Math Prerequisites for General EM Algorithm KL-Divergence Let p(x) and q(x) be probability mass functions (PMFs) on X. We want to measure how dierent they are. The Kullback-Leibler or KL Divergence is dene by KL(p q) = x X p(x)log p(x) q(x). (Assumes absolute continuity: q(x) = 0 implies p(x) = 0.) Can also write KL(p q) = E x p log p(x) q(x). Note, the KL-divergence is not symmetric and doesn't satisfy the triangle inequality. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 40 / 42

Math Prerequisites for General EM Algorithm Gibbs' Inequality Theorem Gibbs' Inequality Let p(x) and q(x) be PMFs on X. Then KL(p q) 0, with equality i p(x) = q(x) for all x X. Since KL(p q) = E p [ log this is screaming for Jensen's inequality. ( )] q(x), p(x) avid Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 41 / 42

Math Prerequisites for General EM Algorithm Gibbs' Inequality: Proof ( )] q(x) KL(p q) = E p [ log p(x) ( [ ]) q(x) log E p p(x) = log p(x) q(x) p(x) x:p(x)>0 ( ) = log q(x) x = log 1 = 0. Since log is strictly convex, we have equality i q/p is constant, i.e., q = p. David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 42 / 42