EM for Spherical Gaussians

Similar documents
A Note on the Expectation-Maximization (EM) Algorithm

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

The Informativeness of k-means for Learning Mixture Models

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Statistical learning. Chapter 20, Sections 1 4 1

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

CPSC 540: Machine Learning

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

The Expectation-Maximization Algorithm

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Review and Motivation

Latent Variable Models and EM algorithm

Expectation Maximization Algorithm

Statistical Pattern Recognition

Machine Learning for Data Science (CS4786) Lecture 12

Lecture 6: April 19, 2002

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Gaussian Mixture Models

Lecture 4: Probabilistic Learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Latent Variable Models and EM Algorithm

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25,

The Expectation Maximization or EM algorithm

Advanced Introduction to Machine Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

p(d θ ) l(θ ) 1.2 x x x

Expectation Maximization and Mixtures of Gaussians

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Latent Variable Models

Computing the MLE and the EM Algorithm

COS513 LECTURE 8 STATISTICAL CONCEPTS

CSE446: Clustering and EM Spring 2017

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Expectation Maximization

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Scribe to lecture Tuesday March

Mixtures of Gaussians. Sargur Srihari

Estimating Gaussian Mixture Densities with EM A Tutorial

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Week 3: The EM algorithm

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

The Expectation Maximization Algorithm

Linear Dynamical Systems

Mixture Models and Expectation-Maximization

Unsupervised Learning

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Clustering and Gaussian Mixture Models

Online Question Asking Algorithms For Measuring Skill

Statistical Pattern Recognition

Finite Singular Multivariate Gaussian Mixture

Algorithmisches Lernen/Machine Learning

Expectation-Maximization (EM) algorithm

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Abstract

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Basic Sampling Methods

Technical Details about the Expectation Maximization (EM) Algorithm

Probabilistic & Unsupervised Learning

Gaussian Mixture Models, Expectation Maximization

Statistical Estimation

Expectation maximization tutorial

1 Expectation Maximization

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs

Mixtures of Gaussians continued

Variable selection for model-based clustering

Expectation Maximization

Probabilistic Time Series Classification

Lecture 6: Gaussian Mixture Models (GMM)

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Randomized Algorithms

Expectation maximization

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Latent Variable Models and Expectation Maximization

A minimalist s exposition of EM

Expectation Maximization

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Mobile Robot Localization

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

EM (cont.) November 26 th, Carlos Guestrin 1

Theory and Use of the EM Algorithm. Contents

Estimating the parameters of hidden binomial trials by the EM algorithm

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Lecture 3 September 1

Lecture 18: March 15

Solution. Daozheng Chen. Challenge 1

Latent Variable Models and Expectation Maximization

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Expectation Maximization

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

Transcription:

EM for Spherical Gaussians Karthekeyan Chandrasekaran Hassan Kingravi December 4, 2007 1 Introduction In this project, we examine two aspects of the behavior of the EM algorithm for mixtures of spherical Gaussians; 1) the benefit of spectral projection for such mixtures, and 2) the general behavior of the EM algorithm under certain separability criteria. Our current results are for mixtures of two Gaussians, although these can be extended. In the case of 1), we show that the value of the Q function for EM (see below) increases in general when one performs spectral projection, and for 2), we give empirical results under different separability conditions, and make a conjecture concerning these. 2 The EM Algorithm The Expectation Maximization algorithm is a means of estimating the maximum likelihood of some data given a distribution. We will sketch a general outline of the method, which follows Bilmes tutorial [1]. Assume we are given a density function p(x Θ) that is governed by a set of parameters Θ. We are also given a set of data of size N, i.e. X = {x 1,..., x N }, which is drawn from this distribution, s.t. the individual vectors are i.i.d with distribution p. The likelihood function is then defined as p(x Θ) = N p(x i Θ) = L(Θ X) i=1 We think of this function as the likelihood of the parameters given the data, and we wish to find the Θ that maximizes L. (Note that one can maximize the log likelihood here if it makes analysis easier; as it does in the Gaussian case.) EM allows us to maximize this by assuming that the data that is observed is incomplete, and thus has unobserved values. Assume that X is observed, and Y is the unobserved data, and Z is the concatenation of this, i.e. the complete data. Then: p(z Θ) = p(x, y Θ) = p(y x, Θ)p(x Θ) Note that our new likelihood function becomes L(Θ Z) = L(Θ X, Y ) = p(x, Y Θ), i.e. the complete data likelihood. This function is a random variable, since the missing information Y is unknown. The EM algorithm first finds the expected value of log p(x, Y Θ) with respect to the unknown data Y given X and the current parameter estimates. We denote this as: Q(Θ, Θ (i 1) ) = E[log p(x, Y Θ) X, Θ (i 1) ] = log p(x, Y Θ)f(y X, Θ (i 1) )dy y Y where Θ (i 1) are the current parameter estimates, Y is the space of values that y can take on, and f(y X, Θ (i 1) ) is the distribution which governs the random variable Y. Once the expectation is evaluated, we proceed to maximize it. This step can be written as follows: Θ (i) = argmax Q(Θ, Θ (i 1) ). Θ 1

The EM update equations for spherical Gaussians do both expectation and maximization simultaneously. However, for our current purposes, we are interested in the Q function for spherical Gaussians. Assume you have the following probabilistic model: p(x Θ) = M α i p i (x θ i ) i=1 where the parameters are Θ = (α 1,..., α M, θ 1,..., θ M ), the p i s are density functions characterized by θ i, and the α i s (the mixing weights) sum up to 1. Then [1] shows that, given a guess of parameters Θ (for assumptions concerning the guesses, see the next section): M N M N Q(Θ, Θ ) = log(α l )p(l x i, Θ ) + log(p l (x i θ l ))p(l x i, Θ ) l=1 i=1 l=1 i=1 Here p(l x i, Θ ) means the probability of the lth distribution given the data point and the parameters, whereas p l (x i θ l )) denotes the probability of x i coming from the lth distribution. In a spherical Gaussian with θ = (µ, σ), this is given by: p(l x, µ, σ) = p(x µ, σ) = 1 x µ 2 ((2π) 1 2 σ) d e 2σ 2 3 The Benefit of Spectral Projection for Mixtures of Two Spherical Gaussians In this section, we will show the benefit of spectral projection for the first step of the EM algorithm. 3.1 Notation x A point drawn from the first Gaussian (i.e. the one we are interested in.) d The dimension of the original space. k The dimension of the SVD subspace (and thus the number of Gaussians.) π SV D (y) The projection of y onto the SVD subspace, for arbitrary point y. α i The probability that point x comes from the ith Gaussian in n-dimensional space ˆα i The probability that point x comes from the ith Gaussian in k-dimensional space µ i The true mean of the ith Gaussian. σ i The true variance of the ith Gaussian. µ i The first guess for the mean of the ith Gaussian. µ k The distance between the first and the kth mean, i.e. µ 1 µ k 2 3.2 Assumptions As noted, x comes from the first Gaussian. We are trying to prove that if x comes from the first Gaussian, a spectral projection will make the value of the Q function used in EM greater in the first iteration. To achieve this, we use assume the following: Currently, assume i, j k, σ i = σ j = σ. We have an initial guess for the weights, which we note holds only in the first step of the EM algorithm. We denote this by w1 0 = w2 0 = = wk 0 = 1, where the superscript 0 refers to the 0th iteration of the k algorithm. We use a concentration lemma to bound the expected values of our points to our guess for the means µ i. The assumption we use here is that each respective mean µ i is chosen s.t. the lemma below applies; roughly speaking, this corresponds to saying that the distance between the guess for the ith mean and the real ith mean is not too high. Since k-means clustering is often performed on the data as a preprocessing step to initialize the means, this is not overly restrictive. 2

3.3 Concentration Inequalities Let x D 1. Then: and σ 2 d σ 2 d log(m) E( x µ 1 2 ) σ 2 d + σ 2 d log(m) µ 2 + σ 2 d σ 2 d log(m) 2 µσ(d log(m)) 1 4 E( x µ 2 2 ) µ 2 + σ 2 d + σ 2 d log(m) + 2 µσ(d log(m)) 1 4 Note that bounds similar to the last two items above hold for their SVD projections in dimension k. Since the means for spherical Gaussians are contained in the SVD subspace, they do not move upon projection [2]. This fact helps us in the proofs in the next section. 3.4 Q Function Verification for the 2-Dimensional Case Let us define the aforementioned Q function in terms of two components, as below: s.t. R(Θ ) = Q(Θ, Θ ) = R(Θ ) + S(Θ, Θ ) M N M N log(α l )p(l x i, Θ ) and S(Θ, Θ ) = log(p l (x i θ l ))p(l x i, Θ ) l=1 i=1 We now prove two theorems. l=1 i=1 Theorem 1. Let ˆp(l x i, Θ ) be the probability of a distribution l {1, 2} after the projection of a dataset generated by a mixture of two Gaussians meeting the assumptions stated above. If x i D l, then ˆp(l x i, Θ ) p(l x i, Θ ), where the latter is the probability in the original space. Proof. For the purposes of this proof, let l = 1. A similar result can then be proved for l = 2. We can rewrite the probabilities in the following way: ˆp(l x i, Θ ) p(l x i, Θ ) e π SV D(x µ 1 ) 2 e π SV D(x µ 1) 2 + e π SV D(x µ 2) 2 e π SV D(x µ 2 ) 2 e π SV D(x µ 1 ) 2 + e π SV D(x µ 2 ) 2 e x µ 1 2 e x µ1 2 + e x µ1 2 e x µ 2 2 e x µ 1 2 + e x µ 1 2 Assuming the concentration lemmae hold, w.h.p, π SV D (x µ 2 ) 2 is dominated to a much greater extent by µ than its counterpart in the original space, because all the other components shrink according to the dimension. Therefore, given a large enough µ, the theorem follows. Theorem 2. Assume a data set is generated by a mixture of two Gaussians meeting the assumptions stated above. Then, for the projection of the data onto the SVD subspace spanned by the first two principal components, guarantees a higher S value in the first step of the EM algorithm. Proof. Let k = 2. In this case, this allows us to express the probabilities in terms of each other, i.e. α 2 = 1 α 1 = 1 α. The fact that we will not be able to do this in higher dimensions will pose an issue. Note further that this allows us to state that ˆα = α, where ˆα represents the mixing weights in the projected space. Ŝ( ˆθ, ˆθ 0 ) S(θ, θ 0 ) Here, θ 0 refers to the respective parameters initial setting, before the first step of the EM algorithm. We will prove the inequality for each point and since S is the summation across the whole set of sample points, we would have proved the above inequality in its entirity. The above becomes: 3

ˆα( π SV D (x µ 1) 2 ) (1 ˆα)( π SV D (x µ 2) 2 ) α( x µ 1 2 ) (1 α)( x µ 2 2 ) α(σ 2 2 σ 2 2 log(m)) + (1 α)( µ 2 + σ 2 2 σ 2 2 log(m) 2 µσ(2 log(m)) 1 4 ) α(σ 2 d + σ 2 d log(m)) + (1 α)( µ 2 + σ 2 d + σ 2 d log(m) + 2 µσ(d log(m)) 1 4 ) 2σ 2 2 log(m)σ 2 2 µσ(2 log(m)) 1 4 + 2 µσ(2 log(m)) 1 4 α dσ 2 + d log(m)σ 2 + 2 µσ(d log(m)) 1 4 2 µ(d log(m)) 1 4 ασ 2 µσ((log(m) 1 4 )(2 1 4 + d 1 4 ))α (d 2)σ 2 + log(m)( 2 + d)σ 2 + 2 µσ((log(m) 1 4 )(2 1 4 + d 1 4 )) Since 0 < α < 1, it s easy to see that it the inequality holds. We would now like to prove a similar inequality for the R function. If we can do this, we will have proved that the Q value is higher after spectral projection than before it. Unfortunately, we run into a problem here. The issue is that the α l s i.e. α l = 1 N N i=1 p(l x i, Θ ) are enclosed witin the log function, which makes it more difficult to quantify their behavior. However, intuitively, it seems that the R should increase after projection, because all the log(α l ) term does is assign a weight to the probabilities, which we have proved do increase after projection. If this does not work, one can try proving the whole Q directly; as can be seen in the proof for Theorem 2, there is a lot of slack in the inequality. For the moment, however, we will leave this as a conjecture. Conjecture 1. The value of R(Θ ) increases after spectral projection. 4 The Behavior of the EM algorithm Under Separability Conditions In this section, we present a few graphs that seem to outline how the convergence of the EM algorithm to the global optimum relates to the separability between two univariate component gaussians. We pose several questions related to interpreting the graph. In the graphs attached below, the variance of either component is set to 1. This is without loss of generality, since these graphs are just scaled up based on the percentage of overlap, which is determined by the increase in variances. We refrain from letting the two components having different variances since the presented simplest model itself is yet to be completely understood. Here are some simple observations: 1. As we expect of any clustering algorithm, the performance of the EM algorithm degrades with increase in overlap. 2. When the performance is good, i.e., when EM converges to the global optimum with high probability for random initializations, the bad choices of the initialization points seem to be the ones in which both the means are initialized to the same value. Based on these observations, one can conjecture the following. Conjecture 2. For an equal mixture of two univariate gaussians, one can identify the true means using the EM algorithm with very high probability using random initial guesses only if the overlap between the two gaussians is less than 7 percent i.e., the distance between the means is 3σ if σ is the standard deviation of the two distributions. We also noted some interesting behavior that wasn t really expected. The graph with 7% overlap shows some initialization points where the performance of EM is bad. It was found by decreasing the granularity of the grid that there exist quite a few bad initialization points; however, the the area of convergence still remains large. This number of bad initialization points is expected to increase in higher dimensions. However, we were unable to come up with obvious explanations for the existence of such points. It would be interesting to see some further analysis on this point. 4

References [1] J. Bilmes. A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report, University of Berkeley, ICSI-TR-97-021, 1997. [2] R. Kannan and S. Vempala. Spectral Algorithms. Unpublished notes. http://www.cc.gatech.edu/ vempala/spectral/course.html 5

True means: = 2, mu2 = 2; Unit Variance, Equal Mixing Weights; Region of Convergence = Red (a) 2% Overlap True means: = 1.5, mu2 = 1.5; Unit Variance, Equal Mixing Weights; Region of Convergence = Red (b) 7% Overlap Figure 1: Empirical Results for Convergence: Low Overlap 6

True means: = 1.45, mu2 = 1.45; Unit Variance, Equal Mixing Weights; Region of Convergence = Red (a) 7.5% Overlap True means: = 1.4, mu2 = 1.4; Unit Variance, Equal Mixing Weights; Region of Convergence = Red (b) 8% Overlap Figure 2: Empirical Results for Convergence: Medium Overlap 7

True means: = 1.35, mu2 = 1.35; Unit Variance, Equal Mixing Weights; Region of Convergence = Red (a) 9% Overlap True means: = 1.3, mu2 = 1.3; Unit Variance, Equal Mixing Weights; Region of Convergence = Red mu2 (b) 10% Overlap Figure 3: Empirical Results for Convergence: High Overlap 8

True means: = 1.25, mu2 = 1.25; Unit Variance, Equal Mixing Weights; Region of Convergence = Red mu2 (a) 11% Overlap True means: = 1, mu2 = 1; Unit Variance, Equal Mixing Weights; Region of Convergence = Red 100 80 60 40 20 0 20 40 60 80 mu2 100 (b) 15% Overlap Figure 4: Empirical Results for Convergence: Higher Overlap 9