The Expectation-Maximization Algorithm

Similar documents
Neural Networks and the Back-propagation Algorithm

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Latent Variable Models and EM algorithm

10-701/15-781, Machine Learning: Homework 4

Expectation Maximization

Latent Variable Models and Expectation Maximization

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Estimating Gaussian Mixture Densities with EM A Tutorial

Latent Variable Models and Expectation Maximization

Linear Classifiers as Pattern Detectors

The Expectation-Maximization Algorithm

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Brief Introduction of Machine Learning Techniques for Content Analysis

Clustering and Gaussian Mixtures

Expectation Maximization (EM)

Probabilistic clustering

CSCI-567: Machine Learning (Spring 2019)

An Introduction to Expectation-Maximization

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Machine Learning Lecture 5

Clustering and Gaussian Mixture Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Mixtures of Gaussians. Sargur Srihari

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

The Bayes classifier

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Lecture 4: Probabilistic Learning

Machine Learning Lecture Notes

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Pattern Recognition and Machine Learning

Multivariate Bayesian Linear Regression MLAI Lecture 11

Expectation Maximization

Linear Classifiers as Pattern Detectors

Mixtures of Gaussians continued

Latent Variable Models and EM Algorithm

Lecture 7: Con3nuous Latent Variable Models

CSE 546 Final Exam, Autumn 2013

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

The Multivariate Gaussian Distribution [DRAFT]

Linear Models for Classification

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Curve Fitting Re-visited, Bishop1.2.5

EM Algorithm LECTURE OUTLINE

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Gaussian Mixture Models

CS Lecture 18. Expectation Maximization

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Technical Details about the Expectation Maximization (EM) Algorithm

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

But if z is conditioned on, we need to model it:

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

MACHINE LEARNING ADVANCED MACHINE LEARNING

Machine Learning Techniques for Computer Vision

Statistical Pattern Recognition

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Outline lecture 6 2(35)

Introduction to Machine Learning

Introduction to Machine Learning Spring 2018 Note 18

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Kernel Methods and Support Vector Machines

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

STA 4273H: Statistical Machine Learning

Machine Learning for Signal Processing Bayes Classification

Introduction to SVM and RVM

Machine Learning, Midterm Exam

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Recent Advances in Bayesian Inference Techniques

Final Exam, Fall 2002

Machine Learning for Signal Processing Bayes Classification and Regression

Jeff Howbert Introduction to Machine Learning Winter

CS181 Midterm 2 Practice Solutions

Perceptron Revisited: Linear Separators. Support Vector Machines

Max Margin-Classifier

STA 414/2104: Machine Learning

Statistical Machine Learning

Probability Models for Bayesian Recognition

MACHINE LEARNING ADVANCED MACHINE LEARNING

The Expectation Maximization Algorithm

(Kernels +) Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Gaussian Mixture Distance for Information Retrieval

Data Preprocessing. Cluster Similarity

Expectation Maximization and Mixtures of Gaussians

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Linear & nonlinear classifiers

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Principal Component Analysis

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Machine Learning : Support Vector Machines

Transcription:

The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in 1, 2] and we refer to these wors for further details. Throughout these notes, random variables are represented with upper-case letters, such as X or Z. A sample of a random variable is represented by the corresponding lower-case letter, such as x or z. When random variables are vector valued, we use subscripts to indicate specific components, as in X or Z. The corresponding samples are represented using bold-face letters, such as x or z, and individual components as x or z, respectively. When considering an indexed family of vector-valued data-points, we use indexed bold-face symbols to denote the elements in the family, as in x n or z n. 1 -Means The -means algorithm is a clustering algorithm where we are given a dataset D = {x 1,..., x N } consisting of N observations of some random variable X taing values in R p. We want to partition this data set into a given number K of clusters in a meaningful way. Recall that a cluster is a set of elements such that the similarity between points within the cluster is much greater than the similarity between points in different clusters. For our purposes, we can thin of the points in cluster as perturbed versions of some prototype point µ, and our ultimate goal is to determine the µ, for = 1,..., K. To determine each prototype point µ, we use the elements in D that belong to cluster, which in turn implies that we also need to compute the cluster assignments for each element x n D. This goal can be formalized as that of minimizing the distortion measure J = =1 C n x n µ 2, (1) where each C n is a binary-valued parameter such that C n = 1 if x n belongs to cluster. The value of N C n x n µ 2 somehow represents the 1

dispersion of cluster, and J thus denotes the total dispersion in all clusters. We want to determine values for the {C n } and {µ } to minimize J. Let us consider some initial assignment of values to the µ and proceed to compute the C n that minimize J. Since J depends linearly on each of the parameters C n, the trivial solution that minimizes J is attained by setting all C n to zero, in which case J 0. However, for each point x n exactly one C n must be non-zero (each data point must belong to exactly one cluster), so the minimum value for J is precisely when { 1 if = argmin C n = i x n µ i 2 0 otherwise. This optimization corresponds to the -means step in which every point is assigned to the cluster corresponding to the nearest prototype vector. Now let us consider some assignment of points in D to the clusters 1,..., K, and let us determine the prototype vectors µ that minimize J. Since J is a quadratic function of µ, we get µ J = 2 C n (x n µ ). Equating the above expression to zero and solving for µ finally yields N µ = C nx n N C = 1 N C n x n. n N This optimization corresponds to the -means step in which, for every cluster = 1,..., K, the corresponding prototype points µ are recomputed as the mean of the points in the corresponding cluster. Note that, in each optimization, the value of J must surely decrease. This means that, since J 0 and there is a finite number of possible cluster assignments for the points in D, the -means algorithm is guaranteed to converge. Optimality, however, depends on the initialization of the prototype vectors, µ 1,..., µ K and -means may converge to locally optimal clustering. 2 Gaussian Mixtures Let us now consider the process by which the points in D could be generated. Again adopting the interpretation that the points in a cluster are perturbations of the corresponding prototype vector µ, we can consider that each data-point x n in cluster is sampled according to some distribution centered in the corresponding cluster centroid, µ. This means that each point in the data-set is, in fact, generated from one of K possible distributions, one for each cluster. 2

Table 1: Bayesian networ representation of a mixture model that explicitly factors the joint distribution P X, Z] as P X, Z] = P X Z] P Z]. z x Formally, this can be modeled by considering a random-variable Z taing values in {0, 1} K and such that each z verifies =1 z = 1. A data-point (z, x), where z = 1 for some, means that x was generated according to the distribution associated with cluster. We now analyze the joint behavior of Z and X, for which we define the joint distribution of Z and X, P X, Z], in terms of the marginal distribution P Z] and the conditional distribution P X Z], as represented in the Bayesian networ of Fig. 1. Starting with P Z], since each component Z of Z is a binary variable, we denote by p the probability that this th component of Z is non-zero, i.e., p = P Z = 1]. Since we enforce =1 z = 1, we must also have that =1 p = 1. Note also that we can write K P Z = z] = p z, since each component z of z is either 0 or 1. As for the conditional distribution of X given Z, we will consider that X follows a Gaussian distribution whose mean and covariance matrix depend on Z. In particular, =1 P X = x Z = z] = P X = x Z = 1] = N (x µ, Σ ), where z is such that z = 1. The marginal distribution of X is thus given by P X = x] = P X = x Z = 1] P Z = 1] = =1 p N (x µ, Σ ), (2) i.e., the distribution of X is a linear superposition of Gaussian distributions, also nown as a Gaussian mixture. We can now interpret the generation of the points in D as follows. For n = 1,..., N, we sample Z according to P Z], obtaining a vector z n. We then sample X conditioned on z n, according to P X Z = z n ], obtaining a new data-point x n. The data-set D is then composed from the set of sampled data-points x 1,..., x N. To conclude this section, we note that the {z n } remain unnown. We can compute a distribution of any one z n given the corresponding x n. In 3 =1

fact, we can use Bayes theorem and write, in general, P Z = 1 X = x] = P X = x Z = 1] P Z = 1] P X = x] P X = x Z = 1] P Z = 1] = j=1 P X = x Z j = 1] P Z j = 1] = p N (x µ, Σ ) j=1 p jn (x µ j, Σ j ). For consistency of notation, we henceforth write C n to denote the probability p N (x n µ C n = P Z = 1 X = x n ] =, Σ ) j=1 p jn (x n µ j, Σ j ), (3) where x n is the nth data-point in our data-set D. 3 The Expectation-Maximization Algorithm Given the previous interpretation of the process by which the data-set D is generated, we can now re-formulate the problem of partitioning the datapoints in D into a given number K of clusters as that of finding the parameters p, µ, and Σ that maximize the lielihood of the data, D. Recall that the lielihood of D (given the above parameters) is given by l(d) P D p, µ, Σ] N = P x n p, µ, Σ] = N =1 which can be simplified to P x n z = 1, p, µ, Σ] P z = 1 p, µ, Σ ] N l(d) = P x n µ, Σ ] P z = 1 p] = =1 N =1 N (x n µ, Σ )p. (4) Instead of maximizing the lielihood in (4), we can equivalently maximize the log-lielihood of D, given by K ] ln l(d) = ln N (x n µ, Σ )p, (5) =1 4

since maximizing (5) is significantly simpler than maximizing (4). We start by differentiating ln l(d) with respect to µ and set the derivatives to zero, to get p N (x n µ, Σ ) j=1 N (x Σ 1 n µ j, Σ j )p (x n µ ) = 0, (6) where we have used the fact that µ N (x n µ, Σ ) = µ det(σ ) 1/2 (2π) p/2 = N (x n µ, Σ )Σ 1 (x n µ ). Putting (3) and (6) together, we get and, isolating µ, we finally get: C n Σ 1 (x n µ ) = 0 exp 1 ] ] 2 (x n µ ) Σ 1 (x n µ ) µ = 1 N C n x n, (7) N with N = N C n. Following a similar approach, but differentiating ln l(d) with respect to Σ and equating to zero, we get Σ = 1 N C n (x n µ N )(x n µ ). (8) Finally, maximizing the lielihood with respect to p must be subject to the constraint that =1 p = 1. We can resort to a Lagrange multiplier formulation and maximize the Lagrangian K ] L(p, λ) = ln P D p, µ, Σ] + λ p 1 with respect to p. Differentiating with respect to p and setting the result to 0, we get N (x n µ, Σ ) j=1 N (x + λ = 0 n µ j, Σ j )p j or, equivalently, C n + p λ = 0. =1 5

Summing over all it follows that λ = N and p = N N. (9) The above results cannot be used directly to compute the minimum of ln l(d), since the values {C n } depend on p, µ and Σ through (3) and the latter, in turn, depend on C n. However, it is possible to iteratively estimate all these parameters in an iterative process similar to the one used in -means. Let us consider some initial assignment of values to p, µ, and Σ and proceed to compute the {C n } that minimize ln l(d). This is achieved in the expectation step (E-step), where each C n is computed through (3). Then, given the {C n }, we proceed with the maximization step (M-step), where p, µ, and Σ are recomputed according to (7), (8), and (9). The iterative application of the E- and M-steps yields the acclaimed expectation-maximization (EM) algorithm. 4 EM and -Means It is possible to derive -means as a limit case of the EM algorithm. Let us consider that Σ is initially set to Σ = ɛi for = 1,..., K, where ɛ is some positive constant and I is the identity matrix. We now apply the EM algorithm but treating ɛ as a fixed constant, i.e., we do not recompute Σ at each time-step. This means that N (x µ, Σ ) = 1 exp (2πɛ) p/2 ] 1 2ɛ x n µ 2 and C n = ] p exp x n µ 2 /2ɛ K j=1 p ]. j exp x n µ j 2 /2ɛ Taing the limit as ɛ 0, it follows that the dominating term is the one where x n µ j 2 is smallest, yielding already familiar result: { 1 if = argmin C n = j x n µ j 2 0 otherwise. References 1] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer Science, 2006. 2] Tom M. Mitchell. Machine Learnin. McGraw-Hill, 1997. 6