The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in 1, 2] and we refer to these wors for further details. Throughout these notes, random variables are represented with upper-case letters, such as X or Z. A sample of a random variable is represented by the corresponding lower-case letter, such as x or z. When random variables are vector valued, we use subscripts to indicate specific components, as in X or Z. The corresponding samples are represented using bold-face letters, such as x or z, and individual components as x or z, respectively. When considering an indexed family of vector-valued data-points, we use indexed bold-face symbols to denote the elements in the family, as in x n or z n. 1 -Means The -means algorithm is a clustering algorithm where we are given a dataset D = {x 1,..., x N } consisting of N observations of some random variable X taing values in R p. We want to partition this data set into a given number K of clusters in a meaningful way. Recall that a cluster is a set of elements such that the similarity between points within the cluster is much greater than the similarity between points in different clusters. For our purposes, we can thin of the points in cluster as perturbed versions of some prototype point µ, and our ultimate goal is to determine the µ, for = 1,..., K. To determine each prototype point µ, we use the elements in D that belong to cluster, which in turn implies that we also need to compute the cluster assignments for each element x n D. This goal can be formalized as that of minimizing the distortion measure J = =1 C n x n µ 2, (1) where each C n is a binary-valued parameter such that C n = 1 if x n belongs to cluster. The value of N C n x n µ 2 somehow represents the 1
dispersion of cluster, and J thus denotes the total dispersion in all clusters. We want to determine values for the {C n } and {µ } to minimize J. Let us consider some initial assignment of values to the µ and proceed to compute the C n that minimize J. Since J depends linearly on each of the parameters C n, the trivial solution that minimizes J is attained by setting all C n to zero, in which case J 0. However, for each point x n exactly one C n must be non-zero (each data point must belong to exactly one cluster), so the minimum value for J is precisely when { 1 if = argmin C n = i x n µ i 2 0 otherwise. This optimization corresponds to the -means step in which every point is assigned to the cluster corresponding to the nearest prototype vector. Now let us consider some assignment of points in D to the clusters 1,..., K, and let us determine the prototype vectors µ that minimize J. Since J is a quadratic function of µ, we get µ J = 2 C n (x n µ ). Equating the above expression to zero and solving for µ finally yields N µ = C nx n N C = 1 N C n x n. n N This optimization corresponds to the -means step in which, for every cluster = 1,..., K, the corresponding prototype points µ are recomputed as the mean of the points in the corresponding cluster. Note that, in each optimization, the value of J must surely decrease. This means that, since J 0 and there is a finite number of possible cluster assignments for the points in D, the -means algorithm is guaranteed to converge. Optimality, however, depends on the initialization of the prototype vectors, µ 1,..., µ K and -means may converge to locally optimal clustering. 2 Gaussian Mixtures Let us now consider the process by which the points in D could be generated. Again adopting the interpretation that the points in a cluster are perturbations of the corresponding prototype vector µ, we can consider that each data-point x n in cluster is sampled according to some distribution centered in the corresponding cluster centroid, µ. This means that each point in the data-set is, in fact, generated from one of K possible distributions, one for each cluster. 2
Table 1: Bayesian networ representation of a mixture model that explicitly factors the joint distribution P X, Z] as P X, Z] = P X Z] P Z]. z x Formally, this can be modeled by considering a random-variable Z taing values in {0, 1} K and such that each z verifies =1 z = 1. A data-point (z, x), where z = 1 for some, means that x was generated according to the distribution associated with cluster. We now analyze the joint behavior of Z and X, for which we define the joint distribution of Z and X, P X, Z], in terms of the marginal distribution P Z] and the conditional distribution P X Z], as represented in the Bayesian networ of Fig. 1. Starting with P Z], since each component Z of Z is a binary variable, we denote by p the probability that this th component of Z is non-zero, i.e., p = P Z = 1]. Since we enforce =1 z = 1, we must also have that =1 p = 1. Note also that we can write K P Z = z] = p z, since each component z of z is either 0 or 1. As for the conditional distribution of X given Z, we will consider that X follows a Gaussian distribution whose mean and covariance matrix depend on Z. In particular, =1 P X = x Z = z] = P X = x Z = 1] = N (x µ, Σ ), where z is such that z = 1. The marginal distribution of X is thus given by P X = x] = P X = x Z = 1] P Z = 1] = =1 p N (x µ, Σ ), (2) i.e., the distribution of X is a linear superposition of Gaussian distributions, also nown as a Gaussian mixture. We can now interpret the generation of the points in D as follows. For n = 1,..., N, we sample Z according to P Z], obtaining a vector z n. We then sample X conditioned on z n, according to P X Z = z n ], obtaining a new data-point x n. The data-set D is then composed from the set of sampled data-points x 1,..., x N. To conclude this section, we note that the {z n } remain unnown. We can compute a distribution of any one z n given the corresponding x n. In 3 =1
fact, we can use Bayes theorem and write, in general, P Z = 1 X = x] = P X = x Z = 1] P Z = 1] P X = x] P X = x Z = 1] P Z = 1] = j=1 P X = x Z j = 1] P Z j = 1] = p N (x µ, Σ ) j=1 p jn (x µ j, Σ j ). For consistency of notation, we henceforth write C n to denote the probability p N (x n µ C n = P Z = 1 X = x n ] =, Σ ) j=1 p jn (x n µ j, Σ j ), (3) where x n is the nth data-point in our data-set D. 3 The Expectation-Maximization Algorithm Given the previous interpretation of the process by which the data-set D is generated, we can now re-formulate the problem of partitioning the datapoints in D into a given number K of clusters as that of finding the parameters p, µ, and Σ that maximize the lielihood of the data, D. Recall that the lielihood of D (given the above parameters) is given by l(d) P D p, µ, Σ] N = P x n p, µ, Σ] = N =1 which can be simplified to P x n z = 1, p, µ, Σ] P z = 1 p, µ, Σ ] N l(d) = P x n µ, Σ ] P z = 1 p] = =1 N =1 N (x n µ, Σ )p. (4) Instead of maximizing the lielihood in (4), we can equivalently maximize the log-lielihood of D, given by K ] ln l(d) = ln N (x n µ, Σ )p, (5) =1 4
since maximizing (5) is significantly simpler than maximizing (4). We start by differentiating ln l(d) with respect to µ and set the derivatives to zero, to get p N (x n µ, Σ ) j=1 N (x Σ 1 n µ j, Σ j )p (x n µ ) = 0, (6) where we have used the fact that µ N (x n µ, Σ ) = µ det(σ ) 1/2 (2π) p/2 = N (x n µ, Σ )Σ 1 (x n µ ). Putting (3) and (6) together, we get and, isolating µ, we finally get: C n Σ 1 (x n µ ) = 0 exp 1 ] ] 2 (x n µ ) Σ 1 (x n µ ) µ = 1 N C n x n, (7) N with N = N C n. Following a similar approach, but differentiating ln l(d) with respect to Σ and equating to zero, we get Σ = 1 N C n (x n µ N )(x n µ ). (8) Finally, maximizing the lielihood with respect to p must be subject to the constraint that =1 p = 1. We can resort to a Lagrange multiplier formulation and maximize the Lagrangian K ] L(p, λ) = ln P D p, µ, Σ] + λ p 1 with respect to p. Differentiating with respect to p and setting the result to 0, we get N (x n µ, Σ ) j=1 N (x + λ = 0 n µ j, Σ j )p j or, equivalently, C n + p λ = 0. =1 5
Summing over all it follows that λ = N and p = N N. (9) The above results cannot be used directly to compute the minimum of ln l(d), since the values {C n } depend on p, µ and Σ through (3) and the latter, in turn, depend on C n. However, it is possible to iteratively estimate all these parameters in an iterative process similar to the one used in -means. Let us consider some initial assignment of values to p, µ, and Σ and proceed to compute the {C n } that minimize ln l(d). This is achieved in the expectation step (E-step), where each C n is computed through (3). Then, given the {C n }, we proceed with the maximization step (M-step), where p, µ, and Σ are recomputed according to (7), (8), and (9). The iterative application of the E- and M-steps yields the acclaimed expectation-maximization (EM) algorithm. 4 EM and -Means It is possible to derive -means as a limit case of the EM algorithm. Let us consider that Σ is initially set to Σ = ɛi for = 1,..., K, where ɛ is some positive constant and I is the identity matrix. We now apply the EM algorithm but treating ɛ as a fixed constant, i.e., we do not recompute Σ at each time-step. This means that N (x µ, Σ ) = 1 exp (2πɛ) p/2 ] 1 2ɛ x n µ 2 and C n = ] p exp x n µ 2 /2ɛ K j=1 p ]. j exp x n µ j 2 /2ɛ Taing the limit as ɛ 0, it follows that the dominating term is the one where x n µ j 2 is smallest, yielding already familiar result: { 1 if = argmin C n = j x n µ j 2 0 otherwise. References 1] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer Science, 2006. 2] Tom M. Mitchell. Machine Learnin. McGraw-Hill, 1997. 6