EBEM: An Entropy-based EM Algorithm for Gaussian Mixture Models

EBEM: An Entropy-based EM Algorithm for Gaussian Mixture Models Antonio Peñalver Benavent, Francisco Escolano Ruiz and Juan M. Sáez Martínez Robot Vision Group Alicante University 03690 Alicante, Spain a.penalver@umh.es, {sco, jmsaez}@dccia.ua.es Abstract In this paper we address the problem of estimating the parameters of a Gaussian mixture model. Although the EM algorithm yields the maximum-likelihood solution it requires a careful initialization of the parameters and the optimal number of kernels in the mixture may be unknown beforehand. We propose a criterion based on the entropy of the pdf (probability density function) associated to each kernel to measure the quality of a given mixture model. A novel method for estimating Shannon entropy based on Entropic Spanning Graphs is developed and a modification of the classical EM algorithm to find the optimal number of kernels in the mixture is presented. We test our algorithm in probability density estimation, pattern recognition and color image segmentation. 1 Introduction Gaussian Mixture models have been widely used for density estimation, pattern recognition and function approximation. One of the most common methods for fitting mixtures to data is the EM algorithm [5]. However, this algorithm is prone to initialization errors and it may converge to local maxima of the log-likelihood function. In addition, the algorithm requires that the number of elements (kernels) in the mixture is known beforehand (model-selection). The so called model-selection problem has been addressed in many ways [13][14][7][6]. In this paper we propose a method that starting with only one kernel, finds the maximum-likelihood solution. In order to do so, it tests whether the underlying pdf of each kernel is Gaussian and otherwise it replaces that kernel with two kernels adequately separated from each other. In order to detect non- Gaussianity we compare the entropy of the underlying pdf with the theoretical entropy of a Gaussian. After the kernel with worse degree of Gaussianity has been splited in two, new EM steps are performed in order to obtain a new maximum-likelihood solution. The paper is organized as follows: In section 2, we review the basic principles of Gaussian-mixture models and usual EM algorithm. In section 3, we introduce Entropic Spanning Graphs and how they can be related with the estimation of Renyi s entropy of order α and how Shannon entropy can be estimated through it. In section 4, we present our maximum-entropy based criterion approach and the modified EM algorithm that exploits it. In section 5, we show some experimental results. Conclusions and future work are presented in Section 6. 2 Gaussian-Mixture Models 2.1 Definition A d-dimensional random variable y follows a finitemixture distribution when its pdf p(y Θ) can be described by a weighted sum of known pdf s named kernels. When all these kernels are Gaussian, the mixture is named in the same way: K p(y Θ) = π i p(y Θ i ) (1) i=1 where 0 π i 1, i = 1,..., K, and K i=1 π i = 1, being K the number of kernels, π 1,..., π k the a priori probabilities of each kernel, and Θ i the parameters describing the kernel. In Gaussian mixtures, Θ i = {µ i, Σ i }, that is, the average vector and the covariance matrix. The set of parameters of a given mixture is Θ {Θ 1,..., Θ k, π 1,..., π k }. Obtaining the optimal set of parameters Θ is usually posed in terms of maximizing the log-likelihood of the pdf to be estimated: l(y Θ) = log p(y Θ) = log N n=1 p(y n Θ) = = n=1 log K k=1 π kp(y k Θ k ). With Θ = arg max Θ l(θ) and Y = {y 1,...y N } is a set of N i.i.d. samples of the variable Y. (2)

2.2 EM Algorithm The EM (Expectation-Maximization) algorithm [5] is an iterative procedure that allows us to find maximumlikelihood solutions to problems in which there are hidden variables. In the case of Gaussian mixtures [11], these variables are a set of N labels Z = {z 1,..., z N } associated to the samples. Each label is a binary vector z i = [z (i) 1,..., z(i) k ], with z(i) m = 1 and z p (i) = 0, if p m, indicating that y (i) has been generated by the kernel m. The EM algorithm generates a sequence of estimations of the set of parameters {Θ (t), t = 1, 2,...} by alternating an expectation step and the maximization one until convergence. The Expectation step consists of estimating the expected value of the hidden variables given the visible data Y and the current estimation of the parameters Θ (t). The probability of generating y n with the kernel k is given by p(k y n ) = π kp(y (n) k) Σ K j=1 π jp(y (n) k) In the Maximization step we re-estimate the new parameters Θ (t + 1) given the expected Z: p(k y )y n=1 n n p(k y, n=1 n ) N p(k y )(y n=1 n n µ k)(y n µ k ) T N p(k y, n=1 n ) π k = 1 N n=1 p(k y n ), µ k = Σ k = A detailed description of this classic algorithm is given in [11]. Here we focus on the fact that if K is unknown beforehand it cannot be estimated through maximizing the log-likelihood because l(θ) grows with K. In a classical EM algorithm with a fixed number of kernels density can be underestimated giving a poor description of the data. In the next section we describe the use of entropy to test whether a given kernel describes properly the underlying data. 3 Shannon entropy estimation Several methods for the nonparametric estimation of the differential entropy of a continuous random variable have been widely used (see [1] for a detailed overview). Entropic Spanning Graphs obtained from data to estimate Renyi s α- entropy [9] belong to a type of methods called non plugin. Renyi s α-entropy of a probability density function f is defined as: H α (f) = 1 1 α ln f α (z)dz (5) z for α (0, 1). The α entropy converges to the Shannon entropy f(z) ln f(z)dz as α 1, so it is possible to obtain the second one from the first one. (3) (4) 3.1 Renyi s Entropy and Entropic Spanning Graphs A graph G consists of a set of vertices X n = {x 1,..., x n }, with x n R d and edges {e} that connect vertices in graph: e ij = (x i, x j ). If we denote by M(X n ) the possible sets of edges in the class of acyclic graphs spanning X n (spanning trees), the total edge length functional of the Euclidean power weighted Minimal Spanning Tree is: L MST γ (X n ) = min e γ (6) M(X n) e M(X n) with γ (0, d) y e the euclidean distance between graph vertices. It is intuitive that the length of the MST spanning the more concentrated nonuniform set of points increases at a slower rate than does the MST spanning the uniformly distributed points. This fact, motivate the application of the MST as a way to test for randomness of a set of points[10]. In [8] it was showed that in d-dimensional feature space, with d 2: H α (X n ) = d γ [ ln L ] γ(x n ) n α ln β Lγ,d is an asymptotically unbiased, and almost surely consistent, estimator of the α-entropy of f where α = (d γ)/d and β Lγ,d is a constant bias correction depending on the graph minimization criterion, but independent of f. Closed form expressions are not available for β Lγ,d, only known approximations and bounds: (i) Monte Carlo simulation of uniform random samples on unit cube [0, 1] d ; (ii) Large d approximation: (γ/2) ln(d/(2πe)) in [2]. We can estimate H α (f) for different values of α = (d γ)/d by changing the edge weight exponent γ. As γ modifies the edge weights monotonically, the graph is the same for different values of γ, and only the total length in expression 7 needs to be recomputed. 3.2 Renyi Extrapolation of Shannon Entropy Entropic spanning graphs are suitable for estimating α- entropy with α [0, 1[, so Shannon entropy can not be directly estimated with this method. Figure 1 shows that the shape of function does not depend neither on the nature of data nor on their size. Since it is not possible to directly calculate the value of H α for α = 1, we will approximate this value by means of a continuous function that captures the tendency of H α in the environment of 1. From a value of α [0, 1[, we can calculate the tangent line y = mx + b to H α in this point, using m = H α, x = α and y = H α. In any case, this line (7)

Figure 1. H α for gaussian distributions with different covariance matrices. will be continuous and we will be able to calculate its value for x = 1. The point in α = 1 will be different depending on the α used. From now on, we will call α to the α value that generates the correct entropy value in α = 1, following the described procedure. Experimentally, we have verified which this value does not depend on the nature of the probability distribution associated to the data, but it depends on the number of samples used for the estimation and the problem dimension. As H α is a monotonous decreasing function, we can estimate α value in the Gaussian case by means of a dichotomic search between two well separated α values for a constant number of samples and problem dimension and different covariance matrices. Experimentally, we have verified that α is almost constant for diagonal covariance matrices with variance value greater than 0.5. In order to appreciate the effects of the dimension and the number of samples on the problem, we generated a new experiment in which we randomly calculated α for a set of 1000 distributions with 2 d 5. Experimentally we have verified that the shape of the underlying curve adjusts suitably to a function of the type: α a+b expcd = 1 N, where N is the number of samples, D is the problem dimension and a, b, c are three constants to estimate. In order to estimate these values, we used Monte Carlo Simulation, minimizing the mean square error between expression and data. We obtained a = 1.271, b = 1.3912 and c = 0.2488. Figure 2 shows the function form for different values of dimension and number of samples. 4 The EBEM Algorithm Given a set of kernels for the mixture (initially one kernel) we evaluate the real global entropy H(y) and the theoretical maximum entropy H max (y) of the mixture by considering the individual pairs of entropies for each ker- Figure 2. α for dimensions between 2 and 5 and different number of samples. nel, and the prior probabilities: H(Y ) = K k=1 π kh k (Y ) and H max (Y ) = K k=1 π kh maxk (Y ), with H maxk (Y ) = 1 2 log[(2πe)d Σ ]. If the ratio H(y)/H max (y) is above a given threshold we consider that all kernels are well fitted. Otherwise, we select the kernel with the lowest individual ratio and it is replaced by two other kernels that are conveniently placed and initialized. Then, a new EM with K + 1 kernels starts. 4.1 Introducing a New Kernel A low H(y)/H max (y) local ratio indicates that multimodality arises and thus the kernel must be replaced by two other kernels. The difficulty in the split step arises when the original covariance matrix needs to generate two new matrices with two constraints: overall dispersion must remain almost constant and the new matrices must be positive definite. In [12] the split is based on the constraints of preserving the first to moments before and after the operation in one-dimensional settings. This is an ill-posed problem because the number of equations is less than the number of unknowns and the requirement of retaining the positive definiteness of covariance matrices. 4.2 The Split Equations From definition of mixture in equation 1, considering that the K component is the one with lowest Gaussianity threshold, it must be decomposed into the K 1 and K 2 components with parameters Θ k1 = (µ k1, Σ k1 ) and Θ k2 = (µ k2, Σ k2 ). The corresponding priors, the mean vectors and the covariance matrices should satisfy the following split equations: π = π 1 + π 2 π µ = π 1 µ 1 + π 2 µ 2 π (Σ + µ µ T ) = π 1 (Σ 1 + µ 1 µ T 1 ) + π 2 (Σ 2 + µ 2 µ T 2 ) (8)

In [15] a RJMCMC based algorithm is presented. To solve the equations showed above, a spectral decomposition of the actual covariance matrix is used and the original problem is replaced by estimating the new eigenvalues and eigenvectors of new covariance matrices from the original one. The main limitation is that all covariance matrices of the mixture are constrained to share the same eigenvector matrix. More recently, in [4] new eigenvector matrices are obtained applying a rotation matrix. Let = V Λ V T be the spectral decomposition of the covariance matrix, with Λ = diag(λj 1,..., λj d ) a diagonal matrix containing the eigenvalues of with increasing order, the component with the lowest entropy ratio, π, π 1, π 2 the priors of both original and new components, µ, µ 1, µ 2 the means and, 1, 2 the covariance matrices. Let also be D a dxd rotation matrix with columns orthonormal unit vectors. D is built by generating its lower triangular matrix independently from d(d 1)/2 different uniform U(0, 1) densities. The proposed split operation is given by: π 1 = u 1 π, π 2 = (1 u 1 )π µ 1 = µ ( d i=1 ui 2 λ i V i ) π2 π 1 µ 2 = µ ( d i=1 ui 2 λ i V i ) π1 π 2 π Λ 1 = diag(u 3 )diag(ι u 2 )diag(ι + u 2 )Λ π 1 π Λ 2 = diag(ι u 3 )diag(ι u 2 )diag(ι + u 2 )Λ π 2 V 1 = DV, V 2 = D T V (9) where, ι is a d x 1 vector of ones, u 1, u 2 = (u 1 2, u 2 2,..., u d 2) T and u 3 = (u 1 3, u 2 3,..., u d 3) T are 2d + 1 random variables needed to construct priors, means and eigenvalues for the new component in the mixture. They are calculated as: u 1 be(2, 2), u 1 2 be(1, 2d), u j 2 U( 1, 1), u 1 3 be(1, d), u j 3 U(0, 1), with j = 2,..., d. and be() a Beta distribution. EBEM ALGORITHM Initialization: Start with a unique kernel. K 1. Θ 1 {µ 1, Σ 1} with random values. repeat: //Main loop repeat: //E, M Steps Estimate log-likelihood in iteration i: l i until: l i l i 1 < CONVERGENCE TH Evaluate H(Y ) and H max(y ) globally if (H(Y )/H max < ENTROPY TH) Select kernel K with the lowest ratio and decompose into K 1 and K 2 Initialize parameters Θ 1 and Θ 2(Eq.9) Initialize new averages: µ 1, µ 2 Initialize new eigenvalues matrices: Λ 1, Λ 2 Initialize new eigenvector matrices: V 1 and V 2 Set new priors: π 1 and π 2 else Final True until: Final = True Figure 4. Entropy Based EM algorithm. 5 Experiments In order to test our approach we have performed several experiments with synthetic and image data. In the first one we have generated 2500 samples from 5 bi-dimensional Gaussians with prior probabilities π k = 0.2 k. Their averages are: µ 1 = [ 1, 1] T, µ 2 = [6, 3] T, µ 3 = [3, 6] T, µ 4 = [2, 2] T, µ 5 = [0, 0] T. We have used a Gaussianity threshold of 0.15, and a convergence threshold of 0.001 for the EM algorithm. Our algorithm converges after 30 iterations finding correctly k = 5. In Figure 5 we show the evolution of the algorithm. Figure 3. 2-D Example of splitting one kernel. A graphical description of the splitting process in the 2-D case is showed in Fig.3. Directions and magnitudes of variability are defined by eigenvectors and eigenvalues of the covariance matrix. Otherwise, a full algorithmic description of the process is showed in Fig. 4. Figure 5. Evolution of our algorithm.

We have also tested our approach in unsupervised segmentation of color images. At each pixel i in the image we compute a 3-dimensional feature vector x i with the components in the RGB color space. As a result of our algorithm we obtain the number of components (classes) M and y i [1, 2,..., M] to indicate from which class came the pixel i th. Therefore our image model sets that each pixel is generated by one of the Gaussian densities in the Gaussian mixture model. We have used different entropy thresholds Figure 6. Color image segmentation results. and a convergence threshold of 0.1 for the EM algorithm. In Fig. 6 we show some color image segmentation results with three different images with sizes of 189x189 pixels. In each row we show the original image and the resulting segmented image with increasing entropy threshold. The greater it is the demanded threshold the higher is the number of kernels generated and therefore the number of colors detected in the image. No postprocess has been applied to the resulting images. Finally, we have applied the proposed method to the well known Iris [3] data set, that contains 3 classes of 50 (4- dimensional) instances referred to a type of iris plant: Versicolor, Virginica and Setosa. We have generated 300 training samples from the averages and covariances of the original classes and we have checked the performance in a classification problem with the original 150 samples. Starting with K = 1, the method correctly selected K = 3. Then, a maximum a posteriori classifier was built, with classification performance of 98%. 6 Conclusions and future Work In this paper we have presented a method for finding the optimal number of kernels in a Gaussian mixture based on maximum entropy. We start the algorithm with only one kernel and then we decide to split it on the basis of the entropy of the underlying pdf. The algorithm converges in few iterations and it is suitable for density estimation, pattern recognition and unsupervised color image segmentation. We are currently exploring new methods of estimating the MST in O(logn) cost and how to select the data features efficiently. References [1] E. Beirlant, E. Dudewicz, L. Gyorfi, and E. V. der Meulen. Nonparametric entropy estimation. International Journal on Mathematical and Statistical Sciences, 6(1):17 39, 1996. [2] D. Bertsimas and G. V. Ryzin. An asymptotic determination of the minimum spanning tree and minimum matching constants in geometrical probability. Operations Research Letters, 9(1):223 231, 1990. [3] C. Blake and C. Merz. Uci repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences, 1998. [4] P. Dellaportas and I. Papageorgiou. Multivariate mixtures of normals with unknown number of components. Statistics and Computing, To appear. [5] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood estimation from incomplete data via the em algorithm. Journal of The Royal Statistical Society, 39(1):1 38, 1977. [6] M. Figueiredo and A. Jain. Unsupervised selection and estimation of finite mixture models. In International Conference on Pattern Recognition. ICPR2000, Barcelona, Spain, 2000. IEEE. [7] M. Figueiredo, J. Leitao, and A. Jain. On fitting mixture models. Energy Minimization Methods in Computer Vision and Pattern Recognition. Lecture Notes in Computer Science, 1654(1):54 69, 1999. [8] A. Hero and O. Michel. Asymptotic theory of greedy aproximations to minnimal k-point random graphs. IEEE Transactions on Information Theory, 45(6):1921 1939, 1999. [9] A. Hero and O. Michel. Applications of entropic graphs. IEEE Signal Processing Magazine, 19(5):85 95, 2002. [10] R. Hoffman and A. Jain. A test of randomness based on the minimal spanning tree. Pattern Recognition Letters, 1(1):175 180, 1983. [11] R. Redner and H. Walker. Mixture densities, maximum likelihood, and the em algorithm. SIAM Review, 26(2):195 239, 1984. [12] S. Richardson and P. Green. On bayesian analysis of mixtures with unknown number of components (with discussion). Journal of the Royal Statistical Society B, (1), 1997. [13] N. Vlassis and A. Likas. A kurtosis-based dynamic approach to gaussian mixture modeling. IEEE Trans. Systems, Man, and Cybernetics, 29(4):393 399, 1999. [14] N. Vlassis and A. Likas. A greedy em algorithm for gaussian mixture learning. Neural Processing Letters, 15(1):77 87, 2000. [15] Z. Zhang, K. Chan, Y. Wu, and C. Chen. Learning a multivariate gaussian mixture models with the reversible jump mcmc algorithm. Statistics and Computing, (1), 2004.