EBEM: An Entropy-based EM Algorithm for Gaussian Mixture Models

Similar documents
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

Algorithmisches Lernen/Machine Learning

CS534 Machine Learning - Spring Final Exam

EECS490: Digital Image Processing. Lecture #26

JOURNAL OF PHYSICAL AGENTS, VOL. 4, NO. 2, MAY Large Scale Environment Partitioning in Mobile Robotics Recognition Tasks

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Estimation of Rényi Information Divergence via Pruned Minimal Spanning Trees 1

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Gaussian Mixture Distance for Information Retrieval

Unsupervised Learning with Permuted Data

Mixture Models and EM

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Latent Variable Models and EM algorithm

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

MIXTURE MODELS AND EM

Pattern Recognition and Machine Learning

Machine learning for pervasive systems Classification in high-dimensional spaces

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

L11: Pattern recognition principles

Information Theory in Computer Vision and Pattern Recognition

MACHINE LEARNING ADVANCED MACHINE LEARNING

ECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

STA 4273H: Statistical Machine Learning

Multi-Layer Boosting for Pattern Recognition

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Maximum Likelihood Estimation. only training data is available to design a classifier

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Supervised locally linear embedding

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Probabilistic & Unsupervised Learning

STA 414/2104: Machine Learning

Two-View Segmentation of Dynamic Scenes from the Multibody Fundamental Matrix

ECE 275B Homework #2 Due Thursday 2/12/2015. MIDTERM is Scheduled for Thursday, February 19, 2015

MIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA

Probabilistic Graphical Models

Pattern Recognition. Parameter Estimation of Probability Density Functions

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Density Estimation: ML, MAP, Bayesian estimation

COM336: Neural Computing

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Unsupervised Learning

Mixtures of Gaussians with Sparse Structure

STA 4273H: Statistical Machine Learning

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Density Estimation. Seungjin Choi

Mixtures of Gaussians. Sargur Srihari

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Machine Learning for Signal Processing Bayes Classification and Regression

Adaptive Crowdsourcing via EM with Prior

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier

Robust Laplacian Eigenmaps Using Global Information

Hierarchical Clustering of Dynamical Systems based on Eigenvalue Constraints

EM in High-Dimensional Spaces

Feature selection and extraction Spectral domain quality estimation Alternatives

Spectral Generative Models for Graphs

Lecture 3: Pattern Classification

Efficient Greedy Learning of Gaussian Mixture Models

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

STA 4273H: Statistical Machine Learning

Mathematical Formulation of Our Example

Review of Maximum Likelihood Estimators

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Lecture 3: Pattern Classification. Pattern classification

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Latent Variable Models

Introduction to Graphical Models

Large Scale Environment Partitioning in Mobile Robotics Recognition Tasks

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Finite Singular Multivariate Gaussian Mixture

Clustering using Mixture Models

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

The Regularized EM Algorithm

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Statistical Pattern Recognition

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis

Low-level Image Processing

Chapter 14 Combining Models

MACHINE LEARNING ADVANCED MACHINE LEARNING

Pattern Classification

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

Lecture: Face Recognition and Feature Reduction

An Analytic Distance Metric for Gaussian Mixture Models with Application in Image Retrieval

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

The Expectation-Maximization Algorithm

Determining the number of components in mixture models for hierarchical data

Transcription:

EBEM: An Entropy-based EM Algorithm for Gaussian Mixture Models Antonio Peñalver Benavent, Francisco Escolano Ruiz and Juan M. Sáez Martínez Robot Vision Group Alicante University 03690 Alicante, Spain a.penalver@umh.es, {sco, jmsaez}@dccia.ua.es Abstract In this paper we address the problem of estimating the parameters of a Gaussian mixture model. Although the EM algorithm yields the maximum-likelihood solution it requires a careful initialization of the parameters and the optimal number of kernels in the mixture may be unknown beforehand. We propose a criterion based on the entropy of the pdf (probability density function) associated to each kernel to measure the quality of a given mixture model. A novel method for estimating Shannon entropy based on Entropic Spanning Graphs is developed and a modification of the classical EM algorithm to find the optimal number of kernels in the mixture is presented. We test our algorithm in probability density estimation, pattern recognition and color image segmentation. 1 Introduction Gaussian Mixture models have been widely used for density estimation, pattern recognition and function approximation. One of the most common methods for fitting mixtures to data is the EM algorithm [5]. However, this algorithm is prone to initialization errors and it may converge to local maxima of the log-likelihood function. In addition, the algorithm requires that the number of elements (kernels) in the mixture is known beforehand (model-selection). The so called model-selection problem has been addressed in many ways [13][14][7][6]. In this paper we propose a method that starting with only one kernel, finds the maximum-likelihood solution. In order to do so, it tests whether the underlying pdf of each kernel is Gaussian and otherwise it replaces that kernel with two kernels adequately separated from each other. In order to detect non- Gaussianity we compare the entropy of the underlying pdf with the theoretical entropy of a Gaussian. After the kernel with worse degree of Gaussianity has been splited in two, new EM steps are performed in order to obtain a new maximum-likelihood solution. The paper is organized as follows: In section 2, we review the basic principles of Gaussian-mixture models and usual EM algorithm. In section 3, we introduce Entropic Spanning Graphs and how they can be related with the estimation of Renyi s entropy of order α and how Shannon entropy can be estimated through it. In section 4, we present our maximum-entropy based criterion approach and the modified EM algorithm that exploits it. In section 5, we show some experimental results. Conclusions and future work are presented in Section 6. 2 Gaussian-Mixture Models 2.1 Definition A d-dimensional random variable y follows a finitemixture distribution when its pdf p(y Θ) can be described by a weighted sum of known pdf s named kernels. When all these kernels are Gaussian, the mixture is named in the same way: K p(y Θ) = π i p(y Θ i ) (1) i=1 where 0 π i 1, i = 1,..., K, and K i=1 π i = 1, being K the number of kernels, π 1,..., π k the a priori probabilities of each kernel, and Θ i the parameters describing the kernel. In Gaussian mixtures, Θ i = {µ i, Σ i }, that is, the average vector and the covariance matrix. The set of parameters of a given mixture is Θ {Θ 1,..., Θ k, π 1,..., π k }. Obtaining the optimal set of parameters Θ is usually posed in terms of maximizing the log-likelihood of the pdf to be estimated: l(y Θ) = log p(y Θ) = log N n=1 p(y n Θ) = = n=1 log K k=1 π kp(y k Θ k ). With Θ = arg max Θ l(θ) and Y = {y 1,...y N } is a set of N i.i.d. samples of the variable Y. (2)

2.2 EM Algorithm The EM (Expectation-Maximization) algorithm [5] is an iterative procedure that allows us to find maximumlikelihood solutions to problems in which there are hidden variables. In the case of Gaussian mixtures [11], these variables are a set of N labels Z = {z 1,..., z N } associated to the samples. Each label is a binary vector z i = [z (i) 1,..., z(i) k ], with z(i) m = 1 and z p (i) = 0, if p m, indicating that y (i) has been generated by the kernel m. The EM algorithm generates a sequence of estimations of the set of parameters {Θ (t), t = 1, 2,...} by alternating an expectation step and the maximization one until convergence. The Expectation step consists of estimating the expected value of the hidden variables given the visible data Y and the current estimation of the parameters Θ (t). The probability of generating y n with the kernel k is given by p(k y n ) = π kp(y (n) k) Σ K j=1 π jp(y (n) k) In the Maximization step we re-estimate the new parameters Θ (t + 1) given the expected Z: p(k y )y n=1 n n p(k y, n=1 n ) N p(k y )(y n=1 n n µ k)(y n µ k ) T N p(k y, n=1 n ) π k = 1 N n=1 p(k y n ), µ k = Σ k = A detailed description of this classic algorithm is given in [11]. Here we focus on the fact that if K is unknown beforehand it cannot be estimated through maximizing the log-likelihood because l(θ) grows with K. In a classical EM algorithm with a fixed number of kernels density can be underestimated giving a poor description of the data. In the next section we describe the use of entropy to test whether a given kernel describes properly the underlying data. 3 Shannon entropy estimation Several methods for the nonparametric estimation of the differential entropy of a continuous random variable have been widely used (see [1] for a detailed overview). Entropic Spanning Graphs obtained from data to estimate Renyi s α- entropy [9] belong to a type of methods called non plugin. Renyi s α-entropy of a probability density function f is defined as: H α (f) = 1 1 α ln f α (z)dz (5) z for α (0, 1). The α entropy converges to the Shannon entropy f(z) ln f(z)dz as α 1, so it is possible to obtain the second one from the first one. (3) (4) 3.1 Renyi s Entropy and Entropic Spanning Graphs A graph G consists of a set of vertices X n = {x 1,..., x n }, with x n R d and edges {e} that connect vertices in graph: e ij = (x i, x j ). If we denote by M(X n ) the possible sets of edges in the class of acyclic graphs spanning X n (spanning trees), the total edge length functional of the Euclidean power weighted Minimal Spanning Tree is: L MST γ (X n ) = min e γ (6) M(X n) e M(X n) with γ (0, d) y e the euclidean distance between graph vertices. It is intuitive that the length of the MST spanning the more concentrated nonuniform set of points increases at a slower rate than does the MST spanning the uniformly distributed points. This fact, motivate the application of the MST as a way to test for randomness of a set of points[10]. In [8] it was showed that in d-dimensional feature space, with d 2: H α (X n ) = d γ [ ln L ] γ(x n ) n α ln β Lγ,d is an asymptotically unbiased, and almost surely consistent, estimator of the α-entropy of f where α = (d γ)/d and β Lγ,d is a constant bias correction depending on the graph minimization criterion, but independent of f. Closed form expressions are not available for β Lγ,d, only known approximations and bounds: (i) Monte Carlo simulation of uniform random samples on unit cube [0, 1] d ; (ii) Large d approximation: (γ/2) ln(d/(2πe)) in [2]. We can estimate H α (f) for different values of α = (d γ)/d by changing the edge weight exponent γ. As γ modifies the edge weights monotonically, the graph is the same for different values of γ, and only the total length in expression 7 needs to be recomputed. 3.2 Renyi Extrapolation of Shannon Entropy Entropic spanning graphs are suitable for estimating α- entropy with α [0, 1[, so Shannon entropy can not be directly estimated with this method. Figure 1 shows that the shape of function does not depend neither on the nature of data nor on their size. Since it is not possible to directly calculate the value of H α for α = 1, we will approximate this value by means of a continuous function that captures the tendency of H α in the environment of 1. From a value of α [0, 1[, we can calculate the tangent line y = mx + b to H α in this point, using m = H α, x = α and y = H α. In any case, this line (7)

Figure 1. H α for gaussian distributions with different covariance matrices. will be continuous and we will be able to calculate its value for x = 1. The point in α = 1 will be different depending on the α used. From now on, we will call α to the α value that generates the correct entropy value in α = 1, following the described procedure. Experimentally, we have verified which this value does not depend on the nature of the probability distribution associated to the data, but it depends on the number of samples used for the estimation and the problem dimension. As H α is a monotonous decreasing function, we can estimate α value in the Gaussian case by means of a dichotomic search between two well separated α values for a constant number of samples and problem dimension and different covariance matrices. Experimentally, we have verified that α is almost constant for diagonal covariance matrices with variance value greater than 0.5. In order to appreciate the effects of the dimension and the number of samples on the problem, we generated a new experiment in which we randomly calculated α for a set of 1000 distributions with 2 d 5. Experimentally we have verified that the shape of the underlying curve adjusts suitably to a function of the type: α a+b expcd = 1 N, where N is the number of samples, D is the problem dimension and a, b, c are three constants to estimate. In order to estimate these values, we used Monte Carlo Simulation, minimizing the mean square error between expression and data. We obtained a = 1.271, b = 1.3912 and c = 0.2488. Figure 2 shows the function form for different values of dimension and number of samples. 4 The EBEM Algorithm Given a set of kernels for the mixture (initially one kernel) we evaluate the real global entropy H(y) and the theoretical maximum entropy H max (y) of the mixture by considering the individual pairs of entropies for each ker- Figure 2. α for dimensions between 2 and 5 and different number of samples. nel, and the prior probabilities: H(Y ) = K k=1 π kh k (Y ) and H max (Y ) = K k=1 π kh maxk (Y ), with H maxk (Y ) = 1 2 log[(2πe)d Σ ]. If the ratio H(y)/H max (y) is above a given threshold we consider that all kernels are well fitted. Otherwise, we select the kernel with the lowest individual ratio and it is replaced by two other kernels that are conveniently placed and initialized. Then, a new EM with K + 1 kernels starts. 4.1 Introducing a New Kernel A low H(y)/H max (y) local ratio indicates that multimodality arises and thus the kernel must be replaced by two other kernels. The difficulty in the split step arises when the original covariance matrix needs to generate two new matrices with two constraints: overall dispersion must remain almost constant and the new matrices must be positive definite. In [12] the split is based on the constraints of preserving the first to moments before and after the operation in one-dimensional settings. This is an ill-posed problem because the number of equations is less than the number of unknowns and the requirement of retaining the positive definiteness of covariance matrices. 4.2 The Split Equations From definition of mixture in equation 1, considering that the K component is the one with lowest Gaussianity threshold, it must be decomposed into the K 1 and K 2 components with parameters Θ k1 = (µ k1, Σ k1 ) and Θ k2 = (µ k2, Σ k2 ). The corresponding priors, the mean vectors and the covariance matrices should satisfy the following split equations: π = π 1 + π 2 π µ = π 1 µ 1 + π 2 µ 2 π (Σ + µ µ T ) = π 1 (Σ 1 + µ 1 µ T 1 ) + π 2 (Σ 2 + µ 2 µ T 2 ) (8)

In [15] a RJMCMC based algorithm is presented. To solve the equations showed above, a spectral decomposition of the actual covariance matrix is used and the original problem is replaced by estimating the new eigenvalues and eigenvectors of new covariance matrices from the original one. The main limitation is that all covariance matrices of the mixture are constrained to share the same eigenvector matrix. More recently, in [4] new eigenvector matrices are obtained applying a rotation matrix. Let = V Λ V T be the spectral decomposition of the covariance matrix, with Λ = diag(λj 1,..., λj d ) a diagonal matrix containing the eigenvalues of with increasing order, the component with the lowest entropy ratio, π, π 1, π 2 the priors of both original and new components, µ, µ 1, µ 2 the means and, 1, 2 the covariance matrices. Let also be D a dxd rotation matrix with columns orthonormal unit vectors. D is built by generating its lower triangular matrix independently from d(d 1)/2 different uniform U(0, 1) densities. The proposed split operation is given by: π 1 = u 1 π, π 2 = (1 u 1 )π µ 1 = µ ( d i=1 ui 2 λ i V i ) π2 π 1 µ 2 = µ ( d i=1 ui 2 λ i V i ) π1 π 2 π Λ 1 = diag(u 3 )diag(ι u 2 )diag(ι + u 2 )Λ π 1 π Λ 2 = diag(ι u 3 )diag(ι u 2 )diag(ι + u 2 )Λ π 2 V 1 = DV, V 2 = D T V (9) where, ι is a d x 1 vector of ones, u 1, u 2 = (u 1 2, u 2 2,..., u d 2) T and u 3 = (u 1 3, u 2 3,..., u d 3) T are 2d + 1 random variables needed to construct priors, means and eigenvalues for the new component in the mixture. They are calculated as: u 1 be(2, 2), u 1 2 be(1, 2d), u j 2 U( 1, 1), u 1 3 be(1, d), u j 3 U(0, 1), with j = 2,..., d. and be() a Beta distribution. EBEM ALGORITHM Initialization: Start with a unique kernel. K 1. Θ 1 {µ 1, Σ 1} with random values. repeat: //Main loop repeat: //E, M Steps Estimate log-likelihood in iteration i: l i until: l i l i 1 < CONVERGENCE TH Evaluate H(Y ) and H max(y ) globally if (H(Y )/H max < ENTROPY TH) Select kernel K with the lowest ratio and decompose into K 1 and K 2 Initialize parameters Θ 1 and Θ 2(Eq.9) Initialize new averages: µ 1, µ 2 Initialize new eigenvalues matrices: Λ 1, Λ 2 Initialize new eigenvector matrices: V 1 and V 2 Set new priors: π 1 and π 2 else Final True until: Final = True Figure 4. Entropy Based EM algorithm. 5 Experiments In order to test our approach we have performed several experiments with synthetic and image data. In the first one we have generated 2500 samples from 5 bi-dimensional Gaussians with prior probabilities π k = 0.2 k. Their averages are: µ 1 = [ 1, 1] T, µ 2 = [6, 3] T, µ 3 = [3, 6] T, µ 4 = [2, 2] T, µ 5 = [0, 0] T. We have used a Gaussianity threshold of 0.15, and a convergence threshold of 0.001 for the EM algorithm. Our algorithm converges after 30 iterations finding correctly k = 5. In Figure 5 we show the evolution of the algorithm. Figure 3. 2-D Example of splitting one kernel. A graphical description of the splitting process in the 2-D case is showed in Fig.3. Directions and magnitudes of variability are defined by eigenvectors and eigenvalues of the covariance matrix. Otherwise, a full algorithmic description of the process is showed in Fig. 4. Figure 5. Evolution of our algorithm.

We have also tested our approach in unsupervised segmentation of color images. At each pixel i in the image we compute a 3-dimensional feature vector x i with the components in the RGB color space. As a result of our algorithm we obtain the number of components (classes) M and y i [1, 2,..., M] to indicate from which class came the pixel i th. Therefore our image model sets that each pixel is generated by one of the Gaussian densities in the Gaussian mixture model. We have used different entropy thresholds Figure 6. Color image segmentation results. and a convergence threshold of 0.1 for the EM algorithm. In Fig. 6 we show some color image segmentation results with three different images with sizes of 189x189 pixels. In each row we show the original image and the resulting segmented image with increasing entropy threshold. The greater it is the demanded threshold the higher is the number of kernels generated and therefore the number of colors detected in the image. No postprocess has been applied to the resulting images. Finally, we have applied the proposed method to the well known Iris [3] data set, that contains 3 classes of 50 (4- dimensional) instances referred to a type of iris plant: Versicolor, Virginica and Setosa. We have generated 300 training samples from the averages and covariances of the original classes and we have checked the performance in a classification problem with the original 150 samples. Starting with K = 1, the method correctly selected K = 3. Then, a maximum a posteriori classifier was built, with classification performance of 98%. 6 Conclusions and future Work In this paper we have presented a method for finding the optimal number of kernels in a Gaussian mixture based on maximum entropy. We start the algorithm with only one kernel and then we decide to split it on the basis of the entropy of the underlying pdf. The algorithm converges in few iterations and it is suitable for density estimation, pattern recognition and unsupervised color image segmentation. We are currently exploring new methods of estimating the MST in O(logn) cost and how to select the data features efficiently. References [1] E. Beirlant, E. Dudewicz, L. Gyorfi, and E. V. der Meulen. Nonparametric entropy estimation. International Journal on Mathematical and Statistical Sciences, 6(1):17 39, 1996. [2] D. Bertsimas and G. V. Ryzin. An asymptotic determination of the minimum spanning tree and minimum matching constants in geometrical probability. Operations Research Letters, 9(1):223 231, 1990. [3] C. Blake and C. Merz. Uci repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences, 1998. [4] P. Dellaportas and I. Papageorgiou. Multivariate mixtures of normals with unknown number of components. Statistics and Computing, To appear. [5] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood estimation from incomplete data via the em algorithm. Journal of The Royal Statistical Society, 39(1):1 38, 1977. [6] M. Figueiredo and A. Jain. Unsupervised selection and estimation of finite mixture models. In International Conference on Pattern Recognition. ICPR2000, Barcelona, Spain, 2000. IEEE. [7] M. Figueiredo, J. Leitao, and A. Jain. On fitting mixture models. Energy Minimization Methods in Computer Vision and Pattern Recognition. Lecture Notes in Computer Science, 1654(1):54 69, 1999. [8] A. Hero and O. Michel. Asymptotic theory of greedy aproximations to minnimal k-point random graphs. IEEE Transactions on Information Theory, 45(6):1921 1939, 1999. [9] A. Hero and O. Michel. Applications of entropic graphs. IEEE Signal Processing Magazine, 19(5):85 95, 2002. [10] R. Hoffman and A. Jain. A test of randomness based on the minimal spanning tree. Pattern Recognition Letters, 1(1):175 180, 1983. [11] R. Redner and H. Walker. Mixture densities, maximum likelihood, and the em algorithm. SIAM Review, 26(2):195 239, 1984. [12] S. Richardson and P. Green. On bayesian analysis of mixtures with unknown number of components (with discussion). Journal of the Royal Statistical Society B, (1), 1997. [13] N. Vlassis and A. Likas. A kurtosis-based dynamic approach to gaussian mixture modeling. IEEE Trans. Systems, Man, and Cybernetics, 29(4):393 399, 1999. [14] N. Vlassis and A. Likas. A greedy em algorithm for gaussian mixture learning. Neural Processing Letters, 15(1):77 87, 2000. [15] Z. Zhang, K. Chan, Y. Wu, and C. Chen. Learning a multivariate gaussian mixture models with the reversible jump mcmc algorithm. Statistics and Computing, (1), 2004.