UvA-DARE (Digital Academic Repository) Recursive unsupervised learning of finite mixture models Zivkovic, Z.; van der Heijden, F.

Size: px

Start display at page:

Download "UvA-DARE (Digital Academic Repository) Recursive unsupervised learning of finite mixture models Zivkovic, Z.; van der Heijden, F."

Britton Long
6 years ago
Views:

1 UvA-DARE (Digital Acadeic Repository) Recursive unsupervised learning of finite ixture odels Zivkovic, Z.; van der Heijden, F. Published in: IEEE Transactions on Pattern Analysis and Machine Intelligence DOI: /TPAMI Link to publication Citation for published version (APA): Zivkovic, Z., & van der Heijden, F. (2004). Recursive unsupervised learning of finite ixture odels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(5), DOI: /TPAMI General rights It is not peritted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Coons). Disclaier/Coplaints regulations If you believe that digital publication of certain aterial infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitiate coplaint, the Library will ake the aterial inaccessible and/or reove it fro the website. Please Ask the Library: or a letter to: Library of the University of Asterda, Secretariat, Singel 425, 1012 WP Asterda, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Asterda ( Download date: 19 Jun 2018

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 5, MAY Recursive Unsupervised Learning of Finite Mixture Models Zoran Zivkovic, Meber, IEEE Coputer Society, and Ferdinand van der Heijden, Meber, IEEE Coputer Society Abstract There are two open probles when finite ixture densities are used to odel ultivariate data: the selection of the nuber of coponents and the initialization. In this paper, we propose an online (recursive) algorith that estiates the paraeters of the ixture and that siultaneously selects the nuber of coponents. The new algorith starts with a large nuber of randoly initialized coponents. A prior is used as a bias for axially structured odels. A stochastic approxiation recursive learning algorith is proposed to search for the axiu a posteriori (MAP) solution and to discard the irrelevant coponents. Index Ters Online (recursive) estiation, unsupervised learning, finite ixtures, odel selection, EM-algorith. 1 INTRODUCTION æ FINITE ixture probability density odels have been analyzed any ties and used extensively for odeling ultivariate data [16], [8]. In [3] and [6], an efficient heuristic was used to siultaneously estiate the paraeters of a ixture and select the appropriate nuber of its coponents. The idea is to start with a large nuber of coponents and introduce a prior to express our preference for copact odels. During soe iterative search procedure for the MAP solution, the prior drives the irrelevant coponents to extinction. The entropic-prior fro [3] leads to a MAP estiate that iniizes the entropy and, hence, leads to a copact odel. The Dirichlet prior fro [6] gives a solution that is related to odel selection using the Miniu Message Length (MML) criterion [20]. This paper is inspired by the aforeentioned papers [3], [6]. Our contribution is in developing an online version which is potentially very useful in any situations since it is highly eory and tie efficient. We use a stochastic approxiation procedure to estiate the paraeters of the ixture recursively. More on the behavior of approxiate recursive equations can be found in [13], [5], [15]. We propose a way to include the suggested prior fro [6] in the recursive equations. This enables the online selection of the nuber of coponents of the ixture. We show that the new algorith can reach solutions siilar to those obtained by batch algoriths. In Sections 2 and 3 of the paper, we introduce the notation and discuss soe standard probles associated with finite ixture fitting. In Section 4, we describe the entioned heuristic that enables us to estiate the paraeters of the ixture and to siultaneously select the nuber of its coponents. Further, in Section 5, we develop an online version. The final practical algorith we used in our experients is described in Section 6. In. Z. Zivkovic is with the Inforatics Institute, University of Asterda, Kruislaan 403, 1098SJ Asterda, The Netherlands. E-ail: zivkovic@science.uva.nl.. F. van der Heijden is with the Laboratory for Measureent and Instruentation, University of Twente, PO Box 217, 7500AE Enschede, The Netherlands. E-ail: f.vanderheijden@utwente.nl. Manuscript received 18 Nov. 2002; revised 24 June 2003; accepted 3 Nov Recoended for acceptance by Y. Ait. For inforation on obtaining reprints of this article, please send e-ail to: tpai@coputer.org, and reference IEEECS Log Nuber Section 7, we deonstrate how the new algorith perfors for a nuber of standard probles and copare it to soe batch algoriths. 2 PARAMETER ESTIMATION A ixture density with M coponents for a d-diensional rando variable ~x is given by: pð~x; ~ Þ¼ XM ¼1 p ð~x; ~ Þ; with X M ¼1 ¼ 1; where ~ ¼f 1 ; ::; M ; ~ 1 ; ::; ~ M g are the paraeters. The nuber of paraeter depends on the nuber of coponents M and the notation ~ ðmþ will be used to stress this when needed. The th coponent of the ixture is denoted by p ð~x; ~ Þ and ~ are its paraeters. The ixing weights denoted by are nonnegative and add up to one. Given a set of t data saples X¼f~x ð1þ ;...; ~x g the axiu likelihood (ML) estiate of the paraeter values is: b~~ ¼ arg axðlog pðx; ~ ÞÞ: ~ The Expectation Maxiization (EM) algorith [4] is coonly used to search for the solution. The EM algorith is an iterative procedure that searches for a local axiu of the log-likelihood function. In order to apply the EM algorith, we need to introduce for each ~x a discrete unobserved indicator vector ~y ¼½y 1...y M Š T. The indicator vector specifies (by eans of position coding) the ixture coponent fro which the observation ~x is drawn. The new joint density function can be written as a product: pð~x;~y; ~ Þ¼pð~y; 1 ; ::; M Þpð~xj~y; ~ 1 ; ::; ~ M Þ¼ YM ¼1 y p ð~x; ~ Þ y ; where exactly one of the y fro ~y can be equal to 1 and the others are zero. The indicators ~y have a ultinoial distribution defined by the ixing weights 1 ; ::; M. The EM algorith starts with soe initial paraeter estiate ~ b ð0þ. If we denote the set of unobserved data by Y¼f~y ð1þ ;...;~y g the estiate ~ b ðkþ fro the kth iteration of the EM algorith is obtained using the previous estiate ~ b ðk 1Þ : E step : M step : Qð ~ ; ~ b ðk 1Þ Þ¼E Y ðlog pðx; Y; ~ ÞjX; ~ b ðk 1Þ Þ¼ X pðyjx; ~ b ðk 1Þ Þ log pðx; Y; ~ Þ all possible Y b~~ ðkþ ¼ arg axðqð ~ ; ~ b ðk 1Þ ÞÞ: ~ The attractiveness of the EM algorith is that it is easy to ipleent and it converges to a local axiu of the loglikelihood function. However, one of the serious liitations of the EM algorith is that it can end up in a poor local axiu if not properly initialized. The selection of the initial paraeter values is still an open question that was studied any ties. Soe recent efforts were reported in [3], [6], [17], [18], [19]. 3 MODEL SELECTION Note that, in order to use the EM algorith, we need to know the appropriate nuber of coponents M. Too any coponents lead to overfitting and too few to underfitting. Choosing an appropriate nuber of coponents is iportant. Soeties, for exaple, the appropriate nuber of coponents can reveal soe iportant existing underlying structure that characterizes the data. ð1þ ð2þ /04/$20.00 ß 2004 IEEE Published by the IEEE Coputer Society

3 652 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 5, MAY 2004 Full Bayesian approaches saple fro the full a posteriori distribution with the nuber of coponents M considered unknown. This is possible using Markov chain Monte Carlo ethods as reported in [11], [10]. However, these ethods are still far too coputationally deanding. Most of the practical odel selection techniques are based on axiizing the following type of criteria: JðM; ~ ðmþþ ¼ log pðx; ~ ðmþþ PðMÞ: Here, log pðx; ~ ðmþþ is the log-likelihood for the available data. This part can be axiized using the EM. However, introducing ore ixture coponents always increases the log-likelihood. The balance is achieved by introducing P ðmþ that penalizes coplex solutions. Soe exaples of such criteria are the Akaike Inforation Criterion [1], the Bayesian Inference Criterion [14], the Miniu Description Length [12], the Miniu Message Length (MML) [20], etc. For a detailed review, see, for exaple, [8]. 4 SOLUTION USING MAP ESTIMATION The standard procedure for selecting M is the following: Find the ML estiate for different M-s and choose the M that axiizes (3). Suppose that we introduce a prior pð ~ ðmþþ for the ixture paraeters that penalizes coplex solutions in a siilar way as P ðmþ fro (3). Instead of (3), we could use: log pðx; ~ ðmþþ þ log pð ~ ðmþþ: As in [6] and [3], we use the siplest prior choice, the prior only on the ixing weights -s. For exaple, the Dirichlet prior (see [7], chapter 16) for the ixing weights is given by: pð ~ ðmþþ / exp XM ¼1 c log ¼ YM ¼1 c : The procedure is then as follows: We start with a large nuber of randoly initialized coponents M and search for the MAP solution using soe iterative procedure, for exaple, the EM algorith. The prior drives the irrelevant coponents to extinction. In this way, while searching for the MAP solution, the nuber of coponents M is reduced until the balance is achieved. It can be shown that the standard MML odel selection criterion can be approxiated by the Dirichlet prior with the coefficients c equal to N=2, where N presents the nuber of paraeters per coponent of the ixture. See [6] for details. The paraeters c have a eaningful interpretation. For a ultinoial distribution, the c presents the prior evidence (in the MAP sense) for the class (nuber of saples a priori belonging to that class). Negative prior evidence eans that we will accept that the class exists only if there is enough evidence fro the data for the existence of this class. If there are any paraeters per coponent, we will need any data saples to estiate the. In this sense, the presented linear connection between the c and N sees very logical. The procedure fro [6] starts with all the -s equal to 1=M. Although there is no proof of optiality, it sees reasonable to discard the coponent when its weight becoes negative. This also ensures that the ixing weights stay nonnegative. The entropic prior fro [3] has a siilar for: pð ~ ðmþþ / exp ð Hð 1,...; M ÞÞ, where Hð 1 ;...; M Þ¼ P M ¼1 log is the entropy easure for the underlying ultinoial distribution and is a paraeter. We use the entioned Dirichlet prior because it leads to a closed for solution. ð3þ ð4þ ð5þ 5 RECURSIVE (ONLINE) For the ML estiate, the following ~ b log pðx; ~ b Þ¼0. The ixing weights are constrained to su up to 1. We take this into account by introducing the Lagrange ultiplier ^ log pðx; ~ b Þþð P M ¼1 ^ 1Þ ¼ 0. Fro here, after getting rid of, it follows that the ML estiate for t data saples should satisfy ¼ 1 P t t i¼1 o ð~xðiþ Þ with the ownerships defined as: o ð~xþ ¼^ p ð~x; ~ b Þ=pð~x; ~ b Þ: Siilarly, for the MAP solution, we ^ ðlog pðx; ~ b Þ + log pð~ b Þþð P M ¼1 ^ 1ÞÞ ¼ 0, where pð~ b Þ is the entioned Dirichlet prior (5). For t data saples, we get:! ¼ 1 K X t i¼1 o ð~xðiþ Þ c ; ð7þ P M ¼1 o where K ¼ P M ¼1 ðp t i¼1 o ð~xðiþ Þ cþ ¼t Mc (since ¼ 1). The paraeters of the prior are c ¼ c (and c ¼ N=2 as entioned before). We rewrite (7) as: ¼ ^ c=t 1 Mc=t ; ð8þ where ^ ¼ 1 P t t i¼1 o ð~xðiþ Þ is the entioned ML estiate and the bias fro the prior is introduced through c=t. The bias decreases for larger data sets (larger t). However, if a sall bias is acceptable we can keep it constant by fixing c=t to c T ¼ c=t with soe large T. This eans that the bias will always be the sae as if it would have been for a data set with T saples. If we assue that the paraeter estiates do not change uch when a new saple ~x ðtþ1þ is added and, therefore, o ðtþ1þð~xðiþ Þ can be approxiated by o ð~xðiþ Þ that uses the previous paraeter estiates, we get the following well behaved and easy to use recursive update equation: ^ ðtþ1þ ¼ o þð1þtþ 1 ð1þtþ 1 c T : 1 Mc T 1 Mc T ð9þ ð~xðtþ1þ Þ Here, T should be sufficiently large to ake sure that Mc T < 1.We start with initial ^ ð0þ ¼ 1=M and discard the th coponent when ^ ðtþ1þ < 0. Note that the straightforward recursive version of (7) given by: ^ ðtþ1þ ¼ þð1þt McÞ 1 ðo ð~xðtþ1þ Þ Þ, is not very useful. For sall t, the update is negative and the weights for the coponents with high o ð~xðtþ1þ Þ are decreased instead of increased. In order to avoid the negative update, we could start with a larger value for t, but then we cancel out the influence of the prior. This otivates the iportant choice we ade to fix the influence of the prior. The ost coonly used ixture is the Gaussian ixture. A ixture coponent p ð~x; ~ Þ¼Nð~x; ~ ;C Þ has its ean ~ and its covariance atrix C as the paraeters. The prior has influence only on the ixing weights and we can use the recursive equations: b~~ ðtþ1þ ^C ðtþ1þ ¼ b ~~ þðt þ 1Þ 1 o ¼ ^C þðt þ 1Þ 1 o ^C ð~xðtþ1þ Þ ð~xðtþ1þ Þ ð~x ðtþ1þ b ~~ Þ fro [15] for the rest of the paraeters. ð~x ðtþ1þ b ~~ Þð~xðtþ1Þ b ~~ ÞT ð10þ ð11þ

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 5, MAY ASIMPLE PRACTICAL ALGORITHM For an online procedure, it is reasonable to fix the influence of the new saples by replacing the ter ð1 þ tþ 1 fro the recursive update equations (9), (10), and (11) by ¼ 1=T. There are also soe practical reasons for using a fixed sall constant. It reduces the probles with instability of the equations for sall t. Furtherore, a fixed helps in forgetting the out-of-date statistics (rando initialization and coponent deletion) ore rapidly. It is equivalent to introducing an exponentially decaying envelope: ð1 Þ t i is applied to the influence of the saple ~x ðt iþ. For the sake of clarity, we present here the whole algorith we used in our experients. We start with a large nuber of coponents M and with a rando initialization of the paraeters (see next section for an exaple). We have c T ¼ N=2. Furtherore, we use Gaussian ixture coponents with full covariance atrices. Therefore, if the data is d-diensional, we have N ¼ d þ dðd þ 1Þ=2 (the nuber of paraeters for a Gaussian with a full covariance atrix). The online algorith is then given by:. Input: new data saple ~x ðtþ1þ, current paraeter estiates b~~.. Calculate ownerships: o ð~xðtþ1þ Þ¼ p ð~x ðtþ1þ ; ~ b Þ= pð~x ðtþ1þ ; ~ b Þ.. Update ixture weights: ^ ðtþ1þ ¼ þ ðo ð~x ðtþ1þ Þ 1 Mc T ^ Þ ct 1 Mc T.. Check if there are irrelevant coponents: if ^ ðtþ1þ < 0, discard the coponent, set M ¼ M 1 and renoralize the reaining ixing weights.. Update the rest of the paraeters: - ~~ b ðtþ1þ b~~ Þ. - ^C ðtþ1þ ¼ ~~ b þ w~ (where w ¼ o ^C ð~x ðtþ1þ Þ ¼ þ wð~ ~ T speed w ¼ inð20; wþ).. Output: new paraeter estiates ~ b ðtþ1þ. and ~ ¼ ~x ðtþ1þ - ^C Þ (tip: liit the update This siple algorith can be ipleented in only a few lines of code. The recoended upper liit 20 for w siply eans that the updating speed is liited for the covariance atrices of the coponents representing less than 5 percent of the data. This was necessary since ~ T is a singular atrix and the covariance atrices ay becoe singular if updated too fast. 7 EXPERIMENTS In this section, we deonstrate the algorith perforance on a few standard probles. We show suary results fro 100 trials for each data set. For the real-world data sets, we randoly saple fro the data to generate longer sequences needed for our sequential algorith. First, for each of the probles, we present in Fig. 1 how the selected nuber of coponents of the ixture was changing when new saples are sequentially added. The nuber of coponents that was finally selected is presented in the for of a histogra for the 100 trials. In Fig. 2, we present a coparison with soe batch algoriths and study the influence of the paraeter. The rando initialization of the paraeters is the sae as in [6]. The eans ~~ b ð0þ of the ixture coponents are initialized by soe randoly chosen data points. The initial covariance atrices are a fraction (1=10 here) of the ean global diagonal covariance atrix: C ð0þ ¼ 1 10d trace 1 n X n i¼1 ð~x ðiþ b ~~Þð~x ðiþ b ~~Þ T!I; where ~~ b ¼ 1 P n n i¼1 ~xðiþ is the global ean of the data and I is the identity atrix with proper diensions. We used the first n ¼ 100 saples (it is also possible to estiate this initial covariance atrix recursively). Finally, we set the initial ixing weights to ^ ð0þ ¼ 1=M. The initial nuber of coponents M should be large enough so that the initialization reasonably covers the data. We used here the sae initial nuber of coponents as in [6]. 7.1 The Three Gaussians Data Set First, we analyze a Gaussian ixture with ixing weights 1 ¼ 2 ¼ 3 ¼ 1=3, eans 1 ¼½0 2Š T, 2 ¼½00Š T, 3 ¼½02Š T, and covariance atrices C 1 ¼ C 2 ¼ C 3 ¼ 2 0 : 0 0:2 A odified version of the EM called DAEM fro [17] was able to find the correct solution using a bad initialization. For a data set with 900 saples, they needed ore than 200 iterations to get close to the solution. Here, we start with M ¼ 30 ixture coponents. With rando initialization, we perfored 100 trials and the new algorith was always able to find the correct solution while siultaneously estiating the paraeters of the ixture and selecting the nuber of coponents. A siilar batch algorith fro [6] needs about 200 iterations to identify the three coponents (on a data set with 900 saples). Fro the plot in Fig. 1, we see that already after 9,000 saples the new algorith is usually able to identify the three coponents. The coputation costs for 9,000 saples are approxiately the sae as for only 10 iterations of the EM algorith on a data set with 900 saples. Consequently, the new algorith for this data set is about 20 ties faster in finding a siilar solution (a typical solution is presented in Fig. 1 by the ¼ 2 contours of the Guassian coponents). In [9], soe approxiate recursive versions of the EM algorith were copared to the standard EM algorith and it was shown that the recursive versions are usually faster. This is in correspondence with our results. Epirically, we decided that 50 saples per class are enough and used ¼ 1= The Iris Data Set We disregard the class inforation fro the well-known 3-class, 4- diensional Iris data set [2]. Fro the 100 trials, the clusters were properly identified 81 ties. This shows that the order in which the data is presented can influence the recursive solution. The data set had only 150 saples (50 per class) that were repeated any ties. We expect that the algorith would perfor better with ore data saples. We used ¼ 1=150. The typical solution in Fig. 1 is presented by projecting the 4-diensional data to the first two principal coponents. 7.3 The Shrinking Spiral Data Set This data set presents a 1-diensional anifold ( shrinking spiral ) in the three diensions with added noise: ~x ¼½ð13 0:5tÞ cos t ð0:5t 13Þ sin t tšþ~n, with t Unifor½0; 4Š and the noise ~n Nð0;IÞ. The odified EM called SMEM fro [18] was reported to be able to fit a 10 coponent ixture in about 350 iterations. The batch algorith fro [6] is fitting the ixture and selecting 11, 12, or 13 coponents using typically 300 to 400 iterations for a 900 saples data set. Fro the graph in Fig. 1, it is clear that we achieve siilar results, but uch faster. About 18,000 saples was enough to arrive at a siilar solution. Consequently, again, the new algorith is about 20 ties faster. There are no clusters in this data set. The fixed has as the effect that the influence of the old data is downweighted by the exponential decaying envelope ð1 Þ t k (for k<t). For coparison with the other algoriths that used 900 saples, we liited the influence of the older saples to 5 percent of the influence of the current saple by ¼ logð0:05þ=900. In Fig. 1, we present a typical solution by

5 654 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 5, MAY 2004 Fig. 1. Model selection results for a few standard probles (suary fro 100 trials). showing for each coponent the eigenvector corresponding to the largest eigenvalue of the covariance atrix. 7.4 The Enzye Data Set The 1-diensional Enzye data set has 245 data saples. It was shown in [11] using the MCMC that the nuber of coponents supported by the data is ost likely four, but two and three are also good choices. Our algorith arrived at siilar solutions. In a siilar way as before, we used ¼ logð0:05þ= Coparison with Soe Batch Algoriths The following standard batch ethods were considered for coparison: the EM algorith initialized using the result fro k-eans clustering; the SMEM ethod [18]; the greedy EM ethod [19] that starts with a single coponent and adds new ones reported to be faster than the elaborate SMEM. We used 900 saples for the Three Gaussians and the Shrinking Spiral data sets. The batch algoriths assue a known nuber of coponents: three for the Three Gaussians and the Iris data, 13 for the Shrinking Spiral, and four for the Enzye data set. Our new unsupervised recursive algorith RUEM has selected on average approxiately the sae nuber of coponents for the chosen. All the iterative batch algoriths in our experients stop if the change in the log-likelihood is less than The results are presented in Fig. 2a. The best likelihood and the lowest standard deviation are reported in bold. We also added the ideal ML result obtained using a carefully initialized EM. For the Iris data, the EM was initialized using the eans and the covariances of the three classes. However, the solution where the two close clusters are odeled using one coponent was better in ters of likelihood. This wrong solution was found occasionally by soe of the algoriths. The results fro the RUEM are biased. Furtherore, the paraeter is controlling the speed of updating the paraeters and, therefore, also the effective aount of data that is considered. Therefore, we present also the results polished by additionally applying the EM algorith and using the sae saple size for the batch algoriths. The RUEM results and the polished results are better or siilar to the batch results. We also observe that the greedy EM algorith has probles with the Iris and the Shrinking spiral data. 7.6 The Influence of the Paraeter In Figs. 2b and 2c, we show the influence of the paraeter on the selected nuber of coponents. We also plot the log-likelihood

6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 5, MAY Fig. 2. Coparison with soe standard batch ethods and soe experients to study the influence of the paraeter (suary fro 100 trials). (a) The ean and the standard deviation (between the brackets) of the log-likelihood over the nuber of saples calculated on new test data for the synthetic data sets. (b) The three Gaussians data set the influence of. (c) The shrinking spiral data set the influence of. per saple for different values of. For the Three Gaussians data set, there is a range of values for where the sae nuber of coponents is finally selected. We can expect siilar results for other data sets where the clusters are well described by the ixture coponents and the coponents are well separated. For the Shrinking Spiral data set, there are no clear clusters and the nuber of selected coponents slowly declines with larger. Siilarly, the log-likelihood also decreases with. For coparison, we plotted also soe log-likelihood values fro soe batch algoriths (see previous section). The new unsupervised procedure siultaneously estiates paraeters and selects a copact odel. We observe fro the log-likelihood values that for a wide range of values for, we get a good representation of the data with a copact odel. The graphs for the real-world data Iris and Enzye are not included since they look siilar to the graphs for the Shrinking Spiral data. 8 DISCUSSION AND CONCLUSIONS We have proposed an online ethod for fitting ixture odels which relies on a description-length reducing prior and a MAP estiation procedure for selecting a copact odel. The experiental results indicated that the recursive algorith was able to solve difficult probles and to obtain siilar solutions as other elaborate batch algoriths. However, the theoretical support for the finally selected nuber of coponents is questionable. Soe arguents in favor of the entropic prior and its connections to other odel selection criteria are given in [3]. The Dirichlet prior we used is related to the well founded MML principle, but it can be

7 656 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 5, MAY 2004 perhaps better viewed as an efficient heuristics. Therefore, if selecting the correct odel is critical, we suggest, as in the uch slower batch version [6], to perfor an additional check with soe standard odel selection criterion (full MML for exaple). An additional proble when copared to the batch version [6] is the introduced paraeter that balances the influence of the data against the influence of the prior. This is siilar to the paraeter fro the entropic prior ( in [3]). Soe experients were perfored to show the influence of the paraeter. The paraeter ¼ 1=T is related to the nuber of data saples T that are considered and soe heuristic choices were used in the previous section. If selecting the correct nuber of coponents is not critical the new recursive procedure is highly tie and eory efficient and potentially very useful to give a quick up-to-date copact description of the data. ACKNOWLEDGMENTS This work was done while Z. Zivkovic was with the Laboratory for Measureent an Instruentation, University of Twente, Enschede, The Netherlands. REFERENCES [1] H. Akaike, A New Look at the Statistical Model Identification, IEEE Trans. Autoatic Control, vol. 19, no. 6, pp , [2] E. Anderson, The Irises of the Gaspe Peninsula, Bull. of the A. Iris Soc., vol. 59, [3] M.E. Brand, Structure Learning in Conditional Probability Models via an Entropic Prior and Paraeter Extinction, Neural Coputation J., vol. 11, no. 5, pp , [4] A.P. Depster, N. Laird, and D.B. Rubin, Maxiu Likelihood fro Incoplete Data via the EM Algorith, J. Royal Statistical Soc., Series B (Methodological), vol. 1, no. 39, pp. 1-38, [5] V. Fabian, On Asyptotically Efficient Recursive Estiation, Annals of Statistics, vol. 6, pp , [6] M. Figueiredo and A.K. Jain, Unsupervised Learning of Finite Mixture Models, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp , Mar [7] A. Gelan, J.B. Carlin, H.S. Stern, and D.B. Rubin, Bayesian Data Analysis. Chapan and Hall, [8] G. McLachlan and D. Peel, Finite Mixture Models. John Wiley and Sons, [9] R.M. Neal and G.E. Hinton, A New View of the EM Algorith that Justifies Increental, Sparse and Other Variants, Learning in Graphical Models, pp , M.I. Jordan ed., [10] C. Rasussen, The Infinite Gaussian Mixture Model, Advances in Neural Inforation Processing Systes, vol. 12, pp , [11] S. Richardson and P. Green, On Bayesian Analysis of Mixture Models with Unknown Nuber of Coponents, J. Royal Statistical Soc., Series B (Methodological), vol. 59, no. 4, pp , [12] J. Rissansen, Stochastic Coplexity, J. Royal Statistical Soc., Series B (Methodological), vol. 49, no. 3, pp , [13] J. Sacks, Asyptotic Distribution of Stochastic Approxiation Procedures, Annals of Math. Statistics, vol. 29, pp , [14] G. Schwarz, Estiating the Diension of a Model, Annals of Statistics, vol. 6, no. 2, pp , [15] D.M. Titterington, Recursive Paraeter Estiation Using Incoplete Data, J. Royal Statistical Soc., Series B (Methodological), vol. 2, no. 46, pp , [16] D.M. Titterington, A.F.M. Sith, and U.E. Makov, Statistical Analysis of Finite Mixture Distributions. John Wiley and Sons, [17] N. Ueda and R. Nakano, Deterinistic Annealing EM Algorith, Neural Networks, vol. 11, pp , [18] N. Ueda, R. Nakano, Z. Ghahraani, and G.E. Hinton, SMEM Algorith for Mixture Models, Neural Coputation, vol. 12, no. 9, pp , [19] J.J. Verbeek, N. Vlassis, and B. Krose, Efficient Greedy Learning of Gaussian Mixture Models, Neural Coputation, vol. 15, no. 1, [20] C. Wallace and P. Freean, Estiation and Inference by Copact Coding, J. Royal Statistical Soc., Series B (Methodological), vol. 49, no. 3, pp , For ore inforation on this or any other coputing topic, please visit our Digital Library at

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points