Model-based mixture disriminant analysis an experimental study Zohar Halbe and Mayer Aladjem Department of Eletrial and Computer Engineering, Ben-Gurion University of the Negev P.O.Box 653, Beer-Sheva, 8405, Israel Abstrat The subjet of this paper is an experimental study of a disriminant analysis (DA) based on Gaussian mixture estimation of the lass-onditional densities. Five parameterizations of the ovariane matrixes of the Gaussiaomponents are studied. Reommendation for seletion of the suitable parameterization of the ovariane matrixes is given. Keywords: Disriminant analysis, Gaussian mixture model, Density estimation, Model seletion.. Introdution Disriminant analysis (DA) is a powerful tehnique for lassifying observations into known preexisting lasses. In the Bayesian deision framework [] a ommon assumption is that the observed d- dimensional patterns x (x R d ) are haraterized by the lass onditional density f (x), for eah lass =, 2,, C. Let P denotes a prior probability of the lass. Aording to Bayes theorem the posterior probability that an arbitrary observation x belongs to lass is C P( x Class x) = P f ( x)/ P f ( x ). The j j j= lassifiation rule for alloating x to the lass having the highest posterior probability P(x Class x), minimizes the expeted mislassifiation rate []. The latter rule is named the Bayes lassifiation rule. In this paper we assume that f (x) is a mixture of M normal (Gaussian) densities M f ( x) = π N( x µ, Σ ). () j= Here π are mixing oeffiients, whih are non-negative and sum to one. N(x µ Σ ) denotes multivariate normal density with mean vetor µ and ovariane matrix Σ. Usually the ovariane matrixes Σ are taken to be full (unrestrited) or diagonal or spherial. The fitting of the parameters π, µ and Σ is arried out by maximizing the likelihood of the parameters to the training data { x, x,..., x } of observations from lass. 2 The subjet of this paper is an experimental study of a DA based on an extended parameterization of Σ, proposed in [3] and named model-based DA [2]. In addition, we modify the model seletion
proedure proposed in [2], whih overomes the setting of the maximal number of omponents for the mixture (). To the best of our knowledge this is the first extensive experimental study of the modelbased DA. The paper is organized as follows: in Setion 2, we desribe the model-based DA, experimental results are provided in Setion 3 and onlusions in Setion 4. 2. Model-based DA Following [2, 3] the parameterizations of the ovariane matrixes Σ are taken to be: Σ =λ I; I denotes the identity matrix and λ are adjustable parameters; Σ =λ B ; Σ =λ B, where B and B are adjustable diagonal matrixes having positive diagonal elements; Σ =λ D ; and Σ =D, where D and D are adjustable positive definite symmetri matrixes. D is restrited to have unit determinant ( D =). Note, that Σ =λ I, λ B and D are the spherial, diagonal and full ovariane matrixes, respetively, used in the onventional mixture models (). The other parameterizations Σ = λ B, λ D define f (x) () having the same matrixes B and D for all mixture omponents of f (x). The goal is to redue the number ν of the adjustable parameters of f (x) while ensuring flexibility of the mixture model by adjusting the parameters λ for eah omponent. Table gives the expressions for ν, for the parameterizations of Σ used in this paper. Σ Table : Expressions for Σ parameterization and the number ν of the adjustable parameters of f (x). λ B (or B ) D (or D ) ν λ I tr( W )/dn - - d(m +)+M - - λ B tr( WB )/dn - λ B tr( WB )/dn - λ D tr( WD )/dn - D diag / d ( λ )/ diag( λ ) j j diag( )/ diag( ) W W - d(m +)+2(M -) /d W W - 2M d+m - - / n / d λ / j λ j W W 2(M -)+ M d+d(d+)/2 W M d+m -+M d(d+)/2 For a predefined number of omponents M and parameterization of Σ (desribed above), the parameters µ, Σ and π are determined by an expetation-maximization (EM) algorithm proposed in [3]. The EM algorithm is initialized by the k-means lustering algorithm []. Using the latter algorithm the training data { x, x 2,..., x } is partitioned into M groups G, j=, 2,, M. An indiator vetor 2
z = ( z, z,..., z ) is assoiated with eah observation x i from lass. For x i G, i i i2 im z =, otherwise ( z ) = 0, for,, and j=,, M. The EM algorithm alternates between two steps, an expetation step (E-step) and a maximization step (M-step), until onvergene riterion is satisfied. Using the initial z i and the initial λ =, for j=,, M, the following E- and M-steps are yled: M-step: n z, z xi, n µ π =, n n T = z i i W ( x µ )( x µ ) E-step: Calulate Σ using the expressions in Table. z π N( x i µ, Σ ). M π N ( x µ, Σ ) j= j i In the model-based DA [2] eah ombination of the number M of the omponents of f (x) () and the parameterization of Σ orresponds to a ertain Gaussian mixture model. In [2] the Bayesian informatioriterion (BIC) is used for model seletion. The BIC(M ) for f (x) () having M omponents is: BIC(M ) = 2 log f( x i) + vlog( n), (2) where f (x i ) is the mixture model () fitted to the training data by the EM algorithm. Using BIC(M ) (2) the number M and the parameterization of Σ are set by the following proedure. For eah parameterization of Σ : () Set M =0, BIC(0)=. (2) Update M =M +. Apply the EM algorithm for fitting the f (x). Theompute BIC(M ) (2). (3) If BIC(M )>BIC(M -) then set M = M - else repeat steps (2)-(3). Finally we selet the model (parameterization of Σ and the number M of the omponents) orresponding to the minimal value of BIC. This proedure is a modifiation of the original proposal [2]. It overomes the problem in setting the maximal number M max of omponents required in [2] and redues the omputational omplexity. 3
3. Experiments In this setion we investigated the performane of the model-based DA. For omparison we ran the probabilisti prinipal omponent analysis (PPCA) mixture density estimation [4], whih is a popular latent variable method in pattern reognition literature. We arried out experiments on various data sets, artifiial and real-world data, from the UCI Mahine Learning Repository at http://www.is.ui.edu/~mlearn/mlrepository.html and Gunnar Ratsh at http://www.first.gmd.de/ ~ raetsh/data/banana.txt. In addition to those benhmark data sets we omposed a data set named MODIFIED LETTER. It merges the lasses (26 letters) of the benhmark data set LETTER into two groups. We set letters O, U, P, S, X, Z, E, B, F, T, W, A, Q to be Group and the letters H, D, N, C, R, G, K, Y, V, M, I, J, L to be Group 2, using luster analysis of the letters means. The goal was to set ompliated (highly overlapped) groups. In our experiments with MODIFIED LETTER we set a training sample size of 300 observations, seleted from the original LETTER data randomly. Table 2 shows the harateristis of the data sets: C is the number of the lasses, S is the averaged number of the observations ( S = n / C) into the lasses, p is the original dimension of the observations and d is the redued dimension after preproessing by prinipal omponent analysis []. We indiate by * the data sets having small sample sizes (S/d 0). Table 2: Data sets harateristis. DATA C p d S/d SONAR* 2 60 29 2.86 GLASS* 6 9 6 5.83 IONOSPHERE* 2 34 30 5.83 VOWEL* 3 9 9.88 MODIFIED LETTER* 2 6 5 0 IRIS 3 4 3 6.66 WINE 3 3 2 9.66 LIVER DISORDERS 2 7 5 27.6 GAUSIAN 2 8 7 35.7 WAVEFORM-NOISE 3 40 40 4.65 LETTER 26 6 5 5.26 WAVEFORM 3 2 2 79.33 BANANA 2 2 2 250 BANANA-NOISE 2 2 2 250 Figure : Corret lassifiation rate versus number of omponents for MODIFIED LETTER data set. 4
Eah experiment onsisted of 0 runs of the following proedure. We randomly drew a data set with replaement from the original data and split the data into five roughly equal sized parts. Then we fitted the parameters of f (x) () and seleted the best model (M and parameterization of Σ ) using four parts of the data and alulated the orret lassifiation rate, alloating the observations from the left part by the Bayes lassifiation rule (Setion ). We arried out five repliations using a different part for lassifiation eah time. Finally, we averaged the lassifiation rate over 0 5 random runs. (When a ovariane matrix Σ happened to be singular, the zero eigenvalues were replaed by a small number just large enough to permit numerially stable inversion). In Table 3, we report the obtained averaged perentage of the orret lassifiation rate along with the standard errors. In parentheses we report the averaged number of the model parameters ( ν ). The bold sores indiate the highest perentage of orret lassifiation rates. Table 3: Averaged perentage of orret lassifiation rates. DATA λ I DA with f (x) having different parameterizations of Σ λ B λ B λ D D PPCA SONAR* 7.9±0.89(35) 74.8±0.85(97) 74.7±0.92(232) 8.3±0.64(996) 80.9±0.62(928) 79.8±0.73(456) GLASS* 57.9±0.98(47) 62.2±0.7(75) 6.5±0.82(94) 6.0±0.82(279) 6.0±0.82(279) 49.0±0.92(90) IONOSPHERE* 9.4±0.42(375) 92.5±0.40(404) 92.5±0.39(47) 9.9±0.40(06) 83.±0.69(496) 8.7±0.52(26) VOWEL* 85.3±0.40(760) 86.2±0.40(808) 87.2±0.40(084) 89.9±0.28(083) 87.7±0.35(868) 85.4±0.28(680) MODIFIED LETTER* 79.4±0.73(203) 8.6±0.52(229) 79.8±0.52(242) 82.0±0.56(330) 8.7±0.50(273) 8.0±0.6(258) IRIS 96.±0.46(40) 95.5±0.53(36) 94.8±0.50(35) 96.9±0.4(27) 96.9±0.4(27) 97.4±0.35(28) WINE 96.0±0.36(37) 96.8±0.36(37) 96.6±0.4(05) 98.3±0.30(279) 98.7±0.28(270) 98.4±0.28(37) LIVER DISORDERS 66.±0.50(7) 65.±0.49(63) 64.±0.63(70) 67.4±0.65(66) 67.9±0.69(95) 60.7±0.7(56) GAUSIAN 69.8±0.57(00) 7.6±0.64 (84) 69.8±0.62(86) 8.0±0.48(73) 84.8±0.35(05) 73.4±0.57(03) WAVEFORM- NOISE 86.7±0.4(446) 86.5±0.3(550) 86.5±0.3(807) 84.0±0.4(2580) 84.0±0.4(2580) 86.0±0.4(249) LETTER 82.7±0.2(5450) 89.0±0.0(564) 90.0±0.07(833) 9.7±0.07(6987) 94.73±0.07(3384) 84.0±0.07(3380) WAVEFORM 86.5±0.0(297) 86.4±0.09(355) 86.3±0.(427) 85.±0.3(787) 85.±0.2 (756) 86.0±0.2(35) BANANA 92.9±0.2(65) 94.2±0.0(74) 94.2±0.0(72) 94.3±0.0(75) 95.±0.06(6) 95.±0.07(7) BANANA- NOISE 75.3±0.4(37) 75.±0.5(36) 75.4±0.5(36) 75.4±0.4(46) 75.8±0.6(37) 75.8±0.6(45) For the MODIFIED LETTER data set we ran an additional experiment. We trained the lassonditional densities () using the seleted 300 training observations and alulated the orret lassifiation rate using the Bayes lassifiation rule for the rest of the 9700 LETTER observations. In Fig. we show the averaged lassifiation rates versus the number ν +ν 2 of the parameters of the 5
model. We onstruted the graphs using the same number of omponents for eah lass (M =M 2 =M) varying M=, 2,, 20. Observing the results in Table 3 and Fig. we onlude that the model λ D outperforms others models for the data sets SONAR, VOWEL and MODIFIED LETTER having a small sample size (S/d 0) and the model λ B outperforms the model λ B for most of the data sets. Consequently it seems that the models Σ =λ I, λ B and Σ =λ D over a wide range of data strutures and ould be onsidered as the basi set of models for the model-based DA. Looking at Table 3 for the MODIFIED LETTER dataset we observe that using BIC we seleted models λ B and λ D having ν +ν 2 = 229-330, respetively. Observing the results in Fig. we onlude that for those models (λ B, λ D ) the best generalization performane (highest lassifiation rate for 9700 test observations) is for ν +ν 2 >800. Consequently, the model seletion using BIC underestimates the omplexity of f (x) used for Bayes lassifiation. An improvement of the model seletion is needed. The latter is a subjet of our future researh. 4. Conlusions In this paper, we have investigated the performane of the model-based DA [2, 3] on various data sets, artifiial and real-world data. The results obtained in the experimental study show that Σ =λ I, λ B and λ D seems to be a universal set of models that an be reommended for pratial appliations in small sample size data sets. In addition, we proposed a modifiation of the model seletion proedure proposed in [2], whih overomes the setting of the maximal number of omponents in the mixture () and redues the omputatioomplexity of the model seletion. Referenes [] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistial Learning, Springer press, Canada, 200. [2] C. Fraley, A.E. Raftery, Model-Based Clustering, Disriminant Analysis, and Density Estimation, Journal of the Amerian Statistial Assoiation, Vol. 97, pp. 6-63, Jun 2002. [3] G. Celeux, G. Govaert, Gaussian parsimonious lustering models, Pattern Reognition, Vol. 28, No. 5, pp. 78-793, 995. [4] M.E. Tipping, C.M. Bishop, Mixture of Probabilisti Prinipal Component Analyzers, Neural Computation, Vol., pp. 443-482, February 999. 6
Figure : Corret lassifiation rate versus number of omponents for MODIFIED LETTER data set. 7