Heteroscedastic Variance Covariance Matrices for. Unbiased Two Groups Linear Classification Methods

Appled Mathematcal Scences, Vol. 7, 03, no. 38, 6855-6865 HIKARI Ltd, www.m-hkar.com http://dx.do.org/0.988/ams.03.39486 Heteroscedastc Varance Covarance Matrces for Unbased Two Groups Lnear Classfcaton Methods Frday Znzendoff Okwonu Department of Mathematcs and Computer Scence Faculty of Scence, Delta State Unversty P.M.B., Abraka, Ngera & School of Dstance Educaton Unverst Sans Malaysa Pulau Pnang, Malaysa Fzokwonu_delsu@yahoo.com Abdul Rahman Othman School of Dstance Educaton Unverst Sans Malaysa Pulau Pnang, Malaysa oarahman@usm.my Copyrght 03 Frday Znzendoff Okwonu and Abdul Rahman Othman. Ths s an open access artcle dstrbuted under the Creatve Commons Attrbuton Lcense, whch permts unrestrcted use, dstrbuton, and reproducton n any medum, provded the orgnal work s properly cted. Abstract Ths paper nvestgates the comparatve classfcaton performance of the conventonal Fsher lnear classfcaton analyss, the robust Fsher lnear classfcaton analyss based on the mnmum covarance determnant and the M lnear classfcaton

6856 Frday Znzendoff Okwonu and Abdul Rahman Othman technque. These procedures are nvestgated based on heteroscedastc varance covarance matrces and data set generated usng symmetrc, asymmetrc and combned contamnaton models. The performance analyss s based on the comparson between the mean of the optmal probablty and the mean probabltes of correct classfcaton obtan for each technque. The comparatve analyss revealed that the M lnear classfcaton rule s robust and stable for ncreasng proporton of contamnaton and outperformed the other technques for the small sample sze and dmenson and perform comparable wth the Fsher lnear classfcaton analyss for large sample sze. On the other hand, the Fsher s technque outperformed ts robust verson for all cases nvestgated. Mathematcs Subject Classfcaton: 6H99, 6M0 Keywords: Lnear Classfcaton, Robust, Mean Probablty. Introducton The Fsher lnear classfcaton analyss (FLCA)[] was proposed for the purpose of dscrmnaton and classfcaton. In ths paper, we focused on the classfcaton aspect. Ths procedure was advanced wth the assumptons that the data set come from a multvarate normal dstrbuton and the varance covarance matrces are homoscedastc[]. It has been suggested that when the data set are not normally dstrbuted the mean vectors and covarance matrces are nfluenced by outlers, hence varous propostons have been proposed to robustfy these parameters. The maxmum lkelhood estmator (M estmator)[3], generalzed maxmum lkelhood estmator (GM estmators)[4], Smooth estmator (S estmator)[5], mnmum volume ellpsod (MVE) [6] and the mnmum covarance determnant estmator (MCD) [7] were proposed to robustfy the mean vectors and covarance matrces. The robustfed mean vectors and covarance matrces are plug-n nto the conventonal multvarate procedures to obtan robust multvarate technques ncludng the FLCA technque. The MCD procedure has been appled to robustfy the lnear dscrmnant analyss and the quadratc dscrmnant analyss[8]. The MCD procedure strctly depends on nformaton glean from the half set. The half set does not utlze the entre data set. Ths procedure s a data cleanng technque that s used as a preprocessng step before mplementng the technque of nterest. Detal of ths robust hgh breakdown method and ts applcaton to classfcaton s contan n [9]. As dscussed above, robustfcaton of the conventonal procedures was the major consderaton n recent tme whle lttle attenton s pad to the heteroscedastc varance covarance matrces wth respect to lnear classfcaton technques. Although, most authors suggested that when the varance covarance matrces are not

Heteroscedastc varance covarance matrces 6857 equal that the quadratc dscrmnant analyss be appled[]. Hubert and Van Dressen [8] robustfed the mean vectors and covarance matrces when the varance covarance matrces are heteroscedastc and appled these parameters to the quadratc dscrmnant analyss. They compared both the conventonal and robust quadratc dscrmnaton procedures. Lachenbruch [0] also appled the quadratc dscrmnant functon when the varance covarance matrces are heteroscedastc. Glbert [] study the unequal varance covarance matrces for the quadratc dscrmnant functon when the sample means and covarance matrces are known and concluded that ths technque s optmal but on the other hand, f the varance covarance matrces are not too dfferent, the Fsher s procedure perform almost the same as the quadratc dscrmnant functon. Marks and Dunn [] also nvestgated the unequal varance covarance matrces when the separaton parameters are estmated from ntal samples. Msra [3] also studed the effect of unequal covarance matrces on the lnear dscrmnant functon by ctng the case of natural hybrdaton between organsms. Kumar and Andreou [4] developed heteroscedastc dscrmnant analyss as a theoretcal framework for the generalzaton of the lnear dscrmnant analyss usng the maxmum lkelhood to handle the unequal varance covarance matrces. They observed that the Fsher s technque s not a technque of choce when the varance covarance matrces are unequal. Kumar and Andreou [5, 6] proposed the heteroscedastc procedure by droppng the homogenety assumpton of the covarance matrx. Havng consdered the varous propostons and justfcaton of usng the quadratc dscrmnant analyss nstead of the conventonal Fsher procedure when the varance covarance matrces are heterogeneous, a comparable and robust M lnear classfcaton rule s proposed. Ths technque use the wthn group medan to compute ts separaton parameters and hence develop the classfcaton rule. Ths approach and the known methods are nvestgated usng the mean of the optmal probablty as the performance benchmark. Ths paper consders the classfcaton procedures based on the Fsher s procedure, ts robust verson based on the mnmum covarance determnant and M lnear classfcaton rule. The classfcaton performance of these technques s nvestgated when the assumptons are volated. The remnder of ths paper s organzed as follows. The lnear classfcaton technques, the Monte Carlo smulatons and concluson are presented sequentally.. Conventonal Fsher Lnear Classfcaton Analyss(FLCA) Conventonally, the Fsher lnear classfcaton analyss (FLCA) s fundamentally dmenson reducton technque that encompasses separaton. The FLCA procedure s a lnear combnaton of measured varables that best descrbe the allocaton of ndvdual or observaton to known groups. The coeffcent of ths procedure s obtan by post-

6858 Frday Znzendoff Okwonu and Abdul Rahman Othman multplyng the nverse of the pooled covarance matrx by the wthn group mean vectors dfference. In mathematcal form, denote w to be the classfcaton score, u s the coeffcent vector and s non-zero ( u 0) d dmensonal vector, u denote the transpose of the coeffcent vector, x be vector of observatons, n s the sample sze wth respect to the groups and w denote the mdpont, a scalar. The Fsher lnear classfcaton rule assgns an observaton x to group one Π f w = u x > w, () otherwse to group two Π f where w = u x < w, () u = ( x x), pooled ( x + x) w = u, S pooled S = = ( n ) S, n = ( xj x ) = S = = j = n ( n ),,,,,.... Ths classfcaton procedure allocate an observaton x to group one f the classfcaton score s greater than or equal to the mdpont, that s, w w, otherwse the observaton s assgn to group two f the classfcaton score s less than the mdpont, say w < w. 3. Fsher Lnear Classfcaton Analyss Based on Mnmum Covarance Determnant (FMCD) The mnmum covarance determnant procedure search for the subset h (out of n ) of the data set whose covarance matrx has the mnmum determnant[8]. The sample observatons based on the half set are chosen from the multvarate data set to obtan the MCD estmates of mean vectors and covarance matrces. These robust estmates are computed based on the clean data set selected by the half set. The MCD estmates

Heteroscedastc varance covarance matrces 6859 are plug-n nto the Fsher s equatons, say Eq. and Eq. to obtan the robust Fsher lnear classfcaton rule[8]. Ths procedure can be express mathematcally as follows, ( xmcd x ) mcd v = x = u mcd x, Smcdpooled (3) xmcd + xmcd v = umcd. x S mcd mcd = Ω = F n j j= n χd+ l x l j= Ω = α χ n d, α,, l ( x - x ) j mcd j= n l j= f ( Λ( xmcd, Smcd )) χd, α, α = 0.975 l =. 0 otherwse The symbol Ω s the correcton factor requred to obtan unbased and consstent estmates f the data set come from a multvarate normal dstrbuton [7-3]. Where x mcd, and S mcd are the MCD estmates and Λ s the squared Mahalanobs dstance. The correcton factor s used for the FAST-MCD algorthm to compute the MCD estmates. Detal descrpton and theorem to compute the concentraton steps based on the half set of the MCD technque s contan [0, ], respectvely. The robust Fsher lnear classfcaton score s denoted as v, u mcd s the robust lnear classfcaton coeffcent and v s the robust mdpont. The classfcaton procedure s descrbe as follows; an observaton x n group one Π s classfy to group one f the followng condton s satsfy, v v 0, otherwse the observaton x s assgn to group two Π f the followng condton hold, v v < 0., (4) 4. M Lnear Classfcaton Rule (MLCR) It has been observe that unstable lnear classfcaton coeffcent allows for hgh msclassfcaton rate. The objectve s to develop a lnear classfcaton procedure

6860 Frday Znzendoff Okwonu and Abdul Rahman Othman wth stable coeffcent. To acheve the above objectve, we modfy the conventonal lnear classfcaton coeffcent. In ths regard, the proposed robust procedure uses the nverse of the square root of the generalzed varance to obtan the coeffcent. In ths secton, we consder robust measure to substtute the mean vector say wthn group medan. As noted n [4] the medan has bounded nfluence functon. Stromberge[5] observed that the probablty of the medan takng the nfluental observaton as t center s equal to the probablty of takng the regular observaton as t center. The propose technque s descrbe as follows, ( xˆ ˆ x ) z = x, S ( S ) = n ( x ˆ j x ) = j= ( n ) =,(,). (5) From the above equaton, the unbased sample covarance matrces are computed based on the wthn group medan xˆ of the sample observatons. The comparatve cutoff pont z s defned as ( xˆ ˆ ˆ ˆ + x) ( x x) z =. (6) ( S ) = The classfcaton rule s obtan by comparng the classfcaton score wth the comparatve cutoff pont, say, ( xˆ ˆ ˆ ˆ ˆ ˆ x) ( x + x) ( x x) x, ( S ) ( S ) = = (7) Eq. 7 mples that an observaton n group one Π s correctly assgn to group one otherwse the observaton s assgn to group two Π f the followng equaton s satsfed,

Heteroscedastc varance covarance matrces 686 ( xˆ ˆ ˆ ˆ ˆ ˆ x) ( x + x) ( x x) x <. ( S ) ( S ) = = (8) 5. Smulaton The Monte Carlo smulaton s desgned to nvestgate the comparatve classfcaton performance of the above technques for heteroscedastc varance covarance matrces based on symmetrc, asymmetrc and combned contamnaton models. The contamnaton model ( ε ) Nd (0,) + ε Nd ( µ, σ Id ) requre that majorty of the data set come from the uncontamnated data set whle the rest come from the contamnated data set. The robustness of these procedures s nvestgated by varyng the proporton of contamnaton va dfferent sample szes and correspondng dmensons. The data set s dvded nto tranng set (60%) and valdaton set (40%). In each case, the data set are randomly reshuffled. To determne the performance of each procedure, the mean of the optmal probablty (OPT) s used as the performance benchmark. The comparatve analyses are based on the comparson of the mean of the optmal probablty and the mean probabltes of correct classfcaton obtan from each technque. In the followng tables, the best procedure appears n bold; the second best n bold talcs and the msclassfcaton rate for each method s underlne. In Table below, the smulaton revealed that the MLCR technque s robust for all values of ε. For 0% and 0% contamnaton, the FLCA outperformed ts robust verson. The FMCD outperformed the FLCA for 30%, respectvely. Table. Mean Probablty of Correct Classfcaton and Standard Devaton (In Bracket), Symmetrc Contamnaton. (Optmal=0.9049) Con. n d ε OPT- OPT- OPT Dst. (0,6) 30 0 0.8979 0.8787 0.9003 0.007 0.06 0.0046 (0.00) (0.07) (0.0039) (0,6) 30 0 0.8638 0.869 0.8799 0.04 0.047 0.050 (0.0044) (0.0073) (0.006) (0,6) 30 30 0.8093 (0.006) 0.83 (0.0074) 0.8488 (0.006) 0.956 0.0737 0.056

686 Frday Znzendoff Okwonu and Abdul Rahman Othman In Table below, the FLCA outperformed the other procedure for ε = 0. The MLCR performed better than the other technques for ε = 0,30. The second best for ths table can easly be vew by nspecton. TABLE. Mean Probablty of Correct Classfcaton and Standard Devaton (In Bracket), Asymmetrc Contamnaton. (Optmal=0.9049) Con. n d ε OPT- OPT- OPT Dst. (,) 30 0 0.907 (0.005) (,) 30 0 0.7886 (0.006) (,) 30 30 0.695 (0.0070) 0.8837 (0.0053) 0.8059 (0.0044) 0.687 (0.03) 0.8993 (0.0056) 0.808 (0.0084) 0.7659 (0.0073) 0.003 0.0 0.0056 0.63 0.099 0.84 0.4 0.3 0.390 In Table 3, for all values of ε, the MLCR method s robust over the other classfcaton technques. The second best s the FMCD technque. TABLE 3. Mean Probablty of Correct Classfcaton and Standard Devaton (In Bracket), Combned Contamnaton. (Optmal=0.9049) Con. Dst. n d ε OPT- OPT- OPT (,6) 30 0 0.8835 0.873 0.890 0.04 0.036 0.09 (0.0084) (0.0070) (0.04) (,6) 30 0 0.8073 0.849 0.8584 0.0976 0.0557 0.0465 (0.0065) (0.0087) (0.06) (,6) 30 30 0.689 0.8 0.896 0.57 0.098 0.0853 (0.0093) (0.0) (0.0079) In Table 4, for ε = 0,0, the FLCA s robust over the other procedures and the MLCR method performed better for ε = 30. In ths category, the FLCA and the MLCR are the best n that order. TABLE 4. Mean Probablty of Correct Classfcaton and Standard Devaton (In Bracket), Symmetrc Contamnaton. (Optmal=0.9564) Con. Dst. n d ε OPT- OPT- OPT (0, 5) 60 4 0 0.975 0.837 0.903 0.089 0.9 0.0533 (0.0068) (0.004) (0.007) (0, 5) 60 4 0 0.878 0.859 0.874 0.078 0.305 0.08 (0.0048) (0.000) (0.0037) (0, 5) 60 4 30 0.839 0.7998 0.843 0.45 0.566 0.3 (0.008) (0.007) (0.0058)

Heteroscedastc varance covarance matrces 6863 In Table 5 below, the FLCA method s the overall best followed by the MLCR technque. TABLE 5. Mean Probablty of Correct Classfcaton and Standard Devaton (In Bracket), Asymmetrc Contamnaton. (Optmal=0.9564) Con. Dst. n d ε OPT- OPT- OPT (5,) 60 4 0 0.9563 0.8693 0.959 0.000 0.087 0.0405 (0.006) (0.003) (0.0093) (5,) 60 4 0 0.955 0.859 0.90 0.0049 0.045 0.054 (0.0075) (0.009) (0.0088) (5,) 60 4 30 0.9455 0.7574 0.7934 0.009 0.990 0.630 (0.0073) (0.09) (0.0) In Table 6, the FLCA s the best technque for ε = 0. The MLCR technque s robust for ε = 0,30. TABLE 6. Mean Probablty of Correct Classfcaton and Standard Devaton (In Bracket), Combned Contamnaton. (Optmal=0.9564) Con. Dst. n d ε OPT- OPT- OPT (5, 5) 60 4 0 0.97 0.8364 0.8976 0.09 0.00 0.0588 (0.007) (0.005) (0.0095) (5, 5) 60 4 0 0.8660 0.85 0.8673 0.0904 0.339 0.089 (0.0039) (0.0057) (0.008) (5, 5) 60 4 30 0.793 0.8000 0.8344 0.63 0.564 0.0 (0.037) (0.04) (0.046) The comparatve analyses revealed that the MLCR technque performed better for all contamnaton models for small sample sze, the second best n ths category s the FLCA procedure. For large sample sze, the FLCA performed better for asymmetrc contamnaton model. In general, for ncreasng ε the MLCR s more robust and stable than the other procedures compared. 6. Concluson The comparatve analyses based on the mean of the optmal probablty revealed that the conventonal FLCA technque can be appled to perform classfcaton for the unequal varance covarance matrces provded the performance benchmark s well defned. The Monte Carlo smulaton revealed that though t shortcomng s based on small sample sze, ncreasng proporton of contamnaton. However, ts performance mproved for large sample sze, ths s due to the central lmt theorem. The analyses ndcate that the robust Fsher s procedure based on the mnmum covarance

6864 Frday Znzendoff Okwonu and Abdul Rahman Othman determnant does not perform well for heteroscedastc varance covarance matrces. In general, the FMCD procedure has ncrease msclassfcaton rate compared to ts conventonal procedure. Thus, the MLCR was able to brdge the gap between the FLCA and the FMCD, respectvely. In all, the MLCR was able to address and show robust performance over the known methods wth regard to heteroscedastc varance covarance matrces and ncreasng contamnaton proporton. It can easly be observed that as the contamnaton level ncreases, the msclassfcaton rate reduced for the MLCR technque compared to the other technques. In general, we have devce a lnear classfcaton procedure that can handle homoscedastc and heteroscedastc varance covarance matrces wth respect to two groups lnear classfcaton. The comparson between the mean of the optmal probablty and the mean probablty of correct classfcaton for each method reveals that the FLCA can be appled to perform classfcaton under ths condton and the overall breakthrough from the conventonal quadratc dscrmnant procedure s the capablty of the MLCR to perform very well for these condtons. Acknowledgment The authors thank Unverst Sans Malaysa for provdng the short term grant for the publcaton of ths work. References []. R. A. Fsher, The use of multple measurements n taxonomc problems, Annal of Eugencs 7(936), 79-88. []. A. R. Johnson and W. D. Wchen, Appled multvarate statstcal analyss, New York: Prentce Hall, 00. [3]. P. J. Huber, Robust estmaton of a locaton parameter, Annal of Mathematcal Statstcs 35(964), 73-0. [4]. C. L. Mallows, On some topcs n robustness:techncal memorandum, Murray Hll, New Jersey: Bell Telephone Laboratores, 975. [5]. H. P. Lopuhaa, On the relaton between S estmators and M estmators of multvarate locaton and covarance, The Annal of Statstcs 7(989), 66-683. [6]. P. J. Rousseeuw, Least medan of square regresson, Journal of the Amercan Statstcal Assocaton 79(984), 87-880. [7]. P. J. Rousseeuw, Multvarate estmators wth hgh breakdown pont, Mathematcal Statstcs and ts Applcatons B(985), 83-97. [8]. M. Hubert and K. Van Dressen, Fast and robust dscrmnant analyss, Computatonal Statstcs and Data Analyss 45(004), 30-30. [9]. P. J. Rousseeuw and K. Van Dressen, A fast algorthm for the mnmum covarance determnant estmator, Technometrcs 4(999), -3.

Heteroscedastc varance covarance matrces 6865 [0]. P. A. Lachenbruch, Some unsolved practcal problems n dscrmnant analyss, Chapel Hll: Unversty of North Carolna, 975. []. E. Glbert, The effect of unequal varance covarance matrces on Fsher's lnear dscrmnant functon, Bometrcs 5(969), 505-56. []. S. Marks and O. J. Dunn, Dscrmnant functons when covarance matrces are unequal, Journal of the Amercan Statstcal Assocaton 69(974), 555-559. [3]. R. K. Msra, Dscrmnant functons for lnear comparson of means when covarance matrces are unequal, Bometrcs (980), 755-758. [4]. N. Kumar and A. G. Andreou, Heteroscedastc dscrmnant analyss and reduced randk hmms for mproved speech recognton, Speech Communcaton 5(998), -4. [5]. N. Kumar and A. G. Andreou, On the generalzatons of lnear dscrmnant analyss, Electrcal and Computer Engneerng Techncal Report 96-97, 996. [6]. N. Kumar and A. G. Andreou, Generalzaton of lnear dscrmnant analyss n maxmum lkelhood framework, In Proceedngs Jont Meetng of Amercan Statstcal Assocaton, Chcago, Illnos, 996. [7]. C. Croux and G. Haesbroeck, Influence functon and effcency of the mnmum covarance determnant scatter matrx estmator, Journal of Multvarate Analyss 7(999), 6-90. [8]. G. Fauconner and G. Haesbroeck, Outlers detecton wth mnmum covarance determnant estmator n practce, Statstcal Methodology 6(009), 363-379. [9]. M. Hubert, P. J. Rousseeuw and T. Verdonck, A determnstc algorthm for the MCD, cteseerx.st.psu.edu/vewdoc/summary?do=0...7.34, (0). [0]. M. Hubert, P. J. Rousseeuw, and S. Van Aelst, Hgh breakdown robust multvarate methods, Statstcal Scence 3(008), 9-9. []. G. Pson, S. Van Aelst, and G. Wllems, Small sample correctons fot LTS and MCD, Metrka 55(00), -3. []. P. J. Rousseeuw and K. Van Dressen, A fast algorthm for the mnmum covarance determnant estmator, Technometrcs 4(999), -3. [3]. R. Maronna, R. D. Martn, and V. J. Yoha, Robust statstcs: Theory and methods, New York: John Wley, 006. [4]. H. P. Lopuhaa and P. J. Rousseeuw, Breakdown ponts of affne equvarant estmators of multvarate locaton and covarance matrces, The Annal of Statstcs 9(99), 9-48. [5]. A. J. Stromberg, Robust covarance estmates based on resamplng, Journal of Statstcal Planng and Inference 57(997), 3-334. Receved: September 5, 03