In general, x is assumed to have a multivariate normal distribution (see for example, Anderson, 1984; or Johnson and Wichern, 1982). The classificatio

Size: px

Start display at page:

Download "In general, x is assumed to have a multivariate normal distribution (see for example, Anderson, 1984; or Johnson and Wichern, 1982). The classificatio"

Clyde Ryan
6 years ago
Views:

1 CLASSIFICATION AND DISCRIMINATION FOR POPULATIONS WITH MIXTURE OF MULTIVARIATE NORMAL DISTRIBUTIONS Emerson WRUCK 1 Jorge Alberto ACHCAR 1 Josmar MAZUCHELI 2 ABSTRACT: In this paper, we consider mixtures of multivariate normal distributions to be used in classification and discrimination rules. Considering Markov Chain Monte Carlo methods, we get the posterior summaries of interest and the predictive densities needed in the classification rules. A numerical example is introduced to illustrate the proposed methodology. KEYWORD: Mixture of multivariate normal distributions, classification and discrimination, Bayesian analysis 1 Introduction Assume we have interest to classify a unit to one among g groups based on a vetor x of observed data (see for example, Cacoulos, 1973; Lachenbruch, 1975; Goldstein and Dillon, 1978; or Johnson and Wichern, 1982). Usually, this problem appears in different areas, as economy, medicine, ecology, archeology or physics. 1 Instituto de Ci^encias Matemáticas e de Computaοc~ao, Universidade de S~ao Paulo, C.P.668, , S.Carlos, SP, Brazil 2 Departamento de Estat stica, Universidade Estadual de Maringá, Maringá, PR, Brazil Rev. Mat. Est. S~ao Paulo, 19: ,

2 In general, x is assumed to have a multivariate normal distribution (see for example, Anderson, 1984; or Johnson and Wichern, 1982). The classification rule could be built on Fisher's discriminant function or using Bayesian approaches based on the predictive density for a future observation (see for example, Lavine and West, 1992). For many applications, a preliminary data analysis for a trainning data set of the g populations could indicate the need for other multivariate distributions for x, which could improve the performance of the classification rules. Evaluation of classification rules could be based on error rates" or misclassifications probabilities. In this paper, we assume a mixture of multivariate distributions for x in each population with density, f(xj ; p) = KX j=1 p j f(xj j ) (1) = ( 1 ;:::; k ) 0,p = (p 1 ;:::;p k ) 0, where j is a vector of parameters associated to the jth component distribution, p j is the probability ofx P K belongs to populations j and j=1 p j =1. Bayesian inference for mixture of distributions is introduced by many authors (see for exemple, Robert, 1996; or Titterington, Smith and Markov, 1985). As a special case, we consider a Bayesian approach for classification assuming a mixture of multivariate normal distributions for each population using MCMC (Markov Chain Monte Carlo) methods as in Gelfand and Smith (1990) to develop the classification rules. 2 Bayesian Analysis Assuming a Mixture of K=2 Multivariate Normal Distributions Firstly, we assume the special case where each population have a mixture of K=2 multivariate normal distributions, f(xj ; p) = 2X j=1 p j f j (xj j ) (2) 384 Rev. Mat. Est. S~ao Paulo, 19: , 2000

3 P where x =(x 1 ;:::;x q ) 0 ; 2 j=1 p j =1; f j (xj j ) denotes a multivariate normal distribution N q (μ j ; ± j ); = f 1 ; 2 g; 1 = fμ 1 ; ± 1 g and 2 = fμ 2 ; ± 2 g. The likelihood function for and p = (p 1 ;p 2 ), based on a random sample X 1 ;:::;X n is given by L( ; p) = 8 ny < 2X : i=1 j=1 p j f j (x i j j ) For simplification of the conditional distributions needed for the Gibbs Sampling algorithm, we introduce latent variables (see Tanner and Wong, 1987), Z i =(Z i1 ;Z i2 )wherez ij = 1 if the ith observation was generated from the jth component distribution (Z ij = 0 in other case) and i =1;:::;n. Observe that for the special case of K = 2 component distributions, Z ij jx; ; p ο b(1;v ij ) (a binomial distribution) with 9 = ; (3) v ij = p j f j (x i j j ) P 2 j=1 p jf j (x i j j ) (4) Thus, f(z i jx; ; p) =v z i1 i1 (1 v i1) 1 z i1 (5) Considering a sample Z 1 ;:::;Z n,wehave, f(z 1 ; :::; z n jx; ; p) =Q n i=1 Q 2 j=1 [p jf j (x i j j )] z ij Q n P2 ff (6) i=1ρ j=1 p jf j (x i j j ) Let us assume the following prior distribution for and p 1 (see for example, Lavine and West, 1992): ß( ) /j± 1 j 1 2 (q+1) j± 2 j 1 2 (q+1) ß(p 1 ) ο B(a; b); a; b known: (7) where B(a; b) denotes a Beta distribution with mean a (a+b) and ab variance [(a+b) 2 (a+b+1)]. Combining (3) with (6) and the prior distribution (7) assuming independence, the joint posterior distribution for and Rev. Mat. Est. S~ao Paulo, 19: ,

4 p 1 is given by ß( ;p 1 jx; z) /j± 1 j 1 2 (q+1) n Y i=1 ny ρ i=1 [f 2 (x i j 2 )] z i2 ρ ff [f 1 (x i j 1 )] z i1 j± 2 j 1 2 (q+1) ff p (r+a) 1 1 (1 p 1 ) (n+b r) 1 (8) P P n where r = i=1 z n i1 and r 2 = n r = i=1 z i2. The conditional posterior distributions for the Gibbs Sampling algorithm are given by, (i) p 1 j 1 ; 2 ; x; z ο B(a + r; b + n r); (ii) μ 1 j± 1 ;p 1 ; 2 ; x; z ο N q (μx 1 ; 1 r ± 1); (iii) ± 1 jμ 1 ;p 1 ; 2 ; x; z ο Inv W ishart r 1 (V 1 1 ); (9) (iv) μ 2 j± 2 ;p 1 ; 1 ; x; z ο N q (μx 2 ; 1 n r ± 2); (v) ± 2 jμ 2 ;p 1 ; 1 ; x; z ο Inv W ishart n r 1 (V 1 2 ); where μx 1 = 1 r P n i=1 z i1x i ; μx 2 = 1 r 2 P n i=1 z i2x i ; V 1 = P n i=1 z i1(x i μx 1 )(x i μx 1 ) 0 ; V P n 2 = i=1 z i2(x i μx 2 )(x i μx 2 ) 0 and Inv W ishart v (V 1 ) denotes a Inverse-Wishart distribution with v degrees of freedom with density f(w ) /jwj 1 2 (v+q+1) exp ρ 1 2 trjvw 1 j V is a q q symmetric positive definite scale matrix and W is positive definite. To generate samples from the joint posterior distribution (8), we follow the steps: i- Start with initial values p (0) 1 ; μ(0) 1 ; μ(0) 2 ; ±(0) 1 and ± (0) 2 ; ii- Generate a sample Z (1) 1 ;:::;Z(1) n from a binomial distribution with success probability v ij (4). iii- Generate a sample of μ 1 ; μ 2 ; ± 1 and ± 2 from the conditional distributions (9). ff ; 386 Rev. Mat. Est. S~ao Paulo, 19: , 2000

5 We also could consider a informative prior distribution for. A conjugate prior distribution for is given by, ρ ß( ) =j± 1 j ( (g 1 +q) 2 +1) exp 1 tr[g 1± ] k 1 (μ m 1 2 1) 0 ± 1 (μ m ff 1 1 1) ρ ff j± 2 j ( (g 2 +q) 2 +1) exp 1 2 tr[g 2± 1 2 ] k 2 2 (μ 2 m 2) 0 ± 1 2 (μ 2 m 2) (10) where g j and k j are known constants; G j is a symmetric positive definite matrix of known constants; m j is a vector of known constants, j =1; 2: With prior (10) for and the same Beta prior for p 1 given in (7), the conditional posterior distributions for the Gibbs algorithm are given by, (i) p 1 j 1 ; 2 ; x; z ο B(a + r; b + n r); (ii) μ 1 j± 1 ;p 1 ; 2 ; x; z ο N q (a 1 ; ± 1 r+k 1 ); (iii) ± 1 jμ 1 ;p 1 ; 2 ; x; z ο Inv W ishart g1+r(g 1 n 1 ); (11) (iv) μ 2 j± 2 ;p 1 ; 1 ; x; z ο N q (a 2 ; ± 2 n r+k 2 ); (v) ± 2 jμ 2 ;p 1 ; 1 ; x; z ο Inv W ishart g2+n r(g 1 n 2 ); r where a 1 = r+k 1 μx 1 + k1 n r r+k 1 m 1 ; a 2 = n r+k 2 μx 2 + k 2 n r+k 2 m 2 ; G 1 n1 = G k 1 + V 1 + 1r k 1+r (μx 1 m 1 )(μx 1 m 1 ) 0 and G 1 n2 = G 2 + V 2 + k 2(n r) k 2+n r (μx 2 m 2 )(μx 2 m 2 ) 0. Similar results could be obtained considering K>2. 3 Classification for two Populations Let us classify a new object to one of two populations based on q measurements associated random variables X 0 = (X 1 ; :::; X q ) assuming a mixture of normal distributions f (1) (xj P (1) 2 ) = j=1 p(1) j f (1) j (xj (1) j ) for population 1 and f (2) (xj P (2) 2 ) = j=1 p(2) j f (2) j (xj (2) j ) for population 2 where Rev. Mat. Est. S~ao Paulo, 19: ,

6 (l) j = (μ (l) j ; ±(l) j distribution N q (μ (l) (l) ) and f (xj (l) ) denotes a multivariate normal j j j ; ±(l) j ); j =1; 2; l =1; 2. The predictive density for a vector x is given by, f (l) (x) = Z f (l) (xj (l) )ß( (l) jx)d (l) (12) where l =1or2(l indexes populations 1 and 2). A Monte Carlo estimate for f (l) (x) based on the generated Gibbs Samples is given by, ^f (l) (x) = 1 S SX s=1 f (l) (xj (l)s ) (13) where S is the number of generated Gibbs Samples. To classify a new object with observed measurements x, we consider the following allocation rule: i- Allocate x to population 1 if, ^f (1) (x) ^f (2) (x)» c(1j2) c(2j1)» ο2 ο 1 (14) where c(1j2) is the missclassification cost when an observation from population 2 is incorrectly classified in population 1; c(2j1) is the missclassification cost when an observation from population 1 is incorrectly classified in population 2; ο 1 and ο 2 are the prior probabilities of classification to populations 1 and 2, respectively. ii- Allocate x to population 2, otherwise. In the special case of c(1j2) = c(2j1) and ο 1 allocation rule (14) is given by, = ο 2, the i- Allocate x to population 1 if, ^f (1) (x) ^f (2) (x) 1 (15) ii- Allocate x to population 2, otherwise. 388 Rev. Mat. Est. S~ao Paulo, 19: , 2000

7 4 A Numerical Illustration As an illustrative example, let us consider the data of two simulate samples of size 100 generated from populations 1 and 2 with mixtures of two bivariate normal distribution with density (2). For population 1, we assume, μ 1 0:3 1 = 2:5 4:5 0 ; ± 1 = ; p 0:3 1:5 1 =0:4 μ 2:0 0:4 2 = 4:0 10:0 0 ; ± 2 = ; p 0:4 2:5 2 =0:6 Figure 1: Data from populations 1(ffi) and 2( ). For population 2, we assume, Rev. Mat. Est. S~ao Paulo, 19: ,

8 μ 1 = 3:5 5:5 0 ; ± 1 = μ 2 = 6:5 14:6 0 ; ± 2 = 1:0 0:3 ; p 0:3 2:0 1 =0:5 2:0 0:4 0:4 3:0 ; p 2 =0:5 In Figure 1, we have the plot for the data x =(x 1 ;x 2 )from both populations. In Figure 1, we clearly observe two clusters of data for both samples, which indicates the mixture of two bivariate normal distributions for each sample. If we consider the usual linear discriminant function assuming multivariate normal distributions for each population with same covariance matrix ± (see for example, Johnson and Wichern, 1982), same missclassification costs and same prior probabilities, we have in Table 1, the classification table for all data set. Table 1 - Classification table (linear discriminant function ) Actual Predicted membership Membership Pop1 Pop2 Total Pop1 n 1c =70 n 1m =30 n 1 =100 Pop2 n 2m =40 n 2c =60 n 2 =100 In Table 1, n 1c is the numberofpop1 items correctly classified as Pop1 items; n 1m is the number of Pop1 items missclassified as Pop2 items; n 2c is the number of Pop2 items correctly classified; n 2m is the number of Pop2 items missclassified; n 1 and n 2 are the total of actual items in each population. The apparent error (APER) rate is given by AP ER = n 1m + n 2m = n 1 + n =0:35: Considering a quadratic discriminant function assuming multivariate normal distributions for each population with different covariance matrices ± 1 6= ± 2, same missclassification costs and same prior probabilities, we have in Table 2, the classification table for all data set. 390 Rev. Mat. Est. S~ao Paulo, 19: , 2000

9 Table 2 - Classification table (quadratic discriminant function ) Actual Predicted membership Membership Pop1 Pop2 Total Pop1 n 1c =76 n 1m =24 n 1 =100 Pop2 n 2m =41 n 2c =59 n 2 =100 Using the quadratic discriminant function, the apparent error rate is given by AP ER = ( )=200 = 0:325. We observe from the values for AP ER, considering a usual linear discriminant function or a quadratic discriminant function, that a large proportion of items are missclassified, which indicates that the used classification rules are not appropriated for the data set. Considering a mixture of two bivariate normal distributions (2) with (l) 1 =(μ (l) 1 ; ±(l) 1 )and (l) 2 =(μ (l) 2 ; ±(l) μ (l) 1 =(μ (l) 11 μ (l) 12 )0 ; μ (l) 2 =(μ (l) 21 μ (l) 22 )0 ; ± (l) 1 = ± (l) 2 = ψff (l) 211 ff (l) 212 ff (l) 221 ff (l) 222 ψ ff (l) 111 ff (l) ) where, ff (l) 121 ff (l) 122 for l = 1 (Pop1) and l = 2 (Pop2) and the prior distributions (7) with a = 2;b = 3 for Pop1 and a = 2;b = 2 for Pop2, we generated Gibbs Samples for the joint posterior distribution (8) using the conditional posterior distributions (9). We monitored the convergence of the Gibbs Samples using the Geweke (1992) method. The results were generated using Ox package version 2.10 (see Doornik, 1999). For each parameter, we discarded the 4000 first iterations (burn-in-samples) and we considered the 20th,40th,... iterations. Therefore, we have a final sample of size S = 300. In Table 3, we have the posterior summaries for all parameters, as well as the convergence values for the Geweke(1992) criterium GW. We observe convergence for all parameters since jgw j < 2. Considering Monte Carlo estimates for the predictive densities for x in both populations based on the S = 300 generated Gibbs Samples, we use (15) to classify the items to both populations 1 and 2.!! ; Rev. Mat. Est. S~ao Paulo, 19: ,

10 Table 3 - Posterior summaries (mixture of two bivariate normal distributions; prior distributions (7) for ) Param. Mean S.D. 95% credible interval j GW j (1.8231;2.5952) (4.1906;5.6530) ( ;2.0119) ( ;5.2064) ( ;1.0659) Pop1 p (1) ( ; ) (3.4918;4.7025) (9.1946;10.722) (1.6696;3.7523) (2.0013;8.0003) ( ;2.3419) (2.8474;3.5063) (5.1742;5.9197) ( ;1.9301) Pop (1.0427;2.5983) ( ;0.7831) p (2) (0.3554; ) (5.9708;7.0236) (14.250;15.305) (1.7055;3.5964) (2.7541;7.0163) ( ;2.5609) In Table 4, we have the classification results for all data. Table4- Classification table (mixture of two bivariate normal distributions. ) Actual Predicted membership membership Pop1 Pop2 Total Pop1 n 1c =83 n 1m =17 n 1 =100 Pop2 n 2m =19 n 2c =81 n 2 =100 Considering a mixture of two bivariate normal distributions for both populations, the apparent error rate is given by AP ER = ( )=200 = 0:18. That is, we observe a great improvement in the classification rule using the mixture of two bivariate normal distributions, since we getaverysmallvalue the AP ER in comparison with the obtained values for the AP ER using linear or quadratic discriminant functions. We also could assume the conjugate prior distribution (10) for. Considering m 1 =(2:5 4:5) 0 ; m 2 =(4:0 10:0) 0 ; k 1 = k 2 =3;g 1 = g 2 =7; 392 Rev. Mat. Est. S~ao Paulo, 19: , 2000

11 a = b =10; G 1 = for population 1 and 1 0:3 and G 0:3 1:5 2 = 2:0 0:4 0:4 2:5 m 1 =(3:5 5:5) 0 ; m 2 =(6:5 14:6) 0 ; k 1 = k 2 =3;g 1 = g 2 =7; 1 0:3 2:0 0:4 a = b =10; G 1 = and G 0:3 2 2 = 0:4 3:0 for population 2, we have in Table 5, the posterior summaries for all parameters based on S = Gibbs Samples generated from the conditional posterior distributions (11). The used simulation procedure was similar for the case considering the prior distribution (7). Table 5 - Posterior summaries (mixture of two bivariate normal distributions; prior distributions (10) for ) Param. Mean S.D. 95% credible interval j GW j (1.8744;2.6433) (4.2052;5.0603) ( ;1.6046) ( ;2.5620) ( ; ) Pop1 p (1) ( ; ) (3.5397;4.4167) (9.4443;10.579) (1.616;3.0994) (2.0555;5.8916) ( ;1.9683) (2.8589;3.5076) (5.2761;5.9473) ( ;1.5960) ( ;2.3600) ( ; ) Pop2 p (2) ( ; ) (6.1442;6.9367) (14.385;15.286) (1.4666;3.1241) (2.559;5.3511) ( ;1.8881) In Table 6, we have the classification results for all data. Table 6- Actual Classification table (mixture of two bivariate normal distribution and the priori distributions (10) for. ) Predicted membership membership Pop1 Pop2 Total Pop1 n1c =86 n1m =14 n1 =100 Pop2 n2m =18 n2c =82 n2 =100 Rev. Mat. Est. S~ao Paulo, 19: ,

12 In this case, the apparent error rate is given by AP ER = ( )=200 = 0:16. We observe a better performance for the classification rule assuming a mixture of bivariate normal distributions for both population and the conjugate prior distribution (10). 5 Concluding Remarks For many problems of classification and discrimination, the use of standard linear or quadratic discriminant functions could not be appropriate. Usually, a preliminary analysis of existing trainning data could indicate different shapes for the multivariate distribution to be considered in the classification rules in place of the usual assumption of multivariate normal distribution for the data of each population. In this case, mixture of multivariate normal distributions could be very useful in the classification rules. It is important topoint out that the use of MCMC methods to get the posterior summaries of interest does not require sophisticate computational expertise and this approach could be extended to mixtures of more than two multivariate normal distributions with higher dimensions. Acknowledgments: E.Wruck thanks the FAPESP-SP-Brazil for providing financial support, grant ] 99/ J. Mazucheli is graduate student at COPPE/UFRJ and thanks CAPEs for partial support. The authors are also thankful for the referees by the useful comments. WEUCK, E.; ACHCAR, J.A. and MAZUCHELI, J - Classificaοc~ao e discriminaοc~ao para populaοc~oes com misturas de distribuiοc~oes normais multivariadas Rev. Mat. Estat. (S~ao Paulo), v. 19, p , Rev. Mat. Est. S~ao Paulo, 19: , 2000

13 RESUMO: Neste artigo, consideramos misturas de distribuiοc~oes normais multivariadas para serem usadas em leis de classificaοc~ao e discriminaοc~ao. Considerando métodos de Monte Carlo em Cadeias de Markov, obtemos sumários a posteriori de interesse e densidades preditivas para serem usadas nas leis de classificaοc~ao. Um exemplo é discutido. PALAVRAS-CHAVE: Mistura de distribuiοc~oes normais multivariadas, classificaοc~ao e discriminaοc~ao,análise Bayesiana. References ANDERSON, T. W. An Introduction to multivariate statistical methods. New York: John Wiley, p. CACOULLOS, T. Discriminant analysis and applications. New York: Academic Press, p. DOORNIK,J.A. Object-Oriented matrix programming using Ox. 3rd ed. London: Timberlake Consultants, p. GEWEKE, J. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In: BAYESIAN STATISTICS, 4. New york: Oxford University Press, p GELFAND, A.; SMITH, A. Sampling based approaches to calculating marginal densities.j. Am. Stat. Assoc., v.85, p , GOLDSTEIN, M.; DILLON, W. R. Discrete discriminant analysis. New York: Wiley, p. JOHNSON, R. A.; WICHERN, D. W.Applied multivariate statistical analysis. New Jersey: Prentice Hall, p. LACHENBRUCH, P. A.Discriminant analysis. New York: Hafner, p. LAVINE, M.; WEST, M.A Bayesian method for classification and discrimination. Can. J. Stat., n.4, v.20, p , ROBERT, C. P. Mixture of distributions: Inference and estimation. In Markov Chain Monte Carlo in practice. London: Chapman and Hall, p Rev. Mat. Est. S~ao Paulo, 19: ,

14 TANNER, M.; WONG, W. The calculations of posterior distributions by date augmentation. J. Am. Stat. Assoc., v82, p , TITTERINGTON, D. M.; SMITH, A. F. M.; MAKOV, U. V. Satistical analysis of finite mixture distributions. New York: John Wiley, p. Recebido em Rev. Mat. Est. S~ao Paulo, 19: , 2000

Kazuhiko Kakamu Department of Economics Finance, Institute for Advanced Studies. Abstract

Bayesian Estimation of A Distance Functional Weight Matrix Model Kazuhiko Kakamu Department of Economics Finance, Institute for Advanced Studies Abstract This paper considers the distance functional weight