FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA)

1 FRST 531 -- Mutivariate Statistics Mutivariate Discriminant Anaysis (MDA) Purpose: 1. To predict which group (Y) an observation beongs to based on the characteristics of p predictor (X) variabes, using inear composites of predictor variabes. The criterion for these composites is that the between group variance is maximized subject to the within group variance. Each new inear composite is uncorreated with previous ones (see Figures 1, 3 and 4), but are not necessariy orthogona (not a at 90 degree anges). 2. To minimize miscassification error rates. Once the discriminating functions are found, Fisher s inear discriminating functions (1 per group) can be used to predict group membership of another data set (Figure 2). 3. To determine whether the group centroids are statisticay different. The group centroid is the average vaue (average discriminant score) for the inear composite of the predictor variabes. This can aso be found by inputting the averages of each of the predictor variabes to find the average discriminant score. 4. To determine the number of statisticay significant discriminant axes (see Figures 3 and 4). 5. To determine which of the predictor variabes contributes most to discriminating among groups.

2 Reationship of MDA to Other Techniques: Unike Custer Anaysis, the group to which each entity beongs is known. As with Regression Anaysis, a prediction mode is wanted. However, the dependent variabe with MDA is a category (ordina or nomina scae), rather than a continuous variabe as with regression anaysis. MDA is the reverse of Mutivariate Anaysis of Variance (MANOVA); the continuous variabes are dependent variabes in MANOVA and the casses are predictor variabes. PCA can be used as an initia step in discriminant anaysis to reduce the number of predictor variabes. A reated procedure is to fit a series of ogistic modes. These incude Probit and Logit anaysis which predicts the probabiity of a yes or no (2 casses). This can be extended to a mutinomia ogit, by fitting a series of 2 cass modes. Probit and Logit, and mutinomia ogit are not covered in this course. Procedure, Canonica discriminant anaysis: The prediction modes ( inear combinations of predictor variabes) are based on a Learning Data Set. This set is comprised of sampe observations of the X variabes for each of the k groups. X j x11 x12 x13 L x1p x21 x22 x23 L x2 p = x31 x32 x33 L x3p M M M O M xn ( j) 1 xn ( j) 2 xn ( j) 3 L xn ( j) p n = n1 + n2 + n3+ L n k j = 123,,,L k

3 The idea is to determine functions of the X variabes that in some way separates the k groups as we as possibe. The simpest is to take a inear combinations of the X variabes. z 1 = b 11 x 1 + b 12 x 2 + b 13 x 3 +L b 1p x p where z 1 is a matrix of discriminant scores, one vaue for each of the n observations, and one for each of the r discriminating functions. There can be z 2, z 3, etc. up to the smaer of p or k-1 different inear functions possibe. In order to obtain the "best" vaues for the coefficients, we want to maximize the ratio of the between group variance over the within group variance. λ = b T B b T b W b where T= the tota variation of a the predictor variabes; and T can be divided into variance between groups (B) and variance within groups (W). The cacuation of these is as foows: (see presentation in cass)

4 We want to find a coefficient matrix b, such that this ratio is maximized. We aso have the constraint that the resuting discriminant scores are uncorreated. To begin, we take first derivatives and set equa to zero. Then, using a Lagrangian mutipier as we did in finding principe components of a matrix, we obtain: ( B λw) b= 0 which is equivaent to: 1 ( W B λ I) b= 0 b are the eigenvectors of W 1 λ are the eigenvaues of W B 1 B (discriminant weights); from argest to smaest; 1 W B is nonsymmetric; eigenvectors are uncorreated but wi not be orthogona. The reative weight of the function can be expressed by a ratio of the associated eigenvaue reative to the sum of a eigenvaues combined for a discriminating functions. RW This indicates which axis capture the most variation. The cosine of the ange between two discriminating functions can be found by: = cos( θ v ) r λ = 1 λ T = b b v which is the inner product of the two eigenvectors.

5 Assumptions: 1. Mutivariate Normaity of predictor variabes (a continuous and normay distributed). 2. Homogeneity of the variance-covariance matrix over a m groups. Discriminant anaysis is not robust to these assumptions. If these are not met, the resuting tests of significance wi not be reiabe (the package coud report a "p vaue" of 0.001 and the rea vaue is 0.30). The discriminant anaysis can be used as a descriptive too in this case, but cannot be used to test hypothesis about the discriminant functions. If #1 hods but #2 does not, can use a quadratic discriminating function instead of a inear discriminating function. If #1 does not hod, coud resut in biased estimates of miscassification error rates aso. Miscassification Error Rates Fisher s inear discriminating functions ( one function for each of the k groups) can be used to predict group membership, based on an observation of the predictor variabes. The vaue for each of the k Fisher s inear discriminating functions is determined using the vaues for the predictor variabes. The highest vaue indicates the group membership. Aternativey, the vector of discriminating scores (z) coud be found using the r inear discriminating functions (ess than k), by inputting the set of vaues for the predictor variabes. Then, the vector of average discriminating scores using the

6 average vaues for the predictor variabes in the earning data set coud be found for each group. For each group, using the vector of averages ( z m ), the Mahaanobis distance woud then be cacuated: 2 T 1 D = ( z z ) C ( z z ) m m where C is the covariance matrix for the X variabes. The new data point represented by vector z is then predicted to beong to the group having the owest Mahaanobis distance. Based on the prediction modes obtained using the earning dataset, the number of incorrecty cassified observations (miscassification error rate) can be determined using the earning set data, and Fisher s inear discriminating functions. The error rate wi be underestimated, since these data were used to estabish the prediction modes. There are severa aternatives for estimating the miscassification error rate: 1. Cacuate the error rate using a new data set. 2. Spit the origina data set into two subsets. One part of the data woud be used to fit the discriminating functions, and the other woud be used to cacuate the miscassification error rate. 3. Cross-Vaidation The process for cross-vaidation is to 1) fit the discriminating functions using a but 1 of the observations in the data set; 2) cacuate the error rate for the reserved observation; 3) repeat 1 and 2 by reserving a different observation, unti each observation has been removed (i.e. fit the discriminating functions n times).

7 4. n-way vaidation A modification of the cross-vaidation is to divide the data into groups. Discriminant anaysis is performed using a but one of the groups of data. The reserved group of data is then used to test the functions. This is repeated by reserving a different group. The average error rate is then cacuated. Considerations in data spitting incude: 1) random spit? 2) random spit by group? 3) enough data eft to obtain a reasonabe discriminating mode? Which centroids differ? For 2 groups, the difference between group centroids can be tested to see if the group centroids differ ( H0: μ 1 = μ 2). The Mahaanobis distance is defined as: 2 T 1 D = ( z z ) C ( z z ) A transformation of this distance can be used to test for differences between two group centroids: F = n n n + n n1 + n2 p 1 ( n + n 2) p D 2 Under the nu hypothesis that the two group centroids are the same, this is distributed as an F distribution, for the 1 α percentie, and with n + n p 1 degress of freedom. For more than two groups, often this two group test is α performed for every pair of groups; however, the 1 percentie shoud no. pairs be used instead of the 1 α percentie.

8 An aternative test, that is reated to this test, is the Hoteing s T squared test. T n n = n + n D 2 2 The test statistic is then cacuated: n1 + n2 p 1 T p ( n + n 2) 2 which is distributed as the F distribution for the 1 α percentie, and with p and n1 + n2 p 1 degress of freedom. Again this can be used for more groups by testing every pair of groups, but using the percentie. 1 α no. pairs instead of the 1 α Both of these tests assume a mutivariate distribution of data, and that the covariance matrix is the same for the two groups. Which inear composites (discriminant functions) shoud be retained? 1. Significance of Eigenvaues as a Group (a r discriminant functions): Cacuate where n, p, k are as defined above. { } V = ( n 1) 1/ 2( p+ k) n( 1+ λ ) r = 1 Compare to Chi Square distribution with p(k-1) degrees of freedom and the 1 α percentie. If V is greater than the critica vaue, the discriminant functions as a group are significant.

9 2. Significance of function : Cacuate { } V = ( n 1) 1/ 2( p+ k) n( 1+ λ ) Compare to Chi Square distribution with p+k-2 degrees of freedom. If V greater than critica vaue, discriminant function is significant. is Which X Variabes are most important to the Discriminant Scores? 1. Discriminant weight: probems with these are that they reate to variabe size and reationship among variabes (dependence of predictor variabes). 2. Discriminant oadings: gives simpe correation coefficient of variabe with the discriminant scores. Cacuation of Discriminant oadings: Let C -1/2 be the square root of diagona eements of the variance-covariance matrix of the predictor variabes (standard deviations) and et R be the correation matrix for X (pairwise correations between the origina variabes). Then: 1. b * C = 1/2 b to get scaed weights 2. R b to get scaed weights * correation for X which resuts in = * correations between each X with each discriminating function (discriminant scores). where the are vectors of discriminant oadings for the discriminant function.

10 Toos for Interpretation: 1. Pot the group centroids. One centroid for each group, for each discriminant function. (Figure 5) 2. Pot group overaps. (Figure 6). 3. Aso possibe to ater the prior probabiities (equa, sampe based, other). 4. Stepwise discriminant anaysis possibe but based on the mutivariate norma distribution function. References Dion, W.R. and Godstein, M. 1984. Mutivariate anaysis. Methods and appications. John Wiey and Sons, Toronto. [and textbooks for the course]