Discriminant Analysis V.Čekanavičius, G.Murauskas 1 Discriminant analysis one categorical variable depends on one or more normaly distributed variables. Can be used for forecasting. V.Čekanavičius, G.Murauskas 2 1
Data (X 11, X 21, X 31,, X k1, Y 1 ),., (X 1n, X 2n, X 3n,, X kn, Y n ). Dependent variable Y - categorical, Independent variables X- are normal for each category. Covariance matrices of variables for each category are equal. V.Čekanavičius, G.Murauskas 3 Y categorical X 1 X 2 X 3 Normal variables V.Čekanavičius, G.Murauskas 4 2
Checking for assumptions In social sciences checking for assumptions is very rare. Slight violations of assumptions are allowed. V.Čekanavičius, G.Murauskas 5 Discriminant properties for each independent variable Small Wilks lambda variable has better discrimination properties. Variable is statistically significant if Wilks test s p-value p < 0.05. Typically non-significant variables are dropped from the model. Sometimes non-significant variable is retained in the model (if the model with it is much better). 6 3
Large Wilks lambda If some variables in comparison to other variables have much larger Wilk s lambdas (5-7 times larger) one should try discriminant analysis without those variables. Note that dropping of variables changes ALL characteristics (Box test, classification table etc.). V.Čekanavičius, G.Murauskas 7 Canonical functions Just like in MANOVA special functions are constructed accounting for the variance of independent variables: f 1 (x)=a 1 +b 11 X 1 + b 21 X 2 + +b k1 X k, f 2 (x)=a 2 +b 12 X 1 + b 22 X 2 + +b k2 X k,.. If Y has k possible values, then k-1 canonical function is constructed. Most important is the first> second> V.Čekanavičius, G.Murauskas 8 4
Taking into account canonical functions Canonical functions help to check which models variables are most important. If many cases only the first canonical function really matters. V.Čekanavičius, G.Murauskas 9 Checking for the dominant first canonical function If there is more than one canonical function we check what percent of the common variance is explained by the first canonical function. We do not check any percentage if there is only one canonical function (Y has two values). V.Čekanavičius, G.Murauskas 10 5
If the first canonical function dominates The more important variables has the larger absolute value of standardized coefficient. The more important variables has the larger absolute value of correlation with the first canonical function. Not important variables are candidates for dropping from the model. V.Čekanavičius, G.Murauskas 11 Classification table One of the main indicators for the model fit. Classification table shows correct and incorrect classifications when discriminant analysis model is applied to the initial data. V.Čekanavičius, G.Murauskas 12 6
Standard investigation: Classification table. Checking which canonical functions are more important. Checking which variables are more important. Wilks test for significant variables. (Forecasting). Checking for normality (K-S p>0.05). Checking for equality of covariance matrices (Box statistics p>0.05). V.Čekanavičius, G.Murauskas 13 Example: Is it possible to distinguish among lithuanians, latvians and estonians taking into account their answers to the questions about: The sea (test1), Sport (test2), Neighboring countries (test3) V.Čekanavičius, G.Murauskas 14 7
Data V.Čekanavičius, G.Murauskas 15 Analyze -> Classify -> Discriminant V.Čekanavičius, G.Murauskas 16 8
Analyze -> Classify -> Discriminant Dependent variable Here Independent variables V.Čekanavičius, G.Murauskas 17 Statistics check varnos check V.Čekanavičius, G.Murauskas 18 9
Classify -> Discriminant Next here V.Čekanavičius, G.Murauskas 19 Classify varnos check V.Čekanavičius, G.Murauskas 20 10
General statistics V.Čekanavičius, G.Murauskas 21 Vilk s l Tests of Equality of Group Means p-values TEST1 TEST2 TEST3 Wilks' Lambda F df1 df2 Sig..039 406.803 2 33.000.572 12.364 2 33.000.311 36.491 2 33.000 All variables are statistically significant. However Vilk s lis small for TEST1 (Sea) only. TEST1 is most important in the model. V.Čekanavičius, G.Murauskas 22 11
REMARK Taking into account that Wilk s lambda is small for TEST1 only, one should try also the discriminant analysis without TEST2 and TEST3. Then both models should be compared. (Here, the results of this comparative analysis are omitted). Test Results Box's M 9,094 Approx.,880 F df1 12 df2 1791,789 Sig.,682 Tests null hypothesis of equal population covariance matrices. Box p>0.05, covariance matrices do not differ significantly V.Čekanavičius, G.Murauskas 24 12
Summary of Canonical Discriminant Functions Function 1 2 Eigenvalues Eigenvalue % of Variance Cumulative % Canonical Correlation 33.751 a 99.6 99.6.986.129 a.4 100.0.338 a. First 2 canonical discriminant functions were used in the ana f 1 explaines 99.6 % of common variance, f 2 0.4 %. f 1 dominates. V.Čekanavičius, G.Murauskas 25 TEST1 TEST2 TEST3 Structure Matrix Function 1 2.854 *.498 -.136.987 *.254.514 * Pooled within-groups correlations between discriminating variables and standardized canonical discriminant functions Variables ordered by absolute size of correlation within function. *. Largest absolute correlation between each variable and any discriminant function The strongest correlation of f 1 is with TEST1 (sea). V.Čekanavičius, G.Murauskas 26 13
Canonical Discriminant Functions f 1 - good discrimination f 2 - bad discrimination 3 2 1 0 lietuviai latviai -1 estai SALIS Group Centroids Function 2-2 -3-8 -6-4 -2 0 2 4 6 8 estai latviai lietuviai Function 1 V.Čekanavičius, G.Murauskas 27 Classification Results a correctly classified Predicted Group Membershi SALIS 1 lietuviai 2 latviai 3 estai Total Original Count 1 lietuvia 16 0 0 16 2 latviai 0 11 2 13 3 estai 0 2 5 7 % 1 lietuvia 100.0.0.0 100.0 2 latviai.0 84.6 15.4 100.0 3 estai.0 28.6 71.4 100.0 a.88.9% of original grouped cases correctly classified. V.Čekanavičius, G.Murauskas 28 14
Classification Results a Original Count % SALIS 1 lietuviai 2 latviai 3 estai Total 1 lietuvia 16 0 0 16 2 latviai 3 estai 1 lietuvia 2 latviai 3 estai Predicted Group Membershi 0 11 2 13 0 2 5 7 100.0.0.0 100.0.0 84.6 15.4 100.0.0 28.6 71.4 100.0 a.88.9% of original grouped cases correctly classified. Percents of Correct classification V.Čekanavičius, G.Murauskas 29 (forecasting) variables Classification Function Coefficients SALIS 1 lietuviai 2 latviai 3 estai TEST1-1.234.461.163 TEST2 7.881 6.221 6.221 TEST3 1.101.685.780 (Constant) -351.724-301.126-278.343 Fisher's linear discriminant functions for lithuanians Fisher s function is = -1.23*TEST1 + 7.88*TEST2+1.10*TEST3-351.72 V.Čekanavičius, G.Murauskas 30 15
(forecasting) Classification Function Coefficients SALIS 1 lietuviai 2 latviai 3 estai TEST1-1.234.461.163 TEST2 7.881 6.221 6.221 TEST3 1.101.685.780 (Constant) -351.724-301.126-278.343 Fisher's linear discriminant functions for latvians Fisher s function is = 0.46*TEST1 + 6.22*TEST2+0.68*TEST3-301.12 V.Čekanavičius, G.Murauskas 31 (forecasting) Classification Function Coefficients SALIS 1 lietuviai 2 latviai 3 estai TEST1-1.234.461.163 TEST2 7.881 6.221 6.221 TEST3 1.101.685.780 (Constant) -351.724-301.126-278.343 Fisher's linear discriminant functions for Estonians Fisher s function is = 0.16*TEST1 + 6.22*TEST2+0.78*TEST3-278.34 V.Čekanavičius, G.Murauskas 32 16
Forecasting Let TEST1=30, TEST2= 80, TEST3=70. Fisher s functions then: For Lithuanians= 318.78. For Latvians= 257.91. For Estonians= 269.75. We forecast that this respondent is Lithuanian (the largest value of the corresponding Fisher s function). V.Čekanavičius, G.Murauskas 33 Checking for normality 17
Checking for normality Normality of variables should be checked for each category of dependent variable. One can use select cases (three times) Or Split case We demonstrate the last option V.Čekanavičius, G.Murauskas 35 Data -> Split file move Check V.Čekanavičius, G.Murauskas 36 18
Analyze ->Descriptive-> Explore Move Check Here V.Čekanavičius, G.Murauskas 37 Check V.Čekanavičius, G.Murauskas 38 19
salis = 1 lietuviai For Lithuanians: test1 is not normal (p<0.05), test2 and test3 are normal V.Čekanavičius, G.Murauskas 39 Q-Q plot for test2 also shows similarity to normal distribution (all points are close to the line) V.Čekanavičius, G.Murauskas 40 20
salis = 2 latviai For Latvians: test1, test2 and test3 are normal V.Čekanavičius, G.Murauskas 41 Checking for normality One should check for normality for ALL variables and all categories (Lithuanians, Latvians and Estonians). Kolomogorov-Smirnov and/or Shapiro- Wilk tests and sometimes Q-Q plots suffice. V.Čekanavičius, G.Murauskas 42 21