Problems with Stepwise Procedures in. Discriminant Analysis. James M. Graham. Texas A&M University

Size: px

Start display at page:

Download "Problems with Stepwise Procedures in. Discriminant Analysis. James M. Graham. Texas A&M University"

Lenard Tucker
6 years ago
Views:

1 Running Head: PROBLEMS WITH STEPWISE IN DA Problems with Stepwise Procedures in Discriminant Analysis James M. Graham Texas A&M University Graham, J. M. (2001, January). Problems with stepwise procedures in discriminant analysis. Paper presented at the annual meeting of the Southwest Educational Research Association, New Orleans, LA.

2 Problems with Stepwise 2 Abstract Stepwise procedures are a common analytic technique used in discriminant analysis to reduce the number of variables. Despite the frequency of their use, these procedures are rife with errors and best replaced with more accurate procedures such as all-possible-subsets analyses. The rationale behind stepwise procedures is discussed. Problems with stepwise procedures for both descriptive and predictive discriminant analysis are outlined, and alternative solutions to the stepwise problem are explored. Heuristic examples are used throughout to make the discussion more clear.

3 Problems with Stepwise 3 Problems with Stepwise Procedures in Discriminant Analysis Stepwise procedures are a common analytic procedure used in psychological and educational research to reduce the number of variables and to order variables in a given analysis (Thompson, 1995). As noted by Huberty (1994), It is quite common to find the use of stepwise analyses reported in empirically based journal articles (p. 261). However, the use of stepwise procedures entails a number of problems which can lead to misleading and inaccurate results. As a result, it has been noted by Cliff (1987) that, a large proportion of the published results using this method probably present conclusions that are not supported by the data (pp ). The practice of employing stepwise procedures is particularly confounding, as software is available to conduct all-possible-subsets analyses, a better alternative to stepwise procedures (Lautenschlager, 1991; McCabe, 1975; Morris & Meshbane, 1994). Stepwise procedures are most commonly employed in multiple regression and discriminant analysis. As a detailed description of the problems with stepwise procedures as they apply to multiple regression already exists (Thompson, 1995), the present paper will focus on stepwise procedures as applied to descriptive and predictive discriminant analysis (DDA and PDA, respectively). The problems occurring when stepwise procedures

4 Problems with Stepwise 4 are applied to both DDA and PDA are examined, and heuristic data sets are used to provide examples. Throughout the discussion stepwise results are compared to the results of all-possiblesubsets analyses, which are presented as a viable alternative to the stepwise problem. Stepwise Procedures Variable Selection At times, researchers with a large number of dependent variables are not sure if all variables in their analysis are necessary or useful (Klecka, 1980). Stepwise procedures are often used to reduce the number of dependent variables to the best set of variables to describe group differences (DDA) or to predict group membership (PDA). The process of selecting a smaller set of variables is oftentimes necessary, for a variety of reasons. The concept of parsimony is especially relevant here, as a smaller number of variables may be easier to interpret, and provide a more simple solution than a larger number of variables. Researchers may also have a large number of dependent measures and want to remove variables which are redundant or are measuring the same aspect of group differences. Finally, variable reduction may be necessary due to the cost of administering a large number of instruments in a research or clinical setting. In the case of an ongoing research project, it may be useful for researchers to

5 Problems with Stepwise 5 determine which measures are unnecessary for future studies (Huberty 1989). Variable Ordering Results from stepwise procedures are also often used to order variables in terms of their importance for each of the variables in describing group differences in DDA, or predicting group membership in PDA (Huberty, 1989). For instance, the variable selected in the first step of a stepwise procedure might be labeled the single most important variable (the one with the best descriptive or predictive ability) in the study. The variable selected in the second step might be labeled the second most important, and so on. How Stepwise Procedures Work In order to better understand the problems inherit in stepwise procedures, it is important to have an understanding of how stepwise procedures work. Stepwise procedures in discriminant analysis seek to select or order variables by their contribution to separation between groups. The default method for many statistical software programs is to look at the Wilk s Lambda for each variable. Consequently, this is the most frequently used, though there are other methods (Malhalanobis distance, unexplained variance, Rao s V, etc.). The astute reader will recognize that the Wilk s Lambda is of no interest

6 Problems with Stepwise 6 when using PDA. This will be discussed in greater length later; the following description applies only to DDA. Forward stepwise procedures. In the first step of a forward stepwise analysis, each variable is entered into a separate analysis, and the variable with the best univariate discrimination (lowest Wilk s Lambda in most cases) is selected. Next, each remaining variable is paired with the first and entered into a separate analysis. The variable which, when paired with the first, provides the best multivariate discrimination (again, most often the lowest Wilk s Lambda) is selected next. The third step matches each remaining variable with the first two, and so on. This process is continued until either all variables are selected or the decrease in Wilk s Lambda is insufficient to warrant further variable selection, as determined by the F-ratio. In addition, stepwise procedures may also remove variables already included if they are found to become irrelevant (better explained by other variables) throughout the analysis. It is important to note that while stepwise procedures can remove variables, they often do not. Backwards stepwise procedures. Stepwise procedures can also be used in a reverse manner, to a similar effect. Initially, all variables are included in a discriminant analysis. Next separate analyses are run, each removing a single variable. The variable which contributes the least to

7 Problems with Stepwise 7 group differences (the variable which, when removed, results in the smallest increase in Wilk s Lambda) is removed. This process continues until either all variables are removed or the change in the Wilk s Lambda is too great. Problems With Stepwise Procedures in DDA Variable Selection Stepwise procedures in DDA often seek to answer the question, Which subset of variables will best describe group differences? However, stepwise procedures do not necessarily select the best set of variables for describing group differences (Huberty & Barton, 1989). By entering variables one at a time, stepwise procedures do not include all of the information supplied jointly by two or more variables not already included in the analysis (Huberty, 1989). In order to determine the subset of variables which best describe the differences between groups it is necessary to conduct an all possible subsets analysis, which compares all possible groupings of the variables, rather than using stepwise procedures. The following example uses data included in version 9.0 of SPSS (SPSS, 1998), entitled Employee data.sav. In this example, five variables (educational level, employment category, current salary, beginning salary, and previous experience) are used to describe the differences between minorities and nonminorities. Table 1 summarizes the SPSS output for this

8 Problems with Stepwise 8 analysis, using stepwise discriminant analysis. As shown in Table 1, the use of stepwise discriminant analysis has reduced the initial set of five dependent variables to two: variables 3 and 5. During the first step of the analysis, stepwise procedures selected variable 3 as the single variable which best describes the differences between minorities and non-minorities. In the next step, variable 5 was selected as the variable which best describes the differences between groups, given the presence of variable INSERT TABLE 1 ABOUT HERE Next, consider the all possible subsets analysis conducted using MCCABEPC (Lautenschlager, 1991) on the same data, presented in Table 2. These results show, once again, that variable 3 provides the best univariate description of group differences. The best group of two variables, however, is not variables 3 and 5 as shown in the stepwise analysis. While the stepwise analysis has identified the best set of two variables given the presence of variable 3, the all possible subsets analysis has identified variables 4 and 5 as the best group of two variables. In fact, of the top ten subsets of two variables, less than half even contain variable 3!

9 Problems with Stepwise INSERT TABLE 2 ABOUT HERE Stepwise procedures do not always select the best subset of variables of a given size. In order to determine this, it is important to run an all possible subsets analysis. In the above example, variable 3 was initially selected by the stepwise procedure because it provides the best univariate description of group differences. Because stepwise procedures select all variables after the first based on the presence of all previous variables, they fail to consider subsets which do not include the previous variables. In the above example, the best subset of two variables was never considered by the stepwise procedure, because it did not include the initial variable selected by the stepwise procedure (variable 3)! Variable Ordering Results from stepwise procedures are often used to place variables in order of importance. For instance, in the above example, variable 3 might be called the most important variable in describing group differences and variable 5 the second most important. However, it is important to realize that, due to shared variance, a variable which describes a large amount of shared differences may entered late, or not at all (Huberty, 1989). Again, in the above example, variable 4, which provides

10 Problems with Stepwise 10 the second best univariate description of group differences, is never entered into the stepwise DDA. It is a mistake to use results of a stepwise procedure to order variables according to their importance. Capitalization on Sampling Error While all statistical analyses are affected by sampling error to some extent, stepwise procedures are especially suspect to sampling error. This is due to the fact that stepwise procedures select the variable with the lowest Wilk s Lambda to be entered, no matter how small the difference. If an error is made in selecting a variable (due to sampling error), all previous steps will be incorrect. As a result, stepwise procedures often produce results which are as unlikely to replicate in subsequent samples (Thompson, 1995). All possible subsets analyses are less suspect to sampling error than stepwise procedures, as error within the sample is not compounded as it is in stepwise procedures. In addition, while stepwise analyses only provide a single set of variables, all possible subsets analyses allows the researcher to examine a number of equally plausible subsets, providing them an opportunity to select a set of variables based on theory. For example, consider the scree plot of the best subsets of a given size shown in Figure 1, taken from the all possible subsets analysis above. From this, we can see that there is little

11 Problems with Stepwise 11 difference between the best subsets of 5, 4, 3, and 2 variables. From this, the researcher might discern that a two-variable solution provides the best variable size. Figure 2 shows the Wilk s Lambda values for the best ten variable pairs. This figure shows that the first three pairs of variables appear equally plausible in comparison with the others. To invoke the work of William of Okham, it would make sense for the researcher to choose the pair of variables amongst the first three which is the easiest to explain INSERT FIGURES 1 AND 2 ABOUT HERE McCabe (1975) provides an excellent demonstration of how stepwise methods capitalize on sampling error. In McCabe s example, a population of 400 individuals was used to draw 100 samples. Table 3 shows, for a given subset size, the number (out of 100) of samples with the selected subset in the top grouping of best subsets for the population. For example, for the best subset of two variables, 30 out of 100 all-possiblesubsets analyses actually selected the best subset of two predictors, while only 5 out of 100 stepwise analyses did so. This remarkable difference, which is constant across all subset sizes, demonstrates the extent to which stepwise DDA procedures are affected by sampling error.

12 Problems with Stepwise INSERT TABLE 3 ABOUT HERE Problems with Stepwise in PDA Selection Criteria The problems with stepwise procedures inherent in DDA also exist when stepwise procedures are applied to PDA. However, these problems are overshadowed by the fact that the stepwise procedures on common statistical packages are designed, not for PDA, but for DDA (Huberty & Wisenbaker, 1992). While DDA describes group differences by examining the discriminant function coefficients, PDA is concerned only with predicting group membership (Thompson, 1998). As a result, PDA does not utilize tests of statistical significance such as Wilk s Lambda. Instead, PDA is only concerned with hit rates, the number of cases correctly classified (Huberty, 1994). This distinction is important, because while Wilk s Lambda cannot be adversely affected (made higher) by adding variables to a DDA, hit rates can be made worse (Huberty, 1984; Thompson, 1998). In DDA a completely worthless variable (one which does not contribute to group separation or description) would be given a weight of 0 and its impact essentially removed from the analysis. In PDA, however, the same worthless variable would

13 Problems with Stepwise 13 contribute noise to the prediction analysis, making group prediction less accurate. Perhaps the most commonly used statistical software package, SPSS (1998), includes PDA as an option under DDA. As a result, it is possible to run a stepwise PDA analysis and receive classification results at each step. The variables selected by this analysis, however, are still selected in terms of group separation (e.g., Wilk s Lambda), not by group classification (Huberty, 1984)! Using DDA stepwise selection procedures to receive results for PDA is both inaccurate and inappropriate. To demonstrate the gross problems with the use of stepwise procedures in PDA, consider the following example. The data for the following example was taken from Holzinger and Swineford (1939, pp ), a classic data set often used for heuristic examples. The analyses were conducted using all 301 cases, using scores on the first 14 ability scores to predict track (June or February promotion). Table 4 shows the results of the stepwise procedure performed using SPSS with equal priors assumed. As shown, the stepwise analysis selected 5 variables of the original 14, for a 68.4% hit rate. It is also important to note that, while the discriminant function coefficients are reported by SPSS and reproduced in Table 4, they are not relevant to PDA. This

14 Problems with Stepwise 14 information is provided to demonstrate that stepwise procedures select variables, not because of their contribution to classification, but on the basis of their contribution of group description, which is strictly a DDA concept INSERT TABLE 4 ABOUT HERE Table 5 shows the results of the all-possible-subsets analysis of the same data, performed using a program developed by Morris and Meshbane (1994). As shown, the actual best subset of five variables, with a hit rate of 69.1%, includes only three of the variables selected by the stepwise analysis. Furthermore, it can be seen that 2 additional subsets of five variables perform equally as well as the subset selected by the stepwise procedure. By looking at the best ten subsets of any size, presented in Table 6, it can be seen that 6 subsets outperform the stepwise subset, and that at least 4 other subsets do just as well! In this example, the stepwise analysis has not selected the best subset of five variables, primarily because the selection criteria used is irrelevant to PDA INSERT TABLES 5 AND 6 ABOUT HERE

15 Problems with Stepwise 15 Ignores Priors In PDA, it is often of interest to the researcher to discover the rates of group membership in the general population, and use these numbers as the basis for their analysis. Doing so can greatly increase the predictive ability of the analysis. However, while using non-equal priors does change hit rates, it has no effect on how stepwise procedures select variables for analysis. Table 7 shows the results of a stepwise analysis, using the same data as above but with the priors set by the rates found in the sample. It can be seen here that the stepwise analysis has selected the same subset of five variables as when equal priors were assumed, with a hit rate of 75.4%. These variables have been selected, once again, by the same descriptive criteria used when equal priors were assumed. A comparison of the discriminate function coefficients (again, of no interest in PDA) shown in Tables 4 and 7 reveals that they are identical INSERT TABLE 7 ABOUT HERE An all-possible-subsets analysis with the same unequal priors reveals that, by changing the rates of the priors, a completely different set of variables then found while assuming equal priors provides the best predictive ability. As seen in

16 Problems with Stepwise 16 Tables 8 and 9, a hit rate of 76.7% is obtained from the actual best subset of five variables. Seven other subsets of five variables and at least 10 subsets of any size out-perform the stepwise subset INSERT TABLES 8 AND 9 ABOUT HERE While stepwise procedures can often provide inaccurate and misleading results in DDA, their use in PDA is always incorrect. The distinction between DDA and PDA, which is of great importance, is ignored when stepwise procedures as they are found on common software packages are performed. In addition, the use of such procedures ignores prior group membership rates when selecting variables. Even if one were to develop a stepwise procedure which used hit rates as selection criteria, the analysis would still be riddled with the problems inherit in all stepwise procedures: capitalization on sampling error and not always selecting the best subset of variables. Summary As demonstrated, stepwise procedures as they are commonly applied in DDA not only capitalize on sampling error, but also do not always fulfill their primary function: to select the best subset of variables for describing group differences.

17 Problems with Stepwise 17 The use of stepwise procedures in PDA is never appropriate, as the selection criteria applies only to DDA, not PDA. Stepwise procedures as they are commonly applied in DA are rife with problems. With the availability of all-possiblesubsets software, there is no excuse not to use the more accurate method of selecting variables. In addition to providing more accurate results, all-possible-subsets analyses provide a number of equally plausible solutions to the analysis, allowing the researcher to invoke conscious thought, theory, and parsimony when selecting the best subset of variables.

18 Problems with Stepwise 18 References Cliff, N. (1987). Analyzing multivariate data. San Diego, CA: Harcourt Brace Jovanovich. Holzinger, K. L., & Swineford, F. (1939). A study in factor analysis: The stability of the bi-factor solution (No. 48). Chicago: University of Chicago. Huberty, C. J. (1989). Problems with stepwise methods better alternatives. In B. Thompson (Ed.), Advances in Social Science Methodology (Vol. 1, pp ). Greenwich, CT: JAI Press. Huberty, C. J. (1994). Applied discriminant analysis. New York: Wiley. Huberty, C. J. (1984). Issues in the use and interpretation of discriminant analysis. Psychological Bulletin, 95, Huberty, C. J., & Barton, R. M. (1989). An introduction to discriminant analysis. Measurement and Evaluation in counseling and Development, 22, Huberty, C. J., & Wisenbaker, J. M. (1992). Discriminant Analysis: Potential improvements in typical practice. In B. Thompson (Ed.), Advances in Social Science Methodology (Vol. 2, pp ). Greenwich, CT: JAI Press. Klecka, W. R. (1980). Discriminant analysis. Thousand Oaks, CA: Sage Publications.

19 Problems with Stepwise 19 Lautenschlager, G. J. (1991). MCCABEPC: Computing all possible subsets for discriminant analyses. Based on G. P. McCabe Mainframe FORTRAN program. McCabe, G. P. (1975). Computations for variable selection in discriminant analysis. Technometrics, 17, Morris, J. D. & Meshbane, A. (1994). CLASSVSP.EXE [Computer software]. In C. J. Huberty, C. J.(Author) Applied discriminant analysis. New York: Wiley. SPSS (Version 9.0) [Computer software]. (1998). Chicago: SPSS, Inc. Thompson, B. (1995). Stepwise regression and stepwise discriminant analysis need not apply here: A guidelines editorial. Educational and Psychological Measurement, 55, Thompson, B. (1998, April). Five methodology errors in educational research: The pantheon of statistical significance and other faux pas. Invited address presented at the annual meeting of the American Educational Research Association, San Diego, CA.

20 Problems with Stepwise 20 Table 1 Stepwise DDA Results Variables Wilk s Step in Analysis Lambda , At each step, the variable that minimizes the overall Wilk s Lambda is entered. a. F level, tolerance, or VIN insufficient for further computation.

21 Problems with Stepwise 21 Table 2 All Possible Subsets DDA Results Subset Size Wilk s Variables Lambda , , , , , , , , , , Note: Best 10 subsets of size 3, 4, and 5 computed but excluded from this table.

22 Problems with Stepwise 22 Table 3 Simulated Comparison of Stepwise and All Possible Subsets in DDA. Subset Number of samples with selected subset in the top Size Procedure APSS Stepwise APSS Stepwise APSS Stepwise APSS Stepwise APSS Stepwise APSS Stepwise APSS Stepwise Note: Table adapted from McCabe, 1975.

23 Problems with Stepwise 23 Table 4 Stepwise PDA Results (Equal Priors Assumed). Standardized Discriminant Function Coefficients Variables Function Entered 1 T T5.447 T8.557 T T Classification Results Predicted Group Membership 1 2 Original Note: 68.4% of original cases correctly classified.

24 Problems with Stepwise 24 Table 5 All Possible Subsets PDA Results (Equal Priors Assumed): Best Subsets of 5 Variables. Hits #1 #2 Total Variables % Hits

25 Problems with Stepwise 25 Table 6 All Possible Subsets PDA Results (Equal Priors Assumed): Best 10 Subsets. Hits #1 #2 Total Variables % Hits

26 Problems with Stepwise 26 Table 7 Stepwise PDA Results (Priors From Sample). Standardized Discriminant Function Coefficients Variables Function Entered 1 T T5.447 T8.557 T T Classification Results Predicted Group Membership 1 2 Original Note: 75.4% of original cases correctly classified.

27 Problems with Stepwise 27 Table 8 All Possible Subsets PDA Results (Equal Priors From Sample): Best Subsets of 5 Variables. Hits #1 #2 Total Variables % Hits

28 Problems with Stepwise 28 Table 9 All Possible Subsets PDA Results (Priors From Sample): Best 10 Subsets. Hits #1 #2 Total Variables % Hits

29 Problems with Stepwise 29 Best Subset of Size Wilk's Lambda Figure 1. Scree plot of best subsets of a given size in DDA.

30 Problems with Stepwise 30 Subset ,5 3, 5 2, 5 1, 3 2, 3 3, 4 1, 5 1, 4 2, 4 1, WIlk's Lambda Figure 2. Scree plot of best ten subsets of 2 variables in DDA.

Y (Nominal/Categorical) 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV

Y (Nominal/Categorical) 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV 1 Neuendorf Discriminant Analysis The Model X1 X2 X3 X4 DF2 DF3 DF1 Y (Nominal/Categorical) Assumptions: 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV 2. Linearity--in