Logistic Regression: Regression with a Binary Dependent Variable

Size: px

Start display at page:

Download "Logistic Regression: Regression with a Binary Dependent Variable"

Cornelius Logan
6 years ago
Views:

1 Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression should be used instead of multiple regression. Identify the types of variables used for dependent and independent variables in the application of logistic regression. Describe the method used to transform binary measures into the likelihood and probability measures used in logistic regression. Interpret the results of a logistic regression analysis and assessing predictive accuracy, with comparisons to both multiple regression and discriminant analysis. Understand the strengths and weaknesses of logistic regression compared to discriminant analysis and multiple regression. CHAPTER PREVIEW Logistic regression is a specialized form of regression that is formulated to predict and explain a binary (two-group) categorical variable rather than a metric dependent measure. The form of the logistic regression variate is similar to the variate in multiple regression. The variate represents a single multivariate relationship, with regression-like coefficients indicating the relative impact of each predictor variable. The differences between logistic regression and discriminant analysis will become more apparent in our discussion of logistic regression s unique characteristics. Yet many similarities also exist between the two methods. When the basic assumptions of both methods are met, they each give comparable predictive and classificatory results and employ similar diagnostic measures. Logistic regression, however, has the advantage of being less affected than discriminant analysis when the basic assumptions, particularly normality of the variables, are not met. It also can accommodate nonmetric variables through dummy-variable coding, just as regression can. Logistic regression is limited, however, to prediction of only a two-group dependent measure. Thus, in cases From Chapter 6 of Multivariate Data Analysis, 7/e. Joseph F. Hair, Jr., William C. Black, Barry J. Babin, Rolph E. Anderson. Copyright 2010 by Pearson Prentice Hall. All rights reserved. 313

2 for which three or more groups form the dependent measure, discriminant analysis is better suited. Logistic regression may be described as estimating the relationship between a single nonmetric (binary) dependent variable and a set of metric or nonmetric independent variables, in this general form: Y 1 = X 1 + X 2 + X X n (binary nonmetric) (nonmetric and metric) Logistic regression has widespread application in situations in which the primary objective is to identify the group to which an object (e.g., person, firm, or product) belongs. Potential applications include predicting anything where the outcome is binary (e.g., Yes/No). Such situations include the success or failure of a new product, deciding whether a person should be granted credit, or predicting whether a firm will be successful. In each instance, the objects fall into one of two groups, and the objective is to predict and explain the bases for each object s group membership through a set of independent variables selected by the researcher. KEY TERMS Before starting the chapter, review the key terms to develop an understanding of the concepts and terminology to be used. Throughout the chapter the key terms appear in boldface. Other points of emphasis in the chapter and key term cross-references are italicized. Analysis sample Group of cases used in estimating the logistic regression model. When constructing classification matrices, the original sample is divided randomly into two groups, one for model estimation (the analysis sample) and the other for validation (the holdout sample). Categorical variable See nonmetric variable. Classification matrix Means of assessing the predictive ability of the logistic regression model. Created by cross-tabulating actual group membership with predicted group membership, this matrix consists of numbers on the diagonal representing correct classifications and off-diagonal numbers representing incorrect classifications. Cross-validation Procedure of dividing the sample into two parts: the analysis sample used in estimation of the logistic regression model and the holdout sample used to validate the results. Cross-validation avoids the overfitting of the logistic regression by allowing its validation on a totally separate sample. Exponentiated logistic coefficient Antilog of the logistic coefficient, which is used for interpretation purposes in logistic regression. The exponentiated coefficient minus 1.0 equals the percentage change in the odds. For example, an exponentiated coefficient of.20 represents a negative 80 percent change in the odds ( =-.80) for each unit change in the independent variable (the same as if the odds were multiplied by.20). Thus, a value of 1.0 equates to no change in the odds and values above 1.0 represent increases in the predicted odds. Hit ratio Percentage of objects (individuals, respondents, firms, etc.) correctly classified by the logistic regression model. It is calculated as the number of objects in the diagonal of the classification matrix divided by the total number of objects. Also known as the percentage correctly classified. Holdout sample Group of objects not used to compute the logistic regression model. This group is then used to validate the logistic regression model with a separate sample of respondents. Also called the validation sample. Likelihood value Measure used in logistic regression to represent the lack of predictive fit. Even though this method does not use the least squares procedure in model estimation, as is done in multiple regression, the likelihood value is similar to the sum of squared error in regression analysis. 314

3 Logistic coefficient Coefficient in the logistic regression model that acts as the weighting factor for the independent variables in relation to their discriminatory power. Similar to a regression weight or discriminant coefficient. Logistic curve An S-shaped curve formed by the logit transformation that represents the probability of an event. The S-shaped form is nonlinear, because the probability of an event must approach 0 and 1, but never fall outside these limits. Thus, although the midrange involves a linear component, the probabilities as they approach the lower and upper bounds of probability (0 and 1) must flatten out and become asymptotic to these bounds. Logistic regression Special form of regression in which the dependent variable is a nonmetric, dichotomous (binary) variable. Although some differences exist, the general manner of interpretation is quite similar to linear regression. Logit analysis See logistic regression. Logit transformation Transformation of the values of the discrete binary dependent variable of logistic regression into an S-shaped curve (logistic curve) representing the probability of an event. This probability is then used to form the odds ratio, which acts as the dependent variable in logistic regression. Maximum chance criterion Measure of predictive accuracy in the classification matrix that is calculated as the percentage of respondents in the largest group. The rationale is that the best uninformed choice is to classify every observation into the largest group. Nonmetric variable Variable with values that serve merely as a label or means of identification, also referred to as a categorical, nominal, binary, qualitative, or taxonomic variable. The number on a football jersey is an example. Odds The ratio of the probability of an event occurring to the probability of the event not happening, which is used as a measure of the dependent variable in logistic regression. Percentage correctly classified See hit ratio. Proportional chance criterion Another criterion for assessing the hit ratio, in which the average probability of classification is calculated considering all group sizes. Pseudo R 2 A value of overall model fit that can be calculated for logistic regression; comparable to the R 2 measure used in multiple regression. Validation sample See holdout sample. Variate Linear combination that represents the weighted sum of two or more independent variables that comprise the discriminant function. Also called linear combination or linear compound. Wald statistic Test used in logistic regression for the significance of the logistic coefficient. Its interpretation is like the F or t values used for the significance testing of regression coefficients. WHAT IS LOGISTIC REGRESSION? Logistic regression, along with discriminant analysis, is the appropriate statistical technique when the dependent variable is a categorical (nominal or nonmetric) variable and the independent variables are metric or nonmetric variables. When compared to discriminant analysis, logistic regression is limited in its basic form to two groups for the dependent variable, although other formulations can handle more groups. It does have the advantage, however, of easily incorporating nonmetric variables as independent variables, much like in multiple regression. In a practical sense, logistic regression may be preferred for two reasons. First, discriminant analysis relies on strictly meeting the assumptions of multivariate normality and equal variance covariance matrices across groups assumptions that are not met in many situations. Logistic regression does not face these strict assumptions and is much more robust when these assumptions are not met, making its application appropriate in many situations. Second, even if the 315

4 assumptions are met, many researchers prefer logistic regression because it is similar to multiple regression. It has straightforward statistical tests, similar approaches to incorporating metric and nonmetric variables and nonlinear effects, and a wide range of diagnostics. Thus, for these and more technical reasons, logistic regression is equivalent to two-group discriminant analysis and may be more suitable in many situations. THE DECISION PROCESS FOR LOGISTIC REGRESSION The application of logistic regression can be viewed from a six-stage model-building perspective. As with all multivariate applications, setting the objectives is the first step in the analysis. Then the researcher must address specific design issues and make sure the underlying assumptions are met. The analysis proceeds with the estimation of the probability of occurrence in each of the groups by use of the logistic curve as the underlying relationship. The binary measure is translated into the odds of occurrence and then a logit value that acts as the dependent measure. The model form in terms of the independent variables is almost identical to multiple regression. Model fit is assessed much like discriminant analysis by first looking for statistical significance of the overall model and then determining predictive accuracy by developing a classification matrix. Then, given the unique nature of the transformed dependent variable, logistic coefficients are given in their original scale, which is in logarithmic terms, and a transformed scale, which is interpreted more like regression coefficients. Each form of the coefficient details a certain characteristic of the independent variable s impact. Finally, the logistic regression model should be validated with a holdout sample. Each of these stages is discussed in the following sections. Our discussion focuses in large extent on the differences between logistic regression and discriminant analysis or multiple regression. Thus, the reader should also review the underlying principles of models with nonmetric dependent variables and even the basics of multiple regression models. STAGE 1: OBJECTIVES OF LOGISTIC REGRESSION Logistic regression is identical to discriminant analysis in terms of the basic objectives it can address. Logistic regression is best suited to address two research objectives: Identifying the independent variables that impact group membership in the dependent variable Establishing a classification system based on the logistic model for determining group membership. The first objective is quite similar to the primary objectives of discriminant analysis and even multiple regression in that emphasis is placed on the explanation of group membership in terms of the independent variables in the model. In the classification process, logistic regression, like discriminant analysis, provides a basis for classifying not only the sample used to estimate the discriminant function but also any other observations that can have values for all the independent variables. In this way, the logistic regression analysis can classify other observations into the defined groups. STAGE 2: RESEARCH DESIGN FOR LOGISTIC REGRESSION Logistic regression has several unique features that impact the research design. First is the unique nature of the binary dependent variable, which ultimately impacts the model specification and estimation. The second issue relates to sample size, which is impacted by several factors, among them 316

5 the use of maximum likelihood as the estimation technique as well as the need for estimation and holdout samples such as discriminant analysis. Representation of the Binary Dependent Variable In discriminant analysis, the nonmetric character of a dichotomous dependent variable is accommodated by making predictions of group membership based on discriminant Z scores. It requires the calculation of cutting scores and the assignment of observations to groups. Logistic regression approaches this task in a manner more similar to that found with multiple regression. Logistic regression represents the two groups of interest as binary variables with values of 0 and 1. It does not matter which group is assigned the value of 1 versus 0, but this assignment must be noted for the interpretation of the coefficients. If the groups represent characteristics (e.g., gender), then either group can be assigned the value of 1 (e.g., females) and the other group the value of 0 (e.g., males). In such a situation, the coefficients would reflect the impact of the independent variable(s) on the likelihood of the person being female (i.e., the group coded as 1). If the groups represent outcomes or events (e.g., success or failure, purchase or nonpurchase), the assignment of the group codes impacts interpretation as well. Assume that the group with success is coded as 1, with failure coded as 0. Then, the coefficients represent the impacts on the likelihood of success. Just as easily, the codes could be reversed (code of 1 now denotes failure) and the coefficients represent the forces increasing the likelihood of failure. Logistic regression differs from multiple regression, however, in being specifically designed to predict the probability of an event occurring (i.e., the probability of an observation being in the group coded 1). Although probability values are metric measures, there are fundamental differences between multiple regression and logistic regression. USE OF THE LOGISTIC CURVE Because the binary dependent variable has only the values of 0 and 1, the predicted value (probability) must be bounded to fall within the same range. To define a relationship bounded by 0 and 1, logistic regression uses the logistic curve to represent the relationship between the independent and dependent variables (see Figure 1). At very low levels of the independent variable, the probability approaches 0, but never reaches it. Likewise, as the independent variable increases, the predicted values increase up the curve, but then the slope starts decreasing so that at any level of the independent variable the probability will approach 1.0 but never exceed it. The linear models of regression cannot accommodate such a relationship, because it is inherently nonlinear. The linear relationship of regression, even with additional terms of transformations for nonlinear effects, cannot guarantee that the predicted values will remain within the range of 0 and 1. UNIQUE NATURE OF THE DEPENDENT VARIABLE The binary nature of the dependent variable (0 or 1) has properties that violate the assumptions of multiple regression. First, the error term of a discrete variable follows the binomial distribution instead of the normal distribution, thus invalidating all statistical testing based on the assumptions of normality. Second, the variance of a dichotomous variable is not constant, creating instances of heteroscedasticity as well. Moreover, neither violation can be remedied through transformations of the dependent or independent variables. Logistic regression was developed to specifically deal with these issues. Its unique relationship between dependent and independent variables, however, requires a somewhat different approach in estimating the variate, assessing goodness-of-fit, and interpreting the coefficients when compared to multiple regression. 317

6 1.0 Probability of Event (Dependent Variable) 0 Low Level of the Independent Variable High FIGURE 1 Form of the Logistic Relationship Between Dependent and Independent Variables Sample Size Logistic regression, like every other multivariate technique, must consider the size of the sample being analyzed. Very small samples have so much sampling error that identification of all but the largest differences is improbable. Very large sample sizes increase the statistical power so that any difference, whether practically relevant or not, will be considered statistically significant. Yet most research situations fall somewhere in between these extremes, meaning the researcher must consider the impact of sample sizes on the results, both at the overall level and on a group-by-group basis. OVERALL SAMPLE SIZE The first aspect of sample size is the overall sample size needed to adequately support estimation of the logistic model. One factor that distinguishes logistic regression from the other techniques is its use of maximum likelihood (MLE) as the estimation technique. MLE requires larger samples such that, all things being equal, logistic regression will require a larger sample size than multiple regression. For example, Hosmer and Lemeshow recommend sample sizes greater than 400 [4]. Moreover, the researcher should strongly consider dividing the sample into analysis and holdout samples as a means of validating the logistic model (see a more detailed discussion in stage 6). In making this split of the sample, the sample size requirements still hold for both the analysis and holdout samples separately, thus effectively doubling the overall sample size needed based on the model specification (number of parameters estimates, etc.). SAMPLE SIZE PER CATEGORY OF THE DEPENDENT VARIABLE The second consideration is that the overall sample size is important, but so is the sample size per group of the dependent variable. As we discussed for discriminant analysis, there are considerations on the minimum group size as well. The recommended sample size for each group is at least 10 observations per estimated parameter. This is much greater than multiple regression, which had a minimum of five observations per parameter, and that was for the overall sample, not the sample size for each group, as seen with logistic regression. IMPACT OF NONMETRIC INDEPENDENT VARIABLES A final consideration comes into play with the use of nonmetric independent variables. When they are included in the model, they further subdivide the sample into cells created by the combination of dependent and nonmetric independent variables. For example, a simple binary independent variable creates four groups when combined 318

7 with the binary dependent variable. Although it is not necessary for each of these groups to meet the sample size requirements described above, the researcher must still be aware that if any one of these cells has a very small sample size then it is effectively eliminated from the analysis. Moreover, if too many of these cells have zero or very small sample sizes, then the model may have trouble converging and reaching a solution. STAGE 3: ASSUMPTIONS OF LOGISTIC REGRESSION The advantages of logistic regression compared to discriminant analysis and even multiple regression stem in large degree to the general lack of assumptions required in a logistic regression analysis. It does not requires any specific distributional form of the independent variables and issues such as heteroscedasticity do not come into play as they did in discriminant analysis. Moreover, logistic regression does not require linear relationships between the independent variables and the dependent variables as does multiple regression. It can address nonlinear effects even when exponential and polynomial terms are not explicitly added as additional independent variables because of the logistic relationship. STAGE 4: ESTIMATION OF THE LOGISTIC REGRESSION MODEL AND ASSESSING OVERALL FIT One of the unique characteristics of logistic regression is its use of the logistic relationship described earlier in both estimating the logistic model and establishing the relationship between dependent and independent variables. The result is a unique transformation of the dependent variable, which impacts not only the estimation process, but also the resulting coefficients for the independent variables. And yet logistic regression shares approaches to assessing overall model fit with both discriminant analysis (i.e., use of classification matrices) and multiple regression (i.e., R 2 measures). The following sections discuss the estimation process followed by the various ways in which model fit is evaluated. Estimating the Logistic Regression Model Logistic regression has a single variate composed of estimated coefficients for each independent variable, as found in multiple regression. However, this variate is estimated in a different manner. Logistic regression derives its name from the logit transformation used with the dependent variable, creating several differences in the estimation process (as well as the interpretation process discussed in a following section). TRANSFORMING THE DEPENDENT VARIABLE As shown earlier, the logit model uses the specific form of the logistic curve, which is S-shaped, to stay within the range of 0 to 1. To estimate a logistic regression model, this curve of predicted values is fitted to the actual data, just as was done with a linear relationship in multiple regression. However, because the actual data values of the dependent variables can only be either 1 or 0, the process is somewhat different. Figure 2 portrays two hypothetical examples of fitting a logistic relationship to sample data. The actual data represent whether an event either happened or not by assigning values of either 1 or 0 to the outcomes (in this case a 1 is assigned when the event happened, 0 otherwise, but they could have just as easily been reversed). Observations are represented by the dots at either the top or bottom of the graph. These outcomes (happened or not) occur at each value of the independent variable (the X axis). In part (a), the logistic curve cannot fit the data well, because a number of values of the independent variable have both outcomes (1 and 0). In this case the independent variable does not distinguish between the two outcomes, as shown by the high overlap of the two groups. However, in part (b), a much more well-defined relationship is based on the independent variable. Lower values of the independent variable correspond to the observations with 0 for the 319

8 1 (a) Poorly Fitted Relationship Y X 10 1 (b) Well-Defined Relationship Y X 10 FIGURE 2 Examples of Fitting the Logistic Curve to Sample Data dependent variable, whereas larger values of the independent variable correspond well with those observations with a value of 1 on the dependent variable. Thus, the logistic curve should be able to fit the data quite well. But how do we predict group membership from the logistic curve? For each observation, the logistic regression technique predicts a probability value between 0 and 1. Plotting the predicted values for all values of the independent variable generates the curve shown in Figure 2. This predicted probability is based on the value(s) of the independent variable(s) and the estimated coefficients. If the predicted probability is greater than.50, then the prediction is that the outcome is 1 (the event happened); otherwise, the outcome is predicted to be 0 (the event did not happen). Let s return to our example and see how it works. In parts (a) and (b) of Figure 2, a value of 6.0 for X (the independent variable) corresponds to a probability of.50. In part (a), we can see that a number of observations of both groups fall on both sides of this value, resulting in a number of misclassifications. The misclassifications are most 320

9 noticeable for the group with values of 1.0, yet even several observations in the other group (dependent variable = 0.0) are misclassified. In part (b), we make perfect classification of the two groups when using the probability value of.50 as a cutoff value. Thus, with an estimated logistic curve we can estimate the probability for any observation based on its values for the independent variable(s) and then predict group membership using.50 as a cutoff value. Once we have the predicted membership, we can create a classification matrix just as was done for discriminant analysis and assess predictive accuracy. ESTIMATING THE COEFFICIENTS Where does the curve come from? In multiple regression, we estimate a linear relationship that best fits the data. In logistic regression, we follow the same process of predicting the dependent variable by a variate composed of the logistic coefficient(s) and the corresponding independent variable(s). What differs is that in logistic regression the predicted values can never be outside the range of 0 to 1. Although a complete discussion of the conceptual and statistical issues involved in the estimation process is beyond the scope of this chapter, several excellent sources with complete treatments of these issues are available [1, 5, 6]. We can describe the estimation process in two basic steps as we introduce some common terms and provide a brief overview of the process. TRANSFORMING A PROBABILITY INTO ODDS AND LOGIT VALUES Just as with multiple regression, logistic regression predicts a metric dependent variable, in this case probability values constrained to the range between 0 and 1. But how can we ensure that estimated values do not fall outside this range? The logistic transformation accomplishes this process in two steps. Restating a Probability as Odds. In their original form, probabilities are not constrained to values between 0 and 1. So, what if we were to restate the probability in a way that the new variable would always fall between 0 and 1? We restate it by expressing a probability as odds the ratio of the probability of the two outcomes or events, Prob i (1 - Prob i ). In this form, any probability value is now stated in a metric variable that can be directly estimated. Any odds value can be converted back into a probability that falls between 0 and 1. We have solved our problem of constraining the predicted values to within 0 and 1 by predicting the odds value and then converting it into a probability. Let us use some examples of the probability of success or failure to illustrate how the odds are calculated. If the probability of success is.80, then we also know that the probability of the alternative outcome (i.e., failure) is.20 (.20 = ). This probability means that the odds of success are 4.0 (.80.20), or that success is four times more likely to happen than failure. Conversely, we can state the odds of failure as.25 (.20.80), or in other words, failure happens at one-fourth the rate of success. Thus, no matter which outcome we look at (success or failure), we can state the probability as odds. As you can probably surmise, a probability of.50 results in odds of 1.0 (both outcomes have an equal chance of occurring). Odds less than 1.0 represent probabilities less than.50 and odds greater than 1.0 correspond to a probability greater than.50. We now have a metric variable that can always be converted back to a probability value within 0 and 1. Calculating the Logit Value. The odds variable solves the problem of making probability estimates between 0 and 1, but we have another problem: How do we keep the odds values from going below 0, which is the lower limit of the odds (there is no upper limit). The solution is to compute what is termed the logit value, which is calculated by taking the logarithm of the odds. Odds less than 1.0 will have a negative logit value, odds ratios greater than 1.0 will have positive logit values, and the odds ratio of 1.0 (corresponding to a probability of.5) has a logit value of 0. Moreover, no matter how low the negative value gets, it can still be transformed by taking the antilog into an odds value greater than 0. The following shows some typical probability values and the associated odds and log odds values. 321

10 Probability Odds Log Odds (Logit) NC NC NC NC = Cannot be calculated. With the logit value, we now have a metric variable that can have both positive and negative values but that can always be transformed back to a probability value that is between 0 and 1. Note, however, that the logit can never actually reach either 0 or 1. This value now becomes the dependent variable of the logistic regression model. MODEL ESTIMATION Once we understand how to interpret the values of either the odds or logit measures, we can proceed to using them as the dependent measure in our logistic regression. The process of estimating the logistic coefficients is similar to that used in regression, although in this case only two actual values are used for the dependent variable (0 and 1). Moreover, instead of using ordinary least squares as a means of estimating the model, the maximum likelihood method is used. Estimating the Coefficients. The estimated coefficients for the independent variables are estimated using either the logit value or the odds value as the dependent measure. Each of these model formulations is shown here: prob event Logit i = lna b = b 1 - prob 0 + b 1 X 1 + Á + bn X n event prob event or Odds i = a b = e b 0+b 1 X 1 + Á +b n X n 1 - prob event Both model formulations are equivalent, but whichever is chosen affects how the coefficients are estimated. Many software programs provide the logistic coefficients in both forms, so the researcher must understand how to interpret each form. We will discuss interpretation issues in a later section. This process can accommodate one or more independent variables, and the independent variables can be either metric or nonmetric (binary). As we will see later in our discussion of interpreting the coefficients, both forms of the coefficients reflect both direction and magnitude of the relationship, but are interpreted differently. Using Maximum Likelihood for Estimation. Multiple regression employs the method of least squares, which minimizes the sum of the squared differences between the actual and predicted values of the dependent variable. The nonlinear nature of the logistic transformation requires that another procedure, the maximum likelihood procedure, be used in an iterative manner to find the most likely estimates for the coefficients. Instead of minimizing the squared deviations (least squares), logistic regression maximizes the likelihood that an event will occur. The likelihood value instead of the sum of squares is then used when calculating a measure of overall model fit. Using this alternative estimation technique also requires that we assess model fit in different ways. 322

11 Assessing the Goodness-of-Fit of the Estimated Model The goodness-of-fit for a logistic regression model can be assessed in two ways. One way is to assess model estimation fit using pseudo R 2 values, similar to that found in multiple regression. The second approach is to examine predictive accuracy (like the classification matrix in discriminant analysis). The two approaches examine model fit from different perspectives, but should yield similar conclusions. MODEL ESTIMATION FIT The basic measure of how well the maximum likelihood estimation procedure fits is the likelihood value, similar to the sums of squares values used in multiple regression. Logistic regression measures model estimation fit with the value of -2 times the log of the likelihood value, referred to as -2LL or -2 log likelihood. The minimum value for -2LL is 0, which corresponds to a perfect fit (likelihood = 1 and -2LL is then 0). Thus, the lower the -2LL value, the better the fit of the model. As will be discussed in the following section, the -2LL value can be used to compare equations for the change in fit or to calculate measures comparable to the R 2 measure in multiple regression. Between Model Comparisons. The likelihood value can be compared between equations to assess the difference in predictive fit from one equation to another, with statistical tests for the significance of these differences. The basic approach follows three steps: 1. Estimate a null model. The first step is to calculate a null model, which acts as the baseline for making comparisons of improvement in model fit. The most common null model is one without any independent variables, which is similar to calculating the total sum of squares using only the mean in multiple regression. The logic behind this form of null model is that it can act as a baseline against which any model containing independent variables can be compared. 2. Estimate the proposed model. This model contains the independent variables to be included in the logistic regression model. Hopefully, model fit will improve from the null model and result in a lower -2LL value. Any number of proposed models can be estimated (e.g., models with one, two, and three independent variables can all be separate proposed models). 3. Assess -2LL difference. The final step is to assess the statistical significance of the -2LL value between the two models (null model versus proposed model). If the statistical tests support significant differences, then we can state that the set of independent variable(s) in the proposed model is significant in improving model estimation fit. In a similar fashion, any two proposed models can be compared. In these instances, the -2LL difference reflects the difference in model fit due to the different model specifications. For example, a model with two independent variables may be compared to a model with three independent variables to assess the improvement gained by adding one independent variable. In these instances, one model is selected to act as the null model and then compared against another model. For example, assume that we wanted to test the significance of a set of independent variables collectively to see if they improved model fit. The null model would be specified as a model without these variables and the proposed model would include the variables to be evaluated. The difference in -2LL would signify the improvement from the set of independent variables. We could perform similar tests of the differences in -2LL between other pairs of models varying in the number of independent variables included in each model. The chi-square test and the associated test for statistical significance are used to evaluate the reduction in the log likelihood value. However, these statistical tests are particularly sensitive to sample size (for small samples it is harder to show statistical significance, and vice versa, for large samples). Therefore, researchers must be particularly careful in drawing conclusions based solely on the significance of the chi-square test in logistic regression. Pseudo R 2 Measures. In addition to the statistical chi-square tests, several different R 2 -like measures have been developed and are presented in various statistical programs to represent overall 323

12 model fit. These pseudo R 2 measures are interpreted in a manner similar to the coefficient of determination in multiple regression. A pseudo R 2 value can be easily derived for logistic regression similar to the R 2 value in regression analysis [3]. The pseudo R 2 for a logit model (R 2 LOGIT) can be calculated as R 2 LOGIT = -2LL null - A -2LL modelb -2LL null Just like its multiple regression counterpart, the logit R 2 value ranges from 0.0 to 1.0. As the proposed model increases model fit, the -2LL value decreases. A perfect fit has a -2LL value of 0.0 and a R 2 LOGIT of 1.0. Two other measures are similar in design to the pseudo R 2 value and are generally categorized as pseudo R 2 measures as well. The Cox and Snell R 2 measure operates in the same manner, with higher values indicating greater model fit. However, this measure is limited in that it cannot reach the maximum value of 1, so Nagelkerke proposed a modification that had the range of 0 to 1. Both of these additional measures are interpreted as reflecting the amount of variation accounted for by the logistic model, with 1.0 indicating perfect model fit. A Comparison to Multiple Regression. In discussing the procedures for assessing model fit in logistic regression, we made several references to similarities with multiple regression in terms of various measures of model fit. In the following table, we show the correspondence between concepts used in multiple regression and their counterparts in logistic regression. Correspondence of Primary Elements of Model Fit Multiple Regression Total sum of squares Error sum of squares Regression sum of squares F test of model fit Coefficient of determination (R 2 ) Logistic Regression -2LL of base model -2LL of proposed model Difference of -2LL for base and proposed models Chi-square test of -2LL difference Pseudo R 2 measures As we can see, the concepts between multiple regression and logistic regression are similar. The basic approaches to testing overall model fit are comparable, with the differences arising from the estimation methods used in the two techniques. PREDICTIVE ACCURACY Just as we borrowed the concept of R 2 from regression as a measure of overall model fit, we can look to discriminant analysis for a measure of overall predictive accuracy. The two most common approaches are the classification matrix and chi-square-based measures of fit. Classification Matrix. This classification matrix approach is identical to that used with discriminant analysis, that is, measuring how well group membership is predicted and developing a hit ratio, which is the percentage correctly classified. The case of logistic regression will always include only two groups, but all of the chance-related measures (e.g., maximum chance or proportional chance) used earlier for discriminant analysis are applicable here as well. Chi-Square-Based Measure. Hosmer and Lemeshow [4] developed a classification test where the cases are first divided into approximately 10 equal classes. Then, the number of actual and predicted events is compared in each class with the chi-square statistic. This test provides a comprehensive measure of predictive accuracy that is based not on the likelihood value, but rather on the actual prediction of the dependent variable. The appropriate use of this test requires a sample 324

13 size of at least 50 cases to ensure that each class has at least 5 observations and generally an even larger sample because the number of predicted events should never fall below 1. Also, the chi-square statistic is sensitive to sample size, enabling this measure to find small statistically significant differences when the sample size becomes large. We typically examine as many of these measures of model fit as possible. Hopefully, a convergence of indications from these measures will provide the necessary support for the researcher in evaluating the overall model fit. STAGE 5: INTERPRETATION OF THE RESULTS As discussed earlier, the logistic regression model results in coefficients for the independent variables much like regression coefficients and quite different from the loadings of discriminant analysis. Moreover, most of the diagnostics associated with multiple regression for influential observations are also available in logistic regression. What does differ from multiple regression, however, is the interpretation of the coefficients. Because the dependent variable has been transformed in the process described in the previous stage, the coefficients must be evaluated in a specific manner. The following discussion first addresses how the directionality and then magnitude of the coefficients are determined. Then, the differences in interpretation between metric and nonmetric independent are covered, just as was needed in multiple regression. Testing for Significance of the Coefficients Logistic regression tests hypotheses about individual coefficients just as was done in multiple regression. In multiple regression, the statistical test was to see whether the coefficient was significantly different from 0. A coefficient of 0 indicates that the coefficient has no impact on the dependent variable. In logistic regression, we also use a statistical test to see whether the logistic coefficient is different from 0. Remember, however, in logistic regression using the logit as the dependent measure, a value of 0 corresponds to the odds of 1.00 or a probability of.50 values that indicate the probability is equal for each group (i.e., again no effect of the independent variable on predicting group membership). In multiple regression, the t value is used to assess the significance of each coefficient. Logistic regression uses a different statistic, the Wald statistic. It provides the statistical significance for each estimated coefficient so that hypothesis testing can occur just as it does in multiple regression. If the logistic coefficient is statistically significant, we can interpret it in terms of how it impacts the estimated probability, and thus the prediction of group membership. Interpreting the Coefficients One of the advantages of logistic regression is that we need to know only whether an event (purchase or not, good credit risk or not, firm failure or success) occurred or not to define a dichotomous value as our dependent variable. When we analyze these data using the logistic transformation, however, the logistic regression and its coefficients take on a somewhat different meaning from those found in regression with a metric dependent variable. Similarly, discriminant loadings from a two-group discriminant analysis are interpreted differently from a logistic coefficient. From the estimation process described earlier, we know that the coefficients (B 0, B 1, B 2,..., B n ) are actually measures of the change in the ratio of the probabilities (the odds). However, logistic coefficients are difficult to interpret in their original form because they are expressed in terms of logarithms when we use the logit as the dependent measure. Thus, most computer programs also provide an exponentiated logistic coefficient, which is just a transformation (antilog) of the original logistic coefficient. In this way, we can use either the original or exponentiated logistic coefficients for interpretation. The two types of logistic coefficient differ in that they reflect 325

14 the relationship of the independent variable with the two forms of the dependent variable, as shown here: Logistic Coefficient Reflects Changes in... Original Exponentiated Logit (log of the odds) Odds We will discuss in the next section how each form of the coefficient reflects both the direction and magnitude of the independent variable s relationship, but requires differing methods of interpretation. DIRECTIONALITY OF THE RELATIONSHIP The direction of the relationship (positive or negative) reflects the changes in the dependent variable associated with changes in the independent variable. A positive relationship means that an increase in the independent variable is associated with an increase in the predicted probability, and vice versa for a negative relationship. We will see that the direction of the relationship is reflected differently for the original and exponentiated logistic coefficients. Interpreting the Direction of Original Coefficients. The sign of the original coefficients (positive or negative) indicates the direction of the relationship, just as seen in regression coefficients. A positive coefficient increases the probability, whereas a negative value decreases the predicted probability, because the original coefficients are expressed in terms of logit values, where a value of 0.0 equates to an odds value of 1.0 and a probability of.50. Thus, negative numbers relate to odds less than 1.0 and probabilities less than.50. Interpreting the Direction of Exponentiated Coefficients. Exponentiated coefficients must be interpreted differently because they are the logarithms of the original coefficient. By taking the logarithm, we are actually stating the exponentiated coefficient in terms of odds, which means that exponentiated coefficients will not have negative values. Because the logarithm of 0 (no effect) is 1.0, an exponentiated coefficient of 1.0 actually corresponds to a relationship with no direction. Thus, exponentiated coefficients above 1.0 reflect a positive relationship and values less than 1.0 represent negative relationships. An Example of Interpretation. Let us look at a simple example to see what we mean in terms of the differences between the two forms of logistic coefficients. If B i (the original coefficient) is positive, its transformation (exponentiated coefficient) will be greater than 1, meaning that the odds will increase for any positive change in the independent variable. Thus the model will have a higher predicted probability of occurrence. Likewise, if B i is negative the exponentiated coefficient is less than 1.0 and the odds will be decreased. A coefficient of zero equates to an exponentiated coefficient value of 1.0, resulting in no change in the odds. A more detailed discussion of interpretation of coefficients, logistic transformation, and estimation procedures can be found in numerous texts [4, 5, 6]. MAGNITUDE OF THE RELATIONSHIP OF METRIC INDEPENDENT VARIABLES To determine how much the probability will change given a one-unit change in the independent variable, the numeric value of the coefficient must be evaluated. Just as in multiple regression, the coefficients for metric and nonmetric variables must be interpreted differently, because each reflects different impacts on the dependent variable. For metric variables, the question is: How much will the estimated probability change for each unit change in the independent variable? In multiple regression, we knew that the regression coefficient was the slope of the linear relationship of the independent and dependent measures. A coefficient of 1.35 indicated that the dependent variable increased by 1.35 units each time that independent variable increased by one unit. In logistic regression, we know that we have a nonlinear relationship bounded between 0 and 1, so the coefficients are likely to be interpreted somewhat differently. Moreover, we have both the original and exponentiated coefficients to consider. 326

15 Original Logistic Coefficients. Although most appropriate for determining the direction of the relationship, the original logistic coefficients are less useful in determining the magnitude of the relationship. They reflect the change in the logit (logged odds) value, a unit of measure not particularly understandable in depicting how much the probabilities actually change. Exponentiated Logistic Coefficients. Exponentiated coefficients directly reflect the magnitude of the change in the odds value. Because they are exponents, they are interpreted slightly differently. Their impact is multiplicative, meaning that the coefficient s effect is not added to the dependent variable (the odds), but multiplied for each unit change in the independent variable. As such, an exponentiated coefficient of 1.0 denotes no change (1.0 independent variable = no change). This outcome corresponds to our earlier discussion, where exponentiated coefficients less than 1.0 reflect negative relationships and values above 1.0 denote positive relationships. An Example of Assessing Magnitude of Change. Perhaps an easier approach to determine the amount of change in probability from these values is as follows: Percentage change in odds = (Exponentiated coefficient i - 1.0) 100 The following examples illustrate how to calculate the probability change due to a one-unit change in the independent variable for a range of exponentiated coefficients: Value Exponentiated Coefficient (e b i) Exponentiated Coefficient Percentage change in odds -80% -50% 0% 50% 70% If the exponentiated coefficient is.20, a one-unit change in the independent variable will reduce the odds by 80 percent (the same as if the odds were multiplied by.20). Likewise, an exponentiated coefficient of 1.5 denotes a 50-percent increase in the odds ratio. A researcher who knows the existing odds and wishes to calculate the new odds value for a change in the independent variable can do so directly through the exponentiated coefficient as follows: New odds value = Old odds value Exponentiated coefficient Change in independent variable Let us use a simple example to illustrate the manner in which the exponentiated coefficient affects the odds value. Assume that the odds are 1.0 (i.e., 50 50) when the independent variable has a value of 5.5 and the exponentiated coefficient is We know that if the exponentiated coefficient is greater than 1.0, then the relationship is positive, but we would like to know how much the odds would change. If we expected that the value of the independent variable would increase 1.5 points to 7.0, we could calculate the following: New odds = ( ) = Odds can be translated into probability values by the simple formula of Probability = Odds/ (1 + Odds). Thus, the odds of translate into a probability of 77.9 percent (3.25/( ) =.779), indicating that increasing the independent variable by 1.5 points will increase the probability from 50 percent to 78 percent, an increase of 28 percent. The nonlinear nature of the logistic curve is demonstrated, however, when we apply the same increase to the odds again. This time, assume that the independent variable increased another 1.5 points, to 8.5. Would we also expect the probability to increase by another 28 percent? It cannot, because that would make the probability greater than 100 percent (78% + 28% = 106%). Thus, the 327

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure).

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure). 1 Neuendorf Logistic Regression The Model: Y Assumptions: 1. Metric (interval/ratio) data for 2+ IVs, and dichotomous (binomial; 2-value), categorical/nominal data for a single DV... bear in mind that