A COEFFICIENT OF DETERMINATION FOR LOGISTIC REGRESSION MODELS

A COEFFICIENT OF DETEMINATION FO LOGISTIC EGESSION MODELS ENATO MICELI UNIVESITY OF TOINO After a brief presentation of the main extensions of the classical coefficient of determination ( ), a new index is proposed that can be used with Logistic for ungrouped data. This index is a direct extension of the classical coefficient of determination for linear models (link function identity and normal distribution for errors), and they share the same properties. Index performances (including the one proposed here) are compared by means of simulated data. Key words: Model Fit; Coefficient of Determination; Logistic regression models; Generalized Linear Models; Log likelihood. Correspondence concerning this article should be addressed to enato Miceli, Dipartimento di Psicologia, Università degli Studi di Torino, Via Verdi, 4 TOIO (TO), Italy. E-mail: miceli@psych.unito.it INTODUCTION A large number of research studies in psychology applies models with categorical and limited dependent variables in statistical analysis. Such models usually belong to the large family of Generalized Linear Models (GLM) (McCullagh & Nelder, 983; Nelder & Wedderburn, 97). When data are gathered with non-experimental research methods (as in many studies using logistic regression models), the assessment of the goodness-of-fit raises problems due to the lack of a summary measure that can be easily interpreted, such as the coefficient of determination in classical regression linear models. The coefficient of determination ( ) in classical linear models (link function identity and normal distribution for errors) is widely used as a goodness-of-fit measure because of its interesting properties (ao, 973): (i) it ranges between and (the higher the fit, the more approximates, which is reached when the model perfectly reproduces the observed data); (ii) it is dimensionless, i.e., it is independent of the unit of measurement used for variables; (iii) it is independent of sample size (); (iv) it can be immediately and easily interpreted in that it can be expressed as the proportion of the deviance explained by the model with respect to the total deviance to be explained. In classical linear models, the parameters ( θˆ,ˆ θ,...,ˆθ K) can be estimated by Ordinary Least Squares (OLS) criterion and can be expressed as the ratio between explained deviance and deviance to explain ( observations and K variables): TPM Vol. 4, No., 83-98 Summer 7 7 Cises 83

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,. y) yi) = y) ) () where: ŷ i =θ ˆ K + θ ˆ k xik ; y = k= y i. Numerous suggestions were made for generalizing to various models, other than the classical linear one, even when deviance has to be replaced by the more general concept of variability, and the parameters are Maximum Likelihood (ML) estimates. Efforts were primarily made to extend to discrete models, in particular to logistic regression models for ungrouped data (Aldrich & Nelson, 984; Cox & Snell, 989; Maddala, 983; Magee, 99; Nagelkerke, 99). The index (here referred to as ), originally suggested by Maddala (983), and subsequently by Cox and Snell (989), and Magee (99), can be expressed as: L = L () where is the sample size; L and L denote the likelihoods of the fitted and the null (intercept only) model, respectively. The index (here referred to as A ), proposed by Aldrich and Nelson (984), can be expressed as: c A = + c L where c = log, generally referred to as likelihood ratio. L Even if both indexes present interesting aspects, they do not have property (i). In both cases, the maximum value is less than. In particular, the maximum value of equals: max = L Nagelkerke (99) proposed to correct ) that satisfies property (i), and that can be expressed as: (3), suggesting an index (here referred to as = max (4) It is easily found (Nagelkerke, 99) that not (i) and while properties (i), (ii) and (iii) hold for has the (ii), (iii) and (iv) properties, but, the same is not true of property (iv), which is of fundamental importance in providing a clear interpretation of the index values. Given that is a popular diagnostic tool in research and it varies between and, there is a high risk that its values may be interpreted as explained variation. Furthermore, this 84

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,. risk could be even higher if as suggested by our simulations the index values always tend to suggest an optimistic interpretation of the explanatory power of the model under consideration. Obviously, in order to claim this, a measure having all the four properties mentioned above is needed. For this reason it appears useful to propose a new index here referred to as M or Maximal atio Index. THE MAXIMAL ATIO (M) INDEX It is useful to start thinking about a metric dependent variable (y) and a group of K metric explanatory variables, independent variables, or covariates. In such a context, K nested linear models (link function identity and normal distribution for errors) and the intercept only model can be estimated: besides the intercept, model M will contain only the variable x ; model M will contain x and x, and so forth. Equation () shows the strict proportionality linking K values of to as many values of the deviance explained by each model. In addition, by obtaining parameters through the ML estimator, the explained deviance is equivalent to the likelihood ratio (omitting the scale factor ) often referred to as c (Aldrich & Nelson, 984, p. 55); such ratio σ can be expressed as: Λ = L L (where L denotes the likelihood of the fitted model, and L denotes the likelihood of the null or intercept only model). The deviance explained by the fitted model can thus be expressed as: c [ log( L ) ( )] logλ= log = L (5) Therefore, within classical linear models (link function identity and normal distribution for errors), can be interpreted as in (iv) taking into account the increments in the explained deviance, as well as the increments in the likelihood ratio. In the context of logistic regression models the concept of explained deviance has to be replaced by the more general concept of explained variability and, given that c measures the latter, it seems obvious to develop a measure of fit proportional to this statistic. On the other hand, within GLM, a statistic also indicated as likelihood ratio (see Dobson, 99, p6) is often used, but its meaning is completely different from that of statistic c. Such ratio can be expressed as: λ = L max L (where L denotes the likelihood of the fitted model, but L max denotes the likelihood of the maximal or full model). Nelder and Wedderburn (97) proposed to use twice the logarithm of such ratio as measure of fit of any generalized linear model. They indicated such statistic with the term deviance, so as to evoke the statistic that has the same name in classical linear models, and to underline the extension of such concept to the whole generalized linear models family, even when the simple residual sum of squares can no longer be calculated, or is meaningless. Such statistic, in relation to a generic fitted model, can thus be expressed as: [ log( L ) ( )] D = L (6) logλ= max log While statistic c expressed the contribution of the covariates to the model fit of the dependent variable (so to speak, the way that has been gone thanks to the model), now statistic 85

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,. D expresses the amount of discrepancy that, in spite of the model, is still present ( the way that still has to be gone ). The use of a maximal model in the assessment of fit is commonly associated with a certain type of models (for example, log-linear models), or with particular research contexts (confirmatory or experimental methods), when the model may comprise as many covariates as there are observations. Vice versa, a maximal model is not suitable for studies conducted with nonexperimental methods when, for exploratory purposes, researchers deal with a great amount of observations and no a priori defined group of covariates as it often happens when using a logistic regression model. This may be the reason why, in research practice, each statistic (both c and D ) is exclusively restricted to a specific world or domain. Nonetheless, there is a point in which the two worlds meet: this is the intercept model. Thus, the calculation of statistic D for the intercept model (of any model belonging to the GLM family and hence even for logistic regression models) yields a measure of the variability that covariates still have to explain. Such statistic is here referred to as D : D [ log( L ) ( )] logλ= max log = L (7) Now, in the context of logistic regression models, having a measure of explained variability (c ) and a measure of variability to explain (D ) at our disposal, the Maximal atio (M) can be expressed as: c M= D Thus, it is easy to demonstrate that in the case of classical linear models (link function identity and normal distribution for errors), this ratio coincides with (Miceli,, p. 6-6), and obviously it has the same well known properties, including the one of varying between and and of being proportional to the amount of explained variability. The main steps of the demonstration are reported below; for classical linear models (link function identity and normal distribution for errors) we can write the log-likelihood function of the generic model with k covariates (k < ) and σ for dispersion parameter as: l = ) ( y y ) log( πσ ) i σ where: ŷ i =θ ˆ + K θˆ x k ik ; k= the log-likelihood function of the maximal or full model, when y = yˆ ( i ), is: ( πσ ) l max = log the log-likelihood function of the null or intercept only model, when = y ( i ) and y = y i, is: ( y y) log( πσ ) = i σ l i i ŷ i (8) 86

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,. Then c = ( l l) = y) log( πσ ) + ŷi) + log( πσ ) =+ σ σ y) ŷi) σ = D = ( lmax l) = log( πσ ) + y) + log( πσ ) =+ y) σ σ And c M= D = σ y) ŷi) σ y) = y) ŷi) y) = EMPIICAL COMPAISON BETWEEN THE DIFFEENT INDEXES Through simulated data, it is now possible to compare the performance of the different indexes. Simulations were conducted by generating, for different sample sizes ( = ; = 3; exp( X i) = 3), a continuous latent variable (y), obtained from yi =, where X i denotes + exp X the linear combination of 5 normally distributed random variables, and as many coefficients (plus the intercept). For each sample size, two types of continuous variable (y) were generated, as shown in Figure : simulation type A with about 36% of its values falling into the. interval, thus presenting a clear-cut logistic trend; and simulation type B, with about 86% of its values falling into the same interval, presenting a like linear trend. For each simulation type (A and B) and for each sample size (, 3, and 3), nine cutting points were then defined, in order to generate as many dummy variables (D, D,..., D9), so that each of them had a different frequency of value, as illustrated below: ( ) i Dummy variable D D D3 D4 D5 D6 D7 D8 D9 Frequencies of (%) 3 5 4 5 6 75 9 97 Each of the 54 dummies thus generated was then used as dependent variable in 5 logistic regression models, thus computing an overall ML estimate of 8 models. The variables of the various logistic regression models were organized so as to define, for each dummy, a group of 5 nested models (M, M,..., M5). 87

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,... 3 4 5 6 7 8 9 ote. Simulation type A ( = ): latent dependent variable (y)... 3 4 5 6 7 8 9 ote. Simulation type B ( = ): latent dependent variable (y). FIGUE Two Types of Latent Dependent Variable y. The obtained results, partially reported in Table a, b, c, and Figure, permit us to express subsequent considerations (due to space limitations, Table a, b, c only report some results from simulation type A estimates, with = 3 (dependent variable: D, D3, and D5); Figure reports simulation type A graphs. The remaining results are in line with the ones presented here): (a) the four indexes provide different indications on the model fit; and A even show discordant values; (b) offers a model-data fit value closer to M, compared to the other indexes, yielding higher values in all occasions; 88

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,. (c) M and yield very similar values in almost all simulations; however, discrepancies (with increasingly higher values of ) become larger in proximity of central values ( ), and when the frequency of value in the dependent variable is more or less balanced (4% 6%). TABLE A Comparison among Fit Indexes from Simulation Type A ( = 3) Model D c M A M 8853 39 353 54 36 363 M 8853 4833 594 67 59 576 M3 8853 95 35 5 33 99 M4 8853 869 93 54 63 5858 M5 8853 946 4 5 66 676 M6 8853 367 796 3 97 973 M7 8853 3358 47 476 57 5 M8 8853 344 39 57 8 5 M9 8853 454 3 349 4 M 8853 43 7 448 7 8 M 8853 5685 3836 69 8 678 M 8853 694 537 78 38 88 M3 8853 6987 793 7 94 353 M4 8853 7787 8785 8 37 M5 8853 7986 875 89 37 9 ote. Fifteen nested models were simulated for dependent variable D (frequencies of value = 3%). TABLE B Comparison among Fit Indexes from Simulation Type A ( = 3) Model D c M A M 3374 838 474 46 74 77 M 3374 464 734 8 789 759 M3 3374 3695 965 8 978 M4 3374 6393 966 45 58 M5 3374 6684 84 58 98 4 M6 3374 9 35 77 88 49 M7 3374 3 3545 654 43 393 M8 3374 635 449 76 5 95 M9 3374 398 44 57 75 79 M 3374 469 3548 735 87 875 M 3374 8753 5599 885 649 8473 M 3374 87 7577 884 33 383 M3 3374 437 44 64 5 4483 M4 3374 8633 488 9 5 884 M5 3374 336 964 98 739 844 ote. Fifteen nested models were simulated for dependent variable D3 (frequencies of value = 5%). 89

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,. TABLE C Comparison among Fit Indexes from Simulation Type A ( = 3) Model D c M A M 4588 964 36 4 36 3 M 4588 5 679 78 88 777 M3 4588 4 4 7 4 M4 4588 7848 865 68 73 M5 4588 89 459 5 64 5 M6 4588 567 9 56 4 57 M7 4588 39 3464 949 7 69 M8 4588 477 55 85 889 995 M9 4588 7683 534 94 455 793 M 4588 864 4758 64 63 89 M 4588 3787 797 3 475 45 M 4588 886 94 39 79 934 M3 4588 347 377 55 379 393 M4 4588 3589 63 33 977 447 M5 4588 443 963 983 487 8 ote. Fifteen nested models were simulated for dependent variable D5 (frequencies of value = 5%). Dependent variable D.. ote M M M3 M4 M5 M6 M7 M8 M9 M M M M3 M4 M5 ; ; ; γ M A (figure continues) 9

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,. FIGUE (continued) Dependent variable D.. M M M3 M4 M5 M6 M7 M8 M9 M M M M3 M4 M5 Dependent variable D3.. M M M3 M4 M5 M6 M7 M8 M9 M M M M3 M4 M5 Dependent variable D4.. M M M3 M4 M5 M6 M7 M8 M9 M M M M3 M4 M5 (figure continues) 9

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,. FIGUE (continued) Dependent variable D5.. M M M3 M4 M5 M6 M7 M8 M9 M M M M3 M4 M5 Dependent variable D6.. M M M3 M4 M5 M6 M7 M8 M9 M M M M3 M4 M5 Dependent variable D7.. M M M3 M4 M5 M6 M7 M8 M9 M M M M3 M4 M5 (figure continues) 9

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,. FIGUE (continued) Dependent variable D8.. M M M3 M4 M5 M6 M7 M8 M9 M M M M3 M4 M5 Dependent variable D9.. M M M3 M4 M5 M6 M7 M8 M9 M M M M3 M4 M5 FIGUE Comparison among Fit Indexes from Simulation Type A ( = 3) If it is important that the fit index may be interpreted as a proportion of explained variation, then it should be noted that always tends to suggest an optimistic interpretation of the explanatory power of the fitted model, that is to say a larger proportion of explained variation. In addition, this optimistic interpretation is not constant when data and models vary. This aspect can be verified by assessing the congruence between the increments in the variability explained by each model (expressed by statistic c ) and the corresponding increments in the fit index. Such evaluation can be done with nested models, as in this study. The strict proportionality between c and M can be derived by formula (8). On the contrary, as shown in Table, never strictly follows the increments in the explained variability: above all, the relation is not constant, and larger differences (with r values considerably lower than +) are observed for those dependent 93

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,. variables that present a higher balance between and (D4, D5, and D6). Further, Table suggests that increasingly larger discrepancies can be observed as the sample size increases, and the more the latent variable (y) moves away from linearity (discrepancies in simulation type A are larger than in simulation type B). Table reports r values calculated across the increments of c and in relation to the 5 nested models estimated for each dependent variable. The values of the other two indexes ( and A ), were not reported due to space limitations. However, they are very similar to those of ; usually, values are remarkably lower. A TABLE Pearson Correlations between Likelihood atio c and (for each simulation type and each dependent variable) Simulation type D D D3 D4 D5 A 9843 7663 44 8679 66 B 9843 854 469 889 75 3 A 9633 646 838 7664 936 3 B 9753 7586 4 3887 547 3 A 949 478 439 7345 5768 3 B 9556 69 54 5746 54 Simulation type D6 D7 D8 D9 A 456 386 7786 967 B 859 68 7498 9883 3 A 4979 443 587 9448 3 B 77 348 8364 988 3 A 995 464 368 94 3 B 66 9877 6763 9448 The results of the present study are summarized in Figure 3. For each dependent variable, the values of the four fit indexes (on the ordinate) for each estimated model are shown, so that the trend of these values can be compared with the trend of the likelihood ratio c (the explained variation) on the abscissa. CONCLUSIONS The assessment of the goodness-of-fit for Logistic (ungrouped data) can be facilitated by an index allowing an easy interpretation, such as the coefficient of determination for classical linear regression models. The new index developed in this study (M) can be used as an alternative for the common indexes (proposed by Cox & Snell, 989, and by Nagelkerke, 99) that today are supplied by the most common statistical software packages. 94

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,. This paper compares the performance of M with the other known indexes by means of simulated data. Dependent variable D.. 3 4 5 6 7 8 ote. ; ; ; γ M A Dependent variable D.. 4 6 8 4 6 8 Dependent variable D3.. 3 4 (figure continues) 95

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,. FIGUE 3 (continued) Dependent variable D4.. 5 5 5 3 35 4 45 Dependent variable D5.. 5 5 5 3 35 4 45 Dependent variable D6.. 5 5 5 3 35 4 (figure continues) 96

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,. FIGUE 3 (continued) Dependent variable D7.. 5 5 5 3 35 Dependent variable D8.. 4 6 8 4 6 8 Dependent variable D9.. 3 4 5 6 7 8 FIGUE 3 Comparison among Fit Indexes, with Likelihood atio c on the abscissa from Simulation Type A ( = 3) 97

TPM Vol. 4, No., 83-98 Summer 7 7 Cises Miceli,. In particular, the main distinctive features of the M index are the following: it is easy to compute; in the case of classical linear models (link function identity and normal distribution for errors) it coincides with the classical coefficient of determination ( ); it varies between and ; its values may be interpreted as explained variation by the fitted model with respect to the total variation to be explained. EFEENCES Aldrich, J. H., & Nelson, F. D. (984). Linear Probability, Logit, and Probit Models. Sage University Paper Series on Quantitative Applications in the Social Sciences (pp-45). Beverly Hills and London: Sage Publications. Cox, D.., & Snell, E. J. (989). The Analysis of Binary Data ( nd ed.). London: Chapman & Hall. Dobson, A. J. (99). An Introduction to Generalized Linear Models. London: Chapman & Hall. Maddala, G. S. (983). Limited-dependent and Qualitative Variables in Econometrics. New York: Cambridge University Press. Magee, L. (99). Measures Based on Wald and Likelihood atio Joint Significance Test. American Statistician, 44, 5-53. McCullagh, P., & Nelder, J. A. (983). Generalized Linear Models. New York: Chapman & Hall. Miceli,. (). Percorsi di icerca e Analisi dei Dati [esearch methods and data analysis]. Torino: Bollati Boringhieri. Nagelkerke, N. J. D. (99). A Note on a General Definition of the Coefficient of Determination. Biometrika, 78, 69-69. Nelder, J. A., & Wedderburn,. W. M. (97). Generalized Linear Models. Journal of oyal Statistical Society, A, 35, 37-384. ao, C.. (973). Linear Statistical Inference and its Applications ( nd ed.). New York: Wiley. 98