INFORMATION AS A UNIFYING MEASURE OF FIT IN SAS STATISTICAL MODELING PROCEDURES

Size: px

Start display at page:

Download "INFORMATION AS A UNIFYING MEASURE OF FIT IN SAS STATISTICAL MODELING PROCEDURES"

Sheryl Chambers
5 years ago
Views:

1 INFORMATION AS A UNIFYING MEASURE OF FIT IN SAS STATISTICAL MODELING PROCEDURES Ernest S. Shtatland, PhD Mary B. Barton, MD, MPP Harvard Medical School, Harvard Pilgrim Health Care, Boston, MA ABSTRACT This paper shows that there is a common denominator for many goodness-of-fit criteria in such popular statistical modeling procedures as REG, LOGISTIC, GENMOD, PHREG and others, and that the most natural common denominator is an information-based measure. The necessity to have some common, universal and easy to interpret criterion is almost obvious. PROC REG has eleven criteria of fit, LOGISTIC has four, GENMOD - nine, and PHREG - three. Some of these criteria look alike, but others look absolutely unrelated. With this abundance of criteria, the questions of which one to use, why and how to interpret the criterion of our choice are far from being trivial. As a unifying criterion, it is proposed to use (together with other criteria as R-Square, deviance, loglikelihood and so on) the information-difference statistic which: a) is common for all the types of regression analysis; b) is easy to interpret in terms of information gain/loss (in bits); c) has a very convenient property of additivity that allows SAS users to evaluate the contribution of an individual covariate or a group of covariates in terms of information. SAS USERS AND SAS STATISTICAL MODELING PROCEDURES One of the most difficult points in using SAS Statistical Modeling Procedures (after the type of the model is chosen) is to understand whether the model we have built is good enough for us in terms of some goodness-of-fit criterion. This means that at a minimum, we have to understand and interpret printouts. The difficulty of this task is compounded because, for example, PROC REG has at least 11 statistics of fit ([1], p. 1369), PROC GENMOD - 9 statistics ([2], p. 273), PROC LOGISTIC - 4 ([1], p. 1088, and [2], pp ), PROC PHREG - 3 ([2], p. 832), PROC MIXED - 6 ([2], pp. 547, ), and so on. The expertise required to understand and interpret properly each of these statistics is not available to all users. Therefore it would be desirable to have a universal criterion with an easy to understand meaning that would be common for many applications. POSSIBLE CANDIDATES FOR THE ROLE OF UNIVERSAL CRITERION OF FIT There are two clear candidates for the role of a universal criterion of fit: R-Square and a statistic based on likelihood. 1. R-SQUARE R-Square may be the most popular statistic of fit. Its popularity is so widespread that you can find it in your Vanguard or Fidelity mutual funds booklets. R-Square dominates in PROC REG and PROC GLM (in the latter it is the sole criterion). R-Square has been added to the original list of goodness-of-fit criteria for PROC LOGISTIC. Also, R-Square is sometimes used in Survival Analysis. Nevertheless, the most natural area of applications of R-Square is linear modeling. Here, R-Square is not just a rescaled measure of variation as it is usually thought of, but a comparison between two models: a null model with intercept only and our current model of interest (see [4], p. 114). 1

2 intercept only) models respectively. Here log is either We should mention also some drawbacks of the natural or binary logarithm. The natural R-Square. The basic shortcomings are: logarithm is used more often with the Gaussian or 1) R-Square always has a positive bias, sometimes normal distribution (in linear regression) and the rather substantial. This is why the ADJUSTED base 2 logarithm with discrete distributions R-Square has been introduced. (binomial, Poisson etc.). 2)There are many interpretations of R-Square; this is not bad in itself, but often leads to significant Now, how can the information concept appear in confusion. equation (1)? It is well-known that for any random As a result, we can find the advice ([3], p. 14): event E we have two numbers: the event s probability It may be easier to interpret 1-R 2 [rather than P(E) and its information I(E), the information R-Square itself]. Perhaps this is the reason why in contained in the spite of all the popularity of R-Square, it is a IE ( ) = log PE ( ) message that E has measure many statisticians love to hate. ([4], p. occurred. These 113). quantities are connected according to the formula 2. LIKELIHOOD (2) This is certainly a very fundamental characteristic, Information is as fundamental a concept as even more fundamental than R-Square. It appears in probability and for a general audience, information the form of the logarithm of the likelihood (log- rather than probability may have more intuitive ease. likelihood) in printouts of almost all nonlinear Sometimes I(E) regression procedures as itself or as a part of AIC, HX ( ) = ( Px ( i)log Px ( i)) is called the i BIC, SC or DEVIANCE (see [1], pp , Shannon 1369; [2], p. 273). But sometimes it is used with the information or Shannon complexity ([8], p. 38). For a sign '+', sometimes with the sign '-'; sometimes you discrete-valued random variable X with the can see the log-likelihood with the factor 2, probability distribution {P(x)} we can define the sometimes without it. This is confusing, especially if entropy as the information average: you don't know whether this statistic, the loglikelihood, is meaningful in itself or it is used instead (3) of likelihood because of some mathematical convenience (by the way, the latter is the most typical For a continuous random variable we will have an explanation). We will find the answer to this question integral instead of the sum. According to these through the concepts of information and entropy formulas, we can think of the right side of (1) as a when trying to find a relationship between R-Square sample estimate of the difference between the entropy and likelihood statistics. characteristics of the null model (b=0) and the model under consideration. The most natural R-SQUARE AND LIKELIHOOD - interpretation of this difference is the information WHAT THEY HAVE IN COMMON gain (IG) we have when moving from the simplest 'null' model to the fitted model ([9]-[11]). Only now It has been noted ([5]-[7]) that there exists a simple we have justified our using the log-likelihood instead formula relating R-Square and log-likelihood of just the likelihood not only from a technical but statistics from a conceptual point of view log(1- R 2 )= 1 [logl(b)-logl(0)] (1) N where N is a sample size, logl(b) and logl(0) denote the log-likelihoods of the fitted and null (with INFORMATION AND EVALUATING SCIENTIFICALLY IMPORTANT DIFFERENCES IN R-SQUARE Whatever criterion of fit we are using, we should distinguish 3 types of differences in our models (see [12], p. 319): (1) numeric differences, (2) statistically 2

3 significant differences, and (3) scientifically conclude that the R-Square scale is extremely important differences. Numerical differences may or inhomogeneous: closer to 1 the scale becomes steeper, may not correspond to significant or important and the change in R-Square from 0.99 to (less differences. For R-Square, the statistical significance than 1%!) has resulted in 1.67 bits. At the same of the difference between two models and two time, the change of R-Square from 0.50 to 0.75 is R-Square values (the so-called uniqueness index, see equivalent to 0.5 bit information. Thus, the [13], p. 407) usually is determined by using F-ratio information gain (IG) can be much more helpful statistics. The authors of the above-mentioned book than the uniqueness index in determining which expressed their concern about the absence of a direct difference is practically or scientifically important relationship between statistical significance and and which one is not. practical or scientific importance of uniqueness indices ([13], p. 440): With a large sample, it is It is not difficult to show that there exists a critical possible to obtain a uniqueness index that is value of R-Square equal to statistically significant even though the predictor variable accounts for only 2% or 3% of the variance 1-1/(2ln2) = in the criterion. When reviewing your results, it is important to assess whether a significant uniqueness below which the uniqueness index slightly overindex is large enough to be of substantive importance. estimates the information gain, and above which it How large is large enough?... When doing underestimates the information increment. With research with a criterion variable that traditionally R-Square closer to 1, this underestimate becomes has been difficult to predict, a uniqueness index of stronger and stronger. But for 0 < R 2 < 0.5, 0.05 (and possibly even smaller) may be viewed as the uniqueness index provides us with a good being of substantive importance. When researching a approximation of the information gain. criterion that is relatively easy to predict, the uniqueness index may have to be larger It is interesting to note that with introducing the to be considered important. A similar point of view information or entropy measure we can not only is expressed in [12], p Various sources measure the importance of changes in our model recommend using a 2%, 3% or 5% level of merit for more precisely and objectively, but we can make some the uniqueness index to be considered important. This fundamental relationships in the theory of regression advice is vague and furthermore may be incorrect in more understandable. For example, in a linear some cases. Below we demonstrate this by showing regression problem with a dependent variable Y and that the only objective way of quantifying an independent variables X 1,...,X K, a well-known important vs. unimportant dilemma is in using and fundamental relationship between R-Square and information statistics. The following table contains conditional partial correlations does not look the values of R-Square and corresponding values of encouraging and easy interpretable (see [15], for information gain (see also, [14], p. 250): example): TABLE R 2 =1-(1-R 2 )(1 - R 2 )... YX 1 YX 2 X 1 (4) R-Square Information Gain (IG)-in bits (1 - R 2 YX K X 1,..., X ) K where R YX1 is the correlation coefficient between Y and X1, the partial correlation coefficient R YX2 is X a measure of the strength of the linear relationship between Y and X 2 after controlling for the variable X 1, and so on. After simple transformations and (We have used the binary logarithm in TABLE 1 to taking the binary (or natural) logarithm, we are measure information in bits). From this data we can getting a very simple and easy to interpret additive 3

4 relationship N K IG(5%) (K+1)/N DIFFERENCE I(Y X 1,...,X K )=I(Y X 1 ) I(Y X 2 )+I(Y X 3 ) +... (5) I(Y X K,...,X K-1 ) Here I(Y X 1, X 2,...,X K ) is the the information about Y contained in all the covariates X 1, X 2,... Thus, using this fact, we can approximately test the, X K ; I(Y X 1 ) is the information about Y contained overall significance of our model in advance, without in X 1 alone; I(Y X 2 ) is the information about Y even referring to the table of critical values for the F- contained in X 2 after controlling for X 1, and so on. distribution (as a preliminary or exploratory analysis - to obtain a rough idea of significance). And to finish our list of benefits from using the information-difference or information-gain statistics Note that the parameter (K+1)/N (or K/N which is instead of the classical R-Square criterion, we will close to it) appears in many cases in statistical mention the following. It is well-known that the most modeling: powerful test of significance for overall regression is 1) in formulas for the statistics JP and PC performed by using an F-statistic which in terms of in PROC REG ([1], p. 1369) and in some variants of R-Square is expressed by the formula (see, for AIC, for example (see [9], pp , [11], p. 276, example [12], p. 126): [17], p. 423) F = N-K-1 [R 2 (1-R 2 )] (6) K AIC =- 2 N logl + 2 K N 2) in determining the importance or lack of where K is the number of explanatory or independent importance of the information gain (see [9], p. 170, variables in the model and N is the number of [11], p. 276); observations. We see that (6) is equivalent to 3) in the formula for the average bias of R-Square (see [15], p. 1031), etc log(1- R 2 K ) = 2log(1 + F N-K-1 ) (7) from which we can find quite a remarkable fact: 5% significance levels for the F-statistic are transformed to the significance levels for the information gain according to the formula IG(5%) = -1/2[log(1 - R-Square)], and these significance levels obey approximately the formula (K + 1)/N. The approximation improves with increase in N as demonstrated in the table below (see also [16]) TABLE 2 Before starting with nonlinear procedures, it is worth noting that the ideas similar to those discussed above can be used in canonical correlation analysis and PROC CANCORR. INFORMATION STATISTICS AS CRITERIA OF FIT IN NONLINEAR MODELS Equation (1) can serve as a prompt for us to introduce and utilize information- measuring statistics in PROC LOGISTIC, GENMOD, PHREG and other SAS statistical procedures for nonlinear modeling. 4

5 Frequently applications of these procedures belong to adequacy is better done by comparison with simpler medicine and health and behavioral sciences ([1], [2], models (for example, the model with intercept only) [12], [18]-[22]. For a typical application of logistic than the saturated model. Thus, it is proposed to use regression and PROC LOGISTIC in health services the statistic (8) rather than (9). The reason is that the research see also [23]. Chi-Square approximation for the deviance distribution is often very poor. In this rather uncertain PROC LOGISTIC situation, an additional recommendation is given ([2], p. 281): a deviance that is small relative to its For PROC LOGISTIC, equation (1) was used to degrees of freedom is a possible indication of a good introduce the generalized coefficient of determination model fit. The meaning of a new characteristic, (see [2], p. 414). The most typically used criterion for Value/DF, that appears in GENMOD printouts, is assessing overall model fit is (-2logL). The difference not explained and remains unclear. Taking into account that typically the sample size N is much 2(logL(b) - logl(0)) (8) greater than the number of explanatory variables K we can see that Value/DF is equal (approximately) to has an approximate Chi-Square distribution with k twice the information difference between the degrees of freedom (which is the only justification of saturated model and the model under consideration, introducing the factor 2 in the criterion). Based on which provides us with a new meaningful this fact, we can find the corresponding p-value and interpretation of this characteristic. test the hypotheses. However, it has been noted (see [24] and references in it) that in spite of its practical utility the chi-square statistic is not an estimate of anything meaningful. At the same time, if we divide CONCLUSIONS (8) by 2N we obtain the familiar expression 1 N [logl(b)-logl(0)] which is the estimate of the information difference between the current fitted model and the model with intercept only. In PROC PHREG we have the same situation. PROC GENMOD The deviance (or scaled deviance) plays a central role in assessment of goodness of fit for PROC GENMOD. It can be defined as 1. The facts discussed above support the suggestion that SAS users in their statistical analyses and especially interpretations should think in terms of the information gain or information differences rather than R-Square and uniqueness indices. The information gain, not the uniqueness index, gives us a real value (in universal information units) of our model or improvements in our model if we add some new explanatory variables. 2. Information statistics provide us with an interpretation bridge between linear and nonlinear regressions. These statistics have the same appearance and meaning in both linear and nonlinear settings. D = 2{logL(saturated) - logl(current)} (9) where L(saturated) is the likelihood of a saturated model that contains as many parameters as there are data points, and L(current) is the likelihood of the fitted model. At the same time SAS does not calculate p-values for the overall model deviances as it recommends ( [2], p. 281) that evaluation of model SAS and SAS/STAT are registered trademarks of SAS Institute Inc. indicates USA registration. REFERENCES 1. SAS Institute Inc. (1990). SAS/STAT User s Guide, Version 6, Fourth Edition, Volume 2, Cary, 5

6 NC: SAS Institute Inc. 2. SAS Institute Inc. (1996). SAS/STAT Software Changes and Enhancements Through Release 6.11, Cary, NC: SAS Institute Inc. 3. SAS Institute Inc. (1988). SAS/STAT User s Guide, Release 6.03 Edition, Cary, NC: SAS Institute Inc. 4. Anderson-Sprecher, R. (1994). Model comparisons and. The American Statistician, 48, R 2 5. Cox, D. R. & Snell, E. J. (1989). The Analysis of Binary Data, Second Edition, London: Chapman and Hall. 6. Maddala, G. S. (1983). Limited-Dependent and Quantitative Variables in Econometrics, Cambridge: University Press. 7. Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78, Rissanen J. (1989). Stochastic Complexity in Statistical Inquiry, London: World Scientific. 9. Kent J. T. (1983). Information gain and a general measure of correlation. Biometrika, 70, Kent J. T. & O Quigley, J. (1988). Measures of dependence for censored survival data. Biometrika,75, El Hasnaoui A. & Jais, J. P. (1992). Study and applications of an informational measure of dependence in survival models. Methods of Information in Medicine, 31, Kleinbaum, D. G., Kupper, L. L., & Muller, K. E. (1988). Applied Regression Analysis and Other Multivariable Methods, Second Edition, Boston: PWS-Kent. 13. Hatcher, L. & Stepanski, E. J. (1994). A Stepby -Step Approach to Using the SAS System for Univariate and Multivariate Statistics, Cary, NC: SAS Institute Inc. 14. Theil, H. & Chung, C. F. (1988). Information- theoretic measures of fit for univariate and multivariate linear regression. The American Statistician, 42, Stuart, A. & Ord, J. K. (1991). Advanced Theory of Statistics, Fifth edition, New York: Oxford University Press. 16. Parzen E. (1983). Time series model identification by estimating information. In Studies in Econometric Time Series, New York: Academic Press. 17. Judge, G. G., Griffiths, W. E., Hill, R. C., and Lee, T. (1980). The Theory and Practice of Econometrics, New York: John Wiley & Sons, Inc. 18. Hosmer, D. W. & Lemeshow, S. (1989). Applied Logistic Regression, New York: John Wiley & Sons, Inc. 19. McCullagh, P. & Nelder, J.A. (1989). Generalized Linear Models, Second Edition, New York: Chapman and Hall. 20. Dobson, A. J. (1983). An Introduction to Statistical Modelling, London: Chapman and Hall. 21. Dobson, A. J. (1990). An Introduction to Generalized Linear Models, London: Chapman and Hall. 22. Firth, D. (1991). Generalized Linear Models. In Statistical Theory and Modelling. Ed. Hinkley, D.V., Reid, N., and Snell, E. J., London: Chapman and Hall. 23. Barton, M. B., Fletcher, S. W. (1995) Preventive Services for Medicaid Enrollees in an HMO. J Gen Int Med, 10 (Suppl): Akaike, H. (1977). On Entropy Maximization Principle. In Applications of Statistics. Ed. Krishnaiah, P. R., New York: North-Holland Publishing Company. CONTACT INFORMATION: Ernest S. Shtatland Department of Ambulatory Care and Prevention 6

7 Harvard Pilgrim Health Care & Harvard Medical School 126 Brookline Avenue, Suite 200 Boston, MA tel: (617) Mary B. Barton, MD, MPP Department of Ambulatory Care and Prevention Harvard Medical School & Harvard Pilgrim Health Care 126 Brookline Avenue, Suite 200 Boston, MA tel: (617)

ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION

ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION Ernest S. Shtatland, Ken Kleinman, Emily M. Cain Harvard Medical School, Harvard Pilgrim Health Care, Boston, MA ABSTRACT In logistic regression,