INFORMATION AS A UNIFYING MEASURE OF FIT IN SAS STATISTICAL MODELING PROCEDURES
|
|
- Sheryl Chambers
- 5 years ago
- Views:
Transcription
1 INFORMATION AS A UNIFYING MEASURE OF FIT IN SAS STATISTICAL MODELING PROCEDURES Ernest S. Shtatland, PhD Mary B. Barton, MD, MPP Harvard Medical School, Harvard Pilgrim Health Care, Boston, MA ABSTRACT This paper shows that there is a common denominator for many goodness-of-fit criteria in such popular statistical modeling procedures as REG, LOGISTIC, GENMOD, PHREG and others, and that the most natural common denominator is an information-based measure. The necessity to have some common, universal and easy to interpret criterion is almost obvious. PROC REG has eleven criteria of fit, LOGISTIC has four, GENMOD - nine, and PHREG - three. Some of these criteria look alike, but others look absolutely unrelated. With this abundance of criteria, the questions of which one to use, why and how to interpret the criterion of our choice are far from being trivial. As a unifying criterion, it is proposed to use (together with other criteria as R-Square, deviance, loglikelihood and so on) the information-difference statistic which: a) is common for all the types of regression analysis; b) is easy to interpret in terms of information gain/loss (in bits); c) has a very convenient property of additivity that allows SAS users to evaluate the contribution of an individual covariate or a group of covariates in terms of information. SAS USERS AND SAS STATISTICAL MODELING PROCEDURES One of the most difficult points in using SAS Statistical Modeling Procedures (after the type of the model is chosen) is to understand whether the model we have built is good enough for us in terms of some goodness-of-fit criterion. This means that at a minimum, we have to understand and interpret printouts. The difficulty of this task is compounded because, for example, PROC REG has at least 11 statistics of fit ([1], p. 1369), PROC GENMOD - 9 statistics ([2], p. 273), PROC LOGISTIC - 4 ([1], p. 1088, and [2], pp ), PROC PHREG - 3 ([2], p. 832), PROC MIXED - 6 ([2], pp. 547, ), and so on. The expertise required to understand and interpret properly each of these statistics is not available to all users. Therefore it would be desirable to have a universal criterion with an easy to understand meaning that would be common for many applications. POSSIBLE CANDIDATES FOR THE ROLE OF UNIVERSAL CRITERION OF FIT There are two clear candidates for the role of a universal criterion of fit: R-Square and a statistic based on likelihood. 1. R-SQUARE R-Square may be the most popular statistic of fit. Its popularity is so widespread that you can find it in your Vanguard or Fidelity mutual funds booklets. R-Square dominates in PROC REG and PROC GLM (in the latter it is the sole criterion). R-Square has been added to the original list of goodness-of-fit criteria for PROC LOGISTIC. Also, R-Square is sometimes used in Survival Analysis. Nevertheless, the most natural area of applications of R-Square is linear modeling. Here, R-Square is not just a rescaled measure of variation as it is usually thought of, but a comparison between two models: a null model with intercept only and our current model of interest (see [4], p. 114). 1
2 intercept only) models respectively. Here log is either We should mention also some drawbacks of the natural or binary logarithm. The natural R-Square. The basic shortcomings are: logarithm is used more often with the Gaussian or 1) R-Square always has a positive bias, sometimes normal distribution (in linear regression) and the rather substantial. This is why the ADJUSTED base 2 logarithm with discrete distributions R-Square has been introduced. (binomial, Poisson etc.). 2)There are many interpretations of R-Square; this is not bad in itself, but often leads to significant Now, how can the information concept appear in confusion. equation (1)? It is well-known that for any random As a result, we can find the advice ([3], p. 14): event E we have two numbers: the event s probability It may be easier to interpret 1-R 2 [rather than P(E) and its information I(E), the information R-Square itself]. Perhaps this is the reason why in contained in the spite of all the popularity of R-Square, it is a IE ( ) = log PE ( ) message that E has measure many statisticians love to hate. ([4], p. occurred. These 113). quantities are connected according to the formula 2. LIKELIHOOD (2) This is certainly a very fundamental characteristic, Information is as fundamental a concept as even more fundamental than R-Square. It appears in probability and for a general audience, information the form of the logarithm of the likelihood (log- rather than probability may have more intuitive ease. likelihood) in printouts of almost all nonlinear Sometimes I(E) regression procedures as itself or as a part of AIC, HX ( ) = ( Px ( i)log Px ( i)) is called the i BIC, SC or DEVIANCE (see [1], pp , Shannon 1369; [2], p. 273). But sometimes it is used with the information or Shannon complexity ([8], p. 38). For a sign '+', sometimes with the sign '-'; sometimes you discrete-valued random variable X with the can see the log-likelihood with the factor 2, probability distribution {P(x)} we can define the sometimes without it. This is confusing, especially if entropy as the information average: you don't know whether this statistic, the loglikelihood, is meaningful in itself or it is used instead (3) of likelihood because of some mathematical convenience (by the way, the latter is the most typical For a continuous random variable we will have an explanation). We will find the answer to this question integral instead of the sum. According to these through the concepts of information and entropy formulas, we can think of the right side of (1) as a when trying to find a relationship between R-Square sample estimate of the difference between the entropy and likelihood statistics. characteristics of the null model (b=0) and the model under consideration. The most natural R-SQUARE AND LIKELIHOOD - interpretation of this difference is the information WHAT THEY HAVE IN COMMON gain (IG) we have when moving from the simplest 'null' model to the fitted model ([9]-[11]). Only now It has been noted ([5]-[7]) that there exists a simple we have justified our using the log-likelihood instead formula relating R-Square and log-likelihood of just the likelihood not only from a technical but statistics from a conceptual point of view log(1- R 2 )= 1 [logl(b)-logl(0)] (1) N where N is a sample size, logl(b) and logl(0) denote the log-likelihoods of the fitted and null (with INFORMATION AND EVALUATING SCIENTIFICALLY IMPORTANT DIFFERENCES IN R-SQUARE Whatever criterion of fit we are using, we should distinguish 3 types of differences in our models (see [12], p. 319): (1) numeric differences, (2) statistically 2
3 significant differences, and (3) scientifically conclude that the R-Square scale is extremely important differences. Numerical differences may or inhomogeneous: closer to 1 the scale becomes steeper, may not correspond to significant or important and the change in R-Square from 0.99 to (less differences. For R-Square, the statistical significance than 1%!) has resulted in 1.67 bits. At the same of the difference between two models and two time, the change of R-Square from 0.50 to 0.75 is R-Square values (the so-called uniqueness index, see equivalent to 0.5 bit information. Thus, the [13], p. 407) usually is determined by using F-ratio information gain (IG) can be much more helpful statistics. The authors of the above-mentioned book than the uniqueness index in determining which expressed their concern about the absence of a direct difference is practically or scientifically important relationship between statistical significance and and which one is not. practical or scientific importance of uniqueness indices ([13], p. 440): With a large sample, it is It is not difficult to show that there exists a critical possible to obtain a uniqueness index that is value of R-Square equal to statistically significant even though the predictor variable accounts for only 2% or 3% of the variance 1-1/(2ln2) = in the criterion. When reviewing your results, it is important to assess whether a significant uniqueness below which the uniqueness index slightly overindex is large enough to be of substantive importance. estimates the information gain, and above which it How large is large enough?... When doing underestimates the information increment. With research with a criterion variable that traditionally R-Square closer to 1, this underestimate becomes has been difficult to predict, a uniqueness index of stronger and stronger. But for 0 < R 2 < 0.5, 0.05 (and possibly even smaller) may be viewed as the uniqueness index provides us with a good being of substantive importance. When researching a approximation of the information gain. criterion that is relatively easy to predict, the uniqueness index may have to be larger It is interesting to note that with introducing the to be considered important. A similar point of view information or entropy measure we can not only is expressed in [12], p Various sources measure the importance of changes in our model recommend using a 2%, 3% or 5% level of merit for more precisely and objectively, but we can make some the uniqueness index to be considered important. This fundamental relationships in the theory of regression advice is vague and furthermore may be incorrect in more understandable. For example, in a linear some cases. Below we demonstrate this by showing regression problem with a dependent variable Y and that the only objective way of quantifying an independent variables X 1,...,X K, a well-known important vs. unimportant dilemma is in using and fundamental relationship between R-Square and information statistics. The following table contains conditional partial correlations does not look the values of R-Square and corresponding values of encouraging and easy interpretable (see [15], for information gain (see also, [14], p. 250): example): TABLE R 2 =1-(1-R 2 )(1 - R 2 )... YX 1 YX 2 X 1 (4) R-Square Information Gain (IG)-in bits (1 - R 2 YX K X 1,..., X ) K where R YX1 is the correlation coefficient between Y and X1, the partial correlation coefficient R YX2 is X a measure of the strength of the linear relationship between Y and X 2 after controlling for the variable X 1, and so on. After simple transformations and (We have used the binary logarithm in TABLE 1 to taking the binary (or natural) logarithm, we are measure information in bits). From this data we can getting a very simple and easy to interpret additive 3
4 relationship N K IG(5%) (K+1)/N DIFFERENCE I(Y X 1,...,X K )=I(Y X 1 ) I(Y X 2 )+I(Y X 3 ) +... (5) I(Y X K,...,X K-1 ) Here I(Y X 1, X 2,...,X K ) is the the information about Y contained in all the covariates X 1, X 2,... Thus, using this fact, we can approximately test the, X K ; I(Y X 1 ) is the information about Y contained overall significance of our model in advance, without in X 1 alone; I(Y X 2 ) is the information about Y even referring to the table of critical values for the F- contained in X 2 after controlling for X 1, and so on. distribution (as a preliminary or exploratory analysis - to obtain a rough idea of significance). And to finish our list of benefits from using the information-difference or information-gain statistics Note that the parameter (K+1)/N (or K/N which is instead of the classical R-Square criterion, we will close to it) appears in many cases in statistical mention the following. It is well-known that the most modeling: powerful test of significance for overall regression is 1) in formulas for the statistics JP and PC performed by using an F-statistic which in terms of in PROC REG ([1], p. 1369) and in some variants of R-Square is expressed by the formula (see, for AIC, for example (see [9], pp , [11], p. 276, example [12], p. 126): [17], p. 423) F = N-K-1 [R 2 (1-R 2 )] (6) K AIC =- 2 N logl + 2 K N 2) in determining the importance or lack of where K is the number of explanatory or independent importance of the information gain (see [9], p. 170, variables in the model and N is the number of [11], p. 276); observations. We see that (6) is equivalent to 3) in the formula for the average bias of R-Square (see [15], p. 1031), etc log(1- R 2 K ) = 2log(1 + F N-K-1 ) (7) from which we can find quite a remarkable fact: 5% significance levels for the F-statistic are transformed to the significance levels for the information gain according to the formula IG(5%) = -1/2[log(1 - R-Square)], and these significance levels obey approximately the formula (K + 1)/N. The approximation improves with increase in N as demonstrated in the table below (see also [16]) TABLE 2 Before starting with nonlinear procedures, it is worth noting that the ideas similar to those discussed above can be used in canonical correlation analysis and PROC CANCORR. INFORMATION STATISTICS AS CRITERIA OF FIT IN NONLINEAR MODELS Equation (1) can serve as a prompt for us to introduce and utilize information- measuring statistics in PROC LOGISTIC, GENMOD, PHREG and other SAS statistical procedures for nonlinear modeling. 4
5 Frequently applications of these procedures belong to adequacy is better done by comparison with simpler medicine and health and behavioral sciences ([1], [2], models (for example, the model with intercept only) [12], [18]-[22]. For a typical application of logistic than the saturated model. Thus, it is proposed to use regression and PROC LOGISTIC in health services the statistic (8) rather than (9). The reason is that the research see also [23]. Chi-Square approximation for the deviance distribution is often very poor. In this rather uncertain PROC LOGISTIC situation, an additional recommendation is given ([2], p. 281): a deviance that is small relative to its For PROC LOGISTIC, equation (1) was used to degrees of freedom is a possible indication of a good introduce the generalized coefficient of determination model fit. The meaning of a new characteristic, (see [2], p. 414). The most typically used criterion for Value/DF, that appears in GENMOD printouts, is assessing overall model fit is (-2logL). The difference not explained and remains unclear. Taking into account that typically the sample size N is much 2(logL(b) - logl(0)) (8) greater than the number of explanatory variables K we can see that Value/DF is equal (approximately) to has an approximate Chi-Square distribution with k twice the information difference between the degrees of freedom (which is the only justification of saturated model and the model under consideration, introducing the factor 2 in the criterion). Based on which provides us with a new meaningful this fact, we can find the corresponding p-value and interpretation of this characteristic. test the hypotheses. However, it has been noted (see [24] and references in it) that in spite of its practical utility the chi-square statistic is not an estimate of anything meaningful. At the same time, if we divide CONCLUSIONS (8) by 2N we obtain the familiar expression 1 N [logl(b)-logl(0)] which is the estimate of the information difference between the current fitted model and the model with intercept only. In PROC PHREG we have the same situation. PROC GENMOD The deviance (or scaled deviance) plays a central role in assessment of goodness of fit for PROC GENMOD. It can be defined as 1. The facts discussed above support the suggestion that SAS users in their statistical analyses and especially interpretations should think in terms of the information gain or information differences rather than R-Square and uniqueness indices. The information gain, not the uniqueness index, gives us a real value (in universal information units) of our model or improvements in our model if we add some new explanatory variables. 2. Information statistics provide us with an interpretation bridge between linear and nonlinear regressions. These statistics have the same appearance and meaning in both linear and nonlinear settings. D = 2{logL(saturated) - logl(current)} (9) where L(saturated) is the likelihood of a saturated model that contains as many parameters as there are data points, and L(current) is the likelihood of the fitted model. At the same time SAS does not calculate p-values for the overall model deviances as it recommends ( [2], p. 281) that evaluation of model SAS and SAS/STAT are registered trademarks of SAS Institute Inc. indicates USA registration. REFERENCES 1. SAS Institute Inc. (1990). SAS/STAT User s Guide, Version 6, Fourth Edition, Volume 2, Cary, 5
6 NC: SAS Institute Inc. 2. SAS Institute Inc. (1996). SAS/STAT Software Changes and Enhancements Through Release 6.11, Cary, NC: SAS Institute Inc. 3. SAS Institute Inc. (1988). SAS/STAT User s Guide, Release 6.03 Edition, Cary, NC: SAS Institute Inc. 4. Anderson-Sprecher, R. (1994). Model comparisons and. The American Statistician, 48, R 2 5. Cox, D. R. & Snell, E. J. (1989). The Analysis of Binary Data, Second Edition, London: Chapman and Hall. 6. Maddala, G. S. (1983). Limited-Dependent and Quantitative Variables in Econometrics, Cambridge: University Press. 7. Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78, Rissanen J. (1989). Stochastic Complexity in Statistical Inquiry, London: World Scientific. 9. Kent J. T. (1983). Information gain and a general measure of correlation. Biometrika, 70, Kent J. T. & O Quigley, J. (1988). Measures of dependence for censored survival data. Biometrika,75, El Hasnaoui A. & Jais, J. P. (1992). Study and applications of an informational measure of dependence in survival models. Methods of Information in Medicine, 31, Kleinbaum, D. G., Kupper, L. L., & Muller, K. E. (1988). Applied Regression Analysis and Other Multivariable Methods, Second Edition, Boston: PWS-Kent. 13. Hatcher, L. & Stepanski, E. J. (1994). A Stepby -Step Approach to Using the SAS System for Univariate and Multivariate Statistics, Cary, NC: SAS Institute Inc. 14. Theil, H. & Chung, C. F. (1988). Information- theoretic measures of fit for univariate and multivariate linear regression. The American Statistician, 42, Stuart, A. & Ord, J. K. (1991). Advanced Theory of Statistics, Fifth edition, New York: Oxford University Press. 16. Parzen E. (1983). Time series model identification by estimating information. In Studies in Econometric Time Series, New York: Academic Press. 17. Judge, G. G., Griffiths, W. E., Hill, R. C., and Lee, T. (1980). The Theory and Practice of Econometrics, New York: John Wiley & Sons, Inc. 18. Hosmer, D. W. & Lemeshow, S. (1989). Applied Logistic Regression, New York: John Wiley & Sons, Inc. 19. McCullagh, P. & Nelder, J.A. (1989). Generalized Linear Models, Second Edition, New York: Chapman and Hall. 20. Dobson, A. J. (1983). An Introduction to Statistical Modelling, London: Chapman and Hall. 21. Dobson, A. J. (1990). An Introduction to Generalized Linear Models, London: Chapman and Hall. 22. Firth, D. (1991). Generalized Linear Models. In Statistical Theory and Modelling. Ed. Hinkley, D.V., Reid, N., and Snell, E. J., London: Chapman and Hall. 23. Barton, M. B., Fletcher, S. W. (1995) Preventive Services for Medicaid Enrollees in an HMO. J Gen Int Med, 10 (Suppl): Akaike, H. (1977). On Entropy Maximization Principle. In Applications of Statistics. Ed. Krishnaiah, P. R., New York: North-Holland Publishing Company. CONTACT INFORMATION: Ernest S. Shtatland Department of Ambulatory Care and Prevention 6
7 Harvard Pilgrim Health Care & Harvard Medical School 126 Brookline Avenue, Suite 200 Boston, MA tel: (617) Mary B. Barton, MD, MPP Department of Ambulatory Care and Prevention Harvard Medical School & Harvard Pilgrim Health Care 126 Brookline Avenue, Suite 200 Boston, MA tel: (617)
ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION
ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION Ernest S. Shtatland, Ken Kleinman, Emily M. Cain Harvard Medical School, Harvard Pilgrim Health Care, Boston, MA ABSTRACT In logistic regression,
More informationLOGISTIC REGRESSION Joseph M. Hilbe
LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of
More informationBootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions
JKAU: Sci., Vol. 21 No. 2, pp: 197-212 (2009 A.D. / 1430 A.H.); DOI: 10.4197 / Sci. 21-2.2 Bootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions Ali Hussein Al-Marshadi
More informationA COEFFICIENT OF DETERMINATION FOR LOGISTIC REGRESSION MODELS
A COEFFICIENT OF DETEMINATION FO LOGISTIC EGESSION MODELS ENATO MICELI UNIVESITY OF TOINO After a brief presentation of the main extensions of the classical coefficient of determination ( ), a new index
More informationSAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
More informationCOLLABORATION OF STATISTICAL METHODS IN SELECTING THE CORRECT MULTIPLE LINEAR REGRESSIONS
American Journal of Biostatistics 4 (2): 29-33, 2014 ISSN: 1948-9889 2014 A.H. Al-Marshadi, This open access article is distributed under a Creative Commons Attribution (CC-BY) 3.0 license doi:10.3844/ajbssp.2014.29.33
More informationChapter 1 Statistical Inference
Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations
More informationDISPLAYING THE POISSON REGRESSION ANALYSIS
Chapter 17 Poisson Regression Chapter Table of Contents DISPLAYING THE POISSON REGRESSION ANALYSIS...264 ModelInformation...269 SummaryofFit...269 AnalysisofDeviance...269 TypeIII(Wald)Tests...269 MODIFYING
More informationGeneralized Linear Models (GLZ)
Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the
More informationCHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA
STATISTICS IN MEDICINE, VOL. 17, 59 68 (1998) CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA J. K. LINDSEY AND B. JONES* Department of Medical Statistics, School of Computing Sciences,
More informationIntroduction to the Logistic Regression Model
CHAPTER 1 Introduction to the Logistic Regression Model 1.1 INTRODUCTION Regression methods have become an integral component of any data analysis concerned with describing the relationship between a response
More informationAnalysis of Longitudinal Data: Comparison between PROC GLM and PROC MIXED.
Analysis of Longitudinal Data: Comparison between PROC GLM and PROC MIXED. Maribeth Johnson, Medical College of Georgia, Augusta, GA ABSTRACT Longitudinal data refers to datasets with multiple measurements
More information8 Nominal and Ordinal Logistic Regression
8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on
More informationClass Notes: Week 8. Probit versus Logit Link Functions and Count Data
Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While
More informationModel comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection
Model comparison Patrick Breheny March 28 Patrick Breheny BST 760: Advanced Regression 1/25 Wells in Bangladesh In this lecture and the next, we will consider a data set involving modeling the decisions
More informationBIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY
BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY Ingo Langner 1, Ralf Bender 2, Rebecca Lenz-Tönjes 1, Helmut Küchenhoff 2, Maria Blettner 2 1
More informationLOGISTICS REGRESSION FOR SAMPLE SURVEYS
4 LOGISTICS REGRESSION FOR SAMPLE SURVEYS Hukum Chandra Indian Agricultural Statistics Research Institute, New Delhi-002 4. INTRODUCTION Researchers use sample survey methodology to obtain information
More informationEstimating terminal half life by non-compartmental methods with some data below the limit of quantification
Paper SP08 Estimating terminal half life by non-compartmental methods with some data below the limit of quantification Jochen Müller-Cohrs, CSL Behring, Marburg, Germany ABSTRACT In pharmacokinetic studies
More informationRepeated ordinal measurements: a generalised estimating equation approach
Repeated ordinal measurements: a generalised estimating equation approach David Clayton MRC Biostatistics Unit 5, Shaftesbury Road Cambridge CB2 2BW April 7, 1992 Abstract Cumulative logit and related
More informationPractice of SAS Logistic Regression on Binary Pharmacodynamic Data Problems and Solutions. Alan J Xiao, Cognigen Corporation, Buffalo NY
Practice of SAS Logistic Regression on Binary Pharmacodynamic Data Problems and Solutions Alan J Xiao, Cognigen Corporation, Buffalo NY ABSTRACT Logistic regression has been widely applied to population
More informationLogistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ
Logistic Regression The goal of a logistic regression analysis is to find the best fitting and most parsimonious, yet biologically reasonable, model to describe the relationship between an outcome (dependent
More informationA note on R 2 measures for Poisson and logistic regression models when both models are applicable
Journal of Clinical Epidemiology 54 (001) 99 103 A note on R measures for oisson and logistic regression models when both models are applicable Martina Mittlböck, Harald Heinzl* Department of Medical Computer
More informationApproximate Median Regression via the Box-Cox Transformation
Approximate Median Regression via the Box-Cox Transformation Garrett M. Fitzmaurice,StuartR.Lipsitz, and Michael Parzen Median regression is used increasingly in many different areas of applications. The
More informationLISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014
LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Liang (Sally) Shan Nov. 4, 2014 L Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers
More information\It is important that there be options in explication, and equally important that. Joseph Retzer, Market Probe, Inc., Milwaukee WI
Measuring The Information Content Of Regressors In The Linear Model Using Proc Reg and SAS IML \It is important that there be options in explication, and equally important that the candidates have clear
More informationAnalysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems
Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems Jeremy S. Conner and Dale E. Seborg Department of Chemical Engineering University of California, Santa Barbara, CA
More informationIntroduction to Statistical Analysis
Introduction to Statistical Analysis Changyu Shen Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology Beth Israel Deaconess Medical Center Harvard Medical School Objectives Descriptive
More informationModel Estimation Example
Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions
More informationAnalysis of variance, multivariate (MANOVA)
Analysis of variance, multivariate (MANOVA) Abstract: A designed experiment is set up in which the system studied is under the control of an investigator. The individuals, the treatments, the variables
More informationModel Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction
Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction ReCap. Parts I IV. The General Linear Model Part V. The Generalized Linear Model 16 Introduction 16.1 Analysis
More informationLogistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression
Logistic Regression Usual linear regression (repetition) y i = b 0 + b 1 x 1i + b 2 x 2i + e i, e i N(0,σ 2 ) or: y i N(b 0 + b 1 x 1i + b 2 x 2i,σ 2 ) Example (DGA, p. 336): E(PEmax) = 47.355 + 1.024
More informationA Note on Bayesian Inference After Multiple Imputation
A Note on Bayesian Inference After Multiple Imputation Xiang Zhou and Jerome P. Reiter Abstract This article is aimed at practitioners who plan to use Bayesian inference on multiplyimputed datasets in
More informationInvestigating Models with Two or Three Categories
Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might
More informationCzech J. Anim. Sci., 50, 2005 (4):
Czech J Anim Sci, 50, 2005 (4: 163 168 Original Paper Canonical correlation analysis for studying the relationship between egg production traits and body weight, egg weight and age at sexual maturity in
More informationLogistic Regression: Regression with a Binary Dependent Variable
Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression
More informationUNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator
UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS Duration - 3 hours Aids Allowed: Calculator LAST NAME: FIRST NAME: STUDENT NUMBER: There are 27 pages
More informationThe GENMOD Procedure. Overview. Getting Started. Syntax. Details. Examples. References. SAS/STAT User's Guide. Book Contents Previous Next
Book Contents Previous Next SAS/STAT User's Guide Overview Getting Started Syntax Details Examples References Book Contents Previous Next Top http://v8doc.sas.com/sashtml/stat/chap29/index.htm29/10/2004
More informationIntroduction to Statistical modeling: handout for Math 489/583
Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect
More informationStatistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018
Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical
More informationConfirmatory and Exploratory Data Analyses Using PROC GENMOD: Factors Associated with Red Light Running Crashes
Confirmatory and Exploratory Data Analyses Using PROC GENMOD: Factors Associated with Red Light Running Crashes Li wan Chen, LENDIS Corporation, McLean, VA Forrest Council, Highway Safety Research Center,
More informationFEEG6017 lecture: Akaike's information criterion; model reduction. Brendan Neville
FEEG6017 lecture: Akaike's information criterion; model reduction Brendan Neville bjn1c13@ecs.soton.ac.uk Occam's razor William of Occam, 1288-1348. All else being equal, the simplest explanation is the
More informationRANDOM and REPEATED statements - How to Use Them to Model the Covariance Structure in Proc Mixed. Charlie Liu, Dachuang Cao, Peiqi Chen, Tony Zagar
Paper S02-2007 RANDOM and REPEATED statements - How to Use Them to Model the Covariance Structure in Proc Mixed Charlie Liu, Dachuang Cao, Peiqi Chen, Tony Zagar Eli Lilly & Company, Indianapolis, IN ABSTRACT
More informationProcedia - Social and Behavioral Sciences 109 ( 2014 )
Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 09 ( 04 ) 730 736 nd World Conference On Business, Economics And Management - WCBEM 03 Categorical Principal
More informationSigmaplot di Systat Software
Sigmaplot di Systat Software SigmaPlot Has Extensive Statistical Analysis Features SigmaPlot is now bundled with SigmaStat as an easy-to-use package for complete graphing and data analysis. The statistical
More informationGeneralized Linear Models 1
Generalized Linear Models 1 STA 2101/442: Fall 2012 1 See last slide for copyright information. 1 / 24 Suggested Reading: Davison s Statistical models Exponential families of distributions Sec. 5.2 Chapter
More informationDetermining the number of components in mixture models for hierarchical data
Determining the number of components in mixture models for hierarchical data Olga Lukočienė 1 and Jeroen K. Vermunt 2 1 Department of Methodology and Statistics, Tilburg University, P.O. Box 90153, 5000
More informationOpen Problems in Mixed Models
xxiii Determining how to deal with a not positive definite covariance matrix of random effects, D during maximum likelihood estimation algorithms. Several strategies are discussed in Section 2.15. For
More informationThe GENMOD Procedure (Book Excerpt)
SAS/STAT 9.22 User s Guide The GENMOD Procedure (Book Excerpt) SAS Documentation This document is an individual chapter from SAS/STAT 9.22 User s Guide. The correct bibliographic citation for the complete
More informationApplication of Ghosh, Grizzle and Sen s Nonparametric Methods in. Longitudinal Studies Using SAS PROC GLM
Application of Ghosh, Grizzle and Sen s Nonparametric Methods in Longitudinal Studies Using SAS PROC GLM Chan Zeng and Gary O. Zerbe Department of Preventive Medicine and Biometrics University of Colorado
More informationCase of single exogenous (iv) variable (with single or multiple mediators) iv à med à dv. = β 0. iv i. med i + α 1
Mediation Analysis: OLS vs. SUR vs. ISUR vs. 3SLS vs. SEM Note by Hubert Gatignon July 7, 2013, updated November 15, 2013, April 11, 2014, May 21, 2016 and August 10, 2016 In Chap. 11 of Statistical Analysis
More information1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches
Sta 216, Lecture 4 Last Time: Logistic regression example, existence/uniqueness of MLEs Today s Class: 1. Hypothesis testing through analysis of deviance 2. Standard errors & confidence intervals 3. Model
More informationGeneralized Models: Part 1
Generalized Models: Part 1 Topics: Introduction to generalized models Introduction to maximum likelihood estimation Models for binary outcomes Models for proportion outcomes Models for categorical outcomes
More informationSAS/STAT 14.2 User s Guide. The GENMOD Procedure
SAS/STAT 14.2 User s Guide The GENMOD Procedure This document is an individual chapter from SAS/STAT 14.2 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute Inc.
More informationBias-corrected AIC for selecting variables in Poisson regression models
Bias-corrected AIC for selecting variables in Poisson regression models Ken-ichi Kamo (a), Hirokazu Yanagihara (b) and Kenichi Satoh (c) (a) Corresponding author: Department of Liberal Arts and Sciences,
More informationBivariate Weibull-power series class of distributions
Bivariate Weibull-power series class of distributions Saralees Nadarajah and Rasool Roozegar EM algorithm, Maximum likelihood estimation, Power series distri- Keywords: bution. Abstract We point out that
More informationStatistics 262: Intermediate Biostatistics Model selection
Statistics 262: Intermediate Biostatistics Model selection Jonathan Taylor & Kristin Cobb Statistics 262: Intermediate Biostatistics p.1/?? Today s class Model selection. Strategies for model selection.
More informationStatistical Practice
Statistical Practice A Note on Bayesian Inference After Multiple Imputation Xiang ZHOU and Jerome P. REITER This article is aimed at practitioners who plan to use Bayesian inference on multiply-imputed
More informationData Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA
Data Analyses in Multivariate Regression Chii-Dean Joey Lin, SDSU, San Diego, CA ABSTRACT Regression analysis is one of the most used statistical methodologies. It can be used to describe or predict causal
More informationModel Selection for Semiparametric Bayesian Models with Application to Overdispersion
Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS020) p.3863 Model Selection for Semiparametric Bayesian Models with Application to Overdispersion Jinfang Wang and
More informationExtensions of Cox Model for Non-Proportional Hazards Purpose
PhUSE 2013 Paper SP07 Extensions of Cox Model for Non-Proportional Hazards Purpose Jadwiga Borucka, PAREXEL, Warsaw, Poland ABSTRACT Cox proportional hazard model is one of the most common methods used
More informationTesting Equality of Two Intercepts for the Parallel Regression Model with Non-sample Prior Information
Testing Equality of Two Intercepts for the Parallel Regression Model with Non-sample Prior Information Budi Pratikno 1 and Shahjahan Khan 2 1 Department of Mathematics and Natural Science Jenderal Soedirman
More informationAnalysis of Count Data A Business Perspective. George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013
Analysis of Count Data A Business Perspective George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013 Overview Count data Methods Conclusions 2 Count data Count data Anything with
More informationSAS/STAT 13.1 User s Guide. Introduction to Survey Sampling and Analysis Procedures
SAS/STAT 13.1 User s Guide Introduction to Survey Sampling and Analysis Procedures This document is an individual chapter from SAS/STAT 13.1 User s Guide. The correct bibliographic citation for the complete
More informationGeneralized Linear Models
Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n
More informationGraphical Procedures, SAS' PROC MIXED, and Tests of Repeated Measures Effects. H.J. Keselman University of Manitoba
1 Graphical Procedures, SAS' PROC MIXED, and Tests of Repeated Measures Effects by H.J. Keselman University of Manitoba James Algina University of Florida and Rhonda K. Kowalchuk University of Manitoba
More informationGroup Comparisons: Differences in Composition Versus Differences in Models and Effects
Group Comparisons: Differences in Composition Versus Differences in Models and Effects Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised February 15, 2015 Overview.
More informationApplication of Poisson and Negative Binomial Regression Models in Modelling Oil Spill Data in the Niger Delta
International Journal of Science and Engineering Investigations vol. 7, issue 77, June 2018 ISSN: 2251-8843 Application of Poisson and Negative Binomial Regression Models in Modelling Oil Spill Data in
More informationTreatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev
Variable selection: Suppose for the i-th observational unit (case) you record ( failure Y i = 1 success and explanatory variabales Z 1i Z 2i Z ri Variable (or model) selection: subject matter theory and
More informationGLM models and OLS regression
GLM models and OLS regression Graeme Hutcheson, University of Manchester These lecture notes are based on material published in... Hutcheson, G. D. and Sofroniou, N. (1999). The Multivariate Social Scientist:
More informationARIC Manuscript Proposal # PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2
ARIC Manuscript Proposal # 1186 PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2 1.a. Full Title: Comparing Methods of Incorporating Spatial Correlation in
More informationComment on Article by Scutari
Bayesian Analysis (2013) 8, Number 3, pp. 543 548 Comment on Article by Scutari Hao Wang Scutari s paper studies properties of the distribution of graphs ppgq. This is an interesting angle because it differs
More informationSESUG 2011 ABSTRACT INTRODUCTION BACKGROUND ON LOGLINEAR SMOOTHING DESCRIPTION OF AN EXAMPLE. Paper CC-01
Paper CC-01 Smoothing Scaled Score Distributions from a Standardized Test using PROC GENMOD Jonathan Steinberg, Educational Testing Service, Princeton, NJ Tim Moses, Educational Testing Service, Princeton,
More informationComparison of Estimators in GLM with Binary Data
Journal of Modern Applied Statistical Methods Volume 13 Issue 2 Article 10 11-2014 Comparison of Estimators in GLM with Binary Data D. M. Sakate Shivaji University, Kolhapur, India, dms.stats@gmail.com
More informationH-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL
H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL Intesar N. El-Saeiti Department of Statistics, Faculty of Science, University of Bengahzi-Libya. entesar.el-saeiti@uob.edu.ly
More informationSTA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3
STA 303 H1S / 1002 HS Winter 2011 Test March 7, 2011 LAST NAME: FIRST NAME: STUDENT NUMBER: ENROLLED IN: (circle one) STA 303 STA 1002 INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator. Some formulae
More informationREVISED PAGE PROOFS. Logistic Regression. Basic Ideas. Fundamental Data Analysis. bsa350
bsa347 Logistic Regression Logistic regression is a method for predicting the outcomes of either-or trials. Either-or trials occur frequently in research. A person responds appropriately to a drug or does
More informationGenerating Half-normal Plot for Zero-inflated Binomial Regression
Paper SP05 Generating Half-normal Plot for Zero-inflated Binomial Regression Zhao Yang, Xuezheng Sun Department of Epidemiology & Biostatistics University of South Carolina, Columbia, SC 29208 SUMMARY
More informationKULLBACK-LEIBLER INFORMATION THEORY A BASIS FOR MODEL SELECTION AND INFERENCE
KULLBACK-LEIBLER INFORMATION THEORY A BASIS FOR MODEL SELECTION AND INFERENCE Kullback-Leibler Information or Distance f( x) gx ( ±) ) I( f, g) = ' f( x)log dx, If, ( g) is the "information" lost when
More information36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)
36-309/749 Experimental Design for Behavioral and Social Sciences Dec 1, 2015 Lecture 11: Mixed Models (HLMs) Independent Errors Assumption An error is the deviation of an individual observed outcome (DV)
More informationGeneralized Linear Models for Non-Normal Data
Generalized Linear Models for Non-Normal Data Today s Class: 3 parts of a generalized model Models for binary outcomes Complications for generalized multivariate or multilevel models SPLH 861: Lecture
More information1. Introduction Over the last three decades a number of model selection criteria have been proposed, including AIC (Akaike, 1973), AICC (Hurvich & Tsa
On the Use of Marginal Likelihood in Model Selection Peide Shi Department of Probability and Statistics Peking University, Beijing 100871 P. R. China Chih-Ling Tsai Graduate School of Management University
More informationIntroduction to Generalized Models
Introduction to Generalized Models Today s topics: The big picture of generalized models Review of maximum likelihood estimation Models for binary outcomes Models for proportion outcomes Models for categorical
More informationMohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago
Mohammed Research in Pharmacoepidemiology (RIPE) @ National School of Pharmacy, University of Otago What is zero inflation? Suppose you want to study hippos and the effect of habitat variables on their
More informationMultiple regression and inference in ecology and conservation biology: further comments on identifying important predictor variables
Biodiversity and Conservation 11: 1397 1401, 2002. 2002 Kluwer Academic Publishers. Printed in the Netherlands. Multiple regression and inference in ecology and conservation biology: further comments on
More informationEPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7
Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review
More informationLogistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S
Logistic regression analysis Birthe Lykke Thomsen H. Lundbeck A/S 1 Response with only two categories Example Odds ratio and risk ratio Quantitative explanatory variable More than one variable Logistic
More informationGeneralized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.
Texts in Statistical Science Generalized Linear Mixed Models Modern Concepts, Methods and Applications Walter W. Stroup CRC Press Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint
More informationChapter 5. Introduction to Path Analysis. Overview. Correlation and causation. Specification of path models. Types of path models
Chapter 5 Introduction to Path Analysis Put simply, the basic dilemma in all sciences is that of how much to oversimplify reality. Overview H. M. Blalock Correlation and causation Specification of path
More informationMIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010
MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010 Part 1 of this document can be found at http://www.uvm.edu/~dhowell/methods/supplements/mixed Models for Repeated Measures1.pdf
More informationApplied Multivariate Analysis
Department of Mathematics and Statistics, University of Vaasa, Finland Spring 2017 Dimension reduction Exploratory (EFA) Background While the motivation in PCA is to replace the original (correlated) variables
More informationSCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models
SCHOOL OF MATHEMATICS AND STATISTICS Linear and Generalised Linear Models Autumn Semester 2017 18 2 hours Attempt all the questions. The allocation of marks is shown in brackets. RESTRICTED OPEN BOOK EXAMINATION
More informationSAS/STAT 14.2 User s Guide. Introduction to Analysis of Variance Procedures
SAS/STAT 14.2 User s Guide Introduction to Analysis of Variance Procedures This document is an individual chapter from SAS/STAT 14.2 User s Guide. The correct bibliographic citation for this manual is
More informationHolzmann, Min, Czado: Validating linear restrictions in linear regression models with general error structure
Holzmann, Min, Czado: Validating linear restrictions in linear regression models with general error structure Sonderforschungsbereich 386, Paper 478 (2006) Online unter: http://epub.ub.uni-muenchen.de/
More informationModels for Binary Outcomes
Models for Binary Outcomes Introduction The simple or binary response (for example, success or failure) analysis models the relationship between a binary response variable and one or more explanatory variables.
More informationStatistical Distribution Assumptions of General Linear Models
Statistical Distribution Assumptions of General Linear Models Applied Multilevel Models for Cross Sectional Data Lecture 4 ICPSR Summer Workshop University of Colorado Boulder Lecture 4: Statistical Distributions
More informationDr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46
BIO5312 Biostatistics Lecture 10:Regression and Correlation Methods Dr. Junchao Xia Center of Biophysics and Computational Biology Fall 2016 11/1/2016 1/46 Outline In this lecture, we will discuss topics
More informationGeneralized linear models
Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models
More informationCOMPLEMENTARY LOG-LOG MODEL
COMPLEMENTARY LOG-LOG MODEL Under the assumption of binary response, there are two alternatives to logit model: probit model and complementary-log-log model. They all follow the same form π ( x) =Φ ( α
More informationDetermining Sufficient Number of Imputations Using Variance of Imputation Variances: Data from 2012 NAMCS Physician Workflow Mail Survey *
Applied Mathematics, 2014,, 3421-3430 Published Online December 2014 in SciRes. http://www.scirp.org/journal/am http://dx.doi.org/10.4236/am.2014.21319 Determining Sufficient Number of Imputations Using
More informationOutline of GLMs. Definitions
Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density
More information