PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

Similar documents
Count data page 1. Count data. 1. Estimating, testing proportions

ssh tap sas913, sas

Modeling Machiavellianism Predicting Scores with Fewer Factors

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Lecture 8: Summary Measures

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Generalized Linear Models for Non-Normal Data

Models for Binary Outcomes

A Re-Introduction to General Linear Models (GLM)

Age 55 (x = 1) Age < 55 (x = 0)

SAS Analysis Examples Replication C8. * SAS Analysis Examples Replication for ASDA 2nd Edition * Berglund April 2017 * Chapter 8 ;

Investigating Models with Two or Three Categories

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure).

Simple logistic regression

Binary Logistic Regression

Testing Independence

8 Nominal and Ordinal Logistic Regression

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

A Re-Introduction to General Linear Models

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Logistic Regression Analyses in the Water Level Study

STA6938-Logistic Regression Model

Practice of SAS Logistic Regression on Binary Pharmacodynamic Data Problems and Solutions. Alan J Xiao, Cognigen Corporation, Buffalo NY

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Adjusting Analyses of Survey Results using a Predicted Probability of Response Rob Gately, INNOVUS: Epidemiology, Waltham, MA

MULTINOMIAL LOGISTIC REGRESSION

CHAPTER 1: BINARY LOGIT MODEL

STAT 7030: Categorical Data Analysis

Paper: ST-161. Techniques for Evidence-Based Decision Making Using SAS Ian Stockwell, The Hilltop UMBC, Baltimore, MD

Laboratory Enhancement Program HIV Laboratory, Public Health Ontario Updated analyses: January 2009 to December Background

Understand the difference between symmetric and asymmetric measures

Sections 3.4, 3.5. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Models for Ordinal Response Data

Appendix: Computer Programs for Logistic Regression

G M = = = 0.927

Sociology 362 Data Exercise 6 Logistic Regression 2

Econometric Modelling Prof. Rudra P. Pradhan Department of Management Indian Institute of Technology, Kharagpur

Section 9c. Propensity scores. Controlling for bias & confounding in observational studies

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

LOGISTIC REGRESSION. Lalmohan Bhar Indian Agricultural Statistics Research Institute, New Delhi

Paper Using and Understanding LSMEANS and LSMESTIMATE

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

Three-Way Contingency Tables

Chapter 19: Logistic regression

Immigration attitudes (opposes immigration or supports it) it may seriously misestimate the magnitude of the effects of IVs

2 Describing Contingency Tables

Longitudinal Modeling with Logistic Regression

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

COMPLEMENTARY LOG-LOG MODEL

Point-in-Time Count KS-507 Kansas Balance of State CoC

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

DISPLAYING THE POISSON REGRESSION ANALYSIS

Multinomial Logistic Regression Models

Sections 4.1, 4.2, 4.3

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

Frequency Distribution Cross-Tabulation

Introducing Generalized Linear Models: Logistic Regression

Lecture 12: Effect modification, and confounding in logistic regression

Generalized Linear Models

Time-Invariant Predictors in Longitudinal Models

LECTURE 15: SIMPLE LINEAR REGRESSION I

Textbook Examples of. SPSS Procedure

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Statistics and Data Analysis

MSUG conference June 9, 2016

Using the same data as before, here is part of the output we get in Stata when we do a logistic regression of Grade on Gpa, Tuce and Psi.

Basic Business Statistics, 10/e

Sociology Exam 1 Answer Key Revised February 26, 2007

Measuring relationships among multiple responses

Logistic regression: Miscellaneous topics

Paper The syntax of the FILENAME is:

Linear Regression Analysis for Survey Data. Professor Ron Fricker Naval Postgraduate School Monterey, California

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Interactions among Continuous Predictors

Epidemiology Principle of Biostatistics Chapter 14 - Dependent Samples and effect measures. John Koval

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Lecture #11: Classification & Logistic Regression

In matrix algebra notation, a linear model is written as

Chapter 1. Modeling Basics

Application of Ghosh, Grizzle and Sen s Nonparametric Methods in. Longitudinal Studies Using SAS PROC GLM

Log-linear Models for Contingency Tables

SAS Analysis Examples Replication C11

Introduction Fitting logistic regression models Results. Logistic Regression. Patrick Breheny. March 29

Discrete Multivariate Statistics

Case-control studies C&H 16

Cohen s s Kappa and Log-linear Models

Means or "expected" counts: j = 1 j = 2 i = 1 m11 m12 i = 2 m21 m22 True proportions: The odds that a sampled unit is in category 1 for variable 1 giv

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

The SEQDESIGN Procedure

Epidemiology Wonders of Biostatistics Chapter 13 - Effect Measures. John Koval

From the help desk: Comparing areas under receiver operating characteristic curves from two or more probit or logit models

Generating Half-normal Plot for Zero-inflated Binomial Regression

Lecture 15 (Part 2): Logistic Regression & Common Odds Ratio, (With Simulations)

1. Capitalize all surnames and attempt to match with Census list. 3. Split double-barreled names apart, and attempt to match first half of name.

Unit 11: Multiple Linear Regression

Binary Dependent Variables

Review of Multiple Regression

Transcription:

Paper SD174 PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY ABSTRACT Keywords: Logistic. INTRODUCTION This paper covers some gotchas in SAS R PROC LOGISTIC. A gotcha is a mistake that isn t obviously a mistake the program runs, there may be a note or a warning, but no errors. Output appears. But it s the wrong output. This is not a primer on PROC LOGISTIC, much less on logistic regression. There are many good books on logistic regression; one such is?. Each section has several subsections. First, I identify the gotcha. Then I give an example. Third, I show what evidence you have that it occurs. Fourth, I show how to fix it in some cases, referring to other resources. In some cases, I offer an explanation between the evidence and the solution. GOTCHA #1 CODING 0 AND 1 PROC LOGISTIC can be used to run logistic regression on a dichotomous dependent variable. Often, these are coded 0 and 1, with 0 for no or the equivalent, and 1 for yes or the equivalent. In this case, we are usually interested in modeling the probability of a yes. However, by default, SAS models the probability of a 0 (which would be a no ). For example, we might be interested in modeling the presence of a disease, with 0 meaning the person is not infected, and 1 meaning he or she is infected. To keep it simple, I will use one independent variable: sex, code as 1 for female and 0 for male. So: d a t a t o d a y ; i n p u t d i s e a s e f e m a l e w e i g h t ; d a t a l i n e s ; 0 0 100 1 0 200 0 1 200 1 1 100 ; ; ; ; we then run PROC LOGISTIC: p roc l o g i s t i c d a t a = t o d a y ; model d i s e a s e = f e m a l e ; and get, among other output, an odds ratio estimate of 1.39 for female, while it s clear that men are much more likely to be infected. The evidence that this is happening is one line in the output: {Probability modeled is disease=0} and several lines in the log: NOTE: PROC LOGISTIC is modeling the probability that disease=0. One way to change this to model the probability that disease=1 is to specify the response variable option EVENT= 1 There are several solutions. The simplest is not the one mentioned in the log, but rather the DESCENDING option. 1

p roc l o g i s t i c d a t a = t o d a y d e s c e n d i n g ; model d i s e a s e = f e m a l e ; Another method is the one mentioned in the log, which is more general: p roc l o g i s t i c d a t a = t o d a y ; model d i s e a s e ( e v e n t = 1 ) = f e m a l e ; GOTCHA #2 EFFECT CODING AND CLASS VARIABLES When you have a categorical independent variable with more than 2 levels, you need to define it with a CLASS statement. In PROC GLM the default coding for this is dummy coding. In PROC LOGISTIC, it s effect coding. To me, effect coding is quite unnatural. EXAMPLE Continuing with the same example of modeling probability of infection, suppose you now have race/ethnicity as an IV, with 6 categories, as defined by the Census Bureau: White, Black/African American, Hispanic/Latino, American Indian/Alaskan Native, Native Hawaiian or other Pacific Islander, and Asian. p roc l o g i s t i c d a t a = t o d a y 2 ; c l a s s r a c e ; model d i s e a s e ( e v e n t = 1 ) = r a c e ; and get parameter estimates that include (among much else): Parameter DF Estimate Intercept 1-0.8527 race AIAN 1-0.0636 race AfrA 1-0.7568 race Asian 1 0.1595 race Lat 1 0.5650 race NHPI 1 0.1595 and OR estimates Point Effect Estimate race AIAN vs White 1.000 race AfrA vs White 0.500 race Asian vs White 1.250 race Lat vs White 1.875 race NHPI vs White 1.250 but we know that the OR estimate should be e OR, and, for example, e.06 = 0.94 not 1. The design matrix. With the default, the design matrix looks like this: 2

race AIAN 1 0 0 0 0 AfrA 0 1 0 0 0 Asian 0 0 1 0 0 Lat 0 0 0 1 0 NHPI 0 0 0 0 1 White -1-1 -1-1 -1 and each parameter estimate estimates the difference between that level and the average of the other levels. On the other hand, with dummy (or reference) coding, it looks like race AIAN 1 0 0 0 0 AfrA 0 1 0 0 0 Asian 0 0 1 0 0 Lat 0 0 0 1 0 NHPI 0 0 0 0 1 White 0 0 0 0 0 and each parameter estimates the difference between that level and the reference group. Use the param = reference option on the class statement: p roc l o g i s t i c d a t a = t o d a y 2 ; c l a s s r a c e / param = r e f ; model d i s e a s e ( e v e n t = 1 ) = r a c e ; For more information (and other possible parameterizations) see the SAS documentation for PROC LOGISTIC, in particular the section CLASS variable parameterization in DETAILS GOTCHA #3 QUASI-COMPLETE AND COMPLETE SEPARATION If you picture the data as a 2x2 crosstab, then quasi-complete separation occurs when one of the cells is 0. Complete separation occurs when one cell in each row and column is 0. EXAMPLE QUASI-COMPLETE SEPARATION An example of quasi-complete separation is: d a t a t o d a y 7 a ; i n p u t x $ y $ @@; d a t a l i n e s ; A C A C A C A C A C B C B C B C B C B C B C B C B C B C B D B D B D B D B D ; ; ; ; B C It varies. Confidence intervals will be extremely wide. But sometimes there is a warning that there was complete or quasicomplete separation, sometimes a note. Sometimes, you get a warning that convergence was not attained. With the weight option you sometimes get a note that observations with nonpositive frequencies or weights were excluded. For example, if you run the data step creating data today7a, and then p roc l o g i s t i c d a t a = t o d a y 7 a ; c l a s s x ; model y = x ; 3

you get a warning in the log. But if you run this data step: d a t a today7 ; i n p u t x $ y $ w e i g h t @@; d a t a l i n e s ; A C 5 A D 0 B C 10 B D 5 ; ; ; and then p roc l o g i s t i c d a t a = t o d a y 7 ; c l a s s x ; model y = x ; w e ight weight ; you get a note in the log. Sometimes, you can delete the offending variable. In the example, there was only one independent variable, but usually, there will be more, and one of them will be the problematic one. Alternatively, if there are multiple categories, it may be sensible to combine some of them. A third possibility is to leave the offending variable in the equation, and simply report the results for the other variables these are still correct. You can then report the coefficients for the offending variables as inf. Finally, you can use exact inference with the EXACT option. REFERENCES Paul Allison s paper at SGF 2008 has a great deal of lucid explanation of this problem [?], although his statement that PROC LOGISTIC gives clear diagnostic messages is, as we have seen, not always true. GOTCHA #4 CONCORDANT AND DISCORDANT Part of the default output from PROC LOGISTIC is a table that has entries including percent concordant and percent discordant. To me, this implies the percent that would correctly be assigned, based on the results of the logistic regression. But that is not what it is. It looks at all possible pairs of observations. A pair is concordant if the observation with the larger value of X also has the larger value of Y. A pair is discordant if the observation with the larger value of X has the smaller value of Y; here, X and Y are the predicted value and the actual value. EXAMPLE For our first example, the output looks like this: Association of Predicted Probabilities and Observed Responses Percent Concordant 25.0 Somers D 0.000 Percent Discordant 25.0 Gamma 0.000 Percent Tied 50.0 Tau-a 0.000 Pairs 4 c 0.500 It is hard to find documentation of this. I couldn t find it explained in the LOGISTIC documentation. I found a mention of concordant and discordant in the FREQ documentation, but it was not clear what X and Y were, until I searched SAS-L and found an explanation from David Cassell. For what I was thinking of, you need the CTABLE option on the MODEL statement, which gives the proportion correctly classified, the sensitivity, the specificity, and other measures for each of a number of cutpoints of the predicted probability level. By default, it gives probability levels from 0 to 1 at intervals of.02, but if you just want a few, you can get them: p roc l o g i s t i c d a t a = t o d a y 3 ; c l a s s r a c e sex / param = r e f ; model d i s e a s e ( e v e n t = 1 ) = r a c e sex / c t a b l e pprob = (. 2 5. 5. 7 5 ) ; 4

which yields Correct Incorrect Percentages Prob Non- Non- Sensi- Speci- False False Level Event Event Event Event Correct tivity ficity POS NEG 0.250 1 0 12 11 4.2 8.3 0.0 92.3 100.0 0.500 0 6 6 12 25.0 0.0 50.0 100.0 66.7 0.750 0 11 1 12 45.8 0.0 91.7 100.0 52.2 GOTCHA #5 INTERACTIONS When you add interactions to a logistic model, no odds ratios are printed. EXAMPLE data today3; input disease race $ sex $ weight @@; datalines; 0 White F 500 0 White M 400 1 White F 200 1 White M 250 0 AfrA F 100 0 AfrA M 125 1 AfrA F 20 1 AfrA M 28 0 Lat F 100 0 Lat M 90 1 Lat F 75 1 Lat M 30 0 AIAN F 25 0 AIAN M 15 1 AIAN F 10 1 AIAN M 8 0 Asian F 100 0 Asian M 110 1 Asian F 50 1 Asian M 40 0 NHPI F 10 0 NHPI M 8 1 NHPI F 5 1 NHPI M 3 ;;;; run; and then p roc l o g i s t i c d a t a = t o d a y 3 ; c l a s s r a c e sex / param = r e f ; model d i s e a s e ( e v e n t = 1 ) = r a c e sex r a c e sex ; There are no odds ratios printed. EXPLANATION The reason no odds ratios are printed is a bit complex. Let s take a simpler case, with two independent variables, each with two levels, and, for simplicity, equal cell sizes. If the dependent variable is continuous, then we might get something like this (with DV means in each cell). White 40 60 50 Other 90 50 70 Average 65 55 60 Now, we can see that White males are lower than White females, but Black males are higher than Black females. The average effect of being male is to add 5, and the average effect of being White is to subtract 10. If there were no interaction, the main effects would give 5

Male Female White 60 + 5 10 = 55 60 5 10 = 45 Other 60 + 5 + 10 = 75 60 5 + 10 = 65 But if we put proportions in the table and estimate odds ratios, things are trickier. Start with proportions: White.4.6.5 Black.9.5.7 Average.65.55.6 and it still works just like means. But change to odds: White 2 to 3 3 t 2 1 to 1 Black 9 to 1 1 to 1 7 to 3 Average 13 to 7 11 to 9 3 to 2 or, dividing through White.67 1.5 1 Black 9 1 2.33 Average 1.86 1.22 1.5 Notice that the average odds is not the average of the odds...e.g. 1.86 (.67 + 9)/2. What would the main effects model predict? For categorical variables there is now a simple solution: p roc l o g i s t i c d a t a = t o d a y 3 ; c l a s s r a c e sex / param = r e f ; model d i s e a s e ( e v e n t = 1 ) = r a c e sex r a c e sex ; o d d s r a t i o sex ; which yields, among other things: Odds Ratio Estimates Point 95% Wald Odds ratio Estimate Confidence Limits sex F vs M at race = AIAN 0.75 0.24 2.32 sex F vs M at race = AfrA 0.89 0.47 1.68 sex F vs M at race = Asian 1.37 0.84 2.26 sex F vs M at race = Lat 2.25 1.35 3.75 sex F vs M at race = NHPI 1.33 0.24 7.35 sex F vs M at race = White 0.64 0.51 0.80 For continuous variables, we need to add AT to the ODDSRATIO statement. SUMMARY PROC LOGISTIC offers many traps for the unwary. In addition to the above, one needs to be careful about Wald vs. likelihood ratio tests (SAS defaults to likelihood ratio tests for the confidence intervals and Wald tests for the parameter estimates. For more on these see?. In addition to the gotchas presented here, there are the usual issues of model building, variable selection, model interpretation and so on. For more on these issues see:???? REFERENCES 6

ACKNOWLEDGMENTS CONTACT INFORMATION Peter L. Flom 515 West End Ave New York, NY 10024 Phone: (917) 488-7176 peterflomconsulting@mindspring.com Personal webpage: www.peterflom.com SAS R and all other SAS Institute Inc., product or service names are registered trademarks ore trademarks of SAS Institute Inc., in the USA and other countries. R indicates USA registration. Other brand names and product names are registered trademarks or trademarks of their respective companies. 7