ABSTRACT INTRODUCTION. SESUG Paper

Similar documents
More Statistics tutorial at Logistic Regression and the new:

Ignoring the matching variables in cohort studies - when is it valid, and why?

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

Stat 642, Lecture notes for 04/12/05 96

Correlation and regression

Homework Solutions Applied Logistic Regression

SAS macro to obtain reference values based on estimation of the lower and upper percentiles via quantile regression.

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

Unbiased estimation of exposure odds ratios in complete records logistic regression

Practice of SAS Logistic Regression on Binary Pharmacodynamic Data Problems and Solutions. Alan J Xiao, Cognigen Corporation, Buffalo NY

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

STA6938-Logistic Regression Model

A tool to demystify regression modelling behaviour

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Multiple linear regression S6

Diagnostics for matched case control studies : SAS macro for Proc Logistic

CHAPTER 1: BINARY LOGIT MODEL

GMM Logistic Regression with Time-Dependent Covariates and Feedback Processes in SAS TM

Package generalhoslem

Jun Tu. Department of Geography and Anthropology Kennesaw State University

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure).

Logistic Regression Models for Multinomial and Ordinal Outcomes

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION

Procedia - Social and Behavioral Sciences 109 ( 2014 )

Assessing Calibration of Logistic Regression Models: Beyond the Hosmer-Lemeshow Goodness-of-Fit Test

Truncated logistic regression for matched case-control studies using data from vision screening for school children.

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

STAT 7030: Categorical Data Analysis

Tests for the Odds Ratio in a Matched Case-Control Design with a Quantitative X

Generalized Linear Models for Non-Normal Data

Basic Medical Statistics Course

Package LBLGXE. R topics documented: July 20, Type Package

Multinomial Logistic Regression Models

Effect Modification and Interaction

SAS Macro for Generalized Method of Moments Estimation for Longitudinal Data with Time-Dependent Covariates

ARIC Manuscript Proposal # PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Adaptive Fractional Polynomial Modeling in SAS

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Lecture 15 (Part 2): Logistic Regression & Common Odds Ratio, (With Simulations)

Estimation of the Relative Excess Risk Due to Interaction and Associated Confidence Bounds

Investigating Models with Two or Three Categories

Extensions of Cox Model for Non-Proportional Hazards Purpose

Statistics in medicine

Introduction to mtm: An R Package for Marginalized Transition Models

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

Logistic Regression: Regression with a Binary Dependent Variable

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Description Syntax for predict Menu for predict Options for predict Remarks and examples Methods and formulas References Also see

SAS Analysis Examples Replication C8. * SAS Analysis Examples Replication for ASDA 2nd Edition * Berglund April 2017 * Chapter 8 ;

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

A new strategy for meta-analysis of continuous covariates in observational studies with IPD. Willi Sauerbrei & Patrick Royston

Lab 8. Matched Case Control Studies

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Estimating Explained Variation of a Latent Scale Dependent Variable Underlying a Binary Indicator of Event Occurrence

Using PROC GENMOD to Analyse Ratio to Placebo in Change of Dactylitis. Irmgard Hollweck / Meike Best 13.OCT.2013

Estimating a Piecewise Growth Model with Longitudinal Data that Contains Individual Mobility across Clusters

Longitudinal Modeling with Logistic Regression

Mixed- Model Analysis of Variance. Sohad Murrar & Markus Brauer. University of Wisconsin- Madison. Target Word Count: Actual Word Count: 2755

Introduction to logistic regression

Qinlei Huang, St. Jude Children s Research Hospital, Memphis, TN Liang Zhu, St. Jude Children s Research Hospital, Memphis, TN

Classification: Linear Discriminant Analysis

Chapter 5: Logistic Regression-I

ssh tap sas913, sas

CS6220: DATA MINING TECHNIQUES

An Introduction to Causal Mediation Analysis. Xu Qin University of Chicago Presented at the Central Iowa R User Group Meetup Aug 10, 2016

especially with continuous

Dynamic Determination of Mixed Model Covariance Structures. in Double-blind Clinical Trials. Matthew Davis - Omnicare Clinical Research

Analyzing Residuals in a PROC SURVEYLOGISTIC Model

Flexible mediation analysis in the presence of non-linear relations: beyond the mediation formula.

Generating Half-normal Plot for Zero-inflated Binomial Regression

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

Local Likelihood Bayesian Cluster Modeling for small area health data. Andrew Lawson Arnold School of Public Health University of South Carolina

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Modelling Survival Data using Generalized Additive Models with Flexible Link

Adaptive Fractional Polynomial Modeling in SAS

Calculating Odds Ratios from Probabillities

jh page 1 /6

Generalized logit models for nominal multinomial responses. Local odds ratios

An Empirical Comparison of Multiple Imputation Approaches for Treating Missing Data in Observational Studies

Analysis of recurrent event data under the case-crossover design. with applications to elderly falls

LCA_Distal_LTB Stata function users guide (Version 1.1)

Asymptotic equivalence of paired Hotelling test and conditional logistic regression

Sensitivity analysis and distributional assumptions

Paper: ST-161. Techniques for Evidence-Based Decision Making Using SAS Ian Stockwell, The Hilltop UMBC, Baltimore, MD

Online supplement. Absolute Value of Lung Function (FEV 1 or FVC) Explains the Sex Difference in. Breathlessness in the General Population

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Lecture 12: Effect modification, and confounding in logistic regression

A note on R 2 measures for Poisson and logistic regression models when both models are applicable

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Interpretation of the Fitted Logistic Regression Model

Application of Indirect Race/ Ethnicity Data in Quality Metric Analyses

Introduction to lnmle: An R Package for Marginally Specified Logistic-Normal Models for Longitudinal Binary Data

Using PROC GENMOD to Analyse Ratio to Placebo in Change of Dactylitis

Transcription:

SESUG Paper 140-2017 Backward Variable Selection for Logistic Regression Based on Percentage Change in Odds Ratio Evan Kwiatkowski, University of North Carolina at Chapel Hill; Hannah Crooke, PAREXEL International and University of North Carolina at Charlotte; Kathy Roggenkamp, University of North Carolina at Chapel Hill ABSTRACT Variable selection is a fundamental component of statistical modeling. A common variable selection method used in health sciences is backward variable selection, which iteratively removes variables based on their relevance to the model. Often, automated backward variable selection procedures determine variable relevance based on overall statistical significance. However, many epidemiologists, including formative thinkers Greenland and Robins, favor a "change-in-estimate" approach to variable selection rather than an overall significance approach. We developed a SAS software macro to implement a backward variable selection procedure for logistic regression using the "change-in-estimate" method. Our macro implements backwards variable selection in the logistic regression model in the situation where there is a single independent variable (IV) and single dependent variable (DV) of interest, with additional covariates that are eligible for removal based on their relevance to the model. This relevance is based on the percentage change in odds ratio between the IV and DV in a full model including all additional covariates and a reduced model which removes a single covariate at a time. This macro provides epidemiologists and other health science professionals with a theoretically sound option for automated backward variable selection in logistic regression, and is an extension of backward variable selection options provided in the LOGISTIC procedure. The macro is easily implemented in any dataset by having the user specify the IV, DV, additional covariates, and threshold of difference in odds ratio which is used for removal of additional covariates. INTRODUCTION Variable selection, or identification of confounders, is a fundamental component of statistical modeling in epidemiology. A number of variable selection procedures have been suggested, such as forward and backward, which are both step-wise methods. 1,2 Frequently, automated regression procedures employ a step-wise approach based on overall statistical significance for inclusion of covariates using p-value as the metric. 3 However, it is commonly agreed that a step-wise approach relying on change-in-estimate for covariate inclusion is a superior method for maximizing the relevance of covariates included in the model. 3,4 When using logistic regression to model the effect of an exposure of interest (IV) on a binary outcome variable (DV), the change-in-estimate procedure examines the percentage change in the adjusted odds ratio (aor) for the association between the IV and DV upon removal of a particular covariate. 1-6 The standard convention is a change in the aor of 10% or more suggests the covariate is important to the model and should be left in, though newer research suggests that a 5% change may be a sufficient cut-off depending on the size of the exposure-outcome relationship. 5 While a variable selection procedure based on percentage change in odds ratio has many statistical and epidemiologic advantages, implementation is computationally intensive. For instance, if there is a statistical model with a single IV, DV, and 10 additional covariates eligible for removal, as many as 55 separate models are needed to implement this procedure. We present a macro that automates the covariate selection process using backward variable selection based on change-in-estimate. The macro enables the evaluation of an arbitrary number of additional covariates with a user-specified threshold for inclusion in the model based on change in aor upon removal. 1

MACRO BACKWARD_OR_ELIM Macro backward_or_elim implements the change-in-estimate procedure and produces output which thoroughly details every iteration, including all full and reduced models. Macro backward_or_elim procedure STEP 0: Set macro arguments: IV, DV additional covariates, and threshold for inclusion in the model based on change in aor upon removal. STEP 1: Run logistic regression with full set of additional covariates. Compute odds ratio between IV and DV. STEP 2: Run logistic regression with a reduced set of additional covariates, running a separate model for the full set minus one covariate at a time in a leave-one-out manner. Compute odds ratio between IV and DV in each of these reduced models. STEP 3: Identify the additional covariate that has the lowest effect on the odds ratio between IV and DV upon removal from the set of additional covariates. If this impact is less than the user-defined threshold, then delete this covariate from the additional covariate set and return to STEP 1. Otherwise, proceed to STEP 4. STEP 4: End; display final iteration table. EXAMPLE The ICU dataset is used to demonstrate macro backward_or_elim. 7,8 Name Description Codes/Values STA Vital Status 0 = Lived 1 = Died INF Infection Probable at ICU Admission 0 = No 1 = Yes GENDER Gender 0 = Male 1 = Female CAN Cancer Part of Present Problem 0 = No 1 = Yes CPR CPR Prior to ICU Admission 0 = No 1 = Yes Figure 1: Variables used in ICU dataset This example is for illustrative purposes only and must not be interpreted to have any scientific relevance. The macro is invoked using: %backward_or_elim(iv=inf, DV=STA, covariates=gender CAN CPR, threshold=0.05, dataset=icu_data); 2

Backwards elimination procedure for independent variable INF, dependent variable STA, and additional covariates at threshold 0.05 Iteration Full Model aor Reduced Variable Reduced Model aor Change in aor 1 2.241 CAN 2.236 0.209% CPR 2.502 11.64% GENDER 2.242 0.025% 2 2.242 CAN 2.236 0.246% CPR 2.506 11.76% 3 2.236 CPR 2.500 11.81% Output 1: Output from backward_or_elim macro using ICA dataset 7,8 ITERATION 1 Iteration 1 corresponds to the model with dependent variable STA, independent variable INF, and three additional covariates (CAN, CPR, GENDER). The full model aor is the odds ratio between IV and DV adjusted for these three additional covariates, and is equal to 2.241. Note that the IV and DV are fixed for every model in this procedure, and that the DV remains in the model regardless of which additional covariates are included. In iteration 1 there are three reduced models which are indexed by which additional covariate is removed. The model with removed variable CAN includes the two additional covariates (CPR, GENDER); the model with removed variable CPR includes the two additional covariates (CAN, GENDER); the model with removed variable GENDER includes the two additional covariates (CAN, CPR). For each of these reduced models, the odds ratio between IV and DV is computed while adjusting for one less additional covariate. The reduced models with the change in aor less than the threshold of 5% are shown in bold type, and the model with the lowest change in aor among models eligible for removal is highlighted. The model corresponding to GENDER, which adjusts for (CAN, CPR), has an aor that is only 0.025% different that the full model aor, therefore GENDER is removed from the set of additional covariates. ITERATION 2 Iteration 2 begins with the updated full set of additional covariates (CAN, CPR). Note that the full model aor in iteration 2 is 2.242, which is the same as the reduced model aor in iteration 1 for the reduced model which excludes GENDER, since in both models the odds ratio is adjusted for (CAN, CPR). In iteration 2 there are two reduced models: the model with removed variable CAN including the single additional covariate CPR and the model with removed variable CPR including the single additional covariate CAN. The model corresponding to CAN, which adjusts only for CPR, has an aor that is only 0.246% different than the full model aor, therefore CAN is removed from the set of additional covariates. ITERATION 3 In iteration 3 the set of additional covariates is only CPR. The reduced model aor corresponds to a model with no additional covariates, and the odds ratio between IV and DV is 11.81% different than the full model aor. Therefore, the additional covariate CPR is not removed and the procedure ends. Note that the initial model considered included IV, DV, and the three initial covariates (CAN, CPR, GENDER), while the final model includes IV, DV, and only the additional covariate CPR. 3

SOURCE CODE %macro backward_or_elim(iv, DV, covariates, threshold, dataset); /* Initialize variables */ %let cov_list=%sysfunc(compress(&covariates,(,),)); %let iteration=1; %let num=%sysfunc(countw(&cov_list)); %let minimum=&threshold; ods exclude all; %do %while(&minimum<=&threshold or %eval(iteration<&num)); /*** Step 1: run full model ***/ ods output OddsRatios=OddsData_Full; proc logistic data=&dataset descending; class &DV / param=ref ; model &DV = &IV &cov_list; ods output close; proc sql noprint; select OddsRatioEst into :fullor separated by ' ' from OddsData_Full where Effect="%UPCASE(&IV)"; /*** Step 2: run reduced models ***/ %do i = 1 %to %sysfunc(countw(&cov_list)); %let cov_list_reduced = %sysfunc(tranwrd(&cov_list,%scan(&cov_list,&i),)); ods output OddsRatios=OddsData_Reduced; proc logistic data=&dataset descending; class &DV / param=ref ; model &DV = &IV &cov_list_reduced; data OddsData_Reduced; length Effect $25 Removed $25; set OddsData_Reduced; removed="%sysfunc(scan(&cov_list,&i))"; if Effect ^= "%UPCASE(&IV)" then delete; proc append data=oddsdata_reduced base=oddsdata_merged; %end; /*** Step 3: compute effect of deleting one variable on OR ***/ data OddsData_Merged; set OddsData_Merged; delta = abs((oddsratioest-&fullor)/&fullor); iteration=&iteration; oddsratio=&fullor; proc sql noprint; select min(delta) as minimum, removed into :minimum, :removedvar from OddsData_Merged having delta=minimum; data OddsData_Merged; 4

set OddsData_Merged; elim = "&removedvar"; proc append data=oddsdata_merged base=oddsdata_final; proc datasets nolist; delete OddsData_Merged; /*** remove &removedvar, the variable with lowest effect on OR */ %if %eval(&minimum<&threshold) %then %do; %let cov_list=%sysfunc(tranwrd(&cov_list,&removedvar,)); %let iteration=%eval(&iteration+1); %end; %end; ods exclude none; data final; set OddsData_Final; proc datasets nolist; delete OddsData_Full OddsData_Reduced OddsData_Merged OddsData_Final; ods rtf; title "Backwards elimination procedure for independent variable &IV, dependent variable &DV, and additional covariates at threshold &threshold"; PROC REPORT DATA=Final NOWD; COLUMNS iteration oddsratio removed oddsratioest delta elim; DEFINE iteration / GROUP 'Iteration'; DEFINE oddsratio / GROUP 'Full Model aor'; DEFINE removed / GROUP 'Reduced Variable'; DEFINE oddsratioest / 'Reduced Model aor'; DEFINE delta / FORMAT=Percent8.3 GROUP 'Change in aor'; DEFINE elim / GROUP noprint; break after iteration/; compute after iteration; line ''; endcomp; COMPUTE delta; IF (delta<&threshold) THEN DO; CALL DEFINE(_col_,"STYLE","STYLE=[FONT_WEIGHT=BOLD]"); END; ENDCOMP; COMPUTE elim; IF (elim = removed and delta<&threshold) THEN DO; CALL DEFINE(_row_,"STYLE","STYLE=[BACKGROUND= cxdddddd]"); END; ENDCOMP; RUN; ods rtf close; %mend backward_or_elim; 5

CONCLUSION This flexible macro implements a variable selection technique which is of substantial epidemiologic interest. This current implementation is limited to the case of logistic regression with binary IV, DV, and additional covariates. This macro has already been extended to the cases of: categorical IV and DV, and categorical or continuous additional covariates adding additional covariates that are not eligible for removal, therefore creating models with an IV, DV, additional covariates eligible for removal, and additional non-removable covariates using additional options within PROC LOGISITIC (such as WEIGHT) using this framework for any generalized linear regression method in the GLM procedure These extensions are available by request from the author. REFERENCES 1. Lee P.H. 2014. Is a Cutoff of 10% Appropriate for the Change-in-Estimate Criterion of Confounder Identification? American Journal of Epidemiology, 24(2):161 167. 2. McNamee R. 2005. Regression modelling and other to control confounding. Occupational and Environmental Medicine, 62(7): 500 506. 3. Greenland, S. 1989. Modeling and Variable Selection in Epidemiologic Analysis. American Journal of Public Health, 79(3):340 349. 4. Walter S, Tiemeier H. 2009. Variable Selection: Current Practice in Epidemiological Studies. European Journal of Epidemiology, 24(12):733 736. 5. Robins J.M., Mark S.D., and Newey W.K. 1992. Estimating Exposure Effects by Modelling the Expectation of Exposure Conditional on Confounders. Biometrics, 48(2):479 495. 6. Greenland S, Daniel R, Pearce N. 2016. Outcome Modeling Strategies in Epidemiology: Traditional Methods and Basic Alternatives. International Journal of Epidemiology, 45(2):565 575. 7. Lemeshow, S., Teres, D., Avrunin, J. S., Pastides, H. 1988. Predicting the Outcome of Intensive Care Unit Patients. Journal of the American Statistical Association, 83(402):348 356. 8. Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X. 2013. Applied Logistic Regression. 3rd ed. Hoboken, NJ: John Wiley & Sons. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Evan Kwiatkowski University of North Carolina at Chapel Hill ekwiatkowski@unc.edu Hannah Crooke PAREXEL International and University of North Carolina at Charlotte hannah.crooke@parexel.com Kathy Roggenkamp University of North Carolina at Chapel Hill kathy_roggenkamp@unc.edu 6