Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Similar documents
1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Linear Regression Models P8111

Chapter 1 Statistical Inference

Generalized linear models

Statistics 262: Intermediate Biostatistics Model selection

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Lecture 2: Poisson and logistic regression

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Lecture 5: Poisson and logistic regression

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Single-level Models for Binary Responses

STA6938-Logistic Regression Model

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

COMPLEMENTARY LOG-LOG MODEL

12 Modelling Binomial Response Data

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

LOGISTIC REGRESSION Joseph M. Hilbe

Bayesian Multivariate Logistic Regression

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Bayesian Hypothesis Testing in GLMs: One-Sided and Ordered Alternatives. 1(w i = h + 1)β h + ɛ i,

How the mean changes depends on the other variable. Plots can show what s happening...

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Sparse Linear Models (10/7/13)

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Lecture 13: More on Binary Data

Does low participation in cohort studies induce bias? Additional material

9 Generalized Linear Models

Exam Applied Statistical Regression. Good Luck!

Obnoxious lateness humor

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Machine Learning Linear Classification. Prof. Matteo Matteucci

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Model Selection. Frank Wood. December 10, 2009

Lecture 6: Linear Regression (continued)

Tuning Parameter Selection in L1 Regularized Logistic Regression

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

8 Nominal and Ordinal Logistic Regression

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Generalized Linear Models for Non-Normal Data

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Applied Economics. Regression with a Binary Dependent Variable. Department of Economics Universidad Carlos III de Madrid

Sections 4.1, 4.2, 4.3

multilevel modeling: concepts, applications and interpretations

Regression, Ridge Regression, Lasso

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016

Bayesian methods for missing data: part 1. Key Concepts. Nicky Best and Alexina Mason. Imperial College London

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Multiple linear regression S6

Logistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

ISyE 691 Data mining and analytics

Statistical Distribution Assumptions of General Linear Models

Day 4: Shrinkage Estimators

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

Chapter 20: Logistic regression for binary response variables

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

Jun Tu. Department of Geography and Anthropology Kennesaw State University

CS6220: DATA MINING TECHNIQUES

Bayesian Isotonic Regression and Trend Analysis

STA 216, GLM, Lecture 16. October 29, 2007

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Generalized Models: Part 1

Generalized Linear Models Introduction

Lecture 12: Effect modification, and confounding in logistic regression

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

MS&E 226: Small Data

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Stat 5101 Lecture Notes

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016

ssh tap sas913, sas

Introducing Generalized Linear Models: Logistic Regression

Lecture 6: Linear Regression

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

Correlation and regression

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Bayesian linear regression

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

36-463/663: Multilevel & Hierarchical Models

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

Data-analysis and Retrieval Ordinal Classification

Model Estimation Example

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Transcription:

Model Selection in GLMs Last class: estimability/identifiability, analysis of deviance, standard errors & confidence intervals (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection 1

Model Uncertainty In most data analyses, there is uncertainty about the model & you need to do some form of model comparison. 1. Select q out of p predictors to form a parsimonius model 2. Select the link function (e.g., logit or probit) 3. Select the distributional form (normal, t-distributed) We focus first on problem 1 - Variable Selection 2

Frequentist Strategies for Variable Selection It is standard practice to sequentially add or drop variables from a model one at a time and examine the change in model fit If model fit does not improve much (or significantly at some arbitrary level) when adding a predictor, then that predictor is left out of the model (forward selection) Similarly, if model fit does not decrease much when removing a predictor, then that predictor is removed (backwards elimination) Implementation: step() function in S-PLUS. 3

Illustration Suppose we have 30 candidate predictors, X 1,..., X 30. The true model is logit Pr(y i = 1 x i ) = x iβ, where x i = (x i1,..., x i,30 ) (i = 1,..., n = 100) Note that predictors with β = 0 are effectively not in the model Let γ j = 1(β j 0) be a 0/1 indicator that the jth predictor is included 4

Interest focuses on selection of the subset of q p important predictors out of the p candidate predictors: Xγ = {X j : γ j = 1, β j 0} (q 1). There is a list of 2 p possible models corresponding to the different subsets Xγ X. The size of the model list, 2 p, is calculated by noting that we can either include or exclude each of the p candidate predictors. 5

Letting M denote the model list, we can introduce a model index M M The trick is how to decide between competing models M M and M M. An important issue is also efficient searching of the (potentially huge!) model space M for good models Often not possible to visit and perform calculations for all models in M (2 30 = 1073741824) 6

Returning to the logistic regression with 30 candidate predictors example, we simulate data & run stepwise selection Simulation conditions: n = 100, q = 5, p = 30, x i N p (0, I p ), X 1 X 5 have coefficients βγ = (1, 2, 3, 4, 5)/2, X 6 X 30 have coefficients 0. S-PLUS implementation: n<- 1000 p<- 30 # sample size # number of candidate predictors X<- matrix(0,n,p) # simulate predictor values for(j in 1:p) X[,j]<- rnorm(n,0,1) Xl<- data.frame(x) beta<- c((1:5)/2,rep(0,25)) # true values of parameters py<- 1/(1+exp(0.5-X%*%beta)) # probability of response # note that intercept is assumed to be -0.5 7

y<- rbinom(n,1,py) # simulate response from true model fit.true<- glm(y ~ X.1 + X.2 + X.3 + X.4 + X.5, family=binomial, data=xl) # fit true model X2<- Xl[,ind] # scramble order of predictors fit.full<- glm(y ~., family=binomial, data=x2) # implement stepwise selection using AIC criteria fit.step<- step(fit.full, scope=list(upper = ~., lower=~1) trace=f) # intercept in all fit.step$anova # summarize results 8

Comments: start with full model & performs backwards elimination - removes predictors sequentially which reduce AIC Definition of Akaike Information Criterion (AIC): AIC = 2 maximized log likelihood + 2# parameters Why put in the number of parameters? 9

Results from stepwise analysis of simulated data: Stepwise Model Path Analysis of Deviance Table Initial Model: y ~ X.4 + X.15 + X.26 + X.13 + X.16 + X.25 + X.9 + X.3 + X.18 + X.29 + X.1 + X.30 + X.24 + X.11 + X.12 + X.7 X.21 + X.10 + X.17 + X.8 + X.5 + X.28 + X.22 + X.27 X.2 + X.23 + X.6 + X.14 + X.19 + X.20 Final Model: y ~ X.4 + X.15 + X.26 + X.25 + X.3 + X.1 + X.12 + X.17 + X.5 + X.28 + X.22 + X.2 10

Step Df Deviance Resid. Df Resid. Dev AIC 1 969 528.2075 590.2075 2 - X.9 1 0.018536 970 528.2260 588.2260 3 - X.20 1 0.020528 971 528.2466 586.2466 4 - X.11 1 0.024500 972 528.2711 584.2711 5 - X.18 1 0.045945 973 528.3170 582.3170 6 - X.13 1 0.057255 974 528.3743 580.3743 7 - X.16 1 0.080202 975 528.4545 578.4545 8 - X.24 1 0.100360 976 528.5548 576.5548 9 - X.21 1 0.278245 977 528.8331 574.8331 10 - X.30 1 0.296931 978 529.1300 573.1300 11 - X.27 1 0.440982 979 529.5710 571.5710 12 - X.10 1 0.445923 980 530.0169 570.0169 13 - X.7 1 0.447855 981 530.4648 568.4648 11

14 - X.23 1 0.625145 982 531.0899 567.0899 15 - X.8 1 1.000717 983 532.0906 566.0906 16 - X.6 1 1.025047 984 533.1157 565.1157 17 - X.29 1 1.366004 985 534.4817 564.4817 18 - X.19 1 1.445814 986 535.9275 563.9275 19 - X.14 1 1.623275 987 537.5508 563.5508 12

Selected model contains predictors X 1 X 5 belonging to true model (we got lucky & coefficients large) Also contains predictors X 12, X 15, X 17, X 22, X 25, X 26, X 28, which should have been excluded (not so good) Results of maximum likelihood estimation for true model: Value Std. Error t value (Intercept) -0.4494802 0.1106592-4.061844 X.1 0.5525122 0.1115489 4.953095 X.2 0.9541896 0.1176940 8.107376 X.3 1.7647030 0.1489397 11.848442 X.4 2.0024375 0.1638457 12.221483 X.5 2.8375060 0.2018184 14.059703 13

Results of maximum likelihood estimation for full model (focusing on important predictors): Value Std. Error t value (Intercept) -0.47183494 0.1181678-3.9929235 X.1 0.58449537 0.1214352 4.8132281 X.2 1.06182854 0.1289719 8.2330242 X.3 1.91001906 0.1626665 11.7419354 X.4 2.13608265 0.1772401 12.0519146 X.5 3.04082332 0.2230194 13.6347913 Full model MLEs differ somewhat from MLEs under true model and standard errors are higher 14

Now, consider the estimates from the final selected model: Value Std. Error t value (Intercept) -0.4661101 0.1148807-4.057341 X.1 0.5675410 0.1157651 4.902522 X.2 1.0170453 0.1233606 8.244489 X.3 1.8653303 0.1574395 11.847921 X.4 2.1040769 0.1731561 12.151332 X.5 2.9939133 0.2168262 13.807893 X.12 0.1725280 0.1124232 1.534630 X.15 0.1593174 0.1071552 1.486790 X.17-0.2245846 0.1059882-2.118958 X.22-0.2398270 0.1128883-2.124464 X.25-0.1953539 0.1166990-1.673999 X.26-0.2685661 0.1112931-2.413143 X.28-0.1726031 0.1110436-1.554373 15

Comparing the parameters for the predictors that should have been included to the earlier results, we find results in between those for true model and full model. Interesting result: many of the predictors known to be unimportant have coefficients significantly different from zero This is a very common occurrence when using stepwise selection! If we had chosen smaller values for the coefficients for the important predictors, we would have missed one or more of them (based on repeating simulation) 16

Some Issues with Stepwise Procedures Normal Linear & Orthogonal Case In normal models with orthogonal X, forward and backwards selection will yield the same model (i.e., the selection process is not order-dependent). However, the selection of the significance level for inclusion in the model is arbitrary and can have a large impact on the final model selected. Potentially, one can use some goodness of fit criterion (e.g., AIC, BIC). In addition, if interest focus on inference on the β s, stepwise procedures can result in biased estimates and invalid hypothesis tests (i.e., if one naively uses the final model selected without correction for the selection process) 17

Issues with Stepwise Procedures (General Case) For GLMs other than orthogonal, linear Gaussian models, the order in which parameters are added or dropped from the model can have a large impact on the final model selected. This is not good, since choices of the order of selection are typically (if not always) arbitrary. Model selection is a challenging area, but there are alternatives 18

Goodness of Fit Criteria There are a number of criteria that have been proposed for comparing models based on a measure of goodness of fit penalized by model complexity 1. Akaike s Information Criterion (AIC): AIC M = D M + 2p M, where D M is the deviance for model M and p M is the number of predictors 2. Bayesian Information Criterion (BIC): BIC M = D M + p M log(n). 19

Note that deviance decreases as variables are added and the likelihood increases. The AIC and BIC differ in the penalty for model complexity, with the AIC using twice the number of parameters and the BIC using the number of parameters multiplied by the logarithm of the sample size. The BIC tends to place a larger penalty on the number of predictors and hence more parsimonius models are selected. 20

Uncertainty in the link function & distribution Note that we can similarly use the AIC or BIC criteria to select the link function or distribution For example, suppose that we have binary outcome data There are many possible link functions - any smooth, monotone function mapping from R [0, 1] (i.e., cumulative distribution functions for continuous densities) 21

We simulated data from a logistic regression model with x i = (1, dose i ), where dose i Uniform(0, 1), i = 1,..., 100. The parameters were chosen to be β = ( 3, 5) As an alternative to the logistic model, we considered the probit: Pr(y i = 1 x i ) = Φ(x iβ), where Φ(z) = z (2π) 1/2 exp( z 2 /2) is the standard normal cdf We also considered the complementary log-log model: Pr(y i = 1 x i ) = 1 exp { exp(x iβ)}. 22

Pr(response) 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 dose 23

Summary of results Logistic regression results: β = ( 3.67, 6.14), AIC = 94.0211, BIC = 99.23145 Probit regression results: β = ( 2.14, 3.60), AIC = 93.9993, BIC = 99.20964 Complementary log-log results: β = ( 3.14, 4.39), AIC = 94.21893 BIC = 99.42927 Low values of AIC & BIC are preferred, so we select the probit model (which happens to be wrong) 24

Some things to think about It is very often the case that many models are consistent with the data This is particularly true when there is a large number of models in the list of plausible models Ideally, substantive information can be brought to bear to reduce the size of the list For example, certain models may be more consistent with biology 25

However, one is typically still faced with many possible models & concerned about sensitivity of inferences to the model chosen or selected by some algorithm Given the selection bias that occurs, it may be better to focus a priori on a single model rather than run stepwise selection However, better yet would be to formally account for model uncertainty (e.g., through Bayesian model averaging) 26

DDE and Pre-Term Birth Example (Longnecker et al., 2001, Lancet) Scientific Interest: Association between DDE exposure (continuous predictor) and preterm birth (binary response). Previous Data: DDT is a reproductive toxin in birds, rabbits and possibly sea lions. In humans, DDT has been associated with preterm birth in small studies. 27

Current Data: Data drawn from the US Collaborative Perinatal Project (CPP) - complete data for n = 2380 children out of which 361 were born preterm. Possibly Confounding Factors: Study center, serum triglycerides and cholesterol, infant ethnicity and gender, mother s age, height, body mass index before pregnancy, rate of weight gain during pregnancy, parity, socioeconomic status, and smoking during pregnancy. 28

Notation: y i = 1 if preterm birth and y i = 0 if full-term birth d i = dose of dde for ith woman (i = 1,..., n) z i = (z i1,..., z ip ) = vector of confounding factors x i = vector of predictors (i.e., covariates) What model should we use to investigate the association between dose of dde and the probability of preterm birth? 29

Logistic Regression Analysis log [ Pr(y i = 1 x i )] = βdi + α 1 z i1 +... + α p z ip = βd i + z Pr(y i = 0 x i ) iα where β is the increase in the log-odds of a response associated with a unit increase in dose of dde This log-odds ratio is adjusted for possible confounding variables z i 30

Typically, investigators are interested in the exposure odds ratio: Pr(y i = 1 x i = c, z i )/Pr(y i = 0 x i = c, z i ) Pr(y i = 1 x i = c 1, z i )/Pr(y i = 0 x i = c 1, z i ) associated with a unit increase in exposure (e.g., dde dose) Exponentiating the logit model, the exposure odds ratio is equal to exp{βc + z iα} exp{β(c 1) + z iα} = exp(β), which is simply the exponentiated regression coefficient for exposure 31

Rare Disease Assumption Often, investigators are interested in the Relative Risk, RR = Pr(y i = 1 x i = c, z i ) Pr(y i = 1 x i = c 1, z i ), which (unfortunately) depends on the value of c However, under the assumption that the disease (i.e., the response) is rare, Pr(y i = 1 x i = c, z i ) Pr(y i = 1 x i = c, z i ) Pr(y i = 0 x i = c, z i ), and it follows that the Relative Risk is approximately equal to the exposure odds ratio 32

Returning to the Longnecker et al. example, what assumption about the dose-response are we making by using the model: log [ Pr(y i = 1 x i )] = βdi + z Pr(y i = 0 x i ) iα Do they use this model in the published analysis? 33