A tool to demystify regression modelling behaviour

Size: px

Start display at page:

Download "A tool to demystify regression modelling behaviour"

Ambrose Melton
5 years ago
Views:

1 A tool to demystify regression modelling behaviour Thomas Alexander Gerds 1 / 38

2 Appetizer Every child knows how regression analysis works. The essentials of regression modelling strategy, such as which variables to include in which way, however, are typically based on family tradition rather than mathematical theory. Every child knows, too, that it is pretty dicult to change tradition. 2 / 38

3 I have the best data in the world. Here are the guidelines! What is the question? Who asks the question? Parameter? Identiability? 3 / 38

4 A classical misunderstanding 4 / 38

5 A statistics teacher states When the sample size is large, then the test statistic is often normally distributed. When the data are normally distributed, then the t-test statistic is optimal. 5 / 38

6 Two applied researchers remember together If the sample size is large, then I should apply the t-test, if the sample size is small then the non-parametric Wilcoxon rank sum test is preferred Yes, for a t-test you need at least 30 samples 6 / 38

7 Family traditions 7 / 38

8 Target audience 1 Our coverage is not intended for highly skilled practitioners; rather, we target: teachers students and working epidemiologists who would like to do better with data analysis, but who lack resources such as R programming skills or a bona de modelling expert committed to their project. 1 Greenland, Daniel, Pearce Int J Epidemiol Apr; 45(2): / 38

9 Still Greenland et al Throughout, we assume that we are applying a conventional risk or rate regression model (e.g. logistic, Cox or Poisson regression) to estimate the eects of an exposure variable X on the distribution of a disease variable Y while controlling for other variables The other variables include forced variables, such as age and sex, which we may always want to control, and may also include unforced variables about which we are unsure whether to control. 9 / 38

10 Statistical modelling: the generalized linear model Here is a (favourite) multiple regression model 2 : g(y ) = α + βx + γz 2 g is a non-linear link function such as logit 10 / 38

11 Statistical modelling: the generalized linear model Here is a (favourite) multiple regression model 2 : g(y ) = α + βx + γz Some spoilsport thinks that the model gets better when Z is removed: g(y ) = α + β X 2 g is a non-linear link function such as logit 10 / 38

12 Tradition: X Step 1. use the data to test the crude null hypothesis: H 0 : β = 0 Step 2. use the same data to test the adjusted null hypothesis: H 0 : β = 0 Step 3. depends on the results of Steps 1 and 2: 11 / 38

13 Tradition: X Step 1. use the data to test the crude null hypothesis: H 0 : β = 0 Step 2. use the same data to test the adjusted null hypothesis: H 0 : β = 0 Step 3. depends on the results of Steps 1 and 2: Case: p < 0.05 and p < 0.05: X is an independent predictor Case p < 0.05 and p > 0.05: X reects other variables 11 / 38

14 How would an applied statistician with strong fundamental mathematical foundation react to this? 12 / 38

15 How would an applied statistician with strong fundamental mathematical foundation react to this? What is the question? 12 / 38

16 Tradition: Z Step 1. use the data to test the null hypothesis H 0 : γ = 0 Step 2. depends on the results of Step 1: 13 / 38

17 Tradition: Z Step 1. use the data to test the null hypothesis H 0 : γ = 0 Step 2. depends on the results of Step 1: Case: p < 0.05: Use same data to t this model: g(y ) = α + βx + γz Case p > 0.05 : Use same data to t that model: g(y ) = α + β X 13 / 38

18 How would an applied statistician with strong fundamental mathematical foundation react to this? 14 / 38

19 How would an applied statistician with strong fundamental mathematical foundation react to this? What is the parameter of interest: β or β? 14 / 38

20 Frank E Harrell Jr 3 about stepwise variable selection 1. Stepwise selection yields R 2 values that are biased high 2. The ordinary F and χ 2 test statistics do not have It yields P-values that are too small... and the proper correction for them is a very dicult problem. 5. It provides regression coecients that are biased high in absolute value and need shrinkage It allows us to not think about the problem 3 Regression Modelling Strategies (Springer, pages 56-57) 15 / 38

21 Hauck et al 4 About the two rival models g(y ) = α + β X (1) g(y ) = α + βx + γz (2) What makes model (2) the correct one? Why not model (1)? The issue is not one of bias! Rather, models (1) and (2) are estimating dierent measures of treatment eect. Thus, there is no reason for them to yield the same estimates. 4 Controlled Clinical Trials 19: (1998) 16 / 38

22 Tradition: X Z Step 1. Use expert knowledge, a DAG, and/or literature (but, not the data pyha!) to nd the known predictors Z and if they are real confounders: Z X? Step 2. depends on the results of Step 1: 17 / 38

23 Tradition: X Z Step 1. Use expert knowledge, a DAG, and/or literature (but, not the data pyha!) to nd the known predictors Z and if they are real confounders: Z X? Step 2. depends on the results of Step 1: Case: Z X Fit this model: g(y ) = α + βx + γz Case: Z X Fit that model: g(y ) = α + β X 17 / 38

24 How would an applied statistician with strong fundamental mathematical foundation react to this? 18 / 38

25 How would an applied statistician with strong fundamental mathematical foundation react to this? Well, this depends on g. 18 / 38

26 Principal criteria Classical criterion A covariate is a confounder if it is associated with the exposure and, causally, with the outcome. Operational criterion A covariate is a confounder if the estimate of exposure eect is changed by inclusion of the covariate. 19 / 38

27 Principal criteria Classical criterion A covariate is a confounder if it is associated with the exposure and, causally, with the outcome. Operational criterion A covariate is a confounder if the estimate of exposure eect is changed by inclusion of the covariate. 19 / 38

28 Mavericks 5 A maverick is a covariate that satises the operational but not the classical criterion. Y X Z 5 Hauck, Neuhaus, Kalbeisch, Anderson J Clin Epidmiol Vol. 44, No. I, pp , / 38

29 Hauck, Neuhaus, Kalbeisch, Anderson Rule 1 The eect of omitting a maverick is to bias the odds ratio 6 towards no eect. Rule 2 The magnitude of the bias caused by omitting a maverick increases with the variance of the omitted maverick and with the magnitude of the eect of the maverick on the outcome.... Rule 5 Tests of the hypothesis of a no exposure-outcome association remain valid when a maverick is omitted. 6 same for hazard ratio in Cox regression 21 / 38

30 Demonstration of the attenuation eect 22 / 38

31 library(lava) m <- lvm() distribution(m, Y+X) <- binomial.lvm("logit") regression(m,y X+Z) <- c(log(3),log(3)) d <- sim(m,n=50000) model1 <- glm(y X,data=d,family=binomial) model2 <- glm(y X+Z,data=d,family=binomial) Variable OddsRatio CI.95 OddsRatio CI.95 X 2.40 [2.31;2.49] 3.03 [2.91;3.16] Z 3.04 [2.96;3.12] Model 1 Model 2 23 / 38

32 library(lava) m <- lvm() distribution(m, Y+X) <- binomial.lvm("logit") regression(m,y X+Z) <- c(log(2),log(.5)) d <- sim(m,n=50000) model1 <- glm(y X,data=d,family=binomial) model2 <- glm(y X+Z,data=d,family=binomial) Variable OddsRatio CI.95 OddsRatio CI.95 X 1.89 [1.82;1.96] 2.04 [1.96;2.12] Z 0.50 [0.49;0.51] Model 1 Model 2 24 / 38

33 A new recipe 25 / 38

examining coecient p-values, and trying multiple statistical models is

34 (Breaking with awed) regression modelling tradtions Van der Laan & Rose It should not be overlooked that the process of looking at the data, examining coecient p-values, and trying multiple statistical models is not only incredibly prevalent but is taught to students learning statistics. 26 / 38

35 The superlearner recipe Step 1 Ask the research question. Step 2 Dene the parameter of interest without specifying the rest of the model. Step 3 Estimate the parameter; use crossvalidation to distinguish alternative estimators of the same parameter. 27 / 38

36 Example Research question: Is there a dierence in the 30-day survival chances of cardiac arrest patients who received bystander CPR compared to cardiac arrest patients who did not receive bystander CPR? 28 / 38

37 Example Research question: Is there a dierence in the 30-day survival chances of cardiac arrest patients who received bystander CPR compared to cardiac arrest patients who did not receive bystander CPR? Target parameter: θ = P(Y 30 = 1 X = 1) P(Y 30 = 1 X = 0) Note: not specied if or how Z should be included 28 / 38

38 Two alternative estimators 7 Estimator 1: Y 30 X ˆθ 1 = 1 expit(ˆα + ˆβ ) 1 n 1 n 0 i:x i =1 expit(ˆα ) i:x i =0 Estimator 2: Y 30 X + Z ˆθ 2 = 1 expit(ˆα + n ˆβ + ˆγZ i ) 1 1 n 0 i:x i =1 expit(ˆα + ˆγZ i ) i:x i =0 7 Z i are demographics, comorbidities and other risk factors 29 / 38

39 Two alternative estimators 7 Estimator 1: Y 30 X ˆθ 1 = 1 expit(ˆα + ˆβ ) 1 n 1 n 0 i:x i =1 expit(ˆα ) i:x i =0 Estimator 2: Y 30 X + Z ˆθ 2 = 1 expit(ˆα + n ˆβ + ˆγZ i ) 1 1 n 0 i:x i =1 expit(ˆα + ˆγZ i ) i:x i =0 BUT: possibly comparing dierent distributions of Z 7 Z i are demographics, comorbidities and other risk factors 29 / 38

40 Example (continued) Research question: Is there a dierence in the 30-day survival chances of a cardiac arrest patients who received bystander CPR compared to cardiac arrest patients who did not receive bystander CPR when both patients have the same demographics, comorbidities and other risk factors? Target parameter: θ z = P(Y 30 = 1 X = 1, Z = z) P(Y 30 = 1 X = 0, Z = z) Note: result depends on z 30 / 38

41 Example (continued) Research question: What is the average causal eect of bystander CPR on the 30-day survival chances in cardiac arrest patients? Target parameter: θ = E Z {P(Y 30 = 1 do(x = 1), Z)} E Z {P(Y 30 = 1 do(x = 0), Z)} Note: result is an average across Z and does not depend on z 31 / 38

42 Three alternative (G-formula) estimators 8 1 n Estimator 1: 1 n n i=1 ( ) expit(ˆα + ˆβ ) expit(ˆα ) Estimator 2: 1 n ( ) expit(ˆα + ˆβ + ˆγZ i ) expit(ˆα + ˆγZ i ) n i=1 Estimator 3: n i=1 ( expit(ˆα + ˆβ + ˆγ 1 Z i + ˆγ 2 (Z i ) 2 ) expit(ˆα + ˆγ 1 Z i + ˆγ 2 (Z i ) 2 ) ) 8 Note: The estimates rely on dierent prediction models 32 / 38

43 Lava's g-formula (model1) regression(m,y X+Z) <- c(log(3),log(3)) regression(m,x Z) <- log(1.5) set.seed(18) d <- sim(m,n=50000) model1 <- glm(y X,data=d,family=binomial()) estimate(model1,function(p,data){ a <- p["(intercept)"] b <- p["x"] R.X1 <- expit(a + b) R.X0 <- expit(a) list(riskdiff=r.x1-r.x0)}, average=true) Estimate Std.Err 2.5% 97.5% P-value riskdiff / 38

44 Lava's g-formula (model2) regression(m,y X+Z) <- c(log(3),log(3)) regression(m,x Z) <- log(1.5) set.seed(18) d <- sim(m,n=50000) model2 <- glm(y X+Z,data=d,family=binomial()) estimate(model2,function(p,data){ a <- p["(intercept)"] b <- p["x"] c <- p["z"] R.X1 <- expit(a + b + c * data[,"z"]) R.X0 <- expit(a + c * data[,"z"]) list(riskdiff=r.x1-r.x0)}, average=true) Estimate Std.Err 2.5% 97.5% P-value riskdiff / 38

45 Lava's g-formula (model3) regression(m,y X+Z) <- c(log(3),log(3)) regression(m,x Z) <- log(1.5) set.seed(18) d <- sim(m,n=50000) d$q <- d$z^2 model3 <- glm(y X+Z+Q,data=d,family=binomial()) estimate(model3,function(p,data){ a <- p["(intercept)"] b <- p["x"] c <- p["z"] d <- p["q"] R.X1 <- expit(a + b + c * data[,"z"] + d * data[,"q"]) R.X0 <- expit(a + c * data[,"z"]+ d * data[,"q"]) list(riskdiff=r.x1-r.x0)}, average=true) Estimate Std.Err 2.5% 97.5% P-value riskdiff / 38

46 Conclusions The current tradition There is hardly any mathematical theory to back up some of the very prevalent strategies (traditions). We should STOP teaching backward elimination. Attenuation eect The scale on which we measure eects matters. Randomization does not prevent bias in logistic regression and Cox regression on the odds ratio and hazard ratio scale. But, the direction of the bias is known and the variance is reduced when all mavericks are omitted. Lava simulate data alike the real data study performance (bias and variance) of a modelling strategy under controlled conditions 36 / 38

47 Discussion of the new recipe The subject matter perspective matters when we dene what to estimate (average eect versus prediction) Put in several models/methods including machine learning and expert's opinion and let the data decide The superlearner recipe has a built-in validation mechanism Many possible confounders The more covariates Z 1,..., Z p the more possibilities to model Many exposure variables (AKA: What are the predictors?) You should know about the Table 2 fallacy 9 and possibly treat each X 1,..., X p separately using the new recipe or make a prediction model. 9 Westreich & Greenland. Am J Epidemiol. 2013;177(4): / 38

48 Antiparsimony principle 10 Models should be: rich enough to reect the complexity of the relations under study. Countervailing principle: you cannot estimate anything if you try to estimate everything. 10 Greenland (2000). citing L. T. Savage 38 / 38

Data splitting. INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+TITLE:

Data splitting. INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+TITLE: #+TITLE: Data splitting INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+AUTHOR: Thomas Alexander Gerds #+INSTITUTE: Department of Biostatistics, University of Copenhagen