A tool to demystify regression modelling behaviour

Size: px
Start display at page:

Download "A tool to demystify regression modelling behaviour"

Transcription

1 A tool to demystify regression modelling behaviour Thomas Alexander Gerds 1 / 38

2 Appetizer Every child knows how regression analysis works. The essentials of regression modelling strategy, such as which variables to include in which way, however, are typically based on family tradition rather than mathematical theory. Every child knows, too, that it is pretty dicult to change tradition. 2 / 38

3 I have the best data in the world. Here are the guidelines! What is the question? Who asks the question? Parameter? Identiability? 3 / 38

4 A classical misunderstanding 4 / 38

5 A statistics teacher states When the sample size is large, then the test statistic is often normally distributed. When the data are normally distributed, then the t-test statistic is optimal. 5 / 38

6 Two applied researchers remember together If the sample size is large, then I should apply the t-test, if the sample size is small then the non-parametric Wilcoxon rank sum test is preferred Yes, for a t-test you need at least 30 samples 6 / 38

7 Family traditions 7 / 38

8 Target audience 1 Our coverage is not intended for highly skilled practitioners; rather, we target: teachers students and working epidemiologists who would like to do better with data analysis, but who lack resources such as R programming skills or a bona de modelling expert committed to their project. 1 Greenland, Daniel, Pearce Int J Epidemiol Apr; 45(2): / 38

9 Still Greenland et al Throughout, we assume that we are applying a conventional risk or rate regression model (e.g. logistic, Cox or Poisson regression) to estimate the eects of an exposure variable X on the distribution of a disease variable Y while controlling for other variables The other variables include forced variables, such as age and sex, which we may always want to control, and may also include unforced variables about which we are unsure whether to control. 9 / 38

10 Statistical modelling: the generalized linear model Here is a (favourite) multiple regression model 2 : g(y ) = α + βx + γz 2 g is a non-linear link function such as logit 10 / 38

11 Statistical modelling: the generalized linear model Here is a (favourite) multiple regression model 2 : g(y ) = α + βx + γz Some spoilsport thinks that the model gets better when Z is removed: g(y ) = α + β X 2 g is a non-linear link function such as logit 10 / 38

12 Tradition: X Step 1. use the data to test the crude null hypothesis: H 0 : β = 0 Step 2. use the same data to test the adjusted null hypothesis: H 0 : β = 0 Step 3. depends on the results of Steps 1 and 2: 11 / 38

13 Tradition: X Step 1. use the data to test the crude null hypothesis: H 0 : β = 0 Step 2. use the same data to test the adjusted null hypothesis: H 0 : β = 0 Step 3. depends on the results of Steps 1 and 2: Case: p < 0.05 and p < 0.05: X is an independent predictor Case p < 0.05 and p > 0.05: X reects other variables 11 / 38

14 How would an applied statistician with strong fundamental mathematical foundation react to this? 12 / 38

15 How would an applied statistician with strong fundamental mathematical foundation react to this? What is the question? 12 / 38

16 Tradition: Z Step 1. use the data to test the null hypothesis H 0 : γ = 0 Step 2. depends on the results of Step 1: 13 / 38

17 Tradition: Z Step 1. use the data to test the null hypothesis H 0 : γ = 0 Step 2. depends on the results of Step 1: Case: p < 0.05: Use same data to t this model: g(y ) = α + βx + γz Case p > 0.05 : Use same data to t that model: g(y ) = α + β X 13 / 38

18 How would an applied statistician with strong fundamental mathematical foundation react to this? 14 / 38

19 How would an applied statistician with strong fundamental mathematical foundation react to this? What is the parameter of interest: β or β? 14 / 38

20 Frank E Harrell Jr 3 about stepwise variable selection 1. Stepwise selection yields R 2 values that are biased high 2. The ordinary F and χ 2 test statistics do not have It yields P-values that are too small... and the proper correction for them is a very dicult problem. 5. It provides regression coecients that are biased high in absolute value and need shrinkage It allows us to not think about the problem 3 Regression Modelling Strategies (Springer, pages 56-57) 15 / 38

21 Hauck et al 4 About the two rival models g(y ) = α + β X (1) g(y ) = α + βx + γz (2) What makes model (2) the correct one? Why not model (1)? The issue is not one of bias! Rather, models (1) and (2) are estimating dierent measures of treatment eect. Thus, there is no reason for them to yield the same estimates. 4 Controlled Clinical Trials 19: (1998) 16 / 38

22 Tradition: X Z Step 1. Use expert knowledge, a DAG, and/or literature (but, not the data pyha!) to nd the known predictors Z and if they are real confounders: Z X? Step 2. depends on the results of Step 1: 17 / 38

23 Tradition: X Z Step 1. Use expert knowledge, a DAG, and/or literature (but, not the data pyha!) to nd the known predictors Z and if they are real confounders: Z X? Step 2. depends on the results of Step 1: Case: Z X Fit this model: g(y ) = α + βx + γz Case: Z X Fit that model: g(y ) = α + β X 17 / 38

24 How would an applied statistician with strong fundamental mathematical foundation react to this? 18 / 38

25 How would an applied statistician with strong fundamental mathematical foundation react to this? Well, this depends on g. 18 / 38

26 Principal criteria Classical criterion A covariate is a confounder if it is associated with the exposure and, causally, with the outcome. Operational criterion A covariate is a confounder if the estimate of exposure eect is changed by inclusion of the covariate. 19 / 38

27 Principal criteria Classical criterion A covariate is a confounder if it is associated with the exposure and, causally, with the outcome. Operational criterion A covariate is a confounder if the estimate of exposure eect is changed by inclusion of the covariate. 19 / 38

28 Mavericks 5 A maverick is a covariate that satises the operational but not the classical criterion. Y X Z 5 Hauck, Neuhaus, Kalbeisch, Anderson J Clin Epidmiol Vol. 44, No. I, pp , / 38

29 Hauck, Neuhaus, Kalbeisch, Anderson Rule 1 The eect of omitting a maverick is to bias the odds ratio 6 towards no eect. Rule 2 The magnitude of the bias caused by omitting a maverick increases with the variance of the omitted maverick and with the magnitude of the eect of the maverick on the outcome.... Rule 5 Tests of the hypothesis of a no exposure-outcome association remain valid when a maverick is omitted. 6 same for hazard ratio in Cox regression 21 / 38

30 Demonstration of the attenuation eect 22 / 38

31 library(lava) m <- lvm() distribution(m, Y+X) <- binomial.lvm("logit") regression(m,y X+Z) <- c(log(3),log(3)) d <- sim(m,n=50000) model1 <- glm(y X,data=d,family=binomial) model2 <- glm(y X+Z,data=d,family=binomial) Variable OddsRatio CI.95 OddsRatio CI.95 X 2.40 [2.31;2.49] 3.03 [2.91;3.16] Z 3.04 [2.96;3.12] Model 1 Model 2 23 / 38

32 library(lava) m <- lvm() distribution(m, Y+X) <- binomial.lvm("logit") regression(m,y X+Z) <- c(log(2),log(.5)) d <- sim(m,n=50000) model1 <- glm(y X,data=d,family=binomial) model2 <- glm(y X+Z,data=d,family=binomial) Variable OddsRatio CI.95 OddsRatio CI.95 X 1.89 [1.82;1.96] 2.04 [1.96;2.12] Z 0.50 [0.49;0.51] Model 1 Model 2 24 / 38

33 A new recipe 25 / 38

34 (Breaking with awed) regression modelling tradtions Van der Laan & Rose It should not be overlooked that the process of looking at the data, examining coecient p-values, and trying multiple statistical models is not only incredibly prevalent but is taught to students learning statistics. 26 / 38

35 The superlearner recipe Step 1 Ask the research question. Step 2 Dene the parameter of interest without specifying the rest of the model. Step 3 Estimate the parameter; use crossvalidation to distinguish alternative estimators of the same parameter. 27 / 38

36 Example Research question: Is there a dierence in the 30-day survival chances of cardiac arrest patients who received bystander CPR compared to cardiac arrest patients who did not receive bystander CPR? 28 / 38

37 Example Research question: Is there a dierence in the 30-day survival chances of cardiac arrest patients who received bystander CPR compared to cardiac arrest patients who did not receive bystander CPR? Target parameter: θ = P(Y 30 = 1 X = 1) P(Y 30 = 1 X = 0) Note: not specied if or how Z should be included 28 / 38

38 Two alternative estimators 7 Estimator 1: Y 30 X ˆθ 1 = 1 expit(ˆα + ˆβ ) 1 n 1 n 0 i:x i =1 expit(ˆα ) i:x i =0 Estimator 2: Y 30 X + Z ˆθ 2 = 1 expit(ˆα + n ˆβ + ˆγZ i ) 1 1 n 0 i:x i =1 expit(ˆα + ˆγZ i ) i:x i =0 7 Z i are demographics, comorbidities and other risk factors 29 / 38

39 Two alternative estimators 7 Estimator 1: Y 30 X ˆθ 1 = 1 expit(ˆα + ˆβ ) 1 n 1 n 0 i:x i =1 expit(ˆα ) i:x i =0 Estimator 2: Y 30 X + Z ˆθ 2 = 1 expit(ˆα + n ˆβ + ˆγZ i ) 1 1 n 0 i:x i =1 expit(ˆα + ˆγZ i ) i:x i =0 BUT: possibly comparing dierent distributions of Z 7 Z i are demographics, comorbidities and other risk factors 29 / 38

40 Example (continued) Research question: Is there a dierence in the 30-day survival chances of a cardiac arrest patients who received bystander CPR compared to cardiac arrest patients who did not receive bystander CPR when both patients have the same demographics, comorbidities and other risk factors? Target parameter: θ z = P(Y 30 = 1 X = 1, Z = z) P(Y 30 = 1 X = 0, Z = z) Note: result depends on z 30 / 38

41 Example (continued) Research question: What is the average causal eect of bystander CPR on the 30-day survival chances in cardiac arrest patients? Target parameter: θ = E Z {P(Y 30 = 1 do(x = 1), Z)} E Z {P(Y 30 = 1 do(x = 0), Z)} Note: result is an average across Z and does not depend on z 31 / 38

42 Three alternative (G-formula) estimators 8 1 n Estimator 1: 1 n n i=1 ( ) expit(ˆα + ˆβ ) expit(ˆα ) Estimator 2: 1 n ( ) expit(ˆα + ˆβ + ˆγZ i ) expit(ˆα + ˆγZ i ) n i=1 Estimator 3: n i=1 ( expit(ˆα + ˆβ + ˆγ 1 Z i + ˆγ 2 (Z i ) 2 ) expit(ˆα + ˆγ 1 Z i + ˆγ 2 (Z i ) 2 ) ) 8 Note: The estimates rely on dierent prediction models 32 / 38

43 Lava's g-formula (model1) regression(m,y X+Z) <- c(log(3),log(3)) regression(m,x Z) <- log(1.5) set.seed(18) d <- sim(m,n=50000) model1 <- glm(y X,data=d,family=binomial()) estimate(model1,function(p,data){ a <- p["(intercept)"] b <- p["x"] R.X1 <- expit(a + b) R.X0 <- expit(a) list(riskdiff=r.x1-r.x0)}, average=true) Estimate Std.Err 2.5% 97.5% P-value riskdiff / 38

44 Lava's g-formula (model2) regression(m,y X+Z) <- c(log(3),log(3)) regression(m,x Z) <- log(1.5) set.seed(18) d <- sim(m,n=50000) model2 <- glm(y X+Z,data=d,family=binomial()) estimate(model2,function(p,data){ a <- p["(intercept)"] b <- p["x"] c <- p["z"] R.X1 <- expit(a + b + c * data[,"z"]) R.X0 <- expit(a + c * data[,"z"]) list(riskdiff=r.x1-r.x0)}, average=true) Estimate Std.Err 2.5% 97.5% P-value riskdiff / 38

45 Lava's g-formula (model3) regression(m,y X+Z) <- c(log(3),log(3)) regression(m,x Z) <- log(1.5) set.seed(18) d <- sim(m,n=50000) d$q <- d$z^2 model3 <- glm(y X+Z+Q,data=d,family=binomial()) estimate(model3,function(p,data){ a <- p["(intercept)"] b <- p["x"] c <- p["z"] d <- p["q"] R.X1 <- expit(a + b + c * data[,"z"] + d * data[,"q"]) R.X0 <- expit(a + c * data[,"z"]+ d * data[,"q"]) list(riskdiff=r.x1-r.x0)}, average=true) Estimate Std.Err 2.5% 97.5% P-value riskdiff / 38

46 Conclusions The current tradition There is hardly any mathematical theory to back up some of the very prevalent strategies (traditions). We should STOP teaching backward elimination. Attenuation eect The scale on which we measure eects matters. Randomization does not prevent bias in logistic regression and Cox regression on the odds ratio and hazard ratio scale. But, the direction of the bias is known and the variance is reduced when all mavericks are omitted. Lava simulate data alike the real data study performance (bias and variance) of a modelling strategy under controlled conditions 36 / 38

47 Discussion of the new recipe The subject matter perspective matters when we dene what to estimate (average eect versus prediction) Put in several models/methods including machine learning and expert's opinion and let the data decide The superlearner recipe has a built-in validation mechanism Many possible confounders The more covariates Z 1,..., Z p the more possibilities to model Many exposure variables (AKA: What are the predictors?) You should know about the Table 2 fallacy 9 and possibly treat each X 1,..., X p separately using the new recipe or make a prediction model. 9 Westreich & Greenland. Am J Epidemiol. 2013;177(4): / 38

48 Antiparsimony principle 10 Models should be: rich enough to reect the complexity of the relations under study. Countervailing principle: you cannot estimate anything if you try to estimate everything. 10 Greenland (2000). citing L. T. Savage 38 / 38

Data splitting. INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+TITLE:

Data splitting. INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+TITLE: #+TITLE: Data splitting INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+AUTHOR: Thomas Alexander Gerds #+INSTITUTE: Department of Biostatistics, University of Copenhagen

More information

Estimating the Marginal Odds Ratio in Observational Studies

Estimating the Marginal Odds Ratio in Observational Studies Estimating the Marginal Odds Ratio in Observational Studies Travis Loux Christiana Drake Department of Statistics University of California, Davis June 20, 2011 Outline The Counterfactual Model Odds Ratios

More information

Marginal, crude and conditional odds ratios

Marginal, crude and conditional odds ratios Marginal, crude and conditional odds ratios Denitions and estimation Travis Loux Gradute student, UC Davis Department of Statistics March 31, 2010 Parameter Denitions When measuring the eect of a binary

More information

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Overview In observational and experimental studies, the goal may be to estimate the effect

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2008 Paper 241 A Note on Risk Prediction for Case-Control Studies Sherri Rose Mark J. van der Laan Division

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Lab 8. Matched Case Control Studies

Lab 8. Matched Case Control Studies Lab 8 Matched Case Control Studies Control of Confounding Technique for the control of confounding: At the design stage: Matching During the analysis of the results: Post-stratification analysis Advantage

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

ISQS 5349 Spring 2013 Final Exam

ISQS 5349 Spring 2013 Final Exam ISQS 5349 Spring 2013 Final Exam Name: General Instructions: Closed books, notes, no electronic devices. Points (out of 200) are in parentheses. Put written answers on separate paper; multiple choices

More information

Lecture Discussion. Confounding, Non-Collapsibility, Precision, and Power Statistics Statistical Methods II. Presented February 27, 2018

Lecture Discussion. Confounding, Non-Collapsibility, Precision, and Power Statistics Statistical Methods II. Presented February 27, 2018 , Non-, Precision, and Power Statistics 211 - Statistical Methods II Presented February 27, 2018 Dan Gillen Department of Statistics University of California, Irvine Discussion.1 Various definitions of

More information

Introduction to Statistical Analysis

Introduction to Statistical Analysis Introduction to Statistical Analysis Changyu Shen Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology Beth Israel Deaconess Medical Center Harvard Medical School Objectives Descriptive

More information

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification Todd MacKenzie, PhD Collaborators A. James O Malley Tor Tosteson Therese Stukel 2 Overview 1. Instrumental variable

More information

Targeted Maximum Likelihood Estimation in Safety Analysis

Targeted Maximum Likelihood Estimation in Safety Analysis Targeted Maximum Likelihood Estimation in Safety Analysis Sam Lendle 1 Bruce Fireman 2 Mark van der Laan 1 1 UC Berkeley 2 Kaiser Permanente ISPE Advanced Topics Session, Barcelona, August 2012 1 / 35

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective Anastasios (Butch) Tsiatis and Xiaofei Bai Department of Statistics North Carolina State University 1/35 Optimal Treatment

More information

Prediction of ordinal outcomes when the association between predictors and outcome diers between outcome levels

Prediction of ordinal outcomes when the association between predictors and outcome diers between outcome levels STATISTICS IN MEDICINE Statist. Med. 2005; 24:1357 1369 Published online 26 November 2004 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/sim.2009 Prediction of ordinal outcomes when the

More information

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Review Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 Chapter 1: background Nominal, ordinal, interval data. Distributions: Poisson, binomial,

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Estimating and contextualizing the attenuation of odds ratios due to non-collapsibility

Estimating and contextualizing the attenuation of odds ratios due to non-collapsibility Estimating and contextualizing the attenuation of odds ratios due to non-collapsibility Stephen Burgess Department of Public Health & Primary Care, University of Cambridge September 6, 014 Short title:

More information

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection Model Selection in GLMs Last class: estimability/identifiability, analysis of deviance, standard errors & confidence intervals (should be able to implement frequentist GLM analyses!) Today: standard frequentist

More information

Ignoring the matching variables in cohort studies - when is it valid, and why?

Ignoring the matching variables in cohort studies - when is it valid, and why? Ignoring the matching variables in cohort studies - when is it valid, and why? Arvid Sjölander Abstract In observational studies of the effect of an exposure on an outcome, the exposure-outcome association

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY Ingo Langner 1, Ralf Bender 2, Rebecca Lenz-Tönjes 1, Helmut Küchenhoff 2, Maria Blettner 2 1

More information

Correlation and Linear Regression

Correlation and Linear Regression Correlation and Linear Regression Correlation: Relationships between Variables So far, nearly all of our discussion of inferential statistics has focused on testing for differences between group means

More information

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates Anastasios (Butch) Tsiatis Department of Statistics North Carolina State University http://www.stat.ncsu.edu/

More information

multilevel modeling: concepts, applications and interpretations

multilevel modeling: concepts, applications and interpretations multilevel modeling: concepts, applications and interpretations lynne c. messer 27 october 2010 warning social and reproductive / perinatal epidemiologist concepts why context matters multilevel models

More information

Lecture 2: Poisson and logistic regression

Lecture 2: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 11-12 December 2014 introduction to Poisson regression application to the BELCAP study introduction

More information

ABSTRACT INTRODUCTION. SESUG Paper

ABSTRACT INTRODUCTION. SESUG Paper SESUG Paper 140-2017 Backward Variable Selection for Logistic Regression Based on Percentage Change in Odds Ratio Evan Kwiatkowski, University of North Carolina at Chapel Hill; Hannah Crooke, PAREXEL International

More information

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20 Logistic regression 11 Nov 2010 Logistic regression (EPFL) Applied Statistics 11 Nov 2010 1 / 20 Modeling overview Want to capture important features of the relationship between a (set of) variable(s)

More information

Model Selection. Frank Wood. December 10, 2009

Model Selection. Frank Wood. December 10, 2009 Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide

More information

Cohen s s Kappa and Log-linear Models

Cohen s s Kappa and Log-linear Models Cohen s s Kappa and Log-linear Models HRP 261 03/03/03 10-11 11 am 1. Cohen s Kappa Actual agreement = sum of the proportions found on the diagonals. π ii Cohen: Compare the actual agreement with the chance

More information

A new strategy for meta-analysis of continuous covariates in observational studies with IPD. Willi Sauerbrei & Patrick Royston

A new strategy for meta-analysis of continuous covariates in observational studies with IPD. Willi Sauerbrei & Patrick Royston A new strategy for meta-analysis of continuous covariates in observational studies with IPD Willi Sauerbrei & Patrick Royston Overview Motivation Continuous variables functional form Fractional polynomials

More information

Statistics Ph.D. Qualifying Exam: Part II November 9, 2002

Statistics Ph.D. Qualifying Exam: Part II November 9, 2002 Statistics Ph.D. Qualifying Exam: Part II November 9, 2002 Student Name: 1. Answer 8 out of 12 problems. Mark the problems you selected in the following table. 1 2 3 4 5 6 7 8 9 10 11 12 2. Write your

More information

Confounding, mediation and colliding

Confounding, mediation and colliding Confounding, mediation and colliding What types of shared covariates does the sibling comparison design control for? Arvid Sjölander and Johan Zetterqvist Causal effects and confounding A common aim of

More information

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke BIOL 51A - Biostatistics 1 1 Lecture 1: Intro to Biostatistics Smoking: hazardous? FEV (l) 1 2 3 4 5 No Yes Smoke BIOL 51A - Biostatistics 1 2 Box Plot a.k.a box-and-whisker diagram or candlestick chart

More information

Approximate analysis of covariance in trials in rare diseases, in particular rare cancers

Approximate analysis of covariance in trials in rare diseases, in particular rare cancers Approximate analysis of covariance in trials in rare diseases, in particular rare cancers Stephen Senn (c) Stephen Senn 1 Acknowledgements This work is partly supported by the European Union s 7th Framework

More information

Selection of Variables and Functional Forms in Multivariable Analysis: Current Issues and Future Directions

Selection of Variables and Functional Forms in Multivariable Analysis: Current Issues and Future Directions in Multivariable Analysis: Current Issues and Future Directions Frank E Harrell Jr Department of Biostatistics Vanderbilt University School of Medicine STRATOS Banff Alberta 2016-07-04 Fractional polynomials,

More information

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation Biost 58 Applied Biostatistics II Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 5: Review Purpose of Statistics Statistics is about science (Science in the broadest

More information

Multiple linear regression S6

Multiple linear regression S6 Basic medical statistics for clinical and experimental research Multiple linear regression S6 Katarzyna Jóźwiak k.jozwiak@nki.nl November 15, 2017 1/42 Introduction Two main motivations for doing multiple

More information

Lecture 5: Poisson and logistic regression

Lecture 5: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 3-5 March 2014 introduction to Poisson regression application to the BELCAP study introduction

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu 1 / 35 Tip + Paper Tip Meet with seminar speakers. When you go on

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

More information

Linear Regression Analysis

Linear Regression Analysis REVIEW ARTICLE Linear Regression Analysis Part 14 of a Series on Evaluation of Scientific Publications by Astrid Schneider, Gerhard Hommel, and Maria Blettner SUMMARY Background: Regression analysis is

More information

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH The First Step: SAMPLE SIZE DETERMINATION THE ULTIMATE GOAL The most important, ultimate step of any of clinical research is to do draw inferences;

More information

Core Courses for Students Who Enrolled Prior to Fall 2018

Core Courses for Students Who Enrolled Prior to Fall 2018 Biostatistics and Applied Data Analysis Students must take one of the following two sequences: Sequence 1 Biostatistics and Data Analysis I (PHP 2507) This course, the first in a year long, two-course

More information

Causality II: How does causal inference fit into public health and what it is the role of statistics?

Causality II: How does causal inference fit into public health and what it is the role of statistics? Causality II: How does causal inference fit into public health and what it is the role of statistics? Statistics for Psychosocial Research II November 13, 2006 1 Outline Potential Outcomes / Counterfactual

More information

Master s Written Examination - Solution

Master s Written Examination - Solution Master s Written Examination - Solution Spring 204 Problem Stat 40 Suppose X and X 2 have the joint pdf f X,X 2 (x, x 2 ) = 2e (x +x 2 ), 0 < x < x 2

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Diagnostics can identify two possible areas of failure of assumptions when fitting linear models.

Diagnostics can identify two possible areas of failure of assumptions when fitting linear models. 1 Transformations 1.1 Introduction Diagnostics can identify two possible areas of failure of assumptions when fitting linear models. (i) lack of Normality (ii) heterogeneity of variances It is important

More information

Probabilistic Index Models

Probabilistic Index Models Probabilistic Index Models Jan De Neve Department of Data Analysis Ghent University M3 Storrs, Conneticut, USA May 23, 2017 Jan.DeNeve@UGent.be 1 / 37 Introduction 2 / 37 Introduction to Probabilistic

More information

ECON 4160, Autumn term Lecture 1

ECON 4160, Autumn term Lecture 1 ECON 4160, Autumn term 2017. Lecture 1 a) Maximum Likelihood based inference. b) The bivariate normal model Ragnar Nymoen University of Oslo 24 August 2017 1 / 54 Principles of inference I Ordinary least

More information

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data?

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data? When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data? Kosuke Imai Department of Politics Center for Statistics and Machine Learning Princeton University Joint

More information

Examples and Limits of the GLM

Examples and Limits of the GLM Examples and Limits of the GLM Chapter 1 1.1 Motivation 1 1.2 A Review of Basic Statistical Ideas 2 1.3 GLM Definition 4 1.4 GLM Examples 4 1.5 Student Goals 5 1.6 Homework Exercises 5 1.1 Motivation In

More information

Rewrap ECON November 18, () Rewrap ECON 4135 November 18, / 35

Rewrap ECON November 18, () Rewrap ECON 4135 November 18, / 35 Rewrap ECON 4135 November 18, 2011 () Rewrap ECON 4135 November 18, 2011 1 / 35 What should you now know? 1 What is econometrics? 2 Fundamental regression analysis 1 Bivariate regression 2 Multivariate

More information

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What? You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?) I m not goin stop (What?) I m goin work harder (What?) Sir David

More information

GROUPED SURVIVAL DATA. Florida State University and Medical College of Wisconsin

GROUPED SURVIVAL DATA. Florida State University and Medical College of Wisconsin FITTING COX'S PROPORTIONAL HAZARDS MODEL USING GROUPED SURVIVAL DATA Ian W. McKeague and Mei-Jie Zhang Florida State University and Medical College of Wisconsin Cox's proportional hazard model is often

More information

Turning a research question into a statistical question.

Turning a research question into a statistical question. Turning a research question into a statistical question. IGINAL QUESTION: Concept Concept Concept ABOUT ONE CONCEPT ABOUT RELATIONSHIPS BETWEEN CONCEPTS TYPE OF QUESTION: DESCRIBE what s going on? DECIDE

More information

Dynamics in Social Networks and Causality

Dynamics in Social Networks and Causality Web Science & Technologies University of Koblenz Landau, Germany Dynamics in Social Networks and Causality JProf. Dr. University Koblenz Landau GESIS Leibniz Institute for the Social Sciences Last Time:

More information

Specification Errors, Measurement Errors, Confounding

Specification Errors, Measurement Errors, Confounding Specification Errors, Measurement Errors, Confounding Kerby Shedden Department of Statistics, University of Michigan October 10, 2018 1 / 32 An unobserved covariate Suppose we have a data generating model

More information

Valida&on of Predic&ve Classifiers

Valida&on of Predic&ve Classifiers Valida&on of Predic&ve Classifiers 1! Predic&ve Biomarker Classifiers In most posi&ve clinical trials, only a small propor&on of the eligible popula&on benefits from the new rx Many chronic diseases are

More information

TMA 4275 Lifetime Analysis June 2004 Solution

TMA 4275 Lifetime Analysis June 2004 Solution TMA 4275 Lifetime Analysis June 2004 Solution Problem 1 a) Observation of the outcome is censored, if the time of the outcome is not known exactly and only the last time when it was observed being intact,

More information

Comparison of Two Samples

Comparison of Two Samples 2 Comparison of Two Samples 2.1 Introduction Problems of comparing two samples arise frequently in medicine, sociology, agriculture, engineering, and marketing. The data may have been generated by observation

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

Basic Medical Statistics Course

Basic Medical Statistics Course Basic Medical Statistics Course S7 Logistic Regression November 2015 Wilma Heemsbergen w.heemsbergen@nki.nl Logistic Regression The concept of a relationship between the distribution of a dependent variable

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information

Lecture 7 Time-dependent Covariates in Cox Regression

Lecture 7 Time-dependent Covariates in Cox Regression Lecture 7 Time-dependent Covariates in Cox Regression So far, we ve been considering the following Cox PH model: λ(t Z) = λ 0 (t) exp(β Z) = λ 0 (t) exp( β j Z j ) where β j is the parameter for the the

More information

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Logistic Regression 1 / 38 Logistic Regression 1 Introduction

More information

Business Statistics 41000: Homework # 5

Business Statistics 41000: Homework # 5 Business Statistics 41000: Homework # 5 Drew Creal Due date: Beginning of class in week # 10 Remarks: These questions cover Lectures #7, 8, and 9. Question # 1. Condence intervals and plug-in predictive

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen Harvard University Harvard University Biostatistics Working Paper Series Year 2014 Paper 175 A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome Eric Tchetgen Tchetgen

More information

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University Lecture 25 Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University November 24, 2015 1 2 3 4 5 6 7 8 9 10 11 1 Hypothesis s of homgeneity 2 Estimating risk

More information

Generalized Linear Modeling - Logistic Regression

Generalized Linear Modeling - Logistic Regression 1 Generalized Linear Modeling - Logistic Regression Binary outcomes The logit and inverse logit interpreting coefficients and odds ratios Maximum likelihood estimation Problem of separation Evaluating

More information

Solutions for Examination Categorical Data Analysis, March 21, 2013

Solutions for Examination Categorical Data Analysis, March 21, 2013 STOCKHOLMS UNIVERSITET MATEMATISKA INSTITUTIONEN Avd. Matematisk statistik, Frank Miller MT 5006 LÖSNINGAR 21 mars 2013 Solutions for Examination Categorical Data Analysis, March 21, 2013 Problem 1 a.

More information

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study TECHNICAL REPORT # 59 MAY 2013 Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study Sergey Tarima, Peng He, Tao Wang, Aniko Szabo Division of Biostatistics,

More information

Asymptotic equivalence of paired Hotelling test and conditional logistic regression

Asymptotic equivalence of paired Hotelling test and conditional logistic regression Asymptotic equivalence of paired Hotelling test and conditional logistic regression Félix Balazard 1,2 arxiv:1610.06774v1 [math.st] 21 Oct 2016 Abstract 1 Sorbonne Universités, UPMC Univ Paris 06, CNRS

More information

Beyond GLM and likelihood

Beyond GLM and likelihood Stat 6620: Applied Linear Models Department of Statistics Western Michigan University Statistics curriculum Core knowledge (modeling and estimation) Math stat 1 (probability, distributions, convergence

More information

Assignment 2: K-Nearest Neighbors and Logistic Regression

Assignment 2: K-Nearest Neighbors and Logistic Regression Assignment 2: K-Nearest Neighbors and Logistic Regression SDS293 - Machine Learning Due: 4 Oct 2017 by 11:59pm Conceptual Exercises 4.4 parts a-d (p. 168-169 ISLR) When the number of features p is large,

More information

INTERVAL ESTIMATION AND HYPOTHESES TESTING

INTERVAL ESTIMATION AND HYPOTHESES TESTING INTERVAL ESTIMATION AND HYPOTHESES TESTING 1. IDEA An interval rather than a point estimate is often of interest. Confidence intervals are thus important in empirical work. To construct interval estimates,

More information

Chapter 11. Correlation and Regression

Chapter 11. Correlation and Regression Chapter 11. Correlation and Regression The word correlation is used in everyday life to denote some form of association. We might say that we have noticed a correlation between foggy days and attacks of

More information

Advanced Quantitative Research Methodology, Lecture Notes: Research Designs for Causal Inference 1

Advanced Quantitative Research Methodology, Lecture Notes: Research Designs for Causal Inference 1 Advanced Quantitative Research Methodology, Lecture Notes: Research Designs for Causal Inference 1 Gary King GaryKing.org April 13, 2014 1 c Copyright 2014 Gary King, All Rights Reserved. Gary King ()

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

Machine Learning. Module 3-4: Regression and Survival Analysis Day 2, Asst. Prof. Dr. Santitham Prom-on

Machine Learning. Module 3-4: Regression and Survival Analysis Day 2, Asst. Prof. Dr. Santitham Prom-on Machine Learning Module 3-4: Regression and Survival Analysis Day 2, 9.00 16.00 Asst. Prof. Dr. Santitham Prom-on Department of Computer Engineering, Faculty of Engineering King Mongkut s University of

More information

Propensity Score Methods for Causal Inference

Propensity Score Methods for Causal Inference John Pura BIOS790 October 2, 2015 Causal inference Philosophical problem, statistical solution Important in various disciplines (e.g. Koch s postulates, Bradford Hill criteria, Granger causality) Good

More information

Lecture 10: Introduction to Logistic Regression

Lecture 10: Introduction to Logistic Regression Lecture 10: Introduction to Logistic Regression Ani Manichaikul amanicha@jhsph.edu 2 May 2007 Logistic Regression Regression for a response variable that follows a binomial distribution Recall the binomial

More information

Subgroup analysis using regression modeling multiple regression. Aeilko H Zwinderman

Subgroup analysis using regression modeling multiple regression. Aeilko H Zwinderman Subgroup analysis using regression modeling multiple regression Aeilko H Zwinderman who has unusual large response? Is such occurrence associated with subgroups of patients? such question is hypothesis-generating:

More information

Bias Variance Trade-off

Bias Variance Trade-off Bias Variance Trade-off The mean squared error of an estimator MSE(ˆθ) = E([ˆθ θ] 2 ) Can be re-expressed MSE(ˆθ) = Var(ˆθ) + (B(ˆθ) 2 ) MSE = VAR + BIAS 2 Proof MSE(ˆθ) = E((ˆθ θ) 2 ) = E(([ˆθ E(ˆθ)]

More information

STAT331. Cox s Proportional Hazards Model

STAT331. Cox s Proportional Hazards Model STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations

More information

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1 TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1 1.1 The Probability Model...1 1.2 Finite Discrete Models with Equally Likely Outcomes...5 1.2.1 Tree Diagrams...6 1.2.2 The Multiplication Principle...8

More information

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017 Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction

More information

CAUSAL INFERENCE IN THE EMPIRICAL SCIENCES. Judea Pearl University of California Los Angeles (www.cs.ucla.edu/~judea)

CAUSAL INFERENCE IN THE EMPIRICAL SCIENCES. Judea Pearl University of California Los Angeles (www.cs.ucla.edu/~judea) CAUSAL INFERENCE IN THE EMPIRICAL SCIENCES Judea Pearl University of California Los Angeles (www.cs.ucla.edu/~judea) OUTLINE Inference: Statistical vs. Causal distinctions and mental barriers Formal semantics

More information

Answers to Problem Set #4

Answers to Problem Set #4 Answers to Problem Set #4 Problems. Suppose that, from a sample of 63 observations, the least squares estimates and the corresponding estimated variance covariance matrix are given by: bβ bβ 2 bβ 3 = 2

More information

LOGISTIC REGRESSION Joseph M. Hilbe

LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

Sigmaplot di Systat Software

Sigmaplot di Systat Software Sigmaplot di Systat Software SigmaPlot Has Extensive Statistical Analysis Features SigmaPlot is now bundled with SigmaStat as an easy-to-use package for complete graphing and data analysis. The statistical

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

Lecture 12: Effect modification, and confounding in logistic regression

Lecture 12: Effect modification, and confounding in logistic regression Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007 Today Categorical predictor create dummy variables just like for linear regression

More information

Advanced Statistics I : Gaussian Linear Model (and beyond)

Advanced Statistics I : Gaussian Linear Model (and beyond) Advanced Statistics I : Gaussian Linear Model (and beyond) Aurélien Garivier CNRS / Telecom ParisTech Centrale Outline One and Two-Sample Statistics Linear Gaussian Model Model Reduction and model Selection

More information

Dimension Reduction Methods

Dimension Reduction Methods Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.

More information

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto. Introduction to Dalla Lana School of Public Health University of Toronto olli.saarela@utoronto.ca September 18, 2014 38-1 : a review 38-2 Evidence Ideal: to advance the knowledge-base of clinical medicine,

More information