Ph.D. course: Regression models. Introduction. 19 April 2012

Similar documents
Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status

Ph.D. course: Regression models

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Lecture 7 Time-dependent Covariates in Cox Regression

Statistics in medicine

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Definitions and examples Simple estimation and testing Regression models Goodness of fit for the Cox model. Recap of Part 1. Per Kragh Andersen

Extensions of Cox Model for Non-Proportional Hazards Purpose

Survival Analysis Math 434 Fall 2011

Subgroup analysis using regression modeling multiple regression. Aeilko H Zwinderman

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

Residuals and model diagnostics

Turning a research question into a statistical question.

STA441: Spring Multiple Regression. More than one explanatory variable at the same time

1 The problem of survival analysis

Classification & Regression. Multicollinearity Intro to Nominal Data

Analysis of Longitudinal Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington

Group Sequential Tests for Delayed Responses. Christopher Jennison. Lisa Hampson. Workshop on Special Topics on Sequential Methodology

Regression models. Categorical covariate, Quantitative outcome. Examples of categorical covariates. Group characteristics. Faculty of Health Sciences

STAT 526 Spring Final Exam. Thursday May 5, 2011

Log-linearity for Cox s regression model. Thesis for the Degree Master of Science

More Statistics tutorial at Logistic Regression and the new:

Proportional hazards regression

The Weibull Distribution

Machine Learning. Module 3-4: Regression and Survival Analysis Day 2, Asst. Prof. Dr. Santitham Prom-on

Introduction To Logistic Regression

A multi-state model for the prognosis of non-mild acute pancreatitis

Causal Hazard Ratio Estimation By Instrumental Variables or Principal Stratification. Todd MacKenzie, PhD

You know I m not goin diss you on the internet Cause my mama taught me better than that I m a survivor (What?) I m not goin give up (What?

Individualized Treatment Effects with Censored Data via Nonparametric Accelerated Failure Time Models

TMA 4275 Lifetime Analysis June 2004 Solution

Analysing data: regression and correlation S6 and S7

Multiple linear regression S6

Categorical Predictor Variables

Survival Analysis. 732G34 Statistisk analys av komplexa data. Krzysztof Bartoszek

Statistical Methods for Alzheimer s Disease Studies

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

Accelerated Failure Time Models

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

Lecture 3. Truncation, length-bias and prevalence sampling

Survival Analysis. Stat 526. April 13, 2018

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018

BIOSTATISTICAL METHODS

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

7.1 The Hazard and Survival Functions

Exercises. (a) Prove that m(t) =

Research Methods in Political Science I

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

ST745: Survival Analysis: Cox-PH!

Robust estimates of state occupancy and transition probabilities for Non-Markov multi-state models

Mixture modelling of recurrent event times with long-term survivors: Analysis of Hutterite birth intervals. John W. Mac McDonald & Alessandro Rosina

Part III Measures of Classification Accuracy for the Prediction of Survival Times

If the roles of the variable are not clear, then which variable is placed on which axis is not important.

Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times

Survival analysis in R

STATISTICS Relationships between variables: Correlation

Survival analysis in R

Lecture 8 Stat D. Gillen

Lab 8. Matched Case Control Studies

Unbiased estimation of exposure odds ratios in complete records logistic regression

MATH 1150 Chapter 2 Notation and Terminology

Tied survival times; estimation of survival probabilities

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

General Regression Model

PubH 7405: REGRESSION ANALYSIS MLR: BIOMEDICAL APPLICATIONS

Estimating the Mean Response of Treatment Duration Regimes in an Observational Study. Anastasios A. Tsiatis.

Instrumental variables estimation in the Cox Proportional Hazard regression model

MAS3301 / MAS8311 Biostatistics Part II: Survival

Monitoring clinical trial outcomes with delayed response: incorporating pipeline data in group sequential designs. Christopher Jennison

especially with continuous

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

Introduction to Statistical Analysis

Practical Biostatistics

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

Basic Medical Statistics Course

A new strategy for meta-analysis of continuous covariates in observational studies with IPD. Willi Sauerbrei & Patrick Royston

Lecture Discussion. Confounding, Non-Collapsibility, Precision, and Power Statistics Statistical Methods II. Presented February 27, 2018

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016

Survival models and health sequences

Evaluating Predictive Accuracy of Survival Models with PROC PHREG

Package threg. August 10, 2015

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Survival Analysis. Lu Tian and Richard Olshen Stanford University

Analysis of Time-to-Event Data: Chapter 6 - Regression diagnostics

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation

CIMAT Taller de Modelos de Capture y Recaptura Known Fate Survival Analysis

Categorical data analysis Chapter 5

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Stat 642, Lecture notes for 04/12/05 96

Nonparametric Model Construction

STA6938-Logistic Regression Model

Scatter plot of data from the study. Linear Regression

Lecture 5 Models and methods for recurrent event data

Correlation and Simple Linear Regression

Can a Pseudo Panel be a Substitute for a Genuine Panel?

Math 1040 Sample Final Examination. Problem Points Score Total 200

Correlation and simple linear regression S5

Approximate Median Regression via the Box-Cox Transformation

Transcription:

Ph.D. course: Regression models Introduction PKA & LTS Sect. 1.1, 1.2, 1.4 19 April 2012 www.biostat.ku.dk/~pka/regrmodels12 Per Kragh Andersen 1

Regression models The distribution of one outcome variable is modelled in relation to one or, more often several explanatory variables (or covariates). Well-known (perhaps?) examples of regression models include: linear regression logistic regression (Cox) proportional hazards regression The type of regression model to use in a given situation depends on the type of the outcome variable: linear regression: quantitative outcome logistic regression: binary outcome (Cox) proportional hazards regression: survival time outcome 2

Explanatory variables All types of regression models may include two types of explanatory variables: categorical explanatory variables quantitative explanatory variables This means that many features are common for linear, logistic, and Cox regression. The book Regression with linear predictors highlights such similarities by focussing on the type of explanatory variables and the way in which these are combined into the linear predictor (examples to follow). 3

Example 1.1: Body mass index and vitamin D status European study in: Ireland, Poland, Finland, and Denmark, Data on vitamin D status (25OHD in serum, nmol/l), Data on (among other factors) age and body mass index: BMI = weight in kg (height in m) 2. Purpose: assess whether vitamin D status depends on BMI and age and how it varies among countries. Outcome variable, vit D: quantitative. Quantitative explanatory variables: age and BMI; Categorical explanatory variables: country and categorized BMI, e.g. Normal (<25) vs. Overweight ( 25) (a binary explanatory variable), Overweight women could be further divided into Slight overweight and Obese ( 30) 4

Table 1: Average 25OHD vitamin D values for 41 adult Irish women in subgroups given by body mass index. BMI Group n Vitamin D Normal 16 56.138 Overweight 25 42.804 Slight overweight 16 45.831 Obese 9 37.422 Vit D seems to decrease with increasing BMI. 5

Example 1.2: Fever in early pregnancy and risk of fetal death The Danish National Birth Cohort Study recruited pregnant women 1997-2002 for telephone interviews scheduled to take place in weeks 12-16. Here: data on women recruited before 31 March 1999, interviewed before week 17, and who were still pregnant at week 17. Study relation between risk of fetal death and episodes of fever in early pregnancy. Outcome variable, fetal death: binary, Both categorical and quantitative explanatory variables relevant. Also Example 1.4 on surgery complications has a binary outcome. 6

Table 2: Distribution of fetal death by number of fever episodes before pregnancy week 17 in 11,778 women recruited to the Danish National Birth Cohort Study. Number of Fever Episodes Fetal Death 0 1 2 3+ Total No 9595 1852 182 30 11659 Yes 98 20 1 0 119 Total 9693 1872 183 30 11778 98/9693=1.0% of women without fever episodes experienced fetal death, roughly the same percentage for women with fever episodes: (20+1+0)/(1872+183+30)=1.0%. 7

Note: Confounding In both examples there may be confounding, i.e. simple comparisons between Normal weight and Overweight women Women with or without reported episodes of fever in early pregnancy may not be fair because other factors associated with the outcome may be unevenly distributed in the groups to be compared. This calls for suitable adjustment when the groups are to be compared. 8

Example 1.3: The PBC-3 trial in liver cirrhosis PBC-3: multi-centre randomized trial in patients with primary biliary cirrhosis. Patients recruited 1983-1987 from six European hospitals and randomized to CyA or placebo. Followed until death or liver transplantation (no longer than 1989); 4 patients were lost to follow-up before that date. Outcome variable: time to treatment failure (death or transplantation) Main explanatory variable, treatment, is binary. Other risk factors (serum bilirubin, age, gender etc.) may in spite of the randomization not be quite balanced between the two treatment groups. Both categorical and quantitative explanatory variables are relevant. What about the outcome variable? 9

Table 3: Average observation times in years (and numbers of patients) by treatment group and failure status in the PBC3 trial in liver cirrhosis. Treatment Failure Treatment No Yes Total Placebo 2.86 1.80 2.58 (127) (46) (173) CyA 2.77 2.02 2.58 (132) (44) (176) Total 2.81 1.91 2.58 (259) (90) (349) 10

Table 4: Number (%) of observation times less than two years by treatment group and failure status in the PBC3 trial in liver cirrhosis. Treatment Failure Treatment No Yes Observation Times Patients Placebo 40 27 67 173 All (23%) (16%) (39%) (100%) CyA 41 24 65 176 (23%) (14%) (37%) (100%) Total 81 51 132 349 (23%) (15%) (38%) (100%) 11

Because of the incomplete information on the outcome variable (censoring) neither averages nor percentages are reasonable descriptions of the distribution. Instead, survival ( Kaplan-Meier ) curves are used for estimating survival probabilities as a function of time, t. 12

Survival 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6 Years Figure 1: Comparison of estimated survival curves for CyA (dashed) and placebo (solid) treated patients with PBC. 13

Describing the relation between outcome and one covariate Notation: outcome variable: y or (y i, i = 1,..., n). Covariate: x or (x i, i = 1,..., n). For a quantitative outcome (vitamin D) averages were used for estimating the expected value, m = E(y). For a binary outcome (fetal death, y = 0 or 1) relative frequencies were used for estimating the failure probability, p = pr(y = 1). For a survival time outcome (time to treatment failure) the Kaplan-Meier curve was used for estimating the survival function, S(t) = pr(y > t) as a function of time t. 14

One binary covariate m 0 : mean vit D for Normal weight women (x i = 0), m 1 : mean vit D for Overweight women (x i = 1) E(y i ) = m 0 if x i = 0 m 1 if x i = 1, (1) That is, E(y i ) = m 0 + (m 1 m 0 )x i = a + bx i a = m 0 : intercept; b = m 1 m 0 : slope (the effect of x on y). Interpretation of a and b. Effect? 15

Binary outcome p 0 : pr(fetal death) for women without fever episodes (x i = 0), p 1 : pr(fetal death) for women with fever episodes (x i = 1). pr(y i = 1) = p 0 if x i = 0 p 1 if x i = 1. (2) We could follow the lines from above and define some measure of discrepancy between p 1 and p 0. However, that will wait for link functions to be introduced later. 16

Survival time outcome S 0 (t): survival function for patients treated with placebo (x i = 0), S 1 (t): survival function for patients treated with CyA (x i = 1). pr(y i > t) = S 0 (t) if x i = 0 S 1 (t) if x i = 1. (3) Again, we could follow the lines from above and define some measure of discrepancy between S 1 (t) and S 0 (t) and again that will wait for link functions to be introduced. 17

One categorical covariate For a covariate with k + 1 values, g 0, g 1,..., g k : E(y i ) = m 0 if x i = g 0 m 1 if x i = g 1...... m k if x i = g k. (4) Example: mean vitamin D status in 3 BMI categories. 18

One categorical covariate: dummy variables Introducing k indicator or dummy variables I(x i = g j ), j = 1,...,k where I(x i = g j ) = 1 if x i = g j, I(x i = g j ) = 0 otherwise. E(y i ) = m 0 + (m 1 m 0 )I(x i = g 1 ) + (m 2 m 0 )I(x i = g 2 ) + + (m k m 0 )I(x i = g k ) E(y i ) = a + b 1 I(x i = g 1 ) + b 2 I(x i = g 2 ) + + b k I(x i = g k ), (5) where a = m 0 and b j = m j m 0, j = 1,..., k. (Details later). 19

One quantitative covariate The BMI groups: Normal weight, Slight overweight, Obese are ordered. Monotonic relationship? Figure next slide. The straight line in the scatterplot corresponds to the simple linear regression model: Covariate: x i = BMI for woman i. E(y i ) = a + bx i Slope b: difference in mean response between women differing 1 unit in x. Intercept a: the expected vit D level for women with BMI=0. Model often re-parametrized into, e.g. E(y i ) = a + b(x i 25) where now a is the expected vit D level for women with BMI=25. 20

Vitamin D 40 45 50 55 21.75 27.50 32.50 BMI score Figure 2: Average 25OHD-values plotted against the BMI scores 21.75, 27.5, and 32.5. 21

Vitamin D 20 40 60 80 100 20 25 30 35 BMI Figure 3: Scatterplot: values of the quantitative outcome y (25OHD) plotted against the quantitative covariate x (BMI). 22

Several covariates For a single categorical (binary) or quantitative explanatory variable, building blocks b 1 I(x i = g 1 ) + b 2 I(x i = g 2 ) + + b k I(x i = g k ) (for a binary x: bi(x i = g 1 ) bx i were added to the intercept a. Multiple regression models are obtained by adding such building blocks for the different covariates to obtain the linear predictor. 23

Vitamin D example, women from Ireland or Poland: Linear predictor: x i,1 = BMI for woman i, x i,2 = I(woman i is from Ireland). E(y i ) = a + b 1 I(x i,1 25) + b 2 x i,2 or E(y i ) = a + b 1 x i,1 + b 2 x i,2. 24

First model leads to expected values: Table 5: Expected values in four groups according to BMI and country. Normal Weight Overweight Poland a a + b 1 Ireland a + b 2 a + b 1 + b 2 Effects of BMI for women from Ireland or Poland are the same: (a + b 1 + b 2 ) (a + b 2 ) = b 1 and (a + b 1 ) a = b 1. Effects of country for Overweight and Normal weight women are the same: (a + b 1 + b 2 ) (a + b 1 ) = b 2 and (a + b 2 ) a = b 2. No interaction between country and BMI. 25

Second model leads to parallel lines (vit D vs. BMI) for women from Ireland or Poland. Figure. b 1, the common slope is the common effect of BMI for both countries b 2, the (constant) distance between the two lines is the effect of country for any given value of BMI. Again: No interaction between country and BMI. 26

Vitamin D 10 20 30 40 50 60 20 25 30 35 40 BMI Figure 4: Expected values from the second model: two parallel lines with slope b 1 and vertical distance b 2. Dashed curve is for Ireland, solid for Poland. 27

Summary Multiple regression models are obtained by adding building blocks for the different covariates to obtain the linear predictor. For multiple covariates: (x i,1, x i,2,..., x i,nc ; i = 1,..., n) the linear predictor is LP i = a + b 1 x i,1 + b 2 x i,2 +... + b nc x i,nc. For quantitative covariates an assumption of linearity is imposed. A consequence of adding the terms corresponding to each building block is assuming no interaction. These modelling assumptions need careful consideration. 28