Ph.D. course: Regression models Introduction PKA & LTS Sect. 1.1, 1.2, 1.4 19 April 2012 www.biostat.ku.dk/~pka/regrmodels12 Per Kragh Andersen 1
Regression models The distribution of one outcome variable is modelled in relation to one or, more often several explanatory variables (or covariates). Well-known (perhaps?) examples of regression models include: linear regression logistic regression (Cox) proportional hazards regression The type of regression model to use in a given situation depends on the type of the outcome variable: linear regression: quantitative outcome logistic regression: binary outcome (Cox) proportional hazards regression: survival time outcome 2
Explanatory variables All types of regression models may include two types of explanatory variables: categorical explanatory variables quantitative explanatory variables This means that many features are common for linear, logistic, and Cox regression. The book Regression with linear predictors highlights such similarities by focussing on the type of explanatory variables and the way in which these are combined into the linear predictor (examples to follow). 3
Example 1.1: Body mass index and vitamin D status European study in: Ireland, Poland, Finland, and Denmark, Data on vitamin D status (25OHD in serum, nmol/l), Data on (among other factors) age and body mass index: BMI = weight in kg (height in m) 2. Purpose: assess whether vitamin D status depends on BMI and age and how it varies among countries. Outcome variable, vit D: quantitative. Quantitative explanatory variables: age and BMI; Categorical explanatory variables: country and categorized BMI, e.g. Normal (<25) vs. Overweight ( 25) (a binary explanatory variable), Overweight women could be further divided into Slight overweight and Obese ( 30) 4
Table 1: Average 25OHD vitamin D values for 41 adult Irish women in subgroups given by body mass index. BMI Group n Vitamin D Normal 16 56.138 Overweight 25 42.804 Slight overweight 16 45.831 Obese 9 37.422 Vit D seems to decrease with increasing BMI. 5
Example 1.2: Fever in early pregnancy and risk of fetal death The Danish National Birth Cohort Study recruited pregnant women 1997-2002 for telephone interviews scheduled to take place in weeks 12-16. Here: data on women recruited before 31 March 1999, interviewed before week 17, and who were still pregnant at week 17. Study relation between risk of fetal death and episodes of fever in early pregnancy. Outcome variable, fetal death: binary, Both categorical and quantitative explanatory variables relevant. Also Example 1.4 on surgery complications has a binary outcome. 6
Table 2: Distribution of fetal death by number of fever episodes before pregnancy week 17 in 11,778 women recruited to the Danish National Birth Cohort Study. Number of Fever Episodes Fetal Death 0 1 2 3+ Total No 9595 1852 182 30 11659 Yes 98 20 1 0 119 Total 9693 1872 183 30 11778 98/9693=1.0% of women without fever episodes experienced fetal death, roughly the same percentage for women with fever episodes: (20+1+0)/(1872+183+30)=1.0%. 7
Note: Confounding In both examples there may be confounding, i.e. simple comparisons between Normal weight and Overweight women Women with or without reported episodes of fever in early pregnancy may not be fair because other factors associated with the outcome may be unevenly distributed in the groups to be compared. This calls for suitable adjustment when the groups are to be compared. 8
Example 1.3: The PBC-3 trial in liver cirrhosis PBC-3: multi-centre randomized trial in patients with primary biliary cirrhosis. Patients recruited 1983-1987 from six European hospitals and randomized to CyA or placebo. Followed until death or liver transplantation (no longer than 1989); 4 patients were lost to follow-up before that date. Outcome variable: time to treatment failure (death or transplantation) Main explanatory variable, treatment, is binary. Other risk factors (serum bilirubin, age, gender etc.) may in spite of the randomization not be quite balanced between the two treatment groups. Both categorical and quantitative explanatory variables are relevant. What about the outcome variable? 9
Table 3: Average observation times in years (and numbers of patients) by treatment group and failure status in the PBC3 trial in liver cirrhosis. Treatment Failure Treatment No Yes Total Placebo 2.86 1.80 2.58 (127) (46) (173) CyA 2.77 2.02 2.58 (132) (44) (176) Total 2.81 1.91 2.58 (259) (90) (349) 10
Table 4: Number (%) of observation times less than two years by treatment group and failure status in the PBC3 trial in liver cirrhosis. Treatment Failure Treatment No Yes Observation Times Patients Placebo 40 27 67 173 All (23%) (16%) (39%) (100%) CyA 41 24 65 176 (23%) (14%) (37%) (100%) Total 81 51 132 349 (23%) (15%) (38%) (100%) 11
Because of the incomplete information on the outcome variable (censoring) neither averages nor percentages are reasonable descriptions of the distribution. Instead, survival ( Kaplan-Meier ) curves are used for estimating survival probabilities as a function of time, t. 12
Survival 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6 Years Figure 1: Comparison of estimated survival curves for CyA (dashed) and placebo (solid) treated patients with PBC. 13
Describing the relation between outcome and one covariate Notation: outcome variable: y or (y i, i = 1,..., n). Covariate: x or (x i, i = 1,..., n). For a quantitative outcome (vitamin D) averages were used for estimating the expected value, m = E(y). For a binary outcome (fetal death, y = 0 or 1) relative frequencies were used for estimating the failure probability, p = pr(y = 1). For a survival time outcome (time to treatment failure) the Kaplan-Meier curve was used for estimating the survival function, S(t) = pr(y > t) as a function of time t. 14
One binary covariate m 0 : mean vit D for Normal weight women (x i = 0), m 1 : mean vit D for Overweight women (x i = 1) E(y i ) = m 0 if x i = 0 m 1 if x i = 1, (1) That is, E(y i ) = m 0 + (m 1 m 0 )x i = a + bx i a = m 0 : intercept; b = m 1 m 0 : slope (the effect of x on y). Interpretation of a and b. Effect? 15
Binary outcome p 0 : pr(fetal death) for women without fever episodes (x i = 0), p 1 : pr(fetal death) for women with fever episodes (x i = 1). pr(y i = 1) = p 0 if x i = 0 p 1 if x i = 1. (2) We could follow the lines from above and define some measure of discrepancy between p 1 and p 0. However, that will wait for link functions to be introduced later. 16
Survival time outcome S 0 (t): survival function for patients treated with placebo (x i = 0), S 1 (t): survival function for patients treated with CyA (x i = 1). pr(y i > t) = S 0 (t) if x i = 0 S 1 (t) if x i = 1. (3) Again, we could follow the lines from above and define some measure of discrepancy between S 1 (t) and S 0 (t) and again that will wait for link functions to be introduced. 17
One categorical covariate For a covariate with k + 1 values, g 0, g 1,..., g k : E(y i ) = m 0 if x i = g 0 m 1 if x i = g 1...... m k if x i = g k. (4) Example: mean vitamin D status in 3 BMI categories. 18
One categorical covariate: dummy variables Introducing k indicator or dummy variables I(x i = g j ), j = 1,...,k where I(x i = g j ) = 1 if x i = g j, I(x i = g j ) = 0 otherwise. E(y i ) = m 0 + (m 1 m 0 )I(x i = g 1 ) + (m 2 m 0 )I(x i = g 2 ) + + (m k m 0 )I(x i = g k ) E(y i ) = a + b 1 I(x i = g 1 ) + b 2 I(x i = g 2 ) + + b k I(x i = g k ), (5) where a = m 0 and b j = m j m 0, j = 1,..., k. (Details later). 19
One quantitative covariate The BMI groups: Normal weight, Slight overweight, Obese are ordered. Monotonic relationship? Figure next slide. The straight line in the scatterplot corresponds to the simple linear regression model: Covariate: x i = BMI for woman i. E(y i ) = a + bx i Slope b: difference in mean response between women differing 1 unit in x. Intercept a: the expected vit D level for women with BMI=0. Model often re-parametrized into, e.g. E(y i ) = a + b(x i 25) where now a is the expected vit D level for women with BMI=25. 20
Vitamin D 40 45 50 55 21.75 27.50 32.50 BMI score Figure 2: Average 25OHD-values plotted against the BMI scores 21.75, 27.5, and 32.5. 21
Vitamin D 20 40 60 80 100 20 25 30 35 BMI Figure 3: Scatterplot: values of the quantitative outcome y (25OHD) plotted against the quantitative covariate x (BMI). 22
Several covariates For a single categorical (binary) or quantitative explanatory variable, building blocks b 1 I(x i = g 1 ) + b 2 I(x i = g 2 ) + + b k I(x i = g k ) (for a binary x: bi(x i = g 1 ) bx i were added to the intercept a. Multiple regression models are obtained by adding such building blocks for the different covariates to obtain the linear predictor. 23
Vitamin D example, women from Ireland or Poland: Linear predictor: x i,1 = BMI for woman i, x i,2 = I(woman i is from Ireland). E(y i ) = a + b 1 I(x i,1 25) + b 2 x i,2 or E(y i ) = a + b 1 x i,1 + b 2 x i,2. 24
First model leads to expected values: Table 5: Expected values in four groups according to BMI and country. Normal Weight Overweight Poland a a + b 1 Ireland a + b 2 a + b 1 + b 2 Effects of BMI for women from Ireland or Poland are the same: (a + b 1 + b 2 ) (a + b 2 ) = b 1 and (a + b 1 ) a = b 1. Effects of country for Overweight and Normal weight women are the same: (a + b 1 + b 2 ) (a + b 1 ) = b 2 and (a + b 2 ) a = b 2. No interaction between country and BMI. 25
Second model leads to parallel lines (vit D vs. BMI) for women from Ireland or Poland. Figure. b 1, the common slope is the common effect of BMI for both countries b 2, the (constant) distance between the two lines is the effect of country for any given value of BMI. Again: No interaction between country and BMI. 26
Vitamin D 10 20 30 40 50 60 20 25 30 35 40 BMI Figure 4: Expected values from the second model: two parallel lines with slope b 1 and vertical distance b 2. Dashed curve is for Ireland, solid for Poland. 27
Summary Multiple regression models are obtained by adding building blocks for the different covariates to obtain the linear predictor. For multiple covariates: (x i,1, x i,2,..., x i,nc ; i = 1,..., n) the linear predictor is LP i = a + b 1 x i,1 + b 2 x i,2 +... + b nc x i,nc. For quantitative covariates an assumption of linearity is imposed. A consequence of adding the terms corresponding to each building block is assuming no interaction. These modelling assumptions need careful consideration. 28