Part V: Binary response data

Size: px

Start display at page:

Download "Part V: Binary response data"

Lambert Moody
5 years ago
Views:

1 Part V: Binary response data 275 BIO 233, Spring 2015

2 Western Collaborative Group Study Prospective study of coronary heart disease (CHD) Recruited 3,524 men aged between employed at 10 companies in California baseline survey at intake annual surveys until December 1969 Exclusions: 78 men who were actually outside the pre-specified age range 141 subjects with CHD manifest at intake 106 employees at one firm that excluded itself from follow-up 45 subjects who were lost to follow-up, non-chd death or self-exclusion prior to the first follow-up n = 3,154 study participants at risk for CHD 276 BIO 233, Spring 2015

3 Our primary goal is to investigate the relationship between behavior pattern and risk of CHD Participants were categorized into one of two behavior pattern groups: Type A: characterized by enhanced aggressiveness, ambitiousness, competitive drive, and chronic sense of urgency Type B: characterized by more relaxed and non-competitive Data and documentation are available on the class website > ## > load("wcgs_data.dat") > > dim(wcgs) [1] > names(wcgs) [1] "age" "ht" "wt" "sbp" "dbp" "chol" "ncigs" "behave" [9] "chd" "type" "time" 277 BIO 233, Spring 2015

4 The variables (in column order) are: 1 age age, years 2 ht height, in 3 wt weight, lbs 4 sbp systolic blood pressure, mmhg 5 dbp diastolic blood pressure, mmhg 6 chol cholesterol, mg/dl 7 ncigs number of cigarettes smoked per day 8 behave behavior type 0/1 = B/A 9 chd occurrence of a CHD event during follow-up 10 type type of CHD event 11 time time post-recruitment of the CHD event, days Values for the risk factor covariates are those measured at the intake visit The three CHD-related variables were measured prospectively over an approx. 8.5 years of follow-up 278 BIO 233, Spring 2015

5 Important note: 423 were lost to follow-up 140 men died during the follow-up For our purposes, we are going to ignore these issues and consider the binary outcome of: 1 occurrence of CHD during follow-up Y = 0 otherwise In the dataset, the response variable is chd : > ## > table(wcgs$chd) > round(mean(wcgs$chd) * 100, 1) [1] BIO 233, Spring 2015

6 Primary exposure of interest is behave : > ## > table(wcgs$behave) > round(mean(wcgs$behave) * 100, 1) [1] 50.4 Cross-tabulation and exposure-specific incidence > ## > table(wcgs$behave, wcgs$chd) > round(tapply(wcgs$chd, list(wcgs$behave), FUN=mean) * 100, 1) BIO 233, Spring 2015

7 The probability of the occurrence of CHD during follow-up among type B men is estimated to be expected percentage of type B men who will develop CHD during follow-up is 5.0% The probability of the occurrence of CHD during follow-up among type A men is estimated to be expected percentage of type A men who will develop CHD during follow-up is 11.2% Often use the generic term risk Either way, it s important to remember that these statements are referring to populations of men, rather than the individuals themselves we ve estimated a common or average risk of CHD referred to as the marginal risk marginal in the sense that it does not condition on anything else 281 BIO 233, Spring 2015

8 Contrasts As stated at the start, the primary goal is to investigate the relationship between behavior pattern and risk of CHD We ve characterized risk for each type but the goal requires a comparison of the risks To perform such a comparison we need to choose a contrast Risk difference: RD = = difference in the estimated risk of CHD during follow-up between type A and type B men is (or 6.2%) the way in which the additional risk of CHD of being a type A person manifests through an absolute increase 282 BIO 233, Spring 2015

9 Relative risk: RR = / = 2.24 ratio of the estimated risk of CHD for type A men during follow-up to the estimated risk for type B men the way in which the additional risk of CHD of being a type A person manifests through an relative increase As with the interpretation of the risks themselves, these statements refer to contrasts between populations population of Type A men vs. population of Type B men Contrasts are marginal in the sense that we don t condition on anything else when comparing the two populations i.e. we don t adjust for anything 283 BIO 233, Spring 2015

10 Important to note that the RD and RR are related relationship depends on the value of the response probability for the referent group RD across different combinations of P(Y = 1 X = 0) and RR RR = RR = RR = RR = RR = RR = RR = NA RR = NA NA 284 BIO 233, Spring 2015

11 The RD may be small even if the RR is big for either protective or detrimental effects When the RR is small, the RD is also small unless P(Y = 1 X = 0) is big common outcome However a small RR operating on a large population could correspond to a big public health impact this rationale is often cited in studies of air pollution To move beyond simple contrasts, we need a more general framework for modeling the relationship between the binary response and a vector of covariates 285 BIO 233, Spring 2015

12 GLMs for binary data We ve noted that the Bernoulli distribution is the only possible distribution for binary data Y Bernoulli(µ) f Y (y;µ) = µ y (1 µ) 1 y f Y (y;θ,φ) = exp{yθ log(1+exp{θ})} θ = log ( ) µ 1 µ a(φ) = 1 b(θ) = log(1+exp{θ}) c(y,φ) = BIO 233, Spring 2015

13 The log-likelihood is l(β;y) = = n i=1 n i=1 y i θ i b(θ i ) y i θ i log(1+exp{θ i }) where θ i is a function of β via and g(µ i ) = X T i β µ i = exp{θ i } 1+exp{θ i } 287 BIO 233, Spring 2015

14 The score function for β j is l(β; y) β j = n i=1 µ i η i X j,i µ i (1 µ i ) (y i µ i ) where the expression for µ i / η i is dependent on the choice of the link function g( ) Since the log-likelihood is only a function of β, the expected information matrix is given by the (p+1) (p+1) matrix: I ββ = X T WX where X is the design matrix for the model and W is a diagonal matrix with i th diagonal element W i = ( µi η i ) 2 1 µ i (1 µ i ) 288 BIO 233, Spring 2015

15 Link functions In a GLM, the systematic component is given by g(µ i ) = η i = X T i β We ve noted previously that, for binary data, there are various options for link functions including: linear: g(µ i ) = µ i log: g(µ i ) = log(µ i ) ( ) µi logit: g(µ i ) = log 1 µ i probit: g(µ i ) = probit(µ i ) complementary log-log: g(µ i ) = log{ log(1 µ i )} 289 BIO 233, Spring 2015

16 Q: How do we make a choice from among these options? Balance between interpretability and mathematical properties interpretability of contrasts mathematical properties in terms of fitted values being in the appropriate range 290 BIO 233, Spring 2015

17 Linear (identity) link function µ i = β 0 +β 1 X i Interpret β 0 as the probability of response when X = 0 Interpret β 1 as the change in the probability of response, comparing two populations whose value of X differs by 1 unit The contrast we are modeling the risk difference (RD) As we ve noted, a potential problem is that this specification of the model doesn t respect the fact that the (true) response probability is bounded 291 BIO 233, Spring 2015

18 Log link function log(µ i ) = β 0 +β 1 X i Interpret β 0 as the log of the probability of response when X = 0 exp{β 0 } is the probability of response when X = 0 Interpret β 1 as the change in the log of the probability of response, comparing two populations whose value of X differs by 1 unit exp{β 1 } is the ratio of the probability of response when X = 1 to that when X = 0 The contrast we are modeling the risk ratio (RR) 292 BIO 233, Spring 2015

19 As with the linear link, this choice of link function doesn t necessarily respect the fact that the (true) response probability is bounded Can see this explicitly this by considering the inverse of the link function: µ i = exp{x T i β} which takes values on (0, ) 293 BIO 233, Spring 2015

20 Logit link function logit(µ i ) = log ( µi 1 µ i ) = X T i β The functional µ i = P(Y i = 1 X i ) 1 µ i P(Y i = 0 X i ) is the odds of response Interpret β 0 as the log of the odds of response when X = 0 exp{β 0 } is the odds of response when X = BIO 233, Spring 2015

21 Interpret β 1 as the change in the log of the odds of response, comparing two populations whose value of X differs by 1 unit exp{β 1 } is the ratio of the odds of response when X = 1 to that when X = 0 The contrast we are modeling is the odds ratio (OR) Considering the inverse of the link function yields: µ i = exp{x T i β} 1+exp{X T i β} referred to as the expit function 295 BIO 233, Spring 2015

22 The expit function is the CDF of the standard logistic distribution distribution for a continuous random variable with support on (, ) pdf is given by f X (x) = exp{ x} (1+exp{ x}) 2 The CDF (of any distribution) provides a mapping from the support of the random variable to the (0,1) interval F X ( ) : (, ) (0,1) We could use the inverse CDF of any distribution as a link function F 1 X ( ) : (0,1) (, ) g( ) F 1 ( ) maps µ (0,1) to η (, ) 296 BIO 233, Spring 2015

23 Probit link function probit(µ i ) = Φ 1 (µ i ) = X T i β where Φ( ) is the CDF of the standard normal distribution Interpret β 0 as the probit of probability of response when X = 0 Interpret β 1 as the change in the probit of the probability of response, comparing two populations whose value of X differs by 1 unit Interpretation is tricky contrast is in terms of the inverse CDF of a standard normal distribution no easy way of relating this contrast to more intuitive measures 297 BIO 233, Spring 2015

24 Complementary log-log function log{ log(1 µ i )} = X T i β Inverse CDF of the extreme value (or log-weibull) distribution As with the probit link function, there isn t any intuitive way of interpreting regression parameters based on this link function Has the distinction that it is asymmetric may be useful if the primary purpose is prediction 298 BIO 233, Spring 2015

25 Comparisons Over values of µ (0.1,0.9), models based on the linear, logit and probit link function agree approximately considering their inverse link functions, over the range η i ( 2,2): ( ) η i 4 expit(η 2πηi i) Φ 4 so their fitted values will be approximately equal over this range Also use these relationships to provide approximate relationships between the regression parameters: β linear 1 4 β logit 1 2πβ probit BIO 233, Spring 2015

26 Conditional mean, µ i linear logit probit Linear predictor, η i 300 BIO 233, Spring 2015

27 5 c log log logit probit log g(µ i ) logit(µ i ) 301 BIO 233, Spring 2015

28 From the figures, differences across these link functions manifest primarily in the tails when the probability of response is small or large Also, the logit and probit functions are almost linearly related noted this from the approximations as well For small values of µ i, the complementary log-log, logit and log functions are close to each other equally good for rare events for µ i 0.1 log ( µi 1 µ i ) log(µ i ) log link has the best interpretation OR and RR are close numerically 302 BIO 233, Spring 2015

29 Modeling: WCGS Returning to the WCGS, the dataset has a number of covariates that we might consider including in a model > ## > names(wcgs) [1] "age" "ht" "wt" "sbp" "dbp" "chol" "ncigs" "behave" [9] "chd" "type" "time" Q: How do we approach making decisions about what to include in the model? depends on the purpose of the analysis Towards this, it s useful to classify the analysis into one of two types: association studies prediction studies 303 BIO 233, Spring 2015

30 Association studies The goal is to characterize the relationship between some exposure of interest and the response establish cause-and-effect Understanding the underlying (data generating) mechanisms are crucial need to be attentive to the possibility of alternative explanations control of confounding is crucial Model selection, in terms of the choice of potential confounders, should be based on scientific considerations Despite this ideal, it s not always clear which covariates are confounders and which aren t 304 BIO 233, Spring 2015

31 One strategy is to fit and report the following three models: (1) an unadjusted or minimally adjusted model (2) a model that includes core confounders clear indication from scientific knowledge and/or the literature consensus among investigators (3) a model that includes core confounders plus any potential confounders indication is less certain Report results from model (2) as primary based conclusions on the results of this model interpret models (1) and (3) in terms of sensitivity analyses There are, of course, other philosophies on this! 305 BIO 233, Spring 2015

32 Prediction studies The goal is to estimate the response Y as opposed to the goal of estimating β In contrast to association studies, prediction is typically not hypothesis-driven there is no single exposure or association or parameter that is of interest mechanisms and confounding is less of a concern, if at all Choice of which covariates to include in the model is driven by the extent to which its inclusion improves our ability to predict future outcomes care is needed not to overfit the data These issues typically don t come up in association studies requires different analysis strategies and different statistical tools 306 BIO 233, Spring 2015

33 Confounding The data for the WCGS is observational as a study of Type A vs Type B behavior patterns, the investigators didn t randomize behavior pattern As such, an analysis based on these data may be subject to confounding bias A confounder is defined as a covariate that is (causally) associated with both the exposure of interest and the outcome of interest, while not being on the causal pathway X C? Y 307 BIO 233, Spring 2015

34 Intuitively, from the causal diagram, there is a backdoor association between X and Y, through C If one does not block this pathway then one cannot isolate the (direct) association between X and Y the unadjusted association is spurious in the sense that it is a mixture of the true association and the association characterized by the backdoor pathway confounding bias Note, we haven t introduced any estimators yet we haven t even introduced a contrast yet! As such, confounding is a scientific issue distinct from statistical bias that is an operating characteristic of an estimator 308 BIO 233, Spring 2015

35 The control of confounding bias must, therefore, be approached from a scientific perspective we cannot use statistical techniques to determine whether or not a covariate is a confounder we must use scientific knowledge to make these decisions Given a collection of (potential) confounders, the standard approach to controlling confounding bias is to include them in the linear predictor referred to as regression adjustment e.g., η i = β 0 + β x X i + β c C i interpret β x conditional on C or within strata of C 309 BIO 233, Spring 2015

36 Going back to the causal diagram, conditioning on the confounder blocks the backdoor pathway the effect of including C in the model is to break the association between C and Y X C? Y 310 BIO 233, Spring 2015

37 Exploratory data analysis Whatever the purpose of the study, it is often useful to perform some preliminary exploratory data analysis Q: Why? > ## > apply(wcgs[,1:7], 2, FUN=summary) $age Min. 1st Qu. Median Mean 3rd Qu. Max $ht Min. 1st Qu. Median Mean 3rd Qu. Max $wt Min. 1st Qu. Median Mean 3rd Qu. Max. 311 BIO 233, Spring 2015

38 $sbp Min. 1st Qu. Median Mean 3rd Qu. Max $dbp Min. 1st Qu. Median Mean 3rd Qu. Max $chol Min. 1st Qu. Median Mean 3rd Qu. Max. NA s $ncigs Min. 1st Qu. Median Mean 3rd Qu. Max BIO 233, Spring 2015

39 313 BIO 233, Spring 2015 Frequency Age, years

40 Weight, lbs Height, in 314 BIO 233, Spring 2015

41 Diastolic blood pressure, mmhg 315 BIO 233, Spring 2015 Systolic blood pressure, mmhg

42 Study id Cholesterol, mg/dl 316 BIO 233, Spring 2015 Cholesterol, mg/dl Frequency

43 Frequency Number of cigarettes, per day 317 BIO 233, Spring 2015

44 > ## > table(wcgs$ncigs) Study participants seem to be reporting round numbers likely some misclassification of actual smoking 318 BIO 233, Spring 2015

45 Overall, nothing too worrying pops out Some instances of large values weight of 320lbs diastolic blood pressure of 150 mmhg cholesterol of 645mg/dL smoking 99 cigarettes per day There is also some missingness in the data in a real collaborative setting, we d want to know more about the cholesterol values in particular, why were they missing? only 12 out of 3,154 observations with missing values 319 BIO 233, Spring 2015

46 Based on the EDA, perform the following data manipulations: > ## > wcgs$chol[wcgs$chol > 500] <- NA ## Take out (particularly) strange value > wcgs <- na.omit(wcgs) ## Remove observations with missing chol > > ## Standardize continuous variables to make the intercept interpretable > ## > wcgs$age <- (wcgs$age - 40) / 5 > wcgs$ht <- (wcgs$ht - 70) / 2 > wcgs$wt <- (wcgs$wt - 170) / 10 > wcgs$sbp <- (wcgs$sbp - 125) / 10 > wcgs$dbp <- (wcgs$dbp - 80) / 10 > wcgs$chol <- (wcgs$chol - 200) / 20 > > ## Smoker 0/1 = No/Yes > ## > wcgs$smoker <- as.numeric(wcgs$ncigs > 0) 320 BIO 233, Spring 2015

47 Unadjusted analysis Fit the logistic regression model: logit(µ i ) = β 0 + β 1 behave i > ## > fit0 <- glm(chd ~ behave, family=binomial(), data=wcgs) > summary(fit0) Call: glm(formula = chd ~ behave, family = binomial(), data = wcgs) Deviance Residuals: Min 1Q Median 3Q Max BIO 233, Spring 2015

48 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) <2e-16 *** behave e-09 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for binomial family taken to be 1) Null deviance: on 3140 degrees of freedom Residual deviance: on 3139 degrees of freedom AIC: Number of Fisher Scoring iterations: 5 > summary(fit0$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max BIO 233, Spring 2015

49 Core adjustment Add core adjustment variables into the linear predictor and fit logit(µ i ) = β 0 + β 1 behave i + β 2 age i + β 3 wt i + β 4 sbp i + β 5 chol i + β 6 smoker i > ## > fit1 <- glm(chd ~ behave + age + wt + sbp + chol + smoker, family=binomial(), data=wcgs) > summary(fit1)... Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16 *** behave e-06 *** age e-07 *** wt ** sbp e-05 *** 323 BIO 233, Spring 2015

50 chol e-12 *** smoker e-05 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for binomial family taken to be 1) Null deviance: on 3140 degrees of freedom Residual deviance: on 3134 degrees of freedom AIC: Number of Fisher Scoring iterations: 6 > summary(fit1$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max BIO 233, Spring 2015

51 Full adjustment Add remaining adjustment variables into the linear predictor and fit logit(µ i ) = β 0 + β 1 behave i + β 2 age i + β 3 wt i + β 4 sbp i + β 5 chol i + β 6 smoker i + β 7 ht i + β 8 dbp i > fit2 <- glm(chd ~ behave + age + wt + sbp + chol + smoker + ht + dbp, family=binomial(), data=wcgs) > summary(fit2)... Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16 *** behave e-06 *** age e-07 *** wt * sbp ** 325 BIO 233, Spring 2015

52 chol e-12 *** smoker e-05 *** ht dbp Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for binomial family taken to be 1) Null deviance: on 3140 degrees of freedom Residual deviance: on 3132 degrees of freedom AIC: Number of Fisher Scoring iterations: 6 > summary(fit2$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max BIO 233, Spring 2015

53 Interpretation of results Characterizing the effect of behavior type is the primary scientific goal typically report results on the odds ratio scale denote the odds ratio by θ 1 = exp{β 1 } 95% CIs can be obtained in a number of ways (i) compute the 95% CI for ˆβ 1 and exponentiate (ii) compute a 95% CI directly for ˆθ 1 glm() returns the standard error estimates for the ˆβ s use the delta method to get the standard error for ˆθ 1 Approaches are equivalent asymptotically in small samples, first approach results in an asymmetric CI 327 BIO 233, Spring 2015

54 getci() function implements the first approach code is available on the class website > ## > getci(fit0) exp{beta} lower upper (Intercept) behave Interpretation of ˆθ 1 = 2.36: > ## > getci(fit1)[1:2,] exp{beta} lower upper (Intercept) behave Interpretation of ˆθ 1 = 1.99: 328 BIO 233, Spring 2015

55 > ## > getci(fit2)[1:2,] exp{beta} lower upper (Intercept) behave Interpretation of ˆθ 1 = 1.98: 329 BIO 233, Spring 2015

56 Flexible adjustment When we include potential confounders in the model, we are less concerned with their interpretation primary purpose is the control of confounding bias if we don t model the effects of confounders properly, there may be residual confounding Suggest including these covariates into the model in as flexible manner as possible go beyond linearity Two simple strategies for flexibly modeling continuous covariates are (i) including additional polynomial terms (ii) categorization 330 BIO 233, Spring 2015

57 > ## Polynomial > ## > wcgs$age2 <- wcgs$age^2 > wcgs$age3 <- wcgs$age^3... > > ## Categorization > ## > wcgs$cigscat <- 0 > wcgs$cigscat[wcgs$ncigs >= 10] <- 1 > wcgs$cigscat[wcgs$ncigs >= 20] <- 2 > wcgs$cigscat[wcgs$ncigs >= 30] <- 3 > wcgs$cigscat[wcgs$ncigs >= 40] <- 4 > > ## > flex1 <- glm(chd ~ behave + age + age2 + age3 + wt + wt2 + wt3 + sbp + sbp2 + sbp3 + chol + chol2 + chol3 + factor(cigscat), family=binomial(), data=wcgs) 331 BIO 233, Spring 2015

58 > summary(flex1)... Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16 *** behave e-06 *** age age age wt * wt ** wt ** sbp * sbp sbp chol e-05 *** chol chol factor(cigscat) factor(cigscat) *** factor(cigscat) e-06 *** factor(cigscat) ** > 332 BIO 233, Spring 2015

59 > ## > getci(fit1)[1:2,] exp{beta} lower upper (Intercept) behave > > getci(flex1)[1:2,] exp{beta} lower upper (Intercept) behave > > ## > LRtest(fit1, flex1) Test Statistic = 25.5 on 11 df => p-value = 0.01 [1] 0.01 The likelihood ratio test suggests a better fit but there is virtually no impact on estimation or inference 333 BIO 233, Spring 2015

60 Link functions So far, we ve only considered the logit link function g(µ i ) = log ( µi 1 µ i ) = X T i β By far the most common link function used for GLMs of binary data guaranteed that fitted values are in (0,1) reasonable interpretation of contrasts in terms of odds ratios when the event is rare: OR RR ability to analyze case-control data as if it had been collected prospectively Q: What about other link functions? 334 BIO 233, Spring 2015

61 Potential choices include: linear: g(µ i ) = µ i log: g(µ i ) = log(µ i ) probit: g(µ i ) = probit(µ i ) complementary log-log: g(µ i ) = log{ log(1 µ i )} We ve noted that there is a trade-off between interpretability and mathematical properties For the goal of characterizing the association between behavior type and risk of CHD, interpretability is crucial examine the linear and log link functions If the goals is prediction, then we d be more likely to entertain the probit and complementary log-log link functions 335 BIO 233, Spring 2015

62 In R we use the family argument to change the link other components of the GLM that are functions of the link are appropriately adjusted Let s first consider changing the link function for the unadjusted analysis for the binomial family, the logit link is the default but just to show you how it works > ## > logitf0 <- glm(chd ~ behave, family=binomial(link="logit"), data=wcgs) > summary(logitf0$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max > getci(logitf0) exp{beta} lower upper (Intercept) behave BIO 233, Spring 2015

63 Now let s fit model using the linear link: > ## > linearf0 <- glm(chd ~ behave, family=binomial(link="identity"), data=wcgs) > summary(linearf0$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max > getci(linearf0, expo=false, digits=4) * 100 beta lower upper (Intercept) behave Notice that the fitted values are the same as those obtained using the logit link Q: Why? Interpretation of ˆβ 1 = 6.11: 337 BIO 233, Spring 2015

64 Finally, let s fit model using the log link: > ## > logf0 <- glm(chd ~ behave, family=binomial(link="log"), data=wcgs) > summary(logf0$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max > getci(logf0) exp{beta} lower upper (Intercept) behave Again, notice that the fitted values are the same Interpretation of ˆθ 1 = 2.21: 338 BIO 233, Spring 2015

65 Q: How does changing the link function impact the adjusted analysis? > ## > logitf1 <- glm(chd ~ behave + age + wt + sbp + chol + smoker, family=binomial(), data=wcgs) > getci(logitf1)[1:2,] exp{beta} lower upper (Intercept) behave > > ## > linearf1 <- glm(chd ~ behave + age + wt + sbp + chol + smoker, family=binomial(link="identity"), data=wcgs) Error: no valid set of coefficients has been found: please supply starting values The IWLS algorithm is having trouble finding valid starting values 339 BIO 233, Spring 2015

66 Taking a closer look at the glm() function > > args(glm) function (formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = list(...), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, contrasts = NULL,...) NULL we can provide our own starting values via start for the regression coefficients, β etastart for the linear predictors, {η 1..., η n } mustart for the fitted value, {µ 1,..., µ n } Use values from some other fit that was successful a fit using some other link function a fit based on a different mean model 340 BIO 233, Spring 2015

67 Using a linear link with binary data we also have to be careful about the mean-variance relationship specified by the binomial() family > > names(binomial()) [1] "family" "link" "linkfun" "linkinv" "variance" [6] "dev.resids" "aic" "mu.eta" "initialize" "validmu" [11] "valideta" "simulate" > > binomial()$variance function (mu) mu * (1 - mu) If, at any point during the IWLS algorithm, one of the fitted values is outside (0,1) then the variance will be negative unlikely that the algorithm will converge 341 BIO 233, Spring 2015

68 An alternative is to use OLS and use an appropriate variance estimator to account for the heteroskedasticity induced by the mean-variance relationship Huber-White variance estimator sandwich estimator robust estimator bootstrap variance estimator In R use the lm() function function robustci(), available on the class website, computes robustand bootstrap-based 95% confidence intervals > ## > linearf1 <- lm(chd ~ behave + age + wt + sbp + chol + smoker, data=wcgs) > robustci(linearf1, digits=4, B=1000) * BIO 233, Spring 2015

69 betahat Naive Lo Naive Up Robust Lo Robust Up Boot Lo Boot Up (Intercept) behave age wt sbp chol smoker Interpretation of ˆβ 1 = 4.59: Q: What about the negative fitted values? > ## > summary(linearf1$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max BIO 233, Spring 2015

70 Fitted values using a logit link 344 BIO 233, Spring 2015 Fitted values using a linear link

71 Clearly, some of the fitted values are < 0 > ## > range(logitf1$fitted[linearf1$fitted <= 0]) [1] Fitted values that are < 0 are all small Fitted values that are > 0 are in a much tighter range of values maximum value of 0.326, as opposed to for the logistic model Turning to the log link: > ## > logf1 <- glm(chd ~ behave + age + wt + sbp + chol + smoker, family=binomial(link="log"), data=wcgs) Error: no valid set of coefficients has been found: please supply starting values 345 BIO 233, Spring 2015

72 This time we can t use the lm() function but we can provide starting values from the (successful) fit of the logistic regression: > ## > logf1 <- glm(chd ~ behave + age + wt + sbp + chol + smoker, family=binomial(link="log"), mustart=fitted(logitf1), data=wcgs) > getci(logf1)[1:2,] exp{beta} lower upper (Intercept) behave > summary(logf1$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max All of the fitted values are in (0,1) Interpretation of ˆθ 1 = 1.78: 346 BIO 233, Spring 2015

73 Fitted values using a logit link 347 BIO 233, Spring 2015 Fitted values using a log link

74 Summary of results: Link Contrast Unadjusted Adjusted function model model logit OR 2.36 (1.79, 3.10) 1.99 (1.50, 2.64) linear RD 6.11 (4.21, 8.01) 4.59 (2.75, 6.43) log RR 2.21 (1.71, 2.85) 1.78 (1.38, 2.29) 95% CI based on the Huber-White robust standard error estimate Convincing evidence of a statistically significant difference between Type A and Type B behavior types in CHD risk however you define the contrast Q: Do you think we can claim clinical significance? 348 BIO 233, Spring 2015

75 The Bayesian Solution GLMs for binary data are specified by: Y i X i Bernoulli(µ i ) g(µ i ) = X T i β The unknown parameters are the regression coefficients: β p + 1 parameters In the absence of prior knowledge, it is typical to adopt a flat prior π(β) BIO 233, Spring 2015

76 Computation Generate samples from the posterior π(β y) L(β; y)π(β) via the Metropolis-Hastings algorithm Use the asymptotic sampling distribution of the MLE as a proposal distribution q(β;y) Normal( β MLE, I 1 ββ ) from the (usual) frequentist fit of the GLM Also use this distribution for starting values 350 BIO 233, Spring 2015

77 ## fit1 <- glm(chd ~ behave + age + wt + sbp + chol + smoker, family=binomial(), data=wcgs) ## betahat <- fit1$coef betavar <- summary(fit1)$cov.unscaled X <- model.matrix(fit1) Y <- model.frame(fit1)[,1] ## 3 chains, each for 1,000 scans ## M <- 3 R < startvals <- rmvnorm(m, betahat, betavar) posterior <- array(na, dim=c(r, length(betahat), M)) accept <- array(0, dim=c(r, M)) for(m in 1:M) { ## beta <- startvals[m,] mu <- as.vector(expit(x %*% beta)) 351 BIO 233, Spring 2015

78 } ## for(r in 1:R) { ## betastar <- as.vector(rmvnorm(1, betahat, betavar)) mustar <- as.vector(expit(x %*% betastar)) ## logpiratio <- sum(dbinom(y, 1, mustar, log=true)) - sum(dbinom(y, 1, mu, log=true)) logqratio <- log(dmvnorm(beta, betahat, betavar)) - log(dmvnorm(betastar, betahat, betavar)) ar <- exp(logpiratio + logqratio) if(runif(1) < ar) { beta <- betastar mu <- mustar accept[r,m] <- 1 } posterior[r,,m] <- beta } 352 BIO 233, Spring 2015

79 Examine trace plots for evidence of convergence (or lack thereof) Intercept, β Scan behave logor, β Scan 353 BIO 233, Spring 2015

80 Acceptance rate for the Metropolis-Hastings algorithm: > ## > accrate <- round(apply(accept, 2, mean) * 100, 1) > accrate [1] Proposal and posterior distribution for the log-or of behave, β 1 proposal posterior BIO 233, Spring 2015

81 Summaries of the posterior distribution potential scale reduction (PSR) results based on the Bayesian analysis pool samples from the 3 chains, each with 10% burn in MLE and 95% confidence interval PSR Median 2.5% 97.5% exp{beta} lower upper (Intercept) behave age wt sbp chol smoker Numerical results based on the Bayesian and frequentist analyses are virtually identical differ in their interpretation 355 BIO 233, Spring 2015

82 Posterior distribution for the OR of behave, θ 1 = exp{β 1 } posterior median/mean and (central) 95% credible interval BIO 233, Spring 2015

83 Log link Suppose we want to model the RR, rather than the OR log link, rather the logit link In terms of the model specification, the only thing that changes is the dependence of the mean on the linear predictor: Y i X i Bernoulli(µ i ) log(µ i ) = X T i β form of the likelihood is the same Retain the flat prior for β even though the parameters are different 357 BIO 233, Spring 2015

84 Operationally we need to modify the Metropolis-Hasthings algorithm: (1) change how the µ i s are calculated to evaluate the likelihood/posterior µ i = expit(x T i β) µ i = exp(x T i β) (2) check that the proposed value of β yields a valid set of µ i s if the proposal yields any µ i / (0,1) then we automatically reject the proposal will have zero posterior probability 358 BIO 233, Spring 2015

85 At the r th scan for the m th chain, the algorithm proceeds as: ## betastar <- as.vector(rmvnorm(1, betahat, betavar)) mustar <- as.vector(exp(x %*% betastar)) ## change to the link ## if(sum(mustar <= 0 mustar >= 1) == 0) { logpiratio <- sum(dbinom(y, 1, mustar, log=true)) - sum(dbinom(y, 1, mu, log=true)) logqratio <- log(dmvnorm(beta, betahat, betavar)) - log(dmvnorm(betastar, betahat, betavar)) ar <- exp(logpiratio + logqratio) if(runif(1) < ar) { beta <- betastar mu <- mustar accept[r,m] <- 1 } posterior[r,,m] <- beta } 359 BIO 233, Spring 2015

86 Examine trace plots for evidence of convergence (or lack thereof) Intercept, β Scan behave logor, β Scan 360 BIO 233, Spring 2015

87 Acceptance rate for the Metropolis-Hastings algorithm: > ## > accrate <- round(apply(accept, 2, mean) * 100, 1) > accrate [1] Results: PSR Median 2.5% 97.5% exp{beta} lower upper (Intercept) behave age wt sbp chol smoker Again, the numerical results are virtually identical although the interpretation differs 361 BIO 233, Spring 2015

88 Confounding and Collapsibility Linear regression For a continuous response variable, consider two models: E[Y X,Z] = β 0 + β 1 X + β 2 Z (1) E[Y X] = α 0 + α 1 X (2) In model (1), β 1 is a conditional parameter contrast conditions on the value of Z In model (2), α 1 is a marginal parameter contrast does not condition on anything Q: How are these parameters related? 362 BIO 233, Spring 2015

89 It s straightforward to show that E[Y X] = E[E[Y X,Z]] = E[Y X,Z]f Z X (Z = z X) z z = β 0 + β 1 X + β 2 E[Z X] So the marginal contrast equals α 1 = E[Y X = (x+1)] E[Y X = x] = β 1 + β 2 {E[Z X = (x+1)] E[Z X = x]} The expression within the brackets is the slope from a linear regression of Z X 363 BIO 233, Spring 2015

90 Using this fact, we can write α 1 = β 1 + β 2 COV[X,Z] V[X] the marginal contrast is the conditional contrast plus a bias term Bias requires both β 2 0 and COV[X,Z] 0 Z is related to Y Z is related to X i.e. Z is a confounder The direction of the bias depends on the interplay between β 2 and COV[X, Z] confounding bias may be positive or negative confounding may result in an estimate that is too big or too small 364 BIO 233, Spring 2015

91 If either β 2 = 0 or COV[X,Z] = 0 then β 1 = α 1 Therefore, if Z is a precision variable then β 1 and α 1 have different interpretations the same numerical value However, as the name suggests, the standard error of β 1 will be smaller than the standard error for α 1 Suggests that adjusting for a precisions variable is a good thing, even if one is interested in the marginal association 365 BIO 233, Spring 2015

92 Logistic regression Q: Does the same hold for logistic regression? how are the marginal and conditional parameters related? For a binary outcome, consider two models: logit E[Y X,Z] = β 0 + β 1 X + β 2 Z (3) logit E[Y X] = α 0 + α 1 X (4) The conditional odds ratio for a binary X is θ c x = exp{β 1 } = E[Y = 1 X = 1,Z] E[Y = 0 X = 1,Z] / E[Y = 1 X = 0,Z] E[Y = 0 X = 0,Z] conditional on the value of Z 366 BIO 233, Spring 2015

93 The marginal odds ratio for X is θ m x = exp{α 1 } = E[Y = 1 X = 1] E[Y = 0 X = 1] / E[Y = 1 X = 0] E[Y = 0 X = 0] where E[Y X] = E[Y X,Z]f Z X (Z = z X) z z The relationship between the conditional contrast θ c x and marginal contrast θ m x is not straightforward no simple, closed-form expression for θ m x as a function of θ c x In particular, unlike in the setting of linear regression, they are not linearly related 367 BIO 233, Spring 2015

94 We can, however, calculate θ m x numerically To do so, from the expression for E[Y X], we need to specify E[Y X,Z] f Z X (Z = z X) The first component is given by the logistic regression model: logit E[Y X,Z] = β 0 + β 1 X + β 2 Z For binary X and Z, it s convenient to represent f Z X (Z = z X) via the logistic regression logit E[Z X] = γ 0 + γ 1 X notationally, let φ XZ = exp{γ 1 } denote the X/Z odds ratio 368 BIO 233, Spring 2015

95 The following slides consider the percent difference: θ m 1 θ c 1 θ c under various scenarios for the conditional odds ratio for X, θ c x the conditional odds ratio for Z, θ c z the X/Z odds ratio, φ XZ Throughout, the following are held fixed P(X = 1) = 0.2 P(Z = 1 X = 0) = 0.2 P(Y = 1) = 0.1 R code is available on the course website 369 BIO 233, Spring 2015

96 Strong confounder/exposure association: φ XZ = 0.33 Percentage difference between θ X m and θx c θ X c = 0.20 θ X c = 0.50 θ X c = 0.67 θ X c = 1.00 θ X c = 1.50 θ X c = 2.00 θ X c = Conditional odds ratio for Z, θ Z c 370 BIO 233, Spring 2015

97 Strong confounder/exposure association: φ XZ = 3.00 Percentage difference between θ X m and θx c θ X c = 0.20 θ X c = 0.50 θ X c = 0.67 θ X c = 1.00 θ X c = 1.50 θ X c = 2.00 θ X c = Conditional odds ratio for Z, θ Z c 371 BIO 233, Spring 2015

98 Moderate confounder/exposure association: φ XZ = 0.50 Percentage difference between θ X m and θx c θ X c = 0.20 θ X c = 0.50 θ X c = 0.67 θ X c = 1.00 θ X c = 1.50 θ X c = 2.00 θ X c = Conditional odds ratio for Z, θ Z c 372 BIO 233, Spring 2015

99 Moderate confounder/exposure association: φ XZ = 2.00 Percentage difference between θ X m and θx c θ X c = 0.20 θ X c = 0.50 θ X c = 0.67 θ X c = 1.00 θ X c = 1.50 θ X c = 2.00 θ X c = Conditional odds ratio for Z, θ Z c 373 BIO 233, Spring 2015

100 Weak confounder/exposure association: φ XZ = 0.80 Percentage difference between θ X m and θx c θ X c = 0.20 θ X c = 0.50 θ X c = 0.67 θ X c = 1.00 θ X c = 1.50 θ X c = 2.00 θ X c = Conditional odds ratio for Z, θ Z c 374 BIO 233, Spring 2015

101 Weak confounder/exposure association: φ XZ = 1.20 Percentage difference between θ X m and θx c θ X c = 0.20 θ X c = 0.50 θ X c = 0.67 θ X c = 1.00 θ X c = 1.50 θ X c = 2.00 θ X c = Conditional odds ratio for Z, θ Z c 375 BIO 233, Spring 2015

102 No confounder/exposure association: φ XZ = 1.00 Percentage difference between θ X m and θx c θ X c = 0.20 θ X c = 0.50 θ X c = 0.67 θ X c = 1.00 θ X c = 1.50 θ X c = 2.00 θ X c = Conditional odds ratio for Z, θ Z c 376 BIO 233, Spring 2015

103 As with linear regression, confounding bias may lead to marginal contrasts that are either bigger or smaller than the conditional contrast the true association may be of the opposite sign to the estimated association depends on whether or not the sign of θ c z and φ XZ are the same or opposite The magnitude of confounding bias depends on an interplay between θ c x, θ c z and φ XZ If φ XZ = 0, then θx m may not equal θx c i.e. Z is precision variable this difference is not confounding bias it is due to the non-collapsibility of the odds ratio 377 BIO 233, Spring 2015

104 In contrast to linear regression, if Z is a precision variable then θ m x and θ c x have different interpretations different numerical values Q: How does one choose between the target parameters? 378 BIO 233, Spring 2015

105 Stratified designs So far, we ve considered estimation and inference based on an independent sample of size n, {(X i,y i ); i = 1,...,n} and the likelihood: L = n P(Y i X i ) i=1 parameterize P(Y X) in terms of a regression model, µ = E[Y X;β] learn about the regression coefficients, β Prospective sampling: choose individuals on the basis of their covariates and observe their outcomes Y is random, conditional on X 379 BIO 233, Spring 2015

106 Cross-sectional sampling: choose individuals completely at random and observe their outcomes/ covariates (Y,X) are jointly random, so that the likelihood is L = = n P(Y i,x i ) i=1 n P(Y i X i )P(X i ) i=1 assume that the marginal covariate distribution does not provide information about the prospective association(s) base estimation/inference on L = n P(Y i X i ) i=1 380 BIO 233, Spring 2015

107 In many settings, these sampling schemes are perfectly reasonable However there are settings where we may need a surprisingly large sample size to have reasonable power King County birth weight data: examine power to detect an association between lbw and welfare based on the logistic model: lbw ~ welfare + married + college + age + smoker + wpre use simulation to estimate power under a range of scenarios odds ratio: 1.5, 2.0, and 3.0 sample size: 3,000 8,000 Homework #6 381 BIO 233, Spring 2015

108 Power for the welfare effect Sample size, n with a sample size of n=8,000, we would have an estimated 67% power to detect an odds ratio of BIO 233, Spring 2015

109 That the outcome is rare is a key reason why power is so low incidence of 5.1% in the observed sample controlled in the simulation by manipulating the value β 0 As we draw random samples, we get very few LBW events see the direct impact on the standard error for the odds ratio association between a binary X and binary outcome Y se[ θ] = θ 1 n n n n 11. Q: What happens if we increase the incidence? 383 BIO 233, Spring 2015

110 Repeat simulations for the association between welfare and lbw manipulate β 0 such that the incidence increases from 0.05 to 0.20 fix the sample size at n=4,000 estimated power based on a Wald test: Odds ratio Incidence as incidence increases power increases rate of increase is not dramatic because the exposure of interest (welfare) is also rare 384 BIO 233, Spring 2015

111 In practice, of course, we cannot manipulate incidence But we can manipulate the (relative) number of cases and non-cases that we observe in the data i.e., artificially inflate the observed incidence for example, via a case-control design The problem is that the sample is no longer representative of the target population the sample is non-random But this non-randomness is by design under the control of the researcher such designs referred to as biased sampling schemes use statistical techniques to account for the non-random sampling 385 BIO 233, Spring 2015

112 Case-control studies In a case-control study, we initially stratify the population by outcome status know Y =0/1 for everyone for any given individual, we can (easily) determine Y Proceed by sampling, at random, n 1 cases, i.e. for whom Y = 1 n 0 non-cases or controls, i.e. for whom Y = 0 For all n=n 0 +n 1 sampled individuals, observe the value of their covariates crucial: X is random and not Y 386 BIO 233, Spring 2015

113 The appropriate likelihood is L R = = n P(X i Y i ) i=1 n 0 i=1 P(X i Y i = 0) n 0 +n 1 i=n 0 +1 P(X i Y i = 1) n independent, outcome-specific contributions retrospective likelihood However, the scientific goal is (most often) to learn about prospective associations i.e., P(Y X) Q: How do we learn about prospective associations from the retrospective likelihood? 387 BIO 233, Spring 2015

114 Consider the logistic regression model: logit P(Y = 1 X) = X T β model corresponds the target population of interest As we ve noted, case-control sampling is non-random with respect to the target population Formalize this by introducing a random variable S that indicates selection by the sampling scheme S = 1 selected 0 not selected binary random variable with some probability, P(S = 1) 388 BIO 233, Spring 2015

115 Cross-sectional sampling selection is independent of (Y, X) P(S = 1) is constant Prospective sampling selection depends on the covariate values, X write P(S = 1 X) Case-control sampling selection depends on outcome status, Y write P(S = 1 Y = y) 389 BIO 233, Spring 2015

116 Now consider the distribution of the outcome, conditional on being selected: P(Y = 1 X,S = 1) 390 BIO 233, Spring 2015

117 Using Bayes Theorem and noting that selection depends solely on Y: P(Y = 1 X,S = 1) = P(S = 1 X,Y = 1) P(Y = 1 X) P(S = 1 X) = P(S = 1 X,Y = 1) P(Y = 1 X) 1 P(S = 1 X,Y = y) P(Y = y X) y=0 = = P(S = 1 Y = 1) P(Y = 1 X) 1 P(S = 1 Y = y) P(Y = y X) y=0 π 1 P(Y = 1 X) 1 π y P(Y = y X) y=0 391 BIO 233, Spring 2015

118 dividing the numerator and denominator by: π 0 P(Y = 0 X) = π 1 π 0 exp{x T β} 1+ π 1 π 0 exp{x T β} = exp{β 0 + β 1 X β k X K } 1+exp{β 0 + β 1X β k X K } where β 0 = β 0 + log ( π1 π 0 ) 392 BIO 233, Spring 2015

119 We see that P(Y = 1 X,S = 1) has the same functional form as the desired logistic regression model if P(Y = 1 X) is of logistic form then so is P(Y = 1 X,S = 1) The odds ratio relationships between X and Y are preserved despite the selection process in Homework #5, we saw that bias (for odds ratios) only arises when selection depends on both Y and X The intercept for the two logistic models are different, however 393 BIO 233, Spring 2015

120 All this suggests that, if the primary goal is to learn about odds ratio parameters, estimation/inference could proceed by forming a likelihood using these probabilities: L P = n P(Y i X i,s i = 1) i=1 ignores the fact that the sample was obtained via a case-control scheme i.e., pretend that the sample was obtained prospectively Use L P to learn about {β 0,β 1,...,β K } In principle, we can also learn about the intercept, β 0, if we have information on the probabilities of selection π 0 and π 1 β 0 = β 0 log ( π1 π 0 ) 394 BIO 233, Spring 2015

121 While this seems reasonable, showing that P(Y = 1 X,S = 1) and P(Y = 1 X) have the same functional form is not sufficient Recall the retrospective likelihood: L R = n P(X i Y i ) = i=1 = n P(X i Y i,s i = 1) i=1 n i=1 P(Y i X i,s i = 1) P(X i S i = 1) P(Y i S i = 1) the components of L P correspond to the first component of L R but ignores the other terms Crucially, the P(Y i X i,s i = 1) contributions are not independent of each other as is assumed by L P 395 BIO 233, Spring 2015

122 The true joint distribution of the outcomes {Y 1,...,Y n } is constrained by the sampling scheme the case-control sampling scheme dictates that there will be n 0 controls and n 1 cases so the {Y 1,...,Y n } cannot freely vary To see this more formally, note that L R = n 0 i=1 P(X i Y i = 0) n 0 +n 1 i=n 0 +1 P(X i Y i = 1) = n 0 i=1 P(Y i = 0 X i,s i = 1) P(X i S i = 1) P(Y i = 0 S i = 1) n 0 +n 1 i=n 0 +1 P(Y i = 1 X i,s i = 1) P(X i S i = 1) P(Y i = 1 S i = 1) 396 BIO 233, Spring 2015

Linear Regression Models P8111

Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started