Lecture 9. Statistics Survival Analysis. Presented February 23, Dan Gillen Department of Statistics University of California, Irvine

Statistics 255 - Survival Analysis Presented February 23, 2016 Dan Gillen Department of Statistics University of California, Irvine 9.1

Survival analysis involves subjects moving through time Hazard may change over time Relative hazard may change over time Factors (covariates) determining relative hazard may change over time Why should we only examine covariates that are fixed at their baseline values? 9.2

Examples 1. In a study of Medicare patients, interest is on the effects of heart failure on mortality (age at death) We wish to allow relative hazard to change as subjects experience sentinel events such as: ischemia arrhythmia pump failure Health status of subjects is dynamic 2. In HIV studies of the time from infection (sero-conversion) to AIDS, the risk of AIDS-onset varies dramatically with the subject s CD-4 T-cell count, which is a smooth function of time 3. The relative hazard between high and low risk AML (leukemia) patients may decline with time 9.3

These effects can be expressed with time-varying covariates: 1. x i (t) = ischemia i (t) indicator for ith subject for ischemia hx before age t 2. x i (t) = CD4 i (t) ith HIV patient s CD4 count at time t since HIV-seroconversion 3. x i1 = I(risk i = high) = baseline high risk indicator, and x i2 (t) = x i1 log(t) is the interaction of risk status with time Conceptually, these effects are easy to include in a proportional hazards. Why? 9.4

The partial likelihood for the proportional hazards with time-fixed covariates is L P = failure times j ( ) exp(β T x (j) ) i R (j) exp(β T x i ) L P compares the covariate x (j) of the failed subject to the covariates x i in the risk set R (j) at time t (j) But, there is absolutely no reason that the covariate x i for the ith subject has to be the same for all times t (j). 9.5

The partial likelihood for the proportional hazards with time-varying covariates is L P = failure times j ( ) exp{β T x (j) (t (j) )} i R (j) exp{β T x i (t (j) )} where x i (t (j) ) is the value of the ith subject s covariate at time t (j) L P now compares the covariate x (j) (t (j) ) at time t (j) of the failed subject to the covariates x i (t (j) ) at time t (j) in the risk set R (j) 9.6

To formalize, consider the data: (y 1, δ 1, x 1 (t)),, (y n, δ n, x n (t)), where y i = observed follow-up time δ i = failure (vs. censoring) indicator x i (t) = vector covariate function of time t Note that x i (t) is set of values that (may) vary with t. R (j) is the set of subjects at risk for failure at t (j) Q: who is in the risk set? A: the covariate values at risk at t (j) are {x i (t (j) ) : i R (j) } 9.7

The partial likelihood is constructed by comparing the risk given x (j) (t (j) ) to the risk given all other x i (t (j) )s for i R (j) : L (j) = Pr{subject with x (j) (t (j) ) fails at t (j) } Pr{ some subject in R (j) failed at t (j) } and L P = L (j) failure times j The is λ i {t x i ( )} = λ 0 (t) exp{β T x i (t)} 9.8

Note: The cumulative hazard function for the ith subject is then Λ i {t x i ( )} = = t 0 t 0 λ i {s x i ( )} ds λ 0 (s) exp{β T x i (s)} ds which no longer factors into Λ 0 (t) and exp(β T x i ) The baseline cumulative hazard Λ 0 (t) now refers to a subject with x i (t) = 0 for all t λ0 (t), Λ 0 (t), S 0 (t) may no longer have a clear interpretation Recovering Λi {t x i ( )} is more difficult than before, and perhaps less meaningful 9.9

Consider the bone-marrow transplant data (Section 1.3, K & M): time origin is time of transplant failure event is relapse or death Intermediate events that could effect the hazard for death or relapse include: development of acute graft-versus-host disease (agvhd) development of chronic graft-versus-host disease (cgvhd) return of platelet count to self-sustaining level (platelet recovery) 9.10

We are interested here in the effect of platelet recovery (iplate), a binary variable taking value 0 at t = 0 and, for some subjects, taking value 1 at some time tplate. Lets look at the data > bmt <- read.csv( "http://www.ics.uci.edu/~dgillen/ STAT295/Data/bmt.csv" ) > bmt <- bmt[ order(bmt$tnodis), ] > bmt <- bmt[,-1] > bmt <- cbind( 1:dim(bmt)[1], bmt ) > names(bmt)[1] <- "id" 9.11

### Center age of patient and donor for later interpretation > bmt$agep.c = bmt$agep - 28 > bmt$aged.c = bmt$aged - 28 > bmt[1:25,c("id","tnodis","inodis", "iplate", "tplate")] id tnodis inodis iplate tplate 35 1 1 1 0 1 108 2 2 1 0 2 84 3 10 1 0 10 129 4 16 1 0 16 114 5 32 1 1 16 87 6 35 1 0 35 109 7 47 1 1 11 116 8 47 1 1 28 80 9 48 1 1 14 132 10 48 1 1 30 9.12

Data analysis set-up For a subject who never experiences platelet recovery, iplate= 0, for all time t up to the end of observation tnodis. This is like one observation with one exit time. For a subject experiencing platelet recovery, think of the data as two observations: first, we have an observation with iplate= 0, entering at time t = 0 and being censored at time t =tplate. iplate= 0 for this observation then, we have an observation with iplate= 1, entering the just just after time t =tplate, and exiting at time t =tnodis. iplate= 1 for this observation 9.13

Data analysis set-up ### Create a duplicate record for each subject > bmt.tvc <- bmt[rep(1:dim(bmt)[1],each=2),] > id.1 <-!duplicated( bmt.tvc$id ) > id.2 <- duplicated( bmt.tvc$id ) ### Deal with first record, set start, stop, event indicator, ### and covariate value > bmt.tvc$start[ id.1 bmt.tvc$iplate==0 ] <- 0 > bmt.tvc$stop[ id.1 & bmt.tvc$iplate==1 ] <- bmt.tvc[ id.1 & bmt.tvc$iplate==1, ]$tplate > bmt.tvc$inodis[ id.1 & bmt.tvc$iplate==1 ] <- 0 > bmt.tvc$iplate[ id.1 ] <- 0 ### Deal with second record, set start and stop > bmt.tvc$start[ id.2 & bmt.tvc$iplate==1 ] <- bmt.tvc[ id.2 & bmt.tvc$iplate==1, ]$tplate > bmt.tvc$stop[ id.2 ] <- bmt.tvc[ id.2, ]$tnodis 9.14

Data analysis set-up ### Remove records with missing stop value ### (these did not change iplate status) > bmt.tvc <- bmt.tvc[!is.na(bmt.tvc$stop), ] > bmt.tvc[1:20,c("id","tnodis", "start", "stop", "inodis", "iplate")] id tnodis start stop inodis iplate 35.1 1 1 0 1 1 0 108.1 2 2 0 2 1 0 84.1 3 10 0 10 1 0 129.1 4 16 0 16 1 0 114 5 32 0 16 0 0 114.1 5 32 16 32 1 1 87.1 6 35 0 35 1 0 109 7 47 0 11 0 0 109.1 7 47 11 47 1 1 116 8 47 0 28 0 0 116.1 8 47 28 47 1 1 80 9 48 0 14 0 0 80.1 9 48 14 48 1 1 132 10 48 0 30 0 0 132.1 10 48 30 48 1 1 9.15

Data analysis set-up Interpretation of what we have so far: id indicates multiple observations on a single subject start is the entry time for each observation stop is the exit time for each observation the exit time for the previous observation is used as the entry time for the following observations To create the Surv() response we can now specify Surv(start, stop, event) ### Look at the Surv() object > Surv( bmt.tvc$start, bmt.tvc$stop, bmt.tvc$inodis) [1] ( 0, 1 ] ( 0, 2 ] ( 0, 10 ] ( 0, 16 ] ( 0, 16+] ( 16, 32 ] ( 0, 35 ] 9.16

then proceeds by calling coxph() as before, with the expanded" Surv() object ### Fit Cox with time-varying covariate iplate > fit <- coxph( Surv( start, stop, inodis) ~ fab + agep.c*aged.c + factor(g) + imtx + iplate, data=bmt.tvc ) > summary(fit) coef exp(coef) se(coef) z Pr(> z ) fab 0.82003 2.27056 0.28379 2.89 0.0039 ** agep.c 0.00588 1.00590 0.01985 0.30 0.7671 aged.c 0.00505 1.00506 0.01785 0.28 0.7773 factor(g)2-0.95837 0.38352 0.36382-2.63 0.0084 ** factor(g)3-0.36327 0.69540 0.37352-0.97 0.3308 imtx 0.23177 1.26083 0.25775 0.90 0.3685 iplate -0.94348 0.38927 0.34109-2.77 0.0057 ** agep.c:aged.c 0.00288 1.00289 0.00093 3.10 0.0019 ** exp(coef) exp(-coef) lower.95 upper.95 fab 2.271 0.440 1.302 3.960 agep.c 1.006 0.994 0.968 1.046 aged.c 1.005 0.995 0.971 1.041 factor(g)2 0.384 2.607 0.188 0.782 factor(g)3 0.695 1.438 0.334 1.446 imtx 1.261 0.793 0.761 2.090 iplate 0.389 2.569 0.199 0.760 agep.c:aged.c 1.003 0.997 1.001 1.005 9.17

From the Wald test on the previous page, we obtain a p-value of 0.0057, indicating that platelet recovery is significantly associated with increased survival. We can also perform a likelihood ratio test... ### Fit Cox without time-varying covariate for LRT > fit.red <- coxph( Surv( start, stop, inodis) ~ fab + agep.c*aged.c + factor(g) + imtx, data=bmt.tvc ) > anova( fit.red, fit ) Analysis of Deviance Table Cox : response is Surv(start, stop, inodis) Model 1: ~ fab + agep.c * aged.c + factor(g) + imtx Model 2: ~ fab + agep.c * aged.c + factor(g) + imtx + iplate loglik Chisq Df P(> Chi ) 1-356 2-353 6.53 1 0.011 * 9.18

Interpretation: Adjusting for FAB, group, ages of patient and donor and MTX assignment, a randomly sampled subject having had a platelet recovery by any given time t has about 2/5 the risk of relapse / death than a (similar) subject at the same time t who has not yet experienced platelet recovery. Conclusion: Platelet recovery appears to be an important indicator for successful recovery, or at least, for delayed relapse. 9.19

Recall we had a with several covariates, and we were considering the effect of MTX (0 or 1), indicating receipt of a GVH prophylactic We had concluded via a graphical investigation: The hazards for the two MTX groups do not appear to be proportional Can we / test this conclusion? 9.20

Our original was λ i (t x i ) = λ 0 (t) exp(η i ) where η i = β T x i x ij = ages of patient and donor, fab, group, and x i7 = imtx i β 7 is the estimated log hazard ratio at all times t for imtx= 1 versus imtx= 0, holding other x s constant 9.21

Consider the alternative λ i (t imtx i = 1)/λ i (t imtx i = 0) = t β 8 exp(β 7 ) This expresses a departure from the proportional hazards assumption: β 8 = 0 t β 8 = 1 t so λ i(t imtx = 1) λ i (t imtx = 0) = exp(β 7) Testing H 0 : β 8 = 0 will provide a test of the proportional hazards assumption 9.22

The is now and x i7 = imtx i x i8 = imtx i log(t) ie. a covariate by (log) time interaction... Q: How do we test this in R? A: Create a data set with one observation for each person at each observed failure time: entry: just after last failure time t(j 1) exit: observed failure time t(j) Note: Fitting such a is computationally intensive and can be difficult for large datasets (with many failures)... 9.23

Data analysis set-up Here is one way we can obtain the necessary dataset... ## ##### Create dataset to look at an interaction with time u.evtimes <- unique( bmt$tnodis[ bmt$inodis==1 ] ) num.event <- length( u.evtimes ) bmt.texpand <- bmt[, c("id", "tnodis", "inodis", "fab", "agep.c", "aged.c", "g", "imtx") ] ## ##### Replicate each record for the number of observed failures bmt.texpand <- bmt.texpand[ rep(bmt.texpand$id,each=num.event), ] bmt.texpand$start <- rep( c(0,u.evtimes[1:(num.event-1)]), sum(!duplicated(bmt.texpand$id)) ) bmt.texpand$stop <- rep( u.evtimes, sum(!duplicated(bmt.texpand$id)) ) ## ##### Remove unnecessary rows and create event indicators ## bmt.texpand <- bmt.texpand[ bmt.texpand$tnodis > bmt.texpand$start, ] bmt.texpand <- bmt.texpand[ dim(bmt.texpand)[1]:1, ] bmt.texpand$stop <- ifelse(!duplicated(bmt.texpand$id), bmt.texpand$tnodis, bmt.texpand$stop) bmt.texpand$inodis <- ifelse(!duplicated(bmt.texpand$id), bmt.texpand$inodis,0 ) bmt.texpand <- bmt.texpand[ dim(bmt.texpand)[1]:1, ] 9.24

Now, our scientific question is whether the effect of imtx as measured by the hazard ratio changes with respect to time This can be parameterized and tested using a multiplicative interaction ### Test whether effect of imtx varies with time > fit <- coxph( Surv( start, stop, inodis) ~ fab + agep.c*aged.c + factor(g) + imtx + imtx:log(stop), data=bmt.texpand ) > summary(fit) coef exp(coef) se(coef) z Pr(> z ) fab 0.881860 2.415388 0.278461 3.17 0.0015 ** agep.c 0.003399 1.003405 0.019918 0.17 0.8645 aged.c 0.000496 1.000496 0.018114 0.03 0.9782 factor(g)2-1.014874 0.362448 0.362198-2.80 0.0051 ** factor(g)3-0.329875 0.719014 0.368384-0.90 0.3705 imtx 2.709578 15.022935 1.154639 2.35 0.0189 * agep.c:aged.c 0.003050 1.003054 0.000955 3.19 0.0014 ** imtx:log(stop) -0.479984 0.618793 0.225561-2.13 0.0333 * 9.25

### Test whether effect of imtx varies with time > fit <- coxph( Surv( start, stop, inodis) ~ fab + agep.c*aged.c + factor(g) + imtx + imtx:log(stop), data=bmt.texpand ) > summary(fit) exp(coef) exp(-coef) lower.95 upper.95 fab 2.415 0.4140 1.399 4.169 agep.c 1.003 0.9966 0.965 1.043 aged.c 1.000 0.9995 0.966 1.037 factor(g)2 0.362 2.7590 0.178 0.737 factor(g)3 0.719 1.3908 0.349 1.480 imtx 15.023 0.0666 1.563 144.406 agep.c:aged.c 1.003 0.9970 1.001 1.005 imtx:log(stop) 0.619 1.6160 0.398 0.963 9.26

The Wald test indicates a time-varying effect with imtx (z-statistic of -2.23 and p-value of 0.0258) We can also look at a likelihood ratio test... ### Fit Cox without interaction for LRT Analysis of Deviance Table Cox : response is Surv(start, stop, inodis) Model 1: ~ fab + agep.c * aged.c + factor(g) + imtx Model 2: ~ fab + agep.c * aged.c + factor(g) + imtx + imtx:log(stop) loglik Chisq Df P(> Chi ) 1-356 2-354 5.36 1 0.021 * --- Signif. codes: 0 Ô***Õ 0.001 Ô**Õ 0.01 Ô*Õ 0.05 Ô.Õ 0.1 Ô Õ 1 9.27

Interpretation: When log(t) = 0, i.e. at t = 1, the estimated relative hazard for an imtx= 1 subject, compared to an imtx= 0 subject is 16.5, CI [1.7,157] At 1 year (t = 365), the estimated log-relative hazard for an imtx= 1 subject, compared to aimtx= 0 subject is log(16.48) + log(.607) log(365) = 0.143, giving a relative hazard estimate of exp( 0.143) = 0.87 9.28

The lincontr.coxph() makes this easy... Conclusion: The relative risk of relapse or death for MTX patients is very high early in the study, but drops off later in the study. The hazards are not proportional (X 2 = 5.98 on 1 df) ### Estimated effect of imtx at 1 year (365 days) > lincontr.coxph( fit, contr.names=c("imtx","imtx:log(stop)"), contr.coef=c(1,log(365)) ) Test of H_0: exp( 1*imtx + 5.8999*imtx:log(stop) ) = 1 : exp( Est ) se.est zstat pval ci95.lo ci95.hi 1 0.885 0.336-0.363 0.716 0.458 1.711 9.29

Modeling with time-varying covariates With time-varying covariates, it is much more difficult to interpret the in terms of survival time baseline survival function may not have a real meaning you are always safe to maintain interpretation at the hazard level Keep your eyes on the risk sets! Time-varying covariates methods can be used to test the proportional hazards assumptions 9.30

Modeling with time-varying covariates Model-building strategy: Follow similar strategy to that for time-fixed covariates Treat true time-varying covariates on equal footing with other covariates Treat interactions of time-fixed covariates with log(t) (used to test proportional hazards assumption) on equal footing with other interactions: Examine their effects later in building strategy Go for parsimony, if possible 9.31

Modeling with time-varying covariates Need to be careful about the functional form of t in the : Good ideal to categorize time. For example, consider the hazard ratio associated with your predictor of interest over 0-6 months, 6-12 months, etc. (See Homework 4) Can also consider Schoenfeld resdiduals...coming soon 9.32