Statistics 255 - Survival Analysis Presented March 3, 2016 Motivating Dan Gillen Department of Statistics University of California, Irvine 11.1
First question: Are the data truly discrete? : Number of attempts at a puzzle before it is solved Motivating Number of grades completed before dropping out of school Number of doses required for a given effect to be observed Number of inseminations of cows required to achieve pregnancy Years measured by the the number of rings on a tree (time is the number of seasonal cycles) Number of screening visits to detect recurrence of disease 11.2
If failures are really instantaneous and the time measure is really continuous, then data not truly discrete, but are interval censored... : 1. Alzheimer s disease cohort study Enter into cohort with normal cognition Annual neuropsychological testing to assess conversion to demented state Motivating 2. Breast cancer screening Time from birth until the development of BC Subjects screened every 5 years until age 40, then each year after that 3. Maintenance therapy studies Cohort of cancer patients in remission Time to disease recurrence Patients regularly screened 11.3
Types of interval censoring Fixed interval censoring Researcher (or some external process) selects a fixed screening interval Every subject is screened according to the defined intervals Random interval censoring Screening intervals vary from screening-to-screening and from person-to-person Motivating Independent interval censoring Occurence of screening and lengths of intervals are independent of failure times (conditional on covariates) 11.4
Types of interval censoring Example of where independence does not hold... Relapse prevention for ulcers comparing two treatments Regular screening intervals for endoscopy (6 months) Additional endoscopies at other visits - due to symptoms or other health problems Motivating Effective screening intervals are not independent of relapse We will consider fixed interval censoring... 11.5
Notation Suppose patients are followed-up or screened at times t 1, t 2,..., t j,..., t J "Complete" right censored data is given by: (Y 1, δ 1, x 1 ),..., (Y n, δ n, x n ) "Observed" interval censored data is given by: Motivating (Y 1, δ 1, x 1 ),..., (Y n, δ n, x n ) where Y i = j if Y i (t j 1, t j ] 11.6
Analysis Strategies Univariate Analysis Cohort life table analysis Assume censoring and death is uniformly distributed within each interval Regression Analysis Motivating Fixed interval proportional hazards model 11.7
Consider the proportional hazards model for right censored data Motivating so that λ i (t x i ) = λ 0 (t)e βt x i S i (t x i ) = [S 0 (t)] exp(βt x i ) Now consider this model in the interval censored setting so that S i (t j x i ) = [S 0 (t j )] exp(βt x i ) = Pr[T > t j x i ] = Pr[Surviving j th interval x i ] 11.8
Now consider the conditional probability of failing in an interval given survival up to the start of the interval Pr[Failing in j th interval Survive j 1 interval, x i ] = S i(t j 1 xi) S i (t j x i ) S i (t j 1 xi) = 1 S i(t j x i ) S i (t j 1 xi) ( S0 (t j ) = 1 S 0 (t j 1 ) ) exp(β T x i ) Motivating 11.9
Now, let So that π j = Pr[Failing in j th interval Survive j 1 interval, x i ] ( S0 (t j ) π j = 1 S 0 (t j 1 ) ( S0 (t j ) 1 π j = S 0 (t j 1 ) ) exp(β T x i ) ) exp(β T x i ) log(1 π j ) = exp(β T x i ) log ( ) S0 (t j ) S 0 (t j 1 ) = exp(β T x i )[Λ 0 (t j ) Λ 0 (t j 1 )] Motivating 11.10
Thus we have log[ log(1 π j )] = log[λ 0 (t j ) Λ 0 (t j 1 )] + β T x i γ j + β T x i This is just a binary regression model (ie. a GLM) with a complimentary log-log (cloglog) link and interval-specific intercepts Motivating Provided that J (the number of intervals) is not too large, this model can be fit with ordinary software 11.11
(Section 1.14, K & M) Outcome: Time to cessation of breast feeding (weeks) Covariate of interest: smoking, adjusting for race (race= 1 (White), race= 2 (Black), race= 3 (other)) This is illustrative, so lets do the Cox model using the Efron approximation for ties (for comparison) Motivating I cannot compute the exact partial likelihood in R (on my computer) 11.12
(Section 1.14, K & M) Cox model with Efron approximation > bfeed <- read.table( "http://www.ics.uci.edu/~dgillen/ STAT255//bfeed.txt" ) > names( bfeed ) <- c( "duration", "icompbf", "racemom", "poverty", + "momsmoke", "momdrink", "momage", + "yob", "momeduc", "precare" ) > bfeed <- bfeed[ order(bfeed$duration), ] > bfeed$id <- 1:dim(bfeed)[1] > > > ## > ##### Fit Cox model with Efron adjustment for ties > ## > fit <- coxph( Surv(duration,icompbf) ~ factor(racemom) + momsmoke, data=bfeed, method="efron" ) > summary(fit) Motivating exp(coef) exp(-coef) lower.95 upper.95 factor(racemom)2 1.17 0.858 0.952 1.43 factor(racemom)3 1.38 0.727 1.144 1.66 momsmoke 1.32 0.756 1.140 1.53 11.13
(Section 1.14, K & M) Now recode data to be intervals > bfeed$int.durat <- cut(bfeed$duration, c(0,1,2,4,6,10,16,24,36,52,192), include.lowest=true, ordered_result=true ) > xtabs( ~ int.durat + icompbf, data=bfeed ) icompbf int.durat 0 1 [0,1] 2 77 (1,2] 3 71 (2,4] 6 119 (4,6] 9 75 (6,10] 7 109 (10,16] 5 148 (16,24] 3 107 (24,36] 0 74 (36,52] 0 85 (52,192] 0 27 Motivating 11.14
(Section 1.14, K & M) Now, expand the data to set up for fixed-interval censored proportional hazards model > ## > ##### Expand dataset to consider interval censoring > ## > u.evtimes <- as.ordered(unique(bfeed$int.durat[ bfeed$icompbf==1 ])) > num.event <- length( u.evtimes ) > bfeed.texpand <- bfeed[, c("id", "int.durat", "icompbf", "racemom", "momsmoke") ] > bfeed.texpand <- bfeed.texpand[ rep(bfeed.texpand$id,each=num.event), ] > bfeed.texpand$interval <- rep( u.evtimes, sum(!duplicated(bfeed.texpand$id)) ) > bfeed.texpand <- bfeed.texpand[ bfeed.texpand$int.durat >= bfeed.texpand$interval, ] > bfeed.texpand <- bfeed.texpand[ dim(bfeed.texpand)[1]:1, ] > bfeed.texpand$icompbf <- ifelse(!duplicated(bfeed.texpand$id), bfeed.texpand$icompbf,0 ) > bfeed.texpand <- bfeed.texpand[ dim(bfeed.texpand)[1]:1, ] Motivating 11.15
(Section 1.14, K & M) Let s have a look at the data... > bfeed.texpand[c(1:5,300:305),] id int.durat icompbf racemom momsmoke interval 2 1 [0,1] 1 1 1 [0,1] 28 2 [0,1] 1 1 1 [0,1] 35 3 [0,1] 1 1 0 [0,1] 86 4 [0,1] 1 1 1 [0,1] 88 5 [0,1] 1 1 1 [0,1] 564 178 (2,4] 0 1 1 [0,1] 564.1 178 (2,4] 0 1 1 (1,2] 564.2 178 (2,4] 1 1 1 (2,4] 620 179 (2,4] 0 3 1 [0,1] 620.1 179 (2,4] 0 3 1 (1,2] 620.2 179 (2,4] 1 3 1 (2,4] Motivating 11.16
(Section 1.14, K & M) Now, fit fixed-interval survival model > fit.intph <- glm( icompbf ~ factor(as.numeric(interval)) + factor(racemom) + momsmoke, data=bfeed.texpand, family=binomial(link="cloglog") ) > summary(fit.intph) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -2.61481 0.12085-21.64 < 2e-16 *** factor(as.numeric(interval))2 0.00843 0.16467 0.05 0.95916 factor(as.numeric(interval))3 0.66003 0.14633 4.51 6.5e-06 *** factor(as.numeric(interval))4 0.35893 0.16241 2.21 0.02710 * factor(as.numeric(interval))5 0.91960 0.14908 6.17 6.9e-10 *** factor(as.numeric(interval))6 1.55781 0.14094 11.05 < 2e-16 *** factor(as.numeric(interval))7 1.68622 0.15050 11.20 < 2e-16 *** factor(as.numeric(interval))8 1.81313 0.16433 11.03 < 2e-16 *** factor(as.numeric(interval))9 2.84750 0.16520 17.24 < 2e-16 *** factor(as.numeric(interval))10 5.33164 20.19268 0.26 0.79175 factor(racemom)2 0.14344 0.10511 1.36 0.17235 factor(racemom)3 0.33313 0.09762 3.41 0.00064 *** momsmoke 0.29272 0.07717 3.79 0.00015 *** Motivating 11.17
(Section 1.14, K & M) Use glmci() on the course webpage to exponentiate results and produce CIs > signif( glmci( fit.intph ), 3 ) exp( Est ) ci95.lo ci95.hi z value Pr(> z ) (Intercept) 0.0732 0.0577 9.27e-02-21.6000 0.0000 factor(as.numeric(interval))2 1.0100 0.7300 1.39e+00 0.0512 0.9590 factor(as.numeric(interval))3 1.9300 1.4500 2.58e+00 4.5100 0.0000 factor(as.numeric(interval))4 1.4300 1.0400 1.97e+00 2.2100 0.0271 factor(as.numeric(interval))5 2.5100 1.8700 3.36e+00 6.1700 0.0000 factor(as.numeric(interval))6 4.7500 3.6000 6.26e+00 11.1000 0.0000 factor(as.numeric(interval))7 5.4000 4.0200 7.25e+00 11.2000 0.0000 factor(as.numeric(interval))8 6.1300 4.4400 8.46e+00 11.0000 0.0000 factor(as.numeric(interval))9 17.2000 12.5000 2.38e+01 17.2000 0.0000 factor(as.numeric(interval))10 207.0000 0.0000 3.19e+19 0.2640 0.7920 factor(racemom)2 1.1500 0.9390 1.42e+00 1.3600 0.1720 factor(racemom)3 1.4000 1.1500 1.69e+00 3.4100 0.0006 momsmoke 1.3400 1.1500 1.56e+00 3.7900 0.0001 Motivating 11.18
(Section 1.14, K & M) Interpretation: The estimated relative hazard for cessation of breast feeding for smoking versus non-smoking mothers is exp(0.29) = 1.34 (95% CI: 1.15-1.56; p-value =.0001) Motivating This compares to an estimate of 1.32 when the complete data were used with the Efron approximation 11.19
What about situations where time truly is discrete: Number of attempts a child takes to solve a puzzle before it is solved Number of grades completed before dropping out of school Number of doses required for a given effect to be observed It is natural to think of these settings as continuation trials Motivating A child will only attempt to solve a puzzle a second time if they failed to solve it the first time A student can only fail the 10 th grade if they passed the 9 th grade A patient is only given dose j + 1 if they did not show benefit at dose j 11.20
Discrete-time hazard function Here we may wish to focus on conditional probabilities What is the probability of success on attempt j + 1 given failure on attempt j? Notice how this is similar to a hazard... Motivating What is the probability of failure at time t, given survival up to time t? 11.21
Discrete-time hazard function The survival function for discrete-time data is defined as: S(t j ) = Pr[T > t j ] (same as the survival function for continuous-time data) The hazard function for discrete-time data is: λ(t j ) = Pr[T = t j] Pr[T t j ] = Pr[T = t j] Pr[T > t j 1 ] = S(t j 1) S(t j ) S(t j 1 ) Motivating (different from continuous-time data, but a natural extension) 11.22
Analysis Strategies Univariate Analysis Ordinary Kaplan-Meier / Nelson-Aalen estimators Regression Analysis Motivating Discrete-time proportional hazards model (aka continuation ratio model) 11.23
Let λ(t j ) denote the discrete hazard at time t j. Therefore λ(t j ) is the probability of failure at t j, given survival up to t j (ie. past t j 1 ) The odds of failure at t j, given survival up to t j are then given by λ(t j ) 1 λ(t j ) Motivating Now suppose that λ i (t j x i ) 1 λ i (t j x i ) = λ 0(t j ) 1 λ 0 (t j ) exp(βt x i ) 11.24
This can be thought of as a proportional odds model on the conditional probability of failure at time t j, given survival up to t j. It is referred as a continuation ratio model We can also rewrite this so that log ( λi (t j x i ) ) 1 λ i (t j x i ) = log ( ) λ0 (t j ) 1 λ 0 (t j ) exp(βt x i ) Motivating = γ j + β T x i This is has the form of a logistic regression model with separate intercepts for each follow-up time... 11.25
How do we fit the model? Motivating If the number of unique failure (follow-up) times t1,..., t J is reasonable in size with many ties, we can use ordinary logistic regression with standard software. If the number of unique failure (follow-up) times t1,..., t J is large with few ties,we can use the Cox PH model with the "exact" ties options. What if we have both many ties and a large number of unique failure times?...best going with the the logistic model in most packages 11.26
Survey data considering factors that affect high school graduation available on N=1,691 9th grade students enrolled in a single school district Covariates available 1. Race (White, Black, Hispanic, other) 2. Gender 3. Familiy income (low, medium, high lowest and highest 20%) 4. Parents education (no HS grad, HS grad, some college, college grad) Goal: Estimate the affect of these covariates on the cumulative probability of HS dropout (we ll focus on mom s education as an example) Motivating 11.27
A brief look at the data... ## ##### Read in HS graduation data ## > hsgrad <- read.table( "http://www.ics.uci.edu/~dgillen/stat255/ /hsgrad_comp.txt", header=true ) > nsubjects <- nrow( hsgrad ) > nsubjects [1] 1691 > hsgrad[1:5,] id race male mom.ed dad.ed inc graduate maxgrade 1 101 1 0 2 1 1 1 12 3 103 1 0 2 2 2 1 12 5 105 1 1 4 4 1 1 12 6 106 1 0 4 2 3 1 12 7 107 1 0 2 2 2 3 10 Motivating > table( hsgrad$maxgrade ) 8 9 10 11 12 8 46 55 160 1422 11.28
Set the data up for a CRM fit > ### > ### Construct CRM data... > ### > # > ##### STEP (1) construct the pairs (Y,H) > # > hsgrad$maxgrade <- hsgrad$maxgrade - 7 > ncuts <- max(hsgrad$maxgrade) - 1 > print( paste("ncuts =", ncuts) ) [1] "ncuts = 4" Motivating > y.crm <- NULL > h.crm <- NULL > id <- NULL > for( j in 1:nsubjects ){ + yj <- rep( 0, ncuts ) + if( hsgrad$maxgrade[j] <= ncuts ) yj[ hsgrad$maxgrade[j] ] <- 1 + hj <- 1 - c(0,cumsum(yj)[1:(ncuts-1)]) + y.crm <- c( y.crm, yj ) + h.crm <- c( h.crm, hj ) + id <- c( id, rep(j,ncuts) ) + } 11.29
Set the data up for a CRM fit > ### > ### Construct CRM data... > ### > # > ##### STEP (2) construct the intercepts > # > level <- factor( rep(1:ncuts, nsubjects ), levels=c(1:ncuts), + labels=paste(": ",(1:ncuts)+7) ) > int.mat <- NULL > for( j in 1:ncuts ){ + intj <- rep( 0, ncuts ) + intj[ j ] <- 1 + int.mat <- cbind( int.mat, rep( intj, nsubjects) ) + } > dimnames(int.mat) <- list( NULL, paste("int",c(1:ncuts),sep="") ) Motivating 11.30
Set the data up for a CRM fit > # > ##### STEP (3) expand the X s > # > race <- rep( hsgrad$race, rep(ncuts,nsubjects) ) > male <- rep( hsgrad$male, rep(ncuts,nsubjects) ) > mom.ed <- rep( hsgrad$mom.ed, rep(ncuts,nsubjects) ) > dad.ed <- rep( hsgrad$dad.ed, rep(ncuts,nsubjects) ) > inc <- rep( hsgrad$inc, rep(ncuts,nsubjects) ) > # > print( cbind( id, y.crm, h.crm, level, int.mat, mom.ed )[1:22,] ) id y.crm h.crm level Int1 Int2 Int3 Int4 mom.ed [1,] 1 0 1 1 1 0 0 0 2 [2,] 1 0 1 2 0 1 0 0 2 [3,] 1 0 1 3 0 0 1 0 2 [4,] 1 0 1 4 0 0 0 1 2 [5,] 2 0 1 1 1 0 0 0 2 [6,] 2 0 1 2 0 1 0 0 2 [7,] 2 0 1 3 0 0 1 0 2 [8,] 2 0 1 4 0 0 0 1 2 [9,] 3 0 1 1 1 0 0 0 4 [10,] 3 0 1 2 0 1 0 0 4 [11,] 3 0 1 3 0 0 1 0 4... Motivating 11.31
Set the data up for a CRM fit > # > ##### STEP (4) drop the H=0 and build dataframe > # > keep <- h.crm==1 > hsgrad.crm.data <- data.frame( + id = id[keep], + y = y.crm[keep], + level = level[keep], + race = race[keep], + male = male[keep], + mom.ed = mom.ed[keep], + dad.ed = dad.ed[keep], + inc = inc[keep] ) > print( hsgrad.crm.data[1:15,] ) id y level race male mom.ed dad.ed inc 1 1 0 : 8 1 0 2 1 1 2 1 0 : 9 1 0 2 1 1 3 1 0 : 10 1 0 2 1 1 4 1 0 : 11 1 0 2 1 1 5 2 0 : 8 1 0 2 2 2 6 2 0 : 9 1 0 2 2 2 7 2 0 : 10 1 0 2 2 2... Motivating 11.32
First consider fitting separate logistic regressions by subsetting on individuals in each grade level ie. In 4 separate models, consider the odds of dropping out of school before completing grade j + 1 given that one has passed grade j > # > ##### > ##### Fit separate logistic regressions for cumulative > ##### probability of dropout > ##### > # > table( hsgrad.crm.data$level ) Motivating : 8 : 9 : 10 : 11 1691 1683 1637 1582 11.33
First consider fitting separate logistic regressions by subsetting on individuals in each grade level ie. In 4 separate models, consider the odds of dropping out of school before completing grade j + 1 given that one has passed grade j > fit1 <- glm( y ~ factor( mom.ed ), family=binomial, + subset=(as.integer(level)==1), data = hsgrad.crm.data ) > summary( fit1 ) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -4.9836 1.0034-4.967 6.81e-07 *** factor(mom.ed)2-0.1561 1.0837-0.144 0.885 factor(mom.ed)3-15.5825 1438.1233-0.011 0.991 factor(mom.ed)4-0.9053 1.4176-0.639 0.523 Motivating > glmci( fit1 ) exp( Est ) ci95.lo ci95.hi z value Pr(> z ) (Intercept) 0.0068 0.0010 0.0490-4.9666 0.0000 factor(mom.ed)2 0.8555 0.1023 7.1562-0.1440 0.8855 factor(mom.ed)3 0.0000 0.0000 Inf -0.0108 0.9914 factor(mom.ed)4 0.4044 0.0251 6.5091-0.6386 0.5231 11.34
First consider fitting separate logistic regressions by subsetting on individuals in each grade level ie. In 4 separate models, consider the odds of dropping out of school before completing grade j + 1 given that one has passed grade j > fit2 <- glm( y ~ factor( mom.ed ), family=binomial, + subset=(as.integer(level)==2), data = hsgrad.crm.data ) > glmci( fit2 ) exp( Est ) ci95.lo ci95.hi z value Pr(> z ) (Intercept) 0.0429 0.0189 0.0970-7.5554 0.0000 factor(mom.ed)2 0.8257 0.3412 1.9986-0.4245 0.6712 factor(mom.ed)3 0.3111 0.0618 1.5670-1.4154 0.1569 factor(mom.ed)4 0.1955 0.0482 0.7926-2.2855 0.0223 Motivating > fit3 <- glm( y ~ factor( mom.ed ), family=binomial, + subset=(as.integer(level)==3), data = hsgrad.crm.data ) > glmci( fit3 ) exp( Est ) ci95.lo ci95.hi z value Pr(> z ) (Intercept) 0.1111 0.0640 0.1930-7.7994 0.0000 factor(mom.ed)2 0.3009 0.1563 0.5793-3.5937 0.0003 factor(mom.ed)3 0.2466 0.0791 0.7683-2.4146 0.0158 factor(mom.ed)4 0.1275 0.0450 0.3611-3.8775 0.0001 11.35
First consider fitting separate logistic regressions by subsetting on individuals in each grade level ie. In 4 separate models, consider the odds of dropping out of school before completing grade j + 1 given that one has passed grade j > fit4 <- glm( y ~ factor( mom.ed ), family=binomial, + subset=(as.integer(level)==4), data = hsgrad.crm.data ) > glmci( fit4 ) exp( Est ) ci95.lo ci95.hi z value Pr(> z ) (Intercept) 0.1250 0.0717 0.2179-7.3356 0.0000 factor(mom.ed)2 0.9754 0.5398 1.7626-0.0826 0.9342 factor(mom.ed)3 1.1250 0.5351 2.3651 0.3107 0.7560 factor(mom.ed)4 0.5836 0.2918 1.1671-1.5229 0.1278 Motivating 11.36
Now let s consider fitting the continuation ratio model which simultaneously models all grades > # > ##### > ##### Fit continuation ratio (logit) model > ##### > # > # > fit5 <- glm( y ~ level + factor(mom.ed), family=binomial, + data = hsgrad.crm.data ) Motivating > glmci( fit5 ) exp( Est ) ci95.lo ci95.hi z value Pr(> z ) (Intercept) 0.0078 0.0036 0.0167-12.4296 0.0000 level: 9 5.9254 2.7874 12.5961 4.6242 0.0000 level: 10 7.3585 3.4930 15.5015 5.2501 0.0000 level: 11 24.0970 11.8006 49.2065 8.7357 0.0000 factor(mom.ed)2 0.6614 0.4508 0.9703-2.1142 0.0345 factor(mom.ed)3 0.5866 0.3405 1.0103-1.9230 0.0545 factor(mom.ed)4 0.3249 0.1981 0.5329-4.4523 0.0000 11.37
Motivating Interpretation of coefficients from the CRM model: 11.38
We can also test the assumption of a common (conditional) odds ratio across grade levels using an LRT That is, we wish to test whether the effect of mother s education varies by grade level This is called a test of the "proportional odds" assumption As with any diagnostic test, be careful of underpowered tests... > # > ##### > ##### Test the null hypothesis of a single odds ratio across levels > ##### > # > # > fit6 <- glm( y ~ level * factor( mom.ed ), family=binomial, + data = hsgrad.crm.data ) Motivating > anova( fit5, fit6 ) Analysis of Deviance Table Model 1: y ~ level + factor(mom.ed) Model 2: y ~ level * factor(mom.ed) Resid. Df Resid. Dev Df Deviance 1 6586 2018.30 2 6577 2003.96 9 14.34 11.39
Might be better to use my lrtest() function since it automatically computes the p-value... > # > ##### > ##### Test the null hypothesis of a single odds ratio across levels > ##### > lrtest( fit5, fit6 ) Motivating Assumption: Model 1 nested within Model 2 Resid. Df Resid. Dev Df Deviance pvalue 1 6586 2018.301 2 6577 2003.958 9 14.343 0.1106 11.40
Conclusion from the above test: Motivating 11.41