Survival Analysis I (CHL5209H)

Survival Analysis Dalla Lana School of Public Health University of Toronto olli.saarela@utoronto.ca February 3, 2015 21-1

Time matching/risk set sampling/incidence density sampling/nested design 21-2

21-3 Motivation Suppose that a cohort (that is, a closed population) of size n has been recruited through participation in a health examination survey. Some baseline characteristics (e.g. blood pressure, cholesterol, BMI, smoking habits, prevalent health conditions and their treatment) are recorded on all cohort members. Denote these characteristics by X. We are interested in whether particular genetic/biomarkers (denoted by Z), determined from plasma/serum samples collected at the examination, are associated with disease outcomes in the cohort. However, carrying out the measurements on all cohort member would be expensive, and no association study can be carried out at this point anyway. (Why?) Solution: store (freeze) the biological material to wait for later use.

Before the follow-up Population 0 1000 2000 3000 4000 5000 6000 0 2 4 6 8 10 21-4 Follow up years

After 10 years of follow-up Population 0 1000 2000 3000 4000 5000 6000 0 2 4 6 8 10 21-5 Follow up years

21-6 Case series (184 incident stroke events) 0 2 4 6 8 10 0 1000 2000 3000 4000 5000 6000 Follow up years Population

Survival Analysis Zoom in to see more Population 0 20 40 60 80 0 2 4 6 8 10 21-7 Follow up years

21-8 Identify risk sets 0 2 4 6 8 10 0 20 40 60 80 Follow up years Population

21-9 Sample one control per case 0 2 4 6 8 10 0 20 40 60 80 Follow up years Population

21-10 Sample 5 controls per case 0 2 4 6 8 10 0 20 40 60 80 Follow up years Population

21-11 Or, sample the whole riskset 0 2 4 6 8 10 0 20 40 60 80 Follow up years Population

21-12 Risk set sampling From each risk set, m controls are selected randomly, and independently from the previous or later sampled risk sets. Because of this, one individual can contribute a control to more than one case, and individuals with an outcome event can contribute a control before their event time. The case and the m controls are now matched by time. Selecting more than m = 5 controls per case no longer substantially improves efficiency. This is why cost savings are possible through nested designs. The covariate measurements Z are collected on all individuals with a disease event of interest, and the pooled set of individuals contributing the sampled controls. To ensure comparability of the measurements, the cases and controls may need to to be further matched w.r.t. factors such as storage time, storage conditions, freeze-thaw cycles, and analytic batch (e.g. plate).

21-13

Model Suppose that we are interested in the parameter β in the proportional hazards model λ i (t) = λ 0 (t) exp{βz i + γ x i }, where z i is the genetic/biomarker measurement collected under the nested design, and x i are other factors that we want to include in the model. How to estimate this model under the nested design? Turns out that a certain conditioning argument gives a very useful conditional likelihood expression. For deriving this expression recall the notation of counting process jumps dn i (t) indicating whether an event occurred to individual i at time t, and the history F t representing all observed information just before t. 21-14

likelihood contribution Use now i to index a sampled riskset, rather than individual, with i1, i2,... indexing individuals within the sampled riskset. Consider the special case of 1-1 matched design, where one control per case is selected randomly. The likelihood contribution is given by the conditional probability of individual 1 in risk set i experiencing an outcome event at time t i, given that we know that one of the two individual experienced an outcome event, that is, P(dN i1 (t i ) = 1 dn i1 (t i ) + dn i2 (t i ) = 1, F ti, θ), where θ represents the collection of all model parameters. 21-15

likelihood contribution (2) Using the basic rules of probability, and assuming the outcomes of the two individuals independent, we can write this as P(dN i1 (t i ) = 1 dn i1 (t i ) + dn i2 (t i ) = 1, F ti, θ) = P(dN i1(t i ) = 1, dn i2 (t i ) = 0 F ti, θ) P(dN i1 (t i ) + dn i2 (t i ) = 1 F ti, θ) P(dN i1 (t i ) = 1 F ti, θ) = P(dN i2 (t i ) = 0 F ti, θ) P(dN i1 (t i ) = e 1 F ti, θ). P(dN i2 (t i ) = e 2 F ti, θ) (e 1,e 2 ) {(0,1),(1,0)} Note further that the observed information F ti records the covariate values {z i1, x i1, z i2, x i2 } and that the two individuals are still at risk at this moment, that is, {N i1 (t i ) = 0, N i2 (t i ) = 0}. 21-16

21-17 likelihood contribution (2) Substituting these in we get P(dN i1 (t i ) = 1 dn i1 (t i ) + dn i2 (t i ) = 1, F ti, θ) = = P(dN i1 (t i ) = 1, N i1 (t i ) = 0 z i1, x i1, θ) P(dN i2 (t i ) = 0, N i2 (t i ) = 0 z i2, x i2, θ) P(dN i1 (t i ) = e 1, N i1 (t i ) = 0 z i1, x i1, θ) P(dN i2 (t i ) = e 2, N i2 (t i ) = 0 z i2, x i2, θ) (e 1,e 2 ) (e 1,e 2 ) {(0,1),(1,0)} λ i1 (t i )S i1 (t i )S i2 (t i ) λ i1 (t i ) e 1 S i1 (t i )λ i2 (t i ) e 2 S i2 (t i ) λ i1 (t i ) = λ i1 (t i ) + λ i2 (t i ) exp{βz i1 + γ x i1 } = exp{βz i1 + γ x i1 } + exp{βz i2 + γ x i2 }. Note how the baseline hazard and the survival probabilities canceled out of the expression.

likelihood contribution (3) If there are d cases in total, the complete conditional likelihood expression is L(β, γ) = d i=1 exp{βz i1 + γ x i1 } exp{βz i1 + γ x i1 } + exp{βz i2 + γ x i2 }. (1) The resulting expression should look familiar; if we sample the whole risk set, using the same conditioning argument, we would get L(β, γ) = d i=1 exp{βz i1 + γ x i1 } l R(t i ) exp{βz il + γ x il }, where R(t i ) is the set of individuals at risk at the event time t i. This is equivalent to the Cox partial likelihood. 21-18

21-19 The expression (1) is called conditional likelihood. The reason for this is that in non-time matched designs, a likelihood expression of the same form would be obtained by substituting in a model. In that case, the likelihood expression would be used to estimate log-odds ratios. With time matching, the model substituted in the general probability expression is the proportional hazards survival model, and as a result, we can estimate log-hazard ratios. To sum up, with time matched sampling of the controls, conditional has nothing to do with. Also, confusingly, the terms time matching/risk set sampling/incidence density sampling/nested design are used interchangeably in the literature. They all mean the same thing.

in R The conditional likelihood can be maximized using the clogit function of the R survival package clogit(formula, data, weights, subset, na.action, method=c("exact", "approximate", "efron", "breslow"),...) which uses a model formula of the form case.status~exposure+strata(matched.set) Here the strata variable identifies the time matched risksets. Ties in the data mean that more than one event occurred at the same time; the argument method specifies how the ties are handled. 21-20

Exact method for ties The conditioning approach generalizes if ties are present. For example, if two events occurred at time t i and one control was sampled from the riskset, we would get ( ) 3 P dn i1 (t i ) = 1, dn i2 (t i ) = 1 dn il (t i ) = 2, F ti, θ = l=1 exp{βz i1 + γ x i1 } exp{βz i2 + γ x i2 } exp{βz i1 + γ x i1 } exp{βz i2 + γ x i2 } + exp{βz i1 + γ x i1 } exp{βz i3 + γ x i3 } + exp{βz i2 + γ x i2 } exp{βz i3 + γ x i3 } (homework). This is the so-called exact methods for handling ties; the others are approximations. 21-21