ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES Cox s regression analysis Time dependent explanatory variables Henrik Ravn Bandim Health Project, Statens Serum Institut 4 November 2011 1 / 53

Outline Survival Data Example: Malignant Melanoma Data The Cox Model Cox in SAS Choice of Time-Scale Example: Guinea-Bissau Data Delayed entries Time dependent explanatory variables 2 / 53

d i=1 exp(βx i ) j R(t i ) exp(βx j) 3 / 53

Survival Data Time to death or other event of interest. One time-scale including a well-defined starting time time-origin: Time from start of randomized clinical trial to death. Time from first employment to pension. Time from filling of a tooth to filling falls out. What is special about survival data? Right-skewed. No problem. CENSORING: For some we will only know a lower bound of lifetime. 4 / 53

Simple data 12 11 10 9 8 Individual 7 6 5 4 3 2 1 0 5 10 15 20 25 Times (months) 5 / 53

Survival and hazard function Let T be the TIME to event of interest: S(t) = P(T > t) = probability of survival to time t after entry at time 0 λ(t) = incidence, rate, or hazard Relationship: S(t) = exp ( t ) λ(s)ds = exp( Λ(t)) 0 Λ(t) is called the integrated hazard function. 6 / 53

λ(t) = λ S(t) = e λt Hazard rate 0.001 0.003 0.005 Survival Function 0.2 0.4 0.6 0.8 1.0 0 50 100 150 Time (t) 0 50 100 150 Time (t) Λ(t) = λt Integrated hazard 0.00 0.05 0.10 0.15 0 50 100 150 Time (t) 7 / 53

Kaplan-Meier estimate of survival function Death times t 1,..., t d (ordered). Y (t i ) = # alive just before t i. Ŝ(t) = ( 1 1 ) Y (t t i t i ) Risk sets 12 11 10 9 8 Individual 7 6 5 4 3 2 1 0 5 10 15 20 25 Times (months) 8 / 53

Survival probability 0.00 0.25 0.50 0.75 1.00 Kaplan Meier survival estimate 1 1 1 1 1 0 5 10 15 20 25 Time (months) Number at risk 12 10 9 6 5 4 1 9 / 53

Malignant Melanoma Data In the period 1962-77 a total of 205 patients had their tumor removed and were followed until 1977. At the end of 1977: 57 died of mgl. mel. (status=1) 134 were still alive. (status=2) 14 died of non-related mgl. mel. (status=3) competing risk Purpose: Study effect on survival of sex, age, thickness of tumor, ulceration, etc. 1962 1977 10 / 53

Malignant melanoma N time status sex age year thickness ulcer 1 10 3 1 76 1972 6.76 1 2 30 3 1 56 1968 0.65 0 3 35 2 1 41 1977 1.34 0 4 99 3 0 71 1968 2.90 0 5 185 1 1 52 1965 12.08 1 6 204 1 1 28 1971 4.84 1 7 210 1 1 77 1972 5.16 1 8 232 3 0 60 1974 3.22 1 9 232 1 1 49 1968 12.88 1 10 279 1 0 68 1971 7.41 1................ 203 4688 2 0 42 1965 0.48 0 204 4926 2 0 50 1964 2.26 0 205 5565 2 0 41 1962 2.90 0 11 / 53

The Cox Model The Cox model assumes that the rate for the ith individual is λ i (t) = λ 0 (t) exp(β 1 X i1 + β 2 X i2 +... + β p X ip ) where β 1, β 2,..., β p are regression parameters, X i1 is the covariate value for covariate 1 for individual i, etc. Finally, λ 0 (t) is the baseline hazard. Time t is the time-scale of choice, e.g. age, time since randomization, or time since operation. As formulated here the only quantity on the right-hand side of the equal sign that depends on time is the baseline hazard λ 0 (t). If all covariates (X s) are zero we get λ i (t) = λ 0 (t). The interpretation of the baseline hazard is thus the hazard of a individual that have all covariates equal to zero. 12 / 53

The Cox model λ i (t) = λ 0 (t) exp(β 1 X i1 + β 2 X i2 +... + β p X ip ) can also be written on the log-scale (natural log) log(λ i (t)) = log(λ 0 (t) exp(β 1 X i1 + β 2 X i2 +... + β p X ip )) The Cox model assumes that = log(λ 0 (t)) + β 1 X i1 + β 2 X i2 +... + β p X ip. the effects of covariates are additive and linear on the log rate scale, just like the poisson regression. the CORNER i.e. the baseline hazard is non-parametric and depends on time, and time is thus adjusted for. We now turn to the interpretation of the regression parameters β 1, β 2,..., β p. 13 / 53

One binary covariate To make things more simple we only study the effect of one single binary covariate, e.g. sex on the risk of dying { 0 if individual i is a female X i = 1 if individual i is a male The Cox model is λ i (t) = λ 0 (t) exp(βx i ). With X i defined as above we get { λ 0 (t) if individual i is a female λ i (t) = λ 0 (t) exp(β) if individual i is a male 14 / 53

Mortality Rate Ratio Hazard Ratio If λ i (t) = { λ 0 (t) λ 0 (t) exp(β) if individual i is a female if individual i is a male then we have that the RATE RATIO (RR) between males and females is RR = λ 0(t) exp(β) = exp(β). λ 0 (t) Importantly, the ratio is independent of time, i.e. we have PROPORTIONAL HAZARDS over time. The Cox model is also called the proportional hazards model. How to estimate β? And what about baseline hazard λ 0 (t)? 15 / 53

Likelihood Function The baseline hazard is regarded as a nuisance and is not in general estimated, but it is possible. Let t 1,..., t d be the ordered death times It can been shown, that all we need is to find the β that maximizes the following function called Cox s partial likelihood function d exp(βx i ) L(β) = j R(t i ) exp(βx j) i=1 where R(t i ) is the RISK SET at death time t i i.e. the set of individuals being at risk of dying (under observation) just before time t i. The resulting estimate β is called the MAXIMUM LIKELIHOOD ESTIMATE of β. 16 / 53

Likelihood Function a closer look Death times t 1,..., t d, numbering individuals with deaths first: i = 1, 2,..., d, d + 1,..., n. with times and covariates t 1, t 2,..., t d, t d+1,..., t n. X 1, X 2,..., X d, X d+1,..., X n. At each death time we have the RISK SET: individuals alive and at risk of dying just before the death time: R(t 1 ), R(t 2 ),..., R(t d ) 17 / 53

Risk sets 12 11 10 9 8 Individual 7 6 5 4 3 2 1 0 5 10 15 20 25 Times (months) 18 / 53

For the Cox model λ i (t) = λ 0 (t) exp(βx i ) we use the Cox likelihood function to estimate β: L(β) = = d exp(βx i ) j R(t i ) exp(βx j) i=1 exp(βx 1 ) j R(t 1 ) exp(βx j) exp(βx 2 ) j R(t 2 ) exp(βx j) exp(βx d ) j R(t d ) exp(βx j) We index individuals in the risk sets using the letter j. Writing j R(t 1 ) exp(βx j) means summing over the individuals in the risk set for death time t 1. If we here assume that no one was censored before the first death time all individuals are in the risk set R(t 1 ) and the sum is exp(βx 1 ) + exp(βx 2 ) + + exp(βx n ). 19 / 53

For example for the Cox model λ i (t) = λ 0 (t) exp(β sex) Sex: 1=male, 0=female. Likelihood function: exp(β) j R(t 1 ) exp(βx j) 1 j R(t 2 ) exp(βx j) exp(β) j R(t d ) exp(βx j). If we again assume that no one was censored before the first death time all individuals are in the risk set R(t 1 ) and the sum is exp(β) + 1 + + exp(β) = N M exp(β) + N F, where N M and N F number of males and females respectively in R(t 1 ). The risk sets also play a crucial role in nested case-control studies more on this later in the course. 20 / 53

So far the following assumptions have been made for the Cox model The baseline hazard is assumed non-parametric, i.e. assumed to vary freely. The effects of covariates are additive and linear on the log rate scale. The ratio of the hazard rate for two subjects are constant over time. In other words, there is no interaction between the covariates and the time variable. Let us look at the Melanoma data using SAS. 21 / 53

0.00 0.25 0.50 0.75 1.00 Kaplan Meier survival estimates, by sex 0 5 10 15 Time (years) female male What is the estimate of the RR between males and females? 22 / 53

Cox in SAS 9.1.3 In SAS, proc phreg and proc tphreg can be used for estimating in the Cox model. We will use proc tphreg as this procedure can handle categorical variables much easier than proc phreg. Using proc tphreg we define the variable sex to be categorical using the class statement. For the variable sex 1 is males and 0 is females. proc tphreg data=melanom; class sex; model time*status(2,3) = sex; run; Please note, that we have two censoring codes namely 2 and 3. NB: In SAS 9.2 proc phreg now handles class variables and proc tphreg is obsolete. 23 / 53

Part of output from proc tphreg: Analysis of Maximum Likelihood Estimates Parameter Standard Hazard Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio sex 0 1-0.66214 0.26513 6.2370 0.0125 0.516 The column Parameter Estimate is β. For a class variable SAS will automatically choose the highest number (here 1) as the reference. Thus, the rate ratio or Hazard Ratio is females compared to males. There is no estimate statement in proc (t)phreg, but a similar so-called contrast statement exists. Instead we can use the ref option in the class statement. Note also the option risklimits in the model statement which calculates the confidence interval for the hazard ratio. 24 / 53

proc tphreg data=melanom; class sex(ref="0"); model time*status(2,3) = sex / risklimits; run;... Analysis of Maximum Likelihood Estimates Parameter Standard Hazard 95% Hazard Ratio Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Confidence Limits sex 1 1 0.66214 0.26513 6.2370 0.0125 1.939 1.153 3.260 25 / 53

Melanoma data, thickness of tumor given by variable gtyk 1 if <2mm gtyk = 2 if 2-5 mm 3 if >5 mm proc tphreg data=melanom; class gtyk; model time*status(2,3) = gtyk / risklimits; run; Type 3 Tests Wald Effect DF Chi-Square Pr > ChiSq gtyk 2 25.6749 <.0001 Analysis of Maximum Likelihood Estimates Parameter Standard Hazard 95% Hazard Ratio Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Confidence Limits gtyk 1 1-1.67324 0.38572 18.8176 <.0001 0.188 0.088 0.400 gtyk 2 1-0.11055 0.32391 0.1165 0.7329 0.895 0.475 1.689 26 / 53

Melanoma data, + age in years proc tphreg data=melanom; class gtyk sex; model time*status(2,3) = gtyk sex age / risklimits; run; Type 3 Tests Wald Effect DF Chi-Square Pr > ChiSq sex 1 2.3660 0.1240 gtyk 2 21.3752 <.0001 age 1 1.5241 0.2170 Analysis of Maximum Likelihood Estimates Parameter Standard Hazard 95% Hazard Ratio Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Confidence Limits sex 0 1-0.41608 0.27050 2.3660 0.1240 0.660 0.388 1.121 gtyk 1 1-1.53827 0.39232 15.3738 <.0001 0.215 0.100 0.463 gtyk 2 1-0.08180 0.32849 0.0620 0.8033 0.921 0.484 1.754 age 1 0.01052 0.00852 1.5241 0.2170 1.011 0.994 1.028 27 / 53

LR = 560.248-532.244 = 28.0 χ 2 2 (2 degrees of freedom) 28 / 53 Likelihood Ratio Test. proc tphreg data=melanom; class gtyk sex; model time*status(2,3) = gtyk sex; run; Model Fit Statistics Without With Criterion Covariates Covariates -2 LOG L 566.398 532.244 AIC 566.398 538.244 SBC 566.398 544.373 ------------------------------------------- proc tphreg data=melanom; class sex; model time*status(2,3) = sex; run; Model Fit Statistics Without With Criterion Covariates Covariates -2 LOG L 566.398 560.248 AIC 566.398 562.248 SBC 566.398 564.291

SAS: p-value from chi-square test data temp; chisquare=28; df=2; p=1-probchi(chisquare,df); run; proc print data=temp; run; Obs chisquare df p 1 28 2.000000832 29 / 53

Choice of Time-Scale A study may be conducted over calendar time even though the natural time-scale is time since treatment Melanoma study. Cohort studies are often conducted by recruiting a random sample of the population at the start of the study and then these subjects are followed for a number of years Framingham. A natural time-scale may be age rather than time in study which most often is an artificial time-scale constructed by the investigators. What would time-origin be if age was chosen as time-scale? 30 / 53

Vaccinations in Guinea-Bissau 1990-96 Rural Guinea-Bissau: 5274 children under 7 months of age visited two times at home, with an interval of six months. Information about vaccination (BCG, DTP, mealses vaccine) collected at each visit and at second visit death during follow-up is registered. Some children moved away during follow-up, i.e. censored or survived until next visit, also censored. Below are some of the variable names from the bissau data. fuptime dead bcg agem Follow-up time in days 0 = censored, 1 = dead 1 = Yes, 2 = No Age at first visit in months 31 / 53

Is the risk of dying associated with vaccination? Outcome Exposure Died Survived Total BCG vaccinated 125 (3.8%) 3176 3301 not BCG vaccinated 97 (4.9%) 1876 1973 Total 222 (4.2%) 5052 5274 32 / 53

proc tphreg data=bissau; class bcg; model fuptime*dead(0)=bcg / rl ; run; Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 4.2824 1 0.0385 Score 4.3761 1 0.0364 Wald 4.3474 1 0.0371 Type 3 Tests Wald Effect DF Chi-Square Pr > ChiSq bcg 1 4.3474 0.0371 Analysis of Maximum Likelihood Estimates Parameter Standard Hazard 95% Hazard Ratio Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Confidence Limits bcg 1 1-0.28214 0.13532 4.3474 0.0371 0.754 0.578 0.983 33 / 53

proc tphreg data=bissau; class bcg agem; model fuptime*dead(0)=bcg agem / rl ; run; Type 3 Tests Wald Effect DF Chi-Square Pr > ChiSq bcg 1 5.6510 0.0174 agem 6 7.7246 0.2590 Analysis of Maximum Likelihood Estimates Parameter Standard Hazard 95% Hazard Ratio Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Confidence Limits bcg 1 1-0.34720 0.14605 5.6510 0.0174 0.707 0.531 0.941 agem 0 1 0.01053 0.35339 0.0009 0.9762 1.011 0.506 2.020 agem 1 1 0.12553 0.34494 0.1324 0.7159 1.134 0.577 2.229 agem 2 1-0.24631 0.35903 0.4707 0.4927 0.782 0.387 1.580 agem 3 1 0.20946 0.34502 0.3686 0.5438 1.233 0.627 2.425 agem 4 1 0.34300 0.34265 1.0020 0.3168 1.409 0.720 2.758 agem 5 1 0.34118 0.34699 0.9668 0.3255 1.407 0.713 2.777 34 / 53

Delayed entries Time in study Age as time 12 12 11 11 10 10 9 9 8 8 Individual 7 6 Individual 7 6 5 5 4 4 3 3 2 2 1 1 0 5 10 15 20 25 0 5 10 15 20 25 Times (months) Age (months) 35 / 53

Subjects are only at risk at age of entry and onwards. They are not at risk in our World of analysis before age of entry! Handling of delayed entries is easily done by careful control of the RISK SET R(t i ) at death time t i in the likelihood function: L(β) = d exp(βx i ) j R(t i ) exp(βx j) i=1 Only individuals at risk and under observation is included in the risk set R(t i ) at time t i. 36 / 53

Delayed entries in SAS data bissau2; set bissau; outage=age+fuptime; run; proc tphreg data=bissau2; class bcg; model (age,outage)*dead(0)= bcg / rl; run; Analysis of Maximum Likelihood Estimates Parameter Standard Hazard 95% Hazard Ratio Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Confidence Limits bcg 1 1-0.35542 0.14065 6.3854 0.0115 0.701 0.532 0.923 37 / 53

Time dependent explanatory variables The Cox model can be expanded to include time-varying covariates λ i (t) = λ 0 (t) exp(βx i (t)). The likelihood function for death times t 1,..., t d becomes L(β) = d i=1 exp(βx i (t i )) j R(t i ) exp(βx j(t i )). From this we can see that we just need to know the value of the covariates at the deaths times: X i (t 1 ), X i (t 2 ),..., X i (t d ). The covariate values at any time different from a death time is not used in the likelihood function. 38 / 53

The most simple time-varying covariate is a binary variable that is allowed to change once during follow-up, e.g. new BCG vaccinations registered between visits in the Bissau data: X i (t) = { 0 if no BCG before time t 1 if BCG-time t 39 / 53

A child being BCG-vaccinated after 3 months of follow-up. BCG 0 1 0 1 2 3 4 5 6 Follow up (months) The time-varying covariate is 0 in the time interval 0 to 3 months and 1 for the rest of follow-up. For a child who was BCG vaccinated before first visit the time-varying covariate is one during all the follow-up. 40 / 53

Multi-state Model λ 01 (t) 0 1 Unexposed Exposed λ 02 (t) 2 Dead λ 12 (t) We want to compare λ 02 (t) and λ 12 (t). The transition λ 01 (t) is not modeled here. 41 / 53

Instead of time of follow-up we will use age as time-scale to illustrate the use of BCG as a time-varying covariate in the Bissau data. At visit 2 the vaccination cards were seen for the children at home and an age of BCG vaccination (bcgage) was calculated: id fuptime dead age bcg bcgage outage... 486 159 0 199 1 107 358 487 183 0 97 1 20 280 488 183 0 43 2 174 226 489 137 1 140 1 40 277 490 183 0 165 1 46 348... 499 157 0 186 1 64 343 500 25 1 191 2. 216 501 157 0 183 1 61 340... 42 / 53

Binary time-varying covariate in SAS (I) proc tphreg data=bcg; if.<bcgage<outage then bcg_t=1; else bcg_t=0; model (age,outage)*dead(0)=bcg_t / rl ; run; Analysis of Maximum Likelihood Estimates Parameter Standard Hazard 95% Hazard Ratio Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Confidence Limits bcg_t 1-1.08278 0.14046 59.4286 <.0001 0.339 0.257 0.446 43 / 53

The if-statement if.<bcgage<outage then bcg_t=1; else bcg_t=0; is recalculated at each death time. The outage in the model statement refers to the current death times being evaluated (i.e. a t i in the likelihood). For the first death time which is t 1 = 23 days of age, the if-statement becomes if.<bcgage<23 then bcg_t=1; else bcg_t=0; being calculated for all children at risk at age 23 days (in R(t 1 = 23)) with their individual bcgage-values. This is a recalculation of the time-varying covariate at each death time c.f. the likelihood function. 44 / 53

Binary time-varying covariate in SAS (II) Splitting up persons with a changing time-varying covariate in two records: age bcgage outage bcgvacc=0 status=0 bcgvacc=1 status=dead and use delayed entries. Thus, we need to generate a new data set. 45 / 53

data splitbcg; set bcg; if bcgage=. or bcgage>outage then do; bcgvacc=0; entryage=age; exitage=outage; status=dead; output; end; if.<bcgage<=age then do; bcgvacc=1; entryage=age; exitage=outage; status=dead; output; end; if age<bcgage<=outage then do; bcgvacc=0; entryage=age ; exitage=bcgage; status= 0; output; bcgvacc=1; entryage=bcgage; exitage=outage; status=dead; output; end; run; id fuptime dead age bcg bcgage outage bcgvacc entryage exitage status 488 183 0 43 2 174 226 0 43 174 0 488 183 0 43 2 174 226 1 174 226 0 46 / 53

proc tphreg data=splitbcg; class bcgvacc(ref="0"); model (entryage,exitage)*status(0)=bcgvacc / rl ; run; Parameter Standard Hazard 95% Hazard Ratio Parameter DF Estimate Error Chi-Square Pr > ChiSq Ratio Confidence Limits bcgvacc 1 1-1.08278 0.14046 59.4286 <.0001 0.339 0.257 0.446 47 / 53

Other time-varying covariates Effect of binary X (0,1) changes at t 0 : where λ i (t) = λ 0 (t) exp(β 1 X i + β 2 X i I (t t 0 )), I (t t 0 ) = Can be handled by method I+II. { 1 if t t 0 0 if t < t 0 Effect of binary X (0,1) decreases or increases with time: λ i (t) = λ 0 (t) exp(β 1 X i + β 2 (X i t)). Can be handled by method I or by splitting at each failure or special options. 48 / 53

Stanford Heart Transplant Data (p. 235) In a report (Crowley and Hu, J Amer. Statist Assoc. 1977) on the Stanford Heart Transplantation Study, patients identified as been eligible (N=103) for a heart transplant were followed until death or censorship. In total 65 received transplant during follow-up, whereas 38 did not. Assess whether transplanted patients survive better. On the next slide you will find the variables in the transplant data set. Here we will discuss how to analyse and at the exercises we will do some of the analyses. 49 / 53

Stanford Heart Transplant Data variables age cens days trans wait mismatch age (in years) at entry into the study. 0 = Censoring 1 = Dead number of days from entry to dead/censoring. 1 = if the person had a heart transplantation 0 = otherwise. number of days from entry to transplantation NB: if trans = 0 then wait = -1 1 = mismatch between HLA type in donor and patient 0 = no mismatch NB: if trans = 0 then mismatch = -1. 50 / 53

Obs age cens days trans wait mismatch 52 56 1 90 1 27 1 53 53 1 96 1 67 0 54 48 1 100 1 46 0 55 41 1 102 0-1 -1 56 28 0 109 1 96 1 57 46 1 110 1 60 0 58 23 0 131 1 21 1 59 41 1 149 0-1 -1 60 47 1 153 1 26 0 61 43 1 165 1 4 0 62 26 0 180 1 13 0 63 52 1 186 1 160 1 64 47 1 188 1 41 0 65 51 1 207 1 139 1 66 51 1 219 1 83 1 67 8 1 263 0-1 -1 68 47 0 265 1 28 0 69 48 1 285 1 32 1 70 19 1 285 1 57 0 71 49 1 308 1 28 0 51 / 53

Piecewise Constant Hazard Rate = Poisson regression Divide the time scale into K pieces and assuming piecewise constant but different hazard rates in each of the intervals. This may provide a sensible summary of many phenomena and is often used in epidemiology. λ 1 λ 2 λ 3 λ K c 0 = 0 c 1 c 2 c 3 c K 1 c K Age Thus λ(t) = λ k for t (c k 1, c k ], k = 1,..., K The intervals do not need to be of same length. We only need to keep record of the total number of deaths and the exposure time in each group. 52 / 53

We can further divide each interval into categories of covariates, e.g. sex (F=females, M=males): λ 1F λ 2F λ 3F λ KF λ 1M λ 2M λ 3M λ KM c 0 = 0 c 1 c 2 c 3 c K 1 c K Age Not straight forward in SAS to split the time-scale, but so-called user-written SAS-macros exist. See for example: http://staff.pubhealth.ku.dk/~bxc/lexis/lexis.sas Stata use stsplit command. R packages exist (e.g. Epi Package) SPSS? 53 / 53