Survival Analysis 732G34 Statistisk analys av komplexa data Krzysztof Bartoszek (krzysztof.bartoszek@liu.se) 10, 11 I 2018 Department of Computer and Information Science Linköping University
Survival analysis In brief Study of time to event questions. What are the chances of a person surviving till pension age? What are the chances of a pension being taken out for at least 10 years? (demographics, economics) What are the chances of cancer remission within 5 years of successful treatment? (medicine) What are the chances of a computer not requiring replacement within 2 years? (engineering, economics) Do the survival times between two groups receiving different treatments differ significantly? (medicine, biology) What are the chances of a person being arrested within a year of release?
Lesson structure Two lectures 10 January (13:15 15), 11 January (08:15 10) Computer examination 15 January (10:15 12) Hand in of examination at latest 22 January 10:15 Answer in Swedish or English, as you prefer. Electronic reports as.pdf (or.txt). INCLUDE YOUR NAME IN THE REPORT! Work indivdually. Disclose ALL collaborations and sources. Provide source code (if used). E mail contact: krzysztof.bartoszek@liu.se
Course materials, software Lecture slides 2015/16 lecture slides (på svenska, Karl Wahlin) Articles on Kaplan Meier curves, Cox regression, hazards ratio on course www http://www.ats.ucla.edu/stat/examples/alda.htm https://cran.r-project.org/web/packages/survival/ index.html vignettes Paul D. Allison Event history and survival analysis, SAGE Publishing, 2014 (if anyone feels they need a book) R (survival package) R (KMsurv package, data sets) SPSS (e.g. Karl Wahlin s lecture) SAS (e.g. P. D. Allison s books)
Survival analysis the applications Where? Medical studies Engineering studies (reliability theory) Social studies (duration analysis) Any time to event questions What are people interested in? Visualize data Test hypotheses Make predictions Model phenomena Estimate parameters
Important terms Event: something that we are interested in occurring e.g. death, birth, getting arrested, obtaining a salary raise, onset of disease, being cured of disease e.t.c. Failure of an individual: the event occurs for a given individual e.g. dying, giving birth, being born, landing in prison, getting a pay increase, falling ill, becoming healthy e.t.c. Individual at risk: the event has not yet occurred but may in the future. Population at risk: the collection of individuals for which the event can occur.
Experimental setups
The survival function: definition T : random variable, time to event T s cumulative distribution function F (t) := P r(t t) Survival function S(t) := P r(t > t) S(t) is the probability to survive longar than t. Recap questions: How is S(t) related to the CDF? And T s density (if exists)? Question: Is S(t) a monotonic function?
The survival function: toy example Discrete time, we have minute bins individ time status 1 1.00 1.00 2.00 2 2.00 2.00 2.00 3 3.00 1.00 2.00 4 4.00 3.00 2.00 5 5.00 3.00 2.00 6 6.00 2.00 2.00 7 7.00 1.00 2.00 8 8.00 4.00 2.00 9 9.00 2.00 2.00 10 10.00 5.00 2.00
Life table: toy example # at Cens. At risk Deaths Pr. Pr. Pr. start death surv. x surv. > x x n x w x r x d x q x p x S x 1 10 10 3 0.3 0.7 7/10 2 7 7 3 3/7 4/7 28/70 3 4 4 2 0.5 0.5 14/70 4 2 2 1 0.5 0.5 7/70 5 1 1 1 1 0 0 q x = d x /r x p x = 1 q x S x = y x p y
The survival function: toy example Cumulative survival 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 time
The survival function: R code library ( s u r v i v a l ) ## data f o r s u r v i v a l a n a l y s i s needs a binary s t a t u s v a r i a b l e e. g. 1 a l i v e, 2 dead, see? Surv dftoydata< cbind ( individ =1:10,time=times. toydata, status =2) surv. toydata< with ( as. data. frame ( dftoydata ), Surv ( time, status ) ) print ( surv. toydata ) ## [ 1 ] 1 2 1 3 3 2 1 4 2 5 dftoydata [, time ] ## [ 1 ] 1 2 1 3 3 2 1 4 2 5 ## Old v e r s i o n s would accept s u r v f i t ( surv. t o y d a ta ) s f i t. toydata< s u r v f i t ( surv. toydata 1) plot ( s f i t. toydata, conf. i n t=false, xlab= time, ylab= Cumulative s u r v i v a l, lwd=3)
The survival function: toy example 2 individ time status treatment 1 1 3 2.00 a 2 2 3 2.00 a 3 3 2 2.00 b 4 4 4 2.00 b 5 5 1 2.00 b 6 6 1 2.00 a 7 7 5 2.00 a 8 8 4 2.00 b 9 9 1 2.00 a 10 10 4 2.00 b Exercise: Construct separate life tables for group a and b.
The survival function: toy example 2 Cumulative survival 0.0 0.2 0.4 0.6 0.8 1.0 a b 0 1 2 3 4 5 time
The survival function: R code library ( s u r v i v a l ) ## now we see why we need a data frame dftoydata2< data. frame ( individ =1:10,time= sample ( 1 : 5, 1 0, rep=true), status =2, treatment= sample ( c ( a, b ),10, rep=true) ) times. toydata2< with ( dftoydata2, Surv ( time, status ) ) ## and s f i t. toydata2< s u r v f i t ( times. toydata2 treatment, dftoydata2 ) plot ( s f i t. toydata2, lwd=3, l t y=c ( 1, 2 ), xlab= time, ylab= Cumulative s u r v i v a l, cex. lab =1.5) legend ( t o p r i g h t, legend=c ( a, b ), l t y=c ( 1, 2 ), lwd=3, cex =2, bty= n )
Censored data Right censoring What if an entity does not fail to survive? e.g. patient does not die within observation period What if an entity dropped out of the analysis? e.g. patient died of other causes Left censoring What if we know an entity failed to survive but not when failure occurred? e.g. exact time of death unknown Interval censoring We only know failure occurred inside a given time interval.
Censored data: toy example 3 individ time status 1 1 93.06 2.00 2 2 266.68 1.00 3 3 87.56 1.00 4 4 270.87 1.00 5 5 206.27 2.00 6 6 25.69 2.00 7 7 153.26 2.00 8 8 16.75 2.00 9 9 170.52 1.00 10 10 134.20 2.00
Life table: toy example Time intervals can be arbitrary (as useful). r x = n x w x /2 # at Cens. At risk Deaths Pr. Pr. Pr. start death surv. x surv. > x x n x w x r x d x q x p x S x 20 10 0 10 1 0.1 0.9 0.9 30 9 0 9 1 1/9 8/9 72/90 100 8 1 7.5 1 10/75 65/75 0.693 150 6 0 6 2 1/3 2/3 0.462 200 4 1 3.5 0 0 1 0.462 250 3 0 3 1 1/3 2/3 0.308 300 2 2 1 0 0 1 0.308
Kaplan Meier estimation Edward Kaplan and Paul Meier published a paper 1958 how to deal with incomplete observations. We need to know the moments of censoring, i.e. continuous time. records n.max n.start events 10.0 10.0 10.0 6.0 *rmean *se(rmean) 161.7 30.3 median 0.95LCL 0.95UCL 153.3 93.1 NA * restricted mean with upper limit = 271 to obtain 95%CIs use normal approximation 1.96SE
Kaplan Meier estimation: toy example 3 Cumulative survival 0.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 250 time
Kaplan Meier estimation: R code library ( s u r v i v a l ) dftoydata4< data. frame ( individ =1:30,time=runif (30,min=0,max=300), status=sample ( c ( 1, 2 ),30, prob=c ( 0. 2, 0. 8 ), rep=true), treatment=sample ( c ( a, b ),30, rep=true) ) times. toydata4< with ( dftoydata4, Surv ( time, status ) ) s f i t. toydata4< s u r v f i t ( times. toydata4 treatment, dftoydata4 ) print ( s f i t. toydata4, print. rmean=true) plot ( s f i t. toydata4, lwd=1, l t y=c ( 1, 2 ), xlab= time, ylab= Cumulative s u r v i v a l, cex. lab =1.5, conf. i n t=true) legend ( b o t t o m l e f t, legend=c ( a, b ), l t y=c ( 1, 2 ), lwd=3, cex =2, bty= n )
Kaplan Meier estimation: two group comparison Cumulative survival 0.0 0.2 0.4 0.6 0.8 1.0 a b 0 50 100 150 200 250 time
Risk, odds # people # events # non events (hay fever) Medication A N A = 144 D A = 19 H A = 125 Medication B N B = 146 D B = 33 H B = 113 Risk: R A =??, R B =??, Risk ratio: RR A/B =??, Odds: O A =??, O B =??, Odds ratio: OR A/B =??.
Risk, odds https://en.wikipedia.org/wiki/odds_ratio # people # events # non events (hay fever) Medication A N A = 144 D A = 19 H A = 125 Medication B N B = 146 D B = 33 H B = 113 Risk: R A = D A /N A = 0.132, R B = D B /N B = 0.226, Risk ratio: RR A/B = R A /R B = 0.584, RR B/A = R B /R A = 1.713 Odds: Alternative formula? O A = D A /H A = 0.152, O B = D B /H B = 0.292, Odds ratio: OR A/B = O A /O B = 0.520 OR B/A = O B /O A = 1.921.
Hazard rate (function) We are interested in the chances of failure if we know survival occurred up to time t. P r(t T < t + t T t) h(t) := lim t 0 t This is the conditional density of failing at time t given survival till t. If time is discrete, then t = 1.
Log rank test We have two groups (e.g. smokers/non smokers, no chemo/chemo, men/women). Consider HR := h 1 (t)/h 2 (t). H 0 : HR = 1 H a : HR 1 The test looks at observations at each time point. Non parametric. The test variable has a χ 2 distribution. If p value is below significance level (e.g. 0.05), then survival distributions can be assumed different.
Cox regression To estimate h(t) from data we need a parametric form for it in terms of explanatory variables. E.g. justification of a given model can be difficult. Regression methods are popular to relate predictor and response variables (here survival). Cox regression can be used to study effects of predictors on survival time, compare groups under different treatments, predict survival time. Semi parametric method, does not assume a particular form for the hazard function but does assume a regression framework for explanatory variables. Explanatory variables can be discrete or continuous. Explanatory variables can be fixed (e.g. race, ever smoked) or time dependent (blood pressure, salary). Response variable: hazard rate.
Cox regression h(t) = h 0 (t) exp(b 1 x 1 + b 2 x 2 +... + b k x k ) h(t) hazard rate for an individual at time t h 0 (t) baseline hazard (all x i = 0), the intercept sits here b i s are coefficients that modify the baseline hazard x i s are explanatory variables b i s are obtained by e.g. (partial) maximum likelihood (h 0 (t) is usually not of interest) It is a proportional hazards model covariates are multiplicatively related to the hazard.
Cox regression: example veteran data set in survival package Randomized trial of two treatment regimens for lung cancer. D. Kalbfleisch and R. L. Prentice (1980), The Statistical Analysis of Failure Time Data. Wiley, New York. trt: 1=standard 2=test celltype: 1=squamous, 2=smallcell, 3=adeno, 4=large time: survival time status: censoring status karno: Karnofsky performance score (100=good) diagtime: months from diagnosis to randomization age: in years MAY WE USE IT? prior: prior therapy 0=no, 1=yes
Cox regression: example Cumulative survival 0.0 0.2 0.4 0.6 0.8 1.0 trt=1 trt=2 0 200 400 600 800 1000 time
Cox regression: R code library ( s u r v i v a l ) v. times< with ( veteran, Surv ( time, status ) ) v. s u r v f i t< s u r v f i t ( v. times t r t, data=veteran ) plot ( v. s u r v f i t, l t y=c ( 1, 2 ), lwd=1, xlab= time, ylab= Cumulative s u r v i v a l, cex. lab =1.5, conf. i n t=true) legend ( t o p r i g h t, legend=c ( t r t =1, t r t =2 ), l t y=c ( 1, 2 ), lwd=3, cex =2, bty= n ) print ( v. s u r v f i t, print. rmean=true) print ( s u r v d i f f ( v. times veteran $ t r t ) ) print ( s u r v d i f f ( v. times veteran $ c e l l t y p e ) ) summary( coxph ( v. times karno, data=veteran ) ) summary( coxph ( v. times karno+t r t, data=veteran ) ) summary( coxph ( v. times karno+t r t+c e l l t y p e, data= veteran ) )
Cox regression: R code
Model validation Schoenfeld residuals also called partial residuals We want to test if the covariates effect on the hazard is independent of time. Schoenfeld (1982) Partial Residuals for The Proportional Hazards Regression Model A residual for each covariate c for each individual i. r ci = x ci k R i x ck p ck p ck = exp(β T X k )/ r R i exp(β T X r ) Expectation w.r.t. to likelihood of failure of each individual in risk set, R i set of individuals at risk when i fails.
Schoenfeld residuals Only defined for uncensored individuals, i.e. those that failed. Should be independent of time. In other words the slope of the regression of the residuals on time should be 0. library ( s u r v i v a l ) v. times< with ( veteran, Surv ( time, status ) ) CoxKarnoTrtCell< coxph ( v. times karno+t r t+ c e l l t y p e, data=veteran ) ZPHktc< cox. zph ( CoxKarnoTrtCell ) plot ( ZPHktc, var=1) ; plot ( ZPHktc, var=2) ; plot ( ZPHktc, var=3) ; plot ( ZPHktc, var=4) ; plot ( ZPHktc, var=5)
Schoenfeld residuals Time Beta(t) for karno 8.2 19 32 54 99 130 220 390 0.15 0.10 0.05 0.00 0.05 0.10 Time Beta(t) for trt2 8.2 19 32 54 99 130 220 390 4 2 0 2 4 Time Beta(t) for celltypesmallcell 8.2 19 32 54 99 130 220 390 5 0 5 Time Beta(t) for celltypeadeno 8.2 19 32 54 99 130 220 390 5 0 5 Time Beta(t) for celltypelarge 8.2 19 32 54 99 130 220 390 6 4 2 0 2 4 6 http://www.ats.ucla.edu/stat/examples/asa/test proportionality.htm
Extending the proportional hazards model WHY?
Extending the proportional hazards model The basic assumption is that the hazard ratio between different groups is constant with time. But time dependent covariates, e.g. age, salary, etc. Modelled as an interaction term between time and covariate Discrete case (i.e. fixed for time intervals): create separate entry for every time interval where individual has constant value, (start, stop] version in Surv. Continuous case: need a parametric form for the interaction e.g. h(t) = h 0 (t) exp(b 1 x 1 +... + b k x k + cx k+1 t). See Using Time Dependent Covariates vignette https://cran.r-project.org/web/packages/survival/ index.html especially for veteran data analysis.
Recap Modelling time to event in a population. Kaplan Meier curves. Modelling how different treatments affect failure time. Proportional hazards model, Cox regression.