Survival analysis in R

Similar documents
Survival analysis in R

Goodness-Of-Fit for Cox s Regression Model. Extensions of Cox s Regression Model. Survival Analysis Fall 2004, Copenhagen

Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times

The coxvc_1-1-1 package

β j = coefficient of x j in the model; β = ( β1, β2,

Survival Regression Models

In contrast, parametric techniques (fitting exponential or Weibull, for example) are more focussed, can handle general covariates, but require

Faculty of Health Sciences. Cox regression. Torben Martinussen. Department of Biostatistics University of Copenhagen. 20. september 2012 Slide 1/51

Lecture 8 Stat D. Gillen

Lecture 12. Multivariate Survival Data Statistics Survival Analysis. Presented March 8, 2016

Time-dependent coefficients

Fitting Cox Regression Models

Relative-risk regression and model diagnostics. 16 November, 2015

MAS3301 / MAS8311 Biostatistics Part II: Survival

Survival Analysis. Stat 526. April 13, 2018

Survival Analysis. STAT 526 Professor Olga Vitek

Lecture 7. Proportional Hazards Model - Handling Ties and Survival Estimation Statistics Survival Analysis. Presented February 4, 2016

Outline. Cox's regression model Goodness-of-t methods. Cox's proportional hazards model: Survival analysis

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

Multistate models and recurrent event models

Typical Survival Data Arising From a Clinical Trial. Censoring. The Survivor Function. Mathematical Definitions Introduction

Multistate models and recurrent event models

Tied survival times; estimation of survival probabilities

Exercises. (a) Prove that m(t) =

Chapter 4 Regression Models

Lecture 7 Time-dependent Covariates in Cox Regression

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

DAGStat Event History Analysis.

Residuals and model diagnostics

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

Multistate Modeling and Applications

Survival Analysis. 732G34 Statistisk analys av komplexa data. Krzysztof Bartoszek

Survival Analysis I (CHL5209H)

Introduction to Statistical Analysis

Lecture 9. Statistics Survival Analysis. Presented February 23, Dan Gillen Department of Statistics University of California, Irvine

STAT331. Cox s Proportional Hazards Model

Lecture 22 Survival Analysis: An Introduction

Multivariable Fractional Polynomials

Outline. Frailty modelling of Multivariate Survival Data. Clustered survival data. Clustered survival data

Multi-state models: prediction

Extensions of Cox Model for Non-Proportional Hazards Purpose

TMA 4275 Lifetime Analysis June 2004 Solution

Ph.D. course: Regression models. Introduction. 19 April 2012

LSS: An S-Plus/R program for the accelerated failure time model to right censored data based on least-squares principle

Cox s proportional hazards/regression model - model assessment

Part III Measures of Classification Accuracy for the Prediction of Survival Times

Lecture 3. Truncation, length-bias and prevalence sampling

Duration Analysis. Joan Llull

Philosophy and Features of the mstate package

Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status

STAT Sample Problem: General Asymptotic Results

Lecture 6 PREDICTING SURVIVAL UNDER THE PH MODEL

4. Comparison of Two (K) Samples

Lecture 10. Diagnostics. Statistics Survival Analysis. Presented March 1, 2016

SSUI: Presentation Hints 2 My Perspective Software Examples Reliability Areas that need work

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Semiparametric Regression

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

On a connection between the Bradley-Terry model and the Cox proportional hazards model

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Outline. Frailty modelling of Multivariate Survival Data. Clustered survival data. Clustered survival data

University of California, Berkeley

Matched Pair Data. Stat 557 Heike Hofmann

PhD course in Advanced survival analysis. One-sample tests. Properties. Idea: (ABGK, sect. V.1.1) Counting process N(t)

Part III. Hypothesis Testing. III.1. Log-rank Test for Right-censored Failure Time Data

Chapter 7: Hypothesis testing

Multivariable Fractional Polynomials

Log-linearity for Cox s regression model. Thesis for the Degree Master of Science

Proportional hazards regression

Statistics 262: Intermediate Biostatistics Non-parametric Survival Analysis

Chapter 4 Fall Notations: t 1 < t 2 < < t D, D unique death times. d j = # deaths at t j = n. Y j = # at risk /alive at t j = n

Statistical Analysis of Competing Risks With Missing Causes of Failure

Survival Analysis Math 434 Fall 2011

[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements

Accelerated Failure Time Models

Cox s proportional hazards model and Cox s partial likelihood

Journal of Statistical Software

Empirical Processes & Survival Analysis. The Functional Delta Method

STAT 526 Spring Final Exam. Thursday May 5, 2011

STAT 331. Martingale Central Limit Theorem and Related Results

Least Absolute Deviations Estimation for the Accelerated Failure Time Model. University of Iowa. *

Survival Analysis. Lu Tian and Richard Olshen Stanford University

Analysis of Time-to-Event Data: Chapter 6 - Regression diagnostics

Lecture 5 Models and methods for recurrent event data

Survival Analysis for Case-Cohort Studies

The Weibull Distribution

Analysis of Time-to-Event Data: Chapter 4 - Parametric regression models

Comparison of Hazard, Odds and Risk Ratio in the Two-Sample Survival Problem

Exercise 5.4 Solution

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation

Statistics in medicine

Kernel density estimation in R

Support Vector Hazard Regression (SVHR) for Predicting Survival Outcomes. Donglin Zeng, Department of Biostatistics, University of North Carolina

THESIS for the degree of MASTER OF SCIENCE. Modelling and Data Analysis

ST495: Survival Analysis: Maximum likelihood

11 Survival Analysis and Empirical Likelihood

pensim Package Example (Version 1.2.9)

On a connection between the Bradley Terry model and the Cox proportional hazards model

Transcription:

Survival analysis in R Niels Richard Hansen This note describes a few elementary aspects of practical analysis of survival data in R. For further information we refer to the book Introductory Statistics with R by Peter Dalgaard and Dynamic Regression Models for Survival Data by Torben Martinussen and Thomas Scheike and to the R help files. There is also the classical reference Statistical Models Based on Counting Processes by Andersen, Borgan, Gill and Keiding. The analyses that we will do can all be done using the functions in library survival. That library also contains some example data sets. > library(survival) > data(pbc) > pbc[1:10, c("time", "status", "age", "edema", "alb", "bili", + "protime")] time status age edema alb bili protime 1 400 1 58.7652 1 2.60 14.5 12.2 2 4500 0 56.4463 0 4.14 1.1 10.6 3 1012 1 70.0726 1 3.48 1.4 12.0 4 1925 1 54.7406 1 2.54 1.8 10.3 5 1504 0 38.1054 0 3.53 3.4 10.9 6 2503 1 66.2587 0 3.98 0.8 11.0 7 1832 0 55.5346 0 4.09 1.0 9.7 8 2466 1 53.0568 0 4.00 0.3 11.0 9 2400 1 42.5079 0 3.08 3.2 11.0 10 51 1 70.5599 1 2.74 12.6 11.5 The data set contains the main variable time, which is the observed, potentially censored survival time and the status, which indicates if the observation has been censored. In addition there are 17 observed covariates, where we show above five that in previous analyses have been found to be important. We can produce a plot of the survival times with : > nr <- 50 > plot(c(0, pbc$time[1]), c(1, 1), type = "l", ylim = c(0, nr + + 1), xlim = c(0, max(pbc$time[1:nr]) + 10), ylab = "nr", xlab = "Survival tim > for (i in 2:nr) lines(c(0, pbc$time[i]), c(i, i)) > for (i in 1:nr) { 1

+ if (pbc$status[i] == 0) + points(pbc$time[i], i, col = "red", pch = 20) + if (pbc$status[i] == 1) + points(pbc$time[i], i) + } nr 0 10 20 30 40 50 0 1000 2000 3000 4000 Survival time Estimation of the survival function using the Kaplan-Meier estimator can be done using the survfit function. > pbc.surv <- survfit(surv(pbc$time, pbc$status)) > pbc.surv Call: survfit(formula = Surv(pbc$time, pbc$status)) n events median 0.95LCL 0.95UCL 418 161 3395 3090 3853 > plot(pbc.surv) 2

0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 4000 The result from survfit is an R object of type survfit. Taking summary of a survfit object provides the non-parametric estimate of the survival function including additional information at a long list. The plot is perhaps more informative with the added, pointwise confidence bands. The function Surv applied to the time and status variables for the PBC data is a function that create a survival object. A major point in a data set like the PBC data above is that we record covariate information on the individuals. It is no problem to split the estimation of the survival function according to a factor such as the binary edema variable in the PBC data. > pbc.surv <- survfit(surv(pbc$time, pbc$status) ~ pbc$edema) > plot(pbc.surv, conf.int = TRUE, col = c("red", "blue")) 3

0.0 0.2 0.4 0.6 0.8 1.0 0 1000 2000 3000 4000 From the plot it seems clear that individuals without edema has a larger chance of a large survival time than those with edema. This can also be testet formally with the so-called log-rank test. > survdiff(surv(pbc$time, pbc$status) ~ pbc$edema) Call: survdiff(formula = Surv(pbc$time, pbc$status) ~ pbc$edema) N Observed Expected (O-E)^2/E (O-E)^2/V pbc$edema=0 368 121 151.0 5.94 96 pbc$edema=1 50 40 10.0 89.28 96 Chisq= 96 on 1 degrees of freedom, p= 0 But what about including the other covariates also especially the continuous ones? This can be handled in the framework of the Cox proportional hazards model. > pbc.surv <- coxph(surv(pbc$time, pbc$status) ~ pbc$edema) > summary(pbc.surv) 4

Call: coxph(formula = Surv(pbc$time, pbc$status) ~ pbc$edema) n= 418 coef exp(coef) se(coef) z p pbc$edema 1.63 5.09 0.185 8.82 0 exp(coef) exp(-coef) lower.95 upper.95 pbc$edema 5.09 0.197 3.54 7.3 Rsquare= 0.129 (max possible= 0.985 ) Likelihood ratio test= 57.7 on 1 df, p=3e-14 Wald test = 77.7 on 1 df, p=0 Score (logrank) test = 96 on 1 df, p=0 We can tell from the summary that the β corresponding to the covariate edema is significantly different from 0. We also see that the hazard rate for those with edema=1 is estimated as roughly 5 times the base-line hazard rate α 0 (t). We can include more covariates. > pbc.surv <- coxph(surv(time, status) ~ edema + age + alb + bili + + protime, data = pbc) > summary(pbc.surv) Call: coxph(formula = Surv(time, status) ~ edema + age + alb + bili + protime, data = pbc) n= 418 coef exp(coef) se(coef) z p edema 0.7222 2.059 0.20764 3.48 5.1e-04 age 0.0365 1.037 0.00811 4.50 6.8e-06 alb -0.9266 0.396 0.20898-4.43 9.2e-06 bili 0.1246 1.133 0.01271 9.80 0.0e+00 protime 0.1928 1.213 0.05770 3.34 8.3e-04 exp(coef) exp(-coef) lower.95 upper.95 edema 2.059 0.486 1.371 3.093 age 1.037 0.964 1.021 1.054 alb 0.396 2.526 0.263 0.596 bili 1.133 0.883 1.105 1.161 protime 1.213 0.825 1.083 1.358 5

Rsquare= 0.36 (max possible= 0.985 ) Likelihood ratio test= 186 on 5 df, p=0 Wald test = 218 on 5 df, p=0 Score (logrank) test = 309 on 5 df, p=0 It is of course also possible to transform the covariates before using them in the regression. For example; > pbc.surv <- coxph(surv(time, status) ~ edema + age + log(bili) + + log(alb) + log(protime), data = pbc) > summary(pbc.surv) Call: coxph(formula = Surv(time, status) ~ edema + age + log(bili) + log(alb) + log(protime), data = pbc) n= 418 coef exp(coef) se(coef) z p edema 0.6613 1.937 0.20595 3.21 1.3e-03 age 0.0382 1.039 0.00768 4.97 6.5e-07 log(bili) 0.8975 2.453 0.08271 10.85 0.0e+00 log(alb) -2.4524 0.086 0.65707-3.73 1.9e-04 log(protime) 2.3458 10.442 0.77425 3.03 2.4e-03 exp(coef) exp(-coef) lower.95 upper.95 edema 1.937 0.5162 1.2938 2.901 age 1.039 0.9625 1.0234 1.055 log(bili) 2.453 0.4076 2.0863 2.885 log(alb) 0.086 11.6167 0.0237 0.312 log(protime) 10.442 0.0958 2.2895 47.624 Rsquare= 0.429 (max possible= 0.985 ) Likelihood ratio test= 234 on 5 df, p=0 Wald test = 233 on 5 df, p=0 Score (logrank) test = 297 on 5 df, p=0 A Theory We consider n individuals, T1,...,T n (survival times). independent, positive random variables 6

Estimation of the distribution by the empirical distribution: ˆF(t) = 1 n n i=1 and the survival function S(t) = 1 F(t) Ŝ(t) = 1 ˆF(t) = 1 n 1(T i t) n i=1 1(T i > t). Censoring by positive, random variables C 1,...,C n : We observe only T i = min{t i, C i} and Thus we observe the pairs i = 1(T i C i ). (T 1, 1 ),..., (T n, n ). Define the counting process of deaths (non censored events) N(t) = i=1 1(T i t, i = 1). Define also the process of individuals at risk n Y (t) = 1(t T i ). i=1 The jumps for the counting process are given by the process N(t) = N(t) N(t ) = N(t) lim ɛ 0+ N(t ɛ) {0, 1}, which is simply an indicator process for each jump (death). The Kaplan-Meier estimator of the survival function, based on censored survival times, is Ŝ(t) = ( 1 N(s) ). s t Y (s) The factors are equal to 1 for all s where N(s) = 0. If we denote by τ i the time for the i th jump, we can also write the estimator out as ( Ŝ(t) = 1 1 ) ( 1 1 ) ( ) 1... 1 Y (τ 1 ) Y (τ 2 ) Y (τ N(t) ) Each factor can be interpreted as an estimator of the conditional probability of dying in the time interval (τ i, τ i+1 ] given survival to time τ i. 7

If F is continuous differentiable with derivative f (the density for the survival distribution then), we can introduce the hazard rate: α(t) = f(t) S(t). We see that α(t) = 1 lim ɛ 0+ ɛ F(t + ɛ) F(t) S(t) 1 = lim ɛ 0+ ɛ P(T i (t, t + ɛ] Ti > t). Thus α(t) is the instantaneous rate of death at time t. The exponential distribution provides an example where α(t) = λ is constant. In general, we can always get back the the survival function from the hazard rate by ( t ) S(t) = exp α(s)ds. 0 This means that we can specify a distribution via its hazard rate as well as its survival function. The Cox proportional hazards rate model is given as follows. We assume that we for each individual has a vector X i = (X 1i,...,X mi ) T of real valued covariates (they may be indicator variables as well). Then we specify the hazard rate for the i th individual as m α i (t, X i ) = α 0 (t) exp β j X ij = α 0 (t) exp ( X T β ) j=1 for an m-dimensional vector β of parameters. The model is a semi-parametric model. The estimation of α 0 (t) has secundary interest whereas the estimation of the β s is of primary interest. We see that α i (t, X i ) α i (t, X i + (0,..., 0, δ, 0,..., 0) T = e δβ j if the δ above is at the j th position in the vector. This gives an interpretation of the parameters as the log-relative risks for a given covariate. The parameters are estimated by solving the estimation equation where E(β, T i ) = n (X i E(β, T i )) i = 0 i=1 i 1(t T i )X i exp(x T i β) i 1(t T i ) exp(x T i β). 8