Survival Prediction Under Dependent Censoring: A Copula-based Approach Yi-Hau Chen Institute of Statistical Science, Academia Sinica 2013 AMMS, National Sun Yat-Sen University December 7 2013 Joint work with Takeshi Emura (National Central University)
Survival Analysis survival data for the onset time of some event of interest (disease, death, cancer recurrence...) are commonly collected in studies in medicine and many other fields of science owing to the limitation of observation period and other factors such as dropout of the study subjects, data on survival time T is usually censored by some censoring time U such as the study termination time or the dropout time
subject 1 T = T,δ = 1 subject 2 T = U,δ = 0 T > U (censored) time origin
the actual survival data we observe takes the form ( T,δ) T = min(t,u), δ = I(T U) one of the aims of survival analysis is to identify factors that may explain and predict survival time T well based on the censored survival data ( T,δ)
Censoring Mechanism conventional survival analysis is based on independent censoring assumption, assuming that the censoring time is independent of the survival time, conditional on the covariates may be easily violated in practice the independent censoring assumption is even more stringent in the univariate analysis than in the multivariate analysis since T U X 1,X 2 T U X 1
T X 1 dependence induced by X 2 U X 2
even if the independent censoring assumption is satisfied with multivariate X, it may not hold for some univariate covariate X
Effects of Dependent Censoring estimation of survival rates/regression parameters with wrongly assumed independent censoring is subject to serious bias (Zheng and Klein 1995 Biometrika; Huang and Zhang 2008 Biometrics; Chen 2010 JRSSB) adverse effects on variable (gene) selection
Hazard Rates the net hazard rate (given covariate X) for death is defined as h(t X) = Pr( t T t+dt T t,x )/dt the apparent hazard rate in the presence of censoring is h (t X) = Pr( t T t+dt,t U T t,u t,x )/dt which is the hazard rate we can directly estimate from the censored survival data
Apparent vs. Net Hazards when T U X, h( X) = h ( X) h( X) h ( X) generally under dependent censoring dependent censoring generally leads to biased estimation of the net hazard rate
A Copula Framework of Dependent Censoring by Sklar s theorem, the joint survival function of T and U (conditional on covariate X) can be written as Pr( T > t,u > u X ) = C( Pr(T > t X), Pr(U > u X) ) where C : [0,1] 2 [0,1] is called copula that describes explicitly marginal survivals and dependence structure between T and U
Some Examples Clayton copula C α (u,v) = ( u α +v α 1 ) 1/α,(u,v) [0,1] 2,α > 0 Frank copula { C α (u,v) = log α 1+ (αu 1)(α v } 1),(u,v) [0,1] 2, α > 0,α 1 α 1 Gumbel copula C α (u,v) = exp { {( logu) α +( logv) α } 1/α},(u,v) [0,1] 2,α > 1
α: dependence parameter (1-1 correspondence to Kendall s τ)
Bias Assessment under copula model Pr( T > t,u > u X ) = C α ( S T (t X), S U (u X) ) S T (t X) = Pr(T > t X), S U (u X) = Pr(U > u X) h (t X) = r α (t X) h(t X) r α (t X) = C(1,0) α (S T (t X), S U (u X)) S T (t X) C α (S T (t X), S U (u X)), C (1,0) α (u,v) = C α(u,v) u
the apparent effect of X on survival β α(t) = log h (t X = 1) h (t X = 0) = β(t)+logr α(t X = 1) r α (t X = 0) where β(t) = log h(t X=1) h(t X=0) is the net effect of X when C α (u,v) = uv, i.e., T and U are independent, r α ( X) = 1, hence the apparent effect β coincides with the net effect β in general cases the bias arises
Clayton Copula Net effect = 1 Net effect = 1 β = Apparent effect 1.0 0.5 0.0 0.5 1.0 40% censored Not censored 50% censored 60% censored β = Apparent effect 1.2 0.8 0.4 0.0 60% censored 50% censored 40% censored Not censored 0 1 2 3 4 0 1 2 3 4 α =Association parameter α =Association parameter
Frank Copula Net effect = 1 Net effect = 1 β = Apparent effect 0.8 1.0 1.2 1.4 1.6 1.8 2.0 60% censored 50% censored 40% censored Not censored β = Apparent effect 1.2 1.0 0.8 0.6 0.4 60% censored Not censored 50% censored 40% censored 0 5 10 15 0 5 10 15 α =Association parameter α =Association parameter
Survival Prediction with High Dimensional Data a recent focus of research due to abundance of high-throughput genomic/genetic data such more detailed personalized data provide potentially useful predictors for survival analysis hope to achieve more accurate prognosis and develop personalized treatment strategies
a challenge in statistical analysis due to the p n nature of the data
Examples van de Vijver et al. (2002 New England Journal of Medicine) utilized expression profiles from 24,885 genes for 295 breast cancer patients to identify patients who would benefit from adjuvant therapy, leading to a new criterion which reduces patients risk over traditional guidelines based only on histological and clinical characteristics Chen at al. (2007 New England Journal of Medicine) examined expression profiles over 672 genes for 125 non-small-cell lung cancer patients to identify a gene signature closely associated with survival outcome in patients with non-small-cell lung cancer
Compound-Covariate (CC) Method for Prediction (Tukey, 1993 Controlled Clinical Trials) genes are pre-selected or screened, one-by one, by univariate (Cox) regression analysis risk score formed by linear combination of pre-selected genes, with the weight of each gene given by univariate regression coefficient estimate an easy way to tackle high-dimensional covariates
has been widely adopted in real applications (Beer et al., 2002 Nature Medicine; Chen et al., 2007 New England Journal of Medicine; Matsui et al. 2012 Clinical Cancer Research)
Comparative Performances of Existing Methods comparative studies by Wessels et al. (2002 Bioinformatics), Lai et al. (2006 BMC Bioinfomatics), Lecocke et al. (2006 Cancer Informatics), Sun and Li (2012 Bometrics) concluded that CC often yields consistently better results on microarray datasets than more sophisticated multivariate approaches (PCA, Ridge, Lasso,...)
Theoretical Justifications: Shrinkage Method (Emura et al. 2012 PLoS ONE) under independent censoring and all covariates being independent, we can show that the univariate Cox regression estimator (i) has a limiting value 0, when the true coefficient is 0 (ii) has a limiting value lying between the true coefficient value and 0, when the true coefficient is not 0 univariate estimates are shrinkage of multivariate parameters towards zero avoids over-fitting
Theoretical Justifications: Model Averaging (Buckland et al. 1997 Biometrics) each building model is given by a simple univariate model each model is given an equal weight
Gene Selection Accommodating Dependent Censoring a common copula model for each gene: Pr(T > t,u > u X j ) = C α (S T (t X j ),S U (u X j )) a single dependence parameter α proportional hazards model for marginal survival functions of T and U given each covariate: (Λ 0j, Γ 0j : baseline cumulative hazards) S T (t X j ) = exp { Λ 0j (t)exp(β j X j ) } S U (u X j ) = exp { Γ 0j (u)exp(γ j X j ) }
Parameter Estimation semiparametric maximum likelihood estimation (Chen 2010 JRSSB): for fixed α, maximizing with respect to Ω j = (β j,γ j,λ 0j,Γ 0j ) the log likelihood il i (Ω j ), where l i (Ω j ) = δ i [ βj X ij +logη 1ij ( T i ;Ω j )+logdλ 0j ( T i ) ] +(1 δ i ) [ γ j X ij +logη 2ij ( T i ;Ω j )+logdγ 0j ( T i ) ] Φ α { exp ( Λ0j ( T i )e β jx j ),exp ( Γ0j ( T i )e γ jx j )}
η 1ij (t;ω j ) = Φ (1,0) { α exp( Λ0j ( T i )e β jx j ),exp( Γ 0j ( T i )e γ jx j ) } exp ( Λ 0j ( T i )e β ) jx j η 2ij (t;ω j ) = Φ (0,1) { α exp( Λ0j ( T i )e β jx j ),exp( Γ 0j ( T i )e γ jx j ) } exp ( Γ 0j ( T i )e γ ) jx j Φ α = logc α
Prognosis Index ˆβ j (α): the SMLE of β j for fixed α standard error for ˆβ j (α) (by the inverse of observed information matrix) (Chen 2010 JRSSB) gene selection based on the significance test for β j
risk score, or prognosis index (PI) for survival prediction for subject i: PI = K j=1 ˆβ j (α)x ij where K is the number of genes selected for prediction when α is chosen as the value leading to independence copula, C α (u,v) = uv, the copula method reduces to that the traditional compound covariate method
Determination of the Value of α due to the non-identifiability of α with the censored survival data (Tsiatis 1975 PNAS), the likelihood may provide little information on α a practical approach is to choose α maximizing prediction power we adopt the predictive power measure given by Harrell s concordance measure (c-index) (Harrell et al. 1996 Statistics in Medicine) and M-fold cross-validation
M-fold Cross-validated c-index divide the whole sample into M subsamples of about equal size each time, remove a subsample, and use the remaining subsamples to estimate parameters with a chosen α obtain PI(α) for subjects in the subsample removed, and calculate the c-index (i,k) δ ii( T i < T k )I{PI i (α) > PI k (α)}+δ k I( T k < T i )I{PI k (α) > PI i (α)} (i,k) δ ii( T i < T j )+δ k I( T k < T i )
sum the c-index values over M subsamples and obtain CV(α) = M m=1 c m (α) the value α maximizing CV(α) is chosen M = 5 is sufficient for good performance in our experience
Simulation n = 100, p = dim(x) = 100 (T,U) follows Pr(T > t,u > u) = ( ) e tαexp(β X) +e uαexp(γ 1/α X) α = 0.5,2,8 (Kendall s τ = 0.2, 0.5, 0.8, respectively) γ = β; 50% censoring
the first q = 5, 10, 20 genes have non-zero coefficients (informative genes); the remaining p q genes have zero coefficients (noninformative genes) the analysis is based on Clayton copula and 5-fold cross-validated c-index
Predictor Structure scenario 1 (tag genes): each of informative genes is positively correlated to non-informative genes: We have several sets of correlated genes. In each set, there is only one tag gene associated with the survival, while other genes are not associated with the survival given the tag gene
X 1 X q T
scenario 2 (gene pathway): the informative genes are positively correlated: We have a set of correlated genes jointly associated with the survival
X 1 X q T
Evaluation Criteria for Gene Selection sensitivity: sensitivity = p j=1 I(P j P (q),β j 0) p j=1 I(β j 0) 100% P j : p value of the Wald s test for H 0 : β j = 0 P (j) : the jth smallest value from {P 1,...,P p } Specificity: specifiity = p j=1 I(P j > P (q),β j = 0) p j=1 I(β 100% j = 0)
higher sensitivity (specificity) better ability to identify informative (noninformative) genes
Simulation Results (tag gene structure: q = 5, p = 100) β = (0.8,...,0.8,0,...,0) } {{ } 5 }{{} 95 method Kendall s τ sensitivity specificity independence 0.2 50.4 97.4 0.5 47.6 97.2 0.8 48.4 97.3 copula 0.2 60.4 97.9 0.5 60.8 97.9 0.8 57.6 97.8 copula method improves sensitivity by 9 12% ˆα s identified by CV are 4.0 4.6
Simulation Results (tag gene structure: q = 10, p = 100) β = (0.4,...,0.4,0,...,0) } {{ } 10 }{{} 90 method Kendall s τ sensitivity specificity independence 0.2 32.8 92.5 0.5 32.8 92.5 0.8 33.6 92.6 copula 0.2 42.6 93.6 0.5 42.8 93.6 0.8 44.6 93.8 copula method improves sensitivity by 10 11% ˆα s identified by CV are 4.5 5.2
Simulation Results (tag gene structure: q = 20, p = 100) β = (0.2,...,0.2, 0.2,..., 0.2,0,...,0) }{{}}{{}}{{} 10 10 80 method Kendall s τ sensitivity specificity independence 0.2 31.0 82.7 0.5 30.6 82.6 0.8 31.7 82.7 copula 0.2 36.1 84.0 0.5 35.3 83.8 0.8 37.6 84.4 copula method improves sensitivity by 5 6% ˆα s identified by CV are 3.9 4.1
Simulation Results (pathway structure: q = 5, p = 100) β = (0.4,...,0.4,0,...,0) } {{ } 5 }{{} 95 method Kendall s τ sensitivity specificity independence 0.2 96.8 99.8 0.5 98.4 99.9 0.8 97.2 99.8 copula 0.2 100 100 0.5 99.6 99.9 0.8 99.2 99.9 both methods perform equally well ˆα s identified by CV are 4.8 5.3
Simulation Results (pathway structure: q = 10, p = 100) β = (0.2,...,0.2, 0.2,..., 0.2,0,...,0) }{{}}{{}}{{} 5 5 90 method Kendall s τ sensitivity specificity independence 0.2 66.2 96.2 0.5 64.4 96.0 0.8 66.6 96.3 copula 0.2 83.4 98.2 0.5 81.0 97.9 0.8 82.2 98.0 copula method improves sensitivity by 16 17% ˆα s identified by CV are 4.1 4.7
Simulation Results (pathway structure: q = 20, p = 100) β = (0.1,...,0.1, 0.1,..., 0.1,0,...,0) }{{}}{{}}{{} 10 10 80 method Kendall s τ sensitivity specificity independence 0.2 72.5 93.1 0.5 71.5 92.9 0.8 73.6 93.4 copula 0.2 85.0 96.2 0.5 83.8 96.0 0.8 85.9 96.5 copula method improves sensitivity by 12% ˆα s identified by CV are 4.4 4.9
Summary of Simulation Results univariate gene selection based on the copula framework improves the ability of identifying informative genes, compared with the conventional selection procedure under independent censoring the improvement is more significant in Scenario 2 (correlated informative genes) than in Scenario 1 (independent informative genes) similar conclusions hold for p = 500 similar conclusions hold for the analysis based on Frank copula
Non-Small-Cell Lung Cancer Data (Chen at al. 2007 NEJM) data contains expression values from 672 genes for n = 125 patients (38 died and others censored; 70% censoring) the patients are divided into 63:62 training/test datasets in the same way as Chen et al. p = 485 genes with CV > 3% are included for analysis Chen et al. reported 16 genes most predictive for the survival of NSCLC patients
Results: Gene Selection top 16 genes selected by Chen et al. (independent censoring) and by the copula method (6 genes appear in both lists)
Independence Copula No. Gene β p value Gene β p value 1 ANXA5-1.09 0.0039 ZNF264 0.51 0.0004 2 DLG2 1.32 0.0041 MMP16 0.50 0.0005 3 ZNF264 0.55 0.0079 HGF 0.50 0.0010 4 DUSP6 0.75 0.0086 HCK -0.49 0.0012 5 CPEB4 0.59 0.0162 NF1 0.47 0.0016 6 LCK -0.84 0.0171 ERBB3 0.46 0.0016 7 STAT1-0.58 0.0198 NR2F6 0.57 0.0030 8 RNF4 0.65 0.0220 AXL 0.77 0.0035 9 IRF4 0.52 0.0299 CDC23 0.51 0.0050 10 STAT2 0.58 0.0311 DLG2 0.92 0.0055 11 HGF 0.51 0.0334 IGF2-0.34 0.0081 12 ERBB3 0.55 0.0335 RBBP6 0.54 0.0082 13 NF1 0.47 0.0380 COX11 0.51 0.0118 14 FRAP1-0.77 0.0408 DUSP6 0.40 0.0121 15 MMD 0.92 0.0419 ENG -0.37 0.0139 16 HMMR 0.52 0.0481 CKMT1A -0.41 0.0155
Results: Survival Prediction Kaplan-Meier estimates of survival functions for the good and poor prognosis groups in the test data, classified by the PI values (cutpoint=median) smaller p value for the test (of survival functions) means better survival prediction based on grid search for K (# genes selected for prediction) {10, 20,..., 100}, both the compound covariate and copula methods attain minimum p value at K = 80
Results: Kaplan-Meier Plots of Good vs. Poor Prognosis Groups based on 80 genes
Survival probability 0.5 0.6 0.7 0.8 0.9 1.0 Univariate Cox P value = 0.129 0 10 20 30 40 Months Survival probability 0.5 0.6 0.7 0.8 0.9 1.0 Proposed method P value = 0.018 0 10 20 30 40 Months
Results: Kaplan-Meier Plots of Good vs. Poor Prognosis Groups based on 16 genes
Survival probability 0.5 0.6 0.7 0.8 0.9 1.0 Univariate Cox P value = 0.146 0 10 20 30 40 50 Months Survival probability 0.5 0.6 0.7 0.8 0.9 1.0 Proposed method P value = 0.112 0 10 20 30 40 50 Months
Summary univariate gene selection is a convenient and effective tool when p n issue of dependent censoring is more prominent in such univariate analysis dependent censoring may lead to substantial bias for regression coefficient estimation; the bias can be analytically assessed under a copula framework
univariate gene selection procedure accommodating dependent censoring can be performed under a copula framework, together with a predictive performance measure for identifying dependence level between survival and censoring times such a method has greater power to identify informative genes and achieves better survival prediction performance, compared with conventional methods based on independent censoring assumption Thank You!!