Survival Prediction Under Dependent Censoring: A Copula-based Approach

Similar documents
STAT331. Cox s Proportional Hazards Model

Dynamic Prediction of Disease Progression Using Longitudinal Biomarker Data

PhD course: Statistical evaluation of diagnostic and predictive models

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Survival Analysis Math 434 Fall 2011

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

Power and Sample Size Calculations with the Additive Hazards Model

Lecture 5 Models and methods for recurrent event data

Extensions of Cox Model for Non-Proportional Hazards Purpose

Statistical aspects of prediction models with high-dimensional data

Survival Analysis. Stat 526. April 13, 2018

Multivariate Survival Analysis

Univariate shrinkage in the Cox model for high dimensional data

A class of generalized ridge estimator for high-dimensional linear regression

Application of the Time-Dependent ROC Curves for Prognostic Accuracy with Multiple Biomarkers

Estimation of Conditional Kendall s Tau for Bivariate Interval Censored Data

Building a Prognostic Biomarker

Longitudinal + Reliability = Joint Modeling

Part III Measures of Classification Accuracy for the Prediction of Survival Times

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

Support Vector Hazard Regression (SVHR) for Predicting Survival Outcomes. Donglin Zeng, Department of Biostatistics, University of North Carolina

Robustifying Trial-Derived Treatment Rules to a Target Population

Analysing Survival Endpoints in Randomized Clinical Trials using Generalized Pairwise Comparisons

Regularization in Cox Frailty Models

Joint Modeling of Longitudinal Item Response Data and Survival

CASE STUDY: Bayesian Incidence Analyses from Cross-Sectional Data with Multiple Markers of Disease Severity. Outline:

A Bayesian Nonparametric Approach to Causal Inference for Semi-competing risks

Multi-state Models: An Overview

Estimating the Mean Response of Treatment Duration Regimes in an Observational Study. Anastasios A. Tsiatis.

Estimating Causal Effects of Organ Transplantation Treatment Regimes

Part III. Hypothesis Testing. III.1. Log-rank Test for Right-censored Failure Time Data

NONPARAMETRIC ADJUSTMENT FOR MEASUREMENT ERROR IN TIME TO EVENT DATA: APPLICATION TO RISK PREDICTION MODELS

Frailty Models and Copulas: Similarities and Differences

Survival Analysis I (CHL5209H)

Cox s proportional hazards model and Cox s partial likelihood

MAS3301 / MAS8311 Biostatistics Part II: Survival

Lecture 7 Time-dependent Covariates in Cox Regression

Estimation and Goodness of Fit for Multivariate Survival Models Based on Copulas

8/1/2018. Statistics for Radiomics. Outline. Estimation of Parameters in Linear Model. The linear model

Semiparametric maximum likelihood estimation in normal transformation models for bivariate survival data

Logistic regression model for survival time analysis using time-varying coefficients

Multivariate Survival Data With Censoring.

Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview

Linear Model Selection and Regularization

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

Lecture 6 PREDICTING SURVIVAL UNDER THE PH MODEL

Probabilistic Index Models

Introduction to Statistical Analysis

Genomics, Transcriptomics and Proteomics in Clinical Research. Statistical Learning for Analyzing Functional Genomic Data. Explanation vs.

Sample size and robust marginal methods for cluster-randomized trials with censored event times

Lecture 22 Survival Analysis: An Introduction

Lecture 12. Multivariate Survival Data Statistics Survival Analysis. Presented March 8, 2016

A Sampling of IMPACT Research:

Multistate models and recurrent event models

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Semi-Penalized Inference with Direct FDR Control

Lecture 3. Truncation, length-bias and prevalence sampling

Survival Analysis. Lu Tian and Richard Olshen Stanford University

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

Comparison of Predictive Accuracy of Neural Network Methods and Cox Regression for Censored Survival Data

Frailty Probit model for multivariate and clustered interval-censor

On consistency of Kendall s tau under censoring

Practical considerations for survival models

Multimodal Deep Learning for Predicting Survival from Breast Cancer

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

Bayesian Inference for Conditional Copula models with Continuous and Binary Responses

Relative-risk regression and model diagnostics. 16 November, 2015

The influence of categorising survival time on parameter estimates in a Cox model

Models for Multivariate Panel Count Data

Time-dependent coefficients

Multistate models and recurrent event models

Multistate models in survival and event history analysis

Tests of independence for censored bivariate failure time data

Model Selection in Bayesian Survival Analysis for a Multi-country Cluster Randomized Trial

Integrated likelihoods in survival models for highlystratified

Simple techniques for comparing survival functions with interval-censored data

DYNAMIC PREDICTION MODELS FOR DATA WITH COMPETING RISKS. by Qing Liu B.S. Biological Sciences, Shanghai Jiao Tong University, China, 2007

A STUDY OF PRE-VALIDATION

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016

Meei Pyng Ng 1 and Ray Watson 1

A FRAILTY MODEL APPROACH FOR REGRESSION ANALYSIS OF BIVARIATE INTERVAL-CENSORED SURVIVAL DATA

Log-linearity for Cox s regression model. Thesis for the Degree Master of Science

Outline. Frailty modelling of Multivariate Survival Data. Clustered survival data. Clustered survival data

Survival Model Predictive Accuracy and ROC Curves

POWER AND SAMPLE SIZE DETERMINATIONS IN DYNAMIC RISK PREDICTION. by Zhaowen Sun M.S., University of Pittsburgh, 2012

High-dimensional regression modeling

SAMPLE SIZE ESTIMATION FOR SURVIVAL OUTCOMES IN CLUSTER-RANDOMIZED STUDIES WITH SMALL CLUSTER SIZES BIOMETRICS (JUNE 2000)

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation

Statistical Inference and Methods

Group Sequential Designs: Theory, Computation and Optimisation

Proportional hazards model for matched failure time data

TMA 4275 Lifetime Analysis June 2004 Solution

Group Sequential Tests for Delayed Responses. Christopher Jennison. Lisa Hampson. Workshop on Special Topics on Sequential Methodology

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources

Variable Selection in Competing Risks Using the L1-Penalized Cox Model

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Stat 642, Lecture notes for 04/12/05 96

Other likelihoods. Patrick Breheny. April 25. Multinomial regression Robust regression Cox regression

Müller: Goodness-of-fit criteria for survival data

Transcription:

Survival Prediction Under Dependent Censoring: A Copula-based Approach Yi-Hau Chen Institute of Statistical Science, Academia Sinica 2013 AMMS, National Sun Yat-Sen University December 7 2013 Joint work with Takeshi Emura (National Central University)

Survival Analysis survival data for the onset time of some event of interest (disease, death, cancer recurrence...) are commonly collected in studies in medicine and many other fields of science owing to the limitation of observation period and other factors such as dropout of the study subjects, data on survival time T is usually censored by some censoring time U such as the study termination time or the dropout time

subject 1 T = T,δ = 1 subject 2 T = U,δ = 0 T > U (censored) time origin

the actual survival data we observe takes the form ( T,δ) T = min(t,u), δ = I(T U) one of the aims of survival analysis is to identify factors that may explain and predict survival time T well based on the censored survival data ( T,δ)

Censoring Mechanism conventional survival analysis is based on independent censoring assumption, assuming that the censoring time is independent of the survival time, conditional on the covariates may be easily violated in practice the independent censoring assumption is even more stringent in the univariate analysis than in the multivariate analysis since T U X 1,X 2 T U X 1

T X 1 dependence induced by X 2 U X 2

even if the independent censoring assumption is satisfied with multivariate X, it may not hold for some univariate covariate X

Effects of Dependent Censoring estimation of survival rates/regression parameters with wrongly assumed independent censoring is subject to serious bias (Zheng and Klein 1995 Biometrika; Huang and Zhang 2008 Biometrics; Chen 2010 JRSSB) adverse effects on variable (gene) selection

Hazard Rates the net hazard rate (given covariate X) for death is defined as h(t X) = Pr( t T t+dt T t,x )/dt the apparent hazard rate in the presence of censoring is h (t X) = Pr( t T t+dt,t U T t,u t,x )/dt which is the hazard rate we can directly estimate from the censored survival data

Apparent vs. Net Hazards when T U X, h( X) = h ( X) h( X) h ( X) generally under dependent censoring dependent censoring generally leads to biased estimation of the net hazard rate

A Copula Framework of Dependent Censoring by Sklar s theorem, the joint survival function of T and U (conditional on covariate X) can be written as Pr( T > t,u > u X ) = C( Pr(T > t X), Pr(U > u X) ) where C : [0,1] 2 [0,1] is called copula that describes explicitly marginal survivals and dependence structure between T and U

Some Examples Clayton copula C α (u,v) = ( u α +v α 1 ) 1/α,(u,v) [0,1] 2,α > 0 Frank copula { C α (u,v) = log α 1+ (αu 1)(α v } 1),(u,v) [0,1] 2, α > 0,α 1 α 1 Gumbel copula C α (u,v) = exp { {( logu) α +( logv) α } 1/α},(u,v) [0,1] 2,α > 1

α: dependence parameter (1-1 correspondence to Kendall s τ)

Bias Assessment under copula model Pr( T > t,u > u X ) = C α ( S T (t X), S U (u X) ) S T (t X) = Pr(T > t X), S U (u X) = Pr(U > u X) h (t X) = r α (t X) h(t X) r α (t X) = C(1,0) α (S T (t X), S U (u X)) S T (t X) C α (S T (t X), S U (u X)), C (1,0) α (u,v) = C α(u,v) u

the apparent effect of X on survival β α(t) = log h (t X = 1) h (t X = 0) = β(t)+logr α(t X = 1) r α (t X = 0) where β(t) = log h(t X=1) h(t X=0) is the net effect of X when C α (u,v) = uv, i.e., T and U are independent, r α ( X) = 1, hence the apparent effect β coincides with the net effect β in general cases the bias arises

Clayton Copula Net effect = 1 Net effect = 1 β = Apparent effect 1.0 0.5 0.0 0.5 1.0 40% censored Not censored 50% censored 60% censored β = Apparent effect 1.2 0.8 0.4 0.0 60% censored 50% censored 40% censored Not censored 0 1 2 3 4 0 1 2 3 4 α =Association parameter α =Association parameter

Frank Copula Net effect = 1 Net effect = 1 β = Apparent effect 0.8 1.0 1.2 1.4 1.6 1.8 2.0 60% censored 50% censored 40% censored Not censored β = Apparent effect 1.2 1.0 0.8 0.6 0.4 60% censored Not censored 50% censored 40% censored 0 5 10 15 0 5 10 15 α =Association parameter α =Association parameter

Survival Prediction with High Dimensional Data a recent focus of research due to abundance of high-throughput genomic/genetic data such more detailed personalized data provide potentially useful predictors for survival analysis hope to achieve more accurate prognosis and develop personalized treatment strategies

a challenge in statistical analysis due to the p n nature of the data

Examples van de Vijver et al. (2002 New England Journal of Medicine) utilized expression profiles from 24,885 genes for 295 breast cancer patients to identify patients who would benefit from adjuvant therapy, leading to a new criterion which reduces patients risk over traditional guidelines based only on histological and clinical characteristics Chen at al. (2007 New England Journal of Medicine) examined expression profiles over 672 genes for 125 non-small-cell lung cancer patients to identify a gene signature closely associated with survival outcome in patients with non-small-cell lung cancer

Compound-Covariate (CC) Method for Prediction (Tukey, 1993 Controlled Clinical Trials) genes are pre-selected or screened, one-by one, by univariate (Cox) regression analysis risk score formed by linear combination of pre-selected genes, with the weight of each gene given by univariate regression coefficient estimate an easy way to tackle high-dimensional covariates

has been widely adopted in real applications (Beer et al., 2002 Nature Medicine; Chen et al., 2007 New England Journal of Medicine; Matsui et al. 2012 Clinical Cancer Research)

Comparative Performances of Existing Methods comparative studies by Wessels et al. (2002 Bioinformatics), Lai et al. (2006 BMC Bioinfomatics), Lecocke et al. (2006 Cancer Informatics), Sun and Li (2012 Bometrics) concluded that CC often yields consistently better results on microarray datasets than more sophisticated multivariate approaches (PCA, Ridge, Lasso,...)

Theoretical Justifications: Shrinkage Method (Emura et al. 2012 PLoS ONE) under independent censoring and all covariates being independent, we can show that the univariate Cox regression estimator (i) has a limiting value 0, when the true coefficient is 0 (ii) has a limiting value lying between the true coefficient value and 0, when the true coefficient is not 0 univariate estimates are shrinkage of multivariate parameters towards zero avoids over-fitting

Theoretical Justifications: Model Averaging (Buckland et al. 1997 Biometrics) each building model is given by a simple univariate model each model is given an equal weight

Gene Selection Accommodating Dependent Censoring a common copula model for each gene: Pr(T > t,u > u X j ) = C α (S T (t X j ),S U (u X j )) a single dependence parameter α proportional hazards model for marginal survival functions of T and U given each covariate: (Λ 0j, Γ 0j : baseline cumulative hazards) S T (t X j ) = exp { Λ 0j (t)exp(β j X j ) } S U (u X j ) = exp { Γ 0j (u)exp(γ j X j ) }

Parameter Estimation semiparametric maximum likelihood estimation (Chen 2010 JRSSB): for fixed α, maximizing with respect to Ω j = (β j,γ j,λ 0j,Γ 0j ) the log likelihood il i (Ω j ), where l i (Ω j ) = δ i [ βj X ij +logη 1ij ( T i ;Ω j )+logdλ 0j ( T i ) ] +(1 δ i ) [ γ j X ij +logη 2ij ( T i ;Ω j )+logdγ 0j ( T i ) ] Φ α { exp ( Λ0j ( T i )e β jx j ),exp ( Γ0j ( T i )e γ jx j )}

η 1ij (t;ω j ) = Φ (1,0) { α exp( Λ0j ( T i )e β jx j ),exp( Γ 0j ( T i )e γ jx j ) } exp ( Λ 0j ( T i )e β ) jx j η 2ij (t;ω j ) = Φ (0,1) { α exp( Λ0j ( T i )e β jx j ),exp( Γ 0j ( T i )e γ jx j ) } exp ( Γ 0j ( T i )e γ ) jx j Φ α = logc α

Prognosis Index ˆβ j (α): the SMLE of β j for fixed α standard error for ˆβ j (α) (by the inverse of observed information matrix) (Chen 2010 JRSSB) gene selection based on the significance test for β j

risk score, or prognosis index (PI) for survival prediction for subject i: PI = K j=1 ˆβ j (α)x ij where K is the number of genes selected for prediction when α is chosen as the value leading to independence copula, C α (u,v) = uv, the copula method reduces to that the traditional compound covariate method

Determination of the Value of α due to the non-identifiability of α with the censored survival data (Tsiatis 1975 PNAS), the likelihood may provide little information on α a practical approach is to choose α maximizing prediction power we adopt the predictive power measure given by Harrell s concordance measure (c-index) (Harrell et al. 1996 Statistics in Medicine) and M-fold cross-validation

M-fold Cross-validated c-index divide the whole sample into M subsamples of about equal size each time, remove a subsample, and use the remaining subsamples to estimate parameters with a chosen α obtain PI(α) for subjects in the subsample removed, and calculate the c-index (i,k) δ ii( T i < T k )I{PI i (α) > PI k (α)}+δ k I( T k < T i )I{PI k (α) > PI i (α)} (i,k) δ ii( T i < T j )+δ k I( T k < T i )

sum the c-index values over M subsamples and obtain CV(α) = M m=1 c m (α) the value α maximizing CV(α) is chosen M = 5 is sufficient for good performance in our experience

Simulation n = 100, p = dim(x) = 100 (T,U) follows Pr(T > t,u > u) = ( ) e tαexp(β X) +e uαexp(γ 1/α X) α = 0.5,2,8 (Kendall s τ = 0.2, 0.5, 0.8, respectively) γ = β; 50% censoring

the first q = 5, 10, 20 genes have non-zero coefficients (informative genes); the remaining p q genes have zero coefficients (noninformative genes) the analysis is based on Clayton copula and 5-fold cross-validated c-index

Predictor Structure scenario 1 (tag genes): each of informative genes is positively correlated to non-informative genes: We have several sets of correlated genes. In each set, there is only one tag gene associated with the survival, while other genes are not associated with the survival given the tag gene

X 1 X q T

scenario 2 (gene pathway): the informative genes are positively correlated: We have a set of correlated genes jointly associated with the survival

X 1 X q T

Evaluation Criteria for Gene Selection sensitivity: sensitivity = p j=1 I(P j P (q),β j 0) p j=1 I(β j 0) 100% P j : p value of the Wald s test for H 0 : β j = 0 P (j) : the jth smallest value from {P 1,...,P p } Specificity: specifiity = p j=1 I(P j > P (q),β j = 0) p j=1 I(β 100% j = 0)

higher sensitivity (specificity) better ability to identify informative (noninformative) genes

Simulation Results (tag gene structure: q = 5, p = 100) β = (0.8,...,0.8,0,...,0) } {{ } 5 }{{} 95 method Kendall s τ sensitivity specificity independence 0.2 50.4 97.4 0.5 47.6 97.2 0.8 48.4 97.3 copula 0.2 60.4 97.9 0.5 60.8 97.9 0.8 57.6 97.8 copula method improves sensitivity by 9 12% ˆα s identified by CV are 4.0 4.6

Simulation Results (tag gene structure: q = 10, p = 100) β = (0.4,...,0.4,0,...,0) } {{ } 10 }{{} 90 method Kendall s τ sensitivity specificity independence 0.2 32.8 92.5 0.5 32.8 92.5 0.8 33.6 92.6 copula 0.2 42.6 93.6 0.5 42.8 93.6 0.8 44.6 93.8 copula method improves sensitivity by 10 11% ˆα s identified by CV are 4.5 5.2

Simulation Results (tag gene structure: q = 20, p = 100) β = (0.2,...,0.2, 0.2,..., 0.2,0,...,0) }{{}}{{}}{{} 10 10 80 method Kendall s τ sensitivity specificity independence 0.2 31.0 82.7 0.5 30.6 82.6 0.8 31.7 82.7 copula 0.2 36.1 84.0 0.5 35.3 83.8 0.8 37.6 84.4 copula method improves sensitivity by 5 6% ˆα s identified by CV are 3.9 4.1

Simulation Results (pathway structure: q = 5, p = 100) β = (0.4,...,0.4,0,...,0) } {{ } 5 }{{} 95 method Kendall s τ sensitivity specificity independence 0.2 96.8 99.8 0.5 98.4 99.9 0.8 97.2 99.8 copula 0.2 100 100 0.5 99.6 99.9 0.8 99.2 99.9 both methods perform equally well ˆα s identified by CV are 4.8 5.3

Simulation Results (pathway structure: q = 10, p = 100) β = (0.2,...,0.2, 0.2,..., 0.2,0,...,0) }{{}}{{}}{{} 5 5 90 method Kendall s τ sensitivity specificity independence 0.2 66.2 96.2 0.5 64.4 96.0 0.8 66.6 96.3 copula 0.2 83.4 98.2 0.5 81.0 97.9 0.8 82.2 98.0 copula method improves sensitivity by 16 17% ˆα s identified by CV are 4.1 4.7

Simulation Results (pathway structure: q = 20, p = 100) β = (0.1,...,0.1, 0.1,..., 0.1,0,...,0) }{{}}{{}}{{} 10 10 80 method Kendall s τ sensitivity specificity independence 0.2 72.5 93.1 0.5 71.5 92.9 0.8 73.6 93.4 copula 0.2 85.0 96.2 0.5 83.8 96.0 0.8 85.9 96.5 copula method improves sensitivity by 12% ˆα s identified by CV are 4.4 4.9

Summary of Simulation Results univariate gene selection based on the copula framework improves the ability of identifying informative genes, compared with the conventional selection procedure under independent censoring the improvement is more significant in Scenario 2 (correlated informative genes) than in Scenario 1 (independent informative genes) similar conclusions hold for p = 500 similar conclusions hold for the analysis based on Frank copula

Non-Small-Cell Lung Cancer Data (Chen at al. 2007 NEJM) data contains expression values from 672 genes for n = 125 patients (38 died and others censored; 70% censoring) the patients are divided into 63:62 training/test datasets in the same way as Chen et al. p = 485 genes with CV > 3% are included for analysis Chen et al. reported 16 genes most predictive for the survival of NSCLC patients

Results: Gene Selection top 16 genes selected by Chen et al. (independent censoring) and by the copula method (6 genes appear in both lists)

Independence Copula No. Gene β p value Gene β p value 1 ANXA5-1.09 0.0039 ZNF264 0.51 0.0004 2 DLG2 1.32 0.0041 MMP16 0.50 0.0005 3 ZNF264 0.55 0.0079 HGF 0.50 0.0010 4 DUSP6 0.75 0.0086 HCK -0.49 0.0012 5 CPEB4 0.59 0.0162 NF1 0.47 0.0016 6 LCK -0.84 0.0171 ERBB3 0.46 0.0016 7 STAT1-0.58 0.0198 NR2F6 0.57 0.0030 8 RNF4 0.65 0.0220 AXL 0.77 0.0035 9 IRF4 0.52 0.0299 CDC23 0.51 0.0050 10 STAT2 0.58 0.0311 DLG2 0.92 0.0055 11 HGF 0.51 0.0334 IGF2-0.34 0.0081 12 ERBB3 0.55 0.0335 RBBP6 0.54 0.0082 13 NF1 0.47 0.0380 COX11 0.51 0.0118 14 FRAP1-0.77 0.0408 DUSP6 0.40 0.0121 15 MMD 0.92 0.0419 ENG -0.37 0.0139 16 HMMR 0.52 0.0481 CKMT1A -0.41 0.0155

Results: Survival Prediction Kaplan-Meier estimates of survival functions for the good and poor prognosis groups in the test data, classified by the PI values (cutpoint=median) smaller p value for the test (of survival functions) means better survival prediction based on grid search for K (# genes selected for prediction) {10, 20,..., 100}, both the compound covariate and copula methods attain minimum p value at K = 80

Results: Kaplan-Meier Plots of Good vs. Poor Prognosis Groups based on 80 genes

Survival probability 0.5 0.6 0.7 0.8 0.9 1.0 Univariate Cox P value = 0.129 0 10 20 30 40 Months Survival probability 0.5 0.6 0.7 0.8 0.9 1.0 Proposed method P value = 0.018 0 10 20 30 40 Months

Results: Kaplan-Meier Plots of Good vs. Poor Prognosis Groups based on 16 genes

Survival probability 0.5 0.6 0.7 0.8 0.9 1.0 Univariate Cox P value = 0.146 0 10 20 30 40 50 Months Survival probability 0.5 0.6 0.7 0.8 0.9 1.0 Proposed method P value = 0.112 0 10 20 30 40 50 Months

Summary univariate gene selection is a convenient and effective tool when p n issue of dependent censoring is more prominent in such univariate analysis dependent censoring may lead to substantial bias for regression coefficient estimation; the bias can be analytically assessed under a copula framework

univariate gene selection procedure accommodating dependent censoring can be performed under a copula framework, together with a predictive performance measure for identifying dependence level between survival and censoring times such a method has greater power to identify informative genes and achieves better survival prediction performance, compared with conventional methods based on independent censoring assumption Thank You!!