PhD course: Statistical evaluation of diagnostic and predictive models

Size: px

Start display at page:

Download "PhD course: Statistical evaluation of diagnostic and predictive models"

Hector Horton
5 years ago
Views:

1 PhD course: Statistical evaluation of diagnostic and predictive models Tianxi Cai (Harvard University, Boston) Paul Blanche (University of Copenhagen) Thomas Alexander Gerds (University of Copenhagen) March 18-22, / 38

2 Day 4 : Survival Prediction 2 / 38

3 Prediction of Survival Outcomes Survival Prediction with A Single Marker evaluating the accuracy estimating the accuracy Survival Prediction with Multiple Markers constructing composite scores through survival regression models evaluating the accuracy 3 / 38

4 Survival Prediction with A Single Marker In many clinical studies, the outcome of interest is time to the occurrence of a clinical condition. Examples: time to disease diagnosis; recurrence; death. (a) Survival (b) Metastasis-free Survival S^(t) S^(t) Time (years) Time (years) 4 / 38

5 Standard Survival Analysis Kaplan Meier plots Log-rank test for two group comparisons (e.g. assessing treatment effect) Association analysis: Cox proportional hazards model hazard ratio estimates 5 / 38

6 The PEACE Trial Survival probability Placebo ACEi Covariates Placebo Est SE p-value egfr Age <0.01 Gender lveejf <0.01 Hypertension <0.01 Diabetes <0.01 MI Months Questions beyond association: How well can we predict survival? How do we evaluate the prediction performance with survival outcomes? How do we combine information from multiple markers? 6 / 38

7 Survival Prediction Accuracy Measures To assess the accuracy of a marker X in predicting the event time T, various accuracy measures have been suggested: Time-dependent TPR, FPR, PPV, and NPV. Heagerty & Pepe (2000); Heagerty & Zheng, 2005; Cai et al, (2005); Zheng et al, (2007). Proportion of explained variation Korn & Simon (1990); Henderson (1995); Schemper & Stare (1996). (Integrated) Brier score Graf et al (1999); Gerds and Schumacher (2006). Overall concordance measures: C-index Harrell et al (1982) s, Begg et al (2000), Uno et al (2011). 7 / 38

8 Time Dependent TPR, FPR and ROC When interest lies in the prediction of t-year survival, one may assess the accuracy of X in classifying the binary outcome D t = I(T t) by constructing binary prediction rules I(X c). The classification accuracy of I(X c) in predicting D t may be summarized by TPR t (c) = P(X c D t = 1), FPR t (c) = P(X c D t = 0), This corresponds to a time dependent ROC curve { ROC t (c) = TPR t FPR 1 t (u) } 8 / 38

9 Defining "Cases" and "Controls" for a given t In general, several types of time dependent ROC curves have been proposed by defining D t and the populations of interest differently. Entire Population : D t = 1 if T t, D t = 0 if T > t {T t} {T > τ} : D t = 1 if T t, D t = 0 if T > τ {T t} : D t = 1 if T = t, D t = 0 if T > t {T = t} {T > τ} : D t = 1 if T = t, D t = 0 if T > τ τ is a pre-defined time point such that T > τ is considered controls. Classification accuracy measures can be defined accordingly. 9 / 38

10 Overall Prediction Performance Measure Area under the ROC curve for classifying D t = I(T t) AUC t = ROC t (u)du = P(X 1 X 2 T 1 t,t 2 > t) Concordance Statistic (Harrell s C-statistic) C τ = P(X 1 X 2 T 1 T 2,T 1 τ) Integrated Brier score IBS τ = τ 0 {I(T > t) P(T > t X)} 2 dw(t) 10 / 38

11 Estimation of the Time Dependent Accuracy Measures In most studies with event time outcomes, the event time is subject to censoring due to loss to follow up or end of study. Consequently, for event time T, we observe ( T, ), where T = min(t,c), = I(T C) where C is the follow-up (censoring) time. Estimation of the accuracy measures requires assumptions about the censoring variable: A stronger assumption requires C to be independent of both T and X with a common survival function G(t) = P(C t). A weaker assumption requires C to be independent of the event time T conditional on the marker value X, but may depend on X. 11 / 38

12 Estimation of the Time Dependent Accuracy Suppose we are interested in estimating TPR t (c) = P(X c T t) = P(T t X c)p(x c) P(T t) Without censoring, we may estimate TPR t (c) empirically: n i=1 I(X i c,t i t) n i=1 I(T. i t) Due to censoring, D t = I(T t) is not always observed. Various approaches may be taken to account for censoring. 12 / 38

13 Estimation of the Time Dependent Accuracy If C is independent of T and X, a consistent estimator of TPR t (c) may be obtained based on Kaplan-Meier estimates of P(T t) and P(T t X c). For any c, P(T t X c) may be estimated using observations from the subset of patients with {X c}. Inverse Probability Weighting (IPW) with weights W i(t) = I( T i t)δ i G( T i) + I( T i > t). G(t) Note that I(T i t) is observable if I( T i t)δ i = 1 or I( T i > t) = / 38

14 Estimation of the Time Dependent Accuracy For the IPW approach, one may show that E{W i (t)i(t i t,x i c) T i,x i } = I(T i t,x i c) Thus, TPR t (c) may be estimated by TPR t (c) = n i=1ŵi(t)i(x i c,t i t) n. i=1ŵi(t)i(t i t) where Ŵi(t) is obtained by replacing G( ) in W i (t) by Ĝ( ) and Ĝ( ) is Kaplan-Meier estimator of G( ). 14 / 38

15 Estimation of the Time Dependent Accuracy If C depends on X but is independent of T conditional on X, one may estimate TPR t (c) by first estimating S y (t) = P(T t X = y) Non-parametrically via methods such as kernel smoothing conditional Nelson Aalen or Kaplan Meier estimator Semi-parametrically by assuming a regression model for T X. e.g. fitting a Cox proportional hazards model Subsequently, one may obtain a plug-in estimate of TPR t (c) based on P(T t X c) = c S y (t)df(y), where F(y) = P(X y) 1 F(c) 15 / 38

16 Framingham Offspring Study for CVD Prediction Framingham Heart Study: Goal: identifying risk factors for CVD Framingham Risk Score for CHD/Stroke prediction 3 generations original cohort (1948) Offspring cohort (1971), Omni cohort (1994) 3rd generation cohort (2002), 2nd generation Omni cohort (2003) Framingham Offspring Study Female Participants 1687 female out of a total 5124 participants 261 events (death/cvd) with 10-year event rate 6% Framingham risk score (Wilson et al. 1998) Risk score w/ C-reactive protein (CRP) (Cook et al, 2006; Ridker et al, 2007) 16 / 38

17 Framingham Offspring Study for CVD Prediction Table : Estimated accuracy measures ( 100) for 5-year survival based on non-parametric kernel smoothing (NP), IPW and the Cox model. Here c p is the pth percentile of the risk score. NP IPW Cox Est SE Est SE Est SE FPR 5 (c.2 ) FPR 5 (c.8 ) TPR 5 (c.2 ) TPR 5 (c.8 ) NPV 5 (c.2 ) NPV 5 (c.8 ) PPV 5 (c.2 ) PPV 5 (c.8 ) AUC FPR TPR= NPV TPR= PPV TPR= / 38

18 Framingham Offspring Study for CVD Prediction Figure : Time-dependent ROC curve (a) and PPV curve (b) of the risk score for predicting 5-year CVD events. TPR_5yrs Semi-Cox CNA IPW PPV_5yrs Semi-Cox CNA IPW FPR_5yrs v (a) (b) 18 / 38

19 Survival Prediction with Multiple Markers When there are multiple markers available to assist in prediction, one may construct composite scores as for binary outcomes. 1. Fit a survival regression model to combine markers a risk score S( β) 2. Evaluate the performance of S( β) in predicting the survival as in the univariate case 19 / 38

20 Survival Prediction with Multiple Markers A wide range of survival regression models have been proposed in the literature. Cox proportional hazards model; Proportional odds model; Time-specific generalized linear model. 20 / 38

21 Survival Regression Models Cox Proportional Hazards Model (Cox, 1972) λ X (t) = λ 0 (t)exp(β T 0 X) λ X (t) is the hazard function for a subject with marker value X, and λ 0(t) is the baseline hazard function. An equivalent form of the model is P(T t X) = g(h 0(t)+β T 0 X) where g(x) = 1 e ex and h 0( ) is an unknown increasing function. β 0 may be estimated by maximizing the partial likelihood. 21 / 38

22 Survival Regression Models Proportional Odds Model logit P(T t X) = h 0 (t)+β T 0X For any fixed t logistic regression with response I(T t). Rank based estimator (Pettitt, 1984) and non-parametric maximum likelihood estimator (Murphy et al, 1997) have been proposed for β 0. Under either proportional hazards or proportional odds model, the risk score β 0 X is the optimal score for classifying D t = I(T t) for any t. 22 / 38

23 Time-specific Generalized Linear Model Markers useful for identifying short term survivors may be not be useful for identifying long term survivors. To construct time-dependent optimal score, one may consider a time-specific generalized linear model (GLM): P(T t X) = g {h 0 (t)+β T 0t X} Without censoring, for any given time t, one may fit a usual GLM to the synthetic data {D t = I(T t),x} to obtain an estimate of β 0t. Zheng et al (2006) considered inverse probability weighting based on estimators for time-specific logistic regression model. β T 0tX is the optimal score in distinguishing {T t} from {T > t} and achieves the highest ROC t( ). 23 / 38

24 Estimating the Accuracy of the Composite Score By fitting the survival models, one may obtain an estimate of the regression coefficient. Cox proportional hazards model: one may estimate β 0 as the maximizer of the partial likelihood function. Time-specific GLM: one may estimate β 0t as the solution to the weighted estimating equation n i=1 ( ) 1 Ŵ i (t) {I(T i t) g(α+β T X i )} = 0 X i where Ŵi(t) is the weight to account for censoring as defined earlier. e.g. with logistic link, equivalent to fitting a logistic regression with I( T t) as the outcome, X as the predictor, and weights Ŵi(t). 24 / 38

25 Estimating the Accuracy of the Composite Risk Score Suppose β t is the estimator of β 0t ( β t = β if β 0t = β 0 ). We may estimate the accuracy of the risk score β T 0tX by replacing β T 0tX as β T t X; and using tools for the single marker setting. For example, assuming that the censoring is independent of T and X, may be estimated by TPR t {c;β 0t } = P(β T 0tX c T t) TPR t (c; β t ) = n i=1ŵi(t)i( β T t X i c,t i t) n i=1ŵi(t)i(t i t) 25 / 38

26 Estimating the Accuracy of the Composite Score Similarly, assuming that the censoring is independent of T and X, may be estimated by FPR t (c;β 0t ) = P(β T 0tX c T > t) FPR t (c; β t ) = n i=1ŵi(t)i( β T t X i c,t i > t) n. i=1ŵi(t)i(t i > t) Consequently, we may estimate ROC t (u;β 0t ) = TPR t { FPR 1 t (u;β 0t );β 0t } by plugging in TPR t (c; β t ) and FPR t (c; β t ). 26 / 38

27 Example: Breast Cancer Gene Expression Study The New England Journal of Medicine Copyright 2002 by the Massachusetts Medical Society VOLUME 347 DECEMBER 19, 2002 NUMBER 25 A GENE-EXPRESSION SIGNATURE AS A PREDICTOR OF SURVIVAL IN BREAST CANCER MARC J. VAN DE VIJVER, M.D., PH.D., YUDONG D. HE, PH.D., LAURA J. VAN T VEER, PH.D., HONGYUE DAI, PH.D., AUGUSTINUS A.M. HART, M.SC., DORIEN W. VOSKUIL, PH.D., GEORGE J. SCHREIBER, M.SC., JOHANNES L. PETERSE, M.D., CHRIS ROBERTS, PH.D., MATTHEW J. MARTON, PH.D., MARK PARRISH, DOUWE ATSMA, ANKE WITTEVEEN, ANNUSKA GLAS, PH.D., LEONIE DELAHAYE, TONY VAN DER VELDE, HARRY BARTELINK, M.D., PH.D., SJOERD RODENHUIS, M.D., PH.D., EMIEL T. RUTGERS, M.D., PH.D., STEPHEN H. FRIEND, M.D., PH.D., AND RENÉ BERNARDS, PH.D. 27 / 38

28 Example: Breast Cancer Gene Expression Study 295 breast cancer patients who were diagnosed with breast cancer between 1984 and The median survival time is 3.8 years for these patients. Outcome: time to death Markers: gene expression markers The gene expression measurement is the logarithm of the intensity ratios between the red and the green fluorescent dyes, where green dye is used for the reference pool and red is used for the experimental tissue. The prognosis rule developed by van t veer et al (2002) and Vijver et al (2002) was derived based on a 70 gene expression markers. For illustration, we selected 6 out of 70 gene expression markers for prediction. 28 / 38

29 Example: Breast Cancer Gene Expression Study Obtain a linear score β T t X for classifying I(T t) by fitting various regression models: proportional hazards model λ X (t) = λ 0(t)e βt 0 X proportional odds model logitp(t t X) = h 0(t)+β T 0 X time-specific logistic regression model logitp(t t X) = h 0(t)+β T 0tX 29 / 38

30 Example: Breast Cancer Gene Expression Study Estimate the ROC curve, ROC t ( ), for distinguishing {T t} from {T > t} by estimating TPR t (c), and FPR t (c) non-parametrically using inverse-probability weighting. Summarize the overall accuracy of β T t X by estimating AUC t = 1 0 ROC t (u)du. 30 / 38

31 Example: Breast Cancer Gene Expression Study Table : Estimated AUC t (95% CI) at t = 2, 5 and 8 years after diagnosis using a 6-gene classifier with linear composite scores derived from different regression models. t = 2 years t = 5 years t = 8 years Cox.78(.62,.87).84(.78,.88).77(.71,.84) Proportional Odds.78(.59,.87).83(.68,.88).77(.65,.84) Time-specific Logistic.85(.80,.91).84(.80,.89).77(.71,.84) 31 / 38

Example: Breast Cancer Gene Expression Study sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 t=2 years t=5 years t=8 years sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 t=2 years t=5 years t=8 years 0.

32 Example: Breast Cancer Gene Expression Study sensitivity t=2 years t=5 years t=8 years sensitivity t=2 years t=5 years t=8 years specificity 1 specificity (a) Logistic (b) Cox 32 / 38

33 Survival Prediction with Multiple Markers Estimating the Accuracy of the Composite Score: Bias Correction When the sample size n is not large with respect to the number of markers, one may use cross-validation methods to obtain less biased accuracy estimators. one may randomly split the data into K disjoint sets of about equal size and label them as I k,k = 1,,K. For each k, an estimate ˆβ ( k) (t) for β 0(t) may be obtained based on all observations which are not in I k ; an estimate of the accuracy may be estimated based on data in I k. A bias corrected estimator of the accuracy measure may be obtained by averaging over the K accuracy estimates. 33 / 38

34 Survival Prediction with Multiple Markers Estimating the Accuracy of the Composite Score: Interval Estimation In addition to obtaining a point estimator for the accuracy, it is crucial to assess the variability in the estimated accuracy measure. The variability may be assessed via procedures such as the bootstrap. Treat observed data from n subjects as n units {D 1,...,D n}; Randomly sample n units from {D 1,...,D n} with replacement to obtain {D 1,...,D n }; Construct accuracy estimators based on each set of the resampled data; Repeat the procedure for M 0 times to obtain M 0 perturbed estimates of the accuracy; construct interval estimates based on the empirical percentiles of the M 0 perturbed replications. Other types of resampling methods such as the wild bootstrap have also been considered in the literature. Parzen et al (1994); Jin et al (2003); Cai et al (2005); Tian et al (2007). 34 / 38

35 Summary Classification accuracy measures such as the TPR, FPR and ROC can be extended to the setting with survival outcomes. Different types of time-dependent accuracy measures may be defined by defining the "diseased" and "non-diseased" populations at any given time t. To obtain estimators for the classification accuracy measures with survival outcomes, one needs to incorporate censoring appropriately. When there are multiple markers available, various survival regression models may be used to construct composite scores for prediction. Such scores may be optimal with respect to certain accuracy measures when the imposed model holds. Bias correction and variance estimation should be considered when assessing the accuracy. 35 / 38

Part III Measures of Classification Accuracy for the Prediction of Survival Times

Part III Measures of Classification Accuracy for the Prediction of Survival Times Patrick J Heagerty PhD Department of Biostatistics University of Washington 102 ISCB 2010 Session Three Outline Examples