Version of record first published: 01 Jan 2012.

Similar documents
University, Tempe, Arizona, USA b Department of Mathematics and Statistics, University of New. Mexico, Albuquerque, New Mexico, USA

PhD course: Statistical evaluation of diagnostic and predictive models

Published online: 10 Apr 2012.

Part III Measures of Classification Accuracy for the Prediction of Survival Times

Testing Goodness-of-Fit for Exponential Distribution Based on Cumulative Residual Entropy

Survival Model Predictive Accuracy and ROC Curves

Harvard University. Harvard University Biostatistics Working Paper Series

Online publication date: 01 March 2010 PLEASE SCROLL DOWN FOR ARTICLE

On the covariate-adjusted estimation for an overall treatment difference with data from a randomized comparative clinical trial

The American Statistician Publication details, including instructions for authors and subscription information:

Discussion on Change-Points: From Sequential Detection to Biology and Back by David Siegmund

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

Published online: 17 May 2012.

Online publication date: 22 March 2010

Online publication date: 12 January 2010

Package Rsurrogate. October 20, 2016

Guangzhou, P.R. China

Gilles Bourgeois a, Richard A. Cunjak a, Daniel Caissie a & Nassir El-Jabi b a Science Brunch, Department of Fisheries and Oceans, Box

Application of the Time-Dependent ROC Curves for Prognostic Accuracy with Multiple Biomarkers

PLEASE SCROLL DOWN FOR ARTICLE. Full terms and conditions of use:

UNIVERSITY OF CALIFORNIA, SAN DIEGO

The Fourier transform of the unit step function B. L. Burrows a ; D. J. Colwell a a

Use and Abuse of Regression

Park, Pennsylvania, USA. Full terms and conditions of use:

Full terms and conditions of use:

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation

Precise Large Deviations for Sums of Negatively Dependent Random Variables with Common Long-Tailed Distributions

UNIVERSITÄT POTSDAM Institut für Mathematik

Harvard University. Harvard University Biostatistics Working Paper Series. On the Restricted Mean Event Time in Survival Analysis

Horizonte, MG, Brazil b Departamento de F sica e Matem tica, Universidade Federal Rural de. Pernambuco, Recife, PE, Brazil

Online publication date: 30 March 2011

Building a Prognostic Biomarker

University of California, Berkeley

Diatom Research Publication details, including instructions for authors and subscription information:

Marko A. A. Boon a, John H. J. Einmahl b & Ian W. McKeague c a Department of Mathematics and Computer Science, Eindhoven

Characterizations of Student's t-distribution via regressions of order statistics George P. Yanev a ; M. Ahsanullah b a

Estimating a time-dependent concordance index for survival prediction models with covariate dependent censoring

OF SCIENCE AND TECHNOLOGY, TAEJON, KOREA

A Simple Approximate Procedure for Constructing Binomial and Poisson Tolerance Intervals

STAT331. Cox s Proportional Hazards Model

Power and Sample Size Calculations with the Additive Hazards Model

Univariate shrinkage in the Cox model for high dimensional data

Communications in Algebra Publication details, including instructions for authors and subscription information:

Constrained estimation for binary and survival data

LSS: An S-Plus/R program for the accelerated failure time model to right censored data based on least-squares principle

To cite this article: Edward E. Roskam & Jules Ellis (1992) Reaction to Other Commentaries, Multivariate Behavioral Research, 27:2,

Published online: 05 Oct 2006.

P. C. Wason a & P. N. Johnson-Laird a a Psycholinguistics Research Unit and Department of

University, Wuhan, China c College of Physical Science and Technology, Central China Normal. University, Wuhan, China Published online: 25 Apr 2014.

Nuisance Parameter Estimation in Survival Models

Tests of independence for censored bivariate failure time data

Time-dependent Predictive Values of Prognostic Biomarkers with Failure Time Outcome

Accepted author version posted online: 28 Nov 2012.

Dissipation Function in Hyperbolic Thermoelasticity

Part III. Hypothesis Testing. III.1. Log-rank Test for Right-censored Failure Time Data

Cox s proportional hazards model and Cox s partial likelihood

Package survidinri. February 20, 2015

Open problems. Christian Berg a a Department of Mathematical Sciences, University of. Copenhagen, Copenhagen, Denmark Published online: 07 Nov 2014.

Least Absolute Deviations Estimation for the Accelerated Failure Time Model. University of Iowa. *

Lecture 7 Time-dependent Covariates in Cox Regression

Personalized Treatment Selection Based on Randomized Clinical Trials. Tianxi Cai Department of Biostatistics Harvard School of Public Health

Nacional de La Pampa, Santa Rosa, La Pampa, Argentina b Instituto de Matemática Aplicada San Luis, Consejo Nacional de Investigaciones Científicas

Full terms and conditions of use:

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Goodness-Of-Fit for Cox s Regression Model. Extensions of Cox s Regression Model. Survival Analysis Fall 2004, Copenhagen

Statistical inference based on non-smooth estimating functions

Survival Prediction Under Dependent Censoring: A Copula-based Approach

Lecture 3. Truncation, length-bias and prevalence sampling

University of California, Berkeley

ST745: Survival Analysis: Nonparametric methods

Quantile Regression for Residual Life and Empirical Likelihood

MSE Performance and Minimax Regret Significance Points for a HPT Estimator when each Individual Regression Coefficient is Estimated

Survival Analysis Math 434 Fall 2011

Comparing Distribution Functions via Empirical Likelihood

Estimation of the Bivariate and Marginal Distributions with Censored Data

Statistical Methods for Alzheimer s Disease Studies

Müller: Goodness-of-fit criteria for survival data

Residuals and model diagnostics

Application of Time-to-Event Methods in the Assessment of Safety in Clinical Trials

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

MAS3301 / MAS8311 Biostatistics Part II: Survival

A note on adaptation in garch models Gloria González-Rivera a a

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Quantitative Methods for Stratified Medicine

Introduction to Empirical Processes and Semiparametric Inference Lecture 01: Introduction and Overview

The Homogeneous Markov System (HMS) as an Elastic Medium. The Three-Dimensional Case

Hypothesis Testing Based on the Maximum of Two Statistics from Weighted and Unweighted Estimating Equations

George L. Fischer a, Thomas R. Moore b c & Robert W. Boyd b a Department of Physics and The Institute of Optics,

Evaluation of the predictive capacity of a biomarker

Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times

Derivation of SPDEs for Correlated Random Walk Transport Models in One and Two Dimensions

Robustifying Trial-Derived Treatment Rules to a Target Population

1 Introduction. 2 Residuals in PH model

MODELING MISSING COVARIATE DATA AND TEMPORAL FEATURES OF TIME-DEPENDENT COVARIATES IN TREE-STRUCTURED SURVIVAL ANALYSIS

Modelling Survival Events with Longitudinal Data Measured with Error

Regression tree methods for subgroup identification I

Attributable Risk Function in the Proportional Hazards Model

Geometric View of Measurement Errors

1 Glivenko-Cantelli type theorems

Analysis of transformation models with censored data

Transcription:

This article was downloaded by: [North Carolina State University] On: 15 October 212, At: 8:45 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 172954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of the American Statistical Association Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/uasa2 Evaluating Prediction Rules for t-year Survivors With Censored Regression Models Hajime Uno, Tianxi Cai, Lu Tian and L. J. Wei Hajime Uno is Associate Professor, Division of Biostatistics, School of Pharmaceutical Sciences, Kitasato University, Tokyo, Japan 18-8641. Tianxi Cai is Associate Professor, Department of Biostatistics, Harvard School of Public Health, Boston, MA 2115. Lu Tian is Assistant Professor, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 6611. L. J. Wei is Professor, Department of Biostatistics, Harvard School of Public Health, Boston, MA 2115. The authors are grateful to the associate editor, two referees, and the joint editor for insightful comments on the article. This work was supported in part by National Institutes of Health I2B2 and AIDS grants. Version of record first published: 1 Jan 212. To cite this article: Hajime Uno, Tianxi Cai, Lu Tian and L. J. Wei (27): Evaluating Prediction Rules for t-year Survivors With Censored Regression Models, Journal of the American Statistical Association, 12:478, 527-537 To link to this article: http://dx.doi.org/1.1198/1621457149 PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

Evaluating Prediction Rules for t-year Survivors With Censored Regression Models Hajime UNO, Tianxi CAI, LuTIAN, and L. J. WEI Suppose that we are interested in establishing simple but reliable rules for predicting future t-year survivors through censored regression models. In this article we present inference procedures for evaluating such binary classification rules based on various prediction precision measures quantified by the overall misclassification rate, sensitivity and specificity, and positive and negative predictive values. Specifically, under various working models, we derive consistent estimators for the above measures through substitution and cross-validation estimation procedures. Furthermore, we provide large-sample approximations to the distributions of these nonsmooth estimators without assuming that the working model is correctly specified. Confidence intervals, for example, for the difference of the precision measures between two competing rules can then be constructed. All of the proposals are illustrated with real examples, and their finite-sample properties are evaluated through a simulation study. KEY WORDS: Cross-validation; Gene expression; Model selection; Positive and negative predictive values; Prediction error; Receiver operating characteristic curve; Survival analysis. 1. INTRODUCTION Suppose that we are interested in establishing reliable and parsimonious classification rules for predicting future patients survival based on the data collected from a current study. The data consist of a set of survival times, possibly censored, and their corresponding baseline covariates. For example, in a recent breast cancer study, van de Vijver et al. (22) developed prognostic rules based on microarray gene expression data from 295 patients. For each patient, we have survival data, baseline lymph node status, estrogen receptor status, and a gene signature score (http://www.rii.com/publications). This univariate, continuous gene score was derived from gene expression data based on 7 selected genes through a supervised classification algorithm for predicting distant metastases within 5 years after the patient s surgery. It would be interesting to use such gene signatures, in conjunction with clinical markers, to construct a prediction rule for survival that can accurately identify future high-risk patients. Moreover, it is important to measure the added predictive value that gene signatures contribute on top of routinely available clinical markers. To predict covariate-specific survival, we fit the data with a regression model (Cox 1972; Wei 1992; Cheng, Wei, and Ying 1995; Kalbfleisch and Prentice 22). With this fitted model, we can estimate the survival function for a future subject using his covariate information and then predict, for example, whether the patient would survive more than t years. Often the aforementioned survival models assume that the covariate effects on the patient s survival or hazard function are constant over the entire follow-up study period. This modeling assumption may be reasonable for a global assessment of the covariate effects on survival. From a prediction point of view, however, a good classification rule for predicting short-term survivors may Hajime Uno is Associate Professor, Division of Biostatistics, School of Pharmaceutical Sciences, Kitasato University, Tokyo, Japan 18-8641 (E-mail: unoh@pharm.kitasato-u.ac.jp). Tianxi Cai is Associate Professor, Department of Biostatistics, Harvard School of Public Health, Boston, MA 2115 (E-mail: tcai@hsph.harvard.edu). Lu Tian is Assistant Professor, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 6611 (E-mail: lutian@northwestern.edu). L. J. Wei is Professor, Department of Biostatistics, Harvard School of Public Health, Boston, MA 2115 (E-mail: wei@sdac.harvard.edu). The authors are grateful to the associate editor, two referees, and the joint editor for insightful comments on the article. This work was supported in part by National Institutes of Health I2B2 and AIDS grants. perform poorly in predicting long-term survivors. In this article we address this issue through a rather simple time-varying binary regression modeling approach. When there is no censoring, standard methods for binary outcomes proposed by Breiman, Friedman, Olshen, and Stone (1984), McLachlan (1992), and Ripley (1996) may be used to construct prediction rules. To evaluate a classifier, various prediction precision measures, which quantify the discordance or concordance between the observed and the predicted outcomes, have been used by, for example, Spiegelhalter (1986), Korn and Simon (199), Mittlbock and Schemper (1996), Zhou, Obuchowski, and McClish (22), and Pepe (23). Specifically, the overall misclassification rate (OMR) is a useful simple summary measure but may not be appropriate especially when the costs associated with the false positive and negative errors are not equal. The sensitivity (SE) and the specificity (SP), describe the capacity of prediction rule for distinguishing subjects who will fail by time t from those who will not. These measures are helpful in the process of developing a promising classification rule. On the other hand, the positive predictive value (PPV) and negative predictive value (NPV) have direct clinical utility. An excellent discussion about these topics has been given by Pepe (23, chap. 2). In the presence of censoring, especially when the support of the censoring variable is shorter than its survival counterpart, very few methods are available for constructing and evaluating t-year survivor prediction rules. For the case of a univariate covariate, Heagerty, Lumley, and Pepe (2) proposed nonparametric estimators for the SE and SP, and Moskowitz and Pepe (24) considered marginal regression models for comparing the PPV and NPV of two competing prediction rules. When multiple covariates are involved, Heagerty and Zheng (25) developed a prediction rule through a proportional hazards model with time-varying coefficients. Recently Zheng, Cai, and Feng (26) proposed a prediction rule based on a time-varying logistic regression model and evaluated its overall accuracy through simulation. Note that all of the aforementioned procedures were derived under the assumption that the working model is correctly specified. Moreover, there are no 27 American Statistical Association Journal of the American Statistical Association June 27, Vol. 12, No. 478, Theory and Methods DOI 1.1198/1621457149 527

528 Journal of the American Statistical Association, June 27 theoretically justified methods for constructing interval estimates of the prediction precision measures when more than one covariate is available. In this article we propose classification rules for predicting t-year survival based on a class of simple working models that relate the covariates only to the patient s t-year survival probability. Under the assumption that the censoring distribution of the current study is independent of the covariates or can be modeled reasonably well, for each prediction rule we show how to consistently estimate its OMR, SE, SP, PPV, and NPV. Note that most existing estimation procedures for the commonly used survival models may not be able to provide such consistent estimators when the model is incorrectly specified (O Quigley and Xu 21). Besides providing point estimates for the prediction precision measures, we also derive the large-sample distribution of the proposed estimators. Furthermore, because these estimators are not smooth, we use a perturbation-resampling technique to approximate their distributions without involving any nonparametric density like function estimates. Based on these large-sample approximations, confidence intervals for the OMR, SE, SP, PPV, and NPV, or functions thereof, can be constructed accordingly, which provide much more information than their point estimate counterparts for evaluating regression models and their resulting prediction rules. If the same dataset is used to construct the prediction rules and evaluate their performance, then the above substitution or apparent error estimates may be biased (Efron 1983, 1986), especially when the sample size is not large with respect to the number of the covariates in the model. To reduce the potential bias of the apparent error, such methods as cross-validation, bootstrap, and covariance penalties have been proposed for certain regression models with noncensored data (Mallows 1973; Efron 1986, 24; Shao 1996; Efron and Tibshirani 1997; Ye 1998; Tibshirani and Knight 1999). In this article we also study properties of bias-corrected estimators for the OMR, SE, SP, PPV, and NPV through various cross-validation schemes. Lastly, we provide interval estimates for the difference of the prediction precision measures between two competing classification rules or models. Note that our procedures can be easily generalized to the case where we are interested in making joint inferences about the performance of prediction rules for a set of time points t. All of the proposals are illustrated and evaluated through two examples and a simulation study. 2. EVALUATING PREDICTION RULES FOR t-year SURVIVORS BASED ON OVERALL MISCLASSIFICATION RATE 2.1 Consistent Estimators for the Overall Misclassification Rate Let T be a continuous failure time and Z be a set of bounded potential predictors. Moreover, let C be the corresponding censoring variable. Assume that T and C are independent and that the survival function G( ) of C is free of Z.Let{(T i, Z i, C i ), i = 1,...,n} be n independent copies of (T, Z, C). Fortheith subject, we only observe (X i, Z i, i ), where X i = min(t i, C i ), i = I(X i = T i ), and I( ) is the indicator function. Suppose that based on the data {(X i, Z i, i ), i = 1,...,n}, we are interested in establishing a rule that can accurately predict whether or not the survival time T of a future subject with Z = Z is shorter than t-year, where t is a prespecified time point and Pr(X > t)>. To this end, let Z, a function of Z, be a p-dimensional vector with the first component being 1, and consider the following working model: Pr(T t Z) = g(β Z), (1) where g( ) is a known, strictly increasing, differentiable function and β is a p-dimensional vector of unknown parameters. Note that if we assume model (1) for all t > and that the first component of β depends on t, then (1) is called the linear transformation model (Cheng, Wei, and Ying 1995). In particular, if g( ) is 1 exp( exp( )), then (1) is the proportional hazards model. On the other hand, if g( ) is the anti-logit function, (1) is the so-called proportional odds model. In this article, for each time point t of interest, we let g( ), and all of the components of β in (1) depend on t. With this more flexible modeling, we may find, for example, that a good prediction rule for shortterm survivors may be quite different from that for long term survivors. Suppose that β in (1) is estimated by ˆβ based on the data {(X i, Z i, i )}. For a future subject with a covariate vector Z = Z, consider a class of binary prediction rules indexed by c: I(g( ˆβ Z )>c), where c 1. For example, if c =.5 and g( ˆβ Z )>.5, then we predict that this subject would die by time t. To evaluate this class of prediction rules, consider the OMR, D n (c) = E I(T t) I(g( ˆβ Z )>c), where the expectation is taken over {(X i, Z i, i )} and (T, Z ). Suppose that as n, ˆβ converges to a constant vector β, which is free of G( ), and D n (c) goes to D(c) = E I(T t) I(g(β Z )>c). (2) Now let c be a minimizer of D(c) for c 1, and let D(c ) = D, which does not depend on the nuisance censoring distribution. To evaluate the adequacy of model (1) as a prediction tool, we need to estimate D and c. When a working survival model is not correctly specified, it is not clear that the existing estimator of the vector of regression parameters would converge to a constant vector, as n. Moreover, even when the estimator is stabilized for large n, its limit may depend on the distribution of the nuisance censoring variable C. Consequently, the corresponding D n (c) converges to a quantity that may also depend on the censoring and may not be a meaningful criterion for evaluating prediction rules. Here we propose a simple estimator ˆβ for β in the working model (1), which converges to a constant vector β that is free of the censoring distribution. Our estimator ˆβ is based on the estimating function (Zheng et al. 26) U(β) = n 1 w i Ĝ(X i t) Z i{i(x i t) g(β Z i )}, (3) where w i = I(T i t C i ) = I(X i t) i + I(X i > t) and Ĝ( ) is the Kaplan Meier estimator of G( ). Note that if the ith subject is censored before time t, then w i =. On the other hand,

Uno et al.: Predicting t-year Survivors With Censored Regression Models 529 such censored observations are included in the construction of Ĝ( ). Note that conditional on (T i, Z i ), the expected value of w i {G(X i t)} 1 is 1. Therefore, conditional on {(T i, Z i )}, asymptotically the expected value of U(β) is u(β) = E[Z{I(T t) g(β Z)}], which is free of the censoring variable C. Under a rather mild condition that there does not exist a β such that P(β Z 1 >β Z 2 T 1 t T 2 ) = 1, using a similar argument given by Tian, Cai, Goetghebeur, and Wei (27, app. A), we can show that u(β) = has a unique solution, say, β. Moreover, if there does not exist a β such that i I(X i t X j ) = 1 implies that I(β Z i β Z j ) = 1, for any pair 1 i j n, then U(β) = has a unique solution ˆβ for any finite n. Because Ĝ(s) converges uniformly to G(s) for s t, it follows from the uniform law of large numbers (Pollard 199, p. 41) that U(β) is uniformly convergent to u(β) in probability around the neighborhood of β. This implies that ˆβ converges to β in probability as n even when model (1) is not correctly specified. Now, to estimate D(c), first consider the so-called apparent error (Davison and Hinkley 1997, p. 292), ˆD(c) = n 1 w i I(Xi t) I(g( ˆβ Z i )>c). (4) Ĝ(X i t) Let ĉ be a minimizer of ˆD(c), for c 1. In Appendix A, under the mild condition that Pr(T t β Z = y) is strictly increasing in y in the support of β Z, we show that ĉ and ˆD(ĉ) are consistent with respect to c and D. Note that we can check this condition empirically by estimating Pr(T t β Z = y) through a nonparametric function estimate based on the data {(X i, i, ˆβ Z i ), i = 1,...,n}. 2.2 Large-Sample Approximations to the Distribution of ˆD(ĉ) To make further inferences about D, consider a standardized transformation of ˆD(ĉ), n 1/2{ log( log)( ˆD(ĉ)) log( log)(d ) }, (5) which is asymptotically equivalent to { ˆD(ĉ) log( ˆD(ĉ)) } 1 W, where W = n 1/2 { ˆD(ĉ) D }. In Appendix B we show that W is asymptotically equivalent to n 1/2 { ˆD(c ) D } and converges in distribution to a normal with mean. However, directly estimating the variance of W, which involves unknown density-like functions, is difficult. A perturbation-resampling method may be used to obtain a good approximation to the distribution of W. To be specific, let {V i, i = 1,...,n} be n independent copies of a positive random variable V from a known distribution with mean 1 and variance 1. Let D (c) be a perturbed version of ˆD(c), where D (c) = n 1 w i G (X i t) I(X i t) I(g(Z i β )>c) V i (6) and G ( ) and β are the corresponding perturbed versions of Ĝ( ) and ˆβ. To construct G ( ), we use the martingale representation formula for the Kaplan Meier estimate (Fleming and Harrington 1991, p. 98). Specifically, for t >, the unconditional distribution of Ĝ(t) G(t) can be approximated by the conditional distribution (given the data) of { t 1 Ĝ(t) V i I(X j > s)} d ˆM i (s), j=1 where ˆM i (t) = I(X i t, i = ) t I(X i > s) d ˆ (s) and ˆ ( ) is the standard Nelson Aalan estimator of the cumulative hazard function for the censoring variable C. It follows that { t 1 G (t) = Ĝ(t) Ĝ(t) V i I(X j > s)} d ˆM i (s). To obtain a perturbed β, we solve the equation U (β) = n 1 w i G (X i t) Z i{i(x i t) g(β Z i )}V i j=1 (7) =. (8) Note that because U(β) is a differentiable function in β, analternative way to obtain (β ˆβ) is by perturbing the first-order expansion of n 1/2 ( ˆβ β ). It follows from similar arguments given by Park and Wei (23) or Cai, Tian, and Wei (25) that the distribution of (5) can be approximated by the conditional distribution of { ˆD(ĉ) log( ˆD(ĉ)) } 1 W, given the data, where W = n 1/2 {D (ĉ) ˆD(ĉ)}. In practice, we can generate a large number M of random samples W to approximate the distribution of W. Confidence interval estimates of D can then be constructed accordingly through (5). Note that if we let {V i, i = 1,...,n} be the multinomial random vector with size n and cell probability of n 1, then the above resampling method is similar to standard bootstrapping (Efron 1982). However, it is not clear how to justify the large-sample approximation to the distribution of W using perturbation with such dependent V s. 2.3 Large-Sample Properties of Cross-Validated Counterparts for ˆD(ĉ) When the sample size n is not large with respect to the dimension of the covariate vector Z, we can use cross-validation methods to estimate the prediction error D. To this end, we first consider the commonly used K-fold cross-validation, which randomly splits the data into K disjoint sets of about equal size and labels them as I k, k = 1,...,K. For each k, an estimate ˆβ ( k) for β through (3) is obtained based on all observations that are not in I k. We then compute the predicted error estimate ˆD (k) (c) through (4) based on observations in I k. Then an average prediction error estimate for D(c) is ˆD(c) = K 1 K k=1 ˆD (k) (c). (9)

53 Journal of the American Statistical Association, June 27 Let ĉ v be a minimizer of ˆD(c), for c 1. For any fixed small integer K, it is straightforward to show that ĉ v and ˆD(ĉ v ) are consistent for c and D. Moreover, in Appendix C, we show that the standardized ˆD(ĉ v ) W = n 1/2 { ˆD(ĉ v ) D } has the same limiting distribution as that of W based on the apparent error. Therefore, one may use the standard error estimate of the apparent error to construct interval estimates for D, which are centered around the cross-validation estimate. For a general cross-validation, let n t and n v be the sizes of the training and validation subsamples, where n/n v is roughly a fixed positive integer, and n t and n v,asn.we randomly choose a training set to obtain an estimate for β through (3), then compute ˆD(c) in (4) with the validation set. We repeat this process by taking a fresh random training and validation partition. Let ˆD(c) be the average of all ˆD(c) over the entire set of possible random splits of the training-validation subsamples. Let ĉ rv be a minimizer of ˆD(c). In Appendix C we show that the distribution of n 1/2 { ˆD(ĉ rv ) D } is the same as that of W in the limit and thus can be approximated well by that of W. 2.4 Comparisons Between Two Prediction Rules Now suppose that we are interested in comparing two working models (1) with possibly different covariate vectors, say, Z (l), l = 1, 2. To this end, all of the above notations are subindexed by l, l = 1, 2. For example, for model l with the optimal cutoff point c = c l, the link function in model (1) is g l ( ). Letτ = D 2 (c 2 ) D 1 (c 1 ) and ˆτ = ˆD 2 (ĉ 2 ) ˆD 1 (ĉ 1 ). Then the distribution of W τ = n 1/2 ( ˆτ τ) is approximately normal with mean. Now let τ = D 2 (ĉ 2) D 1 (ĉ 1). Note that for D 1 ( ) and D 2 ( ), we need to use the same set of perturbation variables {V i, i = 1,...} in (6), (7), and (8). Then the distribution of W τ can be well approximated by the conditional distribution of W τ = n1/2 (τ ˆτ). Confidence intervals for τ can then be constructed through this approximation. For the aforementioned K-fold and random cross-validation schemes, we can construct the corresponding estimates ˆD l (ĉ v ) and ˆD l (ĉ rv ), l = 1, 2, to make inference about τ. 3.1 Mayo Data 3. EXAMPLES Here we use two examples to illustrate our proposals. The first example is from the well-known Mayo primary biliary cirrhosis study (Fleming and Harrington 1991, app. D). The dataset used here consists of 418 patient records, each of which contains the survival data and 17 potential prognostic factors. To simplify the illustration, we consider only five covariates: age, log(albumin), log(bilirubin), edema and log(protime), which were selected as the most important predictors based on a Cox regression model (Dickson, Fleming, Grambsch, Fisher, and Langworthy 1989; Fleming and Harrington 1991, p. 195). Suppose that we are interested in establishing prediction rules for 1-year survivors based on these five covariates. First, we consider two different models (1) with g(y) = 1 exp{ exp(y)} to fit the data. The first model uses age only, and the second one takes the above five covariates additively. With apparent errors ˆD(c), for all cases studied here, ĉ.5. ˆD(ĉ) and the corresponding standard error estimates are reported in Table 1. All standard error estimates are constructed based on M = 2, sets of {V i }, where V is the unit exponential. The table also reports point estimates based on the 1-fold and random cross-validation procedures. For the random crossvalidation procedure, we let the training set size be 2n/3 for each of 2 iterations. For model II, the apparent error estimate appears noticeably small compared with its random cross-validation counterpart. From Table 1, based on the random cross-validation point estimates, 95% intervals for the misclassification rate for models I and II are (.24,.44) and (.14,.31). Table 1 also reports 95% Table 1. Comparing Various Model-Based Prediction Rules for Several Time Points With Mayo Biliary Cirrhosis Data Apparent error 1-fold CV Random CV Year Model a ˆD(ĉ) (SE) ˆD(ĉ v ) ˆD(ĉ rv ) CI for difference b Cox 2 I.12 (.16).12.12 II.8 (.14).9.9 (.,.5).12 III.1 (.15).11.11 (.4,.1).11 IV.11 (.16).12.12 (.3,.1).12 5 I.27 (.25).29.28 II.14 (.21).15.16 (.7,.17).28 III.14 (.2).14.15 (.1,.3).18 IV.14 (.2).15.15 (.3,.3).17 8 I.39 (.34).41.41 II.2 (.34).23.24 (.1,.25).37 III.2 (.35).24.24 (.4,.4).26 IV.2 (.33).21.22 (.2,.5).24 1 I.3 (.5).3.34 II.16 (.42).18.22 (.3,.21).36 III.16 (.43).18.21 (.3,.5).24 IV.17 (.38).18.21 (.7,.6).22 NOTE: SE, estimated standard error. a Model I: g(intercept + age); model II: g(intercept + age + log(bilirubin) + log(albumin) + edema + log(protime)); model III: g(intercept + age + log(bilirubin) + log(albumin)); model IV: g(intercept + age + log(bilirubin)), where g(y) = 1 exp{ exp(y)}. b 95% confidence interval for the difference of the OMRs of two competing models: models I and II; models II and III; models III and IV.

Uno et al.: Predicting t-year Survivors With Censored Regression Models 531 confidence intervals for the difference of error rates between two fitted models. For example, the interval estimate for the difference of two rates D, model I minus model II, is (.3,.21), indicating that model II, which includes clinical biomarkers, is better than model I with respect to the 1-year survival prediction. On the other hand, the degree of improvement ranges from 3% to 21%, reflecting rather large sampling variation. It is interesting to note that for model II, unlike the results from the standard Cox model fitting, edema and log(protime) are not statistically significant; the p values for testing no covariate effect are.37 and.23. To explore whether these two clinical markers are needed for prediction, we fit the data with model III, which consists of three covariates: age, log(bilirubin), and log(albumin). The resulting point and the standard error estimates are reported in Table 1. The 95% interval estimate for the difference of the error rates between models III and II is (.3,.5), indicating that edema and protime have no added value over the other three covariates for predicting 1- year survivors with respect to the OMR. It is also interesting to demonstrate that a statistically significant covariate may not add any substantial value for prediction (Pepe, Janes, Longton, Leisenring, and Newcomb 24). To this end, we create model IV by deleting a highly, statistically significant covariate, log(albumin), from model III. The point and standard error estimates for this model are reported in Table 1. The 95% confidence interval for the difference of the OMR between models IV and III is (.7,.6), indicating that age and bilirubin appears to be sufficient for predicting the 1-year survivors with respect to the OMR. To examine how the prediction changes over time, we also construct classification rules for predicting 2-, 5-, and 8-year survivors based on the aforementioned procedures. As shown in Table 1, the classification appears to be more accurate, with respect to the OMR, for identifying short-term survivors than for long-term survivors. For example, for model II with random cross-validation, the estimated OMRs are.9,.16,.24, and.22 for predicting 2-, 5-, 8-, and 1-year survivors. For comparison, we also obtain classification rules by fitting the data with a Cox proportional hazards model and use the resulting linear risk score for prediction. As shown in Table 1, the estimated OMRs of the prediction rule based on the Cox model appears to be higher than those obtained through model (1). For example, for 8-year survivors, the estimated OMR using the Cox model with all five covariates is 5% higher than that using model (1). 3.2 Gene Expression Breast Cancer Data The second example is from the breast cancer study described in Section 1. van de Vijver et al. (22) proposed a binary prediction rule based solely on the gene score. For illustration, suppose that we are interested in predicting 1-year survivors and consider three models (1). The first model does not use any covariate, the second uses two clinical markers, and the third uses the clinical markers along with the gene score. Again, for all cases studied here, ĉ.5. The 1-year survival rate for this study is approximately.7. Model I essentially produces a rule predicting that all future patients would survive beyond 1 years. The error rate for this naive rule is.3. Table 2 presents the apparent errors ˆD(ĉ) and the corresponding standard error estimates for ˆD(ĉ). In the present case, the estimates Table 2. Comparing Various Model-Based Prediction Rules for Several Time Points With the Breast Cancer Data Apparent error 1-fold CV Random CV Year Model a ˆD(ĉ) (SE) ˆD(ĉ v ) ˆD(ĉ rv ) 2 I.4 (.11).4.4 II.4 (.11).4.4 III.4 (.11).4.4 van de Vijver b.58 (.12) 5 I.17 (.22).17.16 II.17 (.22).17.16 III.17 (.22).17.17 van de Vijver b.46 (.28) 8 I.25 (.28).25.25 II.25 (.28).25.25 III.24 (.31).25.26 van de Vijer b.38 (.41) 1 I.3 (.31).29.3 II.28 (.33).3.28 III.25 (.36).27.28 van de Vijver b.35 (.5) NOTE: SE, estimated standard error. a Model I: g(intercept); model II: g(intercept + node + ER); model III: g(intercept + node + ER + gene), where g(y) = 1 exp{ exp(y)}. b Based on the classification rule in van de Vijver et al. (22). based on cross-validation are quite similar to the apparent errors. With respect to the OMR, it is interesting to note that the prediction rules, which use the baseline clinical or gene expression information, do not perform better than the aforementioned naive rule (model I). In fact, the error rate for the rule of van de Vijver et al. (22) is 35%, higher than the 1-year mortality rate of 3%. For the present case, accurately identifying future breast cancer patients who would likely die before 1 years after surgery is critical, the OMR may not be a good criterion for evaluating prediction rules. We discuss other evaluation criteria in Section 5. Note that for both examples, for all models considered here, the nonparametric function estimates for Pr(T 1 β Z) appear to be monotone in β Z. 4. SIMULATION STUDIES To examine finite-sample properties of the proposed estimation procedures based on ˆD(ĉ), ˆD(ĉ v ), and ˆD(ĉ rv ),we conducted a simulation study. Specifically, we mimicked the Mayo study to generate realizations of T, Z, and C. Here Z = (1, Z cov ) and Z cov is a multivariate normal whose mean and covariance matrix are based on the 282 completely observed vectors consisting of age, log(bilirubin), log(albumin), log(sgot), log(protime), log(cholesterol), and log(copper) from the Mayo study. Now let Z = (age, log(bilirubin), log(albumin)). For each realized Z from the above normal, the survival time T is generated through an exponential with a scale parameter exp(b Z ), where b is estimated from the 282 observed censored failure times in the Mayo study and their corresponding covariate vectors of age, log(bilirubin), and log(albumin) with this exponential model. Lastly, the censoring time C is generated from the estimated censoring distribution for the Mayo study using the Kaplan Meier method. This configuration results in about 6% of censoring. In our numerical study, we consider six working models (1). For model I, we let g( ) be the inverse function of

532 Journal of the American Statistical Association, June 27 1 log( log)( ) and Z = Z. Note that model I is the correct model for Pr(T t Z ). For model II, we delete the covariate log(albumin) from the above Z. For model III, we let Z = (1, age, bilirubin, albumin), a setting in which incorrect transformations of covariates are used. For model IV we consider a setting in which the link function is incorrect; that is, we let 1 g( ) be the anti-logit function with Z = Z. In model Vweletg( ) be the same incorrect link and also omit logtransformation of bilirubin and albumin. Model VI is an overfitting model; that is, we let Z = Z with the correct link function, which includes two unnecessary covariates in the fitted model. For each working model, we generate 1, realizations of (T, Z) to obtain its model-specific β and use another fresh 1, realizations to estimate the true OMR D. We then generate 2, sets of realizations {(T i, C i, Z i ), i = 1,...,n} from the aforementioned true model and obtain 2, sets of realized ˆD(ĉ), ˆD(ĉ v ), and ˆD(ĉ rv ) with n t = 2n/3. Note that for the random cross-validation, we use 1 random splits of the sample. Based on these realized point estimates, we obtain the average bias and root mean-squared error (RMSE). Furthermore, we obtain 2, standard error estimates, each of which is based on 2, perturbed W s. Then, for each of the three types of point estimates, we construct 2, 95% confidence intervals for D. Table 3 reports the results with n = 3 and t = 1 years under the heading Observed censoring. Cross-validation indeed reduces bias of the apparent error. However, the bias of the apparent error seems rather small compared with the true error D for each working model. Moreover, the three estimation procedures are similar with respect to RMSE. On the other hand, the empirical coverage level of the confidence interval centered about the apparent error tends to be lower than its nominal counterpart. Table 3 also reports results for the case in which no censoring is involved. Again, with respect to coverage probability, the interval estimate centered Table 3. Empirical Bias, Root Mean Squared Error (RMSE), and Coverage Probability Based on Apparent Error (AE), 1-Fold Cross-Validation (CV 1 ), and Random Cross-Validation (CV 1/3 ) With Sample Size 3 and t = 1 Years Observed censoring No censoring Model AE CV 1 CV 1/3 AE CV 1 CV 1/3 Bias I.38.16.4.15.7.2 II.36.21.3.14.9.1 III.39.16.6.15.7.3 IV.38.16.5.15.7.2 V.39.15.6.15.7.3 VI.53.3.25.2.2.1 RMSE I.54.44.42.27.25.24 II.53.45.41.27.25.24 III.55.45.43.28.25.25 IV.54.44.42.27.25.24 V.54.45.43.28.25.24 VI.65.44.5.3.24.26 Coverage I.887.939.962.926.949.958 level II.912.944.968.932.945.962 III.882.935.962.919.941.947 IV.888.945.959.925.947.959 V.884.938.955.929.944.954 VI.773.926.914.884.938.923 Model I (true): D =.262; model II (covariate omission): D =.271; model III (wrong functional form): D =.268; model IV (wrong link function): D =.262; model V (wrong link and wrong functional form): D =.267; model VI (overfitting): D =.262. about the cross-validation point estimate appears to be better than its apparent error counterpart. 5. EVALUATION BASED ON OTHER MEASURES OF PREDICTIVE ACCURACY 5.1 Sensitivity and Specificity To evaluate a prediction rule for a binary outcome, one may also consider its sensitivity and specificity. For the prediction rule I(g( ˆβ Z)>c), the sensitivity is SE(c) = Pr{g(β Z )>c T t}, and the specificity is SP(c) = Pr{g(β Z ) c T > t}. These conditional probabilities can be estimated consistently by n i I(g( ˆβ Z i )>c, X i t)/ĝ(x i ) ŜE(c) = n (1) i I(X i t)/ĝ(x i ) and n I(g( ˆβ Z i ) c, X i > t) ŜP(c) = n. (11) I(X i > t) We show in Appendix D that the processes n 1/2 {ŜE(c) SE(c)} and n 1/2 {ŜP(c) SP(c)} converge weakly to mean- Gaussian processes. To evaluate a specific working model (1), the commonly used receiver operating characteristic (ROC) curve using (1) and (11) can be constructed (Heagerty and Zheng 25). Furthermore, the K-fold and random cross-validation estimates can be obtained for SE(c) and SP(c). To illustrate our proposal, we fit the breast cancer data (van de Vijver et al. 22) with models I, II, and III presented in Table 2 and then construct the corresponding ROC curves based on {(1 ŜP(c), ŜE(c)), c 1}. These are presented in Figure 1. For model I, because the choice of the threshold is trivial, the single prediction rule is given. We plot only a single point (the black circle) to display the sensitivity/specificity based on Figure 1. ROC Curves of Various Prediction Models for 1-Year Survivors With the Breast Cancer Data (, model I;, model II;, model III;, van de Vijver et al.).

Uno et al.: Predicting t-year Survivors With Censored Regression Models 533 this rule. For model II, due to the discrete nature of the clinical marker values (both are binary), only three distinct nontrivial values (denoted by open circles) are plotted in Figure 1. The true- and false-positive rates of the prediction rule proposed by van de Vijver et al. (22) are indicated by. Basedonthe ROC curves, model III appears to be better than models I and II. Moreover, model III can produce a rule that has almost identical SE and SP to those proposed by van de Vijver et al. (22). We find that the 1-fold and random cross-validation estimates for SE(c) and SP(c) are similar to the apparent error counterparts with these data. Suppose that it was required that a proportion γ of subjects not surviving to 1 years be predicted correctly; that is, a threshold c must be chosen such that SE(c ) = γ. It is straightforward to show that when Pr(T t β Z = y) is positive for y in the support of β Z, c is unique between and 1. Let ĉ be a solution to ŜE(c) = γ. Then ĉ is consistent for c. Moreover, ŜP(ĉ ) converges to SP(c ). To obtain confidence intervals for SP(c ),we use the perturbation-resampling method discussed in Section 2 to obtain an estimated standard error of ŜP(ĉ ) = ŜP{ŜE 1 (γ )} or a transformation thereof. To be specific, the perturbed ŜE(c) is n SE i I(g(Z i (c) = β )>c, X i t)v i /G (X i ) n i I(X i t)v i /G. (X i ) The perturbed SP (c) can be obtained similarly. Now let c be a solution to the equation SE (c ) = γ. It follows from similar arguments as those given for the OMR and the weak convergence of the processes n 1/2 {ŜE(c) SE(c)} and n 1/2 {ŜP(c) SP(c)} that when n is large, the distribution of n 1/2 {ŜP(ĉ ) SP(c )} can be well approximated by the conditional distribution of n 1/2 {SP (c ) ŜP(ĉ )} given the data. Confidence intervals for SP(c ) can then be obtained through this large-sample approximation. Note that for the cross-validation methods discussed in Section 2, the corresponding standardized ŜP(ĉ ) has the same limiting distribution as that of the above standardized apparent error. Moreover, any reasonable summary prediction precision constructed from SE(c) and SP(c) (e.g., the area under the ROC curve), can be estimated consistently through ŜE(c) and ŜP(c), and a large-sample approximation to the resulting estimator can be obtained based on SE (c) and SP (c). We use the breast cancer data to illustrate the above procedure. From the ROC curve for model II in Figure 1, we let γ =.69, an attainable value for this working model empirically. The corresponding ĉ =.23 and ŜP(ĉ) =.45. On the other hand, for model III with the same γ, ĉ =.29 and ŜP(ĉ) =.75. Furthermore, the 95% confidence interval for the difference of the two SP(c ) s (model III minus model II) is (.11,.45), indicating that the gene score adds substantial value to the two clinical markers in predicting 1-year survivors. 5.2 Positive and Negative Predictive Values An alternative way to define the accuracy of a prediction rule is with the positive and negative predictive values, denoted by PPV(c) and NPV(c), where PPV(c) = Pr{T t g(β Z )>c} and NPV(c) = Pr{T > t g(β Z ) c}. Table 4. Estimated (1 PPV) and NPV Based on Various Prediction Models for 1-Year Survivors With the Breast Cancer Data c SE 1 PPV NPV Model II.54.17.45.73.53.42.47.77.23.69.65.77 Model III.72.1.36.72.61.2.49.73.53.3.46.75.47.4.41.78.38.5.46.8.3.6.49.82.27.7.48.85.22.8.53.88.19.9.55.92 van de Vijver.93.55.94 NOTE: Model II: g(intercept + node + ER); model III: g(intercept + node + ER + gene), where g(y) = 1 exp{ exp(y)}; van de Vijver: Based on the classification rule of van de Vijver et al. (22). These conditional probabilities can be consistently estimated by n i I(g( ˆβ PPV(c) Z i )>c, X i t)/ĝ(x i ) = n (12) I(g( ˆβ Z i )>c)) and n I(g( ˆβ NPV(c) Z i ) c, X i > t) = Ĝ(t) n I(g( ˆβ Z i ) c)). (13) Note that for c close to the two ends of the interval [, 1], PPV(c) and NPV(c) may not be able to estimate their theoretical counterparts well. For each working model, the tradeoff between PPV(c) and NPV(c) may be examined for various c s in an interval [c L, c U ] (, 1), where c L and c U are given constants. Table 4 presents the estimated PPV(c), NPV(c), and SE(c) for a range of cutoff points with the breast cancer gene expression data based on models II and III and the prediction rule of van de Vijver et at. (22) presented in Table 2. Note that model II assumes only three nontrivial points, and its largest NPV is only.77. In comparison, model III appears to be more flexible and can reach rather high NPV levels. Moreover, model III can produce a rule that matches the PPV and NPV of the scheme proposed by van de Vijver et al. (22). To make further inferences about evaluating a working model, we may choose a cutoff point d such that NPV(d) = γ, an acceptably large value, and then make inferences about PPV(d). When the underlying risk function Pr{T t g(β Z) = y} is a monotone function in y, the PPV(c) and NPV(c) functions are monotone in c. However, even under this assumption, NPV(c) may not estimate NPV(c) well when c is close to or 1. Unlike the estimates for SE(c) and SP(c), NPV(c) may not be monotone in c, and the above cutoff point d may not be well defined empirically. Moreover, the NPV(c) depends on the prevalence of the disease, and consequently NPV(c) may not be able to reach a prespecified γ. An alternative approach is to choose the cutoff point ĉ such that ŜE(ĉ) = γ, an acceptable level of sensitivity, as we did in Section 5.1, then compute the corresponding PPV(ĉ) and NPV(ĉ). For example, for model II with γ =.69, ĉ =.23, and (12) and (13) are.35 and.77. On the other hand, for model III with the same γ, ĉ =.29 and (12) and (13) are.54 and.85.

534 Journal of the American Statistical Association, June 27 To construct confidence intervals for PPV(c ) and NPV(c ), where SE(c ) = γ, the perturbation-resampling scheme can be used to obtain the perturbed versions of PPV(ĉ) and NPV(ĉ). Specifically, first let c be the solution of SE (c ) = γ,aswe did in Section 5.1. Then let n PPV (c i I(g(Z i ) = β )>c, X i t)v i /G (X i ) n I(g(Z i β )>c ))V i (14) and n NPV (c I(g(Z i ) = β ) c, X i > t)v i G (t) n I(g(Z i β ) c. (15) ))V i It follows from the same argument used for the SE and SP estimators that for large n, the joint distribution of n 1/2 { PPV(ĉ) PPV(c )} and n 1/2 { NPV(ĉ) NPV(c )} can be approximated well by the conditional joint distribution of n 1/2 {PPV (c ) PPV(ĉ)} and n 1/2 {NPV (c ) NPV(ĉ)}. For the gene-expression example, for model II with γ =.69, 95% confidence intervals for PPV(c ) and NPV(c ) are (.28,.46) and (.62,.8). For model III, the corresponding intervals are (.35,.62) and (.78,.89). Furthermore, for the differences of PPV(c ) and NPV(c ) between these two models, 95% intervals are (.1,.24) and (.6,.19). Note that one can obtain the cross-validation counterparts of (12) and (13) and their distributions can be approximated through (14) and (15) as we did for the apparent error estimates. 6. REMARKS In this article we show how to obtain interval estimates for various measures of predictive accuracy that can be used to evaluate censored regression models. Based on the results of our numerical studies, we recommend for practical use the interval estimator centered around a cross-validated point estimate. It is important to note that whether or not survival times are subject to censoring, our estimator ˆβ converges to the same value β, a root of EZ{I(T t) g(β Z)} =. Furthermore, regardless of the adequacy of the working model (1), the proposed procedure provides consistent estimates of the true accuracy measures of the prediction rule I(g(β Z)>c). Therefore, at least for the large-sample case, a nuisance censoring distribution does not contaminate the development and evaluation of prediction rules. In this article we evaluate the performance of a working model by estimating the predictive accuracy of its resulting classification rule. If the working model (1) is correctly specified, then it can be shown (by the same argument as that given in McIntosh and Pepe 22) that the binary classification rules given by I(g( ˆβ Z)>c) has the optimal limiting ROC curve among all rules based on the covariate vector Z. It follows that this is also true with respect to the other accuracy measures, namely the OMR, and PPV and NPV at a given SP or SE level. Without censoring, the prediction accuracy for survival time can be evaluated based on the absolute difference between T and its predicted value obtained through a fitted model for the continuous response T with covariate vector Z (Tian et al. 27). With censoring, the support of the censoring is often significantly shorter than that of the survival time. Unfortunately, this makes it difficult, if not impossible, to provide good estimates of the mean absolute prediction error for survival time (Sinisi, Neugebauer, and van der Laan 26). If our interest lies in the truncated mean absolute prediction error within the support of the censoring variable, then we may use the similar inverse weighting scheme utilized in this article to obtain a consistent estimator. On the other hand, such an average prediction error may not be useful for evaluating prediction rules for long-term or short-term survivors. The proposed procedure, however, does require the assumption that the censoring variable is either free of the covariates or that its conditional distribution can be estimated consistently using semiparametric or nonparametric methods when some of the covariates are continuous. If the covariate vector is discrete, then a purely nonparametric estimator for the covariate-specific censoring distribution can be constructed, and our procedure can be easily generalized to incorporate the covariate-dependent censoring. When some of the covariates are continuous and the fitted model may not be correctly specified, it is a rather challenging, if not impossible, task to generalize our procedures to handle covariate-dependent censoring in a truly nonparametric fashion. APPENDIX A: CONSISTENCY OF ĉ AND ˆD(ĉ) To show that ĉ is a consistent estimator of c, it suffices to show that ˆD(c) converges to D(c) uniformly in c,andthatd(c) has a unique minimizer c (Newey and McFadden 1994, thm. 2.1). To show the uniform consistency of ˆD(c),welet ˆD(c,β)= n 1 w i I(Xi t) I(g(β Z i )>c) Ĝ(X i t) and D(c,β)= E I(T t) I(g(β Z i )>c). Then it follows from the uniform consistency of Ĝ( ) (Kalbfleisch and Prentice 22) and a uniform law of large numbers (Pollard 199) that sup c,β ˆD(c,β) D(c,β) almost surely, where is the compact parameter space for β around β. This, coupled with the fact that ˆβ converges to β,implies that ˆD(c) = ˆD(c, ˆβ) is uniformly consistent for D(c) = D(c,β ). Now, to show that D(c) has a unique minimizer, we write D(c) = Pr(T > t) + E [ {2I(T t) 1}I(g(β Z) c)] = Pr(T > t) + E [{ 2h (g(β Z)) 1} I(g(β Z) c)] F (c){ = Pr(T > t) + 2h (F 1 (x)) 1} dx, where F (y) = P(g(β Z) y) and h (y) = P(T t g(β Z) = y). Thus, assuming that F (y) is strictly increasing, D(c) has a unique minimizer if and only if u { ζ(u) = 2h (F 1 (x)) 1} dx u = 2 h (F 1 (x)) dx u has a unique minimizer, which is guaranteed if h ( ) is an increasing function. This demonstrates that ĉ is a consistent estimator of c.the consistency of ˆD(ĉ) follows directly from the consistency of ĉ and the uniform convergence of ˆD(c) to D(c).

Uno et al.: Predicting t-year Survivors With Censored Regression Models 535 APPENDIX B: LARGE SAMPLE DISTRIBUTION OF W = n 1/2 ( ˆD(ĉ) ˆD ) To derive the limiting distribution of W, weletw(c,β)= n 1/2 { ˆD(c,β) D(c,β)} and note that W = W(ĉ, ˆβ)+ n 1/2 {D(ĉ, ˆβ) D(ĉ,β )}+n 1/2 {D(ĉ) D }. (B.1) We first derive the large-sample distribution for W(c,β). To this end, we note that Ŵ G (t) = n1/2 {G(t) Ĝ(t)} n 1/2 ψ i (t) G(t) and that Ŵ G (t) converges weakly to a mean- Gaussian process indexed by t (Kalbfleisch and Prentice 22), where ψ i (t) = t dm i (u)/ π X (u), π X (t) = Pr(X i > t), M i (t) = I(X i t,δ i = ) t I(X i > u) d C (u), and C ( ) is the cumulative hazard function for the common censoring variable. This, together with a uniform law of large numbers and lemma A.1 of Billias, Gu, and Ying (1997), implies that t W(c,β) n 1/2 {D i (c,β) D(c,β)}+ Ŵ G (s) d ˆγ(s; β) n 1/2 W 1i (c,β), (B.2) where D i (c,β) = w i I(T i t) I(β Z i > c) /G(T i t), ˆγ(s; β) = n 1 n D i (c,β)i(t i t s), and W 1i (c,β) = D i (c,β) D(c,β)+ t ψ i (s) de{ˆγ(s; β)}. Here and throughout, we use the notation to denote equality up to o p (1). It follows from a functional central limit theorem (Pollard 199, chap. 1) that W(c,β) converges weakly to a mean- Gaussian process in (c,β), and thus W(ĉ, ˆβ) is asymptotically equivalent to W(c,β ). It follows from the consistency of ĉ and a Taylor series expansion that the second term in (B.1) is asymptotically equivalent to n 1/2 {D(c, ˆβ) D(c,β )} Ḋ 2 (c,β ) n 1/2 ( ˆβ β ), where Ḋ 2 (c,β) = D(c,β)/ β. Now, by a Taylor series expansion of U(β) around β and the uniform consistency of Ĝ( ), wehavethat n 1/2 ( ˆβ β ) A(β )n 1/2 U(β ), where A(β) = { u(β)/ β} 1. This implies that { } t n 1/2 ( ˆβ β ) A(β ) n 1/2 e i (β ) + Ŵ G (s) d ˆK(s; β ) n 1/2 W Bi (β ), where e i (β) = w i Z i {I(T i t) g(β Z i )}/G(T i t), ˆK(s; β) = n 1 n e i (β)i(t i t s),andw Bi (β) = A(β){e i (β)+ t ψ i (s) de{ ˆK(s; β)}. Therefore, n 1/2 {D(ĉ, ˆβ) D(ĉ,β )} n 1/2 Ḋ 2 (c,β ) W Bi (β ). (B.3) The weak convergence of the process W(c,β) and the convergence of n 1/2 ( ˆβ β ) imply that the process n 1/2 { ˆD(c) D(c)} = W(c, ˆβ) + n 1/2 {D(c, ˆβ) D(c)} is asymptotically equivalent to n 1/2 n {W 1i (c,β ) + Ḋ 2 (c,β ) W Bi (β )} and is tight in c. Now, because > n 1/2 {D(ĉ) D(c )}=n 1/2 { ˆD(ĉ) ˆD(c )} n 1/2 { ˆD(ĉ) D(ĉ) ˆD(c ) + D(c )} > n 1/2 { ˆD(ĉ) D(ĉ) ˆD(c ) + D(c )}, n 1/2 {D(ĉ) D(c )} n 1/2 { ˆD(ĉ) D(ĉ) ˆD(c )+D(c )}.This,together with the tightness of the process n 1/2 { ˆD(c) D(c)}, implies that n 1/2 {D(ĉ) D(c )}=o p (1). Note that when Z is discrete, it is straightforward to show that n 1/2 { ˆD(ĉ) D(ĉ) ˆD(c ) + D(c )} = o p (1), because Pr(ĉ = c ) 1. It then follows from (B.2) and (B.3) that W n 1/2 { W1i (c,β ) + Ḋ 2 (c,β ) W Bi (β ) }. By the central limit theorem, W converges in distribution to a normal with mean and variance E[(W 1i (c,β ) + W Bi (β )) 2 ]. APPENDIX C: LARGE SAMPLE DISTRIBUTION OF n 1/2 { ˆD(ĉ v ) D }ANDn 1/2 { ˆD(ĉ rv ) D } Let {ξ i ; i = 1,...,n} be n exchangeable discrete random variables uniformly distributed over {1, 2,..., K} independent of the data and that satisfy n I(ξ i = k) = n/k, k = 1,...,K. Let ˆD (k) (c,β) denote ˆD(c,β) evaluated based on observations in I k ;then ˆD (k) (c) = ˆD (k) (c,β ( k) ). Then, for the kth partition, we have ˆD (k) (ĉ v ) D = ˆD (k) (ĉv, ˆβ ( k) ) D (ĉv, ˆβ ( k) ) + D (ĉv, ˆβ ( k) ) D(ĉ v,β ) + D(ĉ v,β ) D. It follows from the same argument as given in Appendix B that ˆβ ( k) β K ( = I(ξ i k)w Bi (β ) + o p n 1/2 ), n(k 1) ) ) ˆD (k) (ĉv, ˆβ ( k) D (ĉv, ˆβ ( k) = n 1 ( I(ξ i = k)w 1i (c,β ) + o p n 1/2 ), D ( ĉ v, ˆβ ( k) ) D(ĉv,β ) = 1 n(k 1) Ḋ2(c,β ) I(ξ i k)w Bi (β ) (C.1) (C.2) + o p ( n 1/2 ), (C.3) and D(ĉ v,β ) D = o p (n 1/2 ),wherethep is the product probability measure generated by that of {ξ 1,...,ξ n } and the data. Therefore, { ˆD (k) (ĉ v ) D = n 1 I(ξ i = k)w 1i (c,β ) It follows that ˆD(ĉ v ) D = n 1 k=1 } 1 + I(ξ i k) K 1 Ḋ2(c,β ) W Bi (β ). K { I(ξ i = k)w 1i (c,β ) } 1 + I(ξ i k) K 1 Ḋ2(c,β ) W Bi (β ). Now, because K k=1 I(ξ i = k) = 1and K k=1 I(ξ i k) = K 1, it is straightforward to show that Ŵ = n 1/2 { ˆD(ĉ v ) D } = n 1/2 { W1i (c,β ) + Ḋ 2 (c,β ) W Bi (β ) } + o p (1). Thus Ŵ is asymptotically equivalent to W. For the general cross-validation procedure, without loss of generality, we assume that n/n v = K. Thenn 1/2 ( ˆD(ĉ rv ) D ) = E ξ [n 1/2