Prediction Error Estimation for Cure Probabilities in Cure Models and Its Applications

Size: px

Start display at page:

Download "Prediction Error Estimation for Cure Probabilities in Cure Models and Its Applications"

Katrina Bell
5 years ago
Views:

1 Prediction Error Estimation for Cure Probabilities in Cure Models and Its Applications by Haoyu Sun A thesis submitted to the Department of Mathematics and Statistics in conformity with the requirements for the degree of Master of Science Queen s University Kingston, Ontario, Canada September 2014 Copyright Haoyu Sun, 2014

2 Abstract Cure models are often used to describe survival data with a cure fraction. Many researchers have proposed different methods for model fitting; however, little research has been conducted on the assessment of prediction error for cure models. This report proposes an estimate of the expected Brier score as a measurement of prediction error for mixture cure models, regarding to the prediction of cure status in particular. Both resubstitution and cross-validation methods were used to calculate the value of the proposed estimate. Simulation studies demonstrated that both of these methods work well in terms of assessing prediction error, especially when sample size is large. The proposed prediction error estimates are demonstrated to be able to detect differences between prediction models in the presence of model misspecification. Application of the proposed estimate to a data set of bone marrow transplant patients presents the usefulness of this method in practice. i

3 Acknowledgments First and foremost, I would like to acknowledge and thank my supervisors Dr. Wenyu Jiang and Dr. Paul Peng for all their guidance. No matter how busy they are, Dr. Jiang and Dr. Peng always made themselves available to discuss and review my master project. I am grateful for their advice and caring on my study. I am thankful to have had them as my mentors and my gratitude cannot be captured in words alone. I would also like to thank Dr. Dongsheng Tu who introduced me to the area of biostatistics and inspired me to pursue my master in this area. I am also grateful to his instructions on the course Advanced Biostatistics so I can understand how to apply statistical methods to real problems in clinical trials. A number of individuals have supported me to finish this master program. The department graduate assistant Jennifer Read helped me a lot on daily routines so that I could stay focused on my study. I would like to thank my friends and colleagues from both the Department of Mathematics and Statistics and the Department of Public Health for making me feel at home from day one. My undergrad classmate Jiadong Mao from the University of Melbourne and Benyu Wang from the University of George Washington helped me with some coding issues in my project. Last but not least, I would like to thank my parents, who have been my constant ii

4 support throughout my life. Their support encouraged me whenever I came across any problems in my study or my life. Their trust in me has been great encouragement for me to overcome all the difficulties and finish this master program. iii

5 Contents Abstract Acknowledgments Contents List of Tables List of Figures i ii iv vi vii Chapter 1: Introduction Background Contribution Organization of the report Chapter 2: Review of Mixture Cure Model and Prediction Error Measurement Chapter Overview Notation and Problem Setting for Survival Data Mixture Cure Model Prediction Error of a Survival Model Expected Brier Score Chapter 3: Prediction Error Measurement for Mixture Cure Model Chapter Overview Measurement of Prediction Error for Mixture Cure Models Other Error Estimates Chapter 4: Simulation Study Chapter Overview Data Generation Resubstitution Method iv

6 4.4 Cross-Validation Method Model Misspecification Model with Redundant Variables Model with Variable Left Out Model Comparison Chapter 5: Application to Bone Marrow Transplant Data Chapter Overview Data Description and Problem Setting Analysis Conclusion Chapter 6: Summary and Future Work Summary Future Work Appendix A: R Codes 53 A.1 Codes Simulation Studies A.2 Codes for Model Comparison A.3 Codes for the Bone Marrow Transplant Data v

7 List of Tables 4.1 Key Features of the Simulated Data Prediction Error Measurement Using Resubstitution Method Prediction Error Measurement Using Cross-Validation Method Prediction Error Measurement for Models with Redundant Variables Prediction Error Measurement for Models with Missing Variable Results for Model Comparison Results for Model Comparison with a Narrower Censoring Distribution Variable Description Prediction Error for Each Model Results from Model 1 and Model Results from Model 3 and Model vi

8 List of Figures 4.1 K-M Plot of the Simulated Data Plots with Different Sample Size using Resubstitution Method Plots with Different Sample Size using Cross-Validation Method Plots of Prediction Error for Models with Redundant Variables Plots of Prediction Error for Models with Missing Variable K-M Plot of the Simulated Data with a Narrower Censoring Distribution K-M Plot of the Entire Data K-M Plot for Each Disease Group vii

9 1 Chapter 1 Introduction 1.1 Background In some cancer clinical trial settings, where participants are grouped and receive different treatments, survival data with a sizable cure fraction are commonly encountered. The cure fraction is a useful measure to monitor trends in survival of curable diseases. Traditional survival analysis techniques, such as the Cox proportional hazards model, provide no direct estimation of the cure fraction. If it is believed that a proportion of individuals will not experience the event of interest, then it may be appropriate to fit models that explicitly allow for the cure fraction to be estimated and directly modeled (Lambert et al., 2007). For this purpose, cure rate models or cure models are often used. Because of the mixture nature of patients in the trials, the most popular type of cure models is the mixture model (Peng, 2003). For example, Weston et al. (2004) used a cure model to study leukemia in children and found that the proportion of cured leukemia children with the ET-2 protocol to be 70% among those who had nonpelvic primary site for less than 10 years without metastatic disease. Li et al.

10 1.1. BACKGROUND 2 (2010) analyzed smoking cessation data with a novel cure model and found a positive but non-significant association between the lapse and recovery frailties. Yilmaz et al. (2013) applied cure models in the anlysis of molecular genetic prognostic factors for disease-free survival and time to disease recurrence in a cohort of patients with axillary lymph node-negative breast cancer. Many researchers have proposed different estimation methods for cure models, such as parametric models (Farewell, 1986; Yamaguchi, 1992; Peng et al., 1998; Wileyto et al., 2013), semiparametric models (Peng, 2003; Niu and Peng, 2013; Zhang et al., 2013), nonparametic models (Peng and Dear, 2000). Besides, Corbiere and Joly (2007) and Cai et al. (2012) have explored a SAS macro and an R package for model application, respectively. It is of great interest to assess the performance of a mixture cure model in terms of its prediction error, which evaluates the extent to which the predicted event outcomes agree with the observed event outcomes on future patients. The performance of point predictors and predictive distributions are often assessed through loss functions and their expectations, which are termed as prediction error or prediction accuracy (Lawless and Yuan, 2010). For the purpose of evaluating risk prediction models in survival analysis, time-dependent residuals are defined by the difference of time-dependent survival status and predicted survival probabilities. Korn and Simon (1990) introduced a general loss function approach for assessing survival models. However one assumption underlying the approach they proposed is that the survival models are correctly specified. This is problematic when one compares different survival models because when models are misspecified, the bias depends on the degree of misspecification. Graf et al. (1999) proposed estimators of the expected Brier score that avoid any dependence on the assumed survival model by using the

11 1.2. CONTRIBUTION 3 inverse probability-of-censoring weight (IPCW). The estimator was later proved to be consistent with the expected Brier score (Gerds and Schumacher, 2006). However, little research has been conducted concerning the estimation of prediction error for a mixture cure model. In this report, I will propose a measurement of the prediction error and its estimation methods for a mixture cure model. Since the availability of future data is always an issue, we usually need to estimate the prediction error of a model based on current data, by allocating some part of the data as the training data set to build the prediction model, and the other part of the data as the testing data set to assess the model. For example, resubstitution or cross-validation methods are often used when estimating the prediction error for survival data (Lawless and Yuan, 2010). In this report, I used both resubstitution and cross-validation methods to calculate the proposed measurement of prediction error for mixture cure model. 1.2 Contribution This project was inspired by the estimators proposed by Graf et al. (1999). It proposed an estimator of prediction error for the mixture cure model. The proposed estimator can be applied to multicovariate cases and can be used to compare different cure models. A simulation study shows that this measurement converges to the expected Brier score when sample size is large and it works well in moderate to large sample size cases, which are typical in the application areas involving survival predictions.

12 1.3. ORGANIZATION OF THE REPORT Organization of the report In Chapter 2, I will review the mixture cure model and prediction error measurement, particularly the assessment method proposed by Graf et al. (1999). Chapter 3 describes the details of the new proposed prediction error measurement and some methods to assess the measurement. Results of simulation studies, with both resubstitution and cross-validation methods, are reported and discussed in Chapter 4. Chapter 5 presents an application of the proposed method to a data set of the bone marrow transplant patients extracted from Klein and Moeschberger (2003). In Chapter 6, some conclusions are made regarding to the use and properties of the proposed measurement of prediction error for mixture cure models. Future work is also discussed.

13 5 Chapter 2 Review of Mixture Cure Model and Prediction Error Measurement 2.1 Chapter Overview This chapter reviews the mixture cure model and prediction error measurement for survival models. Specifically, Section 2.2 presents notation and the problem setting. Section 2.3 reviews the mixture cure model. Section 2.4 reviews prediction error measurement for survival models. 2.2 Notation and Problem Setting for Survival Data Suppose that there is a random sample of n subjects from an underlying population. Let T i and C i denote the potential failure time and the potential censoring time, respectively. For example, T i could be the time from receiving treatment to the relapse of a particular disease and C i could be the time the subject leaves the study or the end of the study for those disease-free subjects. Assume that T i and C i are independent given covariates x i and z i, where x is a covariate vector that contains covariates that affect the survival time of the uncured subjects, and z is a covariate

14 2.3. MIXTURE CURE MODEL 6 vector that affect the uncure probabilities of the subjects. Let T i = min(t i, C i ) denote the observed time for the i th individual, δ i be the censoring indicator: δ i = 1 if T i C i and δ i = 0 otherwise, U i be an indicator of uncured status, i.e. U i = 1 if the patient is not cured and U i = 0 otherwise. Obviously, U i is a latent variable that is only observable for uncensored subjects. Then, the observed data for each subject is a tuple D = { T i, δ i, x i, z i n i=1. In survival analysis, the hazard function is defined as: Pr{t T t + ɛ Z λ(t Z) = lim, (2.1) ɛ 0 Pr{T t Z for a given covairate vector Z. The proportional hazards model was proposed by Cox (1972). It assumes that the hazard function is affected by covariates Z in the form of λ(t) = λ 0 (t) exp{g(z), where λ 0 ( ) is the baseline hazard, and g(z) reflects the covariates effect. Generally, g(z) = β T Z, where β is an unknown parameter vector. This leads to the usual Cox s proportional hazards model λ(t) = λ 0 (t) exp{β T Z. (2.2) 2.3 Mixture Cure Model In some clinical studies, a substantial proportion of patients who respond favorably to treatment may turn out to be free of any signs or symptoms of the disease and may be considered as cured, while the remaining patients may eventually relapse. Long-term censored survival times usually appear in such data (Peng et al., 1998). Farewell (1986) presented a typical case in a study of breast cancer. The problems of interest in that study included the proportion of patients that could be cured and

15 2.3. MIXTURE CURE MODEL 7 the effects of the treatment methods and other factors on the cure rate and on the failure time of uncured patients. Given the notations in the precious section, the mixture cure model can be presented as follows: S(t x, z) = π(z)s u (t x) + 1 π(z), (2.3) where S(t x, z) is the unconditional survival function of T for the entire population, S u (t x) = P (T > t U = 1, x) is the survival function for uncured patients given the covariate vector x = (x 1,..., x m ), and π(z) = P (U = 1 z) is the probability of being uncured given the covariate vector z = (z 1,..., z q ). The probability of being uncured is often expressed in a logistic form as follows: logit[π(z)] = log[π(z)/(1 π(z))] = (1, z )γ, (2.4) where γ is usually used to describe the effect of z on the probability of π(z). Let h u (t x) denote the hazard function of an uncured patient with the covariate x at time t. If the proportional hazards model is considered for h u (t), h u (t x) = h u0 (t) exp(ζ), where h u0 (t) is the baseline hazard function and ζ = β x. The proportional hazards mixture cure model can be written as: S(t x, z) = π(z)s u0 (t) exp(ζ) + 1 π(z) (2.5) where S u0 (t) = exp { t 0 h u0(w)dw is the baseline survival function. Likelihood method is often applied for parameter estimation. For proportional hazards mixture cure models, given cure status u = (u 1,..., u n ), the complete likelihood function for equation (2.5) is n i=1 π(z i ) u i {1 π(z i ) 1 u i h u (t i x i ) δ i S u (t i x i ) u i.

16 2.3. MIXTURE CURE MODEL 8 Peng (2003) proposed that the EM algorithm could be used to fit the semiparametric cure model. He proposed that the E-step in the (r + 1)th iteration calculated the conditional expected complete log-likelihood function given estimates at the rth iteration, β (r),γ (r), and S (r) u0 (t), which was the sum of the following two functions: L 1 (γ) = log n i=1 π(z i ) p(r) i {1 π(z i ) 1 p(r) i (2.6) L 2 (β, S u0 (t)) = log n i=1 [h u0 (t i ) exp(ζ i )] δ i S u0 (t i ) p(r) i exp(ζ i ) (2.7) where p (r) i δ i + (1 δ i ) = E{u i γ (r), β (r), S (r) u0 (t) = P {u i = 1 γ (r), β (r), S (r) u0 (t), and is given by π (r) (z i )S (r) exp (ζ(r) u0 (t i ) i ) 1 π (r) (z i ) + π (r) (z i )S (r) u0 (t i ) exp (ζ(r) logit[π (r) ((z i ))] = (1, z i)γ (r), and ζ (r) = x iβ (r). The M-step in the (r+1)th iteration maximizes equation (2.6) and (2.7) separately to obtain γ (r+1), β (r+1) and S (r+1) u0 (t). The algorithm is iterated until it converges. A log-normal mixture cure model is also considered in this work, where for individuals who are uncured (U = 1), the time to the event can be modeled as a log-normal distribution, with density function: i ), f(t U = 1, x) = φ ( log(t) µ σ σt ) (2.8) where φ( ) is the probability density function of a standard normal distribution, σ > 0 is a shape parameter and µ = β x. So the log-normal mixture cure models can be

17 2.3. MIXTURE CURE MODEL 9 written as: [ ( )] log(t) µ S(t x, z) = π(z) 1 Φ + 1 π(z) (2.9) σ where Φ( ) is the cumulative density function of a standard normal distribution. The likelihood method is also applied to obtain the estimates of the parameters. When the survival time for uncured patients is assumed to follow an exponential distribution, an exponential mixture cure model is used to fit the data. The density function for uncured patients in exponential mixture cure models is: f(t U = 1, x) = λe λt (2.10) where λ > 0 is called an inverse scale parameter. We know that exponential distribution is a log location-scale distribution with form: Y = log(t ) = µ + W (2.11) where µ = β x and W EV (0, 1). Thus, S T (t x) = P (Y log(t)) = e e[log(t) µ(x)] = [e elog(t)] e µ(x) = [S 0 (t)] e µ(x) = [S 0 (t)] eβ x where S 0 (t) is the baseline survival function with x = 0. Thus, exponential distribution follows the proportional hazards assumption and the exponential mixture cure model is a special case of proportional hazards mixture cure models.

18 2.4. PREDICTION ERROR OF A SURVIVAL MODEL Prediction Error of a Survival Model Let Y be the actual value of survival time and Ŷ be the estimated value of Y, then the error of a prediction Ŷ of Y can be quantified by using a loss function L(Y, Ŷ ), assumed to be nonnegative with L(Y, Y ) = 0 (Lawless and Yuan, 2010). Some of the most common loss functions are: squared error loss (Y Ŷ )2, absolute error loss Y Ŷ, and misclassification error loss I(W t Ŵt), where W t is the actual survival status, and Ŵt is the estimate of W t. Sometimes people may also consider the squared and absolute error losses on the log scale, i.e. taking the logarithm of Y, because the distribution of log(y ) is typically more symmetric. The performance of a predictor Ŷ = Ĝ(z) is measured by the expected loss or prediction error: P = E{L(Y, Ĝ(z)) (2.12) Expected Brier Score Brier index or Brier score (Brier, 1950) has been widely used to assess the quality of probability estimates. Suppose on each of n occasions, an event can occur only at one of r possible classes or categories and on each occasion, the probability that the event will occur in class j is f ij, i = 1,..., n, j = 1,..., r. The Brier score is defined as: BS = 1 n r n (f ij E ij ) 2 (2.13) j=1 i=1 where E ij indicates whether the event of occasion i occurred in class j or not. E ij = 1 if the event of occasion i occurred in class j and E ij = 0 otherwise.

19 2.4. PREDICTION ERROR OF A SURVIVAL MODEL 11 This score quantifies the accuracy of a set of judgments, by comparing the expressed probabilities to the actual outcomes (Redelmeier et al., 1991). In survival analysis, time-dependent residuals are defined by the difference of time-dependent survival status and predicted survival probabilities, i.e. at a fixed time t, the timedependent residuals are defined as I(T > t ) S(t ). For observed data D, the Brier score at time t can be defined as: BS(t ) = 1 n n [I(T i > t ) S D (t x i )] 2 (2.14) i=1 where S D (t x i ) are the survival probabilities for each subject i at time t, given observed covariates x i in data set D if all T i s are observed. And I(T i > t ) indicates whether subject i is still alive at time t. The expected Brier score at time t is defined as: EBS(t ) = E{[I(T > t ) S D (t x)] 2 (2.15) It can be interpreted as a mean squared error of prediction when the survival probabilities S D (t x) are viewed as predictions of the event status at time t, and the expectation is taken over the data set D based on which the prediction model is built, and T and x from the underlying distribution. Censoring often occurs in survival data, which makes it difficult to calculate the Brier score (2.14) directly. Graf et al. (1999) proposed an estimator of the expected Brier score as the measure of prediction error for right censored survival data with prognostic classification schemes, which was later proved to be consistent with the

20 2.4. PREDICTION ERROR OF A SURVIVAL MODEL 12 expected Brier score (Gerds and Schumacher, 2006). They proposed that given observed covariates x i, the prediction error for right censored survival data at time point t can be measured as follows: BS c (t ) = 1 n n {(0 ŜD(t x i )) 2 I( T i t, δ i = 1)(1/Ĝ( T i ))+(1 ŜD(t x i )) 2 I( T i > t )(1/Ĝ(t )) i=1 (2.16) where ŜD(t x i ) are the estimated event-free probabilities, Ĝ(t) denotes the Kaplan- Meier estimate of the censoring distribution G, i.e., the Kaplan-Meier estimate based on ( T i, 1 δ i ), i = 1,..., n. The contribution of the entire data set to the Brier score can be classified into three categories: Category 1: Ti t and δ i = 1; Category 2: Ti > t (δ i = 1 or δ i = 0); Category 3: Ti t and δ i = 0. For an uncensored observation of category 1, the event occurred before t, and the event status at t is equal to I(T i > t ) = 0; thus the contribution to the Brier score is (0 ŜD(t x i )) 2. In category 2 the observed event status at t is equal to 1 since all of these patients are known to be event free at t ; the resulting contribution to the Brier score is (1 ŜD(t x i )) 2. For a censored observation in category 3, the censoring occurred before t, and the event status at t is unknown. Thus this observation does not contribute to the Brier score. Since some censored observations do not contribute to the Brier score calculation, remaining individual contributions have to be reweighted here to compensate for the loss of information due to the exclusion of some censored observations. Observations

21 2.4. PREDICTION ERROR OF A SURVIVAL MODEL 13 in category 1 get the weight 1/Ĝ( T i ), where G( T i ) = P (C i T i ), which corresponds to the situation of category 1; those in category 2 get the weight 1/Ĝ(t ), where G(t ) = P (C i t ), which corresponds to the situation of category 2; and those in category 3 get weight zero. This weighting scheme is called the inverse probabilityof-censoring weight (IPCW), which was first proposed by Robins et al. (1995). Gerds and Schumacher (2006) proved that this weighting scheme does not depend on the estimated event-free probabilities ŜD(t x i ) and hence the estimator is robust against misspecification of the survival model.

22 14 Chapter 3 Prediction Error Measurement for Mixture Cure Model 3.1 Chapter Overview This chapter describes a method for measuring prediction error for mixture cure models. Specifically, section 3.2 describes the details of the method for prediction error measurement on mixture cure models. Section 3.3 introduced some other assessment methods for prediction error to compare with the proposed estimate. 3.2 Measurement of Prediction Error for Mixture Cure Models In the setting of mixture cure models, the expected Brier score defined as: EBS(z) = E{[U ˆπ(z D)] 2 (3.1) where ˆπ(z D) represents the estimated uncure probabilities based on covariate vector z, given data D. The expectation is taken over the data D and the covariate vector z and uncure status U of future patients. The expected Brier score can be interpreted

23 3.2. MEASUREMENT OF PREDICTION ERROR FOR MIXTURE CURE MODELS 15 as a mean squared error of prediction when the estimated probabilities ˆπ i are viewed formally as predictions of the uncured status. It can be used as a measurement of prediction error for cure rate in the mixture cure models. Using the estimated Brier score has a remarkable advantage. At the baseline time point, the survival status is unknown at time t. This means that the predictions are made in terms of predictive values of a diagnostic test, i.e., probabilities of a positive or negative cure status, instead of classifying the patient as cured or uncured. It has been documented in literature that even in the diagnostic setting the predictive value associated with the result of a diagnostic test is more relevant than the test result itself (Graf et al., 1999). Thus, to judge the quality of classification, the Brier score, which measures the average discrepancies between true uncure status and estimated predictive values, may be more preferable than the misclassification rate, which only considers the proportion of an observation being allocated to the incorrect group. Resubstitution and cross-validation are often used when estimating the prediction error for survival data. In this report, I use both of the two methods to estimate the Brier score for the mixture cure model in the presence of censored subjects. Let L denote the proposed estimate of prediction error, L R and L CV denote the value of L calculated by resubstitution and cross-validation method respectively. In resubstitution, one simply builds a mixture cure model based on the entire data set, and then calculate the value of estimate of prediction error using the predicted uncure rate from the model on the same data set. That is, the the prediction error

24 3.2. MEASUREMENT OF PREDICTION ERROR FOR MIXTURE CURE MODELS 16 for mixture cure model, using resubstitution method, can be calculated by: L R = 1 n n {(1 ˆπ(z i D)) 2 I(t i c i )(1/ĜD(t i )) + (0 ˆπ(z i D)) 2 I(t i > t )(1/ĜD(t )) i=1 (3.2) where ˆπ(z i D) are the estimated uncure probabilities based on covariate vector z in data D; t i and c i are respectively the observed failure time and censoring time; t is the largest observed failure time; and ĜD( ) is the Kaplan-Meier estimate of the survival function for the censoring time based on data D. The idea underlying this estimate is that the subjects are in the following categories: Category 1: δ i = 1; Category 2: t i > t ; Category 3: t i t and δ i = 0. All patients in Category 1 are uncured since their failure time is observed; thus their contribution to the Brier score is (1 ˆπ(z i D)) 2. It is assumed that those censored after t are cured. So all patients in Category 2 are assumed to be cured and the resulting contribution of this category to the Brier score is (0 ˆπ(z i D)) 2. However, the cure status of patients in Category 3 is not known so the contribution of this category to the Brier score is not included here. I use the same weighting scheme as in Graf et al. s paper to recover the loss of information due to censoring. Divided by n, these weights sum up to 1. Patients in categories 1 and 2 have C i > T i and C i > t respectively, and they are assigned the weights of 1/(nĜD( ˆT i )) and 1/(nĜD(t )) according to IPCW scheme. Category 3 is dropped for unknown status.

25 3.2. MEASUREMENT OF PREDICTION ERROR FOR MIXTURE CURE MODELS 17 However, when assessing the prediction accuracy of a model with the same data on which the model is built, over-fitting may occur. It is a term that refers to a situation when the model requires more information than the data can provide. Generally, when over-fitting occurs, the prediction error will be underestimated. Cross-validation techniques are always used to overcome this issue. These techniques do not use the entire data set when building a model. Some cases are removed before the data is modeled. The removed cases are often called a testing set and the remaining cases are called a training set. Once the model has been built using the training set, the testing set can be used to test the performance of the model on the unseen data. There are several strategies to conduct cross-validation, such as Split-half Cross Validation, Leave One Out Cross Validation (LOOCV), Bootstrapped LOOCV, etc. In this report, I used the K fold cross validation method. This method splits the data into K folds of the same size, denoted by D 1,..., D K, and repeatedly leaves one fold out as the testing data set and uses the remaining K 1 folds as the training data set. The cross-validation estimate for L is given as: L CV = 1 n K k=1 l F k {(1 ˆπ(z l D k )) 2 I(t l c l )(1/Ĝ k(t l )) (3.3) +(0 ˆπ(z l D k )) 2 I(t l > t k)(1/ĝ k(t k)) where F k is the collection of patients in fold k, ˆπ(z l D k ) are predicted uncure rate for patient l in fold k, which is calculated by the coefficients from mixture cure models built on training data set and covariates of patient l in the testing data set. Ĝ k and t k are respectively the Kaplan-Meier estimate of the survival function for censoring

26 3.3. OTHER ERROR ESTIMATES 18 time and the largest observed failure time in the data set with fold k left out. In this report, I set K = 5 when doing the calculation. 3.3 Other Error Estimates To assess the performance of the proposed estimate of prediction error L, I compare it with two error estimators: BS 1 = 1 n n (U i π i ) 2 (3.4) i=1 and BS 2 = 1 n n (U i ˆπ(z i D)) 2 (3.5) i=1 where U i are the true uncure status of the subjects in the generated data, π i and ˆπ(z i D) are respectively the true uncure rate and the estimated uncure rate from mixture cure models. BS 1 is based on the true uncure probabilities and represents the mean square difference between cure status and cure rate without any estimation error, while BS 2 is a prediction error based on ˆπ(z i D) when cure statuses are available. Neither BS 1 nor BS 2 can be calculated for real data though and they are proposed for the simulation study in the next chapter to investigate the properties of the proposed estimator (3.2) The expectation of BS 1 can be calculated as follows: E(BS 1 ) = E [ 1 n ] n (U i π i ) 2 i=1 =E[(U π) 2 ] = E[E[(U π) 2 π]] (3.6) =E[π(1 π)] = E(π) E(π 2 )

27 3.3. OTHER ERROR ESTIMATES 19 where π is a random variable depending on the covariate vector z. Similarly, the variance of BS 1 can be calculated as follows: V ar(bs 1 ) = V ar [ 1 n ] n (U i π i ) 2 = 1 n V ar[(u π)2 ] i=1 = 1 n {E[(U π)4 ] (E[(U π) 2 ]) 2 U follows a binary distribution with rate π E[(U π) 4 ] is the fourth central moment of U. (3.7) E[(U π) 4 ] = π(1 π) 3π 2 (1 π) 2 = π 4π 2 + 6π 3 3π 4. V ar(bs 1 ) = 1 n {E(π) 5E(π2 ) + 8E(π 3 ) 4E(π 4 ). BS 2 calculated by the cross-validation method can be given as: BS CV 2 = 1 n K k=1 l F k (U l ˆπ(z l D k )) 2 (3.8) where F k is the collection of patients in fold k, U l is the true uncure status of patients in fold k, and ˆπ(z l D k ) is calculated by the covariates of the data set in fold k (testing data set) and the coefficients of the cure model based on the data set with fold k left out (training data set). Though cannot be observed, the value of BS 1 represents the true value of prediction error, measured by the expected Brier score. Thus, the difference between L R (or L CV ) and the expectation of BS 1 measures how well it measures prediction error of the model in terms of accuracy. Since we assume that the cure status for all subjects are known when calculating BS 2 (or BS CV 2 ), there s no loss of information due to censoring. Thus, the comparison between L R and BS 2, or that between L CV

28 3.3. OTHER ERROR ESTIMATES 20 and BS CV 2 examines the appropriateness of the weighting scheme to deal with right censoring. The prediction error measures the accuracy of the prediction model built on the observed data, and should ideally be assessed on a large independent test set. This ideal assessment is reflected by L New, although in application this is not possible because of the lack of the independent test set. L New can be calculated as: L New = 1 m m j=1 {(1 ˆπ(z New j D)) 2 I(t New j (0 ˆπ(z New j D)) 2 I(t New j > t )(1/ĜD(t )) c New j )(1/ĜD(t New j ))+ (3.9) where ˆπ(z New j D) are predicted uncure probabilities for patients in the new data set, which is calculated by the coefficients from mixture cure models built on data D and covariate vector z j in the new test set, t New j censoring time in the test set respectively. and c New j represent the failure time and Similarly, the value of BS 2 for the independent test set can be calculated as: BS New 2 = 1 m m j=1 (U j ˆπ(z New j D)) 2 (3.10) L New and BS New 2 are only computable on simulation data and are for comparison purposes.

29 21 Chapter 4 Simulation Study 4.1 Chapter Overview This chapter investigates properties of the proposed measurement of prediction error for mixture cure models via simulation. Specifically, section 4.2 presents data generation for the simulation study. Section 4.3 and section 4.4 respectively presents the results using resubstitution method and cross-validation methods. Results of the proposed estimate under model misspecification were presented in section 4.5. Section 4.6 presents the results of model comparison using the measurement of prediction error proposed in this report. The section compares the prediction error of one proportional hazards mixture cure model and one log-normal mixture cure model built on the same data. The R codes for all the simulation in this chapter can be found in Appendix A. 4.2 Data Generation Two binary variables z 1 and z 2 are generated and both follow a Bernoulli distribution with rate 0.5. For the cure part of the model, we set γ 0 = 2, γ 1 = 1, and γ 2 =

30 4.2. DATA GENERATION , which correspond to the logistic form of the uncure rate π as: logit(π) = 2 z 1 1.5z 2 (4.1) i.e. π(z 1, z 2 ) = exp(2 z 1 1.5z 2 ) 1 + exp(2 z 1 1.5z 2 ) (4.2) The uncure status U i for each subject i is generated from a Bernoulli distribution with rate π i generated by equation (4.2). For uncured subjects, the standard exponential distribution is used as the baseline distribution. The coefficients of z 1 and z 2 that affects the survival time are β 1 = log(0.5) and β 2 = log(0.4) respectively. Thus, the survival time for the uncured subjects follows: S u (t) = S u0 (t) exp{log(0.5)z 1+log(0.4)z 2 (4.3) where S u0 = e t. Referring to equation (2.5), the survival function of all the subjects can be written as: S(t z 1, z 2 ) = π(z 1, z 2 ) S u0 (t) exp{log(0.5)z 1+log(0.4)z π(z 1, z 2 ) (4.4) The censoring times are generated according to a uniform distribution in the range [0,60]. Under these assumptions, 200 different data sets (each with sample size 100) are generated using R software and fitted using the mixcure package programmed by Yingwei Peng. Since both z 1 and z 2 are binary variables with rate 0.5, subjects in the data set

4.2. DATA GENERATION 23 can be split into four groups: Group 1: z 1 = 0, z 2 = 0; S(t) = 0.8808 exp( t) + 0.1192 Group 2: z 1 = 1, z 2 = 0; S(t) = 0.7311 exp( 0.5t) + 0.

31 4.2. DATA GENERATION 23 can be split into four groups: Group 1: z 1 = 0, z 2 = 0; S(t) = exp( t) Group 2: z 1 = 1, z 2 = 0; S(t) = exp( 0.5t) Group 3: z 1 = 0, z 2 = 1; S(t) = exp( 0.4t) Group 4: z 1 = 1, z 2 = 1; S(t) = exp( 0.2t) After the data sets are generated, Kaplan-Meier survival curves of all the data sets are plotted in Figure 4.1. The theoretical survival curve for each group is also plotted in the figure. Some key features of the generated data are described in Table 4.1. Figure 4.1: K-M Plot of the Simulated Data

32 4.2. DATA GENERATION 24 Table 4.1: Key Features of the Simulated Data Cure Rate 1st Quartile of T Median of T 3rd Quartile of T Group (0.1192) (0.2877) (0.7297) (1.3402) Group (0.2689) (0.6895) (1.2824) (2.2959) Group (0.3775) (0.8252) (1.4562) (2.5349) Group (0.6225) (0.9205) (1.6858) (2.9941) Note: The censoring rate for the generated data is The values in brackets are from theoretical distributions in each group. Graphically, one feature of the data that is suitable for a mixture cure model is that the survival curve levels off toward a positive asymptote. Figure 4.1 and Table 4.1 demonstrate that: 1) the data sets are correctly generated since the survival function of the generated data are quite close to the theoretical survival functions for each group; 2) both the generated data and the theoretical data have a flat tail in the survival curve, which means data is suitable for a mixture cure model. The key features presented in Table 4.1 also demonstrate that the cure rate of the generated data is very close to the theoretical value of E[(1 π(z 1, z 2 ))] = From equation (3.6) we can calculate that under the assumptions in this report, the expectation of BS 1 can be calculated as follows: E(BS 1 ) = E(π) E(π 2 ) π = exp(2 z 1 1.5z 2 ) 1 + exp(2 z 1 1.5z 2 ) and both z 1 and z 2 can be either 1 or 0 with probability 0.5 π can be , , ,or , all with probability E(π) = ( )/4 = ; E(π 2 ) = ( )/4 = E(BS 1 ) = =

33 4.3. RESUBSTITUTION METHOD 25 From equation (3.7), the variance of BS 1 can be calculated as follows: V ar(bs 1 ) = 1 n {E(π) 5E(π2 ) + 8E(π 3 ) 4E(π 4 ) E(π) = , E(π 2 ) = , E(π 3 ) = , E(π 4 ) = V ar(bs 1 ) = 1 { n = 1 n Resubstitution Method I first used resubstitution method to calculate the proposed estimate of prediction error. The resubstitution estimates of the prediction error L R were calculated as in equation (3.2). To check whether over-fitting is a problem when calculating L using resubstitution method, an independent data sets with the same sample size is generated as the test set. The prediction error measured by the test set was calculated as L New in equation (3.9). BS R 2 and BS New 2 were calculated as in equation (3.5) and (3.10) respectively. The comparison plots and main statistical features of different statistics are shown in Figure 4.2 and Table 4.2. Different sample sizes are used when plotting and doing calculation in order to detect the value that L R and L New converge to. The dashed lines in the plots indicate the theoretical value of E(BS 1 ). The red line in each plot has intercept 0 and slope 1. It demonstrates the similarity between the statistics on the horizontal axis and vertical axis. Ideally, if the two statistics are exactly the same, the simulated circles should lie perfectly on the red line. Results in Figure 4.2 show that the black circles in both the plots of L R v.s. BS R 2 and L New v.s. BS New 2 agree with the ideal red line well. Thus, the weighting scheme in the proposed estimate is appropriate for right-censored data. Besides, when sample

34 4.3. RESUBSTITUTION METHOD 26 Table 4.2: Prediction Error Measurement Using Resubstitution Method n=60 n=80 n=100 n=200 n=500 n=1000 L R L New BS2 R BS2 New Mean Standard Deviation Mean Standard Deviation Mean Standard Deviation Mean Standard Deviation Mean Standard Deviation Mean Standard Deviation Note: E(BS 1 ) = , SD(BS 1 ) = 1 n size increases, the black circles in both the plots of L R v.s. BS R 2 and L New v.s. BS New 2 become more focused to the intersection of the two dashed lines, which means that all of L R, L New, BS R 2 and BS New 2 converge to E(BS 1 ) as sample size increases. In addition, from the mean values of L R and L New in Table 4.2 we can conclude that the value of L R is always smaller than that of L New, which means the value of L calculated by resubstitution method slightly underestimates the prediction error. The comparison between L R and BS New 2 also demonstrates the underestimation of prediction error, especially when the sample size is small.

35 4.3. RESUBSTITUTION METHOD 27 Figure 4.2: Plots with Different Sample Size using Resubstitution Method

36 4.4. CROSS-VALIDATION METHOD Cross-Validation Method The values of L CV were calculated as in equation (3.3), and the values of BS CV 2 were calculated as in equation (3.6) with K = 5. The results are demonstrated in Figure 4.3 and Table 4.3. n=60 n=80 n=100 n=200 n=500 n=1000 Table 4.3: Prediction Error Measurement Using Cross-Validation Method L CV BS2 CV L CV BS2 New L CV L New Mean Standard Deviation Mean Standard Deviation Mean Standard Deviation Mean Standard Deviation Mean Standard Deviation Mean Standard Deviation Note: E(BS 1 ) = SD(BS 1 ) = 1 n L CV Figure 4.3 presents similar results as in Figure 4.2. The black circles in plots of v.s. BS CV 2 are generally around the ideal red line, which means the weighting scheme still works well for cross-validation method. When sample size increases, the black circles are more focused to the ideal line and converges to E(BS 1 ). Table 4.3 compares the values of L using both resubstitution and cross-validation methods with the same data set. The results demonstrate that the values of L calculated by crossvalidation method is slightly larger than L New. This means cross-validation method may tend to over-estimate the prediction error. This is because the training sets contains fewer patients than the whole observed data. The difference between L CV

37 4.4. CROSS-VALIDATION METHOD 29 and BS New 2 is small, especially when the sample size is large. This implies the crossvalidation method is quite accurate. Thus, in the following sections, I will present the value using both resubstitution and cross-validation methods. Figure 4.3: Plots with Different Sample Size using Cross-Validation Method

38 4.5. MODEL MISSPECIFICATION Model Misspecification The sensitivity of the proposed prediction error measurement to model misspecification is also of interest. In this section, I examine the performance of L R and L CV with both redundant variables and missing variable. The values calculated in misspecification situations are compared with the values in correct specified models to compare the sensitivity of the proposed method. Similarly, I also compare the values calculated in misspecification situations with BS 1 and BS 2 (or BS CV 2 ) to check its accuracy and the performance of the IPCW scheme Model with Redundant Variables Two redundant covariates z 3 and z 4 are generated. z 3 Bernoulli(0.5) and z 4 N(0, 1). Neither z 3 nor z 4 is correlated with the uncure rate or the survival time in the simulation, but both are included when building mixture cure models. The values of z 1 and z 2 in the data sets are exactly the same as those used in the two sections. Both resubstitution and cross-validation methods are used to calculate the estimate of prediction error. The former one is denoted as L R R, while the latter one is denoted as L CV R. BS 2R and BS2R CV represents the values of BS 2 for models with redundant variables calculated using resubstitution and cross-validation methods respectively. L New R and BS New 2R denote the value of L and BS 2 in models with redundant variables calculated by new test set respectively. The results are presented in Figure 4.4 and Table 4.4. Compared with the results in section 4.3 and 4.4, we can find that the weighting scheme still works well in models with redundant variables. All of L R R, LCV R, BS 2R, and BS CV 2R converge to E(BS 1). Both the difference between L R R and LR, and the

39 4.5. MODEL MISSPECIFICATION 31 Table 4.4: Prediction Error Measurement for Models with Redundant Variables n=60 n=80 n=100 n=200 n=500 n=1000 L CV R BS2R CV L R R BS 2R L New R BS2R New L CV R LCV L R R LR Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Note: E(BS 1 ) = , SD(BS 1 ) = 1 n , SD denotes standard deviation difference between L CV R and LCV are small, especially when the sample size is large. This means models with redundant variables are not so different from the correctly specified models, in terms of prediction error. In addition, the comparison between L New R and L R R demonstrates that in models with redundant variables, the value of L calculated by resubstitution method still underestimates the prediction error, while the value calculated by cross-validation method is accurate. The difference between L R R and LNew R in Table 4.4 is larger than the difference between L R and L New in Table 4.2. This suggests the resubstitution method tends to underestimate the prediction error more seriously when redundant variables are included in the model.

40 4.5. MODEL MISSPECIFICATION 32 Figure 4.4: Plots of Prediction Error for Models with Redundant Variables

41 4.5. MODEL MISSPECIFICATION Model with Variable Left Out Consider the same data sets in section 4.3 and 4.4. When building the prediction models, I omit one covariate (z 2 ) in the model. The values of the estimate calculated by resubstitution and cross-validation methods are denoted as L R M and LCV M, respectively. BS 2M and BS CV 2M respectively represents the value of BS 2 calculated by resubstitution and cross-validation methods. L New M and LNew M denote the value of L and BS 2 in models with variable left out calculated by new test set respectively. The results are presented in Figure 4.5 and Table 4.5. n=60 n=80 n=100 n=200 n=500 Table 4.5: Prediction Error Measurement for Models with Missing Variable n=1000 L CV M BS2M CV M BS 2M L New M BS2M New L CV M LCV L R M LR Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Note: E(BS 1 ) = , SD(BS 1 ) = 1 n , SD denotes standard deviation The comparison between L R M and LNew M shows that when models are built with some covariates left out, the value of L using resubstitution method still underestimates the prediction error. The comparison between L CV M and BSNew 2M demonstrates that the value of L calculated by cross-validation method overcomes the underestimation issue. Figure 4.5 demonstrates that when an important variable is missing, the

42 4.5. MODEL MISSPECIFICATION 34 weighting scheme still works well as the black circles in all plots are around the red line. This is similar to what is shown in Graf et al. s paper (Graf et al., 1999) that the weighting scheme is robust against model misspecification. However, as sample size increases, all the statistics tend to converge to a value greater than E(BS 1 ). The bias is caused by the missing of the important information (z 2 ) when building cure models. The results also indicate that the mixture cure models with missing covariate tend to have larger prediction errors. The results in section and indicate that when models are misspecified, the values of L calculated by resubstitution method underestimate the prediction error, while the values calculated by cross-validation method overcomes the underestimation problem. The IPCW weighting scheme works well even when models are misspecified. When models are built with redundant variables, the values of L, either calculated by resubstitution or cross-validation method, are similar to the correct model, which means the models are not very different in terms of prediction error. However, when models are built with some important variables left out, either L R or L CV detects a larger prediction error in the misspecified models. Thus, the proposed estimate can be used to assess the importance of a variable in a model and for compare prediction models. If the prediction error of the model with a specific variable is smaller than that of the model without the specific variable, it means the variable is important and should be included in the prediction model.

43 4.5. MODEL MISSPECIFICATION 35 Figure 4.5: Plots of Prediction Error for Models with Missing Variable

44 4.6. MODEL COMPARISON Model Comparison In this section, we compare the prediction errors of two models, one with semiparametric PH mixture cure model, and the the other is the log-normal mixture cure model. The data are generated from the exponential mixture cure model as described in section 4.2. Model 1 is built by fitting a semiparametric cure model that models the survival time of the uncured subjects by the Cox s proportional hazards model. Model 2 is built by fitting a parametric model that assumes the survival time of the uncured subjects follows a log-normal distribution. From the assumptions for data generation, it is known that model 1 is a correct model and model 2 is a wrong model for the data. The values of L R and L CV calculated in model 1 are denoted as L R m1 and L CV m1, while those calculated in model 2 are denoted as L R m2 and L CV m2. Results under different sample sizes are presented in Table 4.6. Table 4.6: Results for Model Comparison L R m1 L R m2 L CV m1 L CV m2 L R m2 L R m1 L CV m2 L CV m1 n= n= n= n= n= n= The results do not show any difference of the two models in terms of prediction error under any sample size. The reason is that both models produced similar cure rate estimates due to long followup in the data. When the followup time is long, it is clear which subjects are cured and which are uncured in the data set. The models are

45 4.6. MODEL COMPARISON 37 built by the uncured subjects in the data set. Thus, whatever distribution we assume the survival time of the uncured subjects follows, the coefficients of the models are similar, which in turn gives similar estimates of cure rates. To further compare the prediction estimation between the two models when the the followup in the data shorter, I reduced the range of the censoring distribution of the simulated data to U[0, 13]. The K-M plot of the new generated data is presented in Figure 4.6. The results of model comparison are presented in Table 4.7. The censoring rate for the new generated data is 47%. Figure 4.6: K-M Plot of the Simulated Data with a Narrower Censoring Distribution

Joint Modeling of Longitudinal Item Response Data and Survival

Joint Modeling of Longitudinal Item Response Data and Survival Jean-Paul Fox University of Twente Department of Research Methodology, Measurement and Data Analysis Faculty of Behavioural Sciences Enschede,