A Comparison of Multiple Imputation Methods for Missing Covariate Values in Recurrent Event Data

Size: px

Start display at page:

Download "A Comparison of Multiple Imputation Methods for Missing Covariate Values in Recurrent Event Data"

Cameron Walton
5 years ago
Views:

1 A Comparison of Multiple Imputation Methods for Missing Covariate Values in Recurrent Event Data By Zhao Huo Department of Statistics Uppsala University Supervisor: Ronnie Pingel 2015

2 Abstract Multiple imputation (MI) is a commonly used approach to impute missing data. This thesis studies missing covariates in recurrent event data, and discusses ways to include the survival outcomes in the imputation model. Some MI methods under consideration are the event indicator D combined with, respectively, the right-censored event times T, the logarithm of T and the cumulative baseline hazard H 0 pt q. After imputation, we can then proceed to the complete data analysis. The Cox proportional hazards (PH) model and the PWP model are chosen as the analysis models, and the coefficient estimates are of substantive interest. A Monte Carlo simulation study is conducted to compare different MI methods, the relative bias and mean square error will be used in the evaluation process. Furthermore, an empirical study based on cardiovascular disease event data which contains missing values will be conducted. Overall, the results show that MI based on the Nelson-Aalen estimate of H 0 pt q is preferred in most circumstances. Keywords: Missing data; Multiple imputation; Missing covariates; Recurrent event data; Cox PH model; PWP model

3 Contents 1 Introduction 1 2 Theory and Methods for Missing Data 2 3 Recurrent Event Data Analyzing Data with Recurrent Events Missing Data in Survival Analysis 9 5 Simulation Study Study Design Generating Complete Data Sets Generating Missing Data Results Empirical Study 22 7 Conclusion 26 Acknowledgements 27 References 27

4 1 Introduction Missing data are a common problem in many research areas. In survey-based research, respondents may purposely refuse to answer certain questions due to privacy reasons, or even unwillingness of participants to attend the meeting for evaluation, for example. In clinical research, missing data are also a major issue, especially in longitudinal clinical trials. In survival analysis, the data deficiency usually occurs in the covariates. This is due to the fact that the survival outcomes often consist of easy to measure data, such as death or time of events being present in registries. However, for the covariates some measurements can be missing. For instance, blood samples may be accidentally dropped on the floor, which would make the post-processing of the samples impossible. Or perhaps missing could be due to no answers in questionnaires related to a clinical study. This thesis considers the issue of missing covariates in recurrent event data. Recurrent event data is one type of the multiple failure time data. It arises from survival analysis when each study subject experiences two or more events (failures), and the failures should be the same kind of event. For example, patients with heart disease having a second (or more) myocardial infarction can be seen as a recurrent event. On the other hand, first having myocardial infarction and then having a bone fracture is a different kind of event. Assuming that the data are right censored, it defined as a subject has not experienced any event by the end of the study or can not be followed up, whichever occurred first. For example, in a project to examine stroke occurrence, patients who have never experienced a stroke by the end of the project are right censored. Also, it is assumed that the censoring mechanism is non-informative, which means the censoring time of each subject is independent of its failure time. Rubin (1987) proposed multiple imputation (MI) to provide statistical inference, which is now a commonly used method to impute missing data. Studies by Rubin (1996) further explained that MI is the method of choice since the complete data are available after MI for the ultimate users. The basic concept of MI is replacing each missing data by a number of plausible values based on the distribution of the observed data. This should be done so that several data sets containing no missing observations can be generated. Moons et al. (2006) pointed out that regardless of the missingness mechanism, imputation of missing predictor values using the outcome is preferred over imputation without outcome. To perform MI in survival data using the outcome, several 1

5 papers considered outcome information including the event indicator D, the observed event or censoring time T and/or the logarithm of T. In White and Royston (2009), different imputation models were compared in the setting of a single incomplete covariate. They recommended that MI should be based on the Nelson-Aalen estimate of the cumulative hazard. Considering recurrent event data with missing covariates, we are interested in how the extra information from recurrent events can be incorporated into MI when analyzing survival data. A Monte Carlo simulation study is conducted to evaluate the different MI methods based on two analysis models. One is the Cox proportional hazards (PH) model, and the other is the PWP model. The coefficient estimates of the analysis models are of substantive interest. Assessment of different methods provides suggestions in handling missing data. Thus, our primary aim is to develop a more efficient way of including survival outcomes in the imputation model. The main structure of the thesis is as follows: Section 2 provides the necessary methodological background of the study, including missing data and MI. Section 3 introduces recurrent event data and analysis models. Section 4 describes missing data in survival analysis. Section 5 compares the methods mentioned through a comprehensive simulation study. Section 6 contains an empirical application. Section 7 concludes. 2 Theory and Methods for Missing Data Usually, missing data can be classified into three categories based on the reasons of incompleteness. Missing completely at random (MCAR) means that the probability of data being missing is independent of any characteristics of the subjects. Missing at random (MAR) is that the probability of data being missing does not depend on the unobserved data, conditional on the observed data. Finally, missing not at random (MNAR) is that the probability of data being missing depends on the unobserved data. MCAR provides a convenient way of data analysis, since the estimated parameters remain unbiased by the absence of data. An example of MCAR is a participant of a survey unable to attend the meeting on time because of the traffic jam. MAR occurs if the cause of missingness is correlated to a particular variable. For instance, suppose people refuse to answer "salary" entirely based on their ages, then the data of "salary" 2

6 are MAR. MNAR is more complex than MAR and MCAR. An example is that both low- and high-income people are more likely to miss "salary" than those with an ordinary income, then nonresponse probability for "salary" depends on values that might be missing. According to Little and Rubin (1987), standard methods have been developed to analyze rectangular data sets. The rows of the rectangular data sets are seen as subjects and the columns are seen as variables measured for each subject. Since missing data arise in almost all kinds of trials, some of the entries in the data matrix will not be observed. In general, missing data may cause bias and loss of efficiency. Depending on the type of missingness, different approaches to handle missing data have been suggested. In the case of MCAR, a simple approach to deal with missing data problem is to exclude subjects with missing values. This method usually refers to complete case (CC) analysis. Under MCAR, CC analysis remains unbiased but is less efficient. However, when data are not MCAR, CC analysis can be biased. Using CC analysis offers convenience, but there is one major drawback: discarding cases can lead to a reduction of statistical power. Instead of wasting data, ad hoc methods, such as mean imputation, last observation carried forward, random hot deck imputation have been used historically. However, not only will these methods not reduce bias, they are also single imputation methods. Hence, they may underestimate the standard errors of estimates (Little and Rubin, 1987). To overcome the weaknesses of the ad hoc methods, two other principled techniques to handle missing data have been proposed: maximum likelihood (ML) and multiple imputation (MI). ML analyzes the full incomplete data set. It separates the complete and incomplete data, computes likelihood function for each of the data sets, and then maximizes these two likelihoods to get the estimates of parameters. Also, ML is model specific, and different types of models tend to have different likelihoods. In Enders (2001), the results showed that under MCAR and MAR, the ML estimates are less bias and generally more efficient than ad hoc methods. In practice, the ML estimates are slightly more efficient than the MI estimates. Nonetheless, MI is more general than ML. Because of the focus is on survival analysis, MI should be the method of choice in handling missing data. Studies by Donders et al. (2006) concluded that when data are MCAR or MAR, using the MI approach will lead to unbiased results with correct standard errors. For this reason, our study is limited to situations where the missingness mechanisms are MCAR or MAR. 3

7 Originally produced by Rubin (1978, 1987), MI is now a commonly used approach to handle missing data problem. Based on Bayesian paradigm, MI involves drawing missing data from a posterior predictive distribution conditional on the observed data. The idea is to replace each missing data with a set of m acceptable values (m ą 1), to generate m imputed data sets. According to Schafer (1999), a small number of imputation would be enough, normally between 5 to 10 times. But a recent study by White et al. (2011) suggested that m should be at least equal to the percentage of incomplete cases, for instance, a sample with 20 percent incomplete cases would indicate that m 20. This rule of thumb is used to choose the proper m in our simulation study. After imputation, these imputed data sets are analyzed separately, and the results of the m analyses are combined to get the estimates of interest. One of the advantages of MI is that it allows any analyzes methods after possessing the complete data sets. MI provides a computationally feasible approach which can be applied to a wide range of situations. MI consists of three steps: 1. Generating m imputed data sets. Considering the simple case for a single incomplete variable X, a vector of complete variables Z and complete outcomes Y. Constructing a model fpx Y, Z; αq among subjects with observed X, and drawing the estimated ˆα with variance-covariance matrix S α from this model. Then the imputation parameter α can be drawn from an approximately Npˆα, S α q distribution. The missing values are generated from the posterior predictive distribution fpx Y, Z; α q and the whole process is repeated m times. 2. Analyzing each of the imputed data sets separately. After imputation, we can then proceed to the complete data analysis. Each imputed data set is analyzed separately to obtain the estimates that we are interested in, e.g. regression coefficients. 3. Combining estimates using Rubin s rules. Suppose ˆβ pjq is the point estimate from the jth imputed data set (j 1, 2,..., m) and W pjq is the estimated variance of ˆβ pjq. Then the overall estimate is with variance ˆβ 1 m mÿ ˆβ pjq, (1) j 1 V arp ˆβq W ` p1 ` 1 qb, (2) m 4

8 where W and B are the within- and between-imputation variances, respectively: W 1 m B 1 m 1 mÿ W pjq, (3) j 1 mÿ p ˆβ pjq ˆβq 2. (4) j 1 If more than one variable is incomplete, the method multiple imputation by chained equations (MICE) can be used. MICE is an iterative imputation method that uses separate independent chains to obtain imputations. Lee and Carlin (2010) compared MICE with the multivariate normal imputation (MVN). MICE is more flexible than MVN since it does not rely on the assumption of multivariate normality. Furthermore, the variable-by-variable specification of MICE makes it easy to build models with incompatible conditionals. Azur et al. (2011) provided the procedures for carrying out MICE. Initially, by implementing single imputations, all missing values are replaced by random numbers such as the mean. Then for any one of the variables, resetting it back to its original default value, and performing the same process as Step 1 in MI. This procedure is repeated for each variable that has missing data, and take it as one cycle when looping through all variables. As a result, each missing value is updated in each cycle. Royston (2004) considered that fewer than 10 cycles will be enough. An imputed data set will be created after every 10 cycles, and the whole procedure is repeated m times. Some aspects of using MICE were discussed by White et al. (2011), including the principles of the method, the guidance on model specification and the practical analysis of multiply imputed data. 3 Recurrent Event Data Survival data are generally defined as data that consist of the time to some event and the censoring information for each subject. The event can be, e.g. the occurrence of a specific disease or death. One special case of survival data, recurrent event data, is the main emphasis of this thesis. In order to make it easier to understand, we introduce the multiple failure time data. Multiple failure time data arise from time-to-event studies where each study subject will expe- 5

9 rience two or more events (failures). The failures can be either of the same kind of event or of an entirely different nature. One characteristic of this kind of data is that the survival times may be correlated. Therefore, the analysis of such data would be complicated by the dependence between related failure times. Wei and Glidden (1997) summarized statistical methods when analyzing multiple failure time data. Multiple failure time data can be classified into two categories: ordered and unordered (e.g. Therneau 1997, chapter 8). 1 Ordered failure time data refers to the data that events are naturally ordered and occur in a certain sequence over time. For example, define K as the maximum number of events within a subject, and N as the total sample size. Then the event time for the kth pk 1, 2,..., Kq event for the ith pi 1, 2,..., Nq subject can be written as T ik. For ordered data, the occurrence time of each event involves the constraint T ik ě T i,k 1. The most popular type of ordered data is recurrent event data. Recurrent event data arise from survival analysis when each study subject experiences two or more failures, and the failures should be the same kind of event. It is commonly encountered in longitudinal studies where each subject may experience the same event over time repeatedly. Examples of recurrent event data would be the analysis of a cohort of infants with bronchial obstruction. Villegas et al. (2013) studied the cases that physician revisit the children at various times and each visit can be seen as a recurrent event. Unordered failure time data are data in which each subject is at risk of several failure processes simultaneously. This would be the case when a subject experiences several events that are entirely different, such as patients suffering a heart attack may be infected with other diseases. An important feature of the unordered data is that every event can occur at any time during the research, so there is no restriction on the sequential order. 3.1 Analyzing Data with Recurrent Events Focusing on ordered recurrent event data, Cox-based models are used to analyze such kind of data. There are four key components in characterizing the Cox-based models: risk interval; risk set; baseline hazard; and correlation adjustment. Among these four components, risk interval and risk set are pivotal in choosing a model. 1 Other different names includes longitudinal and clustered, serial and parallel. 6

10 Figure 1: Illustrations of risk interval formulations ( is an event and is censoring) Risk interval refers to the time scales that used to define when a subject is at risk of having an event. There are three possible ways of defining a risk interval: gap time, total time and counting process. Each of these represents a different substantive type of risk process. The gap time is the time from the prior event, i.e., reset the clock to zero when each event begins. Total time and counting process use the same time scale, what makes them different from each other is how they set up the "left time". Counting process defines a subject is not considered to be at risk for the kth event until the pk 1qth event happen. While total time uses the time from the starting point, e.g. the beginning of the research. When dealing with recurrent event data, both gap time and counting process are applicable in the risk interval definition. Figure 1 gives a more intuitive image of risk interval formulations. The kth risk set contains the individuals who are at risk for the kth event. The risk set definition incorporates the baseline hazard selection: common or event-specific. A common baseline hazard means a model with the same underlying hazard for all events, and an event-specific allows the baseline hazard to be different for each event. There are three possible risk sets: restricted, semi-restricted and unrestricted. With a restricted risk set, only subjects who have experienced the pk 1qth events are included to contribute to the kth risk set. The restricted risk set has event-specific baseline hazard, and is the preferred risk set for recurrent event data. A semi-restricted risk set has event-specific baseline hazard, which allows subjects to contribute to the risk set of the kth event at time t as long as they have not experienced the kth event. An unrestricted risk set allows all subjects risk intervals to contribute to the risk set for any event, and has a common baseline hazard. The partial likelihood function for the Cox-based models 7

11 are differ in the composition of the risk set. The Cox PH model developed by Cox (1972) is the most widely used model for survival data, and meanwhile a powerful method to analysis time-to-event occurrence. The usual Cox PH model requires the events to be independent, but for recurrent event data, the events may be correlated within subjects. Besides, the observation time in the standard model ended at the time to first occurrence or censoring, which will result in the repeating events being disregarded. Hence some extensions of the survival models based on Cox PH approach to analysis multiple events have been proposed. Ezell et al. (2003) summarized that the Cox-based models are hybrid versions of the single-event Cox model and are specified in one of the two ways: h ik ptq h 0 ptqexppβ 1 Z ik q, (5) h ik ptq h 0k ptqexppβpkqz 1 ik q. (6) Here Z ik pz i1k,..., Z ipk q 1 is a p-dimensional covariate vector for the k event of the ith subject. β denotes the vectors of the common regression coefficients. β pkq denotes the vectors of the event-specific regression coefficients. In Equation (5), h 0 ptq represents a common baseline hazard for all events. In Equation (6), h 0k ptq is an event-specific baseline hazard for the kth event. Both h 0 ptq and h 0k ptq are non-negative. Finally, h ik ptq stands for the hazard function for the ith subject with respect to the kth event. For analyzing recurrent event data, the most common Cox-based models are: Andersen and Gill (AG); Prentice, William and Peterson, counting process (PWP-CP) and gap time (PWP-GT); and Wei, Lin, and Weissfeld (WLW). These models differ essentially in the risk interval and risk set. The AG model is specified with a common baseline hazard, an unrestricted risk set, and using the counting process formulation. The PWP models are specified with event-specific baseline hazards, restricted risk sets, and either the total time or gap time risk intervals. The PWP-CP model differs from PWP-GT by using counting process instead of gap time formulation. The WLW model is specified with event-specific baseline hazards, semi-restricted risk set, and using the total time formulation. Villegas et al. (2013) concluded that no model can be recommended as the best in all situations. Castañeda and Gerritse (2010) suggested that both the AG and the PWP models are applicable in analyzing repeated failures of the same type. Kelly and Lim (2000) suggested that the PWP model is more appropriate in the analysis of recurrent event data. Thus, the PWP model is 8

12 selected as the analysis model rather than other Cox-based models. Based on our empirical data, the counting process is a more suitable formulation of the risk interval. Liu (2012) pointed out that although the PWP models specify event-specific regression coefficients β pkq, the overall estimates ˆβ can be obtained by fitting a common covariate vector. This can be achieved by assuming that the covariate is not event-varying, such as sex and so on. Hence the analysis model for recurrent event data is h ik ptq h 0k ptqexppβ 1 Z i q, (7) here t k 1 ă t ď t k and Z i pz i1,..., Z ip q 1 is a p-dimensional covariate vector of the ith subject. The definition of other indicators are the same as described above. The overall estimates ˆβ are of substantive interest. Sometimes we are interested in the regression coefficients of the first event, then the analysis model is a usual Cox PH model h i1 ptq h 01 ptqexppβp1qz 1 i q. (8) For this model, the coefficient estimates of the first event β p p1q are of substantive interest. 4 Missing Data in Survival Analysis So far, relatively few studies have been done to investigate missing data in survival analysis. White and Royston (2009) compared different MI models in the setting of a single incomplete covariate, and they recommended that MI should be based on the Nelson-Aalen estimate of the cumulative hazard. Paik (1997) presented three MI estimates for the Cox model with missing covariates. Mbougua et al. (2013) compared different MI methods in dealing with nonlinear continuous covariates, and simulation results have shown that MI by splines should be used in such situation. Van Buuren et al. (1999) used MI for missing blood pressure covariates. Giorgi et al. (2008) performed MI in regression analysis of relative survival, they concluded that missing data in covariates should be modeled and MICE offered an attractive choice. However, the effect of missing data in covariates in multivariate survival models, which is the aim of this thesis, has never been investigated. 9

13 Figure 2: Recurrent event data with a missing covariate ( is observed and is missing; is an event and is censoring) A hypothetical example is used to illustrate the missing patterns in recurrent event data. Suppose there are two covariates X and Z in the model, both are not event-varying. X denotes the incomplete covariate, and there is a set of randomly positions of missing values in it. Z denotes the complete covariate. Assuming that each subject will experience at most two events, the observed event or censoring times are T 1 and T 2, with corresponding event indicators D 1 and D 2. Additionally, we assume that for subject 1 and 3, both two events occur; for subject 2, only the first event occurs; for subject 4, no event has occurred. Missing data arise in subject 3. Figure 2 gives a pictorial representation of the missing patterns. As mentioned previously, carrying out MI concerns the choice of variables in the imputation model. Three imputation models are formed using the event indicator D combined with, respectively, the observed event or censoring time T, the logarithm of T (i.e. logt ) and the cumulative baseline hazard H 0 pt q. However, H 0 pt q is unknown and can only be estimated. A possible way to estimate H 0 pt q is to use the Nelson-Aalen estimator. The H 0 pt q is approximately equal to HpT q when the coefficient β s are small. One way to estimate HpT q was suggested by Nelson (1972) and studied by Aalen (1978), thus it known as the Nelson-Aalen estimator. It is given by Ĥptq ÿ t i ďt ˆ di n i, (9) here i 1, 2,..., δ and t 1 ă t 2 ă ă t δ represent the δ distinct failure times in a sample of n subjects. At time t i, there are d i failures occur and n i subjects are at risk. Since the outcomes 10

14 are fully observed, the Nelson-Aalen estimator can be estimated before the imputation. 5 Simulation Study In this section, the main aim is to explore two issues of major concern in MI: What is the best way to include the survival outcomes in the imputation model? How many events should be considered in imputation? Only outcomes from the first event or from all of the events? Given K as the maximum number of events within a subject, and the survival outcomes are not the same based on different events. If each event is a stratum, to impute the missing values, we can either use the outcomes drawn only from the first stratum, or from both stratums. A series of Monte Carlo simulations will be conducted to compare different approaches to include the survival outcomes in imputation model for the covariates, in survival analysis setting. This study limited to the case where each subject will experience a maximum number of two recurrent events, that is K 2. In imputation, if only the first event is consider, D 1 can be combined with, respectively, T 1, logt 1 and ĤpT 1q. Here ĤpT 1q is the Nelson-Aalen estimator of HpT 1 q. If both events are considered in imputation, D 2, T 2, logt 2 and ĤpT 2q should also be included in the model. Here ĤpT 2q is the Nelson-Aalen estimator of HpT 2 q. Therefore, this paper studies the effect of missing data imputation using six ways to include the observed data in the imputation methods. A summary of these methods are presented in Table 1. 11

15 Table 1: Overview of the imputation methods for imputing incomplete X Abbreviation Description T1 Regression of X on Z, D 1 and T 1 LOGT1 Regression of X on Z, D 1 and logt 1 NA1 Regression of X on Z, D 1 and ĤpT 1q T2 Regression of X on Z, D 1, D 2, T 1 and T 2 LOGT2 Regression of X on Z, D 1, D 2, logt 1 and logt 2 NA2 Regression of X on Z, D 1, D 2, ĤpT 1 q and ĤpT 2q Note: Here, regression is a linear regression. To assess the performance of different MI methods, we compare the estimated coefficients based on analysis models, which are the PWP model and the Cox PH model. Accuracy of the coefficients estimates are evaluated by estimating the relative bias and mean square error. The relative bias is defined as the average of the ratio between bias and the true value ˆ ˆβ RBp ˆβq β E. (10) β The mean square error measures the mean squared difference between the estimates and the true value of the parameter MSEp ˆβq Erp ˆβ βq 2 s. (11) 5.1 Study Design For the covariates, we first consider the simple case where there is a single incomplete normally distributed covariate X and no other covariates. We then proceed and add a complete normally distributed covariate Z in the model. It is important to note that although Z is listed in all imputation methods, it should be deleted when considering the cases of only X and no Z. Also, these methods are examined in terms of four factors: different sample sizes, event correlation levels, censoring levels and missing levels. By comparing the results obtained with different methods under each setting, a proposal for handling missing data can then be given. Simulations are implemented using package mice in RStudio version

16 For comparison, we also carry out the calculations before introducing missing values (ALL) and including complete cases (CC) only. According to the empirical data to be analyzed below, the percentage of incomplete cases is approximately 17% and the censoring percent is about 68%. Based on the rule of thumb described in Section 2, the number of repeated imputations is chosen as m 20. The sample sizes under consideration are N 250 and N 2500, corresponding to small and large sample sizes, respectively. The correlation levels between adjacent recurrence times under consideration are ρ 0 and ρ 0.5, corresponding to low and high correlation, respectively. Censoring levels are set to be 70% referring to the empirical data, and 40% censoring is also being considered to analysis the sensitivity of low censoring. Missing levels are set to be 20% referring to the empirical data, and taking 50% into account to analysis the sensitivity of high missing. All combinations of N t250, 2500u; ρ t0, 0.5u; Censoring t40%, 70%u; and Missing t20%, 50%u; are taken into consideration. In order to reduce Monte Carlo error, 1000 independent replications will be performed for each design. 5.2 Generating Complete Data Sets Let T ik be the event time for the kth event for the ith subject, measured from the time origin 0 to the occurrence of each event, and C i be the censoring time for the ith subject. A subject is considered to be censored if C i ă T ik, and this subject has experienced q events pq ă Kq if T iq ď C i while C i ă T i,q`1. Let T ik be the corresponding observation time. To generate right-censored failure times, define T ik minpt ik, C i q. Let D ik be the event indicator, it equals to 1 if the kth event for the ith subject happens and 0 otherwise. Usually, the survival times are drawn from a Weibull distribution hptq λκt κ 1 exppβ 1 Zq with parameters λ and κ. Given U Uniformp0, 1q, the corresponding T is given by ˆ T logpuq 1 κ. (12) λ exppβ 1 Zq Let U Φpzq, where z Np0, 1q. For the same subject, define the correlation between two successive recurrence event times as ρ (ρ ą 0). Villegas et al. (2013) have proved that for a given ρ, the correlation of Corrpz i1, z i2 q ρ 0 is given by ρ 0 w ` aw 2 ` 2ρp1 wq p1 wq ą 0, (13) 13

17 here w does not depend on any other parameters, w « First, considering the simulation steps for one covariate case. For each subject i pi 1, 2,..., Nq, generate k recurrence times pk 1, 2q and the corresponding event indicator. The simulation steps are as follows: 1. Generating the standard normal covariate X. 2. For a given correlation ρ, calculating ρ 0 using the formula in (13). For subject i, generating a random variate z i1 from a Np0, 1q distribution; then generating a random variate z i2 from a Npρ 0 z i1, 1 ρ 2 0q distribution. Transforming z ik into uniform random variables by using the standard normal cumulative distribution function Φ, such that U ik Φpz ik q. 3. Generating survival times from h T ptq λ T κt κ 1 exppβ X Xq. The parameter values are λ T and κ 1, referring to the empirical data. According to Section 4, β should be small, so the coefficient is set at β X 1. Then the gap time is drawn from ˆ 1 logpu ik q κ t ik. (14) λ T exppβ X Xq To get the recurrence times, let T ik t i1 ` ` t ik. 4. Random censoring times are drawn from a Weibull distribution h C ptq λ C κt κ 1 with parameter values κ 1, λ C or λ C , corresponding to approximately 40% and 70% censoring, respectively. 5. The corresponding observation time can then be generated through T ik minpt ik, C i q, along with the event indicator D ik. Then a complete covariate Z is added in the model. For two covariates case, both X and Z are standard normal with correlation ρ c between them, ρ c 0 or ρ c 0.5. The survival times are drawn from h T ptq λ T κt κ 1 exppβ X X ` β Z Zq. Set β X β Z 0.5, other parameter values are the same as above. The data generation process follows the Steps 1 to Generating Missing Data After complete data generation, two patterns of missingness mechanism are considered to generate missing data: MCAR and MAR. Define M X to be the missing data indicator. For a subject 14

18 i, M X 1 if X i is observed, and 0 otherwise. For a single incomplete covariate X and no Z, only MCAR mechanism is used to generate missing values. By definition, MCAR means the distribution of missingness depends neither on the observed data nor on the unobserved data. That is fpm X Z, X, φq fpm X φq, (15) here φ denotes unknown parameters. Since the data deficiency is totally random, given the missing levels 20% and 50%, missing values are generated entirely by chance. In two covariates case, in order to compare the missingness mechanisms, generate incomplete X in bivariate case using MCAR mechanism, as well as using MAR mechanism. Assuming MAR, the missing data only depend on the observed data: Using logit as the link function, so the inverse function is fpm X Z, X, φq fpm X Z, φq. (16) PrpM X Zq efpzq. (17) 1 ` efpzq Set f pzq 1.5 ` Z, this yields 20% missing. Because there are few differences according to sample sizes, for simplicity, only N 2500 will be used in the case of two covariates. Censoring level is set to be Censoring 40%. Furthermore, both ρ 0 and ρ 0.5 are taken into account since the correlation level has a big impact on the results. 5.4 Results In one covariate case, the simulation results are presented in Tables 2 and 3. The common coefficient estimate β p is shown in Table 2. We also consider the event-specific coefficients and interested in the coefficient estimate of the first event β p p1q, which is shown in Table 3. Table 2 shows that T2, LOGT2 and NA2 have better performance than T1, LOGT1 and NA1. Besides, bias for NA2 have shown to be the minimum no matter what settings, and very similar to CC. Furthermore, the implementation of these MI methods is affected greatly by the correlation levels. Under low correlation, the sample bias is relatively small, and there are obvious differences between different imputation methods. Under high correlation, all methods are strongly biased towards the null, even for ALL. As the correlation decreases, the differences 15

19 between methods become more evident, so NA2 is demonstrated to be even more superior under low correlation. In Table 3, the first-event coefficient estimates are presented. In this case, all imputation methods show small bias towards the null. With respect to MSE, all these methods show very similar results to each other. Taking the relative bias into account, T1, LOGT1 and NA1 work slightly better than T2, LOGT2 and NA2. Because of its relatively low bias, NA1 might outperform all other methods. Results with high missing show greater bias than results with low missing. The sample size, correlation and censoring levels appear to have small impact on the imputation results. As for the first question, what is the best way of including the survival outcomes in the imputation model; apparently, the Nelson-Aalen method is the best choice for both β and β p1q. The only difference is that NA2 is more effective when considering the common parameter β, while NA1 has a better performance when concerning the first-event parameter β p1q. The advantage of the Nelson-Aalen method is more apparent for NA2 than for NA1. In response to the second question, how many events should be considered in imputation; it should be determined based on different circumstances. If we are mostly interested in the first-event parameter, methods using the outcomes from the first event have a better performance. However, if the common parameter is the main focus, methods using the outcomes from both events will significantly decrease the relative bias. Overall, all imputation methods have underestimated the parameter. When comparing small and large sample sizes, the relative bias is not much different between different methods. With respect to MSE, it is relatively small for N 2500 than for N 250. The correlation between survival times has great effect on the results when analyzing the common parameter, but has little effect on the results when analyzing the first-event parameter. Regardless of other factors, high degrees of missingness will lead to greater bias of results, and censoring has a very small influence on the performance of parameter estimation. Results with low censoring show a little bit weaker patterns than results with high censoring. 16

20 Table 2: Simulation results for parameter β in one covariate model, MCAR Imputation Methods N ρ Censoring Missing ALL CC T1 LOGT1 NA1 T2 LOGT2 NA % 20% (0.005) (0.010) (0.030) (0.028) (0.026) (0.010) (0.010) (0.007) 50% (0.005) (0.012) (0.108) (0.106) (0.098) (0.018) (0.016) (0.012) 70% 20% (0.005) (0.010) (0.018) (0.021) (0.018) (0.011) (0.012) (0.009) 50% (0.005) (0.010) (0.065) (0.071) (0.062) (0.021) (0.027) (0.009) % 20% (0.031) (0.031) (0.063) (0.060) (0.058) (0.043) (0.040) (0.032) 50% (0.031) (0.037) (0.122) (0.116) (0.107) (0.066) (0.061) (0.044) 70% 20% (0.029) (0.030) (0.050) (0.051) (0.048) (0.046) (0.041) (0.034) 50% (0.030) (0.038) (0.096) (0.101) (0.091) (0.065) (0.071) (0.038) % 20% (0.001) (0.001) (0.018) (0.020) (0.017) (0.002) (0.002) (0.001) 50% (0.002) (0.002) (0.071) (0.069) (0.068) (0.008) (0.006) (0.002) 70% 20% (0.001) (0.002) (0.018) (0.018) (0.016) (0.003) (0.003) (0.002) 50% (0.002) (0.003) (0.056) (0.058) (0.050) (0.006) (0.008) (0.003) % 20% (0.022) (0.021) (0.058) (0.058) (0.051) (0.042) (0.038) (0.030) 50% (0.023) (0.032) (0.062) (0.069) (0.065) (0.051) (0.046) (0.034) 70% 20% (0.020) (0.032) (0.056) (0.056) (0.055) (0.045) (0.036) (0.031) 50% (0.025) (0.034) (0.066) (0.072) (0.066) (0.048) (0.051) (0.033) Note: Tabulated values are RB y and MSE { (in parentheses). Except ALL and CC, the smallest values in each setting are marked in bold. 17

21 Table 3: Simulation results for parameter βp1q in one covariate model, MCAR Imputation Methods N ρ Censoring Missing ALL CC T1 LOGT1 NA1 T2 LOGT2 NA % 20% (0.005) (0.008) (0.011) (0.009) (0.007) (0.009) (0.008) (0.008) 50% (0.008) (0.016) (0.020) (0.020) (0.017) (0.021) (0.021) (0.017) 70% 20% (0.005) (0.007) (0.012) (0.012) (0.007) (0.008) (0.008) (0.008) 50% (0.007) (0.017) (0.020) (0.023) (0.017) (0.020) (0.020) (0.018) % 20% (0.005) (0.010) (0.011) (0.011) (0.009) (0.010) (0.010) (0.009) 50% (0.008) (0.018) (0.019) (0.018) (0.016) (0.018) (0.015) (0.017) 70% 20% (0.005) (0.011) (0.012) (0.012) (0.007) (0.012) (0.014) (0.013) 50% (0.005) (0.012) (0.012) (0.015) (0.011) (0.017) (0.017) (0.012) % 20% (0.000) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) 50% (0.002) (0.002) (0.003) (0.003) (0.002) (0.003) (0.003) (0.003) 70% 20% (0.001) (0.001) (0.001) (0.002) (0.001) (0.003) (0.003) (0.002) 50% (0.001) (0.002) (0.003) (0.003) (0.002) (0.003) (0.004) (0.003) % 20% (0.001) (0.001) (0.002) (0.002) (0.001) (0.003) (0.003) (0.002) 50% (0.000) (0.002) (0.003) (0.003) (0.002) (0.004) (0.003) (0.003) 70% 20% (0.001) (0.001) (0.002) (0.002) (0.001) (0.002) (0.003) (0.002) 50% (0.001) (0.002) (0.002) (0.003) (0.002) (0.003) (0.003) (0.002) Note: Tabulated values are RB y and MSE { (in parentheses). Except ALL and CC, the smallest values in each setting are marked in bold. 18

22 Table 4: Simulation results for parameter β X in two covariates model Imputation Methods ρ ρ c ALL CC T1 LOGT1 NA1 T2 LOGT2 NA2 MCAR (0.001) (0.002) (0.007) (0.007) (0.005) (0.003) (0.003) (0.002) (0.001) (0.002) (0.007) (0.006) (0.004) (0.003) (0.003) (0.002) (0.009) (0.009) (0.017) (0.017) (0.017) (0.014) (0.014) (0.010) (0.010) (0.012) (0.019) (0.020) (0.017) (0.014) (0.015) (0.011) MAR (0.002) (0.003) (0.010) (0.009) (0.009) (0.005) (0.005) (0.003) (0.001) (0.003) (0.006) (0.007) (0.007) (0.006) (0.003) (0.002) (0.009) (0.009) (0.016) (0.015) (0.013) (0.011) (0.011) (0.008) (0.009) (0.010) (0.021) (0.022) (0.015) (0.015) (0.014) (0.009) Note: Tabulated values are y RB and { MSE (in parentheses). Except ALL and CC, the smallest values in each setting are marked in bold. Considering models with two covariates, the results are given in Table 4, 5, 6 and 7. Looking at the common parameter first, Table 4 and 5 show the results for β p X and β p Z, respectively. Results for β p X are given in Table 4. T2, LOGT2 and NA2 show smaller sample bias than T1, LOGT1 and NA1. Results for the relative bias and MSE are smallest with NA2, while greatest with T1 and LOGT1. So NA2 performs better than any other imputation methods. Results with high ρ c tend to have greater bias towards the null. Again, the correlation between survival times has a great effect on the performance of MI methods, the sample bias increases with ρ increasing. Considering the effect of missingness mechanism, under MAR, the sample bias has increased compared to MCAR. 19

23 Table 5: Simulation results for parameter β Z in two covariates model Imputation Methods ρ ρ c ALL CC T1 LOGT1 NA1 T2 LOGT2 NA2 MCAR (0.001) (0.002) (0.004) (0.004) (0.003) (0.002) (0.002) (0.002) (0.001) (0.001) (0.003) (0.003) (0.003) (0.003) (0.003) (0.002) (0.010) (0.012) (0.012) (0.012) (0.011) (0.011) (0.011) (0.009) (0.011) (0.010) (0.012) (0.012) (0.012) (0.012) (0.012) (0.009) MAR (0.001) (0.003) (0.004) (0.005) (0.005) (0.003) (0.004) (0.002) (0.001) (0.002) (0.003) (0.003) (0.003) (0.003) (0.003) (0.002) (0.012) (0.013) (0.016) (0.016) (0.016) (0.015) (0.015) (0.010) (0.008) (0.009) (0.013) (0.013) (0.012) (0.011) (0.011) (0.010) Note: Tabulated values are y RB and { MSE (in parentheses). Except ALL and CC, the smallest values in each setting are marked in bold. Results for β p Z are given in Table 5. The differences between methods are less apparent than in β p X. Again, T2, LOGT2 and NA2 show smaller sample bias than T1, LOGT1 and NA1. Among all imputation methods, NA2 shows the smallest bias towards the null. All methods have greater sample bias as the ρ and ρ c increases. While results for MAR have increased compared to MCAR, the relative performance of methods are similar for both missingness mechanisms. Next, we will discuss the coefficient estimates of the first event. 20

24 Table 6: Simulation results for parameter β Xp1q in two covariates model Imputation Methods ρ ρ c ALL CC T1 LOGT1 NA1 T2 LOGT2 NA2 MCAR (0.001) (0.001) (0.003) (0.003) (0.001) (0.003) (0.003) (0.003) (0.001) (0.002) (0.003) (0.003) (0.002) (0.003) (0.003) (0.003) (0.001) (0.002) (0.003) (0.003) (0.001) (0.003) (0.003) (0.002) (0.001) (0.001) (0.002) (0.002) (0.001) (0.003) (0.002) (0.003) MAR (0.001) (0.002) (0.003) (0.003) (0.002) (0.003) (0.002) (0.003) (0.001) (0.001) (0.003) (0.003) (0.002) (0.003) (0.003) (0.002) (0.002) (0.002) (0.003) (0.002) (0.001) (0.002) (0.002) (0.002) (0.001) (0.003) (0.003) (0.002) (0.002) (0.003) (0.003) (0.003) Note: Tabulated values are y RB and { MSE (in parentheses). Except ALL and CC, the smallest values in each setting are marked in bold. Results for β p Xp1q are given in Table 6. All methods have performed very adequately, and NA1 is the best of them all. With respect to relative bias, T1, LOGT1 and NA1 perform slightly better than T2, LOGT2 and NA2. The MSE values are almost zero for all methods. Under MCAR, the sample bias increases with growing correlation. Under MAR, again, the sample bias increases with growing correlation, but not as much as in MCAR. Results for β p Zp1q are given in Table 7. Considering the relative bias, all imputation methods show a small bias towards the null. The MSE values remain almost zero for all methods. With increasing values of ρ and ρ c, the sample bias is not increased as much as in β p Xp1q. In this case, NA1 outperforms all other methods in terms of the relative bias. In conclusion, the missingness mechanism has an impact on the imputation results, the estimates under MCAR are generally less biased than those under MAR. When analyzing the common coefficients, NA2 is the best MI method; when analyzing the first-event coefficients, NA1 is preferred in imputing missing data. 21

25 Table 7: Simulation results for parameter β Zp1q in two covariates model Imputation Methods ρ ρ c ALL CC T1 LOGT1 NA1 T2 LOGT2 NA2 MCAR (0.001) (0.002) (0.003) (0.003) (0.002) (0.003) (0.004) (0.003) (0.001) (0.002) (0.002) (0.003) (0.001) (0.003) (0.003) (0.003) (0.001) (0.001) (0.003) (0.003) (0.002) (0.003) (0.003) (0.002) (0.001) (0.002) (0.003) (0.004) (0.003) (0.004) (0.004) (0.004) MAR (0.001) (0.001) (0.003) (0.002) (0.001) (0.003) (0.002) (0.003) (0.001) (0.001) (0.002) (0.003) (0.001) (0.002) (0.003) (0.003) (0.001) (0.002) (0.003) (0.003) (0.002) (0.002) (0.003) (0.002) (0.001) (0.001) (0.003) (0.003) (0.002) (0.003) (0.003) (0.004) Note: Tabulated values are y RB and { MSE (in parentheses). Except ALL and CC, the smallest values in each setting are marked in bold. 6 Empirical Study In this section, an empirical study of handling the problem of missing values in cardiovascular disease (CVD) event data is conducted. The data come from a Swedish cohort of men that are followed for several decades. Due to confidentiality reasons, the data have to be anonymized before we could use it. In the study, each subject will experience a maximum of two CVD events. The outcomes is the follow-up time and the corresponding event indicator. Follow-up time will be calculated from beginning of the study to date of failure, end of follow-up, or loss to follow-up, whichever occurred first. All variables are presented in Table 8. The first part of the table provides the descriptive statistics of the continuous variables, and the second part provides the descriptive statistics of the categorical variables. 22

26 Table 8: Descriptive statistics Continuous Variables Missing Variable Min Max Mean SD 25% Median 75% n % X X X X X X Categorical Variables Missing Variable Value (%) n % X 7 0 (84.50) 1 (15.50) 0 0 X 8 0 (95.92) 1 (4.08) 0 0 X 9 0 (98.57) 1 (1.43) 0 0 X 10 0 (25.18) 1 (51.02) 2 (23.80) 0 0 X 11 1 (62.74) 2 (26.40) 3 (10.68) X 12 1 (13.98) 2 (34.65) 3 (41.68) 4 (4.78) Note: N The original data set contains 2303 individuals, a total of 12 variables six continuous variables and six categorical variables. For continuous variables, four of them are incomplete. The percentage of missing values is 13.46% for all incomplete variables. For categorical variables, two of them are incomplete. The percentage of missing values is 0.18% for variable X 11, and 4.91% for variable X 12. Before imputing missing data, the complete case (CC) analysis is performed. Then, using MI to impute the missing values. Analysis models are the PWP model that analyzing the common effects, and the Cox PH model that analyzing the coefficients of the first event. The variables are common risk factors and we are interested in X 1 and X 2. 23

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23 1 / 23 Statistical Methods Missing Data http://www.stats.ox.ac.uk/ snijders/sm.htm Tom A.B. Snijders University of Oxford November, 2011 2 / 23 Literature: Joseph L. Schafer and John W. Graham, Missing