A Comparison of Multiple Imputation Methods for Missing Covariate Values in Recurrent Event Data

Size: px
Start display at page:

Download "A Comparison of Multiple Imputation Methods for Missing Covariate Values in Recurrent Event Data"

Transcription

1 A Comparison of Multiple Imputation Methods for Missing Covariate Values in Recurrent Event Data By Zhao Huo Department of Statistics Uppsala University Supervisor: Ronnie Pingel 2015

2 Abstract Multiple imputation (MI) is a commonly used approach to impute missing data. This thesis studies missing covariates in recurrent event data, and discusses ways to include the survival outcomes in the imputation model. Some MI methods under consideration are the event indicator D combined with, respectively, the right-censored event times T, the logarithm of T and the cumulative baseline hazard H 0 pt q. After imputation, we can then proceed to the complete data analysis. The Cox proportional hazards (PH) model and the PWP model are chosen as the analysis models, and the coefficient estimates are of substantive interest. A Monte Carlo simulation study is conducted to compare different MI methods, the relative bias and mean square error will be used in the evaluation process. Furthermore, an empirical study based on cardiovascular disease event data which contains missing values will be conducted. Overall, the results show that MI based on the Nelson-Aalen estimate of H 0 pt q is preferred in most circumstances. Keywords: Missing data; Multiple imputation; Missing covariates; Recurrent event data; Cox PH model; PWP model

3 Contents 1 Introduction 1 2 Theory and Methods for Missing Data 2 3 Recurrent Event Data Analyzing Data with Recurrent Events Missing Data in Survival Analysis 9 5 Simulation Study Study Design Generating Complete Data Sets Generating Missing Data Results Empirical Study 22 7 Conclusion 26 Acknowledgements 27 References 27

4 1 Introduction Missing data are a common problem in many research areas. In survey-based research, respondents may purposely refuse to answer certain questions due to privacy reasons, or even unwillingness of participants to attend the meeting for evaluation, for example. In clinical research, missing data are also a major issue, especially in longitudinal clinical trials. In survival analysis, the data deficiency usually occurs in the covariates. This is due to the fact that the survival outcomes often consist of easy to measure data, such as death or time of events being present in registries. However, for the covariates some measurements can be missing. For instance, blood samples may be accidentally dropped on the floor, which would make the post-processing of the samples impossible. Or perhaps missing could be due to no answers in questionnaires related to a clinical study. This thesis considers the issue of missing covariates in recurrent event data. Recurrent event data is one type of the multiple failure time data. It arises from survival analysis when each study subject experiences two or more events (failures), and the failures should be the same kind of event. For example, patients with heart disease having a second (or more) myocardial infarction can be seen as a recurrent event. On the other hand, first having myocardial infarction and then having a bone fracture is a different kind of event. Assuming that the data are right censored, it defined as a subject has not experienced any event by the end of the study or can not be followed up, whichever occurred first. For example, in a project to examine stroke occurrence, patients who have never experienced a stroke by the end of the project are right censored. Also, it is assumed that the censoring mechanism is non-informative, which means the censoring time of each subject is independent of its failure time. Rubin (1987) proposed multiple imputation (MI) to provide statistical inference, which is now a commonly used method to impute missing data. Studies by Rubin (1996) further explained that MI is the method of choice since the complete data are available after MI for the ultimate users. The basic concept of MI is replacing each missing data by a number of plausible values based on the distribution of the observed data. This should be done so that several data sets containing no missing observations can be generated. Moons et al. (2006) pointed out that regardless of the missingness mechanism, imputation of missing predictor values using the outcome is preferred over imputation without outcome. To perform MI in survival data using the outcome, several 1

5 papers considered outcome information including the event indicator D, the observed event or censoring time T and/or the logarithm of T. In White and Royston (2009), different imputation models were compared in the setting of a single incomplete covariate. They recommended that MI should be based on the Nelson-Aalen estimate of the cumulative hazard. Considering recurrent event data with missing covariates, we are interested in how the extra information from recurrent events can be incorporated into MI when analyzing survival data. A Monte Carlo simulation study is conducted to evaluate the different MI methods based on two analysis models. One is the Cox proportional hazards (PH) model, and the other is the PWP model. The coefficient estimates of the analysis models are of substantive interest. Assessment of different methods provides suggestions in handling missing data. Thus, our primary aim is to develop a more efficient way of including survival outcomes in the imputation model. The main structure of the thesis is as follows: Section 2 provides the necessary methodological background of the study, including missing data and MI. Section 3 introduces recurrent event data and analysis models. Section 4 describes missing data in survival analysis. Section 5 compares the methods mentioned through a comprehensive simulation study. Section 6 contains an empirical application. Section 7 concludes. 2 Theory and Methods for Missing Data Usually, missing data can be classified into three categories based on the reasons of incompleteness. Missing completely at random (MCAR) means that the probability of data being missing is independent of any characteristics of the subjects. Missing at random (MAR) is that the probability of data being missing does not depend on the unobserved data, conditional on the observed data. Finally, missing not at random (MNAR) is that the probability of data being missing depends on the unobserved data. MCAR provides a convenient way of data analysis, since the estimated parameters remain unbiased by the absence of data. An example of MCAR is a participant of a survey unable to attend the meeting on time because of the traffic jam. MAR occurs if the cause of missingness is correlated to a particular variable. For instance, suppose people refuse to answer "salary" entirely based on their ages, then the data of "salary" 2

6 are MAR. MNAR is more complex than MAR and MCAR. An example is that both low- and high-income people are more likely to miss "salary" than those with an ordinary income, then nonresponse probability for "salary" depends on values that might be missing. According to Little and Rubin (1987), standard methods have been developed to analyze rectangular data sets. The rows of the rectangular data sets are seen as subjects and the columns are seen as variables measured for each subject. Since missing data arise in almost all kinds of trials, some of the entries in the data matrix will not be observed. In general, missing data may cause bias and loss of efficiency. Depending on the type of missingness, different approaches to handle missing data have been suggested. In the case of MCAR, a simple approach to deal with missing data problem is to exclude subjects with missing values. This method usually refers to complete case (CC) analysis. Under MCAR, CC analysis remains unbiased but is less efficient. However, when data are not MCAR, CC analysis can be biased. Using CC analysis offers convenience, but there is one major drawback: discarding cases can lead to a reduction of statistical power. Instead of wasting data, ad hoc methods, such as mean imputation, last observation carried forward, random hot deck imputation have been used historically. However, not only will these methods not reduce bias, they are also single imputation methods. Hence, they may underestimate the standard errors of estimates (Little and Rubin, 1987). To overcome the weaknesses of the ad hoc methods, two other principled techniques to handle missing data have been proposed: maximum likelihood (ML) and multiple imputation (MI). ML analyzes the full incomplete data set. It separates the complete and incomplete data, computes likelihood function for each of the data sets, and then maximizes these two likelihoods to get the estimates of parameters. Also, ML is model specific, and different types of models tend to have different likelihoods. In Enders (2001), the results showed that under MCAR and MAR, the ML estimates are less bias and generally more efficient than ad hoc methods. In practice, the ML estimates are slightly more efficient than the MI estimates. Nonetheless, MI is more general than ML. Because of the focus is on survival analysis, MI should be the method of choice in handling missing data. Studies by Donders et al. (2006) concluded that when data are MCAR or MAR, using the MI approach will lead to unbiased results with correct standard errors. For this reason, our study is limited to situations where the missingness mechanisms are MCAR or MAR. 3

7 Originally produced by Rubin (1978, 1987), MI is now a commonly used approach to handle missing data problem. Based on Bayesian paradigm, MI involves drawing missing data from a posterior predictive distribution conditional on the observed data. The idea is to replace each missing data with a set of m acceptable values (m ą 1), to generate m imputed data sets. According to Schafer (1999), a small number of imputation would be enough, normally between 5 to 10 times. But a recent study by White et al. (2011) suggested that m should be at least equal to the percentage of incomplete cases, for instance, a sample with 20 percent incomplete cases would indicate that m 20. This rule of thumb is used to choose the proper m in our simulation study. After imputation, these imputed data sets are analyzed separately, and the results of the m analyses are combined to get the estimates of interest. One of the advantages of MI is that it allows any analyzes methods after possessing the complete data sets. MI provides a computationally feasible approach which can be applied to a wide range of situations. MI consists of three steps: 1. Generating m imputed data sets. Considering the simple case for a single incomplete variable X, a vector of complete variables Z and complete outcomes Y. Constructing a model fpx Y, Z; αq among subjects with observed X, and drawing the estimated ˆα with variance-covariance matrix S α from this model. Then the imputation parameter α can be drawn from an approximately Npˆα, S α q distribution. The missing values are generated from the posterior predictive distribution fpx Y, Z; α q and the whole process is repeated m times. 2. Analyzing each of the imputed data sets separately. After imputation, we can then proceed to the complete data analysis. Each imputed data set is analyzed separately to obtain the estimates that we are interested in, e.g. regression coefficients. 3. Combining estimates using Rubin s rules. Suppose ˆβ pjq is the point estimate from the jth imputed data set (j 1, 2,..., m) and W pjq is the estimated variance of ˆβ pjq. Then the overall estimate is with variance ˆβ 1 m mÿ ˆβ pjq, (1) j 1 V arp ˆβq W ` p1 ` 1 qb, (2) m 4

8 where W and B are the within- and between-imputation variances, respectively: W 1 m B 1 m 1 mÿ W pjq, (3) j 1 mÿ p ˆβ pjq ˆβq 2. (4) j 1 If more than one variable is incomplete, the method multiple imputation by chained equations (MICE) can be used. MICE is an iterative imputation method that uses separate independent chains to obtain imputations. Lee and Carlin (2010) compared MICE with the multivariate normal imputation (MVN). MICE is more flexible than MVN since it does not rely on the assumption of multivariate normality. Furthermore, the variable-by-variable specification of MICE makes it easy to build models with incompatible conditionals. Azur et al. (2011) provided the procedures for carrying out MICE. Initially, by implementing single imputations, all missing values are replaced by random numbers such as the mean. Then for any one of the variables, resetting it back to its original default value, and performing the same process as Step 1 in MI. This procedure is repeated for each variable that has missing data, and take it as one cycle when looping through all variables. As a result, each missing value is updated in each cycle. Royston (2004) considered that fewer than 10 cycles will be enough. An imputed data set will be created after every 10 cycles, and the whole procedure is repeated m times. Some aspects of using MICE were discussed by White et al. (2011), including the principles of the method, the guidance on model specification and the practical analysis of multiply imputed data. 3 Recurrent Event Data Survival data are generally defined as data that consist of the time to some event and the censoring information for each subject. The event can be, e.g. the occurrence of a specific disease or death. One special case of survival data, recurrent event data, is the main emphasis of this thesis. In order to make it easier to understand, we introduce the multiple failure time data. Multiple failure time data arise from time-to-event studies where each study subject will expe- 5

9 rience two or more events (failures). The failures can be either of the same kind of event or of an entirely different nature. One characteristic of this kind of data is that the survival times may be correlated. Therefore, the analysis of such data would be complicated by the dependence between related failure times. Wei and Glidden (1997) summarized statistical methods when analyzing multiple failure time data. Multiple failure time data can be classified into two categories: ordered and unordered (e.g. Therneau 1997, chapter 8). 1 Ordered failure time data refers to the data that events are naturally ordered and occur in a certain sequence over time. For example, define K as the maximum number of events within a subject, and N as the total sample size. Then the event time for the kth pk 1, 2,..., Kq event for the ith pi 1, 2,..., Nq subject can be written as T ik. For ordered data, the occurrence time of each event involves the constraint T ik ě T i,k 1. The most popular type of ordered data is recurrent event data. Recurrent event data arise from survival analysis when each study subject experiences two or more failures, and the failures should be the same kind of event. It is commonly encountered in longitudinal studies where each subject may experience the same event over time repeatedly. Examples of recurrent event data would be the analysis of a cohort of infants with bronchial obstruction. Villegas et al. (2013) studied the cases that physician revisit the children at various times and each visit can be seen as a recurrent event. Unordered failure time data are data in which each subject is at risk of several failure processes simultaneously. This would be the case when a subject experiences several events that are entirely different, such as patients suffering a heart attack may be infected with other diseases. An important feature of the unordered data is that every event can occur at any time during the research, so there is no restriction on the sequential order. 3.1 Analyzing Data with Recurrent Events Focusing on ordered recurrent event data, Cox-based models are used to analyze such kind of data. There are four key components in characterizing the Cox-based models: risk interval; risk set; baseline hazard; and correlation adjustment. Among these four components, risk interval and risk set are pivotal in choosing a model. 1 Other different names includes longitudinal and clustered, serial and parallel. 6

10 Figure 1: Illustrations of risk interval formulations ( is an event and is censoring) Risk interval refers to the time scales that used to define when a subject is at risk of having an event. There are three possible ways of defining a risk interval: gap time, total time and counting process. Each of these represents a different substantive type of risk process. The gap time is the time from the prior event, i.e., reset the clock to zero when each event begins. Total time and counting process use the same time scale, what makes them different from each other is how they set up the "left time". Counting process defines a subject is not considered to be at risk for the kth event until the pk 1qth event happen. While total time uses the time from the starting point, e.g. the beginning of the research. When dealing with recurrent event data, both gap time and counting process are applicable in the risk interval definition. Figure 1 gives a more intuitive image of risk interval formulations. The kth risk set contains the individuals who are at risk for the kth event. The risk set definition incorporates the baseline hazard selection: common or event-specific. A common baseline hazard means a model with the same underlying hazard for all events, and an event-specific allows the baseline hazard to be different for each event. There are three possible risk sets: restricted, semi-restricted and unrestricted. With a restricted risk set, only subjects who have experienced the pk 1qth events are included to contribute to the kth risk set. The restricted risk set has event-specific baseline hazard, and is the preferred risk set for recurrent event data. A semi-restricted risk set has event-specific baseline hazard, which allows subjects to contribute to the risk set of the kth event at time t as long as they have not experienced the kth event. An unrestricted risk set allows all subjects risk intervals to contribute to the risk set for any event, and has a common baseline hazard. The partial likelihood function for the Cox-based models 7

11 are differ in the composition of the risk set. The Cox PH model developed by Cox (1972) is the most widely used model for survival data, and meanwhile a powerful method to analysis time-to-event occurrence. The usual Cox PH model requires the events to be independent, but for recurrent event data, the events may be correlated within subjects. Besides, the observation time in the standard model ended at the time to first occurrence or censoring, which will result in the repeating events being disregarded. Hence some extensions of the survival models based on Cox PH approach to analysis multiple events have been proposed. Ezell et al. (2003) summarized that the Cox-based models are hybrid versions of the single-event Cox model and are specified in one of the two ways: h ik ptq h 0 ptqexppβ 1 Z ik q, (5) h ik ptq h 0k ptqexppβpkqz 1 ik q. (6) Here Z ik pz i1k,..., Z ipk q 1 is a p-dimensional covariate vector for the k event of the ith subject. β denotes the vectors of the common regression coefficients. β pkq denotes the vectors of the event-specific regression coefficients. In Equation (5), h 0 ptq represents a common baseline hazard for all events. In Equation (6), h 0k ptq is an event-specific baseline hazard for the kth event. Both h 0 ptq and h 0k ptq are non-negative. Finally, h ik ptq stands for the hazard function for the ith subject with respect to the kth event. For analyzing recurrent event data, the most common Cox-based models are: Andersen and Gill (AG); Prentice, William and Peterson, counting process (PWP-CP) and gap time (PWP-GT); and Wei, Lin, and Weissfeld (WLW). These models differ essentially in the risk interval and risk set. The AG model is specified with a common baseline hazard, an unrestricted risk set, and using the counting process formulation. The PWP models are specified with event-specific baseline hazards, restricted risk sets, and either the total time or gap time risk intervals. The PWP-CP model differs from PWP-GT by using counting process instead of gap time formulation. The WLW model is specified with event-specific baseline hazards, semi-restricted risk set, and using the total time formulation. Villegas et al. (2013) concluded that no model can be recommended as the best in all situations. Castañeda and Gerritse (2010) suggested that both the AG and the PWP models are applicable in analyzing repeated failures of the same type. Kelly and Lim (2000) suggested that the PWP model is more appropriate in the analysis of recurrent event data. Thus, the PWP model is 8

12 selected as the analysis model rather than other Cox-based models. Based on our empirical data, the counting process is a more suitable formulation of the risk interval. Liu (2012) pointed out that although the PWP models specify event-specific regression coefficients β pkq, the overall estimates ˆβ can be obtained by fitting a common covariate vector. This can be achieved by assuming that the covariate is not event-varying, such as sex and so on. Hence the analysis model for recurrent event data is h ik ptq h 0k ptqexppβ 1 Z i q, (7) here t k 1 ă t ď t k and Z i pz i1,..., Z ip q 1 is a p-dimensional covariate vector of the ith subject. The definition of other indicators are the same as described above. The overall estimates ˆβ are of substantive interest. Sometimes we are interested in the regression coefficients of the first event, then the analysis model is a usual Cox PH model h i1 ptq h 01 ptqexppβp1qz 1 i q. (8) For this model, the coefficient estimates of the first event β p p1q are of substantive interest. 4 Missing Data in Survival Analysis So far, relatively few studies have been done to investigate missing data in survival analysis. White and Royston (2009) compared different MI models in the setting of a single incomplete covariate, and they recommended that MI should be based on the Nelson-Aalen estimate of the cumulative hazard. Paik (1997) presented three MI estimates for the Cox model with missing covariates. Mbougua et al. (2013) compared different MI methods in dealing with nonlinear continuous covariates, and simulation results have shown that MI by splines should be used in such situation. Van Buuren et al. (1999) used MI for missing blood pressure covariates. Giorgi et al. (2008) performed MI in regression analysis of relative survival, they concluded that missing data in covariates should be modeled and MICE offered an attractive choice. However, the effect of missing data in covariates in multivariate survival models, which is the aim of this thesis, has never been investigated. 9

13 Figure 2: Recurrent event data with a missing covariate ( is observed and is missing; is an event and is censoring) A hypothetical example is used to illustrate the missing patterns in recurrent event data. Suppose there are two covariates X and Z in the model, both are not event-varying. X denotes the incomplete covariate, and there is a set of randomly positions of missing values in it. Z denotes the complete covariate. Assuming that each subject will experience at most two events, the observed event or censoring times are T 1 and T 2, with corresponding event indicators D 1 and D 2. Additionally, we assume that for subject 1 and 3, both two events occur; for subject 2, only the first event occurs; for subject 4, no event has occurred. Missing data arise in subject 3. Figure 2 gives a pictorial representation of the missing patterns. As mentioned previously, carrying out MI concerns the choice of variables in the imputation model. Three imputation models are formed using the event indicator D combined with, respectively, the observed event or censoring time T, the logarithm of T (i.e. logt ) and the cumulative baseline hazard H 0 pt q. However, H 0 pt q is unknown and can only be estimated. A possible way to estimate H 0 pt q is to use the Nelson-Aalen estimator. The H 0 pt q is approximately equal to HpT q when the coefficient β s are small. One way to estimate HpT q was suggested by Nelson (1972) and studied by Aalen (1978), thus it known as the Nelson-Aalen estimator. It is given by Ĥptq ÿ t i ďt ˆ di n i, (9) here i 1, 2,..., δ and t 1 ă t 2 ă ă t δ represent the δ distinct failure times in a sample of n subjects. At time t i, there are d i failures occur and n i subjects are at risk. Since the outcomes 10

14 are fully observed, the Nelson-Aalen estimator can be estimated before the imputation. 5 Simulation Study In this section, the main aim is to explore two issues of major concern in MI: What is the best way to include the survival outcomes in the imputation model? How many events should be considered in imputation? Only outcomes from the first event or from all of the events? Given K as the maximum number of events within a subject, and the survival outcomes are not the same based on different events. If each event is a stratum, to impute the missing values, we can either use the outcomes drawn only from the first stratum, or from both stratums. A series of Monte Carlo simulations will be conducted to compare different approaches to include the survival outcomes in imputation model for the covariates, in survival analysis setting. This study limited to the case where each subject will experience a maximum number of two recurrent events, that is K 2. In imputation, if only the first event is consider, D 1 can be combined with, respectively, T 1, logt 1 and ĤpT 1q. Here ĤpT 1q is the Nelson-Aalen estimator of HpT 1 q. If both events are considered in imputation, D 2, T 2, logt 2 and ĤpT 2q should also be included in the model. Here ĤpT 2q is the Nelson-Aalen estimator of HpT 2 q. Therefore, this paper studies the effect of missing data imputation using six ways to include the observed data in the imputation methods. A summary of these methods are presented in Table 1. 11

15 Table 1: Overview of the imputation methods for imputing incomplete X Abbreviation Description T1 Regression of X on Z, D 1 and T 1 LOGT1 Regression of X on Z, D 1 and logt 1 NA1 Regression of X on Z, D 1 and ĤpT 1q T2 Regression of X on Z, D 1, D 2, T 1 and T 2 LOGT2 Regression of X on Z, D 1, D 2, logt 1 and logt 2 NA2 Regression of X on Z, D 1, D 2, ĤpT 1 q and ĤpT 2q Note: Here, regression is a linear regression. To assess the performance of different MI methods, we compare the estimated coefficients based on analysis models, which are the PWP model and the Cox PH model. Accuracy of the coefficients estimates are evaluated by estimating the relative bias and mean square error. The relative bias is defined as the average of the ratio between bias and the true value ˆ ˆβ RBp ˆβq β E. (10) β The mean square error measures the mean squared difference between the estimates and the true value of the parameter MSEp ˆβq Erp ˆβ βq 2 s. (11) 5.1 Study Design For the covariates, we first consider the simple case where there is a single incomplete normally distributed covariate X and no other covariates. We then proceed and add a complete normally distributed covariate Z in the model. It is important to note that although Z is listed in all imputation methods, it should be deleted when considering the cases of only X and no Z. Also, these methods are examined in terms of four factors: different sample sizes, event correlation levels, censoring levels and missing levels. By comparing the results obtained with different methods under each setting, a proposal for handling missing data can then be given. Simulations are implemented using package mice in RStudio version

16 For comparison, we also carry out the calculations before introducing missing values (ALL) and including complete cases (CC) only. According to the empirical data to be analyzed below, the percentage of incomplete cases is approximately 17% and the censoring percent is about 68%. Based on the rule of thumb described in Section 2, the number of repeated imputations is chosen as m 20. The sample sizes under consideration are N 250 and N 2500, corresponding to small and large sample sizes, respectively. The correlation levels between adjacent recurrence times under consideration are ρ 0 and ρ 0.5, corresponding to low and high correlation, respectively. Censoring levels are set to be 70% referring to the empirical data, and 40% censoring is also being considered to analysis the sensitivity of low censoring. Missing levels are set to be 20% referring to the empirical data, and taking 50% into account to analysis the sensitivity of high missing. All combinations of N t250, 2500u; ρ t0, 0.5u; Censoring t40%, 70%u; and Missing t20%, 50%u; are taken into consideration. In order to reduce Monte Carlo error, 1000 independent replications will be performed for each design. 5.2 Generating Complete Data Sets Let T ik be the event time for the kth event for the ith subject, measured from the time origin 0 to the occurrence of each event, and C i be the censoring time for the ith subject. A subject is considered to be censored if C i ă T ik, and this subject has experienced q events pq ă Kq if T iq ď C i while C i ă T i,q`1. Let T ik be the corresponding observation time. To generate right-censored failure times, define T ik minpt ik, C i q. Let D ik be the event indicator, it equals to 1 if the kth event for the ith subject happens and 0 otherwise. Usually, the survival times are drawn from a Weibull distribution hptq λκt κ 1 exppβ 1 Zq with parameters λ and κ. Given U Uniformp0, 1q, the corresponding T is given by ˆ T logpuq 1 κ. (12) λ exppβ 1 Zq Let U Φpzq, where z Np0, 1q. For the same subject, define the correlation between two successive recurrence event times as ρ (ρ ą 0). Villegas et al. (2013) have proved that for a given ρ, the correlation of Corrpz i1, z i2 q ρ 0 is given by ρ 0 w ` aw 2 ` 2ρp1 wq p1 wq ą 0, (13) 13

17 here w does not depend on any other parameters, w « First, considering the simulation steps for one covariate case. For each subject i pi 1, 2,..., Nq, generate k recurrence times pk 1, 2q and the corresponding event indicator. The simulation steps are as follows: 1. Generating the standard normal covariate X. 2. For a given correlation ρ, calculating ρ 0 using the formula in (13). For subject i, generating a random variate z i1 from a Np0, 1q distribution; then generating a random variate z i2 from a Npρ 0 z i1, 1 ρ 2 0q distribution. Transforming z ik into uniform random variables by using the standard normal cumulative distribution function Φ, such that U ik Φpz ik q. 3. Generating survival times from h T ptq λ T κt κ 1 exppβ X Xq. The parameter values are λ T and κ 1, referring to the empirical data. According to Section 4, β should be small, so the coefficient is set at β X 1. Then the gap time is drawn from ˆ 1 logpu ik q κ t ik. (14) λ T exppβ X Xq To get the recurrence times, let T ik t i1 ` ` t ik. 4. Random censoring times are drawn from a Weibull distribution h C ptq λ C κt κ 1 with parameter values κ 1, λ C or λ C , corresponding to approximately 40% and 70% censoring, respectively. 5. The corresponding observation time can then be generated through T ik minpt ik, C i q, along with the event indicator D ik. Then a complete covariate Z is added in the model. For two covariates case, both X and Z are standard normal with correlation ρ c between them, ρ c 0 or ρ c 0.5. The survival times are drawn from h T ptq λ T κt κ 1 exppβ X X ` β Z Zq. Set β X β Z 0.5, other parameter values are the same as above. The data generation process follows the Steps 1 to Generating Missing Data After complete data generation, two patterns of missingness mechanism are considered to generate missing data: MCAR and MAR. Define M X to be the missing data indicator. For a subject 14

18 i, M X 1 if X i is observed, and 0 otherwise. For a single incomplete covariate X and no Z, only MCAR mechanism is used to generate missing values. By definition, MCAR means the distribution of missingness depends neither on the observed data nor on the unobserved data. That is fpm X Z, X, φq fpm X φq, (15) here φ denotes unknown parameters. Since the data deficiency is totally random, given the missing levels 20% and 50%, missing values are generated entirely by chance. In two covariates case, in order to compare the missingness mechanisms, generate incomplete X in bivariate case using MCAR mechanism, as well as using MAR mechanism. Assuming MAR, the missing data only depend on the observed data: Using logit as the link function, so the inverse function is fpm X Z, X, φq fpm X Z, φq. (16) PrpM X Zq efpzq. (17) 1 ` efpzq Set f pzq 1.5 ` Z, this yields 20% missing. Because there are few differences according to sample sizes, for simplicity, only N 2500 will be used in the case of two covariates. Censoring level is set to be Censoring 40%. Furthermore, both ρ 0 and ρ 0.5 are taken into account since the correlation level has a big impact on the results. 5.4 Results In one covariate case, the simulation results are presented in Tables 2 and 3. The common coefficient estimate β p is shown in Table 2. We also consider the event-specific coefficients and interested in the coefficient estimate of the first event β p p1q, which is shown in Table 3. Table 2 shows that T2, LOGT2 and NA2 have better performance than T1, LOGT1 and NA1. Besides, bias for NA2 have shown to be the minimum no matter what settings, and very similar to CC. Furthermore, the implementation of these MI methods is affected greatly by the correlation levels. Under low correlation, the sample bias is relatively small, and there are obvious differences between different imputation methods. Under high correlation, all methods are strongly biased towards the null, even for ALL. As the correlation decreases, the differences 15

19 between methods become more evident, so NA2 is demonstrated to be even more superior under low correlation. In Table 3, the first-event coefficient estimates are presented. In this case, all imputation methods show small bias towards the null. With respect to MSE, all these methods show very similar results to each other. Taking the relative bias into account, T1, LOGT1 and NA1 work slightly better than T2, LOGT2 and NA2. Because of its relatively low bias, NA1 might outperform all other methods. Results with high missing show greater bias than results with low missing. The sample size, correlation and censoring levels appear to have small impact on the imputation results. As for the first question, what is the best way of including the survival outcomes in the imputation model; apparently, the Nelson-Aalen method is the best choice for both β and β p1q. The only difference is that NA2 is more effective when considering the common parameter β, while NA1 has a better performance when concerning the first-event parameter β p1q. The advantage of the Nelson-Aalen method is more apparent for NA2 than for NA1. In response to the second question, how many events should be considered in imputation; it should be determined based on different circumstances. If we are mostly interested in the first-event parameter, methods using the outcomes from the first event have a better performance. However, if the common parameter is the main focus, methods using the outcomes from both events will significantly decrease the relative bias. Overall, all imputation methods have underestimated the parameter. When comparing small and large sample sizes, the relative bias is not much different between different methods. With respect to MSE, it is relatively small for N 2500 than for N 250. The correlation between survival times has great effect on the results when analyzing the common parameter, but has little effect on the results when analyzing the first-event parameter. Regardless of other factors, high degrees of missingness will lead to greater bias of results, and censoring has a very small influence on the performance of parameter estimation. Results with low censoring show a little bit weaker patterns than results with high censoring. 16

20 Table 2: Simulation results for parameter β in one covariate model, MCAR Imputation Methods N ρ Censoring Missing ALL CC T1 LOGT1 NA1 T2 LOGT2 NA % 20% (0.005) (0.010) (0.030) (0.028) (0.026) (0.010) (0.010) (0.007) 50% (0.005) (0.012) (0.108) (0.106) (0.098) (0.018) (0.016) (0.012) 70% 20% (0.005) (0.010) (0.018) (0.021) (0.018) (0.011) (0.012) (0.009) 50% (0.005) (0.010) (0.065) (0.071) (0.062) (0.021) (0.027) (0.009) % 20% (0.031) (0.031) (0.063) (0.060) (0.058) (0.043) (0.040) (0.032) 50% (0.031) (0.037) (0.122) (0.116) (0.107) (0.066) (0.061) (0.044) 70% 20% (0.029) (0.030) (0.050) (0.051) (0.048) (0.046) (0.041) (0.034) 50% (0.030) (0.038) (0.096) (0.101) (0.091) (0.065) (0.071) (0.038) % 20% (0.001) (0.001) (0.018) (0.020) (0.017) (0.002) (0.002) (0.001) 50% (0.002) (0.002) (0.071) (0.069) (0.068) (0.008) (0.006) (0.002) 70% 20% (0.001) (0.002) (0.018) (0.018) (0.016) (0.003) (0.003) (0.002) 50% (0.002) (0.003) (0.056) (0.058) (0.050) (0.006) (0.008) (0.003) % 20% (0.022) (0.021) (0.058) (0.058) (0.051) (0.042) (0.038) (0.030) 50% (0.023) (0.032) (0.062) (0.069) (0.065) (0.051) (0.046) (0.034) 70% 20% (0.020) (0.032) (0.056) (0.056) (0.055) (0.045) (0.036) (0.031) 50% (0.025) (0.034) (0.066) (0.072) (0.066) (0.048) (0.051) (0.033) Note: Tabulated values are RB y and MSE { (in parentheses). Except ALL and CC, the smallest values in each setting are marked in bold. 17

21 Table 3: Simulation results for parameter βp1q in one covariate model, MCAR Imputation Methods N ρ Censoring Missing ALL CC T1 LOGT1 NA1 T2 LOGT2 NA % 20% (0.005) (0.008) (0.011) (0.009) (0.007) (0.009) (0.008) (0.008) 50% (0.008) (0.016) (0.020) (0.020) (0.017) (0.021) (0.021) (0.017) 70% 20% (0.005) (0.007) (0.012) (0.012) (0.007) (0.008) (0.008) (0.008) 50% (0.007) (0.017) (0.020) (0.023) (0.017) (0.020) (0.020) (0.018) % 20% (0.005) (0.010) (0.011) (0.011) (0.009) (0.010) (0.010) (0.009) 50% (0.008) (0.018) (0.019) (0.018) (0.016) (0.018) (0.015) (0.017) 70% 20% (0.005) (0.011) (0.012) (0.012) (0.007) (0.012) (0.014) (0.013) 50% (0.005) (0.012) (0.012) (0.015) (0.011) (0.017) (0.017) (0.012) % 20% (0.000) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) 50% (0.002) (0.002) (0.003) (0.003) (0.002) (0.003) (0.003) (0.003) 70% 20% (0.001) (0.001) (0.001) (0.002) (0.001) (0.003) (0.003) (0.002) 50% (0.001) (0.002) (0.003) (0.003) (0.002) (0.003) (0.004) (0.003) % 20% (0.001) (0.001) (0.002) (0.002) (0.001) (0.003) (0.003) (0.002) 50% (0.000) (0.002) (0.003) (0.003) (0.002) (0.004) (0.003) (0.003) 70% 20% (0.001) (0.001) (0.002) (0.002) (0.001) (0.002) (0.003) (0.002) 50% (0.001) (0.002) (0.002) (0.003) (0.002) (0.003) (0.003) (0.002) Note: Tabulated values are RB y and MSE { (in parentheses). Except ALL and CC, the smallest values in each setting are marked in bold. 18

22 Table 4: Simulation results for parameter β X in two covariates model Imputation Methods ρ ρ c ALL CC T1 LOGT1 NA1 T2 LOGT2 NA2 MCAR (0.001) (0.002) (0.007) (0.007) (0.005) (0.003) (0.003) (0.002) (0.001) (0.002) (0.007) (0.006) (0.004) (0.003) (0.003) (0.002) (0.009) (0.009) (0.017) (0.017) (0.017) (0.014) (0.014) (0.010) (0.010) (0.012) (0.019) (0.020) (0.017) (0.014) (0.015) (0.011) MAR (0.002) (0.003) (0.010) (0.009) (0.009) (0.005) (0.005) (0.003) (0.001) (0.003) (0.006) (0.007) (0.007) (0.006) (0.003) (0.002) (0.009) (0.009) (0.016) (0.015) (0.013) (0.011) (0.011) (0.008) (0.009) (0.010) (0.021) (0.022) (0.015) (0.015) (0.014) (0.009) Note: Tabulated values are y RB and { MSE (in parentheses). Except ALL and CC, the smallest values in each setting are marked in bold. Considering models with two covariates, the results are given in Table 4, 5, 6 and 7. Looking at the common parameter first, Table 4 and 5 show the results for β p X and β p Z, respectively. Results for β p X are given in Table 4. T2, LOGT2 and NA2 show smaller sample bias than T1, LOGT1 and NA1. Results for the relative bias and MSE are smallest with NA2, while greatest with T1 and LOGT1. So NA2 performs better than any other imputation methods. Results with high ρ c tend to have greater bias towards the null. Again, the correlation between survival times has a great effect on the performance of MI methods, the sample bias increases with ρ increasing. Considering the effect of missingness mechanism, under MAR, the sample bias has increased compared to MCAR. 19

23 Table 5: Simulation results for parameter β Z in two covariates model Imputation Methods ρ ρ c ALL CC T1 LOGT1 NA1 T2 LOGT2 NA2 MCAR (0.001) (0.002) (0.004) (0.004) (0.003) (0.002) (0.002) (0.002) (0.001) (0.001) (0.003) (0.003) (0.003) (0.003) (0.003) (0.002) (0.010) (0.012) (0.012) (0.012) (0.011) (0.011) (0.011) (0.009) (0.011) (0.010) (0.012) (0.012) (0.012) (0.012) (0.012) (0.009) MAR (0.001) (0.003) (0.004) (0.005) (0.005) (0.003) (0.004) (0.002) (0.001) (0.002) (0.003) (0.003) (0.003) (0.003) (0.003) (0.002) (0.012) (0.013) (0.016) (0.016) (0.016) (0.015) (0.015) (0.010) (0.008) (0.009) (0.013) (0.013) (0.012) (0.011) (0.011) (0.010) Note: Tabulated values are y RB and { MSE (in parentheses). Except ALL and CC, the smallest values in each setting are marked in bold. Results for β p Z are given in Table 5. The differences between methods are less apparent than in β p X. Again, T2, LOGT2 and NA2 show smaller sample bias than T1, LOGT1 and NA1. Among all imputation methods, NA2 shows the smallest bias towards the null. All methods have greater sample bias as the ρ and ρ c increases. While results for MAR have increased compared to MCAR, the relative performance of methods are similar for both missingness mechanisms. Next, we will discuss the coefficient estimates of the first event. 20

24 Table 6: Simulation results for parameter β Xp1q in two covariates model Imputation Methods ρ ρ c ALL CC T1 LOGT1 NA1 T2 LOGT2 NA2 MCAR (0.001) (0.001) (0.003) (0.003) (0.001) (0.003) (0.003) (0.003) (0.001) (0.002) (0.003) (0.003) (0.002) (0.003) (0.003) (0.003) (0.001) (0.002) (0.003) (0.003) (0.001) (0.003) (0.003) (0.002) (0.001) (0.001) (0.002) (0.002) (0.001) (0.003) (0.002) (0.003) MAR (0.001) (0.002) (0.003) (0.003) (0.002) (0.003) (0.002) (0.003) (0.001) (0.001) (0.003) (0.003) (0.002) (0.003) (0.003) (0.002) (0.002) (0.002) (0.003) (0.002) (0.001) (0.002) (0.002) (0.002) (0.001) (0.003) (0.003) (0.002) (0.002) (0.003) (0.003) (0.003) Note: Tabulated values are y RB and { MSE (in parentheses). Except ALL and CC, the smallest values in each setting are marked in bold. Results for β p Xp1q are given in Table 6. All methods have performed very adequately, and NA1 is the best of them all. With respect to relative bias, T1, LOGT1 and NA1 perform slightly better than T2, LOGT2 and NA2. The MSE values are almost zero for all methods. Under MCAR, the sample bias increases with growing correlation. Under MAR, again, the sample bias increases with growing correlation, but not as much as in MCAR. Results for β p Zp1q are given in Table 7. Considering the relative bias, all imputation methods show a small bias towards the null. The MSE values remain almost zero for all methods. With increasing values of ρ and ρ c, the sample bias is not increased as much as in β p Xp1q. In this case, NA1 outperforms all other methods in terms of the relative bias. In conclusion, the missingness mechanism has an impact on the imputation results, the estimates under MCAR are generally less biased than those under MAR. When analyzing the common coefficients, NA2 is the best MI method; when analyzing the first-event coefficients, NA1 is preferred in imputing missing data. 21

25 Table 7: Simulation results for parameter β Zp1q in two covariates model Imputation Methods ρ ρ c ALL CC T1 LOGT1 NA1 T2 LOGT2 NA2 MCAR (0.001) (0.002) (0.003) (0.003) (0.002) (0.003) (0.004) (0.003) (0.001) (0.002) (0.002) (0.003) (0.001) (0.003) (0.003) (0.003) (0.001) (0.001) (0.003) (0.003) (0.002) (0.003) (0.003) (0.002) (0.001) (0.002) (0.003) (0.004) (0.003) (0.004) (0.004) (0.004) MAR (0.001) (0.001) (0.003) (0.002) (0.001) (0.003) (0.002) (0.003) (0.001) (0.001) (0.002) (0.003) (0.001) (0.002) (0.003) (0.003) (0.001) (0.002) (0.003) (0.003) (0.002) (0.002) (0.003) (0.002) (0.001) (0.001) (0.003) (0.003) (0.002) (0.003) (0.003) (0.004) Note: Tabulated values are y RB and { MSE (in parentheses). Except ALL and CC, the smallest values in each setting are marked in bold. 6 Empirical Study In this section, an empirical study of handling the problem of missing values in cardiovascular disease (CVD) event data is conducted. The data come from a Swedish cohort of men that are followed for several decades. Due to confidentiality reasons, the data have to be anonymized before we could use it. In the study, each subject will experience a maximum of two CVD events. The outcomes is the follow-up time and the corresponding event indicator. Follow-up time will be calculated from beginning of the study to date of failure, end of follow-up, or loss to follow-up, whichever occurred first. All variables are presented in Table 8. The first part of the table provides the descriptive statistics of the continuous variables, and the second part provides the descriptive statistics of the categorical variables. 22

26 Table 8: Descriptive statistics Continuous Variables Missing Variable Min Max Mean SD 25% Median 75% n % X X X X X X Categorical Variables Missing Variable Value (%) n % X 7 0 (84.50) 1 (15.50) 0 0 X 8 0 (95.92) 1 (4.08) 0 0 X 9 0 (98.57) 1 (1.43) 0 0 X 10 0 (25.18) 1 (51.02) 2 (23.80) 0 0 X 11 1 (62.74) 2 (26.40) 3 (10.68) X 12 1 (13.98) 2 (34.65) 3 (41.68) 4 (4.78) Note: N The original data set contains 2303 individuals, a total of 12 variables six continuous variables and six categorical variables. For continuous variables, four of them are incomplete. The percentage of missing values is 13.46% for all incomplete variables. For categorical variables, two of them are incomplete. The percentage of missing values is 0.18% for variable X 11, and 4.91% for variable X 12. Before imputing missing data, the complete case (CC) analysis is performed. Then, using MI to impute the missing values. Analysis models are the PWP model that analyzing the common effects, and the Cox PH model that analyzing the coefficients of the first event. The variables are common risk factors and we are interested in X 1 and X 2. 23

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Statistical Methods. Missing Data  snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23 1 / 23 Statistical Methods Missing Data http://www.stats.ox.ac.uk/ snijders/sm.htm Tom A.B. Snijders University of Oxford November, 2011 2 / 23 Literature: Joseph L. Schafer and John W. Graham, Missing

More information

Multivariate Survival Analysis

Multivariate Survival Analysis Multivariate Survival Analysis Previously we have assumed that either (X i, δ i ) or (X i, δ i, Z i ), i = 1,..., n, are i.i.d.. This may not always be the case. Multivariate survival data can arise in

More information

MISSING or INCOMPLETE DATA

MISSING or INCOMPLETE DATA MISSING or INCOMPLETE DATA A (fairly) complete review of basic practice Don McLeish and Cyntha Struthers University of Waterloo Dec 5, 2015 Structure of the Workshop Session 1 Common methods for dealing

More information

Don t be Fancy. Impute Your Dependent Variables!

Don t be Fancy. Impute Your Dependent Variables! Don t be Fancy. Impute Your Dependent Variables! Kyle M. Lang, Todd D. Little Institute for Measurement, Methodology, Analysis & Policy Texas Tech University Lubbock, TX May 24, 2016 Presented at the 6th

More information

2 Naïve Methods. 2.1 Complete or available case analysis

2 Naïve Methods. 2.1 Complete or available case analysis 2 Naïve Methods Before discussing methods for taking account of missingness when the missingness pattern can be assumed to be MAR in the next three chapters, we review some simple methods for handling

More information

Basics of Modern Missing Data Analysis

Basics of Modern Missing Data Analysis Basics of Modern Missing Data Analysis Kyle M. Lang Center for Research Methods and Data Analysis University of Kansas March 8, 2013 Topics to be Covered An introduction to the missing data problem Missing

More information

Pooling multiple imputations when the sample happens to be the population.

Pooling multiple imputations when the sample happens to be the population. Pooling multiple imputations when the sample happens to be the population. Gerko Vink 1,2, and Stef van Buuren 1,3 arxiv:1409.8542v1 [math.st] 30 Sep 2014 1 Department of Methodology and Statistics, Utrecht

More information

Survival Analysis for Case-Cohort Studies

Survival Analysis for Case-Cohort Studies Survival Analysis for ase-ohort Studies Petr Klášterecký Dept. of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, harles University, Prague, zech Republic e-mail: petr.klasterecky@matfyz.cz

More information

Mixture modelling of recurrent event times with long-term survivors: Analysis of Hutterite birth intervals. John W. Mac McDonald & Alessandro Rosina

Mixture modelling of recurrent event times with long-term survivors: Analysis of Hutterite birth intervals. John W. Mac McDonald & Alessandro Rosina Mixture modelling of recurrent event times with long-term survivors: Analysis of Hutterite birth intervals John W. Mac McDonald & Alessandro Rosina Quantitative Methods in the Social Sciences Seminar -

More information

Some methods for handling missing values in outcome variables. Roderick J. Little

Some methods for handling missing values in outcome variables. Roderick J. Little Some methods for handling missing values in outcome variables Roderick J. Little Missing data principles Likelihood methods Outline ML, Bayes, Multiple Imputation (MI) Robust MAR methods Predictive mean

More information

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA Kasun Rathnayake ; A/Prof Jun Ma Department of Statistics Faculty of Science and Engineering Macquarie University

More information

Known unknowns : using multiple imputation to fill in the blanks for missing data

Known unknowns : using multiple imputation to fill in the blanks for missing data Known unknowns : using multiple imputation to fill in the blanks for missing data James Stanley Department of Public Health University of Otago, Wellington james.stanley@otago.ac.nz Acknowledgments Cancer

More information

Lecture 5 Models and methods for recurrent event data

Lecture 5 Models and methods for recurrent event data Lecture 5 Models and methods for recurrent event data Recurrent and multiple events are commonly encountered in longitudinal studies. In this chapter we consider ordered recurrent and multiple events.

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

Survival Regression Models

Survival Regression Models Survival Regression Models David M. Rocke May 18, 2017 David M. Rocke Survival Regression Models May 18, 2017 1 / 32 Background on the Proportional Hazards Model The exponential distribution has constant

More information

Statistical Practice

Statistical Practice Statistical Practice A Note on Bayesian Inference After Multiple Imputation Xiang ZHOU and Jerome P. REITER This article is aimed at practitioners who plan to use Bayesian inference on multiply-imputed

More information

Multi-state Models: An Overview

Multi-state Models: An Overview Multi-state Models: An Overview Andrew Titman Lancaster University 14 April 2016 Overview Introduction to multi-state modelling Examples of applications Continuously observed processes Intermittently observed

More information

A Note on Bayesian Inference After Multiple Imputation

A Note on Bayesian Inference After Multiple Imputation A Note on Bayesian Inference After Multiple Imputation Xiang Zhou and Jerome P. Reiter Abstract This article is aimed at practitioners who plan to use Bayesian inference on multiplyimputed datasets in

More information

MISSING or INCOMPLETE DATA

MISSING or INCOMPLETE DATA MISSING or INCOMPLETE DATA A (fairly) complete review of basic practice Don McLeish and Cyntha Struthers University of Waterloo Dec 5, 2015 Structure of the Workshop Session 1 Common methods for dealing

More information

analysis of incomplete data in statistical surveys

analysis of incomplete data in statistical surveys analysis of incomplete data in statistical surveys Ugo Guarnera 1 1 Italian National Institute of Statistics, Italy guarnera@istat.it Jordan Twinning: Imputation - Amman, 6-13 Dec 2014 outline 1 origin

More information

Chapter 4. Parametric Approach. 4.1 Introduction

Chapter 4. Parametric Approach. 4.1 Introduction Chapter 4 Parametric Approach 4.1 Introduction The missing data problem is already a classical problem that has not been yet solved satisfactorily. This problem includes those situations where the dependent

More information

Lecture 7 Time-dependent Covariates in Cox Regression

Lecture 7 Time-dependent Covariates in Cox Regression Lecture 7 Time-dependent Covariates in Cox Regression So far, we ve been considering the following Cox PH model: λ(t Z) = λ 0 (t) exp(β Z) = λ 0 (t) exp( β j Z j ) where β j is the parameter for the the

More information

Introduction to Statistical Analysis

Introduction to Statistical Analysis Introduction to Statistical Analysis Changyu Shen Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology Beth Israel Deaconess Medical Center Harvard Medical School Objectives Descriptive

More information

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing Alessandra Mattei Dipartimento di Statistica G. Parenti Università

More information

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data Malaysian Journal of Mathematical Sciences 11(3): 33 315 (217) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES Journal homepage: http://einspem.upm.edu.my/journal Approximation of Survival Function by Taylor

More information

Comparison of multiple imputation methods for systematically and sporadically missing multilevel data

Comparison of multiple imputation methods for systematically and sporadically missing multilevel data Comparison of multiple imputation methods for systematically and sporadically missing multilevel data V. Audigier, I. White, S. Jolani, T. Debray, M. Quartagno, J. Carpenter, S. van Buuren, M. Resche-Rigon

More information

arxiv: v5 [stat.me] 13 Feb 2018

arxiv: v5 [stat.me] 13 Feb 2018 arxiv: arxiv:1602.07933 BOOTSTRAP INFERENCE WHEN USING MULTIPLE IMPUTATION By Michael Schomaker and Christian Heumann University of Cape Town and Ludwig-Maximilians Universität München arxiv:1602.07933v5

More information

Modelling and Analysis of Recurrent Event Data

Modelling and Analysis of Recurrent Event Data Modelling and Analysis of Recurrent Event Data Edsel A. Peña Department of Statistics University of South Carolina Research support from NIH, NSF, and USC/MUSC Collaborative Grants Joint work with Prof.

More information

Missing covariate data in matched case-control studies: Do the usual paradigms apply?

Missing covariate data in matched case-control studies: Do the usual paradigms apply? Missing covariate data in matched case-control studies: Do the usual paradigms apply? Bryan Langholz USC Department of Preventive Medicine Joint work with Mulugeta Gebregziabher Larry Goldstein Mark Huberman

More information

Multiple imputation in Cox regression when there are time-varying effects of covariates

Multiple imputation in Cox regression when there are time-varying effects of covariates Received: 7 January 26 Revised: 4 May 28 Accepted: 7 May 28 DOI:.2/sim.7842 RESEARCH ARTICLE Multiple imputation in Cox regression when there are time-varying effects of covariates Ruth H. Keogh Tim P.

More information

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY Ingo Langner 1, Ralf Bender 2, Rebecca Lenz-Tönjes 1, Helmut Küchenhoff 2, Maria Blettner 2 1

More information

Correction for classical covariate measurement error and extensions to life-course studies

Correction for classical covariate measurement error and extensions to life-course studies Correction for classical covariate measurement error and extensions to life-course studies Jonathan William Bartlett A thesis submitted to the University of London for the degree of Doctor of Philosophy

More information

STAT331. Cox s Proportional Hazards Model

STAT331. Cox s Proportional Hazards Model STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations

More information

The Proportional Hazard Model and the Modelling of Recurrent Failure Data: Analysis of a Disconnector Population in Sweden. Sweden

The Proportional Hazard Model and the Modelling of Recurrent Failure Data: Analysis of a Disconnector Population in Sweden. Sweden PS1 Life Cycle Asset Management The Proportional Hazard Model and the Modelling of Recurrent Failure Data: Analysis of a Disconnector Population in Sweden J. H. Jürgensen 1, A.L. Brodersson 2, P. Hilber

More information

Survival Analysis Math 434 Fall 2011

Survival Analysis Math 434 Fall 2011 Survival Analysis Math 434 Fall 2011 Part IV: Chap. 8,9.2,9.3,11: Semiparametric Proportional Hazards Regression Jimin Ding Math Dept. www.math.wustl.edu/ jmding/math434/fall09/index.html Basic Model Setup

More information

Centering Predictor and Mediator Variables in Multilevel and Time-Series Models

Centering Predictor and Mediator Variables in Multilevel and Time-Series Models Centering Predictor and Mediator Variables in Multilevel and Time-Series Models Tihomir Asparouhov and Bengt Muthén Part 2 May 7, 2018 Tihomir Asparouhov and Bengt Muthén Part 2 Muthén & Muthén 1/ 42 Overview

More information

Multiple Imputation for Missing Data in Repeated Measurements Using MCMC and Copulas

Multiple Imputation for Missing Data in Repeated Measurements Using MCMC and Copulas Multiple Imputation for Missing Data in epeated Measurements Using MCMC and Copulas Lily Ingsrisawang and Duangporn Potawee Abstract This paper presents two imputation methods: Marov Chain Monte Carlo

More information

Longitudinal + Reliability = Joint Modeling

Longitudinal + Reliability = Joint Modeling Longitudinal + Reliability = Joint Modeling Carles Serrat Institute of Statistics and Mathematics Applied to Building CYTED-HAROSA International Workshop November 21-22, 2013 Barcelona Mainly from Rizopoulos,

More information

A TWO-STAGE LINEAR MIXED-EFFECTS/COX MODEL FOR LONGITUDINAL DATA WITH MEASUREMENT ERROR AND SURVIVAL

A TWO-STAGE LINEAR MIXED-EFFECTS/COX MODEL FOR LONGITUDINAL DATA WITH MEASUREMENT ERROR AND SURVIVAL A TWO-STAGE LINEAR MIXED-EFFECTS/COX MODEL FOR LONGITUDINAL DATA WITH MEASUREMENT ERROR AND SURVIVAL Christopher H. Morrell, Loyola College in Maryland, and Larry J. Brant, NIA Christopher H. Morrell,

More information

Cox s proportional hazards model and Cox s partial likelihood

Cox s proportional hazards model and Cox s partial likelihood Cox s proportional hazards model and Cox s partial likelihood Rasmus Waagepetersen October 12, 2018 1 / 27 Non-parametric vs. parametric Suppose we want to estimate unknown function, e.g. survival function.

More information

Analysis of Gamma and Weibull Lifetime Data under a General Censoring Scheme and in the presence of Covariates

Analysis of Gamma and Weibull Lifetime Data under a General Censoring Scheme and in the presence of Covariates Communications in Statistics - Theory and Methods ISSN: 0361-0926 (Print) 1532-415X (Online) Journal homepage: http://www.tandfonline.com/loi/lsta20 Analysis of Gamma and Weibull Lifetime Data under a

More information

Comparing Group Means When Nonresponse Rates Differ

Comparing Group Means When Nonresponse Rates Differ UNF Digital Commons UNF Theses and Dissertations Student Scholarship 2015 Comparing Group Means When Nonresponse Rates Differ Gabriela M. Stegmann University of North Florida Suggested Citation Stegmann,

More information

Plausible Values for Latent Variables Using Mplus

Plausible Values for Latent Variables Using Mplus Plausible Values for Latent Variables Using Mplus Tihomir Asparouhov and Bengt Muthén August 21, 2010 1 1 Introduction Plausible values are imputed values for latent variables. All latent variables can

More information

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness A. Linero and M. Daniels UF, UT-Austin SRC 2014, Galveston, TX 1 Background 2 Working model

More information

Philosophy and Features of the mstate package

Philosophy and Features of the mstate package Introduction Mathematical theory Practice Discussion Philosophy and Features of the mstate package Liesbeth de Wreede, Hein Putter Department of Medical Statistics and Bioinformatics Leiden University

More information

Lecture Notes: Some Core Ideas of Imputation for Nonresponse in Surveys. Tom Rosenström University of Helsinki May 14, 2014

Lecture Notes: Some Core Ideas of Imputation for Nonresponse in Surveys. Tom Rosenström University of Helsinki May 14, 2014 Lecture Notes: Some Core Ideas of Imputation for Nonresponse in Surveys Tom Rosenström University of Helsinki May 14, 2014 1 Contents 1 Preface 3 2 Definitions 3 3 Different ways to handle MAR data 4 4

More information

Analyzing Pilot Studies with Missing Observations

Analyzing Pilot Studies with Missing Observations Analyzing Pilot Studies with Missing Observations Monnie McGee mmcgee@smu.edu. Department of Statistical Science Southern Methodist University, Dallas, Texas Co-authored with N. Bergasa (SUNY Downstate

More information

Introduction to Survey Data Analysis

Introduction to Survey Data Analysis Introduction to Survey Data Analysis JULY 2011 Afsaneh Yazdani Preface Learning from Data Four-step process by which we can learn from data: 1. Defining the Problem 2. Collecting the Data 3. Summarizing

More information

Whether to use MMRM as primary estimand.

Whether to use MMRM as primary estimand. Whether to use MMRM as primary estimand. James Roger London School of Hygiene & Tropical Medicine, London. PSI/EFSPI European Statistical Meeting on Estimands. Stevenage, UK: 28 September 2015. 1 / 38

More information

A Regression Model For Recurrent Events With Distribution Free Correlation Structure

A Regression Model For Recurrent Events With Distribution Free Correlation Structure A Regression Model For Recurrent Events With Distribution Free Correlation Structure J. Pénichoux(1), A. Latouche(2), T. Moreau(1) (1) INSERM U780 (2) Université de Versailles, EA2506 ISCB - 2009 - Prague

More information

Methodology and Statistics for the Social and Behavioural Sciences Utrecht University, the Netherlands

Methodology and Statistics for the Social and Behavioural Sciences Utrecht University, the Netherlands Methodology and Statistics for the Social and Behavioural Sciences Utrecht University, the Netherlands MSc Thesis Emmeke Aarts TITLE: A novel method to obtain the treatment effect assessed for a completely

More information

Stock Sampling with Interval-Censored Elapsed Duration: A Monte Carlo Analysis

Stock Sampling with Interval-Censored Elapsed Duration: A Monte Carlo Analysis Stock Sampling with Interval-Censored Elapsed Duration: A Monte Carlo Analysis Michael P. Babington and Javier Cano-Urbina August 31, 2018 Abstract Duration data obtained from a given stock of individuals

More information

Bayesian methods for missing data: part 1. Key Concepts. Nicky Best and Alexina Mason. Imperial College London

Bayesian methods for missing data: part 1. Key Concepts. Nicky Best and Alexina Mason. Imperial College London Bayesian methods for missing data: part 1 Key Concepts Nicky Best and Alexina Mason Imperial College London BAYES 2013, May 21-23, Erasmus University Rotterdam Missing Data: Part 1 BAYES2013 1 / 68 Outline

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

A weighted simulation-based estimator for incomplete longitudinal data models

A weighted simulation-based estimator for incomplete longitudinal data models To appear in Statistics and Probability Letters, 113 (2016), 16-22. doi 10.1016/j.spl.2016.02.004 A weighted simulation-based estimator for incomplete longitudinal data models Daniel H. Li 1 and Liqun

More information

A simulation study for comparing testing statistics in response-adaptive randomization

A simulation study for comparing testing statistics in response-adaptive randomization RESEARCH ARTICLE Open Access A simulation study for comparing testing statistics in response-adaptive randomization Xuemin Gu 1, J Jack Lee 2* Abstract Background: Response-adaptive randomizations are

More information

Determining Sufficient Number of Imputations Using Variance of Imputation Variances: Data from 2012 NAMCS Physician Workflow Mail Survey *

Determining Sufficient Number of Imputations Using Variance of Imputation Variances: Data from 2012 NAMCS Physician Workflow Mail Survey * Applied Mathematics, 2014,, 3421-3430 Published Online December 2014 in SciRes. http://www.scirp.org/journal/am http://dx.doi.org/10.4236/am.2014.21319 Determining Sufficient Number of Imputations Using

More information

CTDL-Positive Stable Frailty Model

CTDL-Positive Stable Frailty Model CTDL-Positive Stable Frailty Model M. Blagojevic 1, G. MacKenzie 2 1 Department of Mathematics, Keele University, Staffordshire ST5 5BG,UK and 2 Centre of Biostatistics, University of Limerick, Ireland

More information

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model Other Survival Models (1) Non-PH models We briefly discussed the non-proportional hazards (non-ph) model λ(t Z) = λ 0 (t) exp{β(t) Z}, where β(t) can be estimated by: piecewise constants (recall how);

More information

UNIVERSITY OF CALIFORNIA, SAN DIEGO

UNIVERSITY OF CALIFORNIA, SAN DIEGO UNIVERSITY OF CALIFORNIA, SAN DIEGO Estimation of the primary hazard ratio in the presence of a secondary covariate with non-proportional hazards An undergraduate honors thesis submitted to the Department

More information

Time-Invariant Predictors in Longitudinal Models

Time-Invariant Predictors in Longitudinal Models Time-Invariant Predictors in Longitudinal Models Today s Class (or 3): Summary of steps in building unconditional models for time What happens to missing predictors Effects of time-invariant predictors

More information

Power and Sample Size Calculations with the Additive Hazards Model

Power and Sample Size Calculations with the Additive Hazards Model Journal of Data Science 10(2012), 143-155 Power and Sample Size Calculations with the Additive Hazards Model Ling Chen, Chengjie Xiong, J. Philip Miller and Feng Gao Washington University School of Medicine

More information

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

FULL LIKELIHOOD INFERENCES IN THE COX MODEL October 20, 2007 FULL LIKELIHOOD INFERENCES IN THE COX MODEL BY JIAN-JIAN REN 1 AND MAI ZHOU 2 University of Central Florida and University of Kentucky Abstract We use the empirical likelihood approach

More information

Biostat 2065 Analysis of Incomplete Data

Biostat 2065 Analysis of Incomplete Data Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh September 13 & 15, 2005 1. Complete-case analysis (I) Complete-case analysis refers to analysis based on

More information

Statistical Inference and Methods

Statistical Inference and Methods Department of Mathematics Imperial College London d.stephens@imperial.ac.uk http://stats.ma.ic.ac.uk/ das01/ 31st January 2006 Part VI Session 6: Filtering and Time to Event Data Session 6: Filtering and

More information

Alexina Mason. Department of Epidemiology and Biostatistics Imperial College, London. 16 February 2010

Alexina Mason. Department of Epidemiology and Biostatistics Imperial College, London. 16 February 2010 Strategy for modelling non-random missing data mechanisms in longitudinal studies using Bayesian methods: application to income data from the Millennium Cohort Study Alexina Mason Department of Epidemiology

More information

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /rssa.

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /rssa. Goldstein, H., Carpenter, J. R., & Browne, W. J. (2014). Fitting multilevel multivariate models with missing data in responses and covariates that may include interactions and non-linear terms. Journal

More information

A Bayesian Nonparametric Approach to Causal Inference for Semi-competing risks

A Bayesian Nonparametric Approach to Causal Inference for Semi-competing risks A Bayesian Nonparametric Approach to Causal Inference for Semi-competing risks Y. Xu, D. Scharfstein, P. Mueller, M. Daniels Johns Hopkins, Johns Hopkins, UT-Austin, UF JSM 2018, Vancouver 1 What are semi-competing

More information

Applied Survival Analysis Lab 10: Analysis of multiple failures

Applied Survival Analysis Lab 10: Analysis of multiple failures Applied Survival Analysis Lab 10: Analysis of multiple failures We will analyze the bladder data set (Wei et al., 1989). A listing of the dataset is given below: list if id in 1/9 +---------------------------------------------------------+

More information

Missing Data and Multiple Imputation

Missing Data and Multiple Imputation Maximum Likelihood Methods for the Social Sciences POLS 510 CSSS 510 Missing Data and Multiple Imputation Christopher Adolph Political Science and CSSS University of Washington, Seattle Vincent van Gogh

More information

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features Yangxin Huang Department of Epidemiology and Biostatistics, COPH, USF, Tampa, FL yhuang@health.usf.edu January

More information

Nonrespondent subsample multiple imputation in two-phase random sampling for nonresponse

Nonrespondent subsample multiple imputation in two-phase random sampling for nonresponse Nonrespondent subsample multiple imputation in two-phase random sampling for nonresponse Nanhua Zhang Division of Biostatistics & Epidemiology Cincinnati Children s Hospital Medical Center (Joint work

More information

Analysis of Incomplete Non-Normal Longitudinal Lipid Data

Analysis of Incomplete Non-Normal Longitudinal Lipid Data Analysis of Incomplete Non-Normal Longitudinal Lipid Data Jiajun Liu*, Devan V. Mehrotra, Xiaoming Li, and Kaifeng Lu 2 Merck Research Laboratories, PA/NJ 2 Forrest Laboratories, NY *jiajun_liu@merck.com

More information

Time-Invariant Predictors in Longitudinal Models

Time-Invariant Predictors in Longitudinal Models Time-Invariant Predictors in Longitudinal Models Topics: What happens to missing predictors Effects of time-invariant predictors Fixed vs. systematically varying vs. random effects Model building strategies

More information

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood PRE 906: Structural Equation Modeling Lecture #3 February 4, 2015 PRE 906, SEM: Estimation Today s Class An

More information

Multistate Modeling and Applications

Multistate Modeling and Applications Multistate Modeling and Applications Yang Yang Department of Statistics University of Michigan, Ann Arbor IBM Research Graduate Student Workshop: Statistics for a Smarter Planet Yang Yang (UM, Ann Arbor)

More information

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates Anastasios (Butch) Tsiatis Department of Statistics North Carolina State University http://www.stat.ncsu.edu/

More information

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples Bayesian inference for sample surveys Roderick Little Module : Bayesian models for simple random samples Superpopulation Modeling: Estimating parameters Various principles: least squares, method of moments,

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

ARIC Manuscript Proposal # PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2

ARIC Manuscript Proposal # PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2 ARIC Manuscript Proposal # 1186 PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2 1.a. Full Title: Comparing Methods of Incorporating Spatial Correlation in

More information

[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements

[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements [Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements Aasthaa Bansal PhD Pharmaceutical Outcomes Research & Policy Program University of Washington 69 Biomarkers

More information

E(Y ij b i ) = f(x ijβ i ), (13.1) β i = A i β + B i b i. (13.2)

E(Y ij b i ) = f(x ijβ i ), (13.1) β i = A i β + B i b i. (13.2) 1 Advanced topics 1.1 Introduction In this chapter, we conclude with brief overviews of several advanced topics. Each of these topics could realistically be the subject of an entire course! 1. Generalized

More information

Longitudinal analysis of ordinal data

Longitudinal analysis of ordinal data Longitudinal analysis of ordinal data A report on the external research project with ULg Anne-Françoise Donneau, Murielle Mauer June 30 th 2009 Generalized Estimating Equations (Liang and Zeger, 1986)

More information

Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models

Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models Tihomir Asparouhov 1, Bengt Muthen 2 Muthen & Muthen 1 UCLA 2 Abstract Multilevel analysis often leads to modeling

More information

THESIS for the degree of MASTER OF SCIENCE. Modelling and Data Analysis

THESIS for the degree of MASTER OF SCIENCE. Modelling and Data Analysis PROPERTIES OF ESTIMATORS FOR RELATIVE RISKS FROM NESTED CASE-CONTROL STUDIES WITH MULTIPLE OUTCOMES (COMPETING RISKS) by NATHALIE C. STØER THESIS for the degree of MASTER OF SCIENCE Modelling and Data

More information

Indirect estimation of a simultaneous limited dependent variable model for patient costs and outcome

Indirect estimation of a simultaneous limited dependent variable model for patient costs and outcome Indirect estimation of a simultaneous limited dependent variable model for patient costs and outcome Per Hjertstrand Research Institute of Industrial Economics (IFN), Stockholm, Sweden Per.Hjertstrand@ifn.se

More information

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Multilevel Statistical Models: 3 rd edition, 2003 Contents Multilevel Statistical Models: 3 rd edition, 2003 Contents Preface Acknowledgements Notation Two and three level models. A general classification notation and diagram Glossary Chapter 1 An introduction

More information

Analysing geoadditive regression data: a mixed model approach

Analysing geoadditive regression data: a mixed model approach Analysing geoadditive regression data: a mixed model approach Institut für Statistik, Ludwig-Maximilians-Universität München Joint work with Ludwig Fahrmeir & Stefan Lang 25.11.2005 Spatio-temporal regression

More information

Multi-state models: prediction

Multi-state models: prediction Department of Medical Statistics and Bioinformatics Leiden University Medical Center Course on advanced survival analysis, Copenhagen Outline Prediction Theory Aalen-Johansen Computational aspects Applications

More information

Models for Multivariate Panel Count Data

Models for Multivariate Panel Count Data Semiparametric Models for Multivariate Panel Count Data KyungMann Kim University of Wisconsin-Madison kmkim@biostat.wisc.edu 2 April 2015 Outline 1 Introduction 2 3 4 Panel Count Data Motivation Previous

More information

Frailty Modeling for clustered survival data: a simulation study

Frailty Modeling for clustered survival data: a simulation study Frailty Modeling for clustered survival data: a simulation study IAA Oslo 2015 Souad ROMDHANE LaREMFiQ - IHEC University of Sousse (Tunisia) souad_romdhane@yahoo.fr Lotfi BELKACEM LaREMFiQ - IHEC University

More information

A STRATEGY FOR STEPWISE REGRESSION PROCEDURES IN SURVIVAL ANALYSIS WITH MISSING COVARIATES. by Jia Li B.S., Beijing Normal University, 1998

A STRATEGY FOR STEPWISE REGRESSION PROCEDURES IN SURVIVAL ANALYSIS WITH MISSING COVARIATES. by Jia Li B.S., Beijing Normal University, 1998 A STRATEGY FOR STEPWISE REGRESSION PROCEDURES IN SURVIVAL ANALYSIS WITH MISSING COVARIATES by Jia Li B.S., Beijing Normal University, 1998 Submitted to the Graduate Faculty of the Graduate School of Public

More information

Robust estimates of state occupancy and transition probabilities for Non-Markov multi-state models

Robust estimates of state occupancy and transition probabilities for Non-Markov multi-state models Robust estimates of state occupancy and transition probabilities for Non-Markov multi-state models 26 March 2014 Overview Continuously observed data Three-state illness-death General robust estimator Interval

More information

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky Empirical likelihood with right censored data were studied by Thomas and Grunkmier (1975), Li (1995),

More information

Nonresponse weighting adjustment using estimated response probability

Nonresponse weighting adjustment using estimated response probability Nonresponse weighting adjustment using estimated response probability Jae-kwang Kim Yonsei University, Seoul, Korea December 26, 2006 Introduction Nonresponse Unit nonresponse Item nonresponse Basic strategy

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Probability and Probability Distributions. Dr. Mohammed Alahmed

Probability and Probability Distributions. Dr. Mohammed Alahmed Probability and Probability Distributions 1 Probability and Probability Distributions Usually we want to do more with data than just describing them! We might want to test certain specific inferences about

More information

A note on multiple imputation for general purpose estimation

A note on multiple imputation for general purpose estimation A note on multiple imputation for general purpose estimation Shu Yang Jae Kwang Kim SSC meeting June 16, 2015 Shu Yang, Jae Kwang Kim Multiple Imputation June 16, 2015 1 / 32 Introduction Basic Setup Assume

More information

Chapter 11. Correlation and Regression

Chapter 11. Correlation and Regression Chapter 11. Correlation and Regression The word correlation is used in everyday life to denote some form of association. We might say that we have noticed a correlation between foggy days and attacks of

More information

BIOS 312: Precision of Statistical Inference

BIOS 312: Precision of Statistical Inference and Power/Sample Size and Standard Errors BIOS 312: of Statistical Inference Chris Slaughter Department of Biostatistics, Vanderbilt University School of Medicine January 3, 2013 Outline Overview and Power/Sample

More information