Diagnostic Measures for the Cox Regression Model with Missing Covariates

Size: px

Start display at page:

Download "Diagnostic Measures for the Cox Regression Model with Missing Covariates"

Constance Moore
6 years ago
Views:

1 Biometrika (2015), 102, 1, pp C 2007 Biometrika Trust Printed in Great Britain Diagnostic Measures for the Cox Regression Model with Missing Covariates BY HONGTU ZHU, JOSEPH G. IBRAHIM Department of Biostatistics, University of North Carolina at Chapel Hill, 3109 McGavran-Greenberg Hall, Campus Box 7420, Chapel Hill, North Carolina 27516, U.S.A. hzhu@bios.unc.edu ibrahim@bios.unc.edu AND MING-HUI CHEN Department of Statistics, University of Connecticut, 215 Glenbrook Road, U-4120, Storrs, Connecticut 06269, U.S.A. ming-hui.chen@uconn.edu SUMMARY This paper investigates diagnostic measures for assessing the influence of observations and model misspecification in the presence of missing covariate data for the Cox regression model. Our diagnostics include case-deletion measures, conditional martingale residuals, and score residuals. The Q-distance is proposed to examine the effects of deleting individual observations on the estimates of finite-dimensional and infinite-dimensional parameters. Conditional martingale residuals are used to construct goodness of fit statistics for testing possible misspecification of the model assumptions. A resampling method is developed to approximate the p-values of the goodness of fit statistics. Simulation studies are conducted to evaluate our methods, and a real data set is analyzed to illustrate their use. Some key words: Case-deletion measure; Conditional martingale residual; Goodness-of-fit statistic; Model misspecification. 1. INTRODUCTION In surveys, clinical trials, and longitudinal studies, complete data are often not available for every subject. There is a very large literature on statistical methods for missing data. These methods, however, strongly depend on the missing-data mechanism as well as on other distributional and modeling assumptions, and can be very sensitive to them. For this reason, analyses are carried out to check the sensitivity of the parameter estimates with respect to the model assumptions. See, for example, Verbeke et al. (2001), Jansen et al. (2003), Troxel, Ma & Heitjan (2004), Copas & Eguchi (2005), and Daniels & Hogan (2007). Diagnostic measures including martingale residuals and Cook s distance have been widely used to identify influential observations and to test model misspecification in survival models (Storer & Crowley, 1985; Pettitt & Daud, 1989; Therneau, Grambsch, & Fleming, 1990; Escobar & Meeker, 1992; Henderson & Oman, 1993; Lin, Wei & Ying, 1993; Barlow, 1997; Marzec & Marzec, 1997; Klein and Moeschberger, 2003; Martinussen and Scheike, 2006). For instance, Pettitt & Daud (1989) applied Cook s (1986) local influence method to the proportional hazards model and derived several useful diagnostics. Martingale residuals have been widely used to construct goodness of fit statistics to examine the functional form of a covariate and the proportional

2 2 H. ZHU, J.IBRAHIM, AND M.CHEN hazards assumption (Barlow & Prentice, 1988; Therneau, Grambsch & Fleming, 1990; Lin, Wei & Ying, 1993). However, to the best of our knowledge, virtually no literature exists for developing diagnostic measures in the Cox regression model (Cox, 1972, 1975) with missing covariate data, except for Scheike, Martinussen & Silver (2010). 2. COX REGRESSION WITH MISSING COVARIATES 2 1. Model setup Consider n observations (x 1, z 1, r 1, y 1, δ 1 ),..., (x n, z n, r n, y n, δ n ), which are independent realizations of (X, Z, R, Y, ), where Y = T C is the minimum of the censoring time C and the survival time T, = 1(T Y ), which equals 1 if the observed event is a failure and 0 otherwise, and each X is a p 1 1 vector of completely observed covariates. Each Z = (Z m, Z o ) is a p 2 1 vector of partially observed covariates, where Z m and Z o denote the missing and observed components of Z. Let R be a p 2 1 random vector, whose k th component, R k, equals 1 if Z k is observed, and 0 if Z k is missing, where Z k is the kth component of Z. Under a general missing data mechanism, it is common to specify the joint density of (X, Z, R, Y, ) as a product of three conditional densities as follows: p(x, Z, R, Y, ) = p(y, X, Z)p(X, Z)p(R X, Z, Y, ). (1) The conditional density of (Y, ) = (y i, δ i ) given v i = (x i, z i ) is assumed to be p(y i, δ i v i ) λ t (y i v i ) δ i S t (y i v i )λ c (y i v i ) 1 δ i S c (y i v i ), i = 1,..., n, (2) where λ t ( ) and S t ( ) are the hazard and survivor functions of the failure time and λ c ( ) and S c ( ) are the hazard and survivor functions of the censoring time. We also assume the Cox model for the failure time, λ t (y i v i ) = h 0 (y i ) exp(v T i β), S t (y i v i ) = exp{ exp(v T i β)h 0 (y i )}, (3) where h 0 (y) is a baseline hazard function and H 0 (y) = y 0 h 0(u)du. We also need to specify a joint distribution for the covariate vector V = (X, Z). It is assumed that p(v i ; α) p(z i x i ; α)p(x i ), where α contains all the unknown parameters in p(v i ; α). Since the x i s are fully observed, it is not necessary to specify a distribution for X. We may follow Lipsitz & Ibrahim (1996) to further model p(z i x i ; α) as the product of one-dimensional conditional distributions. We need to consider different ways of modeling the missing data mechanism p(r i v i, y i, δ i ; ξ) (Ibrahim, Lipsitz & Chen, 1999), where ξ contains all the unknown parameters. It is common to use logistic regression models for the binary variables in r i. We calculate the conditional distribution of Z m = z m,i given D o = d o,i as follows: = p(z m,i d o,i ) λ t (y i v i ) δ i S t (y i v i )λ c (y i v i ) 1 δ i S c (y i v i )p(z i x i ; α)p(r i v i, y i, δ i ; ξ) λt (y i v i ) δ i St (y i v i )λ c (y i v i ) 1 δ i Sc (y i v i )p(z i x i ; α)p(r i v i, y i, δ i ; ξ)dz m,i, where d o,i = (x i, z o,i, r i, y i, δ i ). If the censoring time does not depend on the missing data and all unknown parameters, then we can drop the hazard and survivor functions of the censoring times from the model. Moreover, if the missing data are missing at random, then p(z m,i d o,i ) = λ t (y i v i ) δ i S t (y i v i )p(z i x i ; α) λt (y i v i ) δ i St (y i v i )p(z i x i ; α)dz m,i. (4)

3 Diagnostic Measures for the Cox Regression Model 3 The expectation-maximization algorithm is a popular technique for obtaining the maximum likelihood estimates of η = {h 0 ( ), γ}, denoted by ˆη = {ĥ0( ), ˆγ}, in the Cox regression model with missing covariate data (Chen & Little, 1999; Herring & Ibrahim, 2001), where γ = (β T, α T, ξ T ) T. Let D c and D o denote the complete and observed data, respectively. We calculate the nonparametric maximum likelihood estimator of H 0 ( ), which is a step function with jumps only at the y i such that δ i = 1 for i = 1,..., n. Without loss of generality, we assume that y 1,..., y d are d distinct failure times. At the sth step of the expectation maximization algorithm, given η (s), the expectation-step involves evaluating the Q-function Q(η η (s) ) = E{L c (η D c ) D o, η (s) }, given by Q(η η (s) ) = log[p{y i, δ i x i, z i ; β, h 0 ( )}]p(z m,i x i, z o,i, r i, y i, δ i ; η (s) )dz m,i + + log{p(x i, z i ; α)}p(z m,i x i, z o,i, r i, y i, δ i ; η (s) )dz m,i log{p(r i x i, z i, y i, δ i ; ξ)}p(z m,i x i, z o,i, r i, y i, δ i ; η (s) )dz m,i = Q 1 {β, h 0 ( ) η (s) } + Q 2 (α η (s) ) + Q 3 (ξ η (s) ), (5) where L c (η D c ) = log p(d c ; η) is the complete-data log-likelihood function. The maximization step consists of maximizing Q 1 {β, h 0 ( ) η (s) }, Q 2 (α η (s) ), and Q 3 (ξ η (s) ), separately (Chen & Little, 1999; Herring & Ibrahim, 2001). Our main interest is to make valid inferences about β and H 0 (y), and this requires the correct specification of all three levels of the assumptions in (1), otherwise there may be serious bias in estimation of β and H 0 ( ). Thus, it is crucial to assess the potential misspecification of all the assumptions in (1) Assumptions The following assumptions are needed to facilitate development of our methods, although they may not be the weakest possible conditions. Assumption 1. The C i and T i given V = v i are independent and the hazard and survivor functions of C i do not depend on z m,i and η. Assumption 2. The true value (α, β, ξ ) of (α, β, ξ) is an interior point of the compact parameter space for (α, β, ξ). Assumption 3. The functions log p(v; α) and log p(r x, z, y, δ; ξ) are twice continuously differentiable on γ and the absolute values of their first- and second-order derivatives are dominated by a function B(d). For each i, B(d i ) is integrable such that sup η E{B(d i ) 2 D o ; η} = O p (1). Moreover, v is bounded, p(v; α) is uniformly bounded and identifiable, and var(v) and { 2 α log p(v; α )}p(v; α )dv are positive definite. Assumption 4. Let τ be a finite time point at which any individual still under study is censored. Assume pr(y τ) > 0. The function H 0 (t) = t 0 h 0(s)ds is absolutely continuous and a nondecreasing function such that H 0 (0) = 0 and H 0 (τ) <. Moreover, h 0 (s) 0 is twice continuously differentiable. Assumption 5. The missing covariate data are missing at random, that is, pr(r x, z, y, δ) = pr(r x, z o, y, δ). In addition, the fully observed complete covariates can be observed for all

4 4 H. ZHU, J.IBRAHIM, AND M.CHEN possible covariate values. That is, pr(r = 1 p2 x, z, y, δ) > 0 holds for almost all (x, z) and almost all y [t 1, t 2 ] such that H 0 (t 1 ) H 0 (t 2 ), where 1 p2 is a p 2 1 vector of ones. Assumption 6. The probability function F ϕ (dt)dϕ is absolutely continuous with respect to Lebesgue measure on Π = {ϕ R p 1 : ϕ T ϕ = 1} [, ]. Assumption 7. As n, for any sequences {(ϕ n, u n, t n )} and {(ϕ n,1, u n,1, t n,1 )}, ρ n (ϕ n, u n, t n ; ϕ n,1, u n,1, t n,1 ) converges to zero when ρ(ϕ n, u n, t n ; ϕ n,1, u n,1, t n,1 ) 0. Moreover, ρ(ϕ, u, t; ϕ 1, u 1, t 1 ) is the limit of ρ n (ϕ n, u n, t n ; ϕ n,1, u n,1, t n,1 ), which is defined as (n 1 E[{R i (t n )1(ϕ T n x i u n ) R i (t n,1 )1(ϕ T n,1x i u n,1 )} 2 ]) 1/2, where R i (t) is a conditional martingale residual, which will be introduced later. Assumption 8. For any small a 0 > 0, sup pr[ δ < {v i (α) T ϕ t}/v i (x i, z o,i ) < δ] C 0 δ c 1, (α,ϕ,t) A Π where C 0 and c 1 are two positive scalars, A = {α : α α a 0 }, and V i (x i, z o,i ) 2 = sup α A α v i (α) 2 + sup α A v i (α) Moreover, v i (ˆα) = {x i, r i1 z i1 + (1 r i1 )E(z i1 x i, z o,i ; ˆα),..., r ip2 z ip2 + (1 r ip2 )E(z ip2 x i, z o,i ; ˆα)} T. Assumption 9. Let λ min ( ) be the smallest eigenvalue of a matrix. For a fixed ɛ 0 > 0, we have n 1 [Q{ˆγ, ĥ0( ) ˆη} sup Q{γ, ĥ0( ) ˆη}] = C 0 + o p (1), γ ˆγ =ɛ 0 sup n 1 γq{γ, 2 ĥ0( ) ˆη} A(γ) = o p (1), γ ˆγ ɛ 0 where min γ ˆγ ɛ0 λ min {A(γ) 2 } > 0 and C 0 is a positive scalar. Assumptions 1 5 have been used to establish consistency and asymptotic normality of the nonparametric maximum likelihood estimator in a proportional hazards regression model with covariates missing at random (Chen & Little, 1999). Assumption 6 is required to establish the asymptotic distributions of the Cramer von Mises test statistics introduced below. Assumption 7 is required to invoke the central limit theory for the sums of independent but not identically distributed stochastic processes (van der Vaart & Wellner, 1996; Kosorok, 2007; Pollard, 1990). Assumption 8 is required to invoke Ossiander s entropy conditions (Ossiander, 1987; Andrews, 1994). Assumption 9 is required to establish the asymptotic accuracy of approximating casedeletion measures introduced below. 3. DIAGNOSTIC MEASURES 3 1. Case-deletion influence measures To quantify the effects of deleting the ith observation on the maximum likelihood estimate, ˆη of η, it is common to compute the maximum likelihood estimate of η for a subsample D c[i], in which the ith observation d i = (y i, δ i, v i, r i ) is deleted from D c = {D o, D m } = {(y j, δ j, c j, r j ) : j = 1,..., n}, where D o and D m denote the observed and missing data. However, it is computationally intensive to directly maximize the likelihood function based on the subsample D c[i] for each i. Instead, we define Q [i] (η ˆη) as

5 Diagnostic Measures for the Cox Regression Model 5 Q [i] (η ˆη) = E{L c (η D c[i] ) D o ; ˆη}, where L c (η D c[i] ) denotes the complete-data loglikelihood function for D c[i] and the expectation is taken with respect to p(d m D o ; ˆη). Similar to (5), Q [i] (η ˆη) equals the sum of Q 1[i] {β, h 0 ( ) ˆη} = j i E(log[p{y j, δ j x j, z j ; β, h 0 ( )}] D o ; ˆη), Q 2[i] (α ˆη) = j i E[log{p(x j, z j ; α)} D o ; ˆη], and Q 3[i] (ξ ˆη) = j i E[log{p(r j x j, z j, y j, δ j ; ξ)} D o ; ˆη]. Let ω = (ω 1,..., ω n ) T with ω k 0 for all k. We define Q 1 {ω, β, h 0 ( ) ˆη} to be ω k δ k {log h 0 (y k ) + E(c T k β D o; ˆη)} k=1 ω k H 0 (y k )E{exp(c T k β) D o; ˆη}. (6) k=1 First, by substituting ĥ0( ) into (6), we can obtain Q 1 (ω, β, ĥ0( )) as δ k ω k E(c T k β D o; ˆη) k=1 ω k Ĥ 0 (y k )E{exp(c T k β) D o; ˆη}. (7) k=1 We first calculate ˆβ(ω) = argmax β Q 1 {ω, β, ĥ0( )} and then maximize Q 1 {ω, ˆβ(ω), h 0 ( )} with respect to h 0 ( ), which leads to ĥ 0 (y k β, ω) = δ k ω k j R k ω j E{exp(c T j β) D o; ˆη}, (8) in which R k = {j : y j y k }. If ω = 1 n is an n 1 vector of ones, then ˆβ(1 n ) = ˆβ and ĥ 0 (y k ) = ĥ0(y k ˆβ, 1 n ). Furthermore, if ω = 1 n e i, then we define ˆβ [i] = ˆβ(1 n e i ) and ĥ 0[i] (y k ) = ĥ0(y k ˆβ [i], 1 n e i ). Similarly, we define ˆα [i] and ˆξ [i] as the maximizers of Q 2[i] (α ˆη) and Q 3[i] (ξ ˆη), respectively. Now, we are able to calculate a one-step approximation ˆη 1 [i] = {ĥ1 0[i] ( ), ˆβ 1 [i], ˆα1 [i], ˆξ 1 [i] } of ˆη [i] = {ĥ0[i]( ), ˆβ [i], ˆα [i], ˆξ [i] } as below. We obtain the following theorem, whose proof can be found in the Supplementary Material. THEOREM 1. Under Assumptions 3 and 9, ˆβ 1 [i] = ˆβ [ 2 β Q 1{1 n, ˆβ, ĥ0( )}] 1 2 βω i Q 1 {1 n, ˆβ, ĥ0( )} = ˆβ [i] + o p (n 1 ), ˆα 1 [i] = ˆα { 2 αq 2 (ˆα ˆη)} 1 E{ α log p(v i ; ˆα) D o ; ˆη} = ˆα [i] + o p (n 1 ), ˆξ 1 [i] = ˆξ { 2 ξ Q 3(ˆξ ˆη)} 1 E{ ξ log p(r i d o,i ; ˆξ) D o ; ˆη} = ˆξ [i] + o p (n 1 ), (9) ĥ 1 0[i] (y k) = ĥ0(y k ˆβ 1 [i], 1 n e i ) = ĥ0[i](y k ) + o p (n 1 ). Theorem 1 gives the one-step approximation ˆη 1 [i] of ˆη [i] for each major component of η. It is straightforward to compute ˆη [i] 1 using (9). We introduce a Q-distance for the finite-dimensional parameter γ in the presence of an infinitedimensional parameter h 0 ( ) to quantify the distance between the maximum likelihood estimates of γ with and without the ith observation deleted from the full sample (Cook & Weisberg, 1982; Zhu et al., 2001). The Q-distance for the ith subject is defined as QD i (M) = (ˆγ [i] 1 ˆγ)T M(ˆγ [i] 1 ˆγ), (10) where M is a positive definite matrix. According to (5), we assume that M = diag[ 2 β Q 1{1 n, ˆβ, ĥ0( )}, 2 αq 2 (ˆα ˆη), 2 ξ Q 3(ˆξ ˆη)]. Thus, QD i can be decomposed as a sum

6 6 H. ZHU, J.IBRAHIM, AND M.CHEN of three diagnostic measures based on (1)-(3), that is QD i = QD i,1 + QD i,2 + QD i,3, where QD i,1 = [ ω 2 i β Q 1{1 n, ˆβ, ĥ0( )}] T [ β 2 Q 1{1 n, ˆβ, ĥ0( )}] 1 βω 2 i Q 1 {1 n, ˆβ, ĥ0( )}, QD i,2 = E{ α log p(v i ; ˆα) D o ; ˆη} T { 2 αq 2 (ˆα ˆη)} 1 E{ α log p(v i ; ˆα) D o ; ˆη}, (11) QD i,3 = E{ ξ log p(r i d o,i ; ˆξ) D o ; ˆη} T { 2 ξ Q 3(ˆξ ˆη)} 1 E{ ξ log p(r i d o,i ; ˆξ) D o ; ˆη}. Intuitively, QD i,1, QD i,2, and QD i,3 are associated with the effects of removing the ith observation on the assumptions of p{y i, δ i c i ; β, h 0 ( )}, p(v i ; α), and p(r i v i, y i, δ i ; ξ). If QD i is large, then the ith observation is influential. Similarly, we can also quantify the effects of deleting two or more observations on ˆη (Cook & Weisberg, 1982). For simplicity, we omit those details here. We also define a distance function of ĥ0( ) and ĥ1 0[i] ( ) to quantify the effect of deleting the ith observation on the infinite-dimensional parameter h 0 ( ). Let be the sup-norm for functions. Specifically, we define QD i,h0 ( ) = max Y k (y j ){ĥ0(y k ) 1 j n ĥ1 0[i] (y k)} = Ĥ0 Ĥ1 0[i], (12) k=1 where Y k (u) = 1(y k u), Ĥ 0 (y) = y j y ĥ0(y j ), and Ĥ1 0[i] (y) = y j y ĥ1 0[i] (y j). A challenging issue is the quantification of the magnitude of these case-deletion measures for detecting influential observations. A common approach is to sort these measures for all observations then classify observations with larger measures as influential. However, this approach may not identify truly influential observations and it does not reveal why an observation is influential. To address this issue, we introduce a detection probability of being influential for each observation and for any case-deletion measure. The key idea is to measure the standardized influential level of each observation for a case-deletion measure under the assumption that (1) is the true data generator. We compute the detection probabilities of all observations based on the fitted model p(d i ; ˆη) as follows: We use a semi-bootstrap method described in the Appendix to generate multiple bootstrapped data sets. Then, for each bootstrapped data set, we calculate all of the case-deletion diagnostic measures across all observations. For each observation, the detection probability is calculated as the proportion of the bootstrapped case-deletion diagnostic measures smaller than the corresponding observed case-deletion diagnostic measure. Observations with large detection probabilities, say 0.95 or greater, can be regarded as influential Residuals We consider two types of residuals: conditional martingale residuals and score residuals for the Cox regression model with missing covariates. In the absence of missing covariates, the martingale residual for the ith observation at time t is defined as M i (t) = N i (t) t 0 Y i (u) exp(v T i β)h 0 (u)du, (13) where N i (t) = δ i 1(T i t). However, since z m,i is missing, M i (t) cannot be directly calculated for those cases with missing covariates. Although there are many ways of integrating out z m,i, we define a conditional martingale residual for the ith observation at t as R i (t) = N i (t) t 0 Y i (u)e{exp(v T i β) d o,i }h 0 (u)du, i = 1,..., n (14) where d o,i = (y i, δ i, x i, z o,i, r i ) and the expectation is taken with respect to p(z m,i d o,i ; η). If there are no missing covariates in z i, then R i (t) reduces to M i (t). Thus, R i (t) can be regarded

7 Diagnostic Measures for the Cox Regression Model 7 as a generalization of the martingale residuals used in Cox regression. Computationally, the conditional expectation involved in (14) can be easily carried out using Markov chain Monte Carlo methods (Chen, Shao & Ibrahim, 2000). The R i (t) evaluated at ˆη is given by R i (t) = N i (t) t 0 Y i (u)e{exp(v T i ˆβ) d o,i ; ˆη}ĥ0(u)du. (15) In particular, when t = τ = sup{u : pr{y (u) = 1} > 0}, i.e., the end time of the study, we can obtain the corresponding conditional martingale residual as follows: R i = R i (τ) = δ i ˆr i = δ i yi 0 E{exp(v T i ˆβ) d o,i ; ˆη}ĥ0(u)du, (16) where ˆr i is a generalization of the Cox Snell residual in the presence of missing covariates (Cox & Snell, 1968). We consider the score residual. We define S (r) (β, u; ˆη) = n 1 n Y i(u)e{exp(vi Tβ)v r i D o ; ˆη} for r = 0, 1, 2, where a 0 = 1, a 1 = a, and a 2 = a a T for a vector a. The score function associated with β is β Q(ˆη ˆη) = [δ i E(v i d o,i ; ˆη) Ĥ0(y i )E{v i exp(vi T ˆβ) d o,i ; ˆη}] = 0 U i (u, ˆη)dN i (u), where U i (u; η) = {U i,1 (u; η) T, U i,2 (u; η) T } T = E(v i d o,i ; η) S (1) (β, u; η)/s (0) (β, u; η), in which U i,1 (u; η) denotes the first p 1 components of U i (u; η) associated with x i. Furthermore, we can define a score process for β Ŝ i,2 U(t η) = {U 1 (t η) T, U 2 (t η) T } T = t 0 U i (u; η)dn i (u), (17) where U 1 (t η) denotes the first p 1 components of U(t η) associated with x i. Finally, we have 0 = β Q{ ˆβ, ĥ0( ) ˆη} = U(τ ˆη) = n Ŝi, where Ŝi = (ŝ i1,..., ŝ ip ) is given by ( ) ( ) ( Ŝi,1 x = δ i i E(z i d o,i ; ˆη) Ĥ0(y i ) exp(x T ˆβ xi E{exp(z i 1 ) i T ˆβ ) 2 ) d o,i ; ˆη} E{z i exp(zi T ˆβ, (18) 2 ) d o,i ; ˆη} and Ŝi,1 is the first p 1 1 subvector of Ŝi associated with β 1. Score residuals are useful tools in detecting influential observations and in assessing model assumptions (Therneau et al., 1990). Following the case-deletion diagnostic measures, we can use the same semi-bootstrap method to generate random samples and then calculate the detection probabilities of ŝ ik for k = 1,..., p. We examine several properties of the proposed conditional martingale residuals and score residuals. Through a better understanding of the properties of those residuals, we may develop both formal and informal diagnostic tools to examine of the adequacy of the Cox regression model with missing covariates. THEOREM 2. Suppose that Assumption 3 is true. Then (a) E{R i (t) x i, z o,i } = E{R i (t) x i } = E{R i (t)} = 0. (b) In general, E{R i (t) x i, z o,i, r i } may not equal zero. If p(r i v i, y i, δ i, ξ) is independent of y i and δ i, then E{R i (t) x i, z o,i, r i } = 0.

8 8 H. ZHU, J.IBRAHIM, AND M.CHEN (c) If the missing data are missing at random, then R i (t) = N i (t) t and R i (t) is independent of ξ. For any t, n R i (t) = 0. (d) We have U 1 (t η) = n t 0 U i,1(u; η)d{r i (u)}. (e) We have U 2 (t η) n t 0 U i,2(u; η)d{r i (u)}. 0 Y i (u)e[exp{(x T i, z T i )β} x i, z o,i, δ i, y i ]h 0 (u)du (19) Theorem 2 characterizes the behavior of R i (t) and R i (t). First, E{R i (t)}, E{R i (t) x i }, and E{R i (t) x i, z o,i } are unbiased, whereas E{R i (t) x i, z o,i, r i } is biased. Second, the missing data indicators can be dropped from R i (t) under the missing-at-random assumption. Third, the conditional martingale residuals share some properties with ordinary residuals in linear models and martingale residuals in the Cox regression model. Fourth, in the presence of missing covariates, we cannot replace N i (t) by R i (t) in the score residual process Conditional residual process without incorporating missing data We use the conditional martingale residuals to develop test statistics to check model assumptions in the Cox regression model with missing covariates. These statistics are designed to test the null hypothesis H (0) 0 : E{M(t) x, z} = 0 for some η and all t against the alternative H (0) 1 : E{M(t) x, z} 0 for all η and some t. However, since some components of z are missing, we may wish to test the equality h(t x) = E{R(t) x} = 0. Instead, we test H (1) 0 : h(t x) = 0 for some η and all t, H (1) 1 : h(t x) 0 for all η and some t. (20) Note that h(t x) = 0 is only a necessary condition for E{M(t) x, z} = 0, so accepting h(t x) = 0 does not imply the acceptance of H (0) 0. We can construct statistics for testing H (1) 0 as follows. Following the reasoning in Escanciano (2006) and Zhu, Ibrahim, and Shi (2009), we can show that H (1) 0 is equivalent to testing E{R(t)1(x T ϕ u)} = 0 for almost every (ϕ, u) and all t [0, τ]. Thus, we may define a stochastic process I 1 (ϕ, u, t; η) = n 1/2 1(x T i ϕ u)r i (t), (21) where (ϕ, u) Π and t [0, τ]. We regard I 1 (ϕ, u, t; η) as a stochastic process indexed by (ϕ, u, t) and use it to construct a Cramer von Mises test statistic CM 1 (t) = I 1 (ϕ, u, t; ˆη) 2 F n,ϕ (du)dϕ, (22) Π where F n,ϕ (u) is the empirical distribution function of {x T i ϕ : i = 1,..., n}. Large values of CM 1 (t) lead to rejection of H (1) 0. Compared with other test statistics based on the martingale residual processes (Lin, Wei & Ying, 1993), CM 1 (t) avoids both numerical integration in high dimensions and high-dimensional maximization. THEOREM 3. Under Assumptions 1-7, I 1 (ϕ, u, t; ˆη) is asymptotically equivalent to the sum of I 1 (ϕ, u, t; η ) and n 1/2 [h 1 (ϕ, u, t; η ) T ( ˆβ β ) + h 2 (ϕ, u, t; η ) T (ˆα α ) + τ 0 h 3(ϕ, u, t; η )(s)d{ĥ0(s) H 0 (s)}], where h 1 (ϕ, u, t; η ), h 2 (ϕ, u, t; η ) and

9 Diagnostic Measures for the Cox Regression Model 9 h 3 (ϕ, u, t; η )(s) are defined in the Supplementary Material. Moreover, I 1 (ϕ, u, t; ˆη) converges in distribution to a zero-mean Gaussian process G 1 (ϕ, u, t) and CM 1 (t) converges in distribution to Π G 1(ϕ, u, t) 2 F ϕ (du)dϕ, as n. Theorem 3 (a) characterizes the asymptotic null distributions of I 1 (ϕ, u, t; ˆη) and CM 1 (t). Based on this result, we can develop a resampling method to approximate the null distribution of CM 1. Let {v (b) i : i = 1,..., n} be a random sample from the N(0, 1) distribution. Then, we calculate I 1 (ϕ, u, t; ˆη) (b) = n 1/2 v (b) i { R i (t)1(x T i ϕ u) + n ˆ n (ϕ, u, t) T Jn 1 ψ n,i }, where ψ n,i denotes the score vector for (β, α) and the ĥ0(y i ) for all uncensored observations and ˆ n (ϕ, u, t) includes h 1 (ϕ, u, t; ˆη), h 2 (ϕ, u, t; ˆη), and all h 3 (ϕ, u, t; ˆη)(y i ) for all δ i = 1. We then calculate the test statistics {CM 1 (t) (b) : b = 1,..., B} and approximate the p value of CM 1 (t). Theoretically, we can show that this resampling method is asymptotically valid. COROLLARY 1. Suppose that Assumptions 1 7 are true. As n, conditional on the observed data, I 1 (ϕ, u, t; ˆη) (q) converges weakly to the same Gaussian process as I 1 (ϕ, u, t; ˆη) Conditional residual process incorporating missing data We consider using the missing covariates z i to improve the power of I 1 (ϕ, u, t; η) in detecting potential model misspecification. Since 1(x T ϕ u) in I 1 (ϕ, u, t; η) does not involve the missing covariates z, we may lose power in detecting the misspecification of H (0) 0 in the missing covariate space. In particular, if the fraction of missing covariates is small, then it is very inefficient to drop all the information in z. We first suppose that p(r i x i, z i, y i, δ i ; ξ) is independent of y i and δ i. It can be shown that E{R i (t)1(ˆv T i ϕ u) x i, z o,i } = 0, i = 1,..., n, (23) where ( ϕ, u) Π = { ϕ R p 1+p 2 : ϕ T ϕ = 1} [, ] and ˆv i = {x i, z o,i, z m,i (ˆα)}. We are thus able to incorporate the additional information from z o,i into the indicator function 1(ˆv i T ϕ t). We propose the stochastic process I 2 ( ϕ, u, t; η) = n 1/2 1(ˆv i T ϕ u)r i (t). (24) Plotting I 2 ( ϕ, u, t; ˆη) against t for a specific ϕ is an exploratory tool for detecting the form of misspecification of assumption (1). Then, we develop the corresponding Cramer von Mises test statistic based on I 2 ( ϕ, u, t; ˆη), denoted by CM 2. Large values of CM 2 lead to rejection of the hypothesis that E{R i (t) x i, z o,i, r i } = 0, which may be caused by either that p(r i x i, z i, y i, δ i ; ξ) depends on (y i, δ i ) or E{R(t) x, z} 0. Second, we develop a general strategy for incorporating the information in the missing data. We examine whether we can use the imputed missing covariate data ˆv i when p(r i x i, z i, y i, δ i, ξ) depends on y i and δ i. It can be shown that E{R i (t)1(ˆv T i ϕ t) x i, z o,i } = E[E{R i (t) x i, z o,i, r i }1(ˆv T i ϕ t) x i, z o,i ] 0, which is caused by the facts that v i (α) is a function of v i and r i and E{R i (t) x i, z o,i, r i } 0. We propose to construct a density function for z i given x i, denoted by ˆp(z i x i ), using either parametric methods or nonparametric methods based on all the observed data, and then

10 10 H. ZHU, J.IBRAHIM, AND M.CHEN we simulate z i in the space of the missing covariate data for all observations instead of only imputing the missing covariates z m,i. For simplicity, we use p(z i x i ; ˆα) to simulate {z (b) i : i = 1,..., n} for b = 1,..., B 3. Let v (b) i = (x i, z (b) i ). Then, we propose a conditional martingale residual process, I 3 ( ϕ, u, t; η) (b) = n 1/2 1(v (b)t i ϕ u)r i (t). (25) We may plot I 3 ( ϕ, u, t; ˆη) (b) against u for a specific ϕ as an exploratory tool for detecting possible model misspecification. Similar to the above, we can develop the corresponding Cramer von Mises test statistic based on I 3 ( ϕ, u, t; ˆη) (b) and denote it by CM (b) 3. Large values of CM(b) 3 lead to rejection of the hypothesis that E{R(t) x i, z o,i } = 0. Following Zhu, Ibrahim, and Shi (2009), we can establish the asymptotic distributions of I 2 ( ϕ, u, t; ˆη), I 3 ( ϕ, u, t; ˆη) (b), CM 2, and CM (b) 3. For simplicity, we only include the asymptotic null distribution of I 2 ( ϕ, u, t; ˆη) below. COROLLARY 2. If Assumptions 1-8 hold, then I 2 ( ϕ, u, t; ˆη) converges in distribution to a zero-mean Gaussian process G 2 (ϕ, u, t) defined in the Supplementary Material. Based on the results in Corollary 2, we can also develop a resampling method to approximate the null distribution of CM 2 (t) in order to calculate the p-values of the test statistic CM 2 (t). 4. SIMULATION STUDIES 4 1. Case-deletion measures and martingale residuals We generated 100 data sets from a Cox regression model with missing covariates. Each data set consists of n = 200 observations {(x i, z i, δ i, y i ) : i = 1,..., n} with a completely observed covariate x i and two missing covariates z i = (z i1, z i2 ) T. Moreover, x i was generated from a Bernoulli(0.5) distribution; conditional on x i, z i1 was generated from the logistic regression model logit{pr(z i1 = 1 x i, α 1 )} = α 10 + α 11 x i with α 10 = 1.0 and α 11 = 0.5; and conditional on (x i, z i1 ), z i2 was generated from a N(α 20 + α 21 x i1 + α 22 z i1, α 23 ) distribution, where (α 20, α 21, α 22, α 23 ) = (0.2, 0.1, 0.4, 1). The survival time T i was independently generated from λ(t c i ; β) = h 0 (t) exp(x i β 1 + z i1 β 2 + z i2 β 3 ) with h 0 (t) = 0.56 and β = (0.5, 0.5, 1.0) T and the censoring times C i were independently generated from a U(0, 3) distribution. Then, we set y i = min(t i, C i ), and δ i = 1 when T i C i and 0 otherwise. The missing data z i1 were generated from the logistic regression model logit{pr(r i1 = 1 y i, c i ; ξ 1 )} = ξ 10 + ξ 11 y i + ξ 12 x i + ξ 13 z i2 with ξ 1 = (0.5, 0.3, 0.5, 0.5) T ; and the missing data z i2 were generated from the logistic regression model logit{pr(r i2 = 1 y i, r i1, x i, z i, ξ 2 )} = ξ 20 + ξ 21 y i + ξ 22 r i1 + ξ 23 x i + ξ 24 z i1 with ξ 2 = (0.3, 0.3, 0.4, 0.2, 0.2) T. Under the above simulation design, each simulated data set has about 44% censored values of the y i s, about 23% missing z i1 s and about 37% missing z i2 s. We investigate the performance of different diagnostic measures in the simulated data sets. We introduce two outliers in each simulated data set. In the first, we perturbed the 41st observation by adding s to each survival time, i.e., y i + s, and perturbed the 90th observation by adding s to z i,2, i.e., z i,2 + s. For all other 99 data sets, we selected the two observations closest to the 41st and 90th observations according to the values of (y i, δ i, x i, z i, r i ) and perturbed these two observations in the same way as in the first data set. We varied the value of s in order to introduce different degrees of perturbation. We fitted the same missing-not-at-random model used to

11 Diagnostic Measures for the Cox Regression Model 11 Table 1. Summary of detection probabilities (%) of QD i,1, QD i,2, QD i,3, QD i,h0( ), martingale residual Ri, and score residual s i1 for 100 simulated datasets under the missing-not-atrandom model. Q1, median, and Q3, respectively, denote the 25th percentile, the 50th percentile, and the 75th percentile of 100 detection probabilities. The maximum value is calculated as max(pqd i,1, pqd i,2, pqd i,3, pqd, i,h0( ) pmc i, ps i1 ). y 41 + s pqd i,1 pqd i,2 pqd i,3 pqd i,h0 ( ) pmc i ps i1 Maximum s = 0.1 Median Q Q % % s = 0.2 Median Q Q % % z 90,2 + s pqd i,1 pqd i,2 pqd i,3 pqd i,h0 ( ) pmc i ps i1 Maximum s = 1.5 Median Q Q % % s = 2.5 Median Q Q % % Table 2. Rejection rates (%) of CM 1 (τ) and CM 2 (τ) at the 5% significance level. Average Missing Data Fraction 25% 50% CC Analysis Analysis of All Cases CC Analysis Analysis of All Cases c CM 1(τ) CM 2(τ) CM 1(τ) CM 2(τ) CM 1(τ) CM 2(τ) CM 1(τ) CM 2(τ) generate the simulated data sets, and then calculated various diagnostic measures and their detection probabilities. Table 1 presents the summaries of detection probabilities of various diagnostic measures for 100 simulated data sets. As the degree of perturbation increases, the detection probabilities increase in magnitude. Lowering the threshold for detection probabilities from 97.5% to 90% increases the probability of detecting the induced outliers. Overall, our detection probability is effective for detecting outliers Cramer-von Mises Goodness-of-fit Test Statistics The goal of this simulation is to assess the empirical performance of CM 1 (τ) and CM 2 (τ) and their associated resampling method. We generated data sets from a Cox regression model with

12 12 H. ZHU, J.IBRAHIM, AND M.CHEN two completely observed covariates x i = (x i1, x i2 ) T and one missing covariate z i as follows. In this simulation study, x i1 and x i2 were independently generated from the normal distributions N(0, 1) and N(0, ), respectively; conditional on x i, z i was generated from a normal distribution N(α 0 + α 1 x i1 + α 2 x i2, α 3 ), where (α 0, α 1, α 2, α 3 ) = (0.5, 0.1, 0.4, 1). The survival times T i were independently generated from λ(t x i, z i ; β) = h 0 (t) exp(x i1 β 1 + x i2 β 2 + z i β 3 + cx 2 i1 ) with h 0(t) = 1.0 and β = (1.0, 1.0, 1.0) T and the censoring times C i were independently generated from a U(0, 11) distribution. Then, we set y i = min(t i, C i ), and δ i = 1 when T i C i and 0 otherwise. The missing data z i were assumed to be missing at random and generated from the logistic regression model logit{pr(r i = 1 x i ; ξ)} = ξ 0 + ξ 1 x i1 + ξ 2 x i2. We considered two sets of ξ: (I) ξ = (1.085, 0.2, 0.1) T and (II) ξ = ( 0.015, 0.2, 0.1) T. The averages and ranges of missing data fractions are, respectively, 25% and (19.2%, 31.6%) under (I) and 50% and (44.0%, 58.4%) under (II). We considered c = 0, 0.25, 0.50, and The average censoring percentages are, respectively, 24.6%, 21.4%, 18.7%, and 16.7% for c = 0, 0.25, 0.50, and For each combination of c and ξ, we generated 100 data sets. For all simulated data sets, we fit the Cox regression model (1) with λ(t x i, z i, β) = h 0 (t) exp(x i1 β 1 + x i2 β 2 + z i β 3 ) assuming missing at random. We carried out the completecase analysis and the all-case analysis. Thus, the model would be misspecified if c 0 and the misspecification would be due to the covariate x 2 i1. We set B = 500 to calculate the p-values of all test statistics. The significance level was fixed at The rejection rates are shown in Table 2. As expected, the power of both CM 1 (τ) and CM 2 (τ) to detect model misspecification increases in c and decreases with the missing data fraction. Moreover, the power of CM 1 (τ) is higher than that of CM 2 (τ), but CM 1 (τ) has slightly higher type I errors than CM 2 (τ) when the missing data fraction is low. For the complete-case analysis, the type I error rates of CM 1 and CM 2 are close to In contrast, for the all-case analysis, the type I error rates of CM 1 and CM 2 are 0.07 and 0.05 when the average missing data fraction is 25% and 0.04 and 0.04 when the average missing data fraction is 50%. When c = 0.25, the all-case analysis outperforms the complete-case analysis in terms of the detection of model misspecification. However, this is not the case when c 0.5, which may be caused by the robustness of the complete-case analysis when the data are truly missing at random. 5. ANALYSIS OF SMALL CELL LUNG CANCER DATA We consider data from a phase III advanced non-small-cell lung cancer clinical trial conducted at the University of North Carolina at Chapel Hill (Socinski et al. 2002). The goal of this trial was to compare a defined duration of therapy to continuous therapy followed by second-line therapy in order to determine optimal duration of therapy in non-small-cell lung cancer patients. This study had n = 230 patients. We consider five prognostic factors: x 1 = treatment, which takes a value of 1 if the subject received a defined duration of therapy and 0 otherwise, x 2 = gender, which equals 1 if the subject is male and 0 otherwise, x 3 = age in years, z 1 = Apex, which equals 1 if the tumor was at the top of the lung and 0 otherwise, and z 2 = FACT-G score. For these five prognostic factors, z 1 and z 2 had missing information and x 1, x 2, and x 3 were completely observed for all cases. In this data set, 52.74% of the subjects had missing values in at least one of z 1 and z 2. The outcome variable is time to progression, which is continuous and subject to right censoring, and δ i denotes the censoring indicator, which equals 1 if the ith subject had disease progression, and 0 otherwise. The median follow-up time is 3.94 months and the range of the follow-up time is (0.1, 27.61) months. A summary of the data set was given in Chen, Ibrahim & Shao (2009).

13 Diagnostic Measures for the Cox Regression Model 13 We fit the Cox regression model (1) to the data, where v i = (x T i, zt i )T and β = (β 1,..., β 5 ) T with p 1 = 3 and p 2 = 2. We model two missing covariates z i conditional on x i as p(z i1 x i ; α)p(z i2 x i, z i1 ; α). We use a logistic regression model for z i1 and a normal linear regression model for z i2. Specifically, we have logit{p(z i1 x i ; α)} = z i1 (α k=1 α 1kx ik ) and z i2 N(α k=1 α 2kx ik + α 24 z i1, α 25 ), where α 1 = (α 10, α 11, α 12, α 13 ) T and α 2 = (α 20,..., α 25 ) T. We consider both missing-at-random and missing-not-at-random models for r i. Under the missing-not-at-random model, we take p(r i v i, y i, δ i ; ξ) = p(r i1 v i, y i, δ i ; ξ 1 )p(r i2 r i1, v i, y i, δ i ; ξ 2 ), where ξ = (ξ1 T, ξt 2 )T. Moreover, logistic regression models are specified for p(r i1 v i, y i, δ i ; ξ 1 ) and p(r i2 r i1, v i, y i, δ i ; ξ 2 ), where ξ 1 and ξ 2 are the vectors of the corresponding regression coefficients. However, under the missing-at-random model, p(r i v i, y i, δ i ; ξ) = p(r i x i, y i, δ i ; ξ) and a logistic regression model is specified for p(r i x i, y i, δ i ; ξ). As a comparison, we also consider the complete-case analysis. Table 3 shows the maximum partial likelihood estimate of β for the complete-case analysis and the maximum likelihood estimates of β under both missing-at-random and missing-notat-random models for the missing data mechanism. We can see some differences between the estimates in Table 3. In the complete-case analysis, the p-value for β 1 is while that for β 4 is Thus, in the complete-case analysis, treatment is not significant, but Apex is significant at the 0.05 significance level. However, the p-values are and for β 1 and and for β 4 under the missing-at-random and missing-not-at-random models, respectively, implying that treatment and Apex may be significantly associated with time to disease progression. Thus, continuous therapy followed by second-line therapy may be more beneficial than defined duration of therapy with respect to time to progression based on the analyses incorporating all of the cases. Also, the standard errors from the analysis incorporating all of the cases are consistently smaller than those from the complete-case analysis for all of the β j s. In addition, both sets of the maximum likelihood estimates of β are very similar. Under the missing-not-at-random model, the p-values for the coefficients associated with z i in p(r i v i, y i, δ i ; ξ) are greater than 0.34, which may suggest that there is no evidence against the missing-at-random assumption. We also calculated the test statistics CM 1 and CM 2 to be and 0.637, respectively. By setting B = 1, 000, we approximated the p-values of CM 1 and CM 2 as and 0.127, respectively. These results may also indicate neither that E{R i (t) x i } 0 or E{R i (t) x i, z i } 0 nor that p(r i x i, z i, y i, δ i ; ξ) depends on (y i, δ i ). Figure 1 shows the detection probabilities of some selected diagnostic measures under the missing-at-random and missing-not-at-random models. Additional results can be found in the Supplementary Material. The purpose of the plots of the detection probabilities corresponding to QD i,1, QD i,2, QD i,3, and QD i,h0 ( ) is to identify influential observations due to the specifications of the regression component of the Cox model, the covariate model, the logistic regression models for the missing data binary indicators, and the baseline hazard component of the Cox model, respectively. In addition, the plots corresponding to R i are used to determine the appropriateness of the entire Cox model while the plots corresponding to ŝ i3 and ŝ i5 are used to check the proportional hazard assumptions for age and FACT-G score, respectively. For the 111 complete cases, both the missing-at-random and missing-not-at-random models detected the same 13 outlying cases with maximum detection probabilities greater than For the 119 subjects who had at least one missing values in Apex or quality of life score, the same 12 subjects had maximum detection probabilities greater than 0.95 under both the missing-at-random and missing-not-atrandom models, 6 subjects had maximum detection probabilities greater than 0.95 only under the missing-at-random model, and 4 subjects had maximum detection probabilities greater than 0.95 only under the missing-not-at-random model. For the 6 missing-at-random outlying cases,

14 14 H. ZHU, J.IBRAHIM, AND M.CHEN Table 3. Maximum likelihood estimates of β based on complete-case, missing-atrandom, and missing-not-at-random analyses of the real data. For each β k, efficiency denotes the ratio of standard error of ˆβ k for the complete-case analysis over that for the missing-at-random (or missing-not-at-random) analysis. Model Parameter Estimate SE Z-statistic p-value 95% CI Efficiency complete β ( 0.02, 0.97) 1.00 case β ( 0.41, 0.55) 1.00 β ( 0.28, 0.24) 1.00 β ( 0.07, 1.68) 1.00 β ( 0.37, 0.10) 1.00 missing β ( 0.13, 0.82) 1.45 at β ( 0.18, 0.53) 1.35 random β ( 0.20, 0.16) 1.44 β ( 0.17, 1.66) 1.08 β ( 0.26, 0.16) 1.12 missing β ( 0.14, 0.82) 1.45 not at β ( 0.18, 0.53) 1.35 random β ( 0.20, 0.16) 1.44 β ( 0.18, 1.67) 1.08 β ( 0.26, 0.16) 1.12 the maximum detection probabilities range from to under the missing-not-at-random model and from to under the missing-at-random model. For the 4 missing-not-atrandom outlying cases, the maximum detection probabilities are 0.80, 0.58, 0.825, and under the missing-at-random model and 0.96, 0.958, 0.984, and under the missing-not-atrandom model. The disagreements between the missing-at-random and missing-not-at-random models for those four cases were in the values and the detection probabilities of QD i,2. Overall, the detection probabilities under the missing-at-random model are very close to those under the missing-not-at-random model. The outlying case with the largest maximum detection probabilities is the subject whose quality of life score is 34, which is the smallest value among all subjects with the mean quality of life score of In this case, the values of s i5 are 5.77, 4.84, and 4.82 and the corresponding detection probabilities all are 1.0 under the complete-case analysis, and the missing-at-random and missing-not-at-random models. SUPPLEMENTARY MATERIAL Supplementary material available at Biometrika online includes details of the semi-bootstrap method, proofs and additional simulations and real data analysis results. ACKNOWLEDGEMENT We thank the Editor, the Associate Editor, and two anonymous referees for valuable suggestions, which greatly helped to improve our presentation. We would like to thank Dr. Dongling Zeng for helpful discussions. The research was supported by US National Institute of Health and National Science Foundation Grants.

15 Diagnostic Measures for the Cox Regression Model 15 (a) QD i,1 (b) QD i,1 Time (y) in Months Detection Probability Time (y) in Months Detection Probability (c) QD i,2 (d) QD i,2 Time (y) in Months Detection Probability Time (y) in Months Detection Probability (e) R i (f) R i Time (y) in Months Detection Probability Time (y) in Months Detection Probability (g) ŝ i5 (h) ŝ i5 Time (y) in Months Detection Probability Time (y) in Months Detection Probability Fig. 1. Plots of detection probabilities of QD i,1, QD i,2, conditional martingale residuals R i, and score residuals ŝ i5, respectively, for the missing-at-random (panels (a), (c), (e), and (g)) and missing-not-at-random (panels (b), (d), (f), and (h)) analyses of the real data. The points represent the detection probabilities for progression subjects and the points represent those for censored subjects.

16 16 H. ZHU, J.IBRAHIM, AND M.CHEN REFERENCES ANDREWS, D. W. K. (1994). Empirical process methods in econometrics. In Handbook of Econometrics IV (Eds. R. F. Engle and D. L. McFadden). Amsterdam: Elsevier, pp BARLOW, W. E. (1997). Global measures of local influence for proportional hazards regression models. Biometrics 53, BARLOW, W. E. & PRENTICE, R. L. (1988). Residuals for relative risk regression. Biometrika 75, CHEN, H. Y. & LITTLE, R. J. (1999). Proportional hazards regression with missing covariate. J. Am. Statist. Assoc. 94, CHEN, M., IBRAHIM, J. G. & SHAO, Q. M. (2009). Maximum Likelihood Inference for the Cox Regression Model with Applications to Missing Covariates. J. Multivariate Anal. 100, CHEN, M. H., SHAO, Q. M. & IBRAHIM, J. G. (2000). Monte Carlo Methods in Bayesian Computation. New York: Springer-Verlag. COOK, R. D. (1986). Assessment of local influence (with discussion). J. R. Statist. Soc. B 48, COOK, R. D. & WEISBERG, S. (1982). Residuals and Influence in Regression. Boca Raton, FL: Chapman & Hall. COPAS, J. B. & EGUCHI, S. (2005). Local model uncertainty and incomplete-data bias (with discussion). J. R. Statist. Soc. B 67, COX, D. R. (1972). Regression models and life-tables (with discussion). J. R. Statist. Soc. B 34, COX, D. R. (1975). Partial likelihood. Biometrika 62, COX, D. R. & SNELL, E. J. (1968). A general definition of residuals (with discussion). J. R. Statist. Soc. B 30, DANIELS, M. J. & HOGAN, J. W. (2008). Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling and Sensitivity Analysis. Boca Raton, FL: Chapman & Hall. ESCANCIANO, J. C. (2006). A consistent diagnostic test for regression models using projection. Econometric Theory 22, ESCOBAR, L. A. & MEEKER, W. Q. (1992). Assessing influence in regression analysis with censored data. Biometrics 48, HENDERSON, R. & OMAN, P. (1993). Inference in linear hazard models. Scand. J. Stat. 20, HERRING, A. & IBRAHIM, J. G. (2001). Likelihood-based methods for missing covariates in the Cox proportional hazards model. J. Am. Statist. Assoc. 96, IBRAHIM, J. G., LIPSITZ, S. R. & CHEN, M. (1999). Missing covariates in generalized linear models when the missing data mechanism is nonignorable. J. R. Statist. Soc. B 61, JANSEN, I., MOLENBERGHS, G., AERTS, M., THJIS, H. & VAN STEEN, K. (2003). A local influence approach to binary data from a psychiatric study. Biometrics 59, KLEIN, J. P. & MOESCHBERGER, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data. New York: Springer-Verlag. KOSOROK, M. R. (2007). Introduction to Empirical Processes and Semiparametric Inference. New York: Springer-Verlag. LIN, D. Y., WEI, L. J. & YING, Z. L. (1993). Checking the Cox model with cumulative sums of martingale-based residuals. Biometrika 80,

17 Diagnostic Measures for the Cox Regression Model 17 LIPSITZ, S. R. & IBRAHIM, J. G. (1996). A conditional model for incomplete covariates in parametric regression models. Biometrika 83, MARTINUSSEN, T. & SCHEIKE, T. H. (2006). Dynamic Regression Models for Survival Data. New York: Springer-Verlag. MARZEC, L. & MARZEC, P. (1997). Generalized martingale-residual processes for goodnessof-fit inference in Cox s type regression models. Ann. Statist. 25, OSSIANDER, M. (1987). A central limit theorem under metric entropy with bracketing. Ann. Prob. 15, PETTITT, A. N. & DAUD, I. B. (1989). Case-weighted measures of influence for proportional hazards regression. Applied Statist. 38, POLLARD, D. (1990). Empirical Processes: Theory and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics. Hayward, CA: Institute of Mathematical Statistics and Alexandria, VA: American Statistical Association. SCHEIKE, T., MARTINUSSEN, T. & SILVER, J. (2010). Estimating haplotype effects for survival data. Biometrics 66, SHI, X. Y., ZHU, H. T. & IBRAHIM, J. G. (2009). Local influence for generalized linear models with missing covariates. Biometrics 65, SOCINSKI, M. A., SCHELL, M. J., PETERMAN, A., BAKRI, K., YATES, S., GITTEN, R., UNGER, P., LEE, J., LEE, JI., TYNAN, M., MOORE, M. & KIES, M. (2002). Phase III Trial Comparing Defined Duration of Therapy Versus Continuous Therapy Followed by Second-Line Therapy in Advanced-Stage IIIB/IV Non-Small-Cell Lung Cancer. J. Clin. Oncol. 20, STORER, B. E. & CROWLEY, J. (1985). A diagnostic for Cox regression and general conditional likelihoods. J. Am. Statist. Assoc. 80, THERNEAU, T. M., GRAMBSCH, P. M. & FLEMING, T. R. (1990). Martingale-based residuals for survival models. Biometrika 77, TROXEL, A. B., MA, G. & HEITJAN, D. F. (2004). An index of local sensitivity to nonignorability. Statistica Sinica 14, VAN DER VAART, A. W. & WELLNER, J. A. (1996). Weak Convergence and Empirical Processes. New York: Springer-Verlag. VERBEKE, G., MOLENBERGHS, G., THIJS, H., LASAFFRE, E. & KENWARD, M. G. (2001). Sensitivity analysis for non-random dropout: a local influence approach. Biometrics 57, ZHU, H. T., LEE, S. Y., WEI, B. C. & ZHOU, J. (2001). Case-deletion measures for models with incomplete data. Biometrika 88, ZHU, H. T., IBRAHIM, J. G. & SHI, X. Y. (2009). Diagnostic measures for generalized linear models with missing covariates. Scand. J. Stat. 36,

1 Introduction. 2 Residuals in PH model

1 Introduction. 2 Residuals in PH model Supplementary Material for Diagnostic Plotting Methods for Proportional Hazards Models With Time-dependent Covariates or Time-varying Regression Coefficients BY QIQING YU, JUNYI DONG Department of Mathematical