Modelling Survival Events with Longitudinal Data Measured with Error

Modelling Survival Events with Longitudinal Data Measured with Error Hongsheng Dai, Jianxin Pan & Yanchun Bao First version: 14 December 29 Research Report No. 16, 29, Probability and Statistics Group School of Mathematics, The University of Manchester

Modelling survival events with longitudinal data measured with error Hongsheng Dai, University of Brighton, UK E mail: h.dai@brighton.ac.uk; Jianxin Pan, University of Manchester, UK; Yanchun Bao, Yunnan Normal University, China Abstract In survival analysis, time-dependent covariates are usually available as longitudinal data collected periodically and with error. The longitudinal data can be assumed to follow a linear mixed effect model and Cox regression models are usually used for survival events. The hazard rate of survival times depends on the underlying time-dependent covariate which is described by random effects. Most methods proposed for such models assume a parametric distribution assumption on the random effects and specify a normally distributed error term for the linear mixed effect model. These assumptions may not be valid in practice. This paper proposes a new likelihood method for Cox regression models with error-contaminated time-dependent covariates. The new method does not require any parametric distribution assumption on random effects and random Supported by grant Z26-1-651 from Chun Hui Plan Scientific Research Project of the Ministry of Education of the People s Republic of China. 1

errors. Large sample properties for the estimator are derived and simulation results demonstrate that sometimes the new method is more efficient than existing methods. Key words: Censored data; Longitudinal measurements; Partial likelihood; Proportional hazard model. 1 Introduction Cox proportional hazard model (Cox, 1972) is widely used to study the relationship between survival events and time-dependent or time-independent covariates, which assumes that the failure time hazard rate function λ(s) relates to covariates through λ(s) = λ (s) exp(γw (s)). To implement the above Cox model most existing methods require the time dependent covariate process W (s) to be fully observed. In practice, however, W (s) is measured intermittently and with error. In other words, we only observe longitudinal measurements W j at some time points t j, j = 1,, m, where W j = W (t j ) + ɛ j and ɛ j is the error term. Substituting mis-measured values for true covariates in Cox models leads to biased estimates (Prentice, 1982). Recent studies focus on joint modelling of survival events and longitudinal measurements. The latent time-dependent process W (s) is usually assumed to be W (s) = ω + ω 1 s, which may be generalized to more complex polynomial models. Here ω and ω 1 are random effects. Such assumptions can be used to study the effects of potentially mis-measured time-dependent covariates on the failure time. Many existing research 2

works (DeGruttola and Tu, 1994; Faucett and Thomas, 1996; Henderson et.al., 2; Wulfson and Tsiatis, 1997) assumed that the random effects and random error follow Gaussian distributions and they used EM algorithms or Bayesian approaches to deal with the unobserved covariate process W (s). However, the normal distribution assumption on random effects and random errors may not be true in practice. Misspecification of the distributions of random effects or random errors will result in biased estimates (Tisiatis and Davidian, 21; Song and Huang, 25; Wang, 26). Hu et.al. (1998) and Song et.al. (22b) relaxed the normal assumption by assuming that the density of underlying covariates belong to a smooth class. These approaches involve intensive EM-algorithm computation. Huang and Wang (2) and Song and Huang (25) proposed a corrected score approach which does not require any distribution assumption on random effects and the error term ɛ. Their models assume that the underlying covariate W is time-independent. In addition the corrected score method requires replicated covariate observations for each subject. In many applications, however, the underlying covariates are time-dependent and replicated covariate observations for each subject may not be available. Recently another interesting method is proposed by Wang (26). With the normal distribution assumption for error terms, Wang method does not put any distribution restriction on W (s). This method is based on the assumption that the hazard rate depends on the time-independent random effects, not the time-dependent underlying process W (s). Another estimator, conditional score estimator, is proposed by Tisiatis and Davidian (21). When the hazard rate λ(s) depends on the time-dependent underlying process and the error term is normally distributed, conditional score (CS) estimator does not require any distribution assumption on W (s). 3

In this paper, we that the failure time hazard rate depends on a time-dependent covariate process. With the normal distribution assumption on ɛ, a simple workinglikelihood (SWL) estimator is proposed without any distribution assumption on ω and ω 1. The simple working-likelihood estimator is proved to be consistent and asymptotically normal under some regular conditions. A consistent covariance estimator is also provided. Then we relax the normal distribution assumption on ɛ. A generalized working-likelihood (GWL) estimator is introduced for such cases. Consistency and asymptotic distributions of the GWL estimator are also shown. Simulation studies demonstrate that when the error term ɛ follows a normal distribution, the SWL estimator is as efficient as the CS estimator. Numerical studies also show that the GWL estimator works very well and it is more efficient than SWL and CS estimators when ɛ and (ω, ω 1 ) are both non-normally distributed. This paper is organized as follows. Models and notations are given in Section 2. In Section 3 we propose the simple working-likelihood estimator. The generalized working-likelihood estimator is provided in Section 4. Numerical studies and discussions are given in sections 5 and 6. 2 Models and notations Let Z i (s) be the observed time-dependent covariate and W i (s) be the unobserved timedependent process. Throughout this paper for simplicity we assume that W i (s) is a univariate process, though the proposed methods can be extended to a multivariate unobserved time-dependent process. We also assume that W i (s) = q 1 j= ω ijs j. Here ω i = (ω i,, ω i,q 1 ) is the random effect for the ith subject. We cannot observe W i (s) 4

but observe m i longitudinal measurements W i = { W ij, j = 1,, m i } as W ij = W i (t ij ) + ɛ ij, (1) at ordered times t i = (t i1,, t i,mi ) T. We assume that ω i is independent of t i. Let ɛ i = (ɛ i1,, ɛ i,mi ). Throughout this paper we do not put any distribution assumption on ω i. Let T i be the survival time of the ith subject. In practice, we may not observer T i for all subjects. Instead, we only observe T i = min(t i, C i ) and δ i = I(T i C i ), where the censoring variable C i is independent of T i. There is additional censoring at τ, which is the end time of the experiment. We assume that T i and C i are independent of t i and ɛ i. Cox proportional hazard model assumes that the hazard rate is a function of covariates through the following form, λ i (s) = λ (s) exp(γw i (s)+β T Z i (s)) where λ (t) is an arbitrary baseline hazard function. Define counting processes dn i (s) = I[s T i s + ds, δ i = 1, t iq s and at-risk processes Y i (s) = I[ T i s, t iq s. We have E(dN i (s) F i,s ) = λ (s)ds exp(γw i (s) + β T Z i (s))y i (s), (2) where F i,s is the filtration generated from σ-fields σ{ T i u, δ i, ω i, Z i (u), u s} and σ{t iq u, u τ}. The log-partial likelihood function for parameter θ = (γ, β T ) T is l(θ) = n 1 i [γwi (s) + β T Z i (s) log E () (W, θ, s) dn i (s), (3) where E () (W, θ, s) = n 1 i E() i (W i, θ, s) := n 1 i exp[γw i(s) + β T Z i (s)y i (s). If 5

W i (s) is fully observed, then we can maximize l(θ) by solving the following equations U (1) (θ, τ) := n 1 i τ [ (Wi ) (s) E(1) (W, θ, s) dn Z i (s) E () i (s) =, (4) (W, θ, s) where E (1) (W, θ, s) = E() (W,θ,s) (γ,β T ) T and E (1) (W, θ, s) = n 1 i E (1) i (W i, θ, s) := n 1 i ( Wi (s), Z i (s) ) T T () E i (W i, θ, s). Using martingale theories and under some regular conditions, it is straightforward to show that the maximum likelihood estimate based on (3) is consistent. When W i (s) is measured with error and periodically, the score functions in (4) are not available. To solve this problem, a naive approach is to estimate W i (s) using the Least Square estimates and then to apply these estimates as the covariates. Another method is to use the regression calibration estimator, which is to replace W i (s) with its conditional expectation given the longitudinal measurements. These approaches, however, introduce a severe bias to the estimate of γ. Detailed discussions and comparisons can be found in Tisiatis and Davidian (21) and Wang (26). Based on a sufficient statistic for W i (s) Tisiatis and Davidian (21) proposed a conditional score estimator, when no distribution assumption is given to the random effect ω i. The conditional score estimator is more efficient than the naive approach and regression calibration. 3 A simple working likelihood estimator Throughout this section, we assume that ɛ i in (1) has normal distribution N(, σ 2 I mi ). We assume that ω i, i = 1,, n are i.i.d. with an unknown common distribution. 6

3.1 Unbiased working log-likelihood function Let Ŵi(s) be the ordinary LSE of W i (s) using all the longitudinal observations W i. This requires at least q longitudinal measurements on subject i. Let s = (1, s,, s q 1 ) T and A i = 1 t i,1 t q 1 i,1.. 1 t i,mi t q 1 i,m i... Then we have Var(Ŵi(s) W i (s)) = σ 2 v i (s) where v i (s) := s T (A T i A i ) 1 s. A consistent estimator (Tisiatis and Davidian, 21) for σ 2 is ˆσ 2 = i I[m i>qr i i I[m i>q(m i q), where R i is the residual sum of squares for subject i for the least squares fit to all the m i observations. Let Ê() (θ, σ 2, s) = n 1 i Ê() i (θ, σ 2, s) and [ Ê () i (θ, σ 2, s) := exp γŵi(s) γ2 σ 2 v i (s) + β T Z i (s) Y i (s). 2 We know that given W i (s), the LSE Ŵi(s) is normally distributed. Thus given W i (s) the random variable exp[γŵi(s) has log-normal distribution. Using the well-known results for the expectation of a log-normal distribution, we have E{exp[γŴi(s) W i (s)} = exp[γw i + γ 2 σ 2 v i (s)/2, where E means expectation. Therefore E[Ê() i (θ, σ 2, s) W i (s) = E () i (W i, θ, s) (5) which implies EÊ() i (θ, σ 2, s) = EE () i (W i, θ, s). Under some regular conditions (see Appendix A, C.2) and according to (5) we have 7

that, in probability, lim Ê () (θ, σ 2, s) = lim E () (W, θ, s) := e () (θ, s). (6) n n When σ 2 is known we consider the working likelihood function, ˆln (θ, σ 2 ) = n 1 i [ γŵi(s) + β T Z i (s) log Ê() (θ, σ 2, s) dn i (s). (7) From (6) we know that ˆl n (θ, σ 2 ) is a working likelihood function asymptotically unbiased to l(θ) given in (3). 3.2 The maximum likelihood estimator To maximize ˆl n (θ, σ 2 ) given in (7), we solve the following equations, Û (1) γ (θ, σ 2, τ) := n 1 i Û (1) β (θ, σ 2, τ) := n 1 i τ τ [ Ŵ i (s) Ê(1) γ (θ, σ 2, s) dn i (s) =, Ê () (θ, σ 2, s) Ê (1) β (θ, σ Z 2, s) i (s) dn i (s) =, (8) Ê () (θ, σ 2, s) where Ê (1) γ (θ, σ 2, s) = Ê() (θ, σ 2, s) γ Ê (1) β (θ, σ 2, s) = Ê() (θ, σ 2, s) β = n 1 i = n 1 i [Ŵi (s) σ 2 v i (s)γ Z i (s)ê() i (θ, σ 2, s). Ê() i (θ, σ 2, s), 8

Let Ê(1) (θ, σ 2, s) = (Ê(1) γ (θ, σ 2, s), Ê(1) β (θ, σ 2, s) T ) T and Û (1) (θ, σ 2, s) = (Û (1) γ (θ, σ 2, s), Û (1) β (θ, σ 2, s) T ) T. Then we can write the estimating equations as Û (1) (θ, σ 2, τ) := n 1 i τ [ ) (Ŵi (s) Ê(1) (θ, σ 2, s) dn i (s) =. (9) Z i (s) Ê () (θ, σ 2, s) Since ˆσ 2 is a consistent estimator for σ 2 and (9) is an estimating equation asymptotically unbiased to (4), we conclude that (9) with σ 2 replaced by ˆσ 2 is still an asymptotically unbiased estimating equation. Theorem 3.1. Let ˆθ be the estimated value by solving Û (1) (θ, ˆσ 2, τ) = and θ be the true parameter. Under some regular conditions given in Appendix A, we have lim n ˆθ = θ in probability. Proof. Under some regular conditions l(θ) is concave. The theorem then follows from the facts that ˆl n (θ, σ 2 ) is asymptotically unbiased to l(θ) and that under some mild regular conditions l(θ) has a unique maximum at θ = θ. Let N(s) = i N i(s)/n. A consistent estimator for λ (s)ds is ˆλ (s)ds = d N(s) Ê () (ˆθ,ˆσ 2,s). 9

3.3 Covariance estimation Let Ê(1) Ê (2) (θ, σ (θ, σ 2 2, s), s) := θ = n 1 i [Ŵi(s) σ 2 v i (s)γ 2 σ 2 v i (s) Z i (s)[ŵi(s) σ 2 v i (s)γ [Ŵi(s) σ 2 v i (s)γz i (s) T Z i (s) Z i (s) T Ê() i (θ, σ 2, s) and ˆV (θ, σ 2, s) := Ê(2) (θ, σ 2, s) Ê () (θ, σ 2, s) } {Ê(1) 2 (θ, σ 2, s). Ê () (θ, σ 2, s) For any vector a, the notation a 2 denotes the outer product aa. Under regular conditions C.2 and C.3 in Appendix A, ˆV (θ, σ 2, s) converges uniformly to v(θ, s) given in C.3 in Appendix A and n 1 ˆV (θ, ˆσ 2, s) i dn i(s) converges to v(θ, s)e () (θ, s)λ (s)ds in probability. The asymptotic normality of ˆθ is established by the following theorem. Theorem 3.2. We have n 1/2 (ˆθ θ ) N(, R), with R = I(θ, τ) 1 Σ U (θ, σ 2, τ)i(θ, τ) 1, where Σ U (θ, σ 2, τ) = lim n V ar[ nû (1) (θ, ˆσ 2, τ) and I(θ, τ) = v(θ, s)e () (θ, s)λ (s)ds. A consistent estimator for I(θ, τ) is Î(ˆθ, ˆσ 2, τ) = 1 n i ˆV (ˆθ, ˆσ 2, s)dn i (s). 1

Define ˆΦ i (θ, ˆσ 2, t) = t t [ ) (Ŵi (s) Ê(1) (θ, ˆσ 2, s) dn i (s) (1) Z i (s) Ê () (θ, ˆσ 2, s) [ (Ŵi (s) ˆσ 2 ) v i (s)γ Ê(1) (θ, ˆσ 2, s) Ê () Z i (s) Ê () (θ, ˆσ 2 i (θ, ˆσ 2, s)ˆλ (s)ds., s) A consistent estimator for Σ U (θ, σ 2, t) is ˆΣ U (ˆθ, ˆσ 2, t) = 1 n n ˆΦ i (ˆθ, ˆσ 2, t) 2 + i n ˆΦ i (ˆθ, ˆσ 2, t) ˆΦ 1 (ˆθ, ˆσ 2, t). (11) i=2 Thus a consistent estimator for R is ˆR = Î(ˆθ, ˆσ 2, τ) 1 ˆΣU (ˆθ, ˆσ 2, τ)î(ˆθ, ˆσ 2, τ) 1. 4 A general working likelihood estimator Throughout this section, we relax the normal distribution assumption for ɛ i and only assume ɛ ij, j = 1,, m i, i = 1,, n are i.i.d. with mean and finite variance. We also assume that given ω i, ( W ij, t ij ), j = 1,, m i are i.i.d. pairs. 4.1 Unbiased log-likelihood function and unbiased estimating equation Let ξ i (s) = Ŵi(s) W i (s). We have ξ i (s) = s T (A T i A i ) 1 A T i ɛ i. (12) 11

Note that ξ i (s), i = 1,, n may not have the same distribution since the number of longitudinal measurements of each subject, m i, may not be the same. Suppose each subject has M extra longitudinal observations i.e. ( W ij,1, t ij,1 ), j = 1,, M, which are i.i.d pairs and are also independent of ( W ij, t ij ), j = 1,, m i. The Least- Squares estimate based on the extra longitudinal observations is denoted by Ŵi,1(s). Let ξ i,1 (s) = Ŵi,1(s) W i (s). Since ξ i,1 (s) has a similar expression as that in (12) and each subject has M replicated longitudinal measurements, we know that ξ i,1 (s), i = 1,, n are i.i.d. random variables. Let ϕ (k) (γ, s) = E[ξ i,1 (s) k exp(γξ i,1 (s)), k =, 1, 2 and Ĕ () i (θ, s) = E () i (Ŵi,1, θ, s). Note that Ĕ() i (θ, s) is E () i (W i,1, θ, s) with W i replaced by Ŵi,1, the Least Squares estimate based on the M extra longitudinal observations. Since ξ i (s) ξ i,1 (s) = Ŵi(s) Ŵi,1(s), we have Ĕ () i (θ, s) ϕ () (γ, s) = E () i (Ŵi, θ, s) exp(γξ i (s) γξ i,1 (s)) = E() i (Ŵi, θ, s) exp(γξ i (s)) exp[γξ i,1 (s) ϕ () (γ, s) 1 ϕ () (γ, s) = E () i (W i, θ, s) exp[γξ i,1(s) ϕ () (γ, s). Therefore E[Ĕ() i (θ, s)/ϕ () (γ, s) W i (s) = E () i (W i, θ, s). If let Ĕ() (θ, s) = n 1 i Ĕ() i (θ, s), then we have lim n Ĕ () (θ, s)/ϕ () (γ, s) = lim n E () (θ, s) = e () (θ, s). Similarly as the results in Section 3, we have an unbiased log-likelihood function as ln (θ) = n 1 i [ γŵi(s) + β T Z i (s) log Ĕ() (θ, s) dn ϕ () i (s). (13) (γ, s) 12

Let Ψ = { } ψ 1 (γ, s) = ϕ(1) (γ, s) ϕ () (γ, s), ψ 2(γ, s) = ϕ(2) (γ, s). ϕ () (γ, s) Then unbiased estimating equations are Ŭ (1) (θ, Ψ, τ) := n 1 i τ [ ) (Ŵi (s) Ĕ(1) (θ, Ψ, s) dn i (s) =, (14) Z i (s) Ĕ () (θ, s) where Ĕ (1) (θ, Ψ, s) = Ĕ() (θ, s) θ := n 1 i ) (Ŵi,1 (s) ψ 1 (γ, s) Ĕ () i (θ, s). Z i (s) Let Ĕ(2) (θ, Ψ, s) = Ĕ(1) (θ,ψ,s). We have θ Ĕ (2) (θ, Ψ, s) = n 1 i Ŵ i,1 (s) ψ 1 (γ, s) Z i (s) 2 ψ 2 (γ, s) ψ 1 (γ, s) 2 Ĕ() i (θ, s). Thus the derivative of Ŭ (1) (θ, Ψ, τ) with respect to θ is Ĭ(θ, Ψ, τ) = 1 n i V (θ, Ψ, s)dn i (s) where V (θ, Ψ, s) = Ĕ(2) (θ, Ψ, s) Ĕ () (θ, s) } {Ĕ(1) 2 (θ, Ψ, s). Ĕ () (θ, s) Note that if we replace Ψ = {ψ 1 (γ, s), ψ 2 (γ, s)} in (14) by its consistent estimator, then we can calculate MLE by solving score functions Ŭ (1) (θ, ˆΨ, τ) =. 13

4.2 Consistent estimator for Ψ Suppose that for each ( W ij,1, t ij,1 ), j = 1,, M, there are another two replicated data sets ( W ij,r, t ij,r ), j = 1,, M, r = 2, 3. Based on the two replicated data sets we can find the LSEs Ŵi,r(s), r = 2, 3. Let ξ i,r (s) = Ŵi,r(s) W i (s). For r = 2, 3, we can calculate i.i.d. values ξ i,1 (s) ξ i,r (s) = Ŵi(s) Ŵi,r(s), i = 1,, n. We then have the following theorem. Theorem 4.1. Let ˆψ 1 (γ, s) := ˆψ 2 (γ, s) := i (ξ i,1(s) ξ i,2 (s)) exp(γξ i,1 (s) γξ i,3 (s)) i exp(γξ ; i,1(s) γξ i,3 (s)) i (ξ i,1(s) ξ i,2 (s)) 2 exp(γξ i,1 (s) γξ i,3 (s)) i exp(γξ i,1(s) γξ i,3 (s)) i σ2 s T [A T i,2a i,2 1 s exp(γξ i,1 (s) γξ i,3 (s)) i exp(γξ, i,1(s) γξ i,3 (s)) where A i,2 is the regressor matrix for the second replicated data set ( W ij,2, t ij,2 ), j = 1,, M of subject i. We have ˆψ 1 (γ, s) ψ 1 (γ, s) and ˆψ 2 (γ, s) ψ 2 (γ, s) in probability as n. With the definition of ˆΨ = ( ˆψ 1 (γ, s), ˆψ 2 (γ, s)) given in the above theorem, we have the unbiased estimating equations as Ŭ (1) (θ, ˆΨ, τ) := n 1 i τ [ ) (Ŵi (s) Ĕ(1) (θ, ˆΨ, s) dn i (s) =. (15) Z i (s) Ĕ () (θ, s) Similarly as the proof for Theorem 3.1 we can show that the estimated value θ by solving (15) is consistent. 14

4.3 Covariance estimation Similarly as the results in Section 3, we can show that n( θ θ) converges weakly to a normal distribution N(, R). Let λ (s)ds = d N(s)/Ĕ() ( θ, s) and Φ i (θ, ˆΨ, t) = t [ ) (Ŵi (s) Ĕ(1) (θ, ˆΨ, s) dn i (s) Z i (s) Ĕ () (θ, s) [ (Ŵi (s) ˆψ ) 1 (γ, s) t Z i (s) Ĕ(1) (θ, ˆΨ, s) Ĕ () (θ, s) Ĕ () i (θ, s) λ (s)ds. A consistent estimate for R is [ R = Ĭ( θ, ˆΨ, τ) 1 1 Φ i ( θ, n ˆΨ, τ) 2 + i n Φ i ( θ, ˆΨ, τ) Φ 1 ( θ, ˆΨ, τ) Ĭ( θ, ˆΨ, τ) 1. i=2 4.4 Constructing the three groups of replicated longitudinal measurements The advantage of the above method is that it makes no distribution assumption on random effects and random errors. But this also leads to its drawback that for each subject it needs three extra longitudinal data sets, {( W ij,r, t ij,r ), j = 1,, M}, r = 1, 2, 3. Note that although in practice the replicated longitudinal observations ( W ij,r, t ij,r ), j = 1,, M, r = 1, 2, 3 are not available directly, we can achieve replicated data sets in the following way. We may choose M = q+2 or M = q+3 and select 3M longitudinal measurements from each individual, if it has no less than 3M longitudinal observations. The 3M longitudinal measurements will be partitioned randomly into three groups as 15

( W ij,r, t ij,r ), j = 1,, M, r = 1, 2, 3, which can be viewed as replicated longitudinal observations. These measurements will be used to calculate the consistent estimator ˆΨ. Then we keep the first group of longitudinal measurements ( W ij,1, t ij,1 ), j = 1,, M unchanged and the rest of longitudinal observations for each subject are denoted as ( W ij, t ij ), j = 1,, m i. The larger value of M, the smaller variance for ξ i,r (s). Thus a larger value of M causes smaller variances of ˆΨ and the estimating equation (15). We will expect that a larger value of M leads to better estimates of θ. On the other hand, using the above method for constructing the three replicated data sets, subjects with less than 3M longitudinal observations have to be ignored when estimating ˆΨ. If we choose a very large value of M, then too many subjects will not be taken into account for the estimation of ˆΨ. This may result in a larger variance of ˆΨ and further lead to a poor estimate for θ. In summary, we need to choose M as large as possible, conditioning on that most subjects have at least 3M longitudinal measurements. Effects on the parameter estimates by choosing different values of M are discussed in the following section through simulation studies. 5 Simulation studies and data analysis 5.1 Simulation studies We consider simulation scenarios in Tisiatis and Davidian (21) where for simplicity there is a single time-dependent covariate W i (s) and no time-independent covariates are involved in the proportional hazard model. We choose a modified version of the 16

scenarios in Tisiatis and Davidian (21). We assume W i (s) = ω i + ω i1 s. Two different distributions for (ω i, ω i1 ) are considered. They are (i) (ω i, ω i1 ) is from a bivariate normal distribution with mean (3.173,.13) and covariance matrix D with elements D = (D 11, D 12, D 22 ) = (1.24,.39,.3); (ii) (ω i, ω i1 ) follows a mixture of bivariate normal distribution, with mixing proportion.5 and mixture component N(µ k, D k ), k = 1, 2, where µ 1 = (6.173,.13) T, µ 2 = (2.173,.13) T and D 1 = D 2 = D. The maximum number of longitudinal observations for each subject is 24 and nominal times of observation for W i (s) are t i = {8 + 3j, j =,, 23}. Survival times are generated from model λ i (s) = exp(γw i (s)) with γ = 1. The censoring distribution is exponential with mean 15 and with additional censoring at 8. We also consider two scenarios for the distribution of error terms: (a) a normally distributed error with distribution N(,.5); (b) the error term has a mixture normal distribution,.7n(.7,.1) +.3N(1.633,.1). In each scenario sample sizes are chosen to be n = 2 and 5 Monte Carlo data sets were generated. The parameter γ was estimated using four different methods: (1) using the ideal estimator that could be obtained by fitting by partial likelihood with true values of W i (s); (2) using the conditional score (CS) estimator; (3) using the simple working likelihood (SWL) method; (4) using the generalized working likelihood (GWL) method with M = 4, 5. Other methods such as naive regression or method of last value carried forward are not considered in the simulation studies since they are much less efficient than the conditional score estimator (Tisiatis and Davidian, 21; Huang and Wang, 2). Table 1 is about here. 17

When the error term is normally distributed, from Table 1 we can see that the CS estimator and the SWL estimator work as well as the ideal estimator with respect to the bias. The GWL estimators with M = 4, 5 also have very small bias. The GWL estimators have larger standard error estimates than the other two estimators, since the GWL method does not make the normal assumption for random error term. When the error term is not normally distributed, results are summarized in Table 2. Table 2 is about here. We can see that the bias of CS estimate and SWL estimate increase since they are valid only for normal random errors, but the GWL estimates did not change much and they are very stable. We conclude that when random errors are not normally distributed, the GWL estimators with M = 4, 5 have much smaller bias than SWL and CS estimators. We also investigated other choices for σ 2 and found similar results. As the variance of ɛ ij increases, the standard errors of all estimators increase. This is because larger variance of ɛ ij results in larger variance for estimating equations and further lead to larger standard errors for our estimators. Tisiatis and Davidian (21) pointed out that the estimating equation of conditional score method may have multi-roots. We also investigated the multi-roots problem for estimating equations of both methods. The typical score plots are shown in Figure 1. We can see that all methods have a solution close to the truth. The generalized working likelihood estimating equations have a single root. The conditional score method and the simple working likelihood method have multiple solutions. In practice as Tisiatis and Davidian (21) suggested we may choose the naive regression estimate as starting value to locate the correct estimate. 18

Figure 1 is about here. We also find out that when using generalized working likelihood method and choosing M = 4, the score function may have an outlier solution. In Figure 2 the solution for M = 4 are outliers while the solution for M = 5 is the truth. The outliers will result in poor estimate for standard error and means. Similarly as Song and Huang (25), when all estimates exist and are non-outliers, general working likelihood estimates are stable and have small bias, regardless of the distributions of random effects and error terms. Figure 2 is about here. 5.2 Data analysis To demonstrate the proposed methods we use the primary biliary cirrhosis (PBC) data set collected by the Mayo Clinic from 1974 to 1984. PBC is a chronic disease that can, little by little, destroy some of the bile ducts linking your liver to your gut. When PBC damages your bile ducts, bile can no longer flow through them. Instead it builds up in the liver, damaging the liver cells and causing inflammation and scarring. In the clinical trial, survival status and laboratory results such as serum bilirubin of 312 patients were recorded. In this clinical trail, 158 out of 312 patients took the drug D-penicillamine and the other patients are in the control group. Serum bilirubin are measured at irregularly time points, recorded until death or censoring. Among the 312 patients, the maximum number of repeated measurement is 16. More detailed discussion of the trail and data set can be found in Ding and Wang (28) and Fleming and Harrington (1991). We here consider the biomarker serum bilirubin as the covariate process W i (s) and 19

treatment type as the time-independent covariate Z i in the Cox regression model λ i (s) = λ (s) exp(γw i (s) + βz i ) and longitudinal model W i (s) = ω i + ω i1 s. In our analysis we took logarithm transformation on the serum bilirubin values as the longitudinal measurements. Six typical patients serum bilirubin values are plotted in Figure 3. Figure 3 is about here. We fit the model using conditional score method, simple working-likelihood method and generalized working-likelihood method. The maximum number of longitudinal observations for each subject is 16 and there are 32 patients who have more than 12 measurements on serum bilirubin. We choose M = 4 when using the generalized working-likelihood method. This means that 3M = 12 longitudinal measurements are selected and randomly partitioned into three groups as replicated observations to estimate Ψ. The results are shown in Table 3. The three approaches provide similar results. All three methods suggest that the coefficient estimate ˆγ for the latent process of serum bilirubin is non-significant. The coefficient estimate based on baseline serum bilirubin in Fleming and Harrington (1991) is.8, significantly unequal to. This suggests that the baseline serum bilirubin is a risk factor for survival times but serum bilirubin at later times after treatment are not. Fleming and Harrington (1991) also studied the treatment effect of D-penicillamine to PBC and their result is that the treatment is non-significant. From Table 3 we can see that all three methods give the same result that the coefficient estimate ˆβ for treatment Z i is not significantly unequal to. Table 3 is about here. 2

6 Discussion We have proposed new methods for joint modelling of survival events and error-contaminated time-dependent covariates. The estimators are easily computed and their large sample properties are shown. We suggest using the generalized working likelihood method in practice, since it does not lose much efficiency when the random error is normally distributed and it results in the least bias if the error term is not normally distributed. When using the general working likelihood method, we need replicated observations. Even if replicated observations do not exists, we still obtain replicates from the longitudinal observations using the method in Section 4. When partitioning the 3M longitudinal measurements into three groups, we need to do it randomly for each individual. Note that we cannot partition the 3M measurements into three groups such that t ij,1 < t ik,2 < t il,3 for j, k, l = 1,, M. This is because doing in such way ( W ij,r, t ij,r, j = 1,, M) will not have the same distribution for different values of r and the large sample properties of the estimator cannot be guaranteed. All the existing parametric or nonparametric correction methods assume that the observation times t ij are non-informative. In practice we may observe error-contaminated longitudinal points collected at informative observation times (Liang et.al., 29). For such problems the existing methods are not valid. This is left as future research work. A Regular conditions C.1 The time τ is such that τ λ (s)ds <. 21

C.2 Let E (k) (W, θ, s) = n 1 i E (k) i (W i, θ, s) := n 1 i ( ) k Wi (s) E () i (W i, θ, s), k =, 1, 2. Z i (s) There exists a neighborhood Θ of θ and, respectively, scalar, vector and matrix functions e (), e (1) and e (2) defined on Θ [, τ such that, sup E (k) (W, θ, s) e (k) (θ, s) s [,τ,θ Θ in probability as n. C.3 Let v = e (2) /e () [e (1) /e () 2. Then for all θ Θ and s τ, θ e() (θ, s) = e (1) (θ, s) 2 θ 2 e() (θ, s) = e (2) (θ, s). C.4 The matrix Σ(θ, τ) = v(θ, s)e () (θ, s)λ (s)ds is positive definite. 22

B Proof of Theorem 3.2 Proof. If nû (1) (θ, ˆσ 2, τ) converges weakly to a normal distribution, using the firstorder Taylor extension for Û (1) (θ, ˆσ 2, t) we have n(ˆθ θ ) N(, R), where R = I(θ, τ) 1 Σ U (θ, σ 2, τ)i(θ, τ) 1 and Σ U (θ, σ 2, τ) = lim n V ar[ nû (1) (θ, ˆσ 2, τ). So we only need to show the asymptotic property for nû (1) (θ, ˆσ 2, τ). We can write n Û (1) (θ, ˆσ 2, t) = n 1/2 i n 1/2 i t t [ ) (Ŵi (s) Ê(1) (θ, ˆσ 2, s) dn i (s) Z i (s) Ê () (θ, ˆσ 2, s) [ (Ŵi (s) ˆσ 2 v i (s)γ Z i (s) ) Ê(1) (θ, ˆσ 2, s) Ê () (θ, ˆσ 2, s) Ê () i (θ, ˆσ 2, s)λ (s)ds which is equivalent to n Û (1) (θ, ˆσ 2, t) = n 1/2 i n 1/2 i +n 1/2 i t [ ) (Ŵi (s) e(1) (θ, s) Z i (s) e () dn i (s) (θ, s) [ (Ŵi (s) ˆσ 2 ) v i (s)γ e(1) (θ, s) Z i (s) e () (θ, s) ( e (1) (θ, s) e () (θ, s) Ê (1) (θ, ˆσ 2, s) t t := I II + III. Ê () (θ, ˆσ 2, s) Ê () i (θ, ˆσ 2, s)λ (s)ds ) ( ) dn i (s) Ê() i (θ, ˆσ 2, s)λ (s)ds We know that III = o p (1), since under regular conditions E[dN i (s) Ê() i (θ, σ 2, s)λ (s)ds W i (s) = e and sup (1) (θ,s) Ê(1) (θ,s) θ,s e () (θ,s) = o(1). Thus we can write Ê () (θ,s) n Û (1) (θ, ˆσ 2, t) = I II + o p (1) = n 1/2 i Φ i (θ, ˆσ 2, t) + o p (1) 23

where Φ i (θ, σ 2, t) is t [ ) (Ŵi (s) e(1) (θ, s) Z i (s) e () dn i (s) (θ, s) t [ (Ŵi (s) σ 2 v i (s)γ Z i (s) ) e(1) (θ, s) e () Ê () i (θ, σ 2, s)λ (s)ds. (θ, s) Since Φ i (θ, σ 2, t), i = 1,, n are i.i.d. random variables, we know that n 1/2 i Φ i(θ, σ 2, t) and n 1/2 i [Φ i(θ, ˆσ 2, t) Φ i (θ, σ 2, t) both converge weakly to a Gaussian process. Therefore we have nû (1) (θ, ˆσ 2, t) converges weakly to a normal distribution with mean and covariance matrix [ Σ U (θ, σ 2 1, t) = lim n n n Φ i (θ, ˆσ 2, t) 2 + i=1 n Φ i (θ, ˆσ 2, t) Φ 1 (θ, ˆσ 2, t) i=2 According to the definition of ˆΦ i (θ, ˆσ 2, t) given in Theorem 3.2, we know that a consistent estimator for Σ U (θ, σ 2, t) is ˆΣ U (ˆθ, ˆσ 2, t) = 1 n n ˆΦ i (ˆθ, ˆσ 2, t) 2 + i n ˆΦ i (θ, ˆσ 2, t) ˆΦ 1 (θ, ˆσ 2, t). i=2 Thus a consistent estimator for R is ˆR = Î(ˆθ, ˆσ 2, τ) 1 ˆΣU (ˆθ, ˆσ 2, τ)î(ˆθ, ˆσ 2, τ) 1. C Proof of Theorem 4.1 The theorem follows from the obvious results ϕ (1) (s) ϕ () (s) = E[ξ i,1(s) exp(γξ i,1 (s)) E[exp(γξ i,1 (s)) = E[(ξ i,1(s) ξ i,2 (s)) exp(γξ i,1 (s) γξ i,3 (s)), E[exp(γξ i,1 (s) γξ i,3 (s)) 24

ϕ (2) (s) ϕ () (s) = E[ξ i,1(s) 2 exp(γξ i,1 (s)) E[exp(γξ i,1 (s)) = E[{(ξ i,1(s) ξ i,2 (s)) 2 ξ i,2 (s) 2 } exp(γξ i,1 (s) γξ i,3 (s)) E[exp(γξ i,1 (s) γξ i,3 (s)) and E[ξ i,2 (s) 2 = σ 2 E [ s T [A T i,2a i,2 1 s. References Cox D. R. (1972). Regression models and life tables (with discussion). Journal of the Royal Statistical Society, B, 34:187-22. Dafni U. G. and Tsiatis A. A. (1998). Evaluating surrogate markers of clinical outcome measured with error. Biometrics, 54:1445-1462. DeGruttola V. and Tu X. M. (1994). Modeling progression of CD-4 lymphocyte count and its relationship to survival time. Biometrics, 5:13-114. Faucett C. J. and Thomas D. C. (1996). Simultaneously modelling censored survival data and repeatedly measured covariates: A Gibbs sampling approach. Statstics in Medicine, 1:465-48. Fleming T. R. and Harrington D. P. (1991). Counting Processes and Survival Analysis, John Wiley & Sons, Inc. Henderson R., Diggle P. and Dobson A. (2). Joint modelling of longitudinal measurements and event time data. Biostatistics, 1:465-48. Hu P., Tsiatis A. A. and Davidian M. (1998). Estimating the parameters in the Cox model when covariate variables are measured with error. Biometrics, 54:147-1419. 25

Huang Y. and Wang C. Y. (2). Cox regression with accurate covariates unascertainable: A nonparametric-correction approach. Journal of the American Statistical Association, 95:129-1219. Liang Yu. and Lu W. and Ying Z. (29). Joint modeling and analysis of longitudinal data with informative observation times. Biometrics, 65:377-384. Nakamura T. (199). Corrected score function for errors in variables models: methodology and application to generalized linear models. Biometrika, 77:127-137. Prentice R. (1982). Covariate measurement errors and parameter estimates in a failure time regression measurement error models. Biometrika, 74:73-716. Song X., Davidian M. and Tsiatis A. A. (22a). An estimator for the proportional hazards model with multiple longitudinal covariates measured with error. Biostatistics, 3:511-528. Song X., Davidian M. and Tsiatis A. A. (22b). A semiparametric estimator for the proportional hazards model with longitudinal covariates measured with error. Biometrika, 88:447-458. Jimin Ding and Jane-Ling Wang. (28). Modeling longitudinal data with nonparametric multiplicative random effects jointly with survival data. Biometrics, 64:546-556. Tisiatis A. A. and Davidian M. (21). A semiparametric estimator for the proportional hazards model with longitudinal covariates measured with error. Biometrika, 88:447-458. Tisiatis A. A. and DeGruttola V. and Wulfsohn M. S. (1995). Modeling the relationship 26

of survival to longitudinal data measured with error: Applications to survival and CD4 counts in patients with AIDS. J. Am. Statist. Assoc., 9:27-37. Song X. and Huang Y. (25). On corrected score approach for proportional hazards model with covariate measurement error. Biometrics, 61:72-714. Wang C. Y. and Hsu L. and Feng Z. D. and Prentice R. L. (1997). Regression Calibration in Failure Time Regression. Biometrics, 53:131-145. Wang C. Y. (26). Corrected score estimator for joint modelling of longitudinal and failure time data. Statistica Sinica, 16:235-253. Wulfson M. S. and Tsiatis A. A. (1997). A joint model for survival and longitudianl data measured with error. Biometrics, 53:33-339. 27

Table 1: Simulation results for two underlying random effect distributions, when error term is N(,.5). I, ideal ; CS, conditional score estimator; SWL, simple working likelihood estimator; GWL, Generalized working likelihood estimator with M = 4, 5; SD, Monte Carlo standard deviation; SE, average of estimated standard errors. n = 2 Normal covariate Mixture covariate Method Mean SD SE Mean SD SE I -.9978.87.86 -.991.83.78 CS -.9888.127.118-1.27.129.112 SWL -.9971.143.14-1.13.14.129 GWL,4 -.9957.249.268-1.553.344.382 GWL,5 -.9852.19.193-1.243.217.227 28

Table 2: Simulation results for two underlying random effect distributions, when error term is mixed normal as.7n(.7,.1)+.3n(1.633,.1). I, ideal ; CS, conditional score estimator; SWL, simple working likelihood estimator; GWL, Generalized working likelihood estimator with M = 4, 5; SD, Monte Carlo standard deviation; SE, average of estimated standard errors. n = 2 Normal covariate Mixture covariate Method Mean SD SE Mean SD SE I -.9978.87.86 -.991.83.78 CS -1.88.135.124-1.81.132.119 SWL -1.938.236.273-1.787.212.29 GWL4-1.263.419.49-1.269.43.431 GWL5 -.9794.246.234 -.9937.244.247 Table 3: Results for the PBC data. ˆγ (sd) ˆβ (sd) CS -.29 (.69).53 (.265) SWL -.33 (.55).54 (.16) GWL -.4 (.99).77 (.248) 29

5 5 4 Solid line: I Dash line: CS Dot line: SWL 4 Solid line: I Dot line: GWL,M=4 Dash dot line: GWL,M=5 3 3 2 2 1 1 U( γ ) U( γ ) 1 1 2 2 3 3 4 4 5 2 15 1 5 5 1 15 2 γ 5 2 15 1 5 5 1 15 2 γ Figure 1: Typical score plots for simulation data sets. I, ideal method; CS, conditional score method; SWL, simple working likelihood method; GWL, generalized working likelihood method. 5 4 3 Solid line : Dash line : GWL,M=5 GWL,M=4 2 1 U( γ ) 1 2 3 4 5 2 15 1 5 5 1 15 2 γ Figure 2: Typical score plots for a simulation data set. likelihood method. GWL, generalized working 3

3 2.5 2 1.5 log(serum bilirubin) 1.5.5 1 1.5 5 1 15 2 25 3 35 4 45 5 Time (in days) Figure 3: Longitudinal observation plot for six patients. 31