Maximum Likelihood Estimation for the Proportional Odds Model with Random Effects

Size: px

Start display at page:

Download "Maximum Likelihood Estimation for the Proportional Odds Model with Random Effects"

Elaine Eaton
5 years ago
Views:

1 Maximum Likelihood Estimation for the Proportional Odds Model with Random Effects DONGLIN ZENG, D. Y. LIN, and GUOSHENG YIN In this article, we study the semiparametric proportional odds model with random effects for correlated, right-censored failure time data. We estalish that the maximum likelihood estimators for the parameters of this model are consistent and asymptotically Gaussian. Furthermore, the limiting variances achieve the semiparametric efficiency ounds and can e consistently estimated. Simulation studies show that the asymptotic approximations are accurate for practical sample sizes and that the efficiency gains of the proposed estimators over those of Cai, Cheng and Wei (22, JASA) can e sustantial. A real example is provided to illustrate the proposed methods. KEY WORDS: Correlated failure time data; Frailty model; Linear transformation model; Proportional hazards; Semiparametric efficiency; Survival data. Donglin Zeng is Assistant Professor and D. Y. Lin is Dennis Gillings Distinguished Professor, Department of Biostatistics, CB# 742, University of North Carolina, Chapel Hill, NC Guosheng Yin is Assistant Professor, Department of Biostatistics, M. D. Anderson Cancer Center, Houston, TX 773. This research was supported y the National Institutes of Health. 1

2 1. INTRODUCTION In many scientific studies, there exists natural or artificial clustering of study sujects such that the survival times or failure times of the sujects within the same cluster are correlated. A common approach to accommodating the intra-class dependence is to incorporate an unoserved random effect, the so-called frailty, into the Cox (1972) proportional hazards model. Specifically, the hazard function for the jth suject of the ith cluster associated with a d 1 -vector of covariates X ij is postulated to take the form λ(t X ij, ξ i ) = ξ i λ (t)e XT ijα, i = 1,..., n; j = 1,..., ni, (1) where λ ( ) is an unspecified aseline hazard function, α is a vector of unknown regression parameters, and ξ i is the unoserved frailty for the ith cluster. Although various parametric distriutions for the frailty have een suggested, the existing literature has een focused on the simple case of gamma frailty. The consistency and asymptotic distriution of the maximum likelihood estimator for the gamma frailty model have een rigorously studied y Murphy (1994; 1995) for the case of no covariates and y Parner (1998) for the case with covariates. Model (1) imposes a common gamma frailty on all memers of the same cluster. Several authors have extended this shared gamma frailty model to accommodate more flexile dependence among cluster memers. In particular, Pertersen (1998) allowed different additive frailties for different memers of the same cluster. Parner (1998) assumed that the frailty for each cluster consists of two independent components: a common cluster-level effect and a suject-specific effect, and showed that the maximum likelihood estimator is efficient. Under model (1), the conditional hazard functions given frailties are required to e proportionate over time among different sets of covariate values. This assumption of proportional hazards may not e satisfied in certain applications. For independent failure time data, an attractive alternative to the proportional hazards model is the proportional odds model (Pettitt 1982; Bennett 1983). The proportional odds model constraints the ratio of the odds of survival associated with two sets of covariate values to e constant over time, and consequently the ratio of the hazards to converge to unity as time increases. By contrast, the proportional hazards model constraints the hazard ratio to e constant while the odds ratio tends to or infinity. Physical and iological rationale ehind the proportional odds model was provided y Bennett (1983) and others. Statistical inference is much more challenging under the proportional odds model than under the proportional hazards model. Important contriutions have een made 2

3 y Bennett (1983), Pettitt (1984), Cuzick (1988), Darowska and Doksum (1988), Cheng, Wei and Ying (1995), Wu (1995), Murphy, Rossini and van der Vaart (1997), Shen (1998), Lam and Leung (21), and Chen, Jin and Ying (22) among others. In this article, we consider the proportional odds model with random effects for correlated failure time data. Specifically, logit{s(t X ij, Z ij, i )} = G(t) + X T ijβ + Z T ij i, i = 1,..., n; j = 1,..., n i, (2) where X ij is a d 1 -vector of covariates, as defined earlier, Z ij is a d 2 -vector of covariates, which usually contains 1 and part of X ij, G( ) is an unspecified strictly increasing function, β is a set of unknown regression parameters, i is a set of unoserved random effects, and S( X ij, Z ij, i ) is the survival function conditional on X ij Z ij, and i. We assume that i follows a normal distriution with mean zero and unknown covariance matrix Σ. Note that model (2) allows covariate-specific or suject-specific random effects whereas model (1) only allows a clusterspecific frailty. Two recent articles are concerned with special versions of model (2). Specifically, Cai, Cheng and Wei (22) studied model (2) with a scalar random effect (i.e., Z ij 1). The parameter estimators are otained y minimizing the empirical sum of squares of the differences etween certain oserved quantities and their expected values. The estimators are not asymptotically efficient, and the variance estimation is computationally demanding. The censoring mechanism is required to e purely random and independent of covariates. Lam, Lee and Leung (22) considered the proportional odds model with scalar random effects µ ij, i = 1,..., n and j = 1,..., n i. Within the ith cluster, µ ij, j = 1,..., n i, are multivariate normal with a specific covariance structure. Lam et al. (22) otained the estimators for the regression parameters y maximizing a marginal likelihood ased on the ranks of the failure times. They did not provide formal asymptotic results or consider the prolem of survival function estimation. In this article, we study the maximum likelihood estimation of model (2). The estimators are shown to e consistent and asymptotically efficient. The asymptotic distriutions of the estimators and consistent variance estimators are also otained. Numerical studies reveal that the proposed estimators perform well for practical sample sizes and the efficiency gains over the estimators of Cai et al. (22) can e sustantial. We descrie in greater detail the data structure and model assumptions in the next section, and develop the estimation theory in Section 3. We then present the results of our numerical studies in Section 4 and provide an application to a real medical study in Section 5. Some 3

4 concluding remarks are given in Section 6. Most of the technical details are relegated to the appendix. 2. DATA STRUCTURE AND MODEL ASSUMPTIONS Suppose that there is a random sample of n clusters with potentially different sizes. For i = 1,..., n and j = 1,..., n i, let T ij and Cij e the latent failure time and censoring time for the jth memer of the ith cluster, and let X ij and Z ij e the corresponding d 1 - and d 2 - vectors of covariates. The regression relationship etween T ij and (X ij, Z ij ) is given y model (2). The data consist of (Y ij, ij, X ij, Z ij ) (i = 1,..., n; j = 1,..., n i ), where Y ij = T ij C ij, ij = I(T ij C ij ), and C ij = Cij τ. Here and in the sequel, a = min(a, ), a = max(a, ), I( ) is the indicator function, and τ is a fixed constant denoting the end of the study. We impose the following regularity conditions. C.1. Conditional on covariates X ij and Z ij, the censoring time Cij is independent of the failure time T ij and random effects i. C.2. There exists some positive constant δ such that Pr(Cij τ X ij, Z ij ) δ almost surely. C.3. All the X ij and Z ij are ounded. In addition, if there exist a constant vector c and a symmetric matrix Σ such that [1, X T ij]c + Z T ijσz ij =, j = 1,..., n i, and Z T ijσz ij =, j j ; j, j = 1,..., n i almost surely, then c = and Σ =. C.4. The true value G (t) of G(t) is a strictly increasing function in [, τ] and is continuously differentiale. In addition, G () =, de G(t) /dt t=+ >, and G (τ) <. C.5. The true values of β and Σ, β and Σ, elong to the interior of a known compact set Θ = { (β, Σ) : β B for some constant B, Σ is positive definite and its eigenvalues are ounded away from and }. C.6. The cluster size is completely random. In addition, there exists a positive integer n such that 1 n i n and Pr(n i 2) >. 4

5 Remark 1. Conditions C.3, C.4 and C.6 ensure the identifiaility of the parameters in model (2). If Z ij = Z ij in condition C.3 for continuous covariates, then the two displays in this condition are equivalent to the linear independence of [1, X T ij] and the linear independence of Z ij. In condition C.4, the equality G () = follows from the fact that S( X ij, Z ij, i ) = 1, and the inequality G (τ) < implies that Pr(T ij > τ X ij, Z ij, i ) >. The ound G (τ) is unknown in practice. Condition C.6 implies that the cluster size is ounded and some clusters have at least two sujects. 3. MAXIMUM LIKELIHOOD ESTIMATION Define H(t) = e G(t) and H (t) = e G (t). Note that H () =. Under model (2) and condition C.1, the likelihood function for the parameters (β, Σ, H ) is proportional to n n i e (XT ijβ+z T ij ) 1 ij i=1 H(Y ij ) + e (XT ijβ+z T ij ) e (XT ijβ+z T ij ) H ij (Y ij ) Σ 1/2 e T Σ 1 /2 d, (H(Y ij ) + e (XT ijβ+z T ij ) ) 2 where H (t) is the derivative of H(t). It would seem natural to calculate the maximum likelihood estimators of (β, Σ, H ) y maximizing the aove likelihood function. The maximum of this function, however, is infinity since we can always choose some function H( ) with fixed values at the Y ij while letting H (Y ij ) go to infinity for some Y ij with ij = 1. Thus, we relax H( ) to e right-continuous and allow H( ) to have jumps at the Y ij. We then maximize the following function n n i e (XT ijβ+z T ij ) 1 ij L n (β, Σ, H) i=1 H(Y ij ) + e (XT ijβ+z T ij ) e (XT ijβ+z T ij ) ij H{Y ij } Σ 1/2 e T Σ 1 /2 d, (3) (H(Y ij ) + e (XT ijβ+z T ij ) ) 2 where H{t} denotes the jump size of H(t) at t. To e specific, we maximize L n (β, Σ, H) over the parameter space {(β, Σ, H) : (β, Σ) Θ, H(t) is an increasing right-continuous function in [, τ] with H() = }. The resulting estimators, denoted y β n, Σ n and Ĥn, are referred to as the nonparametric maximum likelihood estimators (NPMLEs) (Parner, 1998) or the sieve maximum likelihood estimators (Huang and Rossini, 1997; Murphy, Rossini and van der Vaart, 1997). 5

6 The existence of the maximizers follows from the following arguments. First, for any (β, Σ, H) in the parameter space, the ith term on the right-hand side of (3) is ounded y max X ij,z ij,(β,σ) Θ n i e XT ijβ+z T ij Σ 1/2 e T Σ 1 /2 d <, where the inequality follows from the oundedness of X ij and Z ij, and the compactness of Θ. Second, for any H, we can always construct a new increasing function H which is a step function with jumps only at the Y ij for which ij = 1 such that H (Y ij ) = H(Y ij ). Clearly, H {Y ij } H{Y ij } for ij = 1 so that L n (β, Σ, H ) L n (β, Σ, H). This implies that the function H which maximizes L n (β, Σ, H) should e a step function with positive jumps only at the Y ij for which ij = 1. Third, if H{Y ij } = for some Y ij, then it is easy to see that L n (β, Σ, H) =. Therefore, we conclude that the maximizers exist. The aove arguments imply that Ĥn(t) is a step function with jumps only at the Y ij for which ij = 1. Thus, the NPMLEs for (β, Σ, H ) can e otained y maximizing L n (β, Σ, H) over the parameter space (β, Σ) Θ and the jump sizes of H at the Y ij. This maximization can e realized via many optimization algorithms such as the large-scale unconstrained optimization function fminunc in MATLAB, which is descried in the next section. The asymptotic properties of the proposed estimators are stated in the following theorems. Theorem 1. Under conditions C.1 C.6, β n β, Σ n Σ and sup t [,τ] Ĥn(t) H (t) almost surely, where is the Euclidean norm. Theorem 2. Under conditions C.1 C.6, the random element n( β T n β T, Σ T n Σ T, Ĥn( ) H ( )) T converges weakly to a zero-mean Gaussian process in the metric space R d 1 R d 2(d 2 +1)/2 l [, τ], where Σ n and Σ are treated as extended column vectors consisting of the upper triangle elements, and l [, τ] is a normed space consisting of all the ounded functions and the norm is defined as the supremum norm on [, τ]. Furthermore, β n and Σ n are asymptotically efficient. Remark 2. Theorem 1 presents the consistency of the maximum likelihood estimators. In conditions C.1 C.6, H( ) is not assumed to e a ounded function, which means that the weakcompactness of the parameter H( ) is not assumed. Thus, otaining a ound for the maximum likelihood estimator Ĥn( ) is a key to the proof of Theorem 1. The proof of Theorem 1 adopts some ideas from Murphy s (1994) proof of the consistency for the gamma frailty model, ut the technical details are quite different. Once the consistency is estalished, the asymptotic 6

7 distriutions of the maximum likelihood estimators stated in Theorem 2 can e derived along the lines of Murphy (1995) and Parner (1998), although the verification of the continuous invertiility of the information operator is sustantially different from theirs. In the statement of Theorem 2, asymptotically efficient estimators mean that the asymptotic variances attain the semiparametric efficiency ounds as defined in Bickel et al. (1993, Ch. 3). The proofs of Theorems 1 and 2 are given in the appendix. It is essential to estimate the asymptotic covariance matrices of β n and Σ n. Intuitively, the variation in estimating the parameter H( ) arises from the variation in estimating the jump sizes of H( ) at the Y ij for which ij = 1. Thus, we can regard the oserved likelihood function as a likelihood function indexed y the parameters β, Σ, and the parameters which represent the jump sizes of H( ) at the Y ij for which ij = 1. From the Fisher information theory in the parametric setting, the asymptotic covariance matrix in Theorem 2 can e estimated y the inverse of the oserved information matrix for all the parameters. constant vector (h 1, h 2 ) R d 1 Specifically, for any R d 2(d 2 +1)/2 and any ounded function h 3, the asymptotic variance of h T 1 β n + h T 2 Σ n + τ h 3(t)dĤn(t) is equal to the asymptotic variance of h T 1 β n + h T Σ 2 n + ij =1 h 3 (Y ij )Ĥn{Y ij } so that it can e estimated y h T nj 1 n h n, where h n is the vector comprising of h 1, h 2 and the h 3 (Y ij ) for which ij = 1, and J n is the negative Hessian matrix of log L n ( β, Σ, Ĥ) with respect to (β, Σ) and the jump sizes of H at the Y ij for which ij = 1. The next theorem formalizes this approximation. Theorem 3. Let V (h 1, h 2, h 3 ) e the asymptotic variance of the random variale n 1/2 {h T 1 ( β n β ) + h T 2 ( Σ n Σ ) + τ h 3(t)d(Ĥn(t) H (t))}. Under conditions C.1 C.6, the estimator nh T nj 1 n h n V (h 1, h 2, h 3 ) uniformly in (h 1, h 2, h 3 ) in proaility. Theorem 3 implies that, when the numer of uncensored oservations is not too large, one can simply invert the oserved information matrix for all the parameters, including β, Σ, and the H{Y ij } for which ij = 1 to calculate the variances and covariances. Our numerical studies revealed that this approximation is satisfactory for practical sample sizes. 4. NUMERICAL STUDIES Simulation studies were conducted to evaluate the finite-sample properties of the proposed methods. In the first set of studies, the failure times were generated from the following special case of model (2): logit{s(t X 1ij, X 2ij, i )} = log t X 1ij + X 2ij + i, i = 1,..., n; j = 1, 2, 7

8 where X 1i1 =, X 1i2 = 1, X 2i1 X 2i2 is a uniform(, 1) random variale, and i is zeromean normal with variance σ 2. The censoring times were generated from the uniform(, 15) distriution, corresponding to approximately 33% censoring rate. We used the optimization algorithm fminunc in the optimization toolox of MATLAB to otain the maximum likelihood estimates of β 1, β 2, σ and H. When the gradients and the Hessian derivatives of the likelihood function are provided, the search algorithm is a suspace trust region method and is ased on the interior-reflective Newton method descried in Coleman and Li (1994; 1996). In each iteration of the search, a large linear system is approximately solved y using the method of preconditioned conjugate gradients. The algorithm converges when the search step size and the norm of the search gradients are smaller than certain thresholds. To avoid negative estimates of the jump sizes for H or negative estimates of σ, we used the logarithms of the jumps sizes and log σ as the parameters during the search. The starting values for (β 1, β 2, σ) were set to e (,, 1). The starting value for the jump size H{Y ij } at the failure time Y ij was given y expression (A.1) in Appendix A.1, on the right-hand side of which the values for (β 1, β 2, σ) were set to e the initial values and H(t) was set to e t. In general, the search algorithm converged within 1 iterations. After the algorithm converged, the variance estimates were calculated y inverting the oserved information matrix. Tale 1 displays the results of these simulation studies with n = 2. likelihood estimators for all the parameters show little ias. The maximum The proposed standard error estimators agree well with the empirical standard errors, and the confidence intervals provide reasonale coverages. For comparisons, we also computed the estimates ased on the method of Cai et al. (22), which minimizes the following criterion function nρ n n i i=1 l k=1 ili(y il Y ik t ) Ĝ w (Y il ) α, + n α n i j i=1 k=1 l=1 e (t+xt ikβ+) (t+x e T jlβ+ ) 1 + e (t+xt ik β+) e (t+xt ikβ+) e (t+xt ilβ+) 1 e 2 2σ 2 dtd 1 + e (t+xt ik β+) {1 + e (t+xt il β+) } 2 2πσ n j [ jl I(Y jl Y ik t ) Ĝ 2 c(y jl ) 1 e 2 1 2σ 2 e 2 2σ 2 dtdd {1 + e (t+xt jl β+ ) } 2 2πσ 2πσ 2, (4) where t is the minimum of the 95th percentile of the oserved Y ij and the 98th percentile of the oserved Y ij for which ij = 1. In (4), ρ is chosen to minimize the asymptotic variance of the estimator for β, Ĝc(t) = e Λc (t) and Ĝw(t) = e Λw (t), where Λ c is the Nelson-Aalen estimator 8 2

9 ased on (Y ij, 1 ij ), i = 1,..., n; j = 1,..., n i, and Λ w is the Nelson-Aalen estimator ased on {Y ik Y il, 1 (I(Y ik Y il ) ik + I(Y ik > Y il ) il )}, i = 1,..., n, 1 k < l n i. The mean squared errors for Cai et al. s estimators of β 1, β 2 and σ turned out to e.96,.749 and.935 under σ = 3, and.55,.192 and.311 under σ = 1. Thus, these estimators can e consideraly less efficient than the maximum likelihood estimators, especially when there is strong intra-class dependence. It would e interesting to make comparisons with Lam et al. s method. Their estimators, however, are not easy to program. Tale 1. Summary Statistics for the Simulation Studies With One Random Effect Paramter Value Mean SE SEE 95% CP MSE σ = 3 β β σ H(τ/4) H(τ/2) H(3τ/4) H(τ) σ = 1 β β σ H(τ/4) H(τ/2) H(3τ/4) H(τ) NOTE: Mean and SE stand for the mean and standard error of the estimator. SEE is the mean of the standard error estimator, and 95% CP is the coverage proaility of the 95% confidence interval. MSE is the mean squared error. Each entry is ased on 5 simulated data sets. In our second set of studies, we considered ivariate normal random effects. The failure times were generated from the following model: logit{s(t X i, 1i, 2i )} = log { (1 + t/2) 2 1 } +.5X 1ij.5X 2ij + 1i + X 2ij 2i, i = 1,..., n; j = 1, 2, where X 1i1 =, X 1i2 = 1, X 2i1 X 2i2 is a uniform(, 1) random variale, and ( 1i, 2i ) has a ivariate normal distriution with zero means, unit variances and covariance.4. censoring times were set to e min(3, C ), where C is uniform(3/8, 11/8), so that approximately 9 The

10 34% of the failure times were censored. The optimization algorithm fminunc was again used to find the maximum likelihood estimates. We used the logarithms of the jump sizes of H and the elements in the square-root of the covariate matrix of the random effects as the parameters during the search. To ensure that the covariance matrix estimate is positive definite, we let the ojective function e a large negative value (i.e., 1 5 ) if any condition for a positive definite matrix was violated. This penalization essentially restricts the search within the meaningful regions of the parameters. As efore, oth the gradients and the Hessian derivatives of the ojective function were supplied in the search algorithm. The starting values for the regression parameters and the covariance matrix were zeros and the identity matrix respectively, while the starting values for the jump sizes of H were determined y (A.1), in which the parametric components were set to e the initial values and H(t) to e t. Tale 2. Summary Statistics for the Simulation Studies With Two Random Effects Parameter Value Mean SE SEE 95% CP MSE n = 2 β β σ σ σ H(τ/4) H(τ/2) H(3τ/4) H(τ) n = 4 β β σ σ σ H(τ/4) H(τ/2) H(3τ/4) H(τ) NOTE: See Note to Tale 1. Tale 2 displays the results for n = 2 and n = 4. For n = 2, the search usually converged after aout 1 iterations, and it took less than 2 hours to complete 5 repetitions on GHz Athlon machines. For n = 4, it took aout 1 hours to complete. In the tale, 1

11 σ 11, σ 22 and σ 12 are the elements in the square-root of the covariance matrix for 1 and 2 so that σ 11 = σ 22 =.979 and σ 12 =.24. For the regression parameters and the function H, the maximum likelihood estimators show little ias and the proposed standard error estimators agree well with the empirical standard errors. The parameters for the covariance matrix of the random effects are estimated less well, although there is a trend for improvement as n increases. 5. AN EXAMPLE We now illustrate the proposed methods with the well-known Diaetic Retinopathy Study (Huster et al. 1989), which was conducted to assess the effectiveness of laser photocoagulation in delaying visual loss among patients with diaetic retinopathy. One eye of each patient was randomly selected to receive the laser treatment while the other eye was used as a control. The failure time of interest is the time to visual loss as measured y visual acuity less than 5/2. Following previous authors, we confine our attention to a suset of 197 high-risk patients, and consider three covariates: X 1ij indicates, y the values 1 versus, whether or not the jth eye (j = 1 for the left eye and j = 2 for the right eye) of the ith patient was treated with laser photocoagulation, X 2i1 X 2i2 indicates, y the values 1 versus, whether the ith patient had adult-onset or juvenile-onset diaetics, and X 3ij = X 1ij X 2ij. We fit model (2) with these three covariates, along with random effects i to account for the correlation etween the two eyes of the same patient. We used the fminunc function with the starting values descried in the previous section. The results of the analysis are shown in Tale 3. There is a high degree of dependence etween the failure times of the two eyes from the same patient. Both the treatment indicator and the interaction term are significant, whereas the diaetic type is not. Cai et al. (22) reported estimates of β 1, β 2 and β 3 of.46,.74 and 1.41 with estimated standard errors of.3,.38 and.54. The conclusions ased on the two sets of results would e somewhat different. Tale 3. Maximum Likelihood Estimates of the Random-Effect Proportional Odds Model for the Diaetic Retinopathy Study Parameter Estimate SE Est/SE p-value β β β σ <.1 11

12 NOTE: SE is the estimated standard error, and p-value pertains to the two-sided test of zero parameter value. To compare the fit of competing models, Cai et al. (22, p. 516) proposed a distance measure which summarizes the differences etween the oserved and fitted values of the failure times. This distance measure turns out to e for the proportional odds model with normal random effect as opposed to for the proportional hazards model with gamma frailty. Thus, the former model appears to fit the data slightly etter than the latter. One important application of random-effects models is to predict the future survival experience of one memer given the survival history of the other memers of the same cluster. In the Diaetic Retinopathy Study, one may e interested in estimating, for example, the conditional survival proailities of the treated eye given that it has not failed efore 3 months while the untreated eye failed etween 24 and 3 months, i.e., Pr(T 2 > t T 2 > 3, 24 < T 1 < 3, X 11 =, X 12 = 1, X 2 ) for t > 3, where T 2 is the failure time for the treated eye and T 1 is the failure time for the untreated eye, X 1k is the treatment status for the kth eye, and X 2 is the diaetic type for this patient. It is straightforward to show that = Pr(T 2 > t T 2 > 3, 24 < T 1 < 3, X 11 =, X 12 = 1, X 2 ) u g(u, t, 1, X 2; β, σ, H){g(u, 24,, X 2 ; β, σ, H) g(u, 3,, X 2 ; β, σ, H)}φ(u)du u g(u, 3, 1, X 2; β, σ, H){g(u, 24,, X 2 ; β, σ, H) g(u, 3,, X 2 ; β, σ, H)}φ(u)du, (5) where φ( ) is the standard normal density function, and g(u, t, X 1, X 2 ; β, σ, H) = e β 1X 1 β 2 X 2 β 3 X 1 X 2 σu H(t) + e β 1X 1 β 2 X 2 β 3 X 1 X 2 σu. We can easily estimate this proaility function y replacing β 1, β 2, σ and H in (5) y their respective maximum likelihood estimates and then evaluating the integration via the Gaussianquadrature formula. The variance function is given y D T J 1 n D, where D is the derivative of (5) with respect to (β 1, β 2, σ) and the jump sizes of H at the Y ij for which ij = 1. Figure 1 displays the estimated survival curves along with the 95% confidence intervals for the two diaetic types. 6. DISCUSSION We have developed consistent and efficient estimators for the proportional odds model with random effects, which is a useful alternative to the popular proportional hazards model with 12

13 Survival proailities Follow-up time (months) Figure 1: Estimated conditional survival proailities for the diaetic retinopathy patients: the curve represents the point estimate of the survival function for the adult-onset diaetics, and the curves the corresponding 95% confidence limits; the curve represents the the point estimate of the survival function for the juvenile-onset diaetics, and the curves the corresponding 95% confidence limits. 13

14 gamma frailty. The proposed estimators are more efficient than those of Cai et al. (22). It is computationally less demanding to evaluate the variances of the proposed estimators than those of Cai et al. s estimators as the latter requires a multi-layer summation. The proposed numerical algorithm does not guarantee a gloal maximum. This is a common prolem for all maximum likelihood estimators in complex settings. Our experience, however, indicates that the proposed algorithm works well in practice. One approach to increase one s confidence in the estimates is to employ different starting values. We have tried different starting values in our simulated and real data and otained very similar answers. Note that the existing ad hoc estimating equations may have multiple solutions as well. For the variance estimation, we invert the oserved information matrix on the asis of Theorem 3. When the numer of uncensored oservations is large, the matrix inversion may potentially e unstale. An alternative approach is to use the numerical differentiation of the profile log-likelihood function, as implemented y Huang and Rossini (1997) and Murphy, Rossini and van der Vaart (1997). In the latter approach, the choice of the neighorhood is aritrary and no variance estimates are availale for the survival function estimators. It would e worthwhile to study the maximum likelihood estimation for a general class of linear transformation models with random effects ψ{s(t X ij, Z ij, i )} = G(t) + X T ijβ + Z T ij i, (6) where ψ is a given link function. A versatile family of link functions is ψ(s) = log{λ 1 (s λ 1)}, λ, which contains oth the proportional hazards model (λ = ) and the proportional odds model (λ = 1). General linear transformation models have een studied y Bickel (1986), Darowka and Doksum (1988), Cheng et al. (1995) and Chen et al. (23) among others for independent failure time data and y Cai et al. (22) for clustered failure time data (with a scalar random effect), although asymptotically efficient estimators have yet to e developed. It is expected that the asymptotic normality and the efficiency of the maximum likelihood estimators for this class of models depend on the smoothness property of ψ. We are currently investigating the conditions for ψ and developing the requisite asymptotic theory. The proposed methods are ased on the normality of the random effects. The normality assumption may not e satisfied in some applications. It would e desirale to relax this assumption and to require only that the random effects have zero means. One possile approach is to approximate the density of random effects with a truncated series expansion (Davidian and Giltinana 1995, Ch. 7). We will pursue this generalization in our future work. 14

15 APPENDIX. PROOFS OF THEOREMS A.1 Proof of Theorem 1 The proof of Theorem 1 mimics Murphy s (1994) proof of consistency for the proportional hazards model with gamma frailty. Sustantial technical complications arise from the fact that, unlike the gamma frailty model, the random effects in our setting cannot e integrated out explicitly. The proof consists of two major steps: in the first step, we show that Ĥn( ) has an upper ound in [, τ] with proaility one; in the second step, we show that any convergent susequence of ( β n, Σ n, Ĥn) must converge to (β, Σ, H ). Step 1. We prove that Ĥn( ) has an upper ound in [, τ] with proaility one. Our approach is to show that since Ĥn maximizes L n it cannot diverge. Let l n (β, Σ, H) = log L n (β, Σ, H). By definition, l n ( β n, Σ n, Ĥn) l n (β, Σ, H) for any β, Σ and H. We wish to show that if Ĥn diverges, then the difference in the log-likelihood must e negative, which will e a contradiction. If H is continuous, l n (β, Σ, H) will e infinite for finite n. Thus, the choice of H = H is excluded. The key is to construct a suitale function H n that uniformly converges to H. Suppose that Ĥn(τ) in some sample space with positive proaility. We will show that n 1 {l n ( β n, Σ n, Ĥn) l n (β, Σ, H n )} diverges to if Ĥn(τ). We will construct the function H n y imitating Ĥn. By differentiating l n (β, Σ, H) with respect to H{Y ij } and setting the derivative to zero, we see that Ĥn{Y ij } satisfies the equation n ij H{Y ij } = R 1k( β n, H, )R 2k (Y ij, β 1 T n, H, )e Σ n /2 Σ n 1/2 d k=1 R 1k( β 1 T n, H, )e Σ n /2 Σ n 1/2, d (A.1) where R 1k (β, H, ) = n k l=1 e (XT klβ+z T kl ) { H(Ykl ) + e (XT kl β+zt kl )} 1+ kl, R 2k (t, β, H, ) = n k l=1 (1 + kl )I(Y kl t) H(Y kl ) + e (XT kl β+zt kl ). Thus, we define H n (t) as a step function with jumps only at the Y ij for which ij = 1 and the jump size H n {Y ij } satisfies the equation ij H n {Y ij } = n k=1 R 1k(β, H, )R 2k (Y ij, β, H, )e T Σ 1 /2 Σ 1/2 d Specifically, H n (t) = n i=1 ni I(Y ij t) H n {Y ij }. R 1k(β, H, )e T Σ 1 /2 Σ 1/2 d 15. (A.2)

16 We will show that H n (t) converges to H (t) uniformly in t [, τ] with proaility one. By the Glivenko-Cantelli theorem (van der Vaart and Wellner 1996, p. 122), H n (t) converges almost surely to E { ni I(Y ij t) ij /µ(y ij ) }, where µ(y) = E R 1k(β, H, )R 2k (y, β, H, )e T Σ 1 /2 Σ 1/2 d R 1k(β, H, )e T Σ 1 /2 Σ 1/2 d n k (1 + = E kl )I(Y kl y) E H (Y kl ) + e (XT β kl +ZT kl ) n k. l=1 If we denote y S c ( X kl, Z kl ) the survival function of C kl given (X kl, Z kl ), then (1 + kl )I(Y kl y) E H (Y kl ) + e (XT β kl +ZT kl ) n k 1 H = E 2 (t) y H (t) + e (XT β kl +ZT kl ) { H (t) + e (XT β kl +ZT )} 2 S c(t X kl, Z kl )dt kl 1 1 E y H (t) + e (XT β kl +ZT kl ) H (t) + e (XT β ds c(t X kl, Z kl ) kl +ZT kl ) n k S = E c (y X kl, Z kl ) {H (y) + e (XT β kl +ZT kl ) } 2 n k, where the second equality follows from integration y part. Thus, n i I(Y ij t) n i ij E µ(y ij ) = E t S E c (y X ij, Z ij )H (y) dy µ(y){h (y) + e (XT ijβ +Z T ij ) } 2 n i = t H (y)dy = H (t). Consequently, H n (t) uniformly converges to H (t) in [, τ]. By plugging equation (A.1) into l n ( β n, Σ n, Ĥn), we otain { l n ( β n, Σ n } n, Ĥn) = log R 1i ( β n, Ĥn, ) Σ 1 n 1/2 T e Σ n /2 d i=1 n n i n ij log R 1k( β n, Ĥn, )R 2k (Y ij, β 1 n, Ĥn, T )e Σ n /2 Σ n 1/2 d i=1 k=1 R 1k( β 1 n, Ĥn, T )e Σ n /2 Σ n 1/2. d Likewise, y plugging equation (A.2) into l n (β, Σ, H n ) and applying the Glivenko-Cantelli theorem, we see that n 1 l n (β, Σ, H n ) = O(1) n 1 n i=1 ni ij log(n), where O(1) denotes a random variale ounded away from infinity almost surely. Thus, n 1 {l n ( β n, Σ n, Ĥn) l n (β, Σ, H n { n )} = O(1)+n 1 log R 1i ( β n, Ĥn, ) Σ n 1/2 e 16 i=1 } 1 T Σ n /2 d

17 n n 1 n i n ij log i=1 n 1 R 1k( β n, Ĥn, )R 2k (Y ij, β 1 n, Ĥn, T )e Σ n /2 Σ n 1/2 d k=1 R 1k( β 1 n, Ĥn, T )e Σ n /2 Σ n 1/2. d (A.3) We will show that if Ĥn(τ), then the right-hand side of (A.3) will diverge to. To this end, we will ound each term in (A.3). Let m and M e constants such that < m e XT kl β n M < almost surely for all k = 1,..., n; l = 1,..., nk. Since Ĥ n (y) + e (XT kl β Ĥ n +Z T kl ) n (y) + e XT kl β n, if Z T kl, } e ZT kl {Ĥ n (y) + e XT kl β n, if Z T kl >, we have Ĥn(y) + e (XT kl β n +Z T kl ) e ZT kl {Ĥn(y) + m}. Similarly, Ĥn(y) + e (XT kl β n +Z T kl ) e ZT kl {Ĥn(y) + M}. Thus, there exist constants C 1 and C 2 such that the following results hold: R 1i ( β n, Ĥn, ) Σ n 1/2 e 1 T Σ n /2 d n i n i C 1 R 1k ( β n, Ĥn, )R 2k (Y ij, β n, Ĥn, ) Σ n 1/2 e n k l=1 n k C 2 e (XT kl β n +Z T kl ) (1+ kl) Z T kl l =1 (Ĥn(Y kl ) + M) 1+ kl I(Y kl Y ij ) Ĥ n (Y kl ) + M n k l=1 e (XT ij β n +Z T ij )+(1+ ij) Z T ij (Ĥn(Y ij ) + m) 1+ ij Σ n 1/2 e 1 T Σ n /2 d 1, (A.4) (Ĥn(Y ij ) + m) 1+ ij 1 T Σ n /2 d After plugging (A.4) and (A.5) into (A.3), we otain n 1 {l n ( β n, Σ n, Ĥn) l n (β, Σ, H n )} = O(1) n 1 n 1 n n i i=1 n k (1 + kl )I(Y kl Y ij )e ZT kl l =1 Ĥ n (Y kl ) + M Σ n 1/2 e T Σ 1 n /2 d 1. (A.5) (Ĥn(Y kl ) + M) 1+ kl log [ n 1 n n k k=1 l =1 Because there exists a constant C 3 such that n n i i=1 { I(Ykl Y ij ) Ĥ n (Y kl ) + M (1 + ij ) log(ĥn(y ij ) + m) } nk l=1 C 2(Ĥn(Y kl ) + m) 1+ ] kl nk l=1 C. 1(Ĥn(Y kl ) + M) 1+ kl we conclude that nk l=1 C 2(Ĥn(Y kl ) + m) 1+ kl nk l=1 C 1(Ĥn(Y kl ) + M) 1+ kl C 3 >, n 1 {l n ( β n, Σ n, Ĥn) l n (β, Σ, H n )} = O(1) n 1 17 n n i i=1 (1 + ij ) log(ĥn(y ij ) + m)

18 n 1 n n i i=1 ij log { n 1 n n k k=1 l=1 } I(Y kl Y ij ). (A.6) Ĥ n (Y kl ) + M It remains to show that, if Ĥn(τ), the right-hand side of (A.6) diverges to. To this end, we choose a partition of [, τ] as follows: with s = τ, choose s 1 < s such that 1 2 E n i (1 + ij )I(Y ij = s ) > E n i ij I(Y ij [s 1, s )). By conditions C.2 and C.4, such an s 1 exists. Define a constant ɛ (, 1) such that ɛ 1 ɛ < E { ni I(Y ij [s 1, s )) } E { ni ij I(Y ij [, τ)) }. If s 1 >, we can choose s 2 max(, s) such that s is the minimum value less than s 1 satisfying that n i (1 ɛ)e (1 + ij )I(Y ij [s 1, s )) E n i ij I(Y ij [s, s 1 )). Clearly, s 2 exists under condition C.4, and s 2 < s 1. This process is continued so that we otain a sequence: τ s > s 1 > s 2 >... such that 1 2 E n i (1 + ij )I(Y ij = s ) E n i ij I(Y ij [s 1, s )), n i (1 ɛ)e (1 + ij )I(Y ij [s p, s p 1 )) E n i ij I(Y ij [s p+1, s p )), p 1. We claim that such a sequence cannot e infinite, i.e., there exists a finite N such that s N+1 = ; otherwise, s p s for some s [, τ). By the definition of s p, n i n i (1 ɛ)e (1 + ij )I(Y ij [s p, s p 1 )) = E ij I(Y ij [s p+1, s p )), p 1. We sum the aove equations over p = 1, 2,..., and y the continuity of true densities, we otain n i (1 ɛ)e (1 + ij )I(Y ij [s, τ)) = E n i ij I(Y ij [s, s 1 )). Thus, n i (1 ɛ)e I(Y ij [s, τ)) ɛe n i ij I(Y ij [s, s 1 )), 18

19 which contradicts the choice of ɛ. Therefore, the sequence is finite: τ = s >... > s N+1 =. Now the right-hand side of (A.6) can e ounded y n n 1 n i I(Y ij = τ)(1 + ij ) log(ĥn(τ) + m) i=1 N n 1 n n i (1 + ij )I(Y ij [s p+1, s p )) log(ĥn(s p+1 ) + m) p= i=1 N n { n 1 n i n } ij I(Y ij [s p+1, s p )) log n 1 n k I(Y kl Y ij, Y kl [s p+1, s p )) + O(1) p= i=1 k=1 l=1 Ĥ n (s p ) + M 1 n n i (1 + ij )I(Y ij = τ) log(ĥn(τ) + m) 2n i=1 1 n n i (1 + ij )I(Y ij = τ) log(ĥn(τ) + m) 2n i=1 n n 1 n i ij I(Y ij [s 1, s )) log(ĥn(τ) + M) i=1 N n n i n 1 (1 + ij )I(Y ij [s p, s p 1 )) log(ĥn(s p ) + m) p=1 i=1 n n 1 n i ij I(Y ij [s p+1, s p )) log(ĥn(s p ) + M) i=1 N n { n 1 n i n } ij I(Y ij [s p+1, s p )) log n 1 n k I(Y kl Y ij, Y kl [s p+1, s p )) + O(1). p= i=1 k=1 l=1 (A.7) The first term on the right-hand side of (A.7) diverges to as Ĥn(τ). The second term is negative as n is large due to the choice of s 1. By the selection of s p, p = 1,..., N, the third term cannot diverge to +. Finally, the fourth term is ounded ecause of the Glivenko- Cantelli theorem. Hence, the right-hand side of (A.7) diverges to. This contradicts the fact that the left-hand side (A.6) is non-negative. In conclusion, we have shown that Ĥn(τ) has an upper ound with proaility 1. Thus, it follows from Helly s selection theorem that there exists a convergent susequence, still denoted y Ĥn( ), which converges point-wise to a monotone function H ( ) in [, τ]. Since β n and Σ n elong to a compact set, y choosing a further susequence, we can assume that β n β and Σ n Σ for some random vectors β and Σ. Step 2. We will show that β = β, Σ = Σ and H (t) = H (t). Define R 3k (t, β, Σ, H) = R 1k(β, H, )R 2k (t, β, H, )e T Σ 1 /2 d R. 1k(β, H, )e T Σ 1 /2 d 19

20 In view of equations (A.1) and (A.2), we see that Ĥn(t) is asolutely continuous with respect to H n (t), and Ĥ n (t) = t nk=1 R 3k (u, β, Σ, H ) nk=1 R 3k (u, β n, Σ n, Ĥn) d H n (u). By taking the limits on oth sides of the aove display, we conclude that H (t) is asolutely continuous with respect to H (t) so that H (t) is differentiale with respect to t. In addition, dĥn(t)/d H n (t) converges to dh (t)/dh (t) uniformly in t. On the other hand, n 1 {l n ( β n, Σ n, Ĥn) l n (β, Σ, H n )} n { = n 1 log R 1i ( β n, Ĥn, ) Σ n 1/2 e i=1 n n 1 i=1 { log } 1 T Σ n /2 d R 1i (β, H } n, ) Σ 1/2 e T Σ 1 /2 d n i n +n 1 ij log (Ĥn{Y ij }/ H n {Y ij } ). i=1 (A.8) By letting n in (A.8), we have E log R 1i(β, H, ) Σ 1/2 e T Σ 1 /2 d n i H (Y ij ) ij R 1i(β, H, ) Σ 1/2 e T Σ 1 /2 d n i H (Y ij ) ij Because the right-hand side is the negative Kullack-Leiler information, we have = n i n i almost surely. In other words, n i H (Y ij ) ij H (Y ij ) ij R 1i (β, H, ) Σ 1/2 e T Σ 1 /2 d R 1i (β, H, ) Σ 1/2 e T Σ 1 /2 d e (XT ijβ +Z T ij ) H (Y ij ) ij { } H (Y ij ) + e (XT ijβ 1+ ij Σ 1/2 e T Σ +Z T ij ) 1 /2 d. = n i e (XT ijβ +Z T ij ) H (Y ij ) ij { } 1+ ij Σ 1/2 e T Σ H (Y ij ) + e (XT ijβ +Z T ij ) 1 /2 d. (A.9) We will show that equation (A.9) entails that β = β, Σ = Σ and H = H. Fix an integer k such that 1 k n i. We let ij = 1, Y ij = in (A.9) for j = 1,..., k; for those 2

21 j such that j > k, we perform the following action on the jth term on oth sides of (A.9): if ij =, we replace Y ij with τ; if ij = 1, we integrate Y ij from to τ. Thus, we otain k = {H ()e XTijβ } ni +Z Tij k j=k+1 Σ 1/2 e T Σ 1 /2 d {H ()e XTijβ } ni +Z Tij H (τ)e XT ijβ +Z T ij H (τ)e XT ijβ +Z T ij ij 1 H (τ)e XT ijβ +Z T ij + 1 H (τ)e XT ijβ +Z T ij ij ij 1 1 ij j=k+1 H (τ)e XT ijβ +Z T ij + 1 H (τ)e XT ijβ +Z T ij + 1 Σ 1/2 e T Σ 1 /2 d. (A.1) Since { ij : j = k + 1,..., n i } are aritrary, we sum the two sides of (A.1) over all possile ij, j = k + 1,..., n i to yield Thus, k = k k exp k = exp {H ()e XTijβ } +Z Tij Σ 1/2 e T Σ 1 /2 d {H ()e XTijβ +Z Tij } Σ 1/2 e T Σ 1 /2 d. X T ijβ + ( k Z ij ) T Σ ( k Z ij ) 2 H () k X T ijβ + ( k Z ij ) T Σ ( k Z ij ) 2 H () k. (A.11) Condition C.4 implies that H () >. Note that the index set {1,..., k} in equation (A.11) can e replaced y any suset of {1,..., n i }. Thus, it is easy to derive from (A.11) that and Z T ijσ Z ij = Z T ijσ Z ij, j j ; j, j = 1,..., n i, X T ijβ + ZT ijσ Z ij 2 + log H () = X T ijβ + ZT ijσ Z ij 2 + log H (), j = 1,..., n i. According to condition C.3, Σ = Σ, β = β and H () = H (). To show that H = H, we let i1 = 1 in (A.1) and integrate Y i1 from to y; we also perform the following action on the jth term on oth sides of (A.1) for j = 2,... n i : if ij =, 21

22 we replace Y ij with τ; if ij = 1, we integrate Y ij from to τ. Then we sum the resulting equalities over all possile { ij : j = 2,..., n i } to yield = H (y)e XT i1β +Z T i1 H (y)e XT i1β +Z T i1 + 1 Σ 1/2 e T Σ 1 /2 d H (y)e XT i1β +Z T i1 H (y)e XT i1β +Z T i1 + 1 Σ 1/2 e T Σ 1 /2 d. Because the two sides of the aove equation are strictly monotone in H (y) and H (y), respectively, we have H (y) = H (y). Comining the results from Step 1 and Step 2, we conclude that, almost surely, β n β, Σ n Σ, and Ĥn(y) H (y). The uniform convergence of Ĥn to H follows from the fact that H is a continuous function. A.2. Proof of Theorem 2 Consider the set H = {(h 1, h 2, h 3 ) : h 1 R d 1, h 2 R d 2(d 2 +1)/2, h 3 ( ) is a function on [, τ]; h 1 1, h 2 1, h 3 V 1}, where h 3 V denotes the total variation of h 3 ( ) in [, τ]. We define a sequence of maps S n mapping a neighorhood of (β, Σ, H ), denoted y U, in the parameter space for (β, Σ, H) into l (H) (i.e., the space consisting of ounded functionals on H) as follows: S n (β, Σ, H)[h 1, h 2, h 3 ] n 1 d dɛ l n(β + ɛh 1, Σ + ɛh 2, H(t) + ɛ A n1 [h 1 ] + A n2 [h 2 ] + A n3 [h 3 ], t h 3 (s)dh(s)) where A np, p = 1, 2, 3, are linear functionals on R d 1, R d 2(d 2 +1)/2, and BV [, τ], respectively, and BV [, τ] is the space of functions with finite total variation in [, τ]. In fact, if we let l β, l Σ, l H [h 3 ] e the score function for β, the score function for Σ, and the score for H along the path H(t) + ɛ t h 3(s)dH(s) for a single cluster, then ɛ= A n1 [h 1 ] = P n [h T 1 l β ], A n2 [h 2 ] = P n [h T 2 l Σ ], A n3 [h 3 ] = P n [l H [h 3 ]], where P n denotes the empirical measure ased on n independent clusters. 22

23 We can explicitly write the functionals A np,p = 1, 2, 3, as follows. Define the operation etween two matrices M 1 and M 2 of the same size as the trace of (M 1 M T 2 ), and for each h 2 R d 2(d 2 +1)/2, let D(h 2 ) e the symmetric matrix such that the extended vector taken from D(h 2 ) is the same as h 2. Then, A n1 [h 1 ] = n 1 n i=1 A n2 [h 2 ] = n 1 n A n3 [h 3 ] = n 1 n R 1i(β, H, ) n i i=1 { n i i=1 X T ijh 1 [(1 + ij )/{1 + H(Y ij )e XT ijβ+z T ij } 1]e T Σ 1 /2 d R, 1i(β, H, )e T Σ 1 /2 d R 1i(β, H, )e T Σ 1 /2 { T Σ 1 D(h 2 )Σ 1 /2 Σ 1 D(h 2 )/2}d R 1i(β, H, )e T Σ 1 /2 d ij h 3 (Y ij ) Yij (1+ ij ) h 3 (y)dh(y) R 1i(β, H, )/(H(Y ij ) + e (XT ijβ+z T ij ) )e T Σ 1 } /2 d R. 1i(β, H, )e T Σ 1 /2 d Correspondingly, we define the limit map S : (β, Σ, H) l (H) as S(β, Σ, H)[h 1, h 2, h 3 ] = A 1 [h 1 ] + A 2 [h 2 ] + A 3 [h 3 ], where the linear functionals A p, p = 1, 2, 3, are otained y replacing the the empirical sum in the A np y the expectation. Clearly, S n ( β n, Σ n, Ĥn) =, and S(β, Σ, H ) =. The desired asymptotic normality will follow if we can verify the four conditions stated in Theorem 2 of Murphy (1995). The first condition that n(s n (β, Σ, H ) S(β, Σ, H )) weakly converges to a tight Gaussian process on l (H) holds since H is a Donsker class and the functionals A np are ounded Lipschitz functionals with respect to H. By the smoothness of S(β, Σ, H), the Fréchet differentiaility holds and the derivative of S(β, Σ, H) at (β, Σ, H ), denoted y Ṡ(β, Σ, H ), is a map from the space {(β β, Σ Σ, H H ) : (β, Σ, H) is in the neighorhood U of (β, Σ, H )} to l (H). The approximation condition that sup (h 1,h 2,h 3 ) H (S n S)( β n, Σ n, Ĥn)[h 1, h 2, h 3 ] (S n S)(β, Σ, H )[h 1, h 2, h 3 ] = o p (n 1/2 { β n β + Σ n Σ + sup Ĥn(y) H (y) }) y [,τ] can e verified along the lines of Murphy (1995, appendix)., 23

24 It remains to show that the linear map Ṡ(β, Σ, H ), denoted y Λ, is continuously invertile on its range. Note that Λ maps (β β, Σ Σ, H H ) to a ounded functional on H. Algeraic manipulations yield Λ(β β, Σ Σ, H H )[h 1, h 2, h 3 ] = (β β ) T Q 1 (h 1, h 2, h 3 ) + (Σ Σ ) T Q 2 (h 1, h 2, h 3 ) + τ Q 3 (h 1, h 2, h 3 )d(h H ), where ( ) h1 Q 1 (h 1, h 2, h 3 ) = B 1 + h 2 ) Q 2 (h 1, h 2, h 3 ) = B 2 ( h1 h 2 ) Q 3 (h 1, h 2, h 3 ) = B 3 ( h1 h 2 + τ τ h 3 (y)d 1 (y)dy, h 3 (y)d 2 (y)dy, + 4 h 3 (y) + τ h 3 (t)d 3 (t, y)dt, B 1, B 2, B 3 are constant matrices, D 1 (y), D 2 (y), D 3 (t, y) are continuously differentiale functions depending on the true densities, and 4 >. Therefore, the operator Q(h 1, h 2, h 3 ) (Q 1 (h 1, h 2, h 3 ), Q 2 (h 1, h 2, h 3 ), Q 3 (h 1, h 2, h 3 )) T can e considered as a sum of a continuously invertile linear operator and a compact operator from the linear span of H to itself. The invertiility of Λ Ṡ(β, Σ, H ) is equivalent to the invertiility of the linear operator Q(h 1, h 2, h 3 ). It suffices to prove that Q(h 1, h 2, h 3 ) is a one-to-one map (Rudin 1973, pp ). If Q(h 1, h 2, h 3 ) =, then Λ(β β, Σ Σ, H H )[h 1, h 2, h 3 ] = for any (β, Σ, H) in the neighorhood U. In particular, we choose y β = β + ɛh 1, Σ = Σ + ɛh 2, H(y) = H (y) + ɛ h 3 (t)dh (t) for a small constant ɛ. By the definition of Λ, = Λ(β β, Σ Σ, H H )[h 1, h 2, h 3 ] = ɛe{(l β [h 1 ] + l Σ [h 2 ] + l H [h 3 ]) 2 }. Thus, l β [h 1 ] + l Σ [h 2 ] + l H [h 3 ] = almost surely. After writing out the expression of this equation, we otain n i R 4i (β, H, ) X T ijh (1 + ij ) d (H (Y ij )e (XT ijβ +Z T ij ) N(, Σ ) + 1) n i + { Σ 1 D(h 2 ) + 1 } 2 T Σ 1 D(h 2 )Σ 1 R 4i (β, H, )d N(, Σ ) R 4i (β, H, ) ijh 3 (Y ij ) (1 + ij) Yij h 3 (y)dh (y) (H (Y ij ) + e (XT ijβ +Z T ij ) d N(, Σ ) =, ) 24 (A.12)

25 where R 4i (β, H, ) = R 1i (β, H, ) n i {H (Y ij )} ij. We will show that (A.12) entails h 1 =, h 2 = and h 3 = y adopting the ideas used in the proof of the identifiaility for Theorem 1. First, we let X ij and Z ij e fixed. Then for a fixed k such that 1 k n i, we define measures µ 1,... µ ni for any Borel set A [, τ], µ m ({} A) =, µ m ({1} A) = I( A), m k, on the set {, 1} [, τ] as follows: µ m ({} A) = I(τ A), µ m ({1} A) = I A dx, m > k. We integrate oth sides of (A.12) over {( i,1, Y 1 ),..., ( i,ni, Y i,ni )} with respect to the product measure dµ 1 dµ ni. In other words, we let im = 1 and Y im = for m k; we sum all the equalities of (A.12) for all possile cominations of { i,k+1,..., i,ni } {, 1} n i k, in which we choose Y im = τ if im = and integrate Y im from to τ if im = 1. The resulting integration is. We study the integral of each term on the left-hand side of (A.12) with respect to the measure n i m=1 µ m. For the first term on the left-hand side of (A.12), from the expression of R 4i (β, H, ), we have that for any, if j k, = X T ijh 1 R 4i (β, H, )X T m k = X T ijh 1 ijh 1 {H ()e XTimβ +Z Tim } δ im {,1},m>k m>k m k if j > k, it holds that = X T ijh 1 {H ()e XTimβ +Z Tim } ; R 4i (β, H, )X T m k 1 + (1 + ij ) d( (H (Y ij )e (XT ijβ +Z T ij ) + 1) n i m=1 µ m ) δ im (H (τ)e XT imβ +Z T im + 1) 1 δ im H (τ)e XT imβ +Z T im + 1 ijh 1 {H ()e XTimβ +Z Tim } δ im {,1},m>k,m j m>k,m j 1 + (1 + ij ) d( (H (Y ij )e (XT ijβ +Z T ij ) + 1) n i m=1 µ m ) δ im (H (τ)e XT imβ +Z T im + 1) 1 δ im H (τ)e XT imβ +Z T im

26 (1 δ 1 1 ij) 1 + δ ij {,1} (H (τ)e XT ijβ +Z T ij + 1) H (τ)e XT ijβ +Z T ij + 1 τ H +δ (t)e (XT ijβ +Z T ij ) 2 ij 1 + dt (H (t) + e (XT ijβ +Z T ij ) ) 2 H (t)e XT ijβ +Z T ij + 1 = X T ijh 1 {H ()e XTimβ } +Z Tim =. m k Therefore, δ ij {,1} n i = X T ijh 1 j k Likewise, = (1 δ 1 1 ij) 1 + (H (τ)e XT ijβ +Z T ij + 1) H (τ)e XT ijβ +Z T ij + 1 τ H +δ (t)e (XT ijβ +Z T ij ) 2 ij 1 + dt (H (t) + e (XT ijβ +Z T ij ) ) 2 H (t)e XT ijβ +Z T ij + 1 R 4i (β, H, ) X T ijh (1 + ij ) d (H (Y ij )e (XT ijβ +Z T ij ) N(, Σ )d( + 1) {H ()e XTimβ } +Z Tim d N(, Σ ). m k { 1 2 Σ 1 D(h 2 ) + 1 } 2 T Σ 1 D(h 2 )Σ 1 R 4i (β, H, )d N(, Σ )d( { 1 2 Σ 1 D(h 2 ) + 1 } 2 T Σ 1 D(h 2 )Σ 1 {H ()e XTimβ } +Z Tim d N(, Σ ). m k Furthermore, if j k, if j > k, = m k R 4i (β, H, ) ijh 3 (Y ij ) (1 + ij) Yij h 3 (y)dh (y) ni H (Y ij ) + e (XT ijβ +Z T ij ) d( µ m ) m=1 = h 3 () {H ()e XTimβ } +Z Tim ; m k R 4i (β, H, ) ijh 3 (Y ij ) (1 + ij) Yij h 3 (y)dh (y) ni H (Y ij ) + e (XT ijβ +Z T ij ) d( µ m ) m=1 {H ()e XTimβ } +Z Tim δ ij {,1} n i m=1 n i m=1 µ m ) µ m ) (A.13) (A.14) (1 δ 1 τ ij) h 3(y)dH (y) H (τ)e XT ijβ +Z T ij + 1 H (τ) + e (XT ijβ +Z T ij ) 26

27 Thus, τ H +δ (t)e (XT ijβ +Z T ij ) 2 t ij h 3 (t) h 3(y)dH (y) dt =. (H (t) + e (XT ijβ +Z T ij ) ) 2 H (t) + e (XT ijβ +Z T ij ) n i = h 3 () j k R 4i (β, H, ) ijh 3 (Y ij ) (1 + ij) Yij h 3 (y)dh (y) (H (Y ij ) + e (XT ijβ +Z T ij ) d N(, Σ )d( ) {H ()e XTimβ } +Z Tim d N(, Σ ). m k Comining (A.13), (A.14) and (A.15) and integrating over, we otain k X T ijh ( k ) T Z ij D(h2 ) ( k ) Z ij + kh3 () =. 2 Since the order for the suscripts j = 1,..., k is aritrary, it holds that k 2 j=k 1 +1 X T ijh ( k 2 j=k 1 +1 ) T Z ij D(h2 ) ( k 2 j=k 1 +1 Z ij ) + (k2 k 1 )h 3 () = n i m=1 µ m ) (A.15) for any 1 k 1 < k 2 n i. Thus, Z T ijd(h 2 )Z ij = for j j and X T ijh 1 +Z T ijd(h 2 )Z ij /2+h 3 () =. By condition C.3, D(h 2 ) =. As a result, h 2 = and h 1 =. In (A.13), we set Y ij =, j = 2,..., n i, and ij = 1, j = 1,..., n i, so as to otain h 3 (Y i1 ) = Yi1 2 h 3 (y)dh (y) i1β +Z T )+ n i e (XT i1 j=2 (XT ijβ +Z T ij ) e T Σ 1 /2 /(H (Y i1 ) + e (XT i1β +Z T i1 ) ) 3 d e (XT i1β +Z T i1 )+ n i j=2 (XT ijβ +Z T ij ) e T Σ 1 /2 /(H (Y i1 ) + e (XT i1β +Z T i1 ) ) 2 d That is, g(y) y h 3(t)dH (t) satisfies the homogeneous equation g (y) H (y) g(y) e (XT i1β +Z T i1 )+ n i j=2 (XT ijβ +Z T ij ) e T Σ 1 /2 /(H (Y i1 ) + e (XT i1β +Z T i1 ) ) 3 d e (XT i1β +Z T i1 )+ n i j=2 (XT ijβ +Z T ij ) e T Σ 1 /2 /(H (Y i1 ) + e (XT i1β +Z T i1 ) ) 2 d with oundary condition g() =. Thus, it is clear that g(y) =, i.e., h 3 (y) =. Hence, we have verified that Q is one-to-one map and have thus shown the invertiility of Ṡ(β, Σ, H ). The asymptotic distriution stated in Theorem 2 now follows from Theorem 2 of Murphy (1995). Furthermore, n Ṡ(β, Σ, H )( β n β, Σ n Σ, Ĥn H )[h 1, h 2, h 3 ] = n( β n β ) T Q 1 (h) + n( Σ n Σ ) T Q 2 (h) + τ n Q 3 (h)d(ĥn H ) = n(p n P)[h T 1 l β + h T 2 l Σ + l H [h 3 ]] + o p (1) (A.16) 27 =.

Tests of independence for censored bivariate failure time data

Tests of independence for censored bivariate failure time data Abstract Bivariate failure time data is widely used in survival analysis, for example, in twins study. This article presents a class of χ