MAXIMUM LIKELIHOOD METHOD FOR LINEAR TRANSFORMATION MODELS WITH COHORT SAMPLING DATA

Size: px

Start display at page:

Download "MAXIMUM LIKELIHOOD METHOD FOR LINEAR TRANSFORMATION MODELS WITH COHORT SAMPLING DATA"

Jasmin Agatha Perry
5 years ago
Views:

1 Statistica Sinica 25 (215), doi: MAXIMUM LIKELIHOOD METHOD FOR LINEAR TRANSFORMATION MODELS WITH COHORT SAMPLING DATA Yuan Yao Hong Kong Baptist University Abstract: Three widely used sampling designs the nested case-control, case-cohort, and classical case-control designs can be categorized as generalized case-cohort designs. Maximum likelihood methods are used to perform regression analysis of linear transformation models with these sampling designs, and the resulting estimator is proved to be consistent, asymptotically normal and semiparametrically efficient. Simulation studies and an application to the Stanford heart transplant data are presented. Key words and phrases: Linear transformation models, maximum likelihood estimation, missing at random, nested case-control sampling. 1. Introduction Cohort sampling is a popular methodology in epidemiological studies and clinical trials. The main advantage of such a design, compared with prospective studies, is that collecting covariate information is relatively quick and inexpensive. A number of publications have addressed the analysis of cohort designs, and in particular on nested case-control (n-c-c) designs (see, e.g., Breslow and Day (198), Langholz and Goldstein (1996), and Breslow (1996)). Thomas (1977) proposed a partial likelihood approach to estimating the regression parameter of the Cox model. Because of its partial likelihood nature, his estimation method possesses useful properties similar to those of Cox s partial likelihood estimation based on full cohort data. The estimator is easy to compute, and its variance estimator is simply the negative of the derivative of the log-partial likelihood. The asymptotic properties of Thomas estimator can be formally established using the counting process martingale theory of Goldstein and Langholz (1992). However, Thomas estimator is not semiparametrically efficient, and it is likely that other more efficient and easy-to-compute estimates exist. Samuelsen (1997) presented an entirely new estimator based on maximizing pseudo-likelihood. His key observation is that the conditional probability

2 1232 YUAN YAO of a censored subject s ever being selected as a control can be explicitly computed with an expression similar to that of the Kaplan Meier estimator. The resulting estimator and its inference are also very easy to compute. Empirical evidence shows that Samuelsen s estimator is sometimes better than Thomas. As Samuelsen (1997) has pointed out, however, the weighting techniques used to construct the pseudo-likelihood can be inefficient. Chen (21) proposed an estimation method based on local averaging that is essentially an alternative sample reuse method other than those of Thomas and Samuelsen. This method is applicable to so-called generalized case-cohort designs, which includes not only n-c-c, but case-control and case-cohort designs as well. Chen (21) discusses how the accuracy of this estimator is comparable to those of Thomas and Samuelsen, but the efficient estimator can be quite difficult to obtain, as it involves estimating and computing an integral operator. There are also some other methods in the literature. Almost all use Cox model. This puts serious limitations on practitioners because of the lack of statistical tools to analyze n-c-c or more general cohort designs. More general statistical methodologies for this purpose are badly needed. The transformation model, as a natural generalization of the Cox model, is an ideal choice. Typically a linear transformation model can be written as log H(T ) = βz + ϵ, where (T, Z) is the response-covariates pair, H is an unknown monotone function, ϵ is the unobserved random variable with known distribution and β is the unknown regression parameter of main interest. When ϵ follows the extreme value distribution or the standard logistic distribution, the model reduces to the Cox s proportional hazards model or the proportional odds model, respectively. Cheng, Wei, and Ying (1995) contains a very simple and elegant idea based on pairwise comparison of the survival times. This type of rank based method relies on the assumption that the censoring variables are independent of the covariates. In this paper, we propose an efficient estimation method for linear transformation models with cohort sampling designs. The rest of the paper is organized as follows. We describe the maximum likelihood-based estimation method and asymptotic properties in Section 2. Simulation studies and an application to data are given in Sections 3 and 4. Concluding remarks are given in Section 5. Technical details are provided in the Appendix. 2. Estimation and Inference Consider a cohort of n i.i.d. individuals. Let T i and C i denote the failure time and censoring time of the ith individual respectively. We assume that C i is independent of (T i, Z i ). We set

3 MAXIMUM LIKELIHOOD METHOD 1233 (i) the event time: Y i = min(t i, C i ); (ii) the failure/censoring index: δ i = I(T i C i ); (iii) the indicator of the ith individual being sampled for covariate ascertainment: i, with conditional probability π i = P ( i = 1 (Y j, δ j ), 1 j n). It is necessary in the inference procedure that ( 1,..., n ) are conditionally independent of (Z 1,..., Z n ), given (Y j, δ j ), 1 j n. In the linear transformation model log H(T ) = βz + ϵ, (2.1) it is supposed that β is the unknown parameter of main interest, H is an unknown increasing function, and ϵ is a continuous random variable whose hazard function λ ϵ is known. We take ϵ to be independent of Z and C. The likelihood function from (2.1) can then be written as n L(β, H, F ) = f(y i, δ i ) f( 1,..., n (Y j, δ j ), 1 j n) = i=1 n f(z i (Y j, δ j, j ), 1 j n) i i=1 n f(y i, δ i ) f( 1,..., n (Y j, δ j ), 1 j n) i=1 n f(z i Y i, δ i ) i, i=1 where the second equation comes from the conditional independence of ( 1,..., n ) and (Z 1,..., Z n ), and independence between individuals. The last term in the likelihood function can be calculated as ( ) f(z i Y i, δ i ) f(yi i, δ i Z i )f(z i ) i =. f(y i, δ i ) For simplicity, denote the hazard and cumulative hazard functions of exp(ϵ) as λ and Λ, and the probability density and distribution of covariate Z as f and F, respectively. The log-likelihood then takes a final form l n (β, H, F ) = n i=1 } i {δ i [βz i + log λ(h(y i )e βz i )] Λ(H(Y i )e βz i ) { +(1 i ) log +δ i log h(y i ) + i log f(z i ) } [e βz λ(h(y i )e βz )] δ i e Λ(H(Y i)e βz) f(z)dz

4 1234 YUAN YAO + log(λ C (Y i ) 1 δ i e Λ C(Y i ) ) + log f ( 1,..., n (Y j, δ j ), 1 j n), where h( ) is the derivative function of H( ). Using the method of discretization of H and F, let q j represent the size of the increment of H at the jth smallest observed failure times, say s j, j = 1,..., n 1, where n 1 = n i=1 δ i is the number of failures. Set n 1 n 1 H(t) = q j I(s j t), and h(t) = q j I(t = s j ). To estimate F, put probability mass on all known covariates, with f(z j ) = P (Z = Z j ) = p j satisfying n 2 p j = 1, where n 2 = n i=1 i is the number of individuals with known covariates. Then the integral over Z in the likelihood can be estimated as [e βz λ(h(y i )e βz )] δ i e Λ(H(Y i)e βz) f(z)dz n 2 = [e βz j λ(h(y i )e βz j )] δ i e Λ(H(Y i)e βzj ) p j. We show that maximizing the log-likelihood function over (β, q 1,..., q n1, p 1,..., p n2 ) leads to its being consistent, asymptotically normal, and semiparametrically efficient for β under certain regularity conditions. Regularity conditions and the proof of the theorems are in the Appendix. Theorem 1. Under the regularity conditions (C1) (C5), β ˆ n β, sup t [,τ] H ˆ n (t) H (t) and sup Z M F ˆ n (Z) F (Z) almost surely. To describe the variance estimation, let τ denote the duration of the study and suppose Z lies in a bounded set M. Let Q 1 = {p BV [, τ] : p 1} and Q 2 = {q BV [M] : q 1}, where BV [D] is the set of functions on D with bounded total variation. Then H ˆ n can be treated as a bounded linear functional in L (Q 1 ) as τ Hˆ n (p) = p(t)dh ˆ n (t), and similarly, Fn ˆ can be treated as a bounded linear functional in L (Q 2 ). Theorem 2. Under the conditions (C1) (C6), n( β ˆ n β, Hn ˆ H, F ˆ n F ) converges weakly to a zero-mean Gaussian process in the metric space R d L (Q 1 ) L (Q 2 ). The limiting covariance matrix of n( β ˆ n β ) attains the semiparametric efficiency bound.

5 MAXIMUM LIKELIHOOD METHOD 1235 Theorem 3. For any (b, p, q) V Q 1 Q 2, where V = {v R d : v 1}, the asymptotic variance for nv T ( ˆβ n β ) + τ n p(t)d [ Hn ˆ (t) H (t) ] + n q(z)d [ Fn ˆ (Z) F (Z) ] Z can be consistently estimated by (v T, p T, q T )In 1 (v T, p T, q T ) T, where ni n is the negative Hessian matrix of the log-likelihood function l n (β, H, F ) with respect to (β, q 1,..., q n1, p 1,..., p n2 ), and the vectors p = (p(s 1 ),..., p(s n1 )) T, q = (q(z 1 ),..., q(z n2 )) T. Taking p = and q =, the variance matrix of n( ˆβ n β ) can be estimated by the upper left d d matrix of In 1. Remark. The computation can be carried out in many scientific computing packages. For example, the algorithm of fmincon in the optimization toolbox of Matlab can be used to find a minimizer and calculate the Hessian matrix as well. Our simulation shows that this algorithm does well when handling moderate sample size like 2. For the initial value, one can use estimates from the Cox model or try different initial values to make the maximization guaranteed. 3. Simulation Study We took independent covariates Z 1 uniform distributed over [, 1] and Z 2 Bernoulli with success probability.5. The hazard function of the error term ϵ was chosen as exp(t) λ ϵ (t) = 1 + r exp(t), with r =, 1 and 2 (Dabrowska and Doksum (1988); Chen, Jin, and Ying (22)). Here the proportional hazards and proportional odds models correspond to r = and r = 1, respectively. For the transformation function in the model log H(t) = βz + ϵ, we take H(t) = t for r =, H(t) = exp(t) 1 for r = 1, and H(t) =.5 exp(2t).5 for r = 2. The censoring time C was uniform on [, c], where c was chosen such that the expected proportion of censoring was.7 and.8, respectively. The sample size n was set at 2 and all simulations were based on 5 replications. For the case-cohort design, we selected all failures and a subcohort of size 85 and 5 when the censoring rate was.7 and.8, respectively. For the classical case-control design, we selected all failures and a group of non-failures of the same size as the failures. For the nested case-control design, we selected all failures and two non-failures in each risk set of failure times. Table 1 summarizes the simulation results estimating β 1 and β 2. The true values were β 1 = 1 and β 2 = 1. Results include the mean of the bias (Bias) of the estimates, the sample standard deviations (SSD) of the estimates, the mean

6 1236 YUAN YAO Table 1. Summary of simulation results. Censoring rate=.7 Case-cohort design r Bias SSD ESE CP RE RE* β 1 β 2 β 1 β 2 β 1 β 2 β 1 β 2 β 1 β 2 β 1 β Classical case-control design r Bias SSD ESE CP RE RE* β 1 β 2 β 1 β 2 β 1 β 2 β 1 β 2 β 1 β 2 β 1 β Nested case-control design r Bias SSD ESE CP RE RE* β 1 β 2 β 1 β 2 β 1 β 2 β 1 β 2 β 1 β 2 β 1 β Censoring rate=.8 Case-cohort design r Bias SSD ESE CP RE RE* β 1 β 2 β 1 β 2 β 1 β 2 β 1 β 2 β 1 β 2 β 1 β Classical case-control design r Bias SSD ESE CP RE RE* β 1 β 2 β 1 β 2 β 1 β 2 β 1 β 2 β 1 β 2 β 1 β Nested case-control design r Bias SSD ESE CP RE RE* β 1 β 2 β 1 β 2 β 1 β 2 β 1 β 2 β 1 β 2 β 1 β of the estimated standard errors (ESE) of the estimates, and the 95% empirical coverage probabilities (CP) for β 1 and β 2 based on the asymptotic normality in Theorem 2. The proposed estimation procedures perform well in all cases. We compared our approach with that of Chen, Sun, and Tong (212), a

7 MAXIMUM LIKELIHOOD METHOD 1237 likelihood-based method involving inverse probability that can apply to different cohort sampling schemes. A series of simulation with the same settings as above was conducted and it found that the Chen et al. estimators less efficient in most of the senarios, especially when the censoring rate was large. This is mainly because the Chen et al. estimator does not involve the event times whose covariates are not sampled. This can be seen in Table 1 by comparing the relative efficiency of our estimators (RE) with that of the Chen et al. estimators (RE*), where the relative efficiency was computed by comparing the sample variance of an estimator to the sample variance of the full-cohort estimator. 4. Application The Stanford heart transplant data, consisting of censored or uncensored survival times in February 198 of 184 patients who had received heart transplants, was reported in Miller and Halpern (1982). In the data set, the patients T5 mismatch score is a measure of the degree of tissue incompatability between the initial donor and the recipient with respect to HLA antigens. The goal of this study was to analyze the relationship among the survival time, T5 mismatch score and the age of the patients. We compared the results using the full cohort data with those from different sampling schemes. As in Miller and Halpern (1982) and Jin, Lin, and Ying (26), only the 157 patients with complete records were used. Following Miller and Halpern (1982), in Model 1 we regressed the logarithm of the survival time against the ages and T5 mismatch scores for the 157 patients. Then the T5 mismatch score was deleted due to its nonsignificance in Model 1 and a quadratic age model was tried to achieve better fit. In Model 2 we regressed the logarithm of the survival time against the age and age 2 for the 152 patients whose survival times were not less than 1 days. The sampling designs for both models were set as follows: For the case-cohort design we selected all failures and a subcohort of size 75 among the whole set of subjects. For the classical case-control design we selected all failures and 25 non-failures. For the nested case-control design we selected all failures and one non-failure in each risk set. Table 2 shows the average of the estimates of parameters with standard errors and p-values based on 4 replications. The proposed method under all the sampling designs led to the same conclusion as using the full data set. For comparison, we also present the estimates from Chen, Sun, and Tong (212) in the table, from which one can draw similar conclusions in general. It can also be seen from the comparison of standard error that the proposed method has the advantage of being asymptotically efficient.

8 1238 YUAN YAO Full data Table 2. Comparison of parameter estimation (estimated standard error, p-value) for the Stanford heart transplant data. Model 1 Age T5.295 (.115,.56).1692 (.211,.212) Model 2 Age Age (.528,.33).23 (.7,.4) Model 1 Design Proposed Chen Age T5 Age T5 C-C.281 (.179,.4).1175 (.2817,.3368).292 (.182,.677).1621 (.2756,.296) C-C-C.275 (.186,.459).1145 (.2882,.346).295 (.178,.638).1636 (.2757,.2947) N-C-C.263 (.146,.31).1825 (.2426,.2263).291 (.174,.542).1691 (.2565,.2693) Model 2 Design Proposed Chen Age Age 2 Age Age 2 C-C (.593,.294).23 (.8,.18) (.83,.599).24 (.1,.317) C-C-C (.652,.32).23 (.8,.125) (.85,.614).24 (.11,.337) N-C-C (.552,.15).23 (.7,.65) (.697,.355).24 (.9,.136) Note: Proposed represents the proposed method; Chen represents the Chen, Sun, and Tong (212) method; C-C represents case-cohort sampling; C-C-C represents classical case-control sampling; N-C-C represents nested case-control sampling. 5. Concluding Remarks This paper presents an efficient estimation technique for a broad class of sampling designs using linear transformation models. The computation procedure is based on the maximization of the discretized likelihood function. The variance estimation is obtained from the inverse of the negative Hessian matrix. We prove that such estimation method attains the semiparametric efficiency bound. The performance of the estimator is illustrated with simulation studies and a Stanford heart transplant data study. Further work will consider extending this method to partially linear transformation models, where kernel smoothing could be applied. Acknowledgement The author is grateful to the referee and an associate editor for their constructive comments which greatly improved this manuscript. The author also deeply thanks Professor Kani Chen for his guidance during the study. Appendix: Regularity conditions and proofs of theorems Some regularity conditions are needed:

9 MAXIMUM LIKELIHOOD METHOD 1239 (C1) The function H (t) and the distribution F (t) are strictly increasing and differentiable, with derivatives h(t) and f(t) that are absolutely continuous, and β lies in the interior of a known compact set in the domain of B. (C2) The covariate Z is bounded, and P (C τ) > δ > for some constant δ. (C3) λ(t) >, and λ is twice continuously differentiable. (C4) lim sup Λ(C x) 1 log(x sup y x λ(y)) = holds for every C >. x (C5) (First Identifiability) Let If Ψ i (β, H, F ) = {[e } βz i λ(h(y i )e βz i )] δ i e Λ(H(Y i)e βz i) i { } [e βz λ(h(y i )e βz )] δ i e Λ(H(Y i)e βz) (1 i ) f(z)dz. Ψ i (β, H, F )h (Y i ) δ i f (Z i ) i = Ψ i (β, H, F )h (Y i ) δ i f (Z i ) i, almost surely, then β = β, H = H and F = F. (C6) (Second Identifiability) If [ ] [ ] v T l β (β, H, F ) + l H (β, H, F ) pdh + l F (β, H, F ) qdf = almost surely for some v R d, p BV [, τ] and q BV [M], then (v, p, q) =, where v T l β, l H [g 1 ] and l F [g 2 ] denote the partial derivatives of l along direction of v, g 1, and g 2, respectively. Remark. Conditions (C1) (C3) assume certain smoothness and identifiability of the model, which are standard requirements in censored data analysis. (C4) is a technical condition on the structure of the model that is used in the proof of consistency. (C5) is the usual parameter identifiability condition. (C6) ensures that the Fisher information is non-singular. Proof of Theorem 1. The jump size of Hn ˆ must be finite, otherwise the loglikelihood would diverge to by (C4). Ĥ n is also bounded almost surely, otherwise if a new estimator H n = Ĥn/Ĥn(τ) were considered, it would contradict the maximum property of Ĥ n. Since Ĥn is uniformly bounded and monotone, Helly s Selection Theorem requires that for any subsequence of {Ĥn}, there is a further subsequence which converges to some monotone function H pointwise. Without loss of generality, assume that ˆF n converges to F and ˆβ n converges

10 124 YUAN YAO to β for the same subsequence. Then consistency is proved if we can show that H = H, F = F, and β = β with probability one. Furthermore, the continuity of H and F ensures that the convergence is uniform in t and Z. Take the derivative of the log-likelihood with respect to H along H + ϵi( Y i ), and let it be zero. Then Hence Ĥ n (t) = n i=1 t ĥ(y i ) = δ i n Ψ jh ( ˆβ n, Ĥn, ˆF. n )[I( Y i )] h(u)dn i (u) = Ψ j ( ˆβ n, Ĥn, ˆF n ) n i=1 t dn i (u) n Ψ jh ( ˆβ n, Ĥn, ˆF, n )[I( u)] Ψ j ( ˆβ n, Ĥn, ˆF n ) where we use N i (u) to denote the counting process of Y i. Now define n t dn i (u) H n (t) =. i=1 n Ψ jh (β, H, F )[I( u)] Ψ j (β, H, F ) The Glivenko-Cantelli Theorem leads to lim be written as Ĥ n (t) = t 1 n 1 n n H n (t) = H (t). Here Ĥn(t) can n Ψ jh (β, H, F )[I( u)] Ψ j (β, H, F ) n Ψ jh ( ˆβ n, Ĥn, ˆF d H n (u). n )[I( u)] Ψ j ( ˆβ n, Ĥn, ˆF n ) (A.1) It is now possible to show that H is continuously differentiable. For simplicity, define E (Ψ jh (β, H, F )[I( u)]/ψ j (β, H, F )) as g 1j (u) and E (Ψ jh (β, H, F )[I( u)]/ψ j (β, H, F )) as g 2j (u). It can be shown that they are both bounded away from zero. Taking the limit of (A.1), we get H (t) = t g 1j (u) g 2j (u) dh (u). We have now shown that H (t) is absolutely continuous with respect to H (t). By assumption, H (t) is continuously differentiable, and so is H (t). Then dh lim ˆ n (t) n dh n (t) = h (t) h (t)

11 MAXIMUM LIKELIHOOD METHOD 1241 uniformly in t [, τ], where h is the derivative of H. Repeating the same process with F similarly yields d lim ˆF n (Z) n d F n (Z) = f (Z) f (Z) uniformly. It follows from the inequality l n ( ˆβ n, Ĥn, ˆF n ) l n (β, H n, F n ) that 1 n log Ψ j( ˆβ n, Ĥn, ˆF n ) n Ψ j (β, H n, F n ) + 1 n δ j log dĥn(y j ) n d H n (Y j ) + 1 n j log d ˆF n (Z j ) n d F n (Z j ). Let n, then E ( log Ψ j(β, H, F )h (Y j ) δ j f (Z j ) ) j Ψ j (β, H, F )h (Y j ) δ jf (Z j ). j The left-hand side is the negative Kullback-Leibler distance, therefore condition C5 requires that β = β, H = H and F = F with probability one. Proof of Theorem 2. This proof is based on the argument on maximum likelihood estimators of Van der Vaart (1998, pp ). Let L(β, H, F ) = log Ψ j + δ j log h(y j ) + j log f(z j ), [ Φ n (β, H, F ) = P n {v ] [ ]} T L β + L H pdh + L F qdf, { [ ] [ ]} Φ(β, H, F ) = P v T L β + L H pdh + L F qdf, where we use v T L β, L H [g 1 ], and L F [g 2 ] to denote the partial derivatives of L along direction of β + ϵv, H + ϵg 1 and F + ϵg 2 respectively as above, and let P n denote the empirical measure based on n i.i.d. observations with P as its expectation. For any δ >, when n large enough, } {(β, H, F ) : β β + H H + F F < δ ( ˆβ n, Ĥn, ˆF n ) N = almost surely. By the Donsker theorem, n(φn Φ)( ˆβ n, Ĥn, ˆF n ) n(φ n Φ)(β, H, F ) = o p (1). Direct calculations show [ n(pn P){ ] v T L β (β, H, F ) + L H (β, H, F ) pdh

12 1242 YUAN YAO [ ]} +L F (β, H, F ) qdf = { n(p n P) v T L β ( ˆβ n, Ĥn, ˆF n ) + L H ( ˆβ n, Ĥn, ˆF n )[ ] pdĥn +L F ( ˆβ n, Ĥn, ˆF [ n ) qd ˆF ]} n + o p (1) = { np v T L β ( ˆβ n, Ĥn, ˆF n ) v T L β (β, H, F ) + L H ( ˆβ n, Ĥn, ˆF n )[ ] pdĥn [ ] L H (β, H, F ) pdh + L F ( ˆβ n, Ĥn, ˆF [ n ) qd ˆF ] n [ ]} L F (β, H, F ) qdf + o p (1) = { v T Ψ jβ ( np ˆβ n, Ĥn, ˆF Ψ n ) Ψ j ( ˆβ n, Ĥn, ˆF vt Ψ jβ (β, H, F ) jh ( ˆβ n, Ĥn, ˆF n )[ pdĥ] + n ) Ψ j (β, H, F ) Ψ j ( ˆβ n, Ĥn, ˆF n ) Ψ jh (β, H, F )[ pdh ] Ψ jf ( ˆβ n, Ĥn, ˆF n )[ qd ˆF ] + Ψ j (β, H, F ) Ψ j ( ˆβ n, Ĥn, ˆF n ) Ψ jf (β, H, F )[ qdf ] } + o p (1). (A.2) Ψ j (β, H, F ) For the continuous linear functional from BV [, τ] to R: ( ) ΨjH [p] p P (β, H, F ), Ψ j by Theorem 4.2 in Edwards and Wayment (1971), there exists a bounded function η H such that ( ) ΨjH [p] τ P (β, H, F ) = η H (t; β, H, F )dp(t). Ψ j To be specific, let p(s) = I(s t), so that ( ) ΨjH [I( t)] η H (t; β, H, F ) = P (β, H, F ). Ψ j Analogously, we define η F (z; β, H, F ) = P and second order partial derivatives functions: ( ) ΨjF [I( z)] (β, H, F ), Ψ j

13 MAXIMUM LIKELIHOOD METHOD 1243 We also set η Hβ (t; β, H, F ) = β η H(t; β, H, F ), η F β (t; β, H, F ) = β η F (t; β, H, F ), η HH (s, t; β, H, F ) = H η H(s; β, H, F )[I( t)], η HF (s, z; β, H, F ) = F η H(s; β, H, F )[I( z)], η F H (x, t; β, H, F ) = H η F (x; β, H, F )[I( t)], η F F (x, z; β, H, F ) = F η F (x; β, H, F )[I( z)]. ζ β (β, H, F ) = β P ζ H (t; β, H, F ) = H P ζ F (z; β, H, F ) = F P ( Ψjβ Ψ j ( Ψjβ Ψ j ( Ψjβ Ψ j ) (β, H, F ), ) (β, H, F )[I( t)], ) (β, H, F )[I( z)]. The right-hand side of (A.2) is now = ( n B 1 [v, p, q] T ( β ˆ n β )+ B 21 [v, p, q]d(ĥn H )+ n +o( ˆβn β + n Ĥn H + n ˆF ) n F, B 22 [v, p, q]d( ˆF ) n F ) with linear operators B 1, B 21, and B 22 defined as τ B 1 [v, p, q] = v T ζ β (β, H, F ) + η Hβ (t; β, H, F )p(t)dh (t) + η F β (z; β, H, F )q(z)df (z); M B 21 [v, p, q] = v T ζ H (t; β, H, F ) + η H (t; β, H, F )p(t) τ + + M η HH (s, t; β, H, F )p(s)dh (s) η F H (x, t; β, H, F )q(x)df (x); B 22 [v, p, q] = v T ζ F (z; β, H, F ) + η F (z; β, H, F )q(z) τ + + M η HF (s, z; β, H, F )p(s)dh (s) η F F (x, z; β, H, F )q(x)df (x).

14 1244 YUAN YAO The operator [B 1, B 21, B 22 ] can then be written as B 1 [v, p, q] T ṽ + B 21 [v, p, q] pdh + B 22 [v, p, q] qdf = d { P v T L dϵ β (β + ϵṽ, H + ϵ pdh, F + ϵ qdf ) ϵ= [ ] +L H (β + ϵṽ, H + ϵ pdh, F + ϵ qdf ) pdh [ ]} +L F (β + ϵṽ, H + ϵ pdh, F + ϵ qdf ) qdf. We now show that (B 1, B 21, B 22 ) is continuously invertible. By the Open Mapping Theorem, we need only prove that the linear operator is one-to-one. To show (B 1, B 21, B 22 ) is injective, suppose that B 1 [v, p, q] =, B 21 [v, p, q] = and B 22 [v, p, q] =. Then = B 1 [v, p, q] T v + B 21 [v, p, q]pdh + B 22 [v, p, q]qdf, where the right-hand side is the derivative of { [ ] P v T L β (β, H, F ) + L H (β, H, F ) pdh [ ]} + L F (β, H, F ) qdf along the path (β + ϵv, H + ϵ pdh, F + ϵ qdf ). This implies that the information along this path is zero, hence [ ] [ ] v T L β (β, H, F ) + L H (β, H, F ) pdh + L F (β, H, F ) qdf = almost surely. By the identifiability condition (C6), we get (v, p, q) =. To show (B 1, B 21, B 22 ) is surjective, write (B 1, B 21, B 22 )[v, p, q] = I 1 [v, p, q] + I 2 [v, p, q], where I 1 [v, p, q] is defined as v η H (t; β, H, F )p(t) η F (z; β, H, F )q(z) and I 2 [v, p, q] is defined as v T ζ β (β, H, F ) + τ η Hβ(t; β, H, F )p(t)dh (t) + M η F β(z; β, H, F )q(z)df (z) v v T ζ H (t; β, H, F ) + τ η HH(s, t; β, H, F )p(s)dh (s) + M η F H(x, t; β, H, F )q(x)df (x). v T ζ F (z; β, H, F ) + τ η HF (s, z; β, H, F )p(s)dh (s) + M η F F (x, z; β, H, F )q(x)df (x)

15 MAXIMUM LIKELIHOOD METHOD 1245 Since η H (t; β, H, F ) = P(L H (β, H, F )[I( t)]), the score function along the direction of H +ϵi( t), η H (t; β, H, F ) is negative and continuous for all t. Similarly, η F (z; β, H, F ) is the score function along the direction F +ϵi( z), hence η F (z; β, H, F ) is negative and continuous for all z. Thus I 1 [v, p, q] is a bijective continuous operator, and is continuously invertible. By the smoothness conditions (C1) and (C3), the image of I 2 is a set of uniformly bounded and equi-continuous functions, hence I 2 is a compact operator by the Arzelá-Ascoli theorem. Now we can write (B 1, B 21, B 22 ) as I 1 + I 2 = I 1 (I + I1 1 I 2), where I is identity mapping. It is clear that I1 1 I 2 is a compact operator by the continuity of I1 1. Then I + I 1 1 I 2 is a Fredholm operator. Now since ker(i + I1 1 I 2) = ker(i 1 + I 2 ) =, the Fredholm operator theory gives that I + I1 1 I 2 is surjective, and so is I 1 + I 2. The continuous invertibility of (B 1, B 21, B 22 ) has been proved. Now with (ṽ, p, q) = (B 1, B 21, B 22 ) 1 (v, p, q), from (A.2) we have { n v T ( ˆβ n β ) + pd(ĥn H ) + qd( ˆF } n F ) = { [ ] n(p n P) ṽ T L β (β, H, F ) + L H (β, H, F ) pdh (A.3) [ ]} n +L F (β, H, F ) qdf +o( ˆβn β + n Ĥn H + n ˆF ) n F. The first term of the right side of (A.3) converges in distribution to a zero-mean Gaussian process in the metric space R d L (Q 1 ) L (Q 2 ). By Slutsky s theorem, we now need only show that n ˆβn β + n Ĥn H + n ˆF n F = O p (1). (A.4) By definition, ˆβ n β + Ĥn H + ˆF n F = sup vt ( ˆβ n β ) + (v,p,q) V Q 1 Q 2 τ pd(ĥn H ) + Z qd( ˆF n F ), so (A.3) gives n ˆβn β + n Ĥn H + n ˆF n F n = O p (1) + o( ˆβn β + n Ĥn H + n ˆF ) n F, therefore (A.4) holds immediately. Now we have { n v T ( ˆβ n β ) + pd(ĥn H ) + qd( ˆF } n F )

16 1246 YUAN YAO = { [ ] n(p n P) ṽ T L β (β, H, F ) + L H (β, H, F ) pdh [ ]} +L F (β, H, F ) qdf + o p (1). (A.5) Since the right side of (A.5) converges to normal by the Central Limit Theorem, we have proved that n( ˆβ n β, Ĥn H, ˆF n F ) converges weakly to a zeromean Gaussian process. If p = q = in equality (A.5), then the estimate ˆβ n is an asymptotically linear estimator with influence function ṽ T L β (β, H, F ) + L H (β, H, F )[ pdh ] + L F (β, H, F )[ qdf ], which lies in the linear space spanned by the score functions: {ṽ T L β + L H [ pdh] + L F [ qdf ] : ṽ R d, p Q 1, q Q 2 }. By Proposition 1 in Bickel et al. (1993), the estimate ˆβ n is semiparametrically efficient. Proof of Theorem 3. Let H n (t) be a step function with a jump size h n (s i ) = H (s i ) max sj <s i H (s j ) at each failure time s i, and F n (z) be a step function with a jump size f n (Z i ) = F (Z i ) max Zj1 Z i,...,z jd Z i F (Z j1 (1),..., Z jd (d)) at each observed covariate Z i, where Z k (m) is the value on the mth dimension of the covariate Z k, and the order Z k < Z i is taken as Z k (m) < Z i (m) for all 1 m d. Then H n (s i ) = H (s i ) and F n (Z i ) = F (Z i ) holds for every s i and Z i. Choose v 1 R d and bounded variation functions p 1 Q 1, q 1 Q 2 such that v 1 p1 dĥn q1 d ˆF n = I 1 n v p, q where I n, p, and q are as defined in Theorem 3 and p 1 dĥn = (p 1 (s 1 )ĥn(s 1 ),..., p 1 (s n1 )ĥn(s n1 )) T, q 1 d ˆF n = (q 1 (Z 1 ) ˆf n (Z 1 ),..., q 1 (Z n2 ) ˆf n (Z n2 )) T. Let and ĥ n h = (ĥn(s 1 ) h(s 1 ),..., ĥn(s n1 ) h(s n1 )) T ˆf n f = ( ˆf n (Z 1 ) f(z 1 ),..., ˆf n (Z n2 ) f(z n2 )) T, then we have nv T ( ˆβ n β ) + n 1 n p(s i )(ĥn(s i ) h(s i )) + n 2 n q(z i )( ˆf n (Z i ) f(z i )) i= i=

17 = n = n ˆβ n β ĥ n h ˆf n f ˆβ n β ĥ n h ˆf n f T T MAXIMUM LIKELIHOOD METHOD 1247 v p q I n v 1 p1 dĥn q1 d ˆF n = L ββ L βh L βf v np n L Hβ L HH L HF ( ˆβ n, Ĥn, ˆF 1 ˆβ n β n ) p1 dĥn, L F β L F H L F F q1 d ˆF Ĥ n H n n ˆF n F n = L ββ L βh L βf np L Hβ L HH L HF v 1 ˆβ n β (β, H, F ) p1 dh, Ĥ n H + o p (1) L F β L F H L F F q1 df ˆF n F = { [ ] n(p n P) v T 1 L β (β, H, F ) + L H (β, H, F ) p 1 dh [ ]} +L F (β, H, F ) q 1 df + o p (1), where the last equality is from the proof of Theorem 2. Note that = { [ ] n(p n P) v T 1 L β (β, H, F ) + L H (β, H, F ) p 1 dh [ ]} +L F (β, H, F ) q 1 df = { n(p n P) v T 1 L β (β, H, F ) + L H (β, H, F )[ ] p 1 dĥ +L F (β, H, F )[ q 1 d ˆF ]} + o p (1), and the variance of the right side is consistently estimated by L ββ L βh L βf v 1 v 1 P L Hβ L HH L HF (β, H, F ) p1 dĥn, p1 dĥn. L F β L F H L F F q1 d ˆF n q1 d ˆF n This can be further estimated by L ββ L βh L βf v 1 v P n L Hβ L HH L HF ( ˆβ n, Ĥn, ˆF 1 n ) p1 dĥn, p1 dĥn L F β L F H L F F q1 d ˆF n q1 d ˆF n

18 1248 YUAN YAO v 1 = p1 dĥn q1 d ˆF n T I n v 1 p1 dĥn q1 d ˆF n = (v T, p T, q T )In 1 I n In 1 (v T, p T, q T ) T = (v T, p T, q T )In 1 (v T, p T, q T ) T. The proof of Theorem 3 is complete. References Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. John Hopkins University Press, Baltimore. Breslow, N. E. (1996). Statistics in epidemiology: The case-control studies. J. Amer. Statist. Assoc. 91, Breslow, N. E. and Day, N. E. (198). Statistical methods in cancer research. Vol.I, The Design and Analysis of Case-Control Studies. International Agency for Research on Cancer, Lyon. Chen, K. (21). Generalized case-cohort sampling. J. Roy. Statist. Soc. Ser. B 63, Chen, K., Jin, Z. and Ying, Z. (22). Semiparametric analysis of transformation models with censored data. Biometrika 89, Chen, K., Sun, L. and Tong, X. (212). Analysis of cohort survival data with transformation model. Statist. Sinica 22, Cheng, S. C., Wei, L. J. and Ying, Z. (1995). Analysis of transformation models with censored data. Biometrika 82, Dabrowska, D. M. and Doksum, K. A. (1988). Estimation and testing in the two-sample generalized odds rate model. J. Amer. Statist. Assoc. 83, Edwards, J. R. and Wayment, S. G. (1971). Representations for transformations continuous in the BV norm. Trans. Amer. Math. Soc. 154, Goldstein, L. and Langholz, B. (1992). Asymptotic theory for nested case-control sampling in the Cox regression model. Ann. Statist. 2, Jin, Z., Lin, D. Y. and Ying, Z. (26). On least-squares regression with censored data. Biometrika 93, Langholz, B. and Goldstein, L. (1996). Risk set sampling in epidemiologic cohort studies. Statist. Sci. 11, Miller, R. G. and Halpern, J. (1982). Regression with censored data. Biometrika 69, Samuelsen, S. (1997). A pseudo-likelihood approach to analysis of nested case-control data. Biometrika 84, Thomas, D. C. (1977). Appendum to Methods of cohort analysis: appraisal by application to asbestos mining, by Liddell, F. D. K., McDonald, J. C., and Thomas, D. C. J. Roy. Statist. Soc. Ser. A 14, Van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge, U.K. Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Kowloon, Hong Kong. yaoyuan@hkbu.edu.hk (Received August 211; accepted August 214)

END-POINT SAMPLING. Yuan Yao, Wen Yu and Kani Chen. Hong Kong Baptist University, Fudan University and Hong Kong University of Science and Technology

END-POINT SAMPLING. Yuan Yao, Wen Yu and Kani Chen. Hong Kong Baptist University, Fudan University and Hong Kong University of Science and Technology Statistica Sinica 27 (2017), 000-000 415-435 doi:http://dx.doi.org/10.5705/ss.202015.0294 END-POINT SAMPLING Yuan Yao, Wen Yu and Kani Chen Hong Kong Baptist University, Fudan University and Hong Kong