Semiparametric Additive Transformation Model under Current Status Data

Size: px

Start display at page:

Download "Semiparametric Additive Transformation Model under Current Status Data"

Sandra Tucker
6 years ago
Views:

1 Semiparametric Additive Transformation Model under Current Status Data Guang Cheng and Xiao Wang Purdue University Abstract We consider the efficient estimation of the semiparametric additive transformation model with current status data. A wide range of survival models and econometric models can be incorporated into this general transformation framework. We apply the B-spline approach to simultaneously estimate the linear regression vector, the nondecreasing transformation function, and a set of nonparametric regression functions. We show that the parametric estimate is semiparametric efficient in the presence of multiple nonparametric nuisance functions. An explicit consistent B-spline estimate of the asymptotic variance is also provided. All nonparametric estimates are smooth, and shown to be uniformly consistent and have faster than cubic rate of convergence. Interestingly, we observe the convergence rate interfere phenomenon, i.e., the convergence rates of B-spline estimators are all slowed down to equal the slowest one. The constrained optimization is not required in our implementation. Numerical results are used to illustrate the finite sample performance of the proposed estimators. Key Words: B-spline; Consistent variance estimation; Current status data; Efficient estimation; Semiparametric transformation models 1 Introduction We consider the efficient estimation of the following semiparametric additive transformation model: H(U) = Z β + h j (W j ) + ɛ, (1) Corresponding Author, Department of Statistics, Purdue University, West Lafayette, IN 4797, chengg@purdue.edu. Department of Statistics, Purdue University, West Lafayette, IN 4797, wangxiao@purdue.edu. 1

2 where H( ) is a monotone transformation function, h j ( ) s are smooth regression functions (with possibly different degrees of smoothness), and ɛ has a known distribution F ( ) with support R. A wide range of survival models and econometric models can be incorporated into the above general transformation framework, e.g., (Huang & Rossini, 1997; Shen, 1998; Huang, 1999; Banerjee et al., 26, 29). In particular, the model (1) can be readily applied to a failure time T by letting U = log T. We can obtain the partly linear additive Cox model, i.e., Huang (1999), by assuming F (s) = 1 exp( e s ) and H(u) = log A(e u ), where A is an unspecified cumulative hazard function. Specifically, the hazard function of T, given the covariates (z, w), has the form λ(t z, w) = a(t) exp( β z + h j (w j )), (2) where a(t) is the baseline hazard function, β = β and h j = h j. However, if we change the form of F (s) to e s /(1 + e s ), the model (1) just becomes the partly linear additive proportional odds model. Motivated by the close connection with survival models, we focus on the current status data in this paper which arises not only in survival analysis but also in demography, epidemiology, econometrics and bioassay. More specifically, we observe X = (V,, Z, W ), where V R is a random examination time and = 1{U V. We assume that U and V are independent given (Z, W ). Under current status data, the model (1) is also related to the semiparametric binary model studied in econometrics. Using the link function F ( ), we assume that the probability of = 1, given the covariates (Z, W, V ), is of the expression: ( ) P ( = 1 Z, W, V ) = F β Z + h j (W j ) + H(V ). (3) Note that Banerjee et al. (26) and Banerjee et al. (29) have done a great deal of statistical estimation and hypothesis testing on the model (3) (without h j terms) by assuming F ( ) to be log-log function and logistic function, respectively. An extensive discussions on the relation between (3) and survival models can be found in Doksum & Gasko (199). Recently a similar transformation model has been considered by Chen & Tong (21) but for the right censored data. They showed that the monotone transformation function is root-n estimable which will never be achieved in the case of current status data. This is the key theoretical difference between the two types of survival data. In this paper, we employ the B-spline approach to simultaneously estimate the vector β, monotone H and smooth h j s. The corresponding estimates are denoted as β, Ĥ and ĥj. In contrast, Ma & Kosorok (25a) apply the penalized NPMLE approach to (1) (with d = 1) which yields a non-smooth step function Ȟ and the penalized estimate ȟ. Our B-spline framework 2

3 has the following theoretical and computational advantages over the existing penalized NPMLE approach: 1. Our B-spline estimate Ĥ is smooth and uniformly consistent. However, Ȟ is always discontinues (regardless of the smoothness of its true function H ) and has a bias which does not vanish asymptotically; see Page 2258 of Ma & Kosorok (25a). We can also obtain the faster rate of convergence for Ĥ than that for Ȟ, i.e., O P (n 1/3 ), by using the B-spline estimation approach. Therefore, we expect more accurate inferences drawn from Ĥ. 2. We are able to give an explicit B-spline estimate for the asymptotic variance of β based on which the asymptotic confidence interval of β can be easily constructed. Under very weak conditions, its consistency is proven. However, the block jackknife approach in Ma & Kosorok (25a) requires more computation, and is even not theoretically justified. 3. Our spline estimation algorithm requires much less computation than the isotonic type algorithm used in Ma & Kosorok (25a) since the order of jumps in the step function is supposed to be much larger than the order of knots we choose for estimating H and h j s. In contrast with Huang (1999), we deal with the current status data rather than the right censored data, and thus we also need to estimate the monotone transformation function which has been profiled out in their partial likelihood framework. Despite the non-root-n convergence rates of Ĥ and ĥj s, we are able to show that β is root-n consistent, asymptotically normal and semiparametric efficient. We derive the efficient information bound by taking the general two-stage projection approach from Sasieni (1992) which is needed due to the involvement of multiple nonparametric functions in semiparametric models. Interestingly, we observe the convergence rate interfere phenomenon for the B-spline estimators, i.e., the convergence rates of nonparametric estimators are all slowed down to equal the slowest one. Moreover, by approximating log Ḣ with the B-spline, we can avoid the monotonicity constraint in the implementation, which is usually required in the literature, e.g., Zhang et al. (21). The remainder of the paper is organized as follows. Section 2 describes the B-spline estimation procedure. The asymptotic properties such as consistency and convergence rates of the estimates are obtained in Section 3. The asymptotic distribution of the parametric component is studied in Section 4, and its efficient information and the corresponding explicit B-spline estimate are given in Section 5. Simulation studies are presented in Section 6.1. We close with an appendix containing technical details. 3

4 2 Semiparametric B-spline Estimation 2.1 Assumptions We first define some notations. For any vector v, v 2 = vv. The notations > and < mean greater than, or smaller than, up to a universal constant. We denote A n B n if A n < B n and A n > B n. The notations P n and G n are used for the empirical distribution and the empirical process of the observations, respectively. Furthermore, we use the operator notation for evaluating expectation. Thus, for every measurable function f and true probability P, P n f = 1 n n f(x i ), P f = i=1 We next present some model assumptions. M1. U and V are independent given (Z, W ). fdp and G n f = 1 n n (f(x i ) P f). M2. (a) The covariates (Z, W ) are assumed to belong to a bounded subset in R l+d, say [, 1] l [, 1] d. The support for V is [l v, u v ], where < l v < u v < + ; (b) The joint density for (Z, V, W ) w.r.t. Lebesgue measure stays away from zero, and the joint density for (V, W ) stays away from infinity. M3. E(Z E(Z V, W )) 2 is strictly positive definite. M4. The residual error distribution F ( ) is assumed to be known and has support R. Denote the first, second and third derivative of F as f, f and f, respectively. We assume that (a) (f(u) f(u) f(u) ) M < over the whole R and f(u) stays away from zero in any compact set of R; (b) [f 2 (v) f(v)f (v)] [f 2 (v) + f(v)(1 F (v))] >, for all v R. Since we employ the smooth B-spline estimation rather than the penalized NPML estimation, our residue error Condition M4 is much less restrictive than that in Ma & Kosorok (25a). Note that Condition M4(b) ensures the concavity of the function s δ log F (s)+(1 δ) log(1 F (s)) for δ =, 1. It is easy to verify that the above Condition M4 is satisfied in the following two general classes of residue error distribution functions after some algebra. F1. F (s) = γ[2γ(γ 1 )] 1 s exp( t γ )dt for γ > 1 is a family of distributions, which includes the standard normal distribution after appropriate rescaling (γ = 2). This corresponds to the probit model Kalbfleisch & Prentice (198). i=1 4

5 F2. F (s) = 1 [1 + γe s ] 1/γ is a Pareto distribution with parameter γ (, ) and corresponds to the odds-rate transformation family, see Dabrowska & Doksum (1988a,b). It includes the following two well-known special cases: (a). Given γ, it yields the extreme value distribution, i.e. F (s) = 1 exp( e s ), which corresponds to the complementary log-log transformation, see Banerjee et al. (26); (b). Given γ = 1, it gives the logistic distribution, i.e. F (s) = e s /(1 + e s ), which corresponds to the logit transformation, see Banerjee et al. (29). 2.2 B-spline Estimation Framework From now on, we change the signs of β and h j for simplicity of exposition. In addition, we recenter H(v) to H(v) H(l v ) so that H(l v ) = for the purpose of identifiability. The additional parameter H(l v ) will be absorbed into the vector β, i.e., the first coordinate of z is set as one. Given a single observation at x = (v, δ, z, w), the log-likelihood of model (1) is written as { [ ] l(β, h 1,..., h d, H) = δ log F H(v) + β z + h j (w j ) +(1 δ) log { 1 F [ H(v) + β z + ] h j (w j ). (4) We assume that β B, which is a bounded open subset in R l, and that its true value β is an interior point of B. Before specifying the parameter spaces for H and h j s, we first introduce the Hölder ball H r c(y), which is a class of smooth functions widely used in the nonparametric estimation, e.g., Stone (1982, 1985). For any f H r c(y), it is J < r times continuously differentiable on Y and its J-th derivative is uniformly Hölder continuous with exponent κ r J (, 1], i.e., f (J) (y 1 ) f (J) (y 2 ) sup c. y 1,y 2 Y,y 1 y 2 y 1 y 2 κ The functions in the Hölder ball can always be approximated by a basis expansion, i.e., f(t) K γ k B k (t) = γ B(t), (5) k=1 where γ = (γ 1,..., γ K ) and B(t) = (B 1 (t),..., B K (t)). Actually, if the degree d of the B-spline satisfies d (r 1), we have f γ B K r as K, (6) 5

6 where denotes the supremum norm.. Assume the following parameter space Condition P1 for the smooth h j. P1. For j = 1,..., d and some known c j, we assume that the parameter space for h j is H j, where { 1 H j = h j : h j H r j c j [, 1] with r j > 1/2 and h j (w j )dw j =, and that the corresponding spline space is { H jn = h j : h j (w) = γ jb j (w) with h j c j and 1 h j (w j )dw j =, based on a system of basis functions B j = (B j1,..., B jkj ) of degree d j (r j 1). As seen from the previous examples, it is reasonable to assume that H( ) is differentiable and strictly increasing over [l v, u v ], i.e., Ḣ(v) C >. Considering that H(l v ) =, we can thus write H(v) = v l v exp(g(s))ds, where g(v) log Ḣ(v) is well defined. Such reparametrization can get around the strict monotonicity and positivity constraints of H, and thus avoids the constrained optimization in the computation. The parameter space Condition P2 for g is specified below. P2. For some known c, we assume that the parameter space for g is G, where and that the corresponding spline space is G = { g : g H r c [l v, u v ] with r > 1/2, G n = {g : g(v) = γ B (v) and g c based on a system of basis functions B = (B 1,..., B K ) of degree d (r 1). Similarly, we define G n = {H(v) = v l v exp(g(s))ds : g G n. By some algebra, we can show that H H r +1 c [l v, u v ] for some c <. REMARK 1. Note that in the theoretical proofs and numerical calculations the exact values of c j are not necessary. Instead, only the boundedness condition, equivalently the compactness of parameter spaces and spline spaces, is needed. Here we assume this boundedness condition, which can be relaxed by invoking the chaining arguments, only for simplifying our theoretical derivations. 6

7 In this paper, we propose the B-spline approach to estimate H and h j s as follows. Let A = B G Π d H j and A n = B G n Π d H jn. Denote α as (β, g, h 1,..., h d ) and its true value α as (β, g, h 1,..., h d ), where g ( ) = log Ḣ( ). The log-likelihood (4) for the observation i can thus be reparametrized as { [ ] vi l i (α) = δ i log F β z i + exp(g(s))ds + h j (w ij ) l v { [ ] vi +(1 δ i ) log 1 F β z i + exp(g(s))ds + h j (w ij ). (7) l v The corresponding B-spline estimate α is defined as n α = arg max l i (α). (8) α A n We can also write α = ( β, ĝ, ĥ1,..., ĥd) = ( β, γ B, γ 1B 1,..., γ d B d). Then, the estimate Ĥ(v) = v l v exp( γ B (s))ds. Some tedious algebra reveals that the Hessian matrix of l i (α) w.r.t. (β, γ, γ 1,..., γ d ) is indeed negative semidefinite under Condition M4(b) which guarantees the existence of α. See more discussions on the computation feasibility in the simulation section. The above estimation procedure also applies to other linear sieves approximating the Hölder ball (or more generally Hölder space), e.g., wavelets. i=1 3 Consistency and Rates of Convergence In this section, we show that our B-spline estimate is consistent and the convergence rate of each nonparametric estimate appears to interfere with each other. Define d(α, α ) = β β + H H 2 + h j h j 2, where 2 is the L 2 norm. Now we give the main Theorem of this section. THEOREM 1. Suppose that Conditions M1-M4 and P1-P2 hold. If K j /n for j =, 1,..., d, then we have d( α, α ) = o P (1). (9) More specifically, we further prove that ( { ) d( α, α ) = O P max K r j j K j /n. (1) j d 7

8 If we further require that K j n 1/(2r j+1) for j =,..., d, then we have where r = min j d {r j. d( α, α ) = O P (n r/(2r+1) ), (11) Under the right censored data, Huang (1999) derived similar convergence rate result (1) in the partly linear additive Cox model by assuming equal r j s. According to Theorem 1, the smooth Ĥ can achieve the rate of convergence, i.e., O P (n r/(2r+1) ), no slower than n 1/3 -rate derived in the penalized estimation context, see Ma & Kosorok (25a), when we assume that g and h j s are all at least continuously differentiable, i.e., r 1. More importantly, we can further show that Ĥ is uniformly consistent, i.e., Ĥ H = o P (1), by applying Lemma 2 in Chen & Shen (1998) that f < f 2r/(2r+d) L 2 (Leb) for any f H r c[a, b] d and noting that Ĥ, H H r +1 c [l v, u v ] for some c >. The above theorem also holds when we employ the constrained monotone B-spline to approximate H, i.e., γ B (v) log H(v) with γ 1 γ 2... γ K. However, such constrained optimization usually requires additional computational effort, see Zhang et al. (21). REMARK 2. From the above Theorem 1, we observe the interesting convergence rate interfere phenomenon, i.e., the convergence rate for each B-spline estimate is forced to equal the slowest one. In Ma & Kosorok (25a), they also show that the convergence rate of the penalized estimate h is unfortunately slowed down to O P (n 1/3 ) by the NPMLE H regardless of the smoothness degree of h. One possible solution in achieving the optimal rate for each nonparametric estimate is to extend the most recent mixed rate asymptotic results Radchenko (28) to the semiparametric setup. Since we assume that r > 1/2, the convergence rate given in (11) is always o P (n 1/4 ). Such a rate is usually fast enough to guarantee the regular asymptotic behavior of β, i.e., n- consistency and asymptotic normality. Indeed, we will improve the current suboptimal rate of β in (11) to the optimal n rate, and further show that β is semiparametric efficient in next section. 4 Weak Convergence of the Parametric Estimate In this section, we study the weak convergence of the spline estimate β in the presence of multiple nonparametric nuisance functions. We first calculate the semiparametric efficient information based on the projection onto the nonorthogonal sumspace. Let ( δ Q θ (x) = f(θ) F (θ) 8 1 δ ), 1 F (θ)

9 where θ(z, v, w) = β z + H(v) + d h j(w j ). Denote θ as the true value of θ. The score functions (operators) for β, g and h j are separately calculated as l β (X; α) = ZQ θ (X), (12) [ V ] l g [a](x; α) = exp(g(s))a(s)ds Q θ (X), (13) l v l hj [b j ](X; α) = b j (W j )Q θ (X). (14) uv We assume that a L 2 (H) {a : l v a 2 (s)dh(s) < and b j L 2(w j ) {b j : 1 b j(w j )dw j = and 1 b2 j(w j )dw j < so that all the score functions defined above are square integrable. To calculate the efficient score function l β, we need to find the projection of l β onto the sumspace A = A g + A h1 + + A hd, where A g = { l g [a] : a L 2 (H) and A hj = { l hj [b j ] : b j L 2(w j ). For simplicity, we define l β (X; α ) and l β (X; α) as l β and l β, respectively. The same notation rule applies to l g [a](x; α) and l hj [b j ](X; α). We define l β (X; α) = l β (X; α) l g [ā ](X; α) l hj [ b j ](X; α), where ā = (a 1,..., a l ) and b j = (b j1,..., b jl ). And (a k, b 1k,..., b dk ) is the minimizer of (a k, b 1k,..., b dk ) E { [ l β ] k l g [a k ] l hj [b jk ] for k = 1,..., l. Similarly, denote l β (X; α ) and l β (X; α) as l β and l β, respectively. By taking the two-stage projection approach from Sasieni (1992), we have ( l β (X) = Z b (W ) E((Z b (W ))Q 2 ) θ (X) V ) Q E(Q 2 θ (X) (15) θ (X) V ) where b (W ) = d b j (W j) satisfies { [ E Z b (W ) E((Z b (W ))Q 2 ] θ V ) E(Q 2 θ V ) k Q 2 θ b jk (W j ) 2 = (16) for every b jk L 2(w j ), j = 1,..., d and k = 1,..., l. By slightly modifying the proof of Lemma 4 in Ma & Kosorok (25a), we can show that the above nonorthogonal projection is well defined and b ( ) exists by the alternating projection Theorem A.4.2 in Bickel et al. (1993). 9

10 Define Π j and Π a as the projection operators respectively. Define Π j g E[g(V, W )Q2 θ W j = w j ] E[Q 2 θ W j = w j ], Π a g E[g(V, W )Q2 θ V = v], E[Q 2 θ V = v] D(v, w) = E[ZQ2 θ V = v, W = w] E[Q 2 θ V = v, W = w], S(v, w j) = E[Q2 θ V = v, W j = w j ], E[Q 2 θ W j = w j ] T (w i, w j ) = E[Q2 θ W i = w i, W j = w j ] E[Q 2 θ W j = w j ], U(w j, v) = E[Q2 θ W j = w j, V = v]. E[Q 2 θ V = v] We say a function f(s, t) belongs to a uniform Hölder ball H r c(s T ) in t relative to s if it is J < r continuously differentiable w.r.t. t and its J-th partial derivative satisfies, with κ r J, sup s S f sup t 1 t 2 (J) t (s, t 1 ) f (J) t (s, t 2 ) c. t 1 t 2 κ Define Sf(v, w j ) = S(v, w j )f V Wj (v, w j ), T f(w i, w j ) = T (w i, w j )f Wi W j (w i, w j ) and Uf(w j, v) = U(w j, v)f Wj V (w j, v), where f V Wj, f Wi W j and f Wj V are the conditional densities of V given W j, W i given W j and W j given V w.r.t. Lebesgue measure, respectively. Here, we assume some model assumptions implying that both b jk and a k belong to some Hölder balls for any j = 1,..., d and k = 1,..., l. M5. We assume that [Π j D(v, w)] k H r j c j [, 1], Sf(v, w j ) H r j c j ([l v, u v ] [, 1]) in w j relative to v and T f(w i, w j ) H r j c j [, 1] 2 in w j relative to w i for some < c j < and j = 1,..., d. M6. We assume that [Π a D(v, w)] k H r +1 c [l v, u v ] and Uf(w j, v) H r +1 c ([, 1] [l v, u v ]) in v relative to w j for some < c <. Note that we can simplify Sf(v, w j ) (T f(w i, w j )) to S(v, w j ) (T (w i, w j )) in Condition M5 and simplify Uf(w j, v) to U(w j, v) in Condition M6 when we assume that V and W are independent and that W is pairwise independent. THEOREM 2. Suppose that Conditions M1-M6 and P1-P2 hold. If K j n 1/(2r j+1) and Ĩ is invertible, then we have n( β β ) = 1 n where Ĩ is the efficient information matrix defined as E l β l β. n i=1 Ĩ 1 l d β (X i ) + o P (1) N(, Ĩ 1 ), (17) 1

11 5 B-spline Estimate of the Efficient Information In this section, we give an explicit B-spline estimate for the efficient information as a by-product of the establishment of asymptotic normality of β. Indeed, it is simply the observed information matrix if we treat the semiparametric model as a parametric one after the B-spline approximation, i.e., H j = H jn and G = G n. Specifically, we treat l i (α) defined in (7) as if it were a parametric likelihood l i (β, γ, γ 1,..., γ d ). We construct the corresponding information estimator for (β, γ, γ 1,..., γ 2 ) : ) Ĵ = (Î11 Î 12 Î 21 Î 22 (l+ d j= K j) (l+ d j= K j) where Îj,k = n i=1 A j(x i ; α)a k (X i; α)/n, for j, k = 1, 2, and A 1 (X; α) = l β (X; α),, A 2 (X; α) = ( lg [B 1 ],..., l g [B K], l h1[b 11 ],..., l hd [B dkd ] ). The parametric inferences imply that the information estimator for β is of the form Î = Î11 Î12Î 1 22 Î21. (18) Some calculations further reveal that Î = P n [ l β lĝ[( γ ) B ] lĥj [( γ j ) B j ]] 2, (19) where [ γ j ] K j l = (γ j1,..., γ jl ) for j =, 1,..., d and (γ k,..., γ dk )T = Î 1 22 Î211 k where 1 k represents the l-vector with its k-th element as one and others as zeros. We will use (18) as our estimator for Ĩ. We need the following additional assumption for Theorem 3. M7. We assume that E sup a k G n [ V l V ] 2 [exp(g(s)) exp(g (s))]a k (s)ds < H H 2 2. THEOREM 3. Under Conditions M1-M7 and P1-P2, we have Î P Ĩ. Note that the consistency of the similar random-sieve efficient information estimate was also proven in the linear regression models with current data; see Theorem 3 of Shen (2). 11

12 6 Numerical Results 6.1 Simulations We perform a Monte-Carlo study to assess the finite-sample performance of our proposed method. To compare with the penalized NPMLE in Ma & Kosorok (25a), we adopt the same setting used in their paper. We simulate the current status data from the partly linear additive Cox model which is a special case of general transformation model. We choose H(u) = log A(e u ) where A(u) = e k (exp(u/3) 1) with k = The errors ɛ follow an extreme value distribution with F (s) = 1 exp( e s ). The regression coefficients β 1 =.3 and β 2 =.25. The covariate Z 1 is Uniform[.5, 1.5] and Z 2 is Bernoulli with success probability.5. We choose W as Uniform[1, 1] and h(w) = sin(w/1.2 1) k. Censoring times are standard exponential distribution conditional on being in the interval [.2, 1.8]. The sample sizes are n = 4 and n = 16. We simulate 4 realizations for both sample sizes. In practice, the location and the numbers of knots for H and h j need to be determined. For simplicity, we will use the equal-spaced knots for all functions. Common model selection methods such as the Akaike information criterion (AIC), and the Bayesian information criterion (BIC) can be employed for selecting the number of knots. In this paper, we determine K, K 1,..., K d by the AIC, given by n AIC = 2 l i (ˆα) + 2(l + K j ) i=1 In our simulation, we use a quadratic spline to approximate both function h and function g in H. Then, AIC = 2 n i=1 l i(ˆα) + 2(K + K 1 + 2). Based on our experiences, it is generally adequate to choose less than ten knots to achieve reasonable approximation, provided that h and H are not overly erratic. Figure 1 shows the AIC scores under different combinations of K and K 1 for one realization of the simulation with the sample size n = 16. It shows that the optimal choices for K and K 1 are 5 and 5, respectively. The estimated h and H with various values of K and K 1 are plotted in Figure 2. In the left panel of Figure 2, we fix K = 5 and plot the estimated h with K 1 = 3, 5, 1. When K 1 is small (e.g., K 1 = 3), there seems be to a big bias in our estimator. On the other hand, when K 1 is large (e.g., K 1 = 1), the estimator displays a wiggly behavior. In the right panel of Figure 2, we fix K 1 = 5 and plot the estimated H with K = 5, 7, 1. As the number of knots is increasing, the estimated H shows a similar wiggly shape. Hence, the numbers of knots should be chosen with caution. It is worth noting that the selected values K, K 1,..., K d based the AIC criterion can be regarded only as the minimum numbers of knots required. They may not be the optimal choices since the concept of optimality is not well defined here. See Xue et al. (24) for similar discussions. Simulation results show that our B-spline estimation procedure performs quite well in the 12 j=

13 K =3 K =5 K =7 K =1 885 AIC K 1 Figure 1: AIC scores under different combinations of K and K K =3 1 K 1 =5 K 1 =1 True K =5 K =7 K =1 True h(w) H(u) w u Figure 2: Plot of the estimated h and H with various values of K and K 1. 13

14 Table 1: Monte Carlo results for the partly linear Cox model with current status data based on 4 replicates Sample size 4 Sample size 16 β 1 Bias SD ESD Coverage ESD-WB β 2 Bias SD ESD Coverage ESD-WB Joint Coverage SD: Standard error; ESD: Estimated standard error; ESD-WB: Estimated standard error from the weighted bootstrap method semiparametric transformation model. The bias and standard errors of the spline estimates of β 1 and β 2 are given in Table 1. The table shows that the sample biases of both β 1 and β 2 are small. The ratio of the standard errors for the two sample sizes is close to 2, a result consistent with a n-convergence rate for β 1 and β 2. The estimated standard errors from (18) (denoted as ESD) are also displayed in Table 1, which are very close to the simulation results. Although our proposed method tends to overestimate the standard error slightly but the overestimation lessens as sample size increases. We also compare our results with the weighted bootstrap method in Ma & Kosorok (25b). The weights are from the exponential distribution with mean one. The estimated standard errors are also similar to the results obtained using our explicit B-spline estimate. The 95% confidence interval constructed from (18) generally have coverage close to the nominal value. Histograms of β 1 and β 2 are shown in Figure 3. It is clear that the marginal distributions of β 1 and β 2 are Gaussian. The left panel of Figure 4 displays the spline estimate of h(w) and the monotone estimate Ĥ is given in the right panel of Figure 4. The dashed line is the true function, the solid line is the average estimate over 4 realizations, and the dash-dotted line is the 95% pointwise confidence band for h(w) or H(v) when we know the true model, which is obtained by taking 2.5 percentile and 97.5 percentile of these 4 estimates at each w or v. As suggested by one of the referees, we also perform a Monte-Carlo study by including two nonparametric functions in the model. Under the same setting as in the last study, the 14

15 Histgram of β 1 Histgram of β Frequency 12 1 Frequency β β 2 Figure 3: Histogram of β 1 and β 2 based on 16 samples and 4 replicates h(w) H(u) w u Figure 4: Left: Estimate and pointwise confidence interval for h. Right: Estimate and pointwise confidence interval for H. The solid line is the average estimate over 4 realizations from sample size n = 16, and the dashed line is the true function. The dash-dotted lines are the 95% pointwise confidence interval. 15

16 AIC K K Figure 5: AIC scores under different combinations of K, K 1, and K 2 two nonparametric functions are h 1 (w 1 ) = sin(w 1 /1.2 1) k with w 1 following a uniform distribution on [1, 1] and h 2 (w 2 ) = 3w2 2 1 with w 2 following a uniform distribution on [ 1, 1]. Figure 6 shows the AIC scores under different combinations of K, K 1 and K 2 for one realization of the simulation with the sample size n = 16. For illustration, we only plot two choices of K where the top surface is for K = 1 and the bottom surface is for K = 4. The optimal choice by the AIC criterion is (K, K 1, K 2 ) = (4, 5, 3). The spline estimates of h 1, h 2 and H under the optimal number of knots are displayed in Figure 6, and the dotted lines are the true functions. To compare our spline based method with the penalized method in Ma & Kosorok (25a), there are four obvious advantages of our method. First, the computational cost of our spline estimate Ĥ is much less expensive than that used in Ma & Kosorok (25a), i.e. the cumulative sum diagram approach. This is because the number of basis B-splines, i.e., K, is often taken much smaller than the sample size n, thus the dimension of the estimation problem is greatly reduced. Secondly, our estimate of the transformation function H is smooth with a higher convergence rate. We obtain a narrower confidence interval for H shown in the right panel of Figure 4. Thirdly, we can obtain an explicit consistent estimate Î. However, the block jackknife approach proposed in Ma & Kosorok (25a) is not theoretically justified. At last, we do not require the constrained optimization in our implementations. 16

17 h 1 (w 1 ) h 2 (w 2 ).5 H(u) w w u Figure 6: Estimates of h 1, h 2 and H. The dotted lines are true functions. Table 2: The estimates and their corresponding estimated standard errors for the parametric part for the calcification data ESD: Estimated standard error extreme value distribution logistic distribution ˆβ ESD( ˆβ 1 ) ˆβ ESD( ˆβ 2 ) Application: Calcification data We illustrate the proposed method in a dataset from the calcification study. Yu et al. (21) investigated the calcification of intraocular lenses, which is an infrequently reported complication of cataract treatment. Understanding the effect of some clinical variables on the time to calcification of the lenses after implantation is the objective of the study. The patients were examined by an ophthalmologist to determine the status of calcification at a random time ranging from zero to thirty six months after implantation of the intraocular lenses. The severity of calcification was graded into five categories ranging from zero to four. In our analysis, we simply treat those with severity > 1 as calcified and those with severity 1 as not calcified. This dataset can be treated as the current status dataset because only the examination time and the calcification status at examination are available. The covariates of interest include Z 1 incision length, Z 2 gender ( 17

18 h 1 H Age/ Time Figure 7: The spline estimates of h(w) and H(v) under two different assumptions of the error distribution: extreme value distribution (solid) and logistic distribution (small dashes). for female and 1 for male), and W age at implantation/1. The original dataset has 379 records. We remove the one record with missing measurement, resulting the sample size n = 378. This dataset has been studied by Xue et al. (24), Lam & Xue (25), and Ma (29). Xue et al. (24) and Lam & Xue (25) modeled the event time by the log-transformation. A straightforward estimation of the hazard function is not available. Ma (29) used the cure model to fit the data, and assumed a generalized linear model for the cure probability. For subjects not cured, the linear and partly linear Cox proportional hazards models are used to model the survival risk. We fit this dateset using the semiparametric additive transformation model. We assume the error distribution F to be one of the two distributions: extreme value distribution and logistic distribution. We approximate h and log Ḣ by quadratic splines. The optimal choices of knots for h and log Ḣ are 6 and 5, respectively. The estimates and their corresponding estimated standard errors for the parametric part are summarized in Table 2. The estimates for h(w) based on different error distributions are displayed in the left panel of Figure 7, and the estimates of H(v) are plotted in the right panel of Figure 7. The analysis shows very similar results for these two error distributions. From Table 2, both incision length and gender are insignificant at the 5% level of significance. From the left panel of Figure 7, h(w) increases steadily from age 5, achieving a peak at age 6, decreasing gradually thereafter, which means that patients ages around 6 tend to enjoy a longer time to calcification. The estimated transformation function Ĥ in the right panel of Figure 7 displays a nonlinear behavior and it shows that the transformation is necessary. We can incorporate an unknown scale parameter into to the residual error distribution F ( ) to further improve the above analysis. Our general B-spline estimation framework can also handle 18

19 this type of transformation models easily. Acknowledgement The first author s research is supported by the National Science Foundation under grant DMS The second author s research is supported by the National Science Foundation under grant CMMI and DMS The authors would like to thank Professor Alexis K. F. Yu for providing the Calcification data and Professor Donglin Zeng for helpful discussions. The authors also would like to thank the editor, the associate editor, and two reviewers for their helpful comments and suggestions which led to a much improved presentation. Appendix Some useful Lemmas We define ɛ-covering number (ɛ-bracketing number) as N(ɛ, A, d) (N B (ɛ, A, d)). The corresponding ɛ-entropy (ɛ-bracketing entropy) is defined as H(ɛ, A, d) = log N(ɛ, A, d) (H B (ɛ, A, d) = log N B (ɛ, A, d)). Define G n (δ ; ) = {g : g(v) = γ B (v) satisfying g δ and H jn (δ j ; ) = {h j : h j (w j ) = γ j B j (w j ) satisfying h j δ j and 1 h j(w j )dw j =. Obviously, G n (c ; ) = G n and H jn (c j ; ) = H jn. Lemma 1 follows from the B-spline approximation property (6). Lemma 2 is directly implied by Lemma 2.5 in (Van de Geer, 2). Lemma 4 is adapted from Proposition 1 in (Cheng & Huang, 21). LEMMA 1. There exist g n G n and h jn H jn such that g n g K r, (A.1) H n H = O(K r ), (A.2) h jn h j K r j j, (A.3) ( ) h jn h j = O max,...,d {K r j j, (A.4) where H n (v) = v l v exp(g n (s))ds. LEMMA 2. for 1 j d. H(ɛ, G n (δ ; ), ) < K log(1 + 4δ /ɛ), (A.5) H(ɛ, H jn (δ j ; ), ) < K j log(1 + 4δ j /ɛ) (A.6) 19

20 LEMMA 3. Let h = (h 1,..., h d ). Define K = {ζ(β, h, H) : β B, h d H jn, g G n, where the form of ζ is defined in (A.12). We have sup G n ζ = O P ( max ζ K j=,1,...,d {K1/2 j ). (A.7) Proof: Define l (β, h, H) = δf (β z + d h j(w j ) + H(v)) + (1 δ)[1 F (β z + d h j(w j ) + H(v))]. The construction of l ( ) implies that l (β, h n, H n ) l (β, h, H ) = O( max j=,1,...,d {K r j j ) (A.8) based on (A.2), (A.4) and M4. Thus, l (β, h n, H n ) is bounded away from zero for sufficiently large n. For any β 1, β 2 B, h 1, h 2 d H jn and g 1, g 2 G n, we have ζ(β 1, h 1, H 1 ) ζ(β 2, h 2, H 2 ) < l (β 1, h 1, H 1 ) l (β 2, h 2, H 2 ) < β 1 β 2 + h 1j h 2j + g 1 g 2. (A.9) The first and second inequalities in the above follow from the fact that l (β, h n, H n ) is strictly positive for sufficiently large n by (A.8), and Condition M4(a), respectively. As shown in (A.9), the functions in the class K are Lipschitz continuous in (β, h, g). Therefore, by combining Lemma 2 and Theorem in (Van de Geer & Wellner, 1996), we obtain that H B (ɛ, K, L 2 (P )) < max j d {K j log(1 + M/ɛ), where M = max j d {4c j. In the end, we apply Lemma in (Van de Geer & Wellner, 1996) to this uniformly bounded class of functions K to obtain (A.7). LEMMA 4. Suppose the following Conditions (B1)-(B3) hold. B1. P n l β = o P (n 1/2 ), P n lĝ[ā ] = o P (n 1/2 ) and P n lĥj [ b j ] = o P (n 1/2 ); B2. sup {α:d(α,α ) C 1 n r/(2r+1) G n ( l β (X; α) l β (X; α )) = o P (1); B3. P ( l β (X; α) l β (X; α )) = Ĩ(β β ) + o( β β ) + o(n 1/2 ) for α satisfying d(α, α ) C 1 n r/(2r+1). 2

21 If α is consistent and Ĩ is invertible, then we have n( β β ) = 1 n n i=1 Ĩ 1 l β (X i ) + o P (1) d N(, Ĩ 1 ). LEMMA 5. (i) If a(s, t) = a(s 1, s 2, t) H r c(s 1 S 2 T ) in t relative to s 1 and s 2, then S 1 a(s 1, s 2, t) ds 1 H r c (S 2 T ) in t relative to s 2. (ii) If a(s, t), b(s, t) H r c(s T ) in t relative to s, then c(s, t) a(s, t)b(s, t) H r c (S T ) in t relative to s. (iii) If a(s, t) H r c(s T ) in t relative to s and f( ) C β, then f(a(s, t)) H r c (S T ) in t relative to s. Proof: Let r be the largest integer smaller than r. Denote the m-th derivative of a(s, t) w.r.t. t as Dt m a(s, t) for m =, 1,..., r. (i) Note that Dt m a(s 1, s 2, t) is bounded for m r, by the dominated convergence theorem, we can take derivative inside the integral to obtain ( ) Dt m a(s 1, s 2, t) ds 1 = Dt m a(s 1, s 2, t) ds 1, S 1 S 1 which implies that Dt m ( S 1 a(s 1, s 2, t) ds 1 ) is bounded for m r. Using this and the fact that r ) r ( ) D t ( S 1 a(s 1, s 2, t 2 ) ds 1 D t S 1 a(s 1, s 2, t 1 ) ds 1 t 2 t 1 r r S 1 sup s 1,s 2 sup t 1 t 2 D r t a(s 1, s 2, t 2 ) D mα t a(s 1, s 2, t 1 ) t 2 t 1 r r ds 1 c <, for all s 2 and t 1 t 2, we conclude that S 1 a(s 1, s 2, t) ds 1 H r c (S 2 T ) in t relative to s 2 for some c <. (ii) The result is true because Dt m c = DtaD i j t b i+j=m is bounded for m r. Also we note that for i < r, It can then be easily verified that D i ta(s, t 2 ) D i ta(s, t 1 ) t 2 t 1 r r = sup s D sup t 1 t 2 r t t2 a(s, t) dt. t 2 t 1 r r t 1 Dt i+1 c(s, t 2 ) D r t c(s, t 1 ) t 2 t 1 r r <. 21

22 (iii) When < α 1, the result follows from the observation that f(a(s, t 2 )) f(a(s, t 1 )) = f(a(s, t 2)) f(a(s, t 1 )) a(s, t 2) a(s, t 1 ). t 2 t 1 β a(s, t 2 ) a(s, t 1 ) t 2 t 1 β Using the chain rule, the above observation and part (ii) of the lemma, the desired result can be obtained by induction for general β. Denote S k (X; α, w k ) = [ l β (X; α)] k l g [a k ](X; α) l hj [b jk ](X; α), where w k = (a k, b 1k,..., b dk ). Let W n = G n d H jn and N = {α A : d(α, α ) = o(1). LEMMA 6. Under Conditions M1-M7 & P1-P2, we have for all α N and k = 1,..., l. E sup w k W n S k (X; α, w k ) S k (X; α, w k ) 2 < d 2 (α, α ) Proof: In view of (12)-(14), we can bound the left hand side of (A.1) by { [ V ] 2 < Q θ Q θ E sup (exp(g(s)) exp(g (s)))a k (s)ds (Q θ Q θ) 2 a k G n l v [ V ] 2 +E sup a k G n l v [ V exp(g (s))a k (s)ds(q θ Q θ ) ] 2 +E sup a k G n (exp(g(s)) exp(g (s)))a k (s)dsq θ + E l v [ sup b 2 jk (Q θ Q θ ) 2] b jk H jn (A.1) after some algebra. The compactness of G n and H jn imply that the third and fifth term in the above are both of the order Q θ Q θ 2 2. For the second term, we can further bound it by [ V V ] E sup a 2 k(s)ds [exp(g(s)) exp(g (s))] 2 ds(q θ Q θ ) 2. a k G n l V l V Considering the compactness of G and G n, we know the second term is also of the order Q θ Q θ 2 2. Assumption M4(a) together with Cauchy-Schwartz inequality implies that Q θ Q θ 2 2 < β β 2 + H H d (h j h j ) 2 2. Since we assume that the density for W is bounded away from zero and infinity, we have that d (h j h j ) 2 2 < d h j h j 2 2 considering the identifiability condition 1 h j(w j )dw j =. Assumption M7 implies that the fourth term is of the order H H 2 2. Considering the form of d(α, α ), we conclude the whole proof. 22

23 Proof of Theorem 1 We show the estimation consistency (9) by first establishing { 2 P ( β β ) Z + (ĥj h jn )(W j ) + Ĥ(V ) H n(v ) = o P (1). (A.11) Combining (A.11) with the identifiability Condition M3, we directly obtain ( β β ) = o P (1) { d 2 which, in turn, implies that P (ĥj h jn )(W j ) + Ĥ(V ) H n(v ) = op (1). Considering the assumption M2(b) and that 1 h j(w j )dw j = for h j H j H jn, we can further show d ĥj h jn 2 + Ĥ H n 2 = o P (1). The spline approximation result (A.2) and (A.3) conclude the proof of (9). In the below, we will show (A.11) to complete the proof of (9). Recall that h = (h 1,..., h d ). Denote h, h n and ĥ as the corresponding true value, B-spline approximation and sieve estimate, respectively. Recall that l (β, h n, H n ) is bounded away from zero for sufficiently large n as implied by (A.8). Then, by the definition of α, we have P n log{l ( β, ĥ, Ĥ)/l (β, h n, H n ), which implies that, by the inequality that α log(x) log(1 + α(x 1)) for any x > and α (, 1), [ { ] l ( β, ĥ, Ĥ) P n log 1 + α l (β, h n, H n ) 1 P n ζ( β, ĥ, Ĥ). (A.12) Lemma 3 implies that (P n P )ζ( β, ĥ, Ĥ) = o P (1) since K j /n = o(1) for any j =, 1,..., d. Thus, P ζ( β, ĥ, Ĥ) o P (1) based on (A.12). Let U n (X) = l ( β, ĥ, Ĥ)/l (β, h n, H n ). Based on (A.8) we know P U n (X) = 1 + o P (1), which further implies P ζ( β, ĥ, Ĥ) o P (1) by the concavity of s log(s). This in turn implies that P ζ( β, ĥ, Ĥ) = o P (1). This forces P (β Z + d h jn(w j ) + H n (V )) ( β Z + d ĥj(w j ) + Ĥ(V )) = o P (1) by the strict concavity of s log s, Conditions M4(a), P1 and P2. It is easy to verify that ERn 2 = o P (1) if E R n = o P (1). Thus, we have shown (A.11) in the end. As for the convergence rate results (1) & (11), we first apply Theorem in Van de Geer & Wellner (1996) to establish θ θ 2 = O P (δ 1n δ 2n ), (A.13) where θ is the plug-in sieve estimate of θ and δ 1n = max j d { K j / n and δ 2n = max j d {K r j j. (A.14) 23

24 Following similar arguments in proving the consistency, we know that (A.13) implies (1) and (11) by choosing K j n 1/(2r j+1). In the below, we show (A.13) by verifying the conditions of Theorem in Van de Geer & Wellner (1996). We first need to show that P [l(α ) l(α)] > θ θ 2 2 (A.15) for every α in the neighborhood of α. Define q(δ, t) = δ log(f (t)) + (1 δ) log(1 F (t)) and q(δ, t) as its second derivative w.r.t. t. Since α maximizes α P l(α), we have [ q(δ, P [l(α ) l(α)] = P θ) ] (θ θ ) 2, 2 where θ is on the line segment between θ and θ. The compactness of the parameter spaces imply that P [l(α ) l(α)] θ θ 2 2. This completes the proof of (A.15). We next calculate the order of E sup θ θ 2 δ G n (l(α) l(α )) as a function of δ, denoted as φ n (δ), by the use of Lemma of Van de Geer & Wellner (1996). Let F 1n (δ) = {l(α) l(α ) : g G n, h j H jn, θ θ 2 δ. Using the same argument as that in the proof of Lemma 3, we obtain that H B (ɛ, F 1n (δ), L 2 (P )) is bounded by C max j d {K j log(1 + δ/ɛ). This leads to J B (δ, F 1n (δ), L 2 (P )) = δ 1 + HB (ɛ, F 1n (δ), L 2 (P ))dɛ C max j d { K j δ. The compactness of G n and H jn implies the uniform boundedness of any f F 1n (δ). Thus, Lemma of Van de Geer & Wellner (1996) gives φ n (δ) = max j d { K j δ + max j d {K j/ n. By solving δ1n 2 φ n (δ 1n ) n, we get the form of δ 1n in (A.14). We next show that P n l( α) P n l(α ) O P (δ2n). 2 The definition of α implies that P n [l( α) l(α )] A n + B n, where A n = (P n P ){l(β, H n, h n ) l(α ) and B n = P {l(β, H n, h n ) l(α ). A straightforward Taylor expansion gives { A n = (P n P ) l 2 (β, H n, h n )(H n H ) + l 2+j (β, H n, h n )(h jn h j ), where l t is the Fréchet derivative of l(β, H n, h n ) w.r.t. the t-th argument. Considering (A.2), (A.3) and the fact that < ɛ 1 q(δ, t) ɛ 2 < for t in some compacta of R 1, we have { l2 (β, H n, h n )(H n H ) + d l 2+j (β, H n, h 2 n )(h jn h j ) P max j d {K r (A.16) j j n ɛ 24

25 for any ɛ >. Let F 2n = {l(β, H, h) l(α ) : g G n, h j H jn, g g C K r, h j h j C j K r j j. Similar analysis in Lemma 3 show that the bracketing entropy integral (in terms of L 2 (P )) for F 2n is finite, thus yields that F 2n is P-Donsker. Combining this P-Donsker result and (A.16), we use Corollary of Van de Geer & Wellner (1996) to conclude that nan /(max j d {K r j j n ɛ ) = o P (1). By choosing some proper < ɛ < 1/2 satisfying n ɛ 1/2 = max j d {K r j j O(max j d {K 2r j j thus concludes the whole proof., we have A n = o P (max j d {K 2r j j ). We can also show B n ) by similar analysis of (A.15). This gives the form of δ 2n in (A.14), and Proof of Theorem 2 We apply Lemma 4 to prove this theorem by checking their Conditions B1 B3. To facilitate the understanding, we first sketch the verification of Condition B1 and then provide the details. To verify B1, we first know that P n l β = since β maximizes l(β, ĝ, ĥ1,..., ĥd), β is consistent and β is an interior point of B. We next show that b jk (a k ) belongs to Hr j c j [, 1] (H r c [l v, u v ]) for some < c j < and j =, 1,..., d such that there exists a b jkn H jn (a kn G n) satisfying b jk b jkn = O(n r j/(2r j +1) ) (A.17) a kn a k = O(n r /(2r +1) ) (A.18) by (6) and the assumption that K j n 1/(2r j+1). Since P n lĥj [b jkn ] = and P n lĝ[a kn ] = for any b jkn H jn and a kn G n, it remains to show P n { lĥj [b jkn ] lĥj [b jk ] = o P (n 1/2 ), (A.19) P n { lĝ[a kn ] lĝ[a k ] = o P (n 1/2 ) (A.2) for verifying Condition B1. Now we show b jk Hr j c j [, 1] and (A.19). Following the analysis in Page 2282 of Ma & Kosorok (25a), we can write, with ā I (v) = v l v exp(g (s))ā (s)ds, b j = Π j D(v, w) Π j ā I (v) Π j b i i j = Π j D(v, w) uv l v ā I (v)sf(v, w j)dv i j 1 b i (w i)t f(w i, w j )dw i. According to Lemma 5 and dominated convergence theorem, we know that b jk (w j) H r j c j [, 1] under Condition M5, b jk L 2(w j ) and a k L 2(H) (thus a Ik is uniformly bounded) for some 25

26 < c j <. As for (A.19), we first decompose its left hand side as I 1n + I 2n, where { I 1n = P lĥj [b jkn b jk ] l hj [b jkn b jk ], { I 2n = (P n P ) lĥj [b jkn b jk ]. By Cauchy-Schwartz Inequality, we have I < 1n b kjn b kj θ θ 2 based on Conditions M4(a), P1 & P2. Thus, (A.13) and (A.17) imply that I 1n = O P (n 2r/(2r+1) ) = o P (n 1/2 ) since r > 1/2. To show I 2n = o P (n 1/2 ), we need to make use of Lemma in Van de Geer & Wellner (1996). We first construct the following class of functions: I n = { f θ,bjkn (x) = l hj [b jkn b jk ](x; α) : α A n(n r 2r+1 ) and bjkn H jn ( n r j 2r j +1 ), where A n (δ) {α A n : d(α, α ) C 1 δ and H jn(δ) {b jkn H jn : b jkn b jk C 2 δ for some < C 1, C 2 <. Let Θ n (δ) = {β z + H(v) + d h j(w j ) : α A n (δ). It is easy to verify that, for every x, f θ1,b jkn1 (x) f θ2,b jkn2 (x) < θ 1 θ 2 + b jkn1 b jkn2, (A.21) where θ j Θ n (n r/(2r+1) ) for j = 1, 2. Let θ 1,..., θ N(ɛ,Θn(n r/(2r+1) ), ) and b 1 jkn,..., b N(ɛ,H jn (n r j /(2r j +1) ), ) jkn be the ɛ-cover for Θ n (n r/(2r+1) ) and H jn(n r j/(2r j +1) ), respectively. Thus, we can construct the bracket [f θ i,b l jkn 2Cɛ, f θ i,b l jkn + 2Cɛ] covering I n. The bracket size is 4Cɛ. Hence, we obtain < H B (ɛ, I n, L 2 (P X )) H(ɛ/(4C), Θ n (n r 2r+1 ), ) + H(ɛ/(4C), H jn(n 2r j +1 ), ) max {K j log(1 + n r/(2r+1) /ɛ) j d r j based on Lemma 2. The corresponding δ-bracketing entropy integral is calculated as J B (δ, I n, L 2 (P X )) δ 1 + HB (ɛ, I n, L 2 (P X )) < max { K j n r 4r+2 δ 1/2. (A.22) j d Now, it is ready to apply Lemma in Van de Geer & Wellner (1996) to show E G n In = o(1) implying I 2n = o P (n 1/2 ). Note that f 2 < b jkn bjk 2 and f b jkn bjk for any f I n, and thus δ and M in Lemma of Van de Geer & Wellner (1996) are both 26

27 chosen as K r j j, i.e., n r j/(2r j +1). Then, by Lemma of Van de Geer & Wellner (1996) and (A.22), we have that E G n In = O ( ( ) n r 1 4r+2 + r j 4r j +2 n 4r 1 4r+2 ) = o(1). This completes the proof of (A.19). We next show (A.2) by similar arguments. Similarly, we have ā I (v) = Π ad(v, w) 1 b j (w j)uf(w j, v)dw j. Recall that ā I (v) = v l v exp(g (s))ā (s)ds. Under Condition M6 and the assumption that g H r c [l v, u v ], we can show that a Ik Hr +1 c [l v, u v ], which implies that a k Hr c [l v, u v ] for some < c <, based on Lemma 5. We next show that I 1n = o P (n 1/2 ) and I 2n = o P (n 1/2 ), where { I 1n = P lĝ[a kn a k ] l g [a kn a k ], { I 2n = (P n P ) lĝ[a kn a k ]. Similarly, by Cauchy-Schwartz Inequality, we can show that [ v ] I 1n < a kn a k θ θ 2 + P (exp(ĝ) exp(g ))(s)(a kn a k )(s)ds l v ) a kn a k ( θ θ 2 + Ĥ H 2 < < O P (n r/(2r+1) ) = o P (n 1/2 ) by choosing K j n 1/(2rj+1). Following similar arguments in analyzing I 2n, we can show that I 2n = o P (n 1/2 ). Thus, we have verified Condition B1 in Lemma 4. We again apply Lemma of Van de Geer & Wellner (1996) to verify Assumption B2. The details are skipped due to the similarity of the previous analysis. It remains to verify Assumption B3. This can be easily established using the Taylor expansion in Banach space. However, we first need to reparameterize the efficient score function l β (X; α) as [ ] V l β (X; α ) = ZQ θ (X) ā (s)dh(s) + b j (W j) Q θ (X) l v l β (X; α ) l η [ c ](X; α ), 27

28 where α = (β, H, h 1,..., h d ), η = (H, h 1,..., h d ) and c = (ā, b 1,..., b d ). We first derive two useful equalities (A.26)-(A.27). Let E α be the expectation corresponding to the reparametrized likelihood under the parameter α. Since E α lβ (X; α ) =, we have t t=e α t lβ (X; α t ) =, (A.23) where αt = α + tɛ. Define l β,β and l β,η [c] as the first derivative of l β w.r.t. β and η (along the direction c), respectively. By setting ɛ = (ɛ β,,..., ) and ɛ = (, e) = (, H, b 1,..., b d ), respectively, some calculations reveal that E { lβ,β (X; α)ɛ β + E { lβ (X; α) l β(x; α)ɛ β =, (A.24) E { lβ,η [e](x; α ) + E { lβ (X; α ) l η[e](x; α ) = (A.25) based on (A.23). By considering the orthogonal property of l β and the above reparametrization, we obtain the following two useful facts: Ĩ = E { lβ,β (X; α), (A.26) E { lβ,η [e](x; α) = (A.27) based on (A.24) and (A.25). Define l β,α,α [h 1, h 2 ](X; α ) as the second order Fréchet derivative of l β w.r.t. α along the direction [h 1, h 2 ] at the point α. The same notation rule applies to l β,α,α [h 1, h 2 ](X; α ) and l η,α,α [h 1, h 2, h 3 ](X; α ). Now we are ready to express the Taylor expansion as follows. E[ l β (X; α) l β (X; α )] = E[ l β (X; α ) l β (X; α)] = E { lβ,β (X; α) (β β ) + E { lβ,η [η η ](X; α) + 1 { lβ,α 2 E,α [ α, α ](X; α ) = Ĩ(β β ) + 1 { 2 E lβ,α,α [ α, α ](X; α ) l η,α,α [ c, α, α ](X; α ), where α = α α and α lies between α and α. The last equation in the above follows from (A.26) & (A.27). Now we only need to show that the second term in the last equation is of the order o( β β ) + o(n 1/2 ). 28

CURE MODEL WITH CURRENT STATUS DATA

Statistica Sinica 19 (2009), 233-249 CURE MODEL WITH CURRENT STATUS DATA Shuangge Ma Yale University Abstract: Current status data arise when only random censoring time and event status at censoring are