A Least-Squares Approach to Consistent Information Estimation in Semiparametric Models

Size: px

Start display at page:

Download "A Least-Squares Approach to Consistent Information Estimation in Semiparametric Models"

Philomena Park
5 years ago
Views:

1 A Least-Squares Approach to Consistent Information Estimation in Semiparametric Models Jian Huang Department of Statistics University of Iowa Ying Zhang and Lei Hua Department of Biostatistics University of Iowa May 5, 2008 Abstract A method is proposed for consistent information estimation in a class of semiparametric models. The method is based on the geometric interpretation of the efficient score function, that it is the residual of the orthogonal projection of the score function for the finite-dimensional parameter onto the tangent space for the infinitedimensional parameter. The empirical version of this projection is a least-squares nonparametric regression problem. Under appropriate conditions, the sum of squared residuals of this regression is shown to be a consistent estimator of the efficient Fisher information and is actually the observed information for a class of sieve maximum likelihood estimators. Simulations studies are conducted to evaluate finite sample performance of the estimator in two illustrating examples: Poisson proportional mean model for panel count data and Cox model for interval-censored data. Finally the method is applied to two real-life examples: bladder tumor study and breast cosmesis study. Key words and phrases. counting process, Cox model, empirical processes, information, interval censoring, maximum (profile) likelihood, monotone polynomial splines, panel count data, proportional hazards model, semiparametric model 1

2 1. Introduction In a regular parametric model, the maximum likelihood estimator (MLE) is asymptotically normal with variance equal to the inverse of the Fisher information, and the Fisher information can be estimated by the observed information. This result provides large sample justification for the use of normal approximation to the distribution of MLE. An important factor making this approximation useful in statistical inference is that the observed information can be readily computed and is consistent. In many situations, consistency of the observed information follows directly from the law of large numbers and consistency of MLE. Asymptotic normality of MLE of the regular parameters continues to hold in many semiparametric and nonparametric models. See for example, Chen (1988, 1995), Chang (1990), Geskus and Groeneboom (1996), Gill (1989), Groeneboom (1996), Groeneboom and Wellner (1992), Gu and Zhang (1993), Murphy (1995), Murphy, Rossini and van der Vaart (1997), Qin and Lawless (1993), Severini and Wong (1991), van der Laan (1993), van der Vaart (1994, 1996), Wong and Severini (1991), Huang (1996), Huang and Rossini (1997) and Wellner and Zhang (2007). In all these examples, the MLE or a smooth functional of the MLE is asymptotically normal with variance equal to the inverse of the efficient Fisher information. The asymptotic normality and efficiency results provide insight to the theoretical properties of maximum likelihood estimators. Unfortunately, in many semiparametric models studied in the aforementioned articles, the efficient Fisher information is either very complicated or may not have an explicit expression and they often can not be estimated directly as in the case of parametric MLE. Thus effort is required to estimate the asymptotic variance in order to apply these results to statistical inference. In this paper, we consider consistent information estimation in a class of semiparametric 2

3 models that are parametrized in terms of a finite-dimensional parameter θ and a parameter φ whose dimension increases with sample size. Hence φ is often called an infinite-dimensional parameter. Two important examples are Cox s (1972) proportional hazards model for interval-censored data studied by Huang and Wellner (1995) and the proportional mean model for panel count data proposed by Sun and Wei (2000) and Wellner and Zhang (2007). In these two examples, θ is the finite-dimensional regression coefficient, φ is the log of baseline hazard function or the log of baseline mean function. At least two methods used in parametric models for estimating the variance have been suggested in semiparametric models. The first method is to use the inverse observed information based on the likelihood, correcting for the presence of the infinite-dimensional parameter φ. However, when φ cannot be estimated at the usual root-n rate, consistency of this estimator has not been proved in general. The second method is to use the second derivative of the profile likelihood of θ at the maximum likelihood estimate. For a fixed value of θ, the profile likelihood is the maximum of the likelihood with respect to φ. Because the profile likelihood often can only be computed numerically, discretized versions of its second derivative are proposed by Nielson, Gill, Andersen and Sorensen (1992), and used by Huang and Wellner (1995) and Murphy and van der Vaart (1996). Using an ingenious sandwich inequality, Murphy and van der Vaart (2000) showed that the discretized versions of the second derivative of profile likelihood provides consistent variance estimator in a large class of semiparametric models. Murphy and van der Vaart (1999) also proved that the profile likelihood resembles the ordinary likelihood in many essential aspects. When the dimension of θ is one or two, it is relatively easy to compute and visualize the profile likelihood. For moderate to high dimensional θ, implementation of the profile likelihood approach may be computationally demanding and 3

4 sometimes difficult. In a general semiparametric maximum likelihood estimation problem, the joint estimation of (θ, φ) is often a quite challenging problem numerically. Sieve semiparametric M-estimation using polynomial splines proposed by Lu, Zhang and Huang (2008) for panel count data is shown numerically efficient. However, the estimation of standard error for the M-estimator of regression parameter still remains a big task and alternatively, they used the bootstrap standard error by taking the computation advantage in spline-based sieve M-estimation. In this article, we propose a least-squares approach to consistent estimation of the information matrix of the semiparametric maximum likelihood estimator of θ. The proposed method is based on the geometric interpretation of the efficient score function, that it is the residual of the projection of the score function for θ onto the tangent space for φ (van der Vaart, 1991, Bickel, Klaassen, Ritov and Wellner, 1993, Chapter 3). Thus the theoretical information calculation is a least-squares problem in a Hilbert space. When a sample from the model is available, this theoretical information can be estimated by its empirical version. It turns out that this empirical version is essentially a least-squares nonparametric regression problem, due to the fact that the score function for the infinite-dimensional parameter is a linear operator. In this nonparametric regression problem, the response is the score function for the finite-dimensional parameter θ, the covariate is the linear score operator for the infinite-dimensional parameter φ, and the regression parameter is the least favorable direction which is used to define the efficient score. Computationally, the proposed method can be implemented with a least-squares nonparametric regression fitting program. In a class of sieve MLE s using a linear approximation space, the proposed estimator of information matrix is shown to be the same as the inverse of the observed information matrix for the sieve MLE. This equivalence is useful from both the theoretical and the computational 4

5 point of view. First, this equivalence enables a simple and indirect consistency proof of the observed information matrix in the semiparametric setting. Second, it provides two ways of computing the observed information matrix: one can either directly compute the observed information matrix or fit a least-squares nonparametric regression. Because of its numerical convenience and good theoretical properties, the class of sieve MLE s using polynomial splines is utilized in our numerical demonstration and is recommended for applications of general semiparametric estimation. The paper is organized as follows. Section 2 describes the motivation and the leastsquares approach. Section 3 specializes the general approach to a class of sieve MLE s. Section 4 applies the proposed method along with the spline-based sieve MLE to two models, semiparametric Poisson mean model for panel count data and Cox proportional hazards model for interval censored data, studied in Wellner and Zhang (2007) and Huang and Wellner (1995), respectively. Section 5 renders numerical results via simulations and applications in real-life examples for the models discussed in Section 4. Section 6 concludes with some discussions. Some technical details are included in appendices. 2. The Least-Squares Approach Let X 1,..., X n be independent random variables with a common probability measure P θ,φ, where (θ, φ) Θ Φ. Here Θ is a subset of R d and Φ is a general space. Assume that P θ,φ has a density p( ; θ, φ) with respect to a σ-finite measure. Denote τ = (θ, φ) and let τ 0 = (θ 0, φ 0 ) Θ Φ be the true parameter value under which the data are generated. The maximum likelihood estimator of τ 0 is the value ˆτ n (ˆθ n, ˆφ n ) that maximizes the 5

6 log-likelihood l n (τ) = n log p(x i ; θ, φ) i=1 over the parameter space T Θ Φ. Let be the Euclidean distance of R d, and Φ be an appropriate norm defined on Φ. We assume it has been shown that (2.1) ˆτ n τ 0 T { ˆθ n θ ˆφ } 1/2 n φ 0 2 Φ = Op (rn 1 ), where r n is a sequence of numbers converging to infinity. Consistency and rate of convergence in nonparametric and semiparametric models have been addressed by many authors, see for example, Birgé and Massart (1993), van der Geer (1993), Shen and Wong (1995), and van der Vaart and Wellner (1996). The results and methods developed by these authors can often be used to verify (2.1). The motivation to study consistent information estimation is the following. In many semiparametric models, in addition to (2.1), it can be shown that (2.2) n 1/2 (ˆθn θ 0 ) d N ( 0, I 1 (θ 0 ) ), where I(θ 0 ) is the efficient Fisher information for θ 0, adjusting for the presence of nuisance parameter φ. The definition of I(θ 0 ) is given below. This holds for models cited in the previous section and for the examples in Section 4. Thus estimation of the asymptotic variance of ˆθ n is equivalent to estimation of I(θ 0 ) provided I(θ 0 ) is nonsingular. Of course, for the problem of estimating I(θ 0 ) to be meaningful, we need to first establish (2.2). The calculation of I(θ 0 ) and its central role in asymptotic efficiency theory for semiparametric models have been systematically studied by Begun, Hall, Huang and Wellner 6

7 (1983), van der Vaart (1991), and BKRW (1993) and the references therein. In the following, We first briefly describe how the information I(θ 0 ) is defined in a general semiparametric MLE setting in order to motivate the proposed information estimator. Let l(θ, φ; x) = log p(x; θ, φ) be the log-likelihood for a sample of size one. Consider a parametric smooth submodel with parameter (θ, φ (s) ), where φ (0) = φ and φ (s) s = h. s=0 Let H be the class of functions h defined by this equation. Usually, H is a Hilbert space. The score operator for φ is (2.3) l 2 (τ; x)(h) = s l(θ, φ (s); x). s=0 Observe that l 2 is a linear operator mapping H to L 2 (P θ,φ ). So for constants c 1, c 2 and h 1, h 2 H, (2.4) l 2 (τ; x)(c 1 h 1 + c 2 h 2 ) = c 1 l2 (τ; x)(h 1 ) + c 2 l2 (τ; x)(h 2 ). The linearity of l 2 is crucial to the proposed method. For a d-dimensional θ, l 1 (τ; x) is the vector of partial derivatives of l(τ; x) with respect to θ. For each component of l 1, a score operator for φ is defined as in (2.3). So the score operator for φ corresponding to l 1 is defined as (2.5) l 2 (τ; x)(h) ( l 2 (τ; x)(h 1 ),..., l 2 (τ; x)(h d )), 7

8 where h (h 1,..., h d ) with h k H, 1 k d. Let Ṗ2 be the closed linear span of { l 2 (h) : h H}. Then the efficient score function for the kth component of θ is l 1,k Π( l 1,k Ṗ2), where l 1,k is the kth component of l 1 (τ; x) and Π( l 1,k Ṗ2) is the projection of l 1,k onto Ṗ2. Equivalently, Π( l 1,k Ṗ2) is the minimizer of the squared residual P [ l 1,k (τ 0 ; X) η] 2 over η Ṗ2. See for example, van der Vaart (1991), Section 6, and BKRW (1993), Theorem 1, page 70. In general, η may not be a score function for φ, that is, there may not exist a h H such that η = l 2 (h). However, in models with regularity conditions, η either is a score function or can be approximated by a score function. So we assume that the efficient score vector for θ is l 1 (τ; x) l 2 (τ; x)(ξ 0 ), where ξ 0 is an element of H d that minimizes (2.6) ρ(h) E l 1 (τ; X) l 2 (τ; X)(h) 2 over H d. The minimizer ξ 0 = (ξ 01, ξ 02,..., ξ 0d ) is called the least favorable direction. Denote the efficient score by l (τ; x) l 1 (τ; x) l 2 (τ; x)(ξ 0 ). Then the information for θ is (2.7) I(θ) = E l (τ; X) 2 = E l 1 (τ; X) l 2 (τ; X)(ξ 0 ) 2. Therefore, to estimate I(θ), it is natural to consider minimizing an empirical version of (2.6). In particular, with the random sample X 1,..., X n and the consistent estimator τ n, we can estimate I(θ) by the minimum value of (2.8) ρ n (h) n 1 n l 1 ( τ n ; X i ) l 2 ( τ n ; X i )(h) 2 i=1 over H d. That is, if ξ n is a minimizer of ρ n over H d, then a natural estimator of I(θ 0 ) 8

9 is În ρ n ( ξ n ). Because l 2 is a linear operator, this minimization problem is essentially a least-squares nonparametric regression problem and it can be solved for each component separately. In the next section, we show that, in a class of sieve MLE s, the minimum value ρ n ( ξ n ) is actually the observed information based on the outer product of the first derivatives of the log-likelihood. In the following, we state a simple proposition which provides sufficient conditions ensuring consistency of the estimated least favorable direction and În as an estimator of I(θ 0 ). This proposition appears to be useful in a large class of models and is easy to apply. As in nonparametric regression, if the space H is too large, minimization over this space may not yield consistent estimators of ξ 0 and I(θ 0 ). We can use an approximation space H n (a sieve) which is smaller than H and converges to H as n tends to infinity. Under appropriate conditions, minimization of ρ n over H n will yield a consistent estimator of ξ 0. On the other hand, if minimizing over H does yield consistent estimators, then H n can be taken to be H. For simplicity and without loss of generality, it suffices to consider the case of onedimensional θ. In the following and throughout, we will use the linear functional notations for integrals. So for any probability measure Q, Qf = fdq as long as the integral is well defined. Below, P = P θ0,φ 0. P n is the empirical measure of X i, 1 i n. So P n f = n 1 n i=1 f(x i). Proposition 2.1. Denote ξ n = argmin Hn ρ n (h) and l(τ, h; x) = [ l 1 (τ; x) l 2 (τ; x)(h)] 2. If the class of functions I = {l(x, τ, h) : τ T, h H} is Glivenko-Cantelli and τ n p τ 0, then The proof is given is Appendix A. ρ n ( ξ n ) p I(θ 0 ). 9

10 3. Observed Information in Sieve MLE We now apply the method described in Section 2 to a class of sieve MLE s. We show that, if the parameter space Φ and the space H can be approximated by a common approximation space, and if this space has a basis, then the least-squares calculation in Section 2 yields the observed information matrix. In other words, computation of the observed information matrix is equivalent to solving the least-squares problem of Section 2. So there is no need to actually carry out the least-squares computation when the observed information can be computed as in the ordinary setting of parametric estimation. This is computationally convenient, because the observed information matrix is based on either the first derivatives or the second derivatives of the log-likelihood function and these derivatives are often already computed in a numerical algorithm for computing the MLE of unknown regression parameters. On the other hand, for problems in which direct computation of the observed information matrix is difficult, one can instead solve the least-squares nonparametric regression problem to obtain the observed information matrix. These nonparametric regression problems can be solved by using standard least-squares fitting programs for linear regression. As in finite-dimensional parametric models, some regularity conditions are required for the MLE θ n to be root-n consistent and asymptotically normal. These regularity conditions usually include certain smoothness assumptions on the infinite-dimensional parameter φ and the underlying probability model. Consequently, the least favorable direction will be a smooth function such as a bounded Lipschitz function. Then we can take H to be the class of such smooth functions. From the approximation theory, many spaces designed for efficient computation can be used to approximate any element in H with arbitrary precision under appropriately defined distance. For example, we may use the space of polynomial spline 10

11 functions (Schumaker, 1981). This class not only has good approximation power, but is also computationally convenient. We will use this approximation space in the next section. Let Φ n be an approximation space for both Φ and H. Suppose it has a set of basis functions B n = (b 1,..., b qn ), such that every φ Φ n can be represented as φ = q n j=1 β jb j B nβ, where β = (β 1,..., β qn ) B n R qn is a vector of real numbers. So every φ Φ n can be identified with a vector β B n. Here the dimension q n is a positive integer depending on sample size n. To ensure consistency of τ n, we need q n as n. In general, for θ n to be asymptotically normal, we need to control the growth rate of q n appropriately. The sieve MLE of τ 0 = (θ 0, φ 0 ) is defined to be the ( θ n, φ n ) that maximizes the loglikelihood l n (θ, φ) over Θ Φ n. Equivalently, one can find ( θ n, β n ) that maximizes l n (θ, B nβ) over Θ B n. Then φ n = B n β n. Now consider estimation of I(θ 0 ). First we introduce some convenient notations. Denote l 2 ( τ n ; x)(b n ) = ( l 2 ( τ n ; x)(b 1 ),..., l 2 ( τ n ; x)(b qn )) T, and ) 2 A 11 = P n ( l1 ( τ n ; X), A12 = P n l1 ( τ n ; X) l 2 T ( τ n ; X)(B n ), A 21 = A T 12, A 22 = P n ( l2 ( τ n ; X)(B n )) 2, where a 2 = aa T. The outer product version of the observed information for θ is given by (3.1) Ô n = A 11 A 12 A 22A 21. Here for any matrix A, A denotes its generalized inverse. Although A 22 may not be unique, A 12 A 22A 21 is unique by the results on the generalized inverse, see for example, 11

12 Rao (1973), Chapter 1, result 1b.5 (vii), page 26. The use of the generalized inverse in (3.1) is for generality of these formulas. When the nonparametric component φ is a smooth function, then for any fixed sample size, the sieve MLE is obtained over a finite-dimensional approximation space whose (theoretical) dimension is O(n ν ), where ν is typically less than 1/2, A 22 is usually invertible. We now show that Ôn equals În, which is the minimum value of ρ n (h) = P n l 1 ( τ n ; X) l 2 ( τ n ; X)(h) 2 over h Φ n for h = (h 1, h 2,, h d ). Write h j = B nc j for j = 1, 2,, d. Let Y i,j = l 1,j ( τ n ; X i ), the jth component of Y i = l 1 ( τ n ; X i ) and Z i = l 2 ( τ n ; X i )(B n ). This minimization problem becomes a least-squares problem of finding ĉ n = (ĉ n,1, ĉ n,2,, ĉ n,d ) that minimizes [ d ] P n (Y i,j Z ic j ) 2. j=1 By standard least-squares calculation, ( n ) ( n ) ĉ n,j = Z i Z i Z i Y i,j i=1 i=1 for j = 1, 2,, n. Hence, we have Î n = ρ n (ˆξ n ) = P n [ l1 ( τ n ; X) l ] 2 2 ( τ n ; X)( ξ n ) ] 2 = P n [ l1 ( τ n ; X) (ĉ n,1, ĉ n,2,, ĉ n,d ) l2 ( τ n ; X)(B n ) ( n ) ( n ) = P n [ l 1 ( τ n ; X) Y i Z i Z i Z i l 2 ( τ n ; X)(B n ) i=1 12 i=1 ] 2

13 = P n [ l1 ( τ n ; X) A 12 A l ] ( τ n ; X)(B n ) = A 11 A 12 A 22A 21 A 12 A 22A 21 + A 12 A 22A 22 A 22A 21 = A 11 A 12 A 22A 21 = Ôn We summarize the above calculation in the following proposition. Proposition 3.1. Assume that there exist β n B n and c n,j R q n for j = 1, 2,, d such that (3.2) B nβ n φ 0 Φ = O(k 1 1n ), and B nc n,j ξ 0,j Φ = O(k 1 2n ), j = 1, 2, d, where k 1n and k 2n are two sequences of numbers satisfying k 1n and k 2n as n. Suppose that the conditions of Proposition 2.1 are satisfied. Then the observed information matrix Ôn defined in (3.1) is a consistent estimator of I(θ 0 ). 4. Examples In this section, we illustrate the proposed method in two semiparametric regression models, including Poisson proportional mean model for panel count data studied in Wellner and Zhang (2007) and Cox model (1972) for interval-censored data studied in Huang and Wellner (1995). In these examples, the parameter space Φ will be the space of smooth functions defined below. The sieve Φ n is the space of polynomial splines. The polynomial splines have been used in many fully nonparametric regression models, see for example, Stone (1985, 1986). 13

14 Let a = d 0 < d 1 < < d Kn < d Kn +1 = b be a partition of [a, b] into K n subintervals I Kt = [d t, d t+1 ), t = 0,..., K 1 and I KK = [d K, d K+1 ], where K K n n v is a positive integer such that max 1 k K+1 d k d k 1 = O(n v ). Denote the set of partition points by D n = {d 1,..., d Kn }. Let S n (D n, K n, m) be the space of polynomial splines of order m 1 consisting of functions s satisfying: (i) the restriction of s to I Kt is a polynomial of order m for m K; (ii) for m 2 and 0 m m 2, s is m times continuously differentiable on [a, b]. This definition is phrased after Stone (1985), which is a descriptive version of Schumaker (1981), page 108, Definition 4.1. According to Schumaker (1981), page 117, Corollary 4.10, there exists a local basis B n {b t, 1 t q n }, so called B-splines, for S n (D n, K n, m), where q n K n + m. These basis functions are nonnegative and sum up to one at each point in [a, b], and each b t is zero outside the interval [d t, d t+m ] Poisson Proportional Mean Model for Panel Count Data Let {N(t) : t 0} be a univariate counting process. K is the total number of observations on the counting process and T = (T K,1,, T K,K ) is a sequence of random observation times with 0 < T K,1 < < T K,K. The counting process is only observed at those times with the cumulative events denoted by N = {N(T K,1 ),, N(T K,K )}. This type of data is referred to to panel count data by Sun and Kalbfleish (1995). In this manuscript, we assume that (K, T ) is conditionally independent of N given a vector of covariates Z and we denote the observed data consisting of independent and identically distributed X 1,, X n, where X i = (K i, T (i), N (i), Z i ) with T (i) = (T (i) K i,1,, T (i) K i,k i ) and N (i) = (N (i) (T (i) K i,1 ),, N(i) (T (i) K i,k i )), for i = 1,, n. Panel count data are often seen in clinical trials, social demographic and industrial reliability studies. Sun and Wei (2000) and Zhang (2002) proposed the proportional mean 14

15 model (4.1) Λ(t Z) = Λ 0 (t) exp(θ 0Z) to analyze panel count data semiparametrically, where Λ(t Z) = E(N(t) Z) is the expected cumulative events observed at time t, conditional on Z with the true baseline mean function given by Λ 0 (t). Wellner and Zhang (2007) proposed a nonhomogeneous Poisson process with the conditional mean function given by (4.1) to study the MLE of τ 0 = (θ 0, Λ 0 (t)). The log likelihood for (θ, Λ(t)) under the Poisson proportional mean model is given by l n (θ, Λ) = n K i i=1 j=1 [ ] N (i) K i,j log Λ K i,j + N (i) K i,j θ Z i exp(θ Z i ) Λ Ki,j, where N (i) K i,j = N(i) (T (i) K i,j ) N(i) (T (i) K i,j 1 ) and Λ Ki,j = Λ(T (i) (i) K i,j) Λ(T K i,j 1 ), for j = 1, 2,, K. To study the asymptotic properties of the MLE, Wellner and Zhang (2007) defined the following L 2 -norm, d(τ 1, τ 2 ) = { θ 1 θ } 1/2 Λ 1 (t) Λ 2 (t) 2 dµ 1 (t) with k µ 1 (t) = P (K = k Z = z) P (T K,j t K = k, Z = z)df (z). R d k=1 j=1 15

16 They showed that the semiparametric MLE, τ n = (ˆθ n, ˆΛ n ) converges to the true parameters τ 0 = (θ 0, Λ 0 ) (under some mild regularity conditions) in a rate lower than n 1/2, i.e. d( τ n, τ 0 ) = O p (n 1/3 ), however, the MLE of θ 0 is still semiparametric efficient, that is n(ˆθn θ 0 ) d N ( 0, I 1 (θ 0 ) ) with the Fisher information matrix given by I(θ 0 ) = E [ ] { Λ 0 (T K,j )e θ 0 Z Z E(Zeθ 0 Z } K, T K,j 1, T K,j ). E(e θ 0 Z K, T K,j 1, T K,j ) The computation of the semiparametric MLE is very time consuming as it requires the joint estimation of θ 0 and the infinite dimensional parameter Λ 0. Although the Fisher information has a nice explicit form, there is no easy method available to calculate the observed information. Lu (2007) studied the sieve MLE for the above semiparametric Poisson model using monotone polynomial splines. Let B n = (b 1,..., b qn ) be the basis of B-splines defined earlier. The monotone polynomial spline space is defined to be M n (D n, K n, m) = { φ n : φ n (t) = q n j=1 β j b j (t) S n (D n, K n, m), β B n, t [a, b] }. where B n = {β : β 1 β 2 β qn }. Each element of M n (D n, K n, m) is a nondecreasing function because of the monotonicity constraints on β 1,..., β qn. This fact is a consequence of the variation diminishing properties of B-splines. See for instance, Schumaker (1981), Example 4.75 and Theorem 4.76, pages Replaying Λ(t) by the exp( q n l=1 β lb j (t)) in the likelihood above, Lu (2007) solved the constraint optimization problem over the space 16

17 Θ B n. It turns out that, compared to Wellner and Zhang s method, the B-splines sieve MLE is much less computational demanding (Lu, Zhang and Huang, 2008) with a better overall convergence rate. In addition, the sieve MLE of θ is still semiparametric efficient with the same Fisher information matrix, I(θ 0 ). The bootstrap procedure was implemented to estimate I(θ 0 ) consistently by Lu, Huang and Zhang (2008) due to the computation advantage of the B-splines sieve MLE method. A straightforward algebra leads that l 1 (τ; x) l 2 (τ; x)(h) = K j=1 ( ) ( NK,j exp(β T Z) Λ Kj Z h ) Kj Λ Kj Under the regularity conditions given in Wellner and Zhang (2007), following the empirical process arguments used in Lu (2007) for monotone B-splines estimation, it can be easily shown that the class { l 1 (τ; x) l 2 (τ; x)(h) : τ T, h H} is Glivenko-Canteli, by showing the bracketing number with L 1 (P ) norm of this class is bounded. This will imply that {( l 1 (τ; x) l 2 (τ; x)(h)) 2 : τ T, h H} is Glivenko-Canteli as well, based on the given regularity conditions. Moreover, using monotone cubic B-splines in the sieve estimation automatically gives B nβ n φ 0 Φ = O(n pν ) and B nc n ξ 0 Φ = O(n pν ) due to Corollary 6.20 of Schumaker (1981). Hence the Fisher information I(θ 0 ) in semiparametric proportional mean model for panel count data can be consistently estimated using the proposed least-squares approach. 17

18 4.2. Cox Model for Interval-Censored Data Interval-censored data occur very frequently in long-term follow-up study for an event time of interest. The exact event time T is not observable; it is only known with certainty that T is bracketed between two adjacent examination times, or occurs before the first or after the last follow-up examination. Let (L, R) be the pair of examination times bracketing the event time T. That is, L is the last examination time before and R is the first examination time after the event. If 0 < L < R <, then T is interval-censored. If the event occurs before the first examination, then T is left-censored. If the event has not occurred after last examination, then T is right-censored. Such data is called case 2 interval-censored data. Nonparametric estimation of a distribution function and its smooth functionals with interval-censored data have been studied by Groeneboom and Wellner (1992), Groeneboom (1996), and Geskus and Groeneboom (1996). In this example, we consider the B-splines sieve MLE of the Cox proportional hazards model for the interval-censored data. With the proportional hazards model, the conditional hazard of T given a covariate vector Z R d is proportional to the baseline hazard. In terms of cumulative hazard functions, this model is (4.2) Λ(t z) = Λ 0 (t)e θ 0 z, where θ 0 is a d-dimensional regression parameter and Λ 0 is the unspecified baseline cumulative hazard function. Denote the two censoring variables by U and V, where P (U V ) = 1. Let G be the joint distribution function of (U, V ). Let δ 1 = 1 [T U], δ 2 = 1 [U<T V ] and δ 3 = 1 δ 1 δ 2. Assume that conditional on Z, T is independent of (U, V ), and that the joint distribution of 18

19 (U, V ) and Z does not depend on the parameters of interest. Then the density function of X (δ 1, δ 2, U, V, Z) with respect to the product of the counting measure on {0, 1} {0, 1} and the probability measure induced by the distribution of (Z, U, V ), is p(x; θ, Λ) = (1 exp( Λ(u)e θ z )) δ 1 (exp( Λ(u)e θ z ) exp( Λ(v)e θ z )) δ 2 (exp( Λ(v)e θ z ) δ 3. Let φ = log Λ. we reparametrize this log-likelihood in terms of (θ, φ). The resulting loglikelihood for a sample of size one is, up to an additive term not dependent on (θ, φ) is l(θ, φ; x) = log p(x, θ, φ) = δ 1 log(1 exp( e z θ+φ(u) )) + δ 2 log(exp( e z θ+φ(u) ) exp( e z θ+φ(v) )) δ 3 e z θ+φ(v). Let X = (X 1, X 2,, X n ) with X i = (δ 1i, δ 2i, U i, V i, Z i ), for 1 i n being a random sample with the same distribution as X = (δ 1, δ 2, U, V, Z). The log-likelihood for this random sample is l n (θ, φ) = n l(θ, φ; X i ). i=1 Because φ is a nondecreasing function, it is desirable to restrict its estimator to be also nondecreasing. Therefore, we seek an estimate of φ in the space of M n. (The abbreviation of M n (D n, K n, m)) The B-splines sieve MLE of τ 0 = (θ 0, φ 0 ) is the τ n = ( θ n, φ n ) that maximizes l n (θ, φ) over Θ M n. This is equivalent to maximizing l n (θ, B nβ) over Θ B n. No restriction will be placed on Θ. Thus Θ can be taken to be R d. As the same as in the model for panel count data, the B-splines sieve MLE is much easy to compute than the semiparametric MLE studied in Huang and Wellner (1995). In 19

20 addition, we can show that under some mild regularity conditions, the sieve MLE of θ, θ n, achieves the semiparametric efficiency as well, with the information given by I(θ 0 ) = P (l (θ 0, φ 0 ; X) 2 = P ( l1 (θ 0, φ 0 ; X) l ) 2 2 (θ 0, φ 0 ; X)(ξ 0 ), where ξ 0 (t) is the solution of a Fredholm integral equation of the second kind, ξ 0 (t) K(t, x)ξ 0 (x)dx = d(t) with two complicate functions K(t, x) and d(t) described in Huang and Wellner (1995). It is obvious that a direct estimation of the information matrix is impossible for this model. However, with the B-splines sieve MLE approach, variance of θ n can be readily estimated using the observed information matrix defined in (3.1) due to Propositions 2.1 and 3.1. For the asymptotic normality of θ n and consistency of the inverse observed information matrices, the following conditions are assumed. (C1) (a) E(ZZ ) is nonsingular; (b) Z is bounded, that is, there exists z 0 > 0 such that P ( Z z 0 ) = 1. (C2) (a) There exists a positive number η such that P (V U η) = 1; (b) the union of the supports of U and V is contained in an interval [a, b], where 0 < a < b <, and 0 < Λ 0 (a) < Λ 0 (b) <. (C3) Λ 0 belongs to Φ, a class of functions with bounded pth derivative in [a, b] for p 1 and the first derivative of Λ 0 is strictly positive and continuous on [a, b]. (C4) The conditional density g(u, v z) of (U, V ) given Z has bounded partial derivatives with respect to (u, v). The bounds of these partial derivatives do not depend on (u, v, z). 20

21 (C5) For some κ (0, 1), a T var(z U)a κa T E(ZZ T U)a and a T var(z V )a κa T E(ZZ T V )a a.s. for all a R d. It should be noted that in applications, implementation of the proposed estimation method does not require these conditions to be satisfied. These conditions are sufficient but may not be necessary to prove the following asymptotic theorem. Some conditions may be weakened but will make the proof considerably more difficulty. However, from a view point of practice, these conditions are usually satisfied. To study the asymptotic properties, we facilitate a metric defined as follows: for any φ 1, φ 2 Φ, define φ 1 φ 2 2 Φ = E[φ 1 (U) φ 2 (U)] 2 + E[φ 1 (V ) φ 2 (V )] 2. and for any τ 1 = (θ 1, φ 1 ) and τ 2 = (θ 2, φ 2 ) in the space of T = Θ Φ, define d(τ 1, τ 2 ) = τ 1 τ 2 T = { θ 1 θ φ 1 φ 2 2 Φ} 1/2. Theorem 4.1. Let K n = O(n ν ), where ν satisfies the restriction 1 2(1+p) < ν < 1 2p. Suppose that T and (U, V ) are conditionally independent given Z and that the distribution of (U, V, Z) does not involve (θ, Λ). Furthermore, suppose that conditions (C1) (C5) hold. Then (i) d ( τ n, τ 0 ) p 0. (ii) d ( τ n, τ 0 ) = O p ( n min(pν,(1 ν)/2) ). Thus if ν = 1/(1 + 2p), d( τ n, τ 0 ) = O p (n p/(1+2p) ). This is the optimal rate of convergence in nonparametric regression with comparable smoothness assumptions. 21

22 (iii) n 1/2 ( θ n θ 0 ) = n 1/2 I 1 (θ 0 ) n i=1 l (θ 0, φ 0 ; X i ) + o p (1) d N(0, I 1 (θ 0 )). Thus θ n is asymptotically normal and efficient. (iv) The inverse observed information matrix is a consistent estimator of I 1 (θ 0 ), the asymptotic variance of n 1/2 ( θ n θ 0 ). The proof of this theorem is considerably long, we will provide a sketch of the proof in Appendix A. 5. Numerical Results 5.1. Simulations Studies In this section, we conduct simulation studies for the examples discussed in the preceding section to evaluate the finite sample performance of the proposed estimator. In each example, we estimate the unknown parameters using the cubic B-splines sieve maximum likelihood estimation and estimate the standard error of the regression parameter estimates using the proposed least-squares method based on the cubic B-splines as well. For the B-splines sieve, the number of knots is chosen as K n = [N 1/3 ], the largest integer below N 1/3, where N is the number of distinct observation time points in the data, and the knots are evenly placed between (0, 1) in the first example and (0, 5) in the second example. Simulation 1: Panel Count Data. We generate the data with the setting given in Wellner and Zhang (2007). For each subject, we independently generate X i = (Z i, K i, T (i) K i, N (i) K i ), for i = 1, 2,, n, where Z i = (Z i,1, Z i,2, Z i,3 ) with Z i,1 Unif(0, 1), Z i,2 N(0, 1), and Z i,3 Bernoulli(0.5); K i is sampled randomly from the discrete set, {1, 2, 3, 4, 5, 6}; Given K i, T (i) K i = (T (i) K i,1, T (i) K i,2,, T (i) K i,k i ) are the order statistics of 22

23 Table 1: Monte-Carlo simulations results of the B-splines sieve MLE of θ 0 with 1000 repetitions for semiparametric analysis of panel count data n=50 n=100 n=200 θ 0,1 θ 0,2 θ 0,3 θ 0,1 θ 0,2 θ 0,3 θ 0,1 θ 0,2 θ 0,3 Bias M-C sd ASE %-CP 98.4% 98.5% 97.5% 97.6% 97.9% 96.5% 96.2% 95.1% 96.6% K i random draws from Unif(0, 1); The panel counts N (i) K i = (N (i) K i,1, N(i) K i,2,, N(i) K i,k i ) are generated from the Poisson process with the conditional mean function given by Λ(t Z i ) = 2t exp(θ T 0 Z i ) with θ 0 = ( 1.0, 0.5, 1.5) T. We conduct the simulation study for n =50, 100 and 200, respectively. In each case, we perform Monte-Carlo study with 1000 repetitions. Table 1 displays the estimation bias(bias), Monte-Carlo standard deviation(m-c s.d.), the average of standard errors using the proposed method(ase), and the coverage probability of 95% Wald-confidence interval using the proposed estimator of standard error(95%-pc). The results show that the B-splines sieve MLE performs quite well. It has very little bias with seemingly decrease of estimation variability as sample size increases. The proposed method tends to overestimate the standard error slightly but the overestimation lessens as sample size increases. As the result of overestimation, the coverage probability of 95% confidence interval exceeds the nominal value when sample size is 50 or 100 but gets closer to 95% with sample size increasing to 200. Simulation 2: Interval-Censored Data. We generate the data in a way similar to 23

24 Table 2: Monte-Carlo simulations results for B-splines sieve MLE of θ 0 with 1000 repetitions for semiparametric analysis of interval-censored data n=50 n=100 n=200 Bias M-C sd ASE %-CP 97.4% 96.4% 95.6% what was used in Huang and Rossini (1997). For each subject, we independently generate X i = (U i, V i, δ i,1 δ i,2, Z i ), for i = 1, 2,, n, where Z i Bernoulli(0.5); we simulate a series of examination times by the partial sum of interarrival times that are independently and identically distributed according to exp(1), then U i is the last examination time within 5 at which the event has not occurred yet and V i is the first observation time within 5 at which the event has occurred; the event time is generated according to Cox proportional hazards model Λ(t) = t exp(z) for which the true parameters are: θ 0 = 1 and Λ 0 (t) = t. Similarly, Monte- Carlo study with 1000 repetition is performed and the corresponding results are displayed in Table 2. The results show that both bias and Monte-Carlo standard deviation decrease as sample size increases. As observed earlier, the proposed least square estimate of the standard error may overestimate the true value but the overestimation lessens as sample size increasing to 200. With sample size 200, the Wald 95% confidence-interval achieves the desired coverage probability. 24

25 5.2. Applications This section illustrates the method in two real life examples: the bladder tumor randomized clinical trial conducted by Byar, Blackard and VACURG (1977) and the breast cosmesis trial described by Finkelstein and Wolfe (1985). We adopt the cubic B-splines sieve semiparametric MLE method and we estimate the asymptotic standard error of the estimates of regression parameter using the least-squares approach with cubic B-splines as well. The inference is made based on asymptotic theorem developed in this paper. Example 1: Bladder Tumor Study. The data set of the bladder tumor randomized clinical trial conducted by the Veterans Administration Cooperative Urological Research Group (Byar, Blackard and VACURG, 1977) is extracted from Andrews and Herzberg (1985, p ). In this study, a randomized clinical trial of three treatments : placebo, pyridoxine pill and thiotepa instillation into bladder, was conducted for patients with superficial bladder tumor (a total of 116 subjects: 40 were randomized to placebo, 31 to pyridoxine pill and 38 to thiotepa instillation) when entering the trial. At each follow-up visit, tumors were counted, measured and then removed after being found. The treatments as originally assigned will continue after each visit. The number of follow-up visits and follow-up times varied greatly from patient to patient and hence the observation of bladder tumor counts in this study falls in the framework of panel count data as described in Section 4. For this trial, the treatment effects, particularly the thiotepa instillation method, on reducing the bladder tumor recurrent have been the focal point of interest in many studies, see for example, Wei, Lin and Weissfeld (1989), Sun and Wei (2000), Zhang (2002) and Wellner and Zhang (2007). In this paper, we study the proportional mean model as proposed by Wellner and Zhang (2007), 25

26 (5.1) E{N(t) Z} = Λ 0 (t) exp(θ 0,1 Z 1 + θ 0,2 Z 2 + θ 0,3 Z 3 + θ 0,4 Z 4 ), where Z 1 and Z 2 are the baseline tumor count and size, measured when subjects entered the study, and Z 3 and Z 4 define the indicators of the pyridoxine pill and instillation treatments, respectively. Lu, Zhang and Huang (2008) has used the cubic B-splines sieve semiparametric MLE method for this model and estimated the asymptotic standard error of the estimate of θ 0 based on the bootstrap approach. In this paper, we reanalyze the data using the same method but estimate the asymptotic standard error using the least-squares approach. The semiparametric sieve MLE of θ 0 is θ n = (0.2076, , , ) with the asymptotic standard errors given by (0.0066, , , ) and the corresponding p- values=(0.0000,0.0004,0.0553,0.0000) based on the asymptotic theorem developed in Wellner and Zhang (2007). We note that the result of this analysis is very different from that of Lu, Zhang and Huang (2008). The conflict of the results indirectly indicates that the working assumption of Poisson process model to form the likelihood may not be valid as Lu-Zhang- Huang s inference procedure is shown to be robust against the underlying counting process. Example 2: Breast Cosmesis Study. The breast cosmesis study is the clinical trial for comparing radiotherapy alone with primary radiotherapy with adjuvant chemotherapy in terms of subsequent cosmetic deterioration of the breast following tumorectomy. Subjects (46 assigned to radiotherapy alone and 48 to radiotherapy plus chemotherapy) were followed for up to 60 months, with pre-scheduled follow-up visits for every 4-6 months. In this paper, we propose Cox proportional hazards model to analyze the difference in time until appearance 26

27 of breast retraction, (5.2) Λ(t Z) = Λ 0 (t) exp(θ 0 Z), where Λ 0 is the baseline hazard (the hazard of the time to appearance of breast retraction) for radiotherapy alone) and Z is the indicator for the treatment of radiotherapy plus chemotherapy. Using the method proposed in this paper, the cubic B-splines sieve semiparametric MLE of θ 0 is θ n = with asymptotic standard error given by The Wald test statistic is Z = with the associated p-value= This indicates that the treatment of radiotherapy with adjuvant chemotherapy significantly increases the risk of the breast retraction and the result is comparable with what has been concluded in Finkelstein and Wolfe (1985). 6. Conclusion and Discussion When the infinite dimensional parameter as nuisance parameter can not be eliminated in estimating the finite dimensional parameter, a general semiparametric maximum likelihood estimation is often a challenging task. Sieve MLE method, proposed originally by Geman and Hwang (1982), renders a practical approach for alleviating the difficulty in semiparametric estimation problem. Particularly, the spline-based sieve semiparametric method, as exemplified by Lu, Zhang and Huang (2008) has many attractions in practice. Not only it reduces the numerical difficulty in computing the semiparametric MLE, it also achieves the semiparametric asymptotic estimation efficiency for the finite dimensional parameter. However, the estimation of the information matrix remains a difficult task in general. In this paper, we propose an easy-to-implement least-squares approach to estimate 27

28 the semiparametric information matrix. We show that the estimator is asymptotic consistent, as often a byproduct of the establishment of asymptotic normality for sieve semiparametric MLE. Interestingly, this estimator is exactly the observed information matrix if we treat the semiparametric sieve MLE as a parametric MLE problem. Although the estimator of asymptotic error overestimates the true value slightly in finite sample situation as shown in our simulation studies, the overestimation issue is generally alleviated as sample size increases, say to 200. In addition to the expression of information matrix given in (2.7), we note that it can be expressed through the second derivatives. Denote l 11 (τ; x) = l θ 1 (τ; x), l12 (τ; x) = l s 1 (θ, φ (s) ; x) s=0, l 21 (τ; x)(h) = l θ 2 (τ; x)(h), and letting ( / s) φ 1(s) s=0 = h 1, l22 (τ; x)(h, h 1 ) = l s 2 (θ, φ 1(s) ; x)(h). Then the information matrix can be written as I(θ 0 ) = E [ l11 (τ 0 ; X) 2 l 12 (τ 0 ; X)(ξ 0 ) + l ] 22 (τ 0 ; X)(ξ 0, ξ 0 ). This expression leads to an alternative estimator of I(θ 0 ), given by (6.1) ρ n ( ξ n ) = n 1 n i=1 [ l11 ( τ n ; X i ) 2 l 12 ( τ n ; X i )( ξ n ) + l 22 ( τ n ; X i )( ξ n, ξ n )], where ξ n is the minimizer of ρ n (h) over H n. The consistency of ρ n ( ξ n ) can be similarly shown and it would be interesting to investigate how this estimator behaves compared to the one proposed in this paper. As implied in Example 1, this semiparametric inference procedure is not robust against the underlying probability model. If the true model is not the working model for forming the likelihood, the inference may be invalid. Wellner and Zhang (2007) developed a general theorem for making robust semiparametric inference and Lu, Zhang and Huang (2007) extend 28

29 the robust semiparametric inference to the spline-based sieve semiparametric estimation. It remains an open research problem on how to generalize the proposed least-squares method to estimate the asymptotic error of the regression parameter estimates in order to make robust semiparametric inference. 7. Appendix A This section contains the sketch of the proofs for Proposition 2.1 and Theorem 4.1. In the following, C represents a constant that may varies from place to place. The proof for Proposition 2.1: For the least favorable direction ξ 0, we define H n (ɛ) = {h : h H n and h ξ 0 H ɛ} for ɛ 0. Note that for any h H, ρ n (h) ρ n (ξ 0 ) = (P n P )l( τ n, h; X) (P n P )l( τ n, ξ 0 ; X) +P {l( τ n, h; X) l(τ 0, h; X)} P {l( τ n, ξ 0 ; X) l(τ 0, ξ 0 ; X)} +P {l(τ 0, h; X) l(τ 0, ξ 0 ; X)} Because the class I = {l(τ, h; X) : τ T and h H} is Glivenko-Cantelli and τ n H n H, we have that (P n P )l( τ n, h; X) = o P (1) and (P n P )l( τ n, ξ 0 ; X) = o p (1). In addition, by the weak consistency of τ n p τ 0, Continuous Mapping Theorem and 29

30 Dominate Convergence Theorem, we can conclude that P {l( τ n, h; X) l(τ 0, h; X)} p 0 and P {l( τ n, ξ 0 ; X) l(τ 0, ξ 0 ; X)} p 0. Hence for any h H, ρ n (h) ρ n (ξ 0 ) = ρ(h) ρ(ξ 0 ) + o p (1) > 0 in probability as n by the characteristics of the least favorable direction. This implies that ( ) P inf ρ n(h) > ρ n (ξ 0 ) p 1 h H n (ɛ) and hence let ɛ 0 we can conclude that ˆξ n ξ 0 H p 0. Subsequently, we can conclude ρ n (ˆξ n ) = P n l( τ n, ˆξ n ; X) = (P n P )l( τ n, ˆξ n ; X) + P l( τ n, ˆξ n ; X) p P l(τ 0, ξ 0 ; X) = I(θ 0 ), by the consistency of τ n and ˆξ n and I being a Glivenko-Cantelli. The proof for Theorem 4.1: To prove Part (i) of Theorem 4.1, we verify the conditions of Theorem 5.7 in van der Vaart (1998). Let M(τ) = P l(τ; X) = P l(θ, φ; X) and M n (τ) = P n l(τ; X) = P n l(θ, φ; X). Hence for any τ = T n = Θ M n, M n (τ) M(τ) = (P n P )l(τ; X). Let L 1 = {l(τ; X) : τ T n }. By the calculation of Shen and Wong (1994), page 597, ɛ > 0, the bracketing number of M n computed with L 1 (P )-norm is bounded by (1/ɛ) Cqn. 30

31 Since Θ R d is compact, Θ can be covered by C(1/ɛ) d balls with radius ɛ. Then by Conditions (C1)-(C3), we can easily construct ɛ-brackets for L 1 with the bracketing number with L 1 (P )-norm bounded by C(1/ɛ) Cqn+d. Hence L 1 is Glivenko-Cantelli by Theorem of van der Vaart and Wellner (1996). Therefore, sup τ Tn M n (τ) M(τ) p 0. Let g(z, t) = exp(θ z + φ(t)) and g 0 (z, t) = exp(θ 0z + φ 0 (t)). A straightforward algebra yields that M(τ 0 ) M(τ) = E { [1 exp( g 0 (Z, U))] log 1 exp( g 0(Z, U)) 1 exp( g(z, U)) +[exp( g 0 (Z, U)) exp( g 0 (Z, V ))] log exp( g 0(Z, U)) exp( g 0 (Z, V )) exp( g(z, U)) exp( g(z, V )) + exp( g 0 (Z, V )) log exp( g } 0(Z, V )) exp( g(z, V )) { ( ) 1 exp( g0 (Z, U)) = E [1 exp( g(z, U))]m 1 exp( g(z, U)) ( ) exp( g0 (Z, U)) exp( g 0 (Z, V )) +[exp( g(z, U)) exp( g(z, V ))]m exp( g(z, U)) exp( g(z, V )) ( )} exp( g0 (Z, V )) + exp( g(z, V ))m, exp( g(z, V )) where m(x) = x log x x + 1 (x 1) 2 /4 for 0 x 5. Further analysis using Taylor expansion and Conditions (C1)-(C3) leads to { 1 M(τ 0 ) M(τ) CE 1 exp( g(z, U)) [exp( g 0(Z, U)) exp( g(z, U))] 2 } 1 + exp( g(z, V )) [exp( g 0(Z, U)) exp( g(z, U))] 2 { CE [(θ 0 θ) Z + (φ 0 φ)(u)] 2 + [(θ 0 θ) Z + (φ 0 φ)(v )] 2}. With Conditions (C1)-(C5), using the same arguments as those in Wellner and Zhang (2007), 31

32 page leads to M(τ 0 ) M(τ) C ( θ θ φ φ 0 2 Φ) = Cd 2 (τ 0, τ). Then it implies that sup τ:d(τ,τ0 ) ɛ M(τ) M(τ 0 ) Cɛ 2 < M(τ 0 ). that For φ 0 Φ, Lu (2007) has shown that there exists a φ 0,n M n of order m p + 2 such This also implies that φ 0,n φ 0 Φ Cq p n φ 0,n φ 0 Cq p n = O ( n pν). = O (n pν ). Now let τ 0,n = (θ 0, φ 0,n ), we have M n ( τ n ) M n (τ 0 ) = M n ( τ n ) M n (τ 0,n ) + M n (τ 0,n ) M n (τ 0 ) P n l(τ 0,n ; X) P n l(τ 0 ; X) = (P n P ) {l(τ 0,n ; X) l(τ 0 ; X)} + M(τ 0,n ) M(τ 0 ). We can easily show that the class L 2 = {l(θ 0, φ; x) l(θ 0, φ 0 ; x) : φ M n and φ φ 0 Φ Cn pν } is a P -Donsker by calculating the bracketing number with L 2 (P )-norm. It is obvious that in this class P (l(θ 0, φ; X) l(θ 0, φ 0 ; X)) 2 0 as n. Hence (P n P ) {l(θ 0, φ 0,n ; X) l(θ 0, φ 0 ; X)} = o p (n 1/2 ) by the relationship between P-Donsker and asymptotic equicontinuity given by Corollary of van der Vaart and Wellner (1996). By the Dominated Convergence Theorem, it is easy to see that M(τ 0,n ) M(τ 0 ) > o(1). Therefore, M n ( τ n ) M n (τ 0 ) o p (n 1/2 ) o(1) = o p (1). 32

33 This completes the proof of d( τ n, τ 0 ) p 0. To prove Part (ii), we verify the conditions of Theorem of van der Vaart and Wellner (1996). First, we have already shown in the proof of consistency that M(τ 0 ) M(τ) Cd 2 (τ 0, τ). Next, we further explore M n ( τ n ) M n (τ 0 ). In the proof of Part (i), we know that M n ( τ n ) M n (τ 0 ) I 1,n + I 2,n, where I 1,n = (P n P ) {l(θ 0, φ 0,n ; X) l(θ 0, φ 0 ; X)} and I 2,n = P {l(θ 0, φ 0,n ; X) l(θ 0, φ 0 ; X)}. By Taylor expansion, we have I 1,n = (P n P ) { l2 (θ 0, φ; } { X)(φ 0,n φ 0 ) = n pν+ɛ (P n P ) l 2 (θ 0, φ; X) φ } 0,n φ 0 n pν+ɛ for any 0 < ɛ < 1/2 pν. Because φ 0,n φ 0 = O(n pν ), using Corollary of van der Vaart and Wellner (1996), we can show that (P n P ) { l2 (θ 0, φ; } X) φ 0,n φ 0 = o n pν+ɛ p (n 1/2 ). Hence I 1,n = o p (n pν+ɛ n 1/2 ) = o p (n 2pν ). Using the fact that the function m(x) = x log x x+1 (x 1) 2 in the neighborhood of x = 1, it can be easily argued that M(τ 0 ) M(τ 0,n ) C φ 0,n φ 0 2 Φ = O(n 2pν ), which implies that I 2,n = M(τ 0,n ) M(τ 0 ) O(n 2pν ). Thus we conclude that M n ( τ n ) M n (τ 0 ) O p (n 2pν ) = O p ( n 2 min(pν,(1 ν)/2) ). Let L 3 (η) = {l(τ; x) l(τ 0 ; x) : φ M n and d(τ, τ 0 ) η}. Using the same argument as that in the proof of consistency, we obtain that the logarithm of the ɛ- bracketing number of L 3 (η), log N [ ] (ɛ, L 3 (η), L 2 (P )) is bounded by Cq n log(η/ɛ). This leads to J [ ] (η, L 3 (η), L 2 (P )) = η log N[ ] (ɛ, L 3 (η), L 2 (P ))dɛ Cqn 1/2 η. Because Conditions (C1) and (C2) guarantee the uniform boundedness of l(τ; x), using Theorem of van der Vaart and Wellner (1996), the key function φ n (η) in Theorem of van der Vaart and 33

Estimation of the Mean Function with Panel Count Data Using Monotone Polynomial Splines

Estimation of the Mean Function with Panel Count Data Using Monotone Polynomial Splines By MINGGEN LU, YING ZHANG Department of Biostatistics, The University of Iowa, 200 Hawkins Drive, C22 GH Iowa City,