A Semiparametric Regression Model for Panel Count Data: When Do Pseudo-likelihood Estimators Become Badly Inefficient?

Size: px

Start display at page:

Download "A Semiparametric Regression Model for Panel Count Data: When Do Pseudo-likelihood Estimators Become Badly Inefficient?"

Stuart Short
5 years ago
Views:

1 A Semiparametric Regression Model for Panel Count Data: When Do Pseudo-likelihood stimators Become Badly Inefficient? Jon A. Wellner 1, Ying Zhang 2, and Hao Liu 3 1 Statistics, University of Washington jaw@stat.washington.edu 2 Statistics, University of Central Florida zhang@pegasus.cc.ucf.edu 3 Biostatistics, University of Washington hliu@u.washington.edu ABSTRACT. We consider estimation in a particular semiparametric regression model for the mean of a counting process under the assumption of panel count data. The basic model assumption is that the conditional mean function of the counting process is of the form Nt Z} expθ ZΛt where Z is a vector of covariates and Λ is the baseline mean function. The panel count observation scheme involves observation of the counting process N for an individual at a random number of random time points; both the number and the locations of these time points may differ across individuals. We study maximum pseudo-likelihood and maximum likelihood estimators and θ n of the regression parameter θ. The pseudo-likelihood estimators are fairly easy to compute, while the full maximum likelihood estimators pose more challenges from the computational perspective. We derive expressions for the asymptotic variances of both estimators under the proportional mean model. Our primary aim is to understand when the pseudo-likelihood estimators have very low efficiency relative to the full maximum likelihood estimators. The upshot is that the pseudo-likelihood estimators can have arbitrarily small efficiency relative to the full maximum likelihood estimators when the distribution of, the number of observation time points per individual, is very heavy-tailed. θ ps n

2 2 Jon A. Wellner, Ying Zhang, and Hao Liu 1Introduction Our goal in this paper is to study efficiency aspects of two types of estimators for a particular semiparametric model for panel count data. Panel count data arise in many fields including demographic studies, industrial reliability, and clinical trials; see for example [19], [7], [31], [32], [29], and [33] where the estimation of either the intensity of event recurrence or the mean function of a counting process with panel count data was studied. Many applications involve covariates whose effects on the underlying counting process are of interest. While there is considerable work on regression modeling for recurrent events based on continuous observations see, for example [22], [5], and [23], regression analysis with panel count data for counting processes has just started recently. [30] proposed estimating equation methods, while [35] and [36] proposed a pseudo-likelihood method for studying the multiplicative mean model 1 with panel count data. Here is a description of the model and the observation scheme. Suppose that N Nt :t 0} is a univariate counting process. In many applications, it is important to estimate the expected number of events Nt Z} which will occur by the time t, conditionally on a covariate vector Z. In this paper we consider the proportional mean regression model given by Λt Z Nt Z} e θ Z Λt, 1 where the monotone increasing function Λ is the baseline mean function. The parameters of primary interest are θ and Λ. The observation scheme we want to study is as follows: suppose that we observe the counting process N at a random number of random times 0 T,0 <T,1 < <T,. We write T T,1,...,T,, and we assume that, T Z G Z is conditionally independent of the counting process N given the covariate vector Z. Wefurther assume that Z H on R d, but we will make no further assumptions about G or H modulo mild integrability and boundedness requirements. The data for each individual will consist of X Z,, T, NT,1,...,NT, Z,, T, N. 2 We will assume that the data consist of X 1,...,X n i.i.d. as X. To derive useful estimators for this model we will often assume, in addition to 1, that the counting process N, conditionally on Z, is a non-homogeneous Poisson process. But our general perspective will be to study the estimators and other procedures when the Poisson assumption fails to hold and we assume only that the proportional mean assumption 1 holds. Such a program was carried out for estimation of Λ without any covariates for this panel count observation model by [33]. Briefly, [33] studied both the

3 Semiparametric Regression Model for Panel Count Data 3 maximum likelihood estimator Λ n and the pseudo-maximum likelihood estimator Λ n of Λ. They showed that both estimators are consistent in L 2 µ ps where µ is the measure defined in terms of the observation process by µb P k k1 k P T k,j B k. [33] also succeeded in showing under additional smoothness and boundedness assumptions that n 1/3 ps σ Λ 2 t 0 Λ } 1/3 n t 0 Λ 0 t 0 0t 0 d 2G 2Z t 0 where σ 2 t Var[Nt], G t k1 P k k G k,j t, and Z argmaxw t t 2 } for a two-sided Brownian motion process W starting at 0. They also proved a corresponding result for a toy estimator version of the maximum likelihood estimator Λ n under the Poisson process assumption, and made efficiency comparisons between the two estimators based on Monte- Carlo studies. The outline of the rest of the present paper is as follows: In section 2, we describe two methods of estimation, namely maximum pseudo-likelihood estimators and maximum likelihood estimators of θ and Λ. The basic picture is that the pseudo-likelihood estimators are computationally relatively straightforward and easy to implement, while the full, semiparametric maximum likelihood estimators are considerably more difficult, requiring an iterative algorithm in the computation of the profile likelihood. For other examples of the use of pseudo-likelihood to obtain computationally simple methods, see e.g. [3] and [27]. In section 3 we present information calculations for the semiparametric model described by the proportional mean function assumption 1 together with the non-homogeneous Poisson process assumption on N. This provides a baseline for comparisons of variances with the best possible asymptotic variance under the Poisson and proportional mean model assumptions. In section 4 we describe asymptotic normality results for the pseudo-likelihood and full maximum likelihood estimators and θ n of θ assuming only the proportional mean structure 1, but not assuming that N is a Poisson process. Proofs of these results will be presented in detail in [34]. Finally, in section 4wecompare the pseudo-likelihood and full likelihood estimators of θ under three different scenarios with the goal of determining situations under which the pseudo-likelihood estimators will lose considerable efficiency relative to the full maximum likelihood estimators. As will be seen, the rough upshot of the calculations here is that the efficiency of the pseudo-likelihood estimators relative to the full maximum likelihood estimators can be low when the distribution of, the number of observation times per subject, is heavy-tailed. θ ps n

4 4 Jon A. Wellner, Ying Zhang, and Hao Liu 2 Two Methods of stimation Maximum Pseudo-likelihood stimation: The natural pseudo-likelihood estimators for this model use the marginal distributions of N, conditional on Z, P Nt k Z Λt Zk exp Λt Z k! and ignore dependence between Nt 1, Nt 2 toobtain the pseudo-likelihood: n i ln ps θ, Λ N i T i i i,j log ΛT i,j i1 } + N i T i i,j θ Z i e θ Z i ΛT i i,j. Then the maximum pseudo-likelihood estimator θ n ps, by θ n ps ps, Λ n argmax θ,λ ln ps θ, Λ. Λ ps n ofθ, Λ isgiven This can be implemented in two steps via the usual pseudo- profile likelihood. For each fixed value of θ we set Λ ps n,θ argmax Λ l ps n θ, Λ, 3 and define Then ln ps,profile θ ln ps ps θ, Λ n,θ. θ ps n argmax θ ln ps,profile θ, and Λps n ps ps Λ n, θ n. In fact, the optimization problem in 3 is easily solved as follows: Let t 1 <...<t m denote the ordered distinct observation time points in the collection of all observations times, T i, j 1,..., i,j i,i1,...,n}, let N i i,j N i T i i,j, and set w l n i i1 A l θ 1 w l 1 [T i i,j t l], n i i1 Then it is easily shown that N l 1 w l n i i1 expθ Z i 1 [T i i,j t l]. N i i,j 1 [T i i,j t l],

5 Semiparametric Regression Model for Panel Count Data 5 Λ ps n,θ left-derivative of Greatest Convex Minorant of w l A l θ, w l N l } m i1 l i l i max min i l j l i p w pn p i p w pa p θ at t l, which is straightforward to compute. Maximum Likelihood stimation: Under the assumption that N is conditionally, given Z a non-homogeneous Poisson process, the likelihood can be calculated using the conditional independence of the increments of N, Ns, t] Nt Ns, and the Poisson distribution of these increments: P Ns, t] k Z to obtain the log-likelihood: l n θ, Λ where i1 [ Λs, t] Z]k k! exp Λs, t] Z n i } N i log Λ i,j i,j + N i i,j θ Z i e θ Z i Λ i,j N j NT,j NT,j 1, Λ j ΛT,j ΛT,j 1, j 1,...,, j 1,...,, with NT,0 0and ΛT,0 0.Then θ n, Λ n argmax θ,λ l n θ, Λ. This maximization can also be carried out in two steps via profile likelihood. For each fixed value of θ we set and define Then Λ n,θ argmax Λ l n θ, Λ, l profile n θ l n θ, Λ n,θ. θ n argmax θ l profile n θ, and Λn Λ n, θ n. Computation of the profile estimator Λ n,θiscomputationally involved, but possible via the iterative convex minorant algorithm; see e.g. [17]. For more on computation without covariates see [33].

6 6 Jon A. Wellner, Ying Zhang, and Hao Liu 3 Information bounds for θ under the Poisson model. We first compute information bounds for estimation of θ under the proportional mean non-homogeneous Poisson process model. Suppose that N Z PoissonΛ Z, and, T Z G Z are conditionally independent given Z.We will assume here that N is conditionally a nonhomogeneous Poisson process with conditional mean function [Nt Z] Λt Z e θ 0 Z Λ 0 t. 4 The second equality expresses the proportional mean regression model assumption. The likelihood for one observation is, using the same notation introduced in Section 2, px; θ, Λ exp Λ j Λ j N j N j!. Thus the log-likelihood for θ, Λ for one observation is given by log px; θ, Λ N j log Λ j Λ j log N j!}. Differentiating this with respect to θ and Λ respectively, the scores for θ and Λ are easily seen to be while where l θ x l Λ ax Z N j e θ 0 Z Λ 0j, 5 Nj } e θ 0 Z a j Λ 0j N j e θ 0 Z Λ 0j } aj Λ 0j, T,j a j adλ 0, a L 2 Λ 0. T,j 1 To compute the information bound for estimation of θ it follows from the results of [2] and [4] that we want to find a so that l θ l Λ a l Λ a

7 Semiparametric Regression Model for Panel Count Data 7 for all a L 2 Λ 0 ; i.e. 0 l θ l } Λ a l Λ a N j e θ 0 Z Λ 0j Z a j l Λ a Λ 0j N j e θ 0 Z Λ 0j 2 Z a j Λ 0j e θ 0 Z Λ 0j Z a j aj Λ 0j Λ 0j aj Λ 0j by conditioning on, T,Z e θ 0 Z Λ 0j Z a j aj } Λ 0j Λ 0j e θ 0 Z Λ 0j Z a j Λ 0j aj } }}, T,j 1,T,j Λ 0j a } j Λ 0j Ze θ 0 Z, T,j 1,T,j Λ 0j a } j e }} θ 0 Z, T,j 1,T,j. Λ 0j Thus we see that the desired orthogonality holds with } a Ze θ 0 Z, T,j 1,T,j j Λ 0j }. 6 e θ 0 Z, T,j 1,T,j Hence the efficient score function for θ is given by l θx l θ x l Λ a x } Ze θ Z, T,j 1,T,j N j e θ Z Λ 0j Z e θ Z,, T,j 1,T,j } and the information for θ is, by computing conditionally on Z,, T, Iθ 0 l θ X 2}

8 8 Jon A. Wellner, Ying Zhang, and Hao Liu } Ze θ 0 Z 2, T,j 1,T,j 0 e θ 0 Z Λ 0j Z } e θ 0 Z, T,j 1,T,j. In particular, we have the following corollary: Corollary 1. Current status data. If P 11so that the only T j of relevance is T 1,1 T while T 1,0 0, then the efficient score function is } l θx l θ x l Ze θ 0 Z T Λ a x NT e θ 0 Z Λ 0 T Z e θ 0 Z T }. and the information for θ is given by Iθ 0 l θ X 2} 0 eθ 0 Z Λ 0 T Z Ze θ 0 Z T e θ 0 Z T } } 2. This can be compared with the information for θ for the Cox proportional hazards model with current status data given by [15], page 547 with Huang s Y replaced by our present T for comparison: Iθ RT,Z Z ZRT,Z T } } 2 RT,Z T where RT,ZΛ 2 T,ZOT Z and 0 z Ot z F t z 1 F t z 1 F 0t expθ 1 1 F 0 t. expθ 0 z Corollary 2. Case 2 Interval-censored data. If P 21so that the only T j s of relevance are T 2,1 T 1 and T 2,2 T 2, while T 2,0 0, then the efficient score function is l θx l θ x l Λ a x } Ze θ 0 Z T 1 NT 1 e θ 0 Z Λ 0 T 1 Z } e θ 0 Z T 1 + NT 2 NT 1 } Ze θ 0 Z T 1,T 2 e θ 0 Z Λ 0 T 2 Λ 0 T 1 Z }, e θ 0 Z T 1,T 2 and the information for θ is given by

9 Semiparametric Regression Model for Panel Count Data 9 Iθ 0 l θ X 2} } Ze θ 0 Z 2 T 1 0 eθ 0 Z Λ 0 T 1 Z } e θ 0 Z T 1 } Ze θ 0 Z 2 T 1,T eθ 0 Z Λ 0 T 2 Λ 0 T 2 Z } e θ 0 Z T 1,T 2. This is much simpler than the information for θ for the Cox proportional hazards model with interval censored case II data given by [16]. The calculations of Huang and Wellner resulted in an integral equation to be solved, analogously to the results for the mean functional considered by [8], [9], and [10]. 4 Asymptotic normality of the two estimators of θ. Here is the crucial theorem concerning the asymptotic behavior of the maximum pseudo-likelihood and maximum likelihood estimators of θ when the proportional mean model holds, but the Poisson assumption concerning N may fail. Theorem 1. Under suitable regularity and integrability conditions, the estimators θ n and θ n are asymptotically normal: ps n θn θ 0 d Z N d 0,A 1 B A 1, 7 and where n θps n θ 0 d Z ps N d 0, A ps 1 B ps A ps 1 8 B m θ 0,Λ 0 ; X 2 Ze θ 0 Z 2, T,j,T,j C j,j Z Z e θ j,j 0 Z, T 1,j,T,j, Ze θ 0 Z 2, T,j 1,T,j A Λ 0j e θ 0 Z Z e θ 0 Z, T,j 1,T,j, C j,j Z Cov[ N j, N j Z,, T ], B ps m ps θ 0,Λ 0 ; X 2

10 10 Jon A. Wellner, Ying Zhang, and Hao Liu Ze θ 0 Z, T,j C ps j,j Z Z e θ j,j 0 Z, T 1,j Ze θ 0 Z, T,j Z e θ 0 Z, T,j, Ze θ 0 Z 2, T,j A ps Λ 0j e θ 0 Z Z e θ 0 Z, T,j, C ps j,j Z Cov[N j,n j Z,, T,j,T,j ]. Our proof of this theorem is based on the results of [35]. While we will not give the proof in detail, we will present here a sketch of the computation of the asymptotic variances given in 7 and 8. Based on the Poisson model, the log-likelihood for θ, Λ with one observation is given by mθ, Λ; X log px; θ, Λ N j log Λ j Λ j log N j!} N j log Λ j + N j θ Z } e θ Z Λ 0j log N j!. 9 Thus the log-likelihood l n θ, Λ for n i.i.d. observations is given by l n θ, Λ np n mθ, Λ;. 10 The maximum likelihood estimators θ, Λ are obtained by maximizing 10. A natural pseudo-likelihood is obtained by simply taking the product of the marginal distributions of the observed counts at the successive observation times. Thus a log-pseudo-likelihood for one observation is given by m ps θ, Λ; X } N j log Λ j + N j θ Z e θ Z Λ j logn j! 11 with Λ,j ΛT,j, and the log-pseudo-likelihood ln ps θ, Λ for n i.i.d. observations is given by ln ps θ, Λ np n m ps θ, Λ;, 12 and the corresponding pseudo-ml s θ ps, Λ ps are obtained by maximizing 12.

11 Semiparametric Regression Model for Panel Count Data Asymptotic variance of the ML Based on the Poisson model, the log-likelihood for θ, Λ with one observation is given by 9. Using the notation of [35], page 29, we have m 1 θ, Λ; X m 2 θ, Λ; X[h] m 11 θ, Λ; X [ ] Z N j Λ j e θ Z, [ ] Nj e θ Z h j, Λ j Λ j ZZ e θ Z, m 12 θ, Λ; X[h] m T 21θ, Λ; X[h] m 22 θ, Λ; X[h,h] Ze θ Z h j, N j Λ j 2 h j h j, where h j T,j T,j 1 hdλ 0 for h L 2 Λ. By A2 of [35], page 30, we need to find a h such that Ṡ 12 θ 0,Λ 0 [h] Ṡ22θ 0,Λ 0 [h,h] P m 12 θ 0,Λ 0 ; X[h] m 22 θ 0,Λ 0 ; X[h,h]} 0, for all h L 2 Λ 0. Note that P m 12 θ 0,Λ 0 ; X[h] m 22 θ 0,Λ 0 ; X[h,h]} [ ] Ze θ 0 Z N j Λ 0j 2 h j [,T,Z Ze θ 0 Z eθ ] 0 Z h j Λ 0j h j. h j Therefore, an obvious choice of h is Ze θ 0 Z, T,j 1,T,j h j Λ 0j. e θ 0 Z, T,j 1,T,j Hence m θ 0,Λ 0 ; X

12 12 Jon A. Wellner, Ying Zhang, and Hao Liu m 1 θ 0,Λ 0 ; X m 2 θ 0,Λ 0 ; X[h ] Z N j e θ 0 Z Λ 0j Nj Ze θ 0 Z, T e θ,j 1,T,j 0 Z Λ 0j Λ 0j e θ 0 Z, T,j 1,T,j Ze N θ 0 Z, T,j 1,T,j j e θ 0 Z Λ 0j Z. e θ 0 Z, T,j 1,T,j By Theorem of [35], page 32, the asymptotic variance will be A 1 B A 1, where B m θ 0,Λ 0 ; X 2,T,Z CT,j,T,j,T,j 1,T,j 1; Z j,j 1 Ze θ 0 Z, T,j 1,T,j Z e θ 0 Z, T,j 1,T,j Ze θ 0 Z, T,j Z 1,T,j e θ 0 Z, T,j 1,T,j, A Ṡ11θ 0,Λ 0 +Ṡ21θ 0,Λ 0 [h ] [ Λ 0j e θ 0 Z ZZ Ze θ 0 Z, T,j e θ 0 Z 1,T,j Λ 0j Z e θ 0 Z, T,j 1,T,j Ze θ 0 Z, T,j 1,T,j,T,Z Λ 0j e θ 0 Z Z Z e θ 0 Z, T,j 1,T,j Ze θ 0 Z 2, T,j 1,T,j,T,Z Λ 0j e θ 0 Z Z e θ 0 Z, T,j 1,T,j, and

13 Semiparametric Regression Model for Panel Count Data 13 CT,j,T,j,T,j 1,T,j 1; Z [ N j e θ 0 Z Λ 0j ] N j e θ 0 Z Λ 0j Z,, T,j 1,T,j,T,j 1,T,j. Note that if the counting process is indeed a conditional Poisson process with the mean function given as specified, Λ0j e CT,j,T,j,T,j 1,T,j 1; Z θ 0 Z, if j j 0, if j j. This yields B A Iθ 0 and thus A 1 B A 1 I 1 θ Asymptotic Variance of the Pseudo-ML Based on the Poisson model, the pseudo log-likelihood for θ, Λ with one observation is given by 11. Using the notation of Zhang 1998, page 29, we have m ps 1 θ, Λ; X [ ] Z N j Λ j e θ Z, m ps 2 θ, Λ; X[h] [ ] Nj e θ Z h j, Λ j m ps 11 θ, Λ; X Λ j ZZ e θ Z, m ps 12 θ, Λ; X[h] mt 21θ, Λ; X[h] Ze θ Z h j, m ps 22 θ, Λ; X[h,h] N j Λ j 2 h jh j, where h j T,j 0 hdλ for h L 2 Λ. By A2 of [35], page 30, we need to find a h such that Ṡ ps 12 θ 0,Λ 0 [h] Ṡps 22 θ 0,Λ 0 [h,h] P m ps 12 θ 0,Λ 0 ; X[h] m ps 22 θ 0,Λ 0 ; X[h,h]} 0, for all h L 2 Λ 0. Note that

14 14 Jon A. Wellner, Ying Zhang, and Hao Liu P m ps 12 θ 0,Λ 0 ; X[h] m ps 22 θ 0,Λ 0 ; X[h,h]} [ ] Ze θ 0 Z N j Λ 0j 2 h j h j,t,z [ Ze θ 0 Z eθ 0 Z h j Λ 0j Therefore, an obvious choice of h is Ze θ 0 Z, T,j h j Λ 0j. e θ 0 Z, T,j Hence ] h j. m ps θ 0,Λ 0 ; X m ps 1 θ 0,Λ 0 ; X m ps 2 θ 0,Λ 0 ; X[h ] Z N j e θ 0 Z Nj Ze θ 0 Z, T Λ 0j e θ,j 0 Z Λ 0j Λ 0j e θ 0 Z, T,j Ze N θ 0 Z, T,j j e θ 0 Z Λ 0j Z. e θ 0 Z, T,j By Theorem of [35], page 32, the asymptotic variance will be A ps 1 B ps A ps 1, where B ps m ps θ 0,Λ 0 ; X 2 Ze θ 0 Z, T,j,T,Z C ps T,j,T,j ; Z Z e θ j,j 0 Z, T 1,j Ze θ 0 Z, T,j Z e θ 0 Z, T,j, A ps Ṡps 11 θ 0,Λ 0 +Ṡps 21 θ 0,Λ 0 [h ] Ze θ 0 Z, T,j Λ 0j e θ 0 Z ZZ e θ 0 Z Λ 0j Z e θ 0 Z, T,j Ze θ 0 Z, T,j,T,Z Λ 0j e θ 0 Z Z Z e θ 0 Z, T,j

15 Semiparametric Regression Model for Panel Count Data 15 Ze θ 0 Z 2, T,j,T,Z Λ 0j e θ 0 Z Z e θ 0 Z, T,j, and C ps T,j,T,j ; Z ] [N j e θ 0 Z Λ 0j N j e θ 0 Z Λ 0j Z,, T,j,T,j. Note that if the counting process is indeed a conditional Poisson process with the mean function given as specified, C ps T,j,T,j ; Z e θ 0 Z Λ 0j j. This yields Ze θ 0 Z, T,j B ps A ps +2,T,Z e θ0z Λ 0j Z e θ j<j 0 Z, T,j Ze θ 0 Z, T,j Z e θ 0 Z, T,j A ps. 5 Comparisons: ML versus pseudo-ml. Scenario 1. We first suppose that the underlying counting process is in fact a standard homogeneous Poisson process conditionally given Z, with baseline mean function Λ 0 t λt, Wewill also assume that the distribution of, T isindependent of Z. Asaconsequence, Z is independent of, T, and the formulas in the preceding section simplify considerably. Because of the Poisson process assumption, A B Iθ 0, and this matrix is given by Ze θ 0 Z 2, T,j 1,T,j Iθ 0,T,Z Λ 0j e θ 0 Z Z e θ 0 Z, T,j 1,T,j Ze θ 0 Z 2,T Λ 0 T, } Z eθ 0 Z Z e θ Z 0,T Λ 0 T, } C, so that if C is nonsingular,

16 16 Jon A. Wellner, Ying Zhang, and Hao Liu Iθ 0 1 C 1 1,T Λ 0 T, }. On the other hand, Ze θ 0 Z 2, T,j A ps,t,z Λ 0j e θ 0 Z Z e θ 0 Z, T,j Ze θ 0 Z 2,T Λ 0 T j eθ 0 Z Z e θ Z 0, while B ps,t,z so that j,j 1 A ps 1 B ps A ps 1,T C 1,Z Ze θ 0 Z 2 Λ 0 T,j j Z eθ 0 Z Z e θ Z 0,,T } j,j 1 Λ 0T,j j } 2. Λ 0T j Thus it follows that the AR of the pseudo-ml of θ 0 relative to the ML of θ 0 under the above scenario is given by ARpseudo, mle A 1 BA 1 T A ps 1 B ps A ps 1 T }} 2 Λ 0T,j }. Λ 0 T, } j,j 1 Λ 0T,j j Note that this equals 1 if P 11.Actually, we have not yet used the assumption about Λ 0.Ifweassume that Λ 0 t λt, then }} 2 T,j ARpseudo, mle } T, } j,j 1 T,j j Scenario 1A. If we assume, further, that P k 1for a fixed integer k 2, and T T,1,...,T, are the order statistics of a sample of k uniformly distributed random variables on an interval [0,M], then T,j k j k +1 M k 2 M,

17 and Hence in this case as k. Semiparametric Regression Model for Panel Count Data 17 T,j j j,j 1 T, } ARpseudo, mle k j,j 1 k k +1 M, j j k2k +1 M M. k +1 6 k/22 k k2k+1 k+1 6 3k +1 22k k Iθ 1 kmλ 2 3/2 4/3 5/4 6/5 7/6 8/7 9/8 10/9 11/10 V ps kmλ 2 5/3 14/9 3/2 22/15 13/9 10/7 17/12 38/27 7/5 ARk Scenario 1B. If we assume instead that is random and for some 0 c< 1/2 T,j Uniform[j c, j + c], j 1,...,, and are independent conditionally on, then we calculate T,, +1 T,j, 2 j,j T,j j, 6 and hence ARpseudo, mle } Note that this depends only on the first three moments of, with the third moment appearing in the denominator. It does not depend on c, but only on T,j j, j 1,...,. When P k 1we find that under this scenario ARpseudo, mle 3k +1 22k It is clear that for random the AR in this scenario can be arbitrarily small if the distribution of is heavy-tailed. This will be shown in more detail in Scenario 2.

18 18 Jon A. Wellner, Ying Zhang, and Hao Liu Scenario 2. Avariant on these calculations is to repeat all the assumptions about Z and, T, assume, conditionally on, that T T,1,...,T, are the order statistics of a sample of uniformly distributed random variables on an interval [0,M], but now allow a distribution for. Then T,j j +1 M 2 M, and T,j j j,j 1 } T, j,j 1 +1 M, j j 2 +1 M M If we let be distributed according to 1+Poissonµ, then the AR will be asymptotically 3/4 again as µ.a more interesting choice of the distribution of is the Zetaα distribution given as follows: for α > 1, P k 1/kα, k 1, 2,..., ζα where ζα j α is the Riemann Zeta function. Then we can compute T,j j +1 M α M M ζα 1, 2 2 ζα and T, } α M, +1 T,j j j,j 1 α j,j 1 j j M +1 M 6 α2 + 1 M 6 2α 2 + α } } 2ζα 2 + ζα 1 M 6 ζα Hence in this case Iθ 1 α C 1 1 Mλ α +1,

19 and Semiparametric Regression Model for Panel Count Data 19 } 2ζα 2+ζα 1 V ps α C 1 6Mλ ζα ζα 1 2ζα 2, ARpseudo, mleα ARα α /2 2 α +1 } α ζα 1/ζα 2 α +1 } 2ζα 2+ζα 1 ζα 3 ζα 1 2 2ζα 2 + ζα 1} α +1 }, and this varies between 0 and 1 as α varies from 3 to ; note that for α 3, 2 ζ1, while ζ2/ζ See Figure 1. AR alpha Fig. 1. ARα If we change the distribution of to Unif1,...,k 0 }, then T,j j +1 M M k 0 +1 M, 2 4

20 20 Jon A. Wellner, Ying Zhang, and Hao Liu T,j j j j α M +1 j,j 1 j,j 1 M 6 M 6 M } 2 k k k } while Hence in this case and T, } M, +1 Iθ 1 α C 1 1 Mλ +1, k 0+12k 0+1/3+k 0+1/2 V ps k 0 C 1 6 Mλk /16, ARpseudo, mlek 0 ARk 0 / } 6 k /16 +1 } k0+12k0+1/3+k0+1/2 6 and this varies between 1 and 9/16 as k 0 varies from 1 to. Scenario 3. We now suppose that the underlying counting process is not a Poisson process, conditionally given Z, but is, instead, defined as in terms of the negative-binomialization of an empirical counting process, as follows: suppose that X 1,X 2,... are i.i.d. with distribution function F on R, and define n N n t 1 [Xi t] for t 0. i1 Suppose that N Z Negative BinomialrZ, γ, θ 0,p where rz, γ, θ 0 γe θ 0 Z, and define N by Nt N N t N 1 [Xi t]. i1,

21 Semiparametric Regression Model for Panel Count Data 21 Then, since it follows that N Z rz, γ, θ 0 q p, VarN Z rz, γ, θ 0 q p 2, Nt Z} [Nt Z, N] Z} NFt Z} γe θ 0 Z F t q p eθ 0 Z Λ 0 t, with Λ 0 t γftq/p. Alternatively, Nt Z Negative Binomialr, p p + qft by straightforward computation, and hence it follows that qft/p + qft Nt Z} r rft q p/p + qft p γeθ 0 Z q p F t eθ 0 Z Λ 0 t, in agreement with the above calculation. Moreover qft/p + qft VarNt Z} r p/p + qft 2 r q p F t1 + q p F t r q p F t+r q p 2 F t 2. Remark. It should be noted that this is somewhat different than the model obtained by supposing that the underlying counting process is a nonhomogeneous Poisson process conditionally given Z and an unobserved frailty Y Gammaγ,γ and conditional on Y and Z mean function In this model we have but Nt Z, Y } Ye θ 0 Z Λ 0 t. Nt Z} e θ 0 Z Λ 0 t, VarNt Z} Var[Nt Y,Z] Z} + Var[Nt Y,Z] Z} e θ 0 Z Λ 0 t+γ 1 e 2θ 0 Z Λ 2 0t. Now we want to calculate

22 22 Jon A. Wellner, Ying Zhang, and Hao Liu CT,j,T,j,T,j 1,T,j 1; Z [ N j e θ 0 Z Λ 0j ] N j e θ 0 Z Λ 0j Z,, T,j 1,T,j,T,j 1,T,j. By computing conditionally on N, and using the fact that conditionally on Z,, T,j 1,T,j,T,j 1,T,j and N, N has increments with a joint multinomial distribution, we find that CT,j,T,j,T,j 1,T,j 1; Z [ N j e θ 0 Z Λ 0j ] N j e θ 0 Z Z, Λ 0j, T,j 1,T,j,T,j 1,T,j [ N j N j N+ N j N e θ 0 Z Λ 0j N j N j N+ N j N e θ 0 Z Λ 0j ] N,Z,,T,j 1,T,j,T,j 1,T,j } Z,, T,j 1,T,j,T,j 1,T,j N F j 1 F j Z,, T } + } N rq/p 2 F j 2 Z,, T, if j j N F j F j Z,, T } + } N rq/p 2 F j F j Z,, T, if j j r q p F,j F,j 2 + r q p F 2 2,j, if j j } r q p r q 2 p F j F j, if j j r q p F,j + r q2 p F 2,j 2, if j j r q2 p F 2 j F j, if j<j e θ 0 Z Λ 0,,j + γ 1 e θ 0 Z Λ 0,,j 2, if j j, γe θ 0 Z q2 p F 2 j F j, if j j, e θ 0 Z Λ 0,,j + γ 1 e θ 0 Z Λ 0,,j 2, if j j, γ 1 e θ 0 Z Λ 0j Λ 0j, if j j. Remark. Note that if N Z PoissonrZ, γ, θ 0, then the process N N N is conditionally, given Z, anon-homogeneous Poisson process with conditional mean function Nt Z} γe θ 0 Z F t e θ 0 Z Λ 0 t, and conditional variance function VarNt Z} γe θ 0 Z F t e θ 0 Z Λ 0 t

23 Semiparametric Regression Model for Panel Count Data 23 where Λ 0 t γft. We will also assume, as in Scenarios 1 and 2, that the distribution of, T isindependent of Z. Asaconsequence, Z is independent of, T, and the formulas in the preceding section again simplify. By taking F uniform on [0,M], Λ 0 t λt where λ γ/mq/p, and we compute, using moments and covariances of uniform spacings, as found on page 721 of [28], B m θ 0,Λ 0 ; X 2,T,Z CT,j,T,j,T,j 1,T,j 1; Z j,j 1 Ze θ 0 Z, T,j 1,T,j Z e θ 0 Z, T,j 1,T,j Ze θ 0 Z, T,j Z 1,T,j e θ 0 Z, T,j 1,T,j,T,Z e θ 0 Z Λ 0j Ze θ 0 Z 2 + γ 1 e θ 0 Z Λ 0j 2 Z e Z θ 0 Ze θ 0 Z 2 +,T,Z γ 1 e θ 0 Z Λ 0j Λ 0j Z e Z θ j j 0 C,T Λ 0j + γ 1 Λ 0j 2 + γ 1,T Λ 0j Λ 0j j j } C λm + γ 1 λ 2 M } λmc + γ 1 λm where, as in scenario 1,

24 24 Jon A. Wellner, Ying Zhang, and Hao Liu Ze θ 0 Z 2 C Z eθ 0 Z Z e θ Z 0. On the other hand, we find that Ze θ 0 Z 2, T,j 1,T,j A,T,Z Λ 0j e θ 0 Z Z e θ 0 Z, T,j 1,T,j } CλM. +1 Thus the asymptotic variance of the ML for this scenario is } } A 1 BA 1 λmc λmγ +2 } Now for the asymptotic variance of the pseudo-ml under scenario 3. To calculate B ps we first need to calculate C ps T,j,T,j ; Z ] [N j e θ 0 Z Λ 0j N j e θ 0 Z Λ 0j Z,, T,j,T,j [N j e θ 0 Z Λ 0j N j e θ 0 Z Λ 0j } N,Z,, T,j,T,j ] Z,, T,j,T,j NF T,j T,j F T,j F T,j } +N rq/p 2 F T,j F T,j Z,, T,j,T,j r q p F T,j T,j F T,j F T,j } + r q p 2 F T,jF T,j e θ 0 Z Λ 0 T,j T,j +γ 1 e θ 0 Z Λ 0 T,j Λ 0 T,j e θ 0 Z Λ 0 T,j 1+γ 1 Λ 0 T,j if j j. We can then calculate B ps m ps θ 0,Λ 0 ; X 2 Ze θ 0 Z, T,j,T,Z C ps T,j,T,j ; Z Z e θ 0 Z, T,j j,j 1

25 Semiparametric Regression Model for Panel Count Data 25 Ze θ 0 Z, T,j Z e θ 0 Z, T,j,T,Z Λ0 T,j 1+γ 1 Λ 0 T,j e θ 0 Z j,j 1 Ze θ 0 Z 2 Z e θ Z 0 C λm j j + λ2 M 2 U,j U,j k +1 γ j,j 1 j,j 2 +1 C λm + λ2 M 2 U,j U,j 6 γ j,j 2 +1 C λm + λ2 M 2 } γ λmc + λm } γ 12 Ze θ 0 Z 2, T,j A ps,t,z Λ 0j e θ 0 Z Z e θ 0 Z, T,j λmc. 2 Thus we find that the asymptotic variance of the pseudo-ml is given by 2+1 A ps 1 B ps A ps 1 λmc λm γ } 2, 2 θ ps n and the asymptotic relative efficiency of the pseudo-mle to the mle is, under scenario 3, ARpseudo, mlenegbin +1}+ λm γ +1} λm γ 2 } 2 +2}

26 26 Jon A. Wellner, Ying Zhang, and Hao Liu } } +1 + λmγ 2 +2 / λm γ } } +1 + λmγ λm γ ARpseudo, mlep oisson 1+ λm +2 γ +1 ARpseudo, mlep oisson. 1+ λm γ Note that when factor λm/γ q/p 0, is zero then we recover our earlier result for the Poisson case. This is to be expected since P oissonλ 0 t expθ 0Z becomes the limiting distribution of Negative Binomialr, p/p + qft as p 1.

27 Semiparametric Regression Model for Panel Count Data 27 6 Conclusions and Further Problems Conclusions As in the case of panel count data without covariates studied in [29] and [33], the pseudo likelihood estimation method for the semiparametric proportional mean model with panel count data proposed and studied in [35] and [36] also has advantages in terms of computational simplicity. The results of section 5 above show that the maximum pseudo-likelihood estimator of the regression parameters can be very inefficient relative to the maximum likelihood estimator, especially when the distribution of is heavy - tailed. In such cases it is clear that we will want to avoid the pseudo-likelihood estimator, and the computational effort required by the full maximum likelihood estimators can be justified by the consequent gain in efficiency. Our derivation of the asymptotic normality of the maximum likelihood estimator of the regression parameters results in a relatively complicated expression for the asymptotic variance which may be difficult to estimate directly. Hence it becomes important to develop efficient algorithms for computation of the maximum likelihood estimator in order to allow implementation, for example, of bootstrap inference procedures. Alternatively, profile likelihood inference may be quite feasible in this model; see e.g. [24], [25], [26] for likelihood ratio procedures in some related interval censoring models. Further problems The asymptotic normality results stated in section 4 will be developed and given in detail in [34]. There are quite a large number of interesting problems still open concerning the semiparametric model for panel count data which has been studied here. Here is a short list: Find a fast and reliable algorithm for computation of the ML θ of θ. Although reasonable algorithms for computation of the maximum pseudolikelihood estimators have been proposed in [35] and [36] based on the earlier work of [29], good algorithms for computation of the maximum likelihood estimators have yet to be developed and implemented. Show that the natural semiparametric profile likelihood ratio procedures are valid for inference about the regression parameter θ via the theorems of [24], [25], and [26]. Do the non-standard likelihood ratio procdures and methods of [1] extend to the present model to give tests and confidence intervals for Λ 0 t? Are there compromise or hybrid estimators between the maximum pseudolikelihood estimators and the full maximum likelihood estimators which have the computational advantages of the former and the efficiency advantages of the latter? Do similar results continue to hold for panel count data with covariates, but with other models for the mean function replacing the proportional mean model given by 1?

28 28 Jon A. Wellner, Ying Zhang, and Hao Liu Are there computational or efficiency advantages to using the ML s for one of the class of Mixed Poisson Process N Z, for example the Negative- Binomial model? Further comparisons with the work of [6], [14], and [20], [21] would be useful. References [1] Banerjee, M. and Wellner, J. A Likelihood ratio tests for monotone functions. Ann. Statist. 29, [2] Begun, J. M., Hall, W. J., Huang, W.M., and Wellner, J. A Information and asymptotic efficiency in parametric-nonparametric models. Ann. Statist. 11, [3] Betensky, R.A., Rabinowitz, D., and Tsiatis, A. A Computationally simple accelerated failure time regression for interval censored data. Biometrika 88, [4] Bickel, P. J., laassen, C.A.J., Ritov, Y., and Wellner, J. A fficient and Adaptive stimation for Semiparametric Models. Johns Hopkins University Press, Baltimore. [5] Cook, R. J., Lawless, J. F., and Nadeau, C Robust tests for treatment comparisons based on recurrent event responses. Biometrics 52, [6] Dean, C. B. and Balshaw, R fficiency lost by analyzing counts rather than event times in Poisson and overdispersed Poisson regression models. J. Amer. Statist. Assoc. 92, [7] Gaver, D. P., and O Muircheartaigh, I. G Robust mpirical Bayes analysis of event rates, Technometrics, 29, [8] Geskus, R. and Groeneboom, P Asymptotically optimal estimation of smooth functionals for interval censoring, part 1. Statist. Neerlandica 50, [9] Geskus, R. and Groeneboom, P Asymptotically optimal estimation of smooth functionals for interval censoring, part 2. Statist. Neerlandica 51, [10] Geskus, R. and Groeneboom, P Asymptotically optimal estimation of smooth functionals for interval censoring case 2. Ann. Statist. 27, [11] Groeneboom, P Nonparametric maximum likelihood estimators for interval censoring and deconvolution. Technical Report 378, Department of Statistics, Stanford University. [12] Groeneboom, P Inverse problems in statistics. Proceedings of the St. Flour Summer School in Probability, Lecture Notes in Math. 1648, Springer Verlag, Berlin. [13] Groeneboom, P. and Wellner, J Information Bounds and Nonparametric Maximum Likelihood stimation. Birkhäuser, Basel. [14] Hougaard, P., Lee, M.T., and Whitmore, G. A Analysis of overdispersed count data by mixtures of Poisson variables and Poisson processes. Biometrics 53, [15] Huang, J fficient estimation for the Cox model with interval censoring, Annals of Statistics, 24,

29 Semiparametric Regression Model for Panel Count Data 29 [16] Huang, J., and Wellner, J. A fficient estimation for the Cox model with case 2 interval censoring, Technical Report 290, University of Washington Department of Statistics, 35 pages. [17] Jongbloed, G The iterative convex minorant algorithm for nonparametric estimation. Journal of Computation and Graphical Statistics 7, [18] albfleisch, J. D. and Lawless, J. F Statistical inference for observational plans arising in the study of life history processes. In Symposium on Statistical Inference and Applications In Honour of George Barnard s 65th Birthday. University of Waterloo, August 5-8, [19] albfleisch, J. D. and Lawless, J. F The analysis of panel count data under a Markov assumption. Journal of the American Statistical Association 80, [20] Lawless, J. F. 1987a. Regression methods for Poisson process data. J. Amer. Statist. Assoc. 82, [21] Lawless, J. F. 1987b. Negative binomial and mixed Poisson regression. Canad. J. Statist. 15, [22] Lawless, J. F. and Nadeau, C Some simple robust methods for the analysis of recurrent events. Technometrics 37, [23] Lin, D. Y., Wei, L. J., Yang, I., and Ying, Z Semiparametric regression for the mean and rate functions of recurrent events. J. Roy. Statist. Soc. B, [24] Murphy, S. and Van der Vaart, A. W Semiparametric likelihood ratio inference. Ann. Statist. 25, [25] Murphy, S. and Van der Vaart, A. W Observed information in semiparametric models. Bernoulli 5, [26] Murphy, S. and Van der Vaart, A. W On profile likelihood. J. Amer. Statist. Assoc. 95, [27] Rabinowitz, D., Betensky, R. A., and Tsiatis, A. A Using conditional logistic regression to fit proportional odds models to interval censored data. Biometrics 56, [28] Shorack, G. R. and Wellner, J. A mpirical Processes with Applications to Statistics. Wiley, New York. [29] Sun, J. and albfleisch, J. D stimation of the mean function of point processes based on panel count data. Statistica Sinica 5, [30] Sun, J. and Wei, L. J Regression analysis of panel count data with covariate-dependent observation and censoring times. J. R. Stat. Soc. Ser. B 62, [31] Thall, P. F., and Lachin, J. M Analysis of Recurrent vents: Nonparametric Methods for Random-Interval Count Data. J. Amer. Statist. Assoc. 83, [32] Thall, P. F Mixed Poisson likelihood regression models for longitudinal interval count data. Biometrics 44, [33] Wellner, J. A. and Zhang, Y Two estimators of the mean of a counting process with panel count data. Ann. Statist. 28, [34] Wellner, J. A., Zhang, Y., and Liu Large sample theory for two estimators in a semiparametric model for panel count data. Manuscript in progress. [35] Zhang, Y stimation for Counting Processes Based on Incomplete Data. Unpublished Ph.D. dissertation, University of Washington. [36] Zhang, Y A semiparametric pseudo likelihood estimation method for panel count data. Biometrika 89,

30 30 Jon A. Wellner, Ying Zhang, and Hao Liu University of Washington Statistics Box Seattle, Washington U.S.A. Department of Statistics University of Central Florida P.O. Box Orlando, Florida University of Washington Biostatistics Box Seattle, Washington U.S.A.

A Semiparametric Regression Model for Panel Count Data: When Do Pseudo-likelihood Estimators Become Badly Inefficient?

A Semiparametric Regression Model for Panel Count Data: When Do Pseudo-likelihood stimators Become Badly Inefficient? Jon A. Wellner 1, Ying Zhang 2, and Hao Liu Department of Statistics, University of