ESTIMATION OF A LOGISTIC REGRESSION MODEL WITH MISMEASURED OBSERVATIONS

Size: px

Start display at page:

Download "ESTIMATION OF A LOGISTIC REGRESSION MODEL WITH MISMEASURED OBSERVATIONS"

Jennifer Griffin
5 years ago
Views:

1 Statistica Sinica 13(2003, ESTIMATION OF A LOGISTIC REGRESSION MODEL WITH MISMEASURED OBSERVATIONS K. F. Cheng and H. M. Hsueh National Cental Univesity and National Chengchi Univesity Abstact: We conside the estimation poblem of a logistic egession model. We assume the esponse obsevations and covaiate values ae both subject to measuement eos. We discuss some paametic and semipaametic estimation methods using mismeasued obsevations with validation data and deive thei asypmtotic distibutions. Ou esults ae extentions of some well known esults in the liteatue. Compaisons of the asymptotic covaiance matices of the studied estimatos ae made, and some lowe and uppe bounds fo the asymptotic elative efficiencies ae given to show the advantages of the semipaametic method. Some simulation esults also show the method pefoms well. Key wods and phases: Kenel estimation, estimated likelihood, logistic egession, measuement eo, misclassification. 1. Intoduction Logistic egession is the most popula fom of binay egession; see Cox (1970 and Pegibon (1981. Reseaches often use logistic egession to estimate the effect of vaious pedictos on some binay outcome of inteest. Basically, the model assumes that the log of the odds of the outcome is a linea function of the pedictos. That is, suppose that the vaiables (X i, Y i follow the model P (Y i = 1 X i = x i = exp(x T i β0 1 + exp(x T i β0 } F (xt i β 0, whee the X i ae andom d-vecto pedictos and the Y i ae Benoulli esponse vaiables. Usually, the maximum likelihood (ML method is used to estimate the egession coefficients β 0. Unde some egulaity conditions, the maximum likelihood estimato of β 0 is asymptotically nomal. The ML method equies that the data consist of pecise measuements fo the binay outcomes and pedictos. Howeve, the data ae often not measued pefectly. Fo instance, Golm, Holloan and Longini (1998, 1999 mentioned that collecting infomation on exposue to infection fo estimating vaccine efficacy may be mismeasued. On the othe hand, Albet, Hunsbege and Bid (1997 and Bollinge and David (1997 gave examples showing that the binay outcome of

2 112 K. F. CHENG AND H. M. HSUEH inteest may also be misclassified. It is geneally tue that the usual analyses based on the mismeasued obsevations lead to inconsistent estimation. The topic of binay egession when pedictos X i ae measued with eo has been the subject of seveal ecent papes, see Caoll, Spiegelman, Lan, Bailey and Abbott (1984, Caoll and Wand (1991, Reilly and Pepe (1995, and Lawless, Kalbfleisch and Wild(1999, etc. When the binay esponses Y i ae subject to misclassification, Pepe(1992 and Cheng and Hsueh (1999 discussed bias coection methods in the estimation of logistic egession paametes. In this pape, we study the estimation poblem of β 0 when both obsevations Y i and pedictos X i ae measued with eo. Paametic and semipaametic methods ae discussed. We find that the poposed semipaametic estimation method is a genealization of the pseudolikelihood method of Caoll and Wand (1991 and the estimated likelihood method of Pepe (1992. Many othe estimating appoaches have been poposed in the liteatue. A method based on the mean scoe was poposed by Pepe, Reilly and Fleming (1994 and Reilly and Pepe (1995 fo the mismeasued outcome data poblem and mismeasued covaiate data poblem, espectively. Howeve, Golm et al. (1999 and Lawless et al. (1999 agued that this semipaametic method is less efficient. Robins, Rotnitzky and Zhao (1994 poposed a class of semipaametic efficient estimatos fo the model with mismeasued pedictos based on the invese pobability weighted estimating equation appoach. When both esponse vaiable and pedictos ae subject to measuement eo, the semipaametic efficient estimato has not been fomally deived yet. In this pape we only emphasize vaious imputation appoaches. The weighting methods, such as the semipaametic efficient estimation deived by Robins et al. (1994, will not be discussed futhe. We assume in this pape that the complete data set consists of a pimay sample plus a smalle validation subsample which is obtained by double sampling scheme. Extension of the selection pobabilities fo the validation data set to depend on the obseved suogate covaiates is discussed biefly in Section 3. Asymptotically, we also suppose that the validation subsample size is a faction of the majo sample size. The estimatos unde discussion will be fomally defined in Section 2. In Section 3, some asymptotic esults ae deived fo semipaametic and paametic estimatos of β 0. Compaisons of thei asymptotic covaiance matices ae given in Section 4. Futhe, finite sample popeties ae exploed though a simulation study in Section Estimation Methods Suppose the tue andom vaiables (Y, X ae subject to mismeasuement and the suogate obsevations ae epesented by (Y 0, W, hee X = (1, X 1 T and W = (1, W 1 T. In addition to the pimay sample (Y 0 i, W i, i = 1,..., n},

3 LOGISTIC REGRESSION MODEL WITH MISMEASURED OBSERVATIONS 113 a smalle validation subsample is also obseved in ode to undestand the mismeasuement stuctue. The sampling scheme is to andomly select k units fom the pimay sample and at the selected units the tue measuement devices ae used to obtain the validation data. Hee the simple andom sampling scheme is consideed and fo the sake of simplicity, the fist k units of the pimay sample ae assumed to be the selected units. Thus we have the validation subsample (Yi 0, Y i, X i, W i, i = 1,..., k}. Late in Section 3 the theoetical esults will be given fo the case that the sampling scheme depends on the value of the suogate pedictos W. Fom the pimay sample, we denote the egession function of Y 0 on W = w as π 0 (w = P (Y 0 = 1 w. Fo Y and W being assumed mutually independent given X, and Y 0 and X being mutually independent given Y, W, the above egession function can be ewitten as π 0 (w = π(w, β 0, whee π(w, β = 1 θ 0 (w}ef (X T β w} + φ 0 (we F (X T β w} 1 π(w, β. Hee the expectation is taken with espect to the conditional density fx W 0, F ( = 1 F (, and the misclassification pobability functions θ 0 (w and φ 0 (w ae defined by θ 0 (w = P (Y 0 = 0 Y = 1, w and φ 0 (w = P (Y 0 = 1 Y = 0, w. Note that the suogate W is often a fallible measuement of X and coesponding to a coase patition in the sample space. The conditional independence of Y 0 and X given Y, W would be implied if the misclassification scheme depends on X only though the value of W. It is clea that the expectation EY 0 F (W T β 0 }W T ] is not necessaily zeo. Hence it is inappopiate to apply the usual likelihood method to the pimay sample fo infeence about β 0. In the following, we discuss some consistent estimates based on diffeent imputation methods. Suppose fist that fx W 0 and θ0 (w, φ 0 (w ae known a pioi. Then fo the obsevation (yj 0, w j, j = k + 1,..., n, in the pimay sample, the coesponding log-likelihood and scoe function ae L (β y 0 j, w j=y 0 j ln π(w j, β + (1 y 0 j ln π(w j, β, S (β y 0 j, w j L (β y 0 j, w j = y0 j π(w j, β π(w j, β π(w j, β 1 θ0 (w j φ 0 (w j }EF (X T j β F (X T j βx j w j }. Since, yj 0 π(w j, β π(w j, β π(w j, β = 2yj 0 1 yj 0π(w j, β + (1 yj 0 π(w j, β, 1 θ 0 (w j φ 0 (w j = 1 θ 0 EF (Xj T (w j π(w j, β} β w j} EF (Xj T β w j}e F (Xj T β w j},

4 114 K. F. CHENG AND H. M. HSUEH we have y 0 j π(w j, β π(w j, β π(w j, β 1 θ0 (w j φ 0 (w j } = E(Y j y 0 j, w j EF (X T j β w j} EF (X T j β w j}e F (X T j β w j}. Consequently, S (β yj 0, w j = E(Y j yj 0, w j EF (Xj T β w j} EF (Xj Tβ w j}e F (Xj Tβ w j} E F (Xj T β F } (Xj T βx j w j = A0 1 (y0 j, w j A 0 2 (y0 j, w j}ef (Xj T β F (Xj T βx j w j } A 0 1 (y0 j, w jef (Xj T β w j} + A 0 2 (y0 j, w je F ], (Xj T β w j} whee A 0 1 (y0 j, w j = P (Y 0 j = y 0 j Y j = 1, w j = θ 0 (w j } (1 y 0 j 1 θ 0 (w j } y 0 j, and A 0 2 (y0 j, w j = P (Y 0 j = y 0 j Y j = 0, w j = φ 0 (w j } y 0 j 1 φ 0 (w j } (1 y 0 j. Using this esult, we easily see that the MLE ˆβ f of β 0 can be obtained by solving the likelihood equations k n S(β y i, x i + S (β yj 0, w j = 0, (1 i=1 j=k+1 whee S(β y i, x i = y i F (x T i β}x i. Unfotunately, in applications fx W 0 (, θ0 ( and φ 0 ( ae aely known and ˆβ f can not be obtained. If the subsample size k is lage enough, a simple estimate ˆβ s can be obtained by using only the validation subsample, i.e., ˆβ s satisfies ki=1 S(β y i, x i = 0. Such a simple estimate is in geneal not efficient; see Cheng and Hsueh (1999. The basic eason is that the infomation contained in the pimay sample is not popely exploited. To do this, assume thee exist paametic functions and paametes (γ 0, α 0, independent of β 0, such that θ 0 (W = θ(w ; α 0, φ 0 (W = φ(w ; α 0 and fx W 0 (X W = f(x W ; γ0 almost suely. See Caoll and Wand (1991 and Cheng and Hsueh (1999 fo elated discussions. Then one can employ the usual appoach to obtain the joint MLE ( ˆβ m, ˆα m, ˆγ m. On the othe hand, the maximum pseudolikelihood estimate (MPLE ˆβ p can be obtained by solving (1 with α 0 and γ 0 in S (β yj 0, w j being eplaced by thei MLE ˆα p and ˆγ p deived fom the validation data. In geneal, ˆβm can be shown to be moe efficient than ˆβ p asymptotically. Howeve, in solving the equations fo finite sample cases, the computational complexity gows (and numeical stability deteioates with the numbe of unknown paametes, so that ˆβ m may not have pope pefomance, see the simulation study in Section 5.

5 LOGISTIC REGRESSION MODEL WITH MISMEASURED OBSERVATIONS 115 The last appoach to be discussed is a semipaametic method. Assuming fx W 0 ( w, θ0 (w and φ 0 (w in S (β yj 0, w j of (1 ae unknown functions, a nonpaametic method is used to estimate them. Fomally, we popose the MPLE ˆβ sp solving the following pseudolikelihood equations Ŝ (β y 0, w = k n S(β y i, x i + Ŝ (β yj 0, w j = 0, i=1 j=k+1 Â1(y 0, w Â2(y 0, w}êf (XT β F (X T βx w] Â1 (y 0, wêf (XT β w} + Â2(y 0, wê F (X T β w}]. Hee, fo any integable function g(x; β, the estimated expectation Êg(X; β w} is given by Êg(X; β w} = k i=1 K h (w w i g(x i ; β}/ k i=1 K h (w w i }. Moeove Â1(y 0, w and Â2(y 0, w ae estimates of A 0 1 (y0, w and A 0 2 (y0, w, with θ 0 (w and φ 0 (w being eplaced by thei nonpaametic estimates ˆθ(w = k i=1 K h (w w i y i (1 yi 0}/ k i=1 K h (w w i y i } and ˆφ(w = k i=1 K h (w w i (1 y i yi 0}/ k i=1 K h (w w i (1 y i }. The above kenel function K(t is taken to be a density function, K h (t = h 1 K(t/h, the bandwidth h depends on n and tends to zeo as n. We emak that the poposed semipaametic estimation method genealizes the methods of Caoll and Wand (1991 and Pepe (1992. If Y = Y 0 with pobability one, i.e., only measuement eo occus in the pedicto, ˆβ sp is the MPLE of Caoll and Wand (1991. On the othe hand, if X = W with pobability one, i.e., only misclassification occus in the esponse, ˆβsp educes to the maximum estimated likelihood estimate of Pepe ( Asymptotic Distibutions The asymptotic distibutions of the estimatos poposed in the peceding section will be pesented hee fo the case d = 1. Extension to geneal d is simple and the elated esults will be given in a emak. The asymptotic popeties depend on cetain egulaity conditions. A.1 β 0 Λ, an open set in R 2. A.2 E(X 2 <. A.3 The misclassification pobability functions θ 0 (w and φ 0 (w (0, 1, and the density function fw 0 (w of W ae stictly positive in the space of W. Futhemoe, these functions and thei l-th deivatives, l = 1, 2, ae in L 4 and also satisfy a weighted Lipschitz condition. (A function η( is said to satisfy a weighted Lipschitz condition if thee exists a constant c and a bounded function ψ in L 4 such that η(x η(y < ψ(x x y fo all x, y with x y < c.

6 116 K. F. CHENG AND H. M. HSUEH A.4 The paametic functions θ(w; α,φ(w; α and f(x w; γ, and thei patial deivatives θ(w; α/ α, φ(w; α/ α and f(x w; γ/ γ, ae unifomly continuous at (α 0, γ 0 fo all x and w in the suppot of X and W. A.5 The function K( is a second-ode kenel (see Gasse and Mülle (1979. A.6 h = h n 0 with nh 2 and nh 4 0, as n. Also, as n, k = n1 + O(h 2 }, whee (0, 1]. In the following, we give the basic asymptotic esults unde the assumption that α = (α 0, α 1 T and γ = (γ 0, γ 1 T. Except fo the asymptotic nomality of ˆβ sp, whose poof is given in the Appendix, the poofs fo othe estimatos ae standad and thus ae omitted. Theoem 1. Suppose conditions (A.1 (A.6 hold and let β be any estimato discussed in Section 2. Then as n, n 1/2 ( β β 0 conveges in distibution to a nomal with mean zeo. Thei asymptotic covaiance matices ae lim Cov n n1/2 ( ˆβ f β 0 } = Σ f = I v + (1 Iv} c 1, lim Cov n n1/2 ( ˆβ s β 0 } = Σ s = (I v 1, lim Cov n n1/2 ( ˆβ m β 0 } = Σ m = Σ f + lim Cov n n1/2 ( ˆβ p β 0 } = Σ p = Σ f + lim Cov n n1/2 ( ˆβ sp β 0 } = Σ sp = Σ f + Hee I v = EF (X T β 0 F (X T β 0 X 2 } and (1 2 Σ f I m Σ f, (1 2 Σ f I p Σ f, (1 2 Σ f I sp Σ f. Iv c 1 θ 0 (W φ 0 (W } 2 EF (X T β 0 F ] 2 (X T β 0 X W } = E π 0 (W 1 π 0, (W } I p = E 1 θ0 (W φ 0 (W }EF (X T β 0 F (X T β 0 X W } π 0 (W 1 π 0 (W } ( π 0 α (W ; α 0, γ 0 T π γ 0(W ; α0, γ 0 ] E F (X T β 0 θ 2 (W ;α 0 + F (X T β 0 φ 2 (W ;α θ 0 (W 1 θ 0 (W } φ 0 (W 1 φ 0 (W } 0 E f(x W ; γ 0 } 2 E 1 θ0 (W φ 0 (W }EF (X T β 0 F (X T β 0 X W } ( π 0 α (W ; α 0, γ 0 T T π 0 (W 1 π 0 (W } π γ(w 0 ; α 0, γ 0,

7 LOGISTIC REGRESSION MODEL WITH MISMEASURED OBSERVATIONS θ 0 (W φ 0 (W } EF (X T β 0 F ] 2 (X T β 0 X W } I sp = E π 0 (W 1 π 0 (W }] 2 ( EF (X T β 0 W }θ 0 (W 1 θ 0 (W }+E F (X T β 0 W }φ 0 (W 1 φ 0 (W } } + 1 θ 0 (W φ 0 (W } 2 EF 2 (X T β 0 W } E 2 F (X T β 0 W }], I m = (1 2 (β 0,α 0,γ 0 (α 0,γ 0 T (β 0,α 0,γ 0 Σ f (β 0,α 0,γ 0 } 1 T (β 0,α 0,γ 0, 1 θ 0 (W φ 0 (W }EF (X T β 0 (β 0,α 0,γ 0 = (1 E F (X T β 0 X W } π 0 (W 1 π 0 (W } ( π 0 α (W ; α 0, γ 0 T ] π γ(w 0 ; α 0, γ 0, ] (α 0,γ 0 = E F (X T β 0 θ 2 (W ;α 0 θ 0 (W 1 θ 0 (W } + F (X T β 0 φ 2 (W ;α 0 φ 0 (W 1 φ 0 (W } 0 0 E f(x W ; γ 0 } 2 ( 1 +(1 E π 0 α (W ; α 0, γ 0 2 π 0 (W 1 π 0 (W } π γ 0(W ; α0, γ 0. In this, π α (W ; α 0, γ 0 = π0 (W α (α 0,γ 0 = E F (X T β 0 W } φ(w ; α 0 EF (X T β 0 W } θ(w ; α 0, π γ (W ; α 0, γ 0 = π0 (W γ (α 0,γ 0 = 1 θ 0 (W φ 0 (W }EF (X T β 0 f(x W ; γ 0 W }, θ(w ; α 0 θ(w ; α = α, φ(w ; α 0 φ(w ; α = α 0 α, α 0 f(x W ; γ 0 = ln f(x W ; γ γ, γ 0 whee, fo any column vecto A, A 2 = AA T. Remak 1. It can be seen that I m, I p and I sp ae nonnegative definite matices, and thus (1 2 Σ f I m Σ f, (1 2 Σ f I p Σ f, (1 2 Σ f I sp Σ f can be egaded as addi-

8 118 K. F. CHENG AND H. M. HSUEH tional vaiations due to estimating the unknown functions θ 0 (W, φ 0 (W and fx W 0 by paametic and nonpaametic methods. Remak 2. It is well known that the infomation matix I v is the vaiance of the scoe; that is, I v = Va S(β 0 Y, X} = EVa S(β 0 Y, X X}]. We see that the infomation matices Iv c and I sp also have simila expessions: Iv c = EVa S (β 0 Y 0, W W }] = Eρ(Y, Y 0 W Va S (β 0 Y, W W }], and I sp = E(ρ(Y, Y 0 W 1 ρ(y, Y 0 W 1 ρ(y, F (X T β 0 W }]Va S (β 0 Y, W W }, whee S (β Y, W = ln P (Y W /, and ρ(y, Y 0 W = Co 2 (Y, Y 0 W = 1 θ 0 (W φ 0 (W } 2 EF (X T β 0 W }E F (X T β 0 W }}/π 0 (W 1 π 0 (W }], ρ(y, F (X T β 0 W = Co 2 (Y, F (X T β 0 W = EF 2 (X T β 0 W } EF (X T β 0 W }] 2 EF (X T β 0 W }E F (X T β 0. W } Thus these infomation matices depend not only on the scoe S (β 0 Y, W but also on the squaed coelation functions. Remak 3. Fo obtaining the validation subsample, a simple andom sampling design is used. Howeve, the selection pobabilities may depend on the value of W. Fo example, suppose g(w is the selection pobability. Then the asymptotic covaiance matix of ˆβ sp becomes Σ sp = Σ + Σ I sp Σ, whee Σ 1 =E g(w EF (X T β 0 F ] (X T β 0 X 2 W } +E (1 g(w } 1 θ0 (W φ 0 (W } 2 1 g(w } Isp=E 2 g(w ( 1 θ 0 (W φ 0 (W } EF (X T β 0 F (X T β 0 X W } π 0 (W 1 π 0 (W } ] 2 EF (X T β 0 F (X T β 0 X W } π 0 (W 1 π 0 (W }] 2 EF (X T β 0 W }θ 0 (W 1 θ 0 (W }+E F (X T β 0 W }φ 0 (W 1 φ 0 (W } } +1 θ 0 (W φ 0 (W } 2 EF 2 (X T β 0 W } E 2 F (X T β 0 W }]. Othe asymptotic covaiance matices in Theoem 1 can be modified accodingly. Remak 4. It is clea that Σ sp educes to the asymptotic covaiance matix of Caoll and Wand s (1991 estimato povided Y = Y 0 with pobability one. Suppose X = W with pobability one, then Σ sp is the asymptotic covaiance matix of Pepe s (1992 estimato; see also Cheng and Hsueh (1999., ] 2

9 LOGISTIC REGRESSION MODEL WITH MISMEASURED OBSERVATIONS 119 Remak 5. The esults in Theoem 1 can be extended to d > 1 vecto pedictos, povided the function K is a pth ode kenel, p > d, and the bandwidth paamete satisfies nh 2d and nh 2p 0 as n. Remak 6. All the asymptotic covaiance matices can be estimated by moment type estimates. 4. Compaisons of Asymptotic Covaiance Matices In this section, we discuss the behavios of diffeent estimatos by compaing thei asymptotic covaiance matices. The geneal esults basically agee with ou expectation, but some of thei poofs ae not tivial. Fo any matices A and B, we wite A B if A B is a nonnegative definite matix Results unde geneal conditions Fist, fom Remak 1 of Section 3, we have Σ m Σ f, Σ p Σ f and Σ sp Σ f. Futhe, suppose we let v (α 0,γ 0 } 1 be the asymptotic covaiance matix of the MLE (ˆα p, ˆγ p which is obtained fom the validation subsample. Then one can ewite Σ p as Σ p = Σ f Σ 1 f + (β 0,α 0,γ 0 v (α 0,γ 0 } 1 T (β 0,α 0,γ 0 ]Σ f, and consequently we have Σ p Σ m =Σ f (β 0,α 0,γ 0 v (α 0,γ 0 } 1 (α 0,γ 0 T (β 0,α 0,γ 0 Σ f (β 0,α 0,γ 0 } 1] T (β 0,α 0,γ 0 Σ f. Since v (α 0,γ 0 } 1 and (α 0,γ 0 T (β 0,α 0,γ 0 Σ f (β 0,α 0,γ 0 } 1 ae espectively, the asymptotic covaiance matices of (ˆα p, ˆγ p and (ˆα m, ˆγ m, the above esults yield the following theoem. Theoem 2. Given conditions (A.1 (A.6, we have Σ s Σ f, Σ sp Σ f and Σ p Σ m Σ f Results unde special constaints Cheng and Hsueh (1999 compaed the asymptotic covaiance matices when X = W with pobability one. Hee the compaisons focus on the case when Y = Y 0 with pobability one. As a consequence of Theoem 1, if Y = Y 0 with pobability one, the matices Σ f and I sp simplify to Σ f = EF (X T β 0 F (X T β 0 X 2 } ( EF (X T β 0 +(1 E F (X T β 0 X W }] 2 ] 1, EF (X T β 0 W }E F (X T β 0 W }

10 120 K. F. CHENG AND H. M. HSUEH ( ] I sp = E ρ(y, F (X T β 0 W E S (β 0 Y, W } 2 W. Then Coollay 1 follows easily fom Σ s Σ sp = Σ f ( (1 ES (β 0 Y, W } 2 (1 2 ( E ρ(y, F (X T β 0 W ES (β 0 Y, W } 2 W ] + (1 2 ES (β 0 Y, W } 2 ES(β 0 Y, X} 2] 1 ES (β 0 Y, W } Σ 2 f. Coollay 1. Suppose conditions (A.1 (A.6 ae satisfied, that Y = Y 0 with pobability one, and ρ(y, F (X T β 0 W min(/(1, 1. Then Σ s Σ sp. Accodingly, the semipaametic estimato ˆβ sp is always bette than the simple MLE ˆβ s unde ou conditions. Simila conclusion can be deived fo the case that X = W with pobability one. Now suppose β 0 is a scala. Then unde the conditions of Theoem 1 and ρ(y, F (X T β 0 W ρ, with pobability one fo some constant ρ, the ARE of ˆβ s with espect to ˆβ sp is always smalle than e 1 (ρ, I, = 1 (1 2 II ρ /(1 }] + (1 I} 2, whee the coefficient of eliability I = ES (β 0 Y, W } 2 /ES(β 0 Y, X} 2 1. This can be used to measue the quality of suogate pedictos W. Some cuves of e 1 ae given in Figue 1 fo ρ = 0.3. The esults clealy show that if the coefficient I is lage enough, then even fo a smalle sampling faction, the semipaametic estimato ˆβ sp is still much moe efficient than the simple MLE ˆβ s. 1.0 I= I=0.3 I= Bound of ARE 0.4 I=1.0 I=0.9 I= Figue 1. Fo ρ = 0.3, cuves of e 1 (ρ, I, at vaious I.

11 LOGISTIC REGRESSION MODEL WITH MISMEASURED OBSERVATIONS 121 Simila esults can be deived fo the ARE of ˆβ sp with espect to ˆβ f. We note that unde the same conditions, a lowe bound fo the ARE of ˆβsp with espect to ˆβ f is e 2 (ρ, I, = 1 (1 2 Iρ 2 + (1 I + (1 2 Iρ. Some cuves of e 2 ae also given in Figue 2 fo ρ = 0.3. The ARE is at least 0.60 fo 0.3. Futhe, fo fixed, e 2 inceases as the coefficient of eliability I deceases I=0.7 I= Bound of ARE 0.4 I=0.1 I=0.3 I= Figue 2. Fo ρ = 0.3, cuves of e 2 (ρ, I, at vaious I. 5. Simulation Studies In ode to study the finite sample pefomance of the estimates, some empiical studies wee caied out. The logistic egession model consideed was F (x T β 0 = exp(x/1 + exp(x} so (β0 0, β0 1 = (0, 1. A linea eo model fo the pedicto X was assumed: W i = X i + v U i, fo some v. X i and U i wee pseudo independent N(0, 0.25 andom vaiables. Thus given W i = w i, X i is a N(w i /(1 + v 2, (0.25v 2 /(1 + v 2 vaiate. We also assumed that given W i = w i, the misclassification pobabilities wee independent of X i and Y i, i.e., θ 0 (w i = φ 0 (w i, fo all w i, whee θ 0 (w i = θ(w i, α 0 and φ 0 (w i = φ(w i, α 0. Fou diffeent models wee consideed. (1a W i = X i + 0.1U i and Y i = Yi 0 with pobability one. In this case, no misclassification occus, and hence the only functional fom needed fo estimating ˆβ m and ˆβ p is f(x w; γ = N(w/(1 + 4γ, γ/(1 + 4γ, γ > 0. (1b W i = X i + 0.1U i and θ 0 (w i = φ 0 (w i = 0.1. In this case, the functional foms needed fo computing ˆβ m, ˆβ p ae θ(w, α = α 0, φ(w, α = α 1, and f(x w; γ = N(w/(1 + 4γ, γ/(1 + 4γ, γ > 0.

12 122 K. F. CHENG AND H. M. HSUEH (2a W i = X i + U i and Y i = Yi 0 with pobability one. In this case, no misclassification occus, and hence the only functional fom needed fo estimating ˆβ m and ˆβ p is f(x w; γ = N(w/(1 + 4γ, γ/(1 + 4γ, γ > 0. (2b W i = X i + U i and θ 0 (w i = φ 0 (w i = exp(w i 2.5/1 + exp(w i 2.5}. In this case, the misclassification pobabilities follow a logit model with egession coefficients α 0 0 = 2.5, α0 1 = 1. The functional foms needed fo estimating ˆβ m and ˆβ p ae θ(w, α = exp(α 0 + α 1 w/1 + exp(α 0 + α 1 w}, φ(w, α = exp(α 0 + α 1 w/1 + exp(α 0 + α 1 w}, and f(x w; γ = N(w/(1 + 4γ, γ/(1 + 4γ, γ > 0. Model (1b featues small measuement eo fo both esponses and covaiates. On the othe hand, Model (2b has moe sevee measuement eo. The pimay sample size used was n = 150, and the sampling factions fo validation subsamples wee = 0.2, 0.4 and 0.6. One thousand pseudo data sets wee geneated to compute the simulated mean squaed eos. The function K( used in computing the nonpaametic egession estimates was the Epanechnikov kenel, that is, K(t = (3/4(1 t 2 I 1,1] (t; see Eubank (1988. Seveal bandwidths h = aˆσ w k 1/3 wee used in ou simulations. Hee ˆσ w is the sample standad deviation of W based on the validation data set, and a is some constant value. Such choice of bandwidth was justified by Sepanski, Knickebocke and Caoll (1994; see also Caoll and Wand (1991 and Wang and Wang (1997. Fist, we investigate the pefomance of ˆβ sp fo vaious choices of a. The pefomance of ˆβsp = ( ˆβ sp,0, ˆβ sp,1 T is measued by its simulated total mean squaed eo (TMSE, defined to be MSE( ˆβ sp,0 +MSE( ˆβ sp,1. Table 1 epots the simulation esults. It is seen that the pefomance of ˆβ sp is not sensitive to the choice of a. This agees with many findings fo semipaametic estimation using kenel egession. Hee, howeve, a bette choice is a = 1 and hence we use this value fo h in emaining simulations. Table 1. Simulated TMSE of ˆβ sp fo diffeent a W = X +.1U W = X + U θ 0 (W = 0 θ 0 (W =.1 θ 0 (W = 0 θ 0 (W = exp(w exp(W 2.5} a =.2 =.4 =.6 =.2 =.4 =.6 =.2 =.4 =.6 =.2 =.4 = Next, we compae the simulated mean squaed eo of diffeent estimates fo β = (β 0, β 1 T. The esults ae tabulated in Table 2. Obviously, ˆβ sp has the best

13 LOGISTIC REGRESSION MODEL WITH MISMEASURED OBSERVATIONS 123 oveall pefomance among the competing estimates. Moeove, the behavio of the simulated TMSE fo ˆβ sp seems insensitive to the selection of values. This is paticulaly tue when the model has only mino measuement eos. Unepoted calculations also show that the simulated standad eos (SE fo ˆβ sp,0 and ˆβ sp,1 ae stable. Unde all cases consideed, the simulated SE s ange fom to fo ˆβ sp,0 and fom to fo ˆβ sp,1. In geneal, ˆβsp has much bette pefomance than the othe estimates, especially when 0.4. Fom Table 2, we also see that the estimates ˆβ f and ˆβ p ae quite competitive and bette than the simple estimate ˆβ s. The latte statement is especially clea when = 0.2. Howeve, the simulated SE s fo ˆβ f,0, ˆβ p,0 and ˆβ s,0 ae not vey diffeent. These values ange fom to On the othe hand, the simulated SE s fo ˆβ f,1, ˆβ p,1 ae smalle than those of ˆβ s,1, paticulaly when = 0.2. The lagest simulated SE fo ˆβ f,1 and ˆβ p,1 is compaed with the coesponding value fo ˆβ s,1. Table 2. Simulated TMSE of diffeent estimatos ˆβ s ˆβm ˆβp ˆβf ˆβsp W = X +.1U θ 0 (W = 0. = = = θ 0 (W = θ 0 (W = 0.1 = = = W = X + U θ 0 (W = 0. = = = exp(w exp(W 2.5 = = = Finally, we comment on the compaison between ˆβ m and ˆβ p. In geneal, ˆβ p is bette than ˆβ m unless = 0.6. This happens because thee ae moe paametes to be estimated simultaneously in computing ˆβ m, and the validation subsample size k is not lage enough. The lagest simulated SE is fo ˆβ m,1 and fo ˆβ m,0, showing that the estimates ˆβ m ae not vey stable. The pefomance of ˆβ m and ˆβ p ae expected to become bette if k is sufficiently lage and the paametic functions θ(w, α, φ(w, α and f X W (x w; γ ae coectly modeled. Cheng and Hsueh (1999 epoted that the pefomances of ˆβ m and ˆβ p depend heavily on

14 124 K. F. CHENG AND H. M. HSUEH the coect choice of the models fo misclassification pobabilities when X = W with pobability one. Theefoe, ˆβ m and ˆβ p ae not vey obust and exta cae needs to be taken to apply the paametic methods. Acknowledgements This eseach was suppoted in pat by the National Science Council, R.O.C. The authos thank the Edito, an associate edito and efeees fo thei useful comments and suggestions which geatly impove the pesentation of this pape. Appendix Poof of Theoem 1. By Taylo s Theoem and fo n, we have 1 n( ˆβsp β = 0 2ˆl(β } 0 n 2 1 3ˆl(β } 1 } 2n 3 ( ˆβ 1 ˆl(β sp β ] 0 0 +o p (1, n whee β is some quantity such that β β 0 ˆβ sp β 0. Hee ˆl is the pseudolikelihood, and ˆl(β 0 k l1,i (β 0 } n ˆl2,j (β 0 } = +, l 1,i (β 0 ˆl 2,j (β 0 Futhe, i=1 = Y i F (X T i β0 }X i, = j=k+1 Â1(Yj 0, W j Â2(Yj 0, Wj]ÊF (XT j β0 F (Xj T β0 X j W j ] ÊF (Xj T β0 W j }Â1(Yj 0, W j + Ê F (Xj T β0 W j }Â2(Yj 0, W j. 2ˆl(β 0 2 = k 2 l 1,i (β 0 2 i=1 } + n j=k+1 2 l 1,i (β 0 2 = F (X T i β 0 F (X T i β 0 X i X T i, 2ˆl2,j (β 0 2 2ˆl2,j (β 0 2 ( Â1(Yj 0 =, W j Â2(Yj 0, Wj]ÊF (XT j β0 F (Xj T β0 X j W j ] ÊF (Xj T β0 W j }Â1(Yj 0, W j + Ê F (Xj T β0 W j }Â2(Yj 0, W j }, 2 Â1(Y j 0, W j Â2(Yj 0, Wj]Ê1 2F (XT j β0 }F (Xj T β0 F (Xj T β0 X j W j ] ÊF (Xj T β0 W j }Â1(Yj 0, W j + Ê F (Xj T β0 W j }Â2(Yj 0, W. j

15 LOGISTIC REGRESSION MODEL WITH MISMEASURED OBSERVATIONS 125 By the Law of Lage Numbes, 1 2ˆl(β } 0 n 2 = 1 2 l(β 0 } 1 n 2 + 2ˆl(β 0 } n l(β 0 }] n 2 = EF (X T β 0 F (X T β 0 }XX T ] 1 θ 0 (W φ 0 (W } 2 } +(1 E π 0 (W 1 π 0 (W } (EF (XT β 0 F (X T β 0 X W ] 2 1 +o p (1 + O p ( n h h 2 n = Σ 1 f + o p (1, n, h 0 and n 2 h. Then as n, h 0, n 2 h, } n( ˆβ sp β 0 ˆl(β = Σ f + o p (1} 0 / n + o p (1. Futhe, by Taylo s Theoem again, as n, n ˆl2,j (β 0 } n l2,j (β 0 } = + H n,j + O p ( 1 + nh nh 3 2 h + nh 3 2, i=k+1 i=k+1 whee H n,j = 1 ki=1 k h i,j, and h i,j = K ( h(w i W j 2 l 2,j (β 0 f(w j θ 0 (W j 2 l 2,j (β 0 } (1 Yi Yi 0 φ 0 (W j } + φ 0 (W j E F (Xj T β0 W j } 2 l 2,j (β 0 + EF (Xj T β0 W j } 2 l 2,j (β 0 } + EF (Xj T β0 F (Xj T β0 X j W j ] } Yi 1 Y 0 i θ 0 (W j } EF (X T j β0 W j } } ] F (Xi T β 0 EF (Xj T β 0 W j } F (Xi T β0 F (Xi T β0 X i EF (Xj T β0 F (Xj T β0 X j W j ]]. As a consequence, fo n, 1 ˆl(β 0 } n k n k l1,i (β 0 } = n k(n k n i=1 j=k+1 1 +O p ( + n 1 n 3 2 h 3 2 h n 1 2 h + h n k n l2,j (β 0 } + n k n h i,j ]

16 126 K. F. CHENG AND H. M. HSUEH Define two i.i.d. andom vectos Z i = (X i, W i, Y i, Yi 0, i = 1,..., k and Z j = (W j, Yj 0, j = k + 1,..., n and let Q n (Z i ; Zj = k l1,i (β 0 } + n k l2,j (β 0 } + n k n n n h i,j. Then k 1 (n k 1 k nj=k+1 i=1 Q n (Z i ; Zj is a genealized U-statistic. Applying the Cental Limit Theoem fo genealized U-statistics (see Sefling(1980, we have, fo n, n k n Q n (Z i ; Zj k(n k i=1 j=k+1 k = n 1 k i=1 ( d N 0, Σ 1 f + EQ n (Z i ; Z j Z i} + 1 n k (1 2 I sp, n i=k+1 EQ n (Z i ; Z j Z j } + o p (1 EQ n (Z i ; Zj Z i} l1,i (β 0 } = k n + n k n 1 θ 0 (W i φ 0 (W i }EF (X T i β0 F (X T i β0 X i W i ] π 0 (W i 1 π 0 (W i } ( Y i 1 Yi 0 θ 0 (W i } (1 Y i Yi 0 φ 0 (W i } 1 θ 0 (W i φ 0 (W i }F (Xi T β0 EF (Xi T β0 W i }] + O p (h 2, EQ n (Z i ; Zj Zj } = n k l2,j (β 0 }. n Consequently, as n, n( ˆβ sp β 0 d N (0, Σ f + (1 2 Σ f I sp Σ f. Refeences Albet, P. S., Hunsbege, S. A. and Bid, F. M. (1997. Modeling epeated measues with monotonic odinal esponses and misclassification, with applications to studying matuation. J. Ame. Statist. Assoc. 92, Bollinge, C. R. and David, M. H. (1997. Modeling discete choice with esponse eo: Food stamp paticipation. J. Ame. Statist. Assoc. 92, Caoll, R. J., Spiegelman, C. H., Lan, K. K. G., Bailey, K. T. and Abbott, R. D. (1984. On eos-in-vaiables fo binay egession models. Biometika 71, Caoll, R. J. and Wand, M. P. (1991. Semipaametic estimation in logistic measuement eo models. J. Roy. Statist. Soc. Se. B 53, Cheng, K. F. and Hsueh, H. M. (1999. Coecting bias due to misclassification in the estimation of logistic egession models. Statist. Pobab. Lett. 44,

17 LOGISTIC REGRESSION MODEL WITH MISMEASURED OBSERVATIONS 127 Cox, D. R. (1970. The Analysis of Binay Data. Chapman and Hall, London. Eubank, R. L. (1988. Spline Smoothing and Nonpaametic Regession. Macel Dekke, New Yok. Gasse, T. and Mülle, H. G. (1979. Kenel Estimation of Regession Functions in Smoothing Techniques fo Cuve Estimation. (Edited by T. Gasse and M. Rosenblatt, Lectue Notes in Mathematics 757, Spinge-Velag, Belin. Golm, G. T., Halloan, M. E. and Longini, I. M. J. (1998. Semi-paametic models fo mismeasued exposue infomation in vaccine tials. Statist. Medicine 17, Golm, G. T., Halloan, M. E. and Longini, I. M. J. (1999. Semipaametic methods fo multiple exposue mismeasuement and bivaiate outcome in HIV vaccine tials. Biometics 55, Lawless, J. F., Kalbfleisch, J. D. and Wild, C. J. (1999. Semipaametic methods fo esponseselective and missing data poblems in egession. J. Roy. Statist. Soc. Se. B 61, Pepe, M. S. (1992. Infeence using suogate outcome data and a validation sample. Biometika 79, Pepe, M. S., Reilly, M. and Fleming, T. R. (1994. Auxiliay outcome data and the mean scoe method. J. Statist. Plann. Infeence 42, Pegibon, D. (1981. Logistic egession diagnostics. Ann. Statist. 9, Reilly, M. and Pepe, M. S. (1995. A mean scoe method fo missing and auxiliay covaiate data in egession models. Biometika 82, Robins, J. M., Rotnitzky, A., Zhao, L.-P. and Lipsitz, S. (1994. Estimation of egession coefficients when some egessos ae not always obseved. J. Ame. Statist. Assoc. 89, Sepanski, J. H., Knickebocke, R. and Caoll, R. J. (1994. A semipaametic coection fo attenuation. J. Ame. Statist. Assoc. 89, Sefling, R. J. (1980. Appoximation Theoems of Mathematical Statistics. John Wiley, New Yok. Wang, C. Y. and Wang, S. (1997. Semipaametic methods in logistic egession with measuement eo. Statist. Sinica 7, Gaduate Institute of Statistics, National Cental Univesity, Chungli, Taiwan, R.O.C. kfcheng@cc.ncu.edu.tw Depatment of Statistics, National Chengchi Univesity, Taipei, Taiwan, R.O.C. hsueh@nccu.edu.tw (Received Apil 2000; accepted July 2002

Central Coverage Bayes Prediction Intervals for the Generalized Pareto Distribution

Central Coverage Bayes Prediction Intervals for the Generalized Pareto Distribution Statistics Reseach Lettes Vol. Iss., Novembe Cental Coveage Bayes Pediction Intevals fo the Genealized Paeto Distibution Gyan Pakash Depatment of Community Medicine S. N. Medical College, Aga, U. P., India