A unified framework for studying parameter identifiability and estimation in biased sampling designs

Size: px

Start display at page:

Download "A unified framework for studying parameter identifiability and estimation in biased sampling designs"

Clarissa Craig
5 years ago
Views:

1 Biometrika Advance Access published January 31, 2011 Biometrika (2011), pp C 2011 Biometrika Trust Printed in Great Britain doi: /biomet/asq059 A unified framework for studying parameter identifiability and estimation in biased sampling designs BY HUA YUN CHEN Division of Epidemiology and Biostatistics, School of Public Health, University of Illinois at Chicago, 1603 West Taylor Street, Chicago, Illinois 60612, U.S.A. hychen@uic.edu SUMMARY Based on the odds ratio representation of a joint density, we propose a unified framework to study parameter identifiability in biased sampling designs. It is shown that most of these designs encountered in practice can be reformulated within the proposed framework and, as a result, the question of parameter identifiability can be largely clarified. Estimation of the identifiable parameters is considered and traditional results on the equivalence of the prospective and retrospective likelihoods are extended. Information contained in data on certain identifiable parameters is often very limited. Such parameters can be poorly estimated by the likelihood approach with practically attainable sample sizes, which can substantially affect the estimates of parameters of primary interest. A partially penalized likelihood approach is proposed to address this. Simulation results suggest that the proposed approach has good performance. Some key words: Case-control design; Matched case-control design; Outcome-dependent sampling design; Profile likelihood; Weak identifiability. 1. INTRODUCTION Case-control and matched case-control designs have been extensively used in epidemiological studies. One key feature of these designs is that, when the outcome is modelled by logistic regression, parameter identifiability and estimation can be studied relatively easily (Anderson, 1972; Prentice & Pyke, 1979; Breslow & Day, 1980; Breslow, 1996; Rabinowitz, 1997). The results on parameter identifiability and estimation have been extended to more general biased sampling designs (Weinberg & Wacholder, 1993; Scott & Wild, 1997; Chen, 2003, 2007). However, such designs can have many variants, and the models for data analysis may need to accommodate structures beyond traditional logistic regression. One such example is the study of gene-environment interaction using the case-control design under the assumptions of gene-environment independence and/or the Hardy Weinberg equilibrium in the general population. The question of parameter identifiability in such a problem is more involved and the traditional prospective analysis of the outcome-dependent sampling design can be inefficient (Umbach & Weinberg, 1997; Chatterjee & Carroll, 2005). The important issues of parameter identifiability and estimation in complex statistical models fitted to biased sampling designs has not been systematically addressed. Odds ratio parameters play an important role in studying biased sampling designs (Breslow, 1976, 1981; Liang, 1985; Liang & Qin, 2000; Satten & Carroll, 2000; Chen, 2003, 2007; Osius, 2005). Chen (2003, 2004, 2007) proposed odds ratio decompositions of a conditional density and a bivariate joint density for studying parameter identifiability and estimation. However, the

2 2 HUA YUN CHEN results obtained there do not apply in general. In this article, we propose a unified framework based on the odd ratio representation of a joint density (Chen, 2010) to study parameter identifiability, estimation and inference in fitting complex models to data from biased sampling designs. 2. PARAMETER IDENTIFIABILITY IN GENERAL BIASED SAMPLING PROBLEMS 2 1. Odds ratio representation of a joint density Let the density of Y = (Y 1,...,Y q ) given X be f (y x).letp j (y j x) be the marginal density of Y j given X. Assume the positivity condition holds, i.e., if p j (y j x)>0for j = 1,...,q, then f (y 1,...,y q x)>0. This assumption is satisfied by almost all applications in practice. Let (y 10,...,y q0, x 0 ) be a fixed point in the sample space. Let y j = (y l, l = j). Definethe conditional odds ratio function between y j and y j given x as η j {(y j, y j0 ); (y j, y j0 ) x}= f (y j, y j x) f (y j0, y j0 x) f (y j0, y j x) f (y j, y j0 x). In the following, we use η j (y j ; y j x) to denote η j {(y j, y j0 ); (y j, y j0 ) x}.forq = 2, Chen (2007) derived an odds ratio representation of f (y 1, y 2 x) as f (y 1, y 2 x) = η 2 (y 2 ; y 1 x) f 1 (y 1 y 20, x) f 2 (y 2 y 10, x) η2 (y 2 ; y 1 x) f 1 (y 1 y 20, x) f 2 (y 2 y 10, x)dy 1 dy 2, (1) where f 1 and f 2 are, respectively, conditional densities for Y 1 given Y 2 and X, andfory 2 given Y 1 and X. One important property of the odds ratio representation is that the three components in the expression η 2 (y 2 ; y 1 x), f 1 (y 1 y 20, x) and f 2 (y 2 y 10, x) are uniquely identifiable, and are also variation independent if no additional constraint is imposed on f (y 1, y 2 x). Following Chen (2010), this representation can be extended to the case with q > 2. Let Y q and (Y q 1,...,Y 1 ) be the two sets of variables in applying (1). The three components of the decomposition are η q {y q ; (y q 1,...,y 1 ) x}, f q (y q y (q 1)0,...,y 10, x) and p(y q 1,...,y 1 y q0, x). By applying a similar representation to p(y q 1,...,y 1 y q0, x) and repeating the process, a representation of the conditional joint density, f (y 1,...,y q x), can be obtained as q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0, x} q f j(y j y j0, x) q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0, x} q { f j(y j y j0, x)dy j }. (2) The following proposition states the identifiability of the components in the odds ratio representation. PROPOSITION 1. The components in the representation, f (y j y j0, x) (j = 1,...,q), and η j {y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0, x} ( j = 2,...,q), are all uniquely identifiable. Proof. The result follows from the identifiability result for q = 2(Chen, 2007) and from the inductive argument used to derive the general representation Parameter identifiability in general biased sampling designs Most biased sampling designs can be viewed as a selective sample from the general population. Let S be the indicator of selection into the biased sample. Consider the biased sampling design

3 having sampling probability Biased sampling designs 3 pr(s = 1 Y 1,...,Y q, X) = π(y, X) q ψ j (Y j, X), (3) where ψ j ( j = 1,...,q) are either known or unknown. The design can be viewed as a multistage biased sampling design. For example, in the first stage, the sampling probability depends on (Y 1, X). For those selected in the first stage, the second stage of selection depends on (Y 2, X) and so on. The design is very flexible and includes as special cases most outcome-dependent sampling designs, such as the case-control and matched case-control sampling designs. Under this design, dpr(y 1,...,y q x, S = 1) can be expressed as q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0, x} q dg j(y j x) q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0, x} q dg j(y j x), where dg j (y j x) denotes dpr(y j y j0, x, S = 1) = ψ j(y j, x) f j (y j y j0, x)dy j ψ j (y j, x) f j (y j y j0, x)dy j ( j = 1,...,q). It can be seen that f (y 1,...,y q x) and p(y 1,...,y q x, S = 1) share the odds ratio functions, which means the odds ratio functions are always identifiable from the biased sampling design. Proposition 2 states the identifiability results and its proof is given in the Supplementary Material. PROPOSITION 2. Under the biased sampling design (3): (i) the odds ratio functions in (2) are always identifiable; (ii) if ψ j is not parametrically modelled and is unknown, then f (y j y j0, x) is not identifiable; and (iii) if ψ j is known and ψ j > 0 for all (y j, x), then f (y j y j0, x) is identifiable. A parameter is identifiable from the biased sample design (3) if it is identifiable from the union of the odds ratio functions and the identifiable f (y j y j0, x) (j = 1,...,q) Applications to specific biased sampling problems Proposition 2 can be applied to different biased sampling designs to determine the identifiability of parameters. The following examples illustrate the applications. Example 1. Let W, Z and X be, respectively, the true outcome, the risk factor and the confounder under control, i.e., the matched variable in a matched case-control design. Let U denote the observed outcome subject to misclassification. Assume that the sampling probability, pr(s = 1 U, W, X, Z) = π(u, X), is of unknown functional form. The observed data consist of (U, X, Z, S = 1). Let the misclassification probabilities be p 0 = pr(u = 1 W = 0, X, Z) and p 1 = pr(u = 0 W = 1, X, Z). Assume for brevity that p 0 and p 1 are independent of (X, Z). Assume a linear logistic regression model for the true outcome (Copeland et al., 1977), i.e., pr(w = 1 X, Z) = exp(β 0 + β 1 Z + β 2 X) 1 + exp(β 0 + β 1 Z + β 2 X).

4 4 HUA YUN CHEN The model for the misclassified outcome becomes pr(u = 1 X, Z) = p 0 + (1 p 1 ) exp(β 0 + β 1 Z + β 2 X). 1 + exp(β 0 + β 1 Z + β 2 X) This can be fitted into the proposed framework by setting Y 2 = U and Y 1 = Z. The conditional odds ratio function, η(u = 1; Z X), is {p 0 + (1 p 1 ) exp(β 0 + β 1 Z + β 2 X)}{1 p 0 + p 1 exp(β 0 + β 1 Z 0 + β 2 X)} {p 0 + (1 p 1 ) exp(β 0 + β 1 Z 0 + β 2 X)}{1 p 0 + p 1 exp(β 0 + β 1 Z + β 2 X)}, where U 0 = 0. When β 1 = 0, none of (p 0, p 1,β 0,β 2 ) is identifiable from the sampling design. Under the assumptions that β 1 = 0, that (p 0, p 1 ) are known and that X is absent, parameters (β 0,β 1 ) are identifiable if Z takes at least three distinct values. When X is present, parameters (β 0,β 1,β 2 ) are identifiable when X takes at least two values, Z takes at least three values and the X values and the Z values different from Z 0 form at least three distinct (X, Z) points. When neither p 0 nor p 1 is known, parameters (p 0, p 1,β 0,β 1,β 2 ) are identifiable when Z takes at least five distinct values and X takes at least two distinct values, and the X values and the Z values different from Z 0 form at least five distinct (X, Z) points. Example 2. In a case-control design or a matched case-control design, the association among covariates themselves in the population may be of interest (Nagelkerke et al., 1995; Lee et al., 1997). Let W be the binary outcome and U, Z and X be covariates. Suppose that the relationship of U to Z and X is of interest. The sampling design satisfies pr(s = 1 W, U, Z, X) = π(w, X). This can be fitted into the proposed framework by setting Y 3 = W, Y 2 = U and Y 1 = Z. The joint density for (W, U, Z) given X under the biased sample design, p(w, U, Z X, S = 1), can be expressed as η 1 {W ; (U, Z) X)η 2 (U; Z W 0, X)dF 1 (W X)dF 2 (U X)dF 3 (Z X) η1 {W ; (U, Z) X)η 2 (U; Z W 0, X)dF 1 (W X)dF 2 (U X)dF 3 (Z X). Let g 1 (W U, Z, X,β) and g 2 (U Z, X,θ) be the models for the population with η 1 and η 2 being, respectively, the odds ratio functions. If g 1 is a linear logistic model and g 2 is a normal linear regression model with the regression coefficients α and residual variance σ 2,then log η 1 {W = 1; (U, Z) X}=β 1 (U U 0 ) + β 2 (Z Z 0 ), log η 2 (U; Z W 0, X) = α 1 σ 2 (Z Z 0)(U U 0 ) [ ] {1 + exp(β0 + β 1 U 0 + β 2 Z + β 3 X)}{1 + exp(β 0 + β 1 U + β 2 Z 0 + β 3 X)} + log, {1 + exp(β 0 + β 1 U + β 2 Z + β 3 X)}{1 + exp(β 0 + β 1 U 0 + β 2 Z 0 + β 3 X)} where W 0 = 0. It can be seen from the above expression that β 1 and β 2 are always identifiable from η 1 when U and Z each take at least two distinct values. When either β 1 = 0orβ 2 = 0, β 0 is not identifiable from η 2.Whenβ 1 = 0andβ 2 = 0, (β 0,β 3 ) may be identifiable from η 2.Inthe latter case, when X is absent and α 1 = 0, β 0 is identifiable from η 2.WhenX is present, β 0 and α 1 /σ 2 are identifiable only when either Z or U takes at least three distinct values. When X is present and α 1 = 0, (β 0,β 3 ) and α 1 /σ 2 are identifiable when X takes at least two distinct values

5 Biased sampling designs 5 and either Z or U takes at least three distinct values, and X, Z = Z 0 and U = U 0 form at least three distinct (U, X, Z) points. The identifiability results obtained in Chatterjee & Carroll (2005) for studying gene and environment effects under the assumption of gene-environment independence in the general population can be viewed as a special case of this example. By setting the disease status D = W,the genetic variants G = U, the environment factor E = Z and g 2 (U Z) = g 2 (U), the odds ratio functions under their model are log η 1 {D = 1; (G, E)}=m(G, E,β 1 ) m(g 0, E 0,β 1 ), ( ) [1 + exp{β0 + m(g 0, E,β 1 )}][1 + exp{β 0 + m(g, E 0,β 1 )}] log η 2 (G; E D 0 = 0) = log, [1 + exp{β 0 + m(g, E,β 1 )}][1 + exp{β 0 + m(g 0, E 0,β 1 )}] where m(g, E,β 1 ) may take the form β 11 G + β 12 E or the form β 11 G + β 12 E + β 13 G E. Both β 11 and β 12 are identifiable from η 1.Whenβ 13 = 0, β 0 is identifiable even if β 11 = β 12 = 0. When β 13 = 0, β 0 is also identifiable if neither β 11 nor β 12 is zero. Example 3. Let D taking 0 or 1 denote the disease status, G denote the genetic factor, E denote the environment factor and Z denote other risk factors. In a case-only sampling design, pr(s = 1 G, E, Z, D) = Dπ (Piegorsch et al., 1994). If the distribution of Z is unspecified, inference on the gene-environment interaction can be based on the distribution pr(g, E Z, D = 1, S = 1) = Under the model that η(g; E Z, D = 1)p(G E 0, Z)p(E G 0, Z) η(g; E Z, D = 1)p(G E0, Z)p(E G 0, Z)dGdE. pr(d = 1 G, E, Z) = exp(β 0 + β 1 E + β 2 G + β 12 EG + β 3 Z) 1 + exp(β 0 + β 1 E + β 2 G + β 12 EG + β 3 Z) and gene-environment independence in the control population, i.e., pr(g, E D = 0) = p(g D = 0)p(E D = 0), logη(g; E Z, D = 1) = β 12 (E E 0 )(G G 0 ). Under model (4) and gene-environment independence in the general population, log η(g; E Z, D = 1) reduces to β 12 (E E 0 )(G G 0 ) plus log {1 + exp(β 0 + β 1 E 0 + β 2 G + β 12 E 0 G + β 3 Z)}{1 + exp(β 0 + β 1 E + β 2 G 0 + β 12 EG 0 + β 3 Z)} {1 + exp(β 0 + β 1 E + β 2 G + β 12 EG + β 3 Z)}{1 + exp(β 0 + β 1 E 0 + β 2 G 0 + β 12 E 0 G 0 + β 3 Z)}. (4) In practice, additional assumptions such as gene-environment independence given Z (γ = 0) and/or the rare disease assumption which implies the second term in the foregoing displayed equation is approximately 0 make the estimation of β 12 easier. However, parameters (β 0,β 1,β 2,β 12,β 3,γ) can become identifiable from the odds ratio model when β 12 = 0and E, G and Z take many distinct values even without the foregoing additional assumptions. One feature of these examples is that the identifiability of some parameters requires other parameters not taking certain values in the parameter space. For example, in Example 1, when β 1 = 0, other parameters become unidentifiable. Such identifiability is termed weak identifiability in this paper. In contrast, in the linear logistic regression under a case-control design, identifiable parameters are not subject to the problem. Such identifiability is termed strong identifiability in this paper. The problem with the weak identifiability in Example 1 is that when β 1

6 6 HUA YUN CHEN is close to zero, information in the data for distinguishing different values of β 0 can be very limited, which may require extremely large sample sizes to estimate the parameter with reasonable accuracy. The poorly estimated β 0 can substantially affect the estimation of β 1, which is usually the parameter of primary interest. We tackle this difficult problem in 4 based on a modification of one of the equivalent likelihoods discussed in the next section. 3. EQUIVALENT LIKELIHOODS FOR PARAMETER ESTIMATION AND INFERENCE 3 1. Biased sampling designs of the case-control type In a case-control design, X is absent. Let Y q denote the outcome and let Y q 1,...,Y 1 denote the covariates. The sampling probability satisfies π(y 1,...,Y q ) = ψ q (Y q ). The marginal distribution for Y q is fixed by the sampling design. That is, p(y q S = 1) is known. The joint density for the sample is p(y q 1,...,y 1 y q )p(y q S = 1), which can be rewritten as p(y 1,...,y q S = 1) with known p(y q S = 1). Let(Y1 i,...,y q i, Si = 1) (i = 1,...,n) be the observed data. It is easy to see that the maximizer of the retrospective likelihood n i=1 p(yq 1 i,...,y 1 i Y q i ) is the same as the maximizer of the joint likelihood n i=1 p(y1 i,...,y q i Si = 1) subject to the constraints p(y q = y kq S = 1) = n kq /n, for k = 1, 0, corresponding to cases and controls. Under the general sampling design (3), we maximize the joint likelihood n q j=2 η j{y j i; (Y j 1 i,...,y 1 i) y q0,...,y ( j+1)0 } q dg j(y j i ) q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0 } q dg j(y j ) i=1 subject to the constraints [ q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0 } ] q 1 {dg j(y j )} yq =y kq dg q (y kq ) q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0 } q dg j(y j ) = n kq n for k = 1,...,N q,wherey 1q,...,y Nq q are all the distinct values Y q takes and n 1q,...,n Nq q are the corresponding frequencies. We termed this sampling design the case-control type. Imposing the constraints does not affect the identifiability results obtained in the previous section, because p(y q S = 1) was assumed known in the parameter identifiability problem. (5) PROPOSITION 3. Assume that G q (y q ) and G k (y k ),k = q, are unconstrained and variation independent of each other and of q η j {y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0 } j=2 j = k,q dg j (y j ). The profile likelihoods for parameters other than G k (y k ) and G q (y q ) under sampling design (3) based on the joint likelihood with or without constraints are identical. Furthermore, the unconstrained joint likelihood with G k (y k ) profiled out is the same as the conditional likelihood based on Y k given Y k.

7 Biased sampling designs 7 Remark 1. The condition on G k (y k ) required by Proposition 3 is satisfied if either condition 1 or 2 in the following holds for the fixed k: (i) the function ψ k (y k ) is not constrained and is unknown; or (ii) the function ψ k (y k ) is known and f k (y k y k0 ) is not constrained, and is variation independent of G q (y q ) and j = k,q η j {y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0 }dg(y j ). In a typical case-control design, ψ q (y q ) is unknown and ψ k (y k ) for k = q is known. This means that the second condition is satisfied in the application to the case-control design. The proof is given in Appendix. Proposition 3 extends the classical equivalence result on the prospective and retrospective analyses of case-control design to allow for more complex models. For example, in the case-control study of gene-environment interaction, if the distribution of E is not parametrically modelled, Proposition 3 states that inferences based on the unconstrained conditional likelihood from p(d, G E, S = 1) for the parameters other than p(e D = 0, G 0 ) in p(g, E D) are exactly the same as those based on the retrospective likelihood from p(g, E D). This result is sharper than the result obtained in Chatterjee & Carroll (2005) when no additional information on the disease prevalence is available. Their result stated that the prospective likelihood from p(d, G E, S = 1) can be used for solving the likelihood estimator of the regression parameters, but cannot be used for inference. In contrast, our result implies that p(d, G E, S = 1) can be used as an ordinary likelihood for both estimation and inference on parameters in p(g, E D) other than p(e D = 0, G 0 ). Such parameters include (β 0,β 1,θ)in Chatterjee & Carroll (2005), but do not include κ in their η. This result is also in agreement with those in Scott & Wild (1997) for multiplicative models Biased sampling designs of the matched case-control type In a biased sampling design of the matched case-control type, p(y q, x S = 1) may be considered fixed by the sampling design. Assume that the observed data are grouped by strata defined by the values X takes. Within stratum k, the joint density of the observed data is m k i=1 [ q j=2 η j{y j ik ; (Y ik q j=2 ] j 1,...,Yik q 1 ) y q0,...,y ( j+1)0, X i } dg j(y j ik X k ) [ η j {y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0, x} ] q dg j(y j X k ), where (Yq ik,...,yik 1, X k )(i = 1,...,m k ) are data from the stratum. If for a fixed j, G j (y j x) is unconstrained and is variation independent of other components in the numerator of the expression, then G j (y j x) can be eliminated from the conditional likelihood based on the permutation of Y j ik (i = 1,...,m k ). If several such G j exist, they can all be eliminated from the conditional likelihood based on a combination of all the permutations on each of the observed y j within a stratum. The conditional logistic analysis of a matched casecontrol design is a special case of the permutation approach. This generalization is useful for analyzing genetic studies with a matched case-control design either when the gene-environment distribution is modelled or when correlated family members are modelled. Such applications will be discussed elsewhere.

8 8 HUA YUN CHEN 4. THE PARTIALLY PENALIZED LIKELIHOOD APPROACH 4 1. Partially penalized likelihood for weakly identifiable parameters When weakly identifiable parameters are involved, the maximum likelihood estimator can have very poor performance under practically attainable sample sizes. We propose the following partially penalized likelihood to tackle this problem. Let l(β 0,β 1, )= n l i (β 0,β 1 Y i, X i ) 1 2 (β 0 m) t (β 0 m), i=1 where l i (β 0,β 1 Y i, X i ) = log p(y i1,...,y iq X i, S i = 1,β 0,β 1 ) or an equivalent loglikelihood, both β 0 and β 1 can be vectors, is a diagonal matrix with positive diagonal elements, and m is a vector of plausible values for β 0. Let (β0,β 1 ) be the true parameter values for (β 0,β 1 ) and β 0 be a scale. Let { li (β0 = E,β 1 ) } 2 ( ) ω0 0 = 01. (β 0,β 1 ) 10 1 The maximum likelihood estimator for β 1 has asymptotic variance ( 1 ω ) 1.The information on β 0, ω 0, is usually very small when β 0 is weakly identifiable. When is large relative to ω 0, the influence of the estimator of β 0 on the variation of the estimator of β 1 can be substantial in the maximum likelihood approach. Let n 1/2 (m n β0 ) h, and λ n/n σ, where m n and λ n are, respectively, the guessed β 0 and the penalty parameter. The proposed partially penalized approach downweights the influence. The variance for the penalized likelihood estimator for β 1 is ( V ( ˆβ 1 ) = 1 ) 1 { σ + ω }( 0 ω 0 + σ (σ + ω 0 ) ) ω 0 + σ This variance is smaller than that of the maximum likelihood estimator. As the penalized likelihood estimator can have bias, the mean square error is a better measure. The following proposition gives the optimal result. A proof is given in the Supplementary Material. PROPOSITION 4. Let β 0 be a scalar parameter. Under the regularity conditions for the maximum likelihood estimator to be consistent and asymptotically normal, the penalized maximum likelihood estimator for β 1 has a minimum mean square error matrix, ( /(ω 0 + 1/h 2 )) 1, which is attained when σ = h 2. Proposition 4 can be used as a guideline for choosing the penalty parameter λ n, because λ n nσ = n h 2 1 (m n β 0 )2. The optimal choice of λ n depends on the distance between the guessed value for β 0 and the true β 0. However, in practice, we do not know the true β 0. If the probable range of the true value for β 0 is known, we may take λ n as the reciprocal of the squared range. In cases where information on the range is not available, we may base the choice of λ n on the notion of controlling the influence of the β 0 estimator on the variance of the β 1 estimator. We can thus choose a suboptimal λ n, as small as possible, such that the variance for the β 1 estimator has little change relative to the change in λ n. This criterion can be implemented relatively easily.

9 Biased sampling designs 9 Table 1. Simulation results on outcome misclassification in the case-control design with misclassification probabilities 0 1r for cases and 0 2 for controls β 1 = 1(r 10%) β 1 = 1(r 0 1%) Methods Bias Evar Mvar Bias Evar Mvar Logist MLE RMLE (L) PMLE (L) RMLE (U) PMLE (U) Bias, estimated truth; Evar, empirical estimate of the variance based on the parameter estimates; Mvar, average of the estimated variance; Logist, logistic regression applied directly to the case-control data; MLE, maximum likelihood estimator; RMLE, maximum likelihood estimator with β 0 fixedat0 8(L)or1 2(U) times the true value; PMLE, penalized maximum likelihood estimator with m at 0 8(L) or1 2(U) times the true value and λ taking the optimal value Applications of the partially penalized likelihood approach We now apply the partially penalized likelihood approach to the outcome misclassification in a case-control design and to the estimation of genetic and environment effects in a case-control design with gene-environment independence in the general population. We demonstrate the applications using simulation. For the outcome misclassification, the simulation model is the same as Example 1 except that X is absent. The covariate Z follows the uniform distribution on [ 1, 1]. The disease risk ratio is fixed at β 1 = 1. To evaluate the impact of the different disease prevalence rates on the performance of the methods, we set β 0 = 2 2 or 6 9, which roughly corresponds to disease prevalence rates of r = 10% or 0 1% at the exposure level of z = 0. The misclassification rates are set to q 0 = 10%r, which roughly corresponds to 10% of the cases in the case-control sample being false positive, and q 1 = 20%, which corresponds to 20% of the disease cases in the population not diagnosed. Each simulated dataset has 500 cases and 500 controls. The optimal valueissetforλ. In the analysis, the misclassification probabilities were assumed known and the guessed β 0 was set to 0 8 times or 1 2 times the true β 0. The simulation results based on 1000 replicates are listed in Table 1. It can be seen from the table that the maximum likelihood estimator of β 1 is subject to large bias and the variance estimates based on the inverse of the information matrix poorly estimated the actual variance. The partially penalized maximum likelihood estimator of β 1 has a small variance that can be well estimated. There is relatively slight improvement of the maximum penalized likelihood estimator over the restricted maximum likelihood estimator, which may be due to the limited information in the data on β 0. For the study of gene-environment effects, we simulated a special case of Example 2. In the simulation, we set G to a binary variable with pr(g = 1) = 0 1. The environment variable is uniformly distributed on [0, 2]. The intercept β 0 is set to 2 2or 6 9. The relative risk parameters are set to (0 5, 0 5, 1). Each simulated dataset has 500 cases and 500 controls. We computed the traditional prospective likelihood estimator based on pr(d G, E, S = 1), the penalized likelihood estimator with the guessed β 0 either at 0 8 times or 1 2 times the true β 0. The penalty λ is set to the optimal values. Owing to the difficulty in computing the maximum likelihood estimator under the gene-environment independence assumption, we only computed it for the case β 0 = 2 2 using the penalized approach with λ set to a very small value, Simulation results based on 1000 replicates are listed in Table 2. From the table, we see that the partially penalized approach can substantially reduce the variance of the β 1 estimators at the cost of a low level of bias when compared with the traditional prospective analysis

10 10 HUA YUN CHEN Table 2. Simulation results on gene-environment effects in the case-control design with geneenvironment independence in the general population Gene (0 5) Environment (0 5) Interaction (1 0) Methods Bias Evar Mvar Bias Evar Mvar Bias Evar Mvar Prevalence rate 10% (β 0 = 2 2) Logist MLE RMLE (L) PMLE (L) RMLE (U) PMLE (U) Prevalence rate 0 1% (β 0 =-6 9) Logist RMLE (L) PMLE (L) RMLE (U) PMLE (U) Bias, estimated truth; Evar, empirical estimate of the variance based on the parameter estimates; Mvar, average of the estimated variance; Logist, logistic regression applied directly to the case-control data; MLE, maximum likelihood estimator calculated using PMLE with m the true β 0 and λ = 0 001; RMLE, maximum likelihood estimator with β 0 fixed at 0 8 (L) or 1 2 (U) times of the true value; PMLE, penalized maximum likelihood estimator with m at 0 8 (L) or 1 2 (U) times of the true value and λ taking the optimal value. and the maximum likelihood estimator. The restricted maximum likelihood can also perform well in comparison with the penalized likelihood approach when the guessed β 0 is far away from the truth. This is likely due to the small amount of information on β 0 contained in the data. When the guessed β 0 is close to the truth, such as the case of β 0 = 2 2, the partially penalized approach has a noticeable improvement over the restricted maximum likelihood approach. SUPPLEMENTARY MATERIAL Supplementary Material available at Biometrika online includes proofs of Propositions 2 and 4. ACKNOWLEDGEMENT I thank the editor, the associate editor and two referees for very helpful comments and suggestions. I also thank Dr Sally Freels for reading and correcting an earlier version of this paper. The research was partially supported by a grant from the National Science Foundation, U.S.A. APPENDIX Proof of Proposition 3. When G k (y k ) is unconstrained and is variation independent of q η j {y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0 } dg j (y j ), j=2 j = k the maximizer for G k has to be achieved with all probability mass concentrated on the observed Y k values when the likelihood is unconstrained. For a constrained likelihood, the maximum of the likelihood has to be less than or equal to the maximum of the corresponding unconstrained likelihood. If the maximizer for

11 G k based on the unconstrained conditional likelihood, n i=1 Biased sampling designs 11 q j=2 η j{y j i; (Y j 1 i,...,y 1 i) y q0,...,y ( j+1)0 } q 1 dg j(y j i ) q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0 } q 1 dg j(y j ), also satisfies the constraints, the maximizer for G k in the constrained conditional likelihood is achieved with all probability mass concentrated on the observed Y k values. Let y jk ( j = 1,...,N k ) be the observed distinct Y k values having frequencies of occurrence n jk ( j = 1,...,N k ).Let η jl = q η t {y t ; (y t 1,...,y 1 ) y q0,...,y (t+1)0 } dg t (y t ) (yk =y jk,y q =y lq ). t=2 t = k,q The loglikelihood with the Lagrange multipliers can be written as l(g k, G q,μ 1,...,μ Nq,λ 1,λ 2,θ) n q = log η j {Y j i ; (Y j 1 i,...,y 1 i ) y q0,...,y ( j+1)0 }+ log dg j (Y j i ) i=1 j = k,q N k N k + n jk log g jk + n lq log g lq n log η jl g jk g lq N q l=1 N q ( Nk + μ η ) l l=1 l=1 η n lq n N q N k + λ 1 g lq 1 + λ 2 g jk 1, l=1 N q l=1 where θ denote all the parameters other than G k and G q in the likelihood and g jk and g lq are probability masses for G k and G q, respectively. Note that l g l0 q l g j0 k l μ l0 = = n l 0 q g l0 q n η jl 0 g jk l=1 η + λ 1 + μ η N jl 0 g q jk l0 l=1 η μ η jlg jk g Nk lq η jl 0 g jk l l=1 ( N k = n j 0 k n g j0 k N q ( + μ l l=1 l=1 η j 0 lg lq l=1 η + λ 2 η j0 lg lq l=1 η η jl 0 g jk g l0 q l=1 η n l0q n l=1 η ) 2 = 0, (A1) η l=1 η ) j 0 lg lq ( N k l=1 η = 0, (A2) ) 2 = 0, (A3)

12 12 HUA YUN CHEN N q l = g lq 1 = 0, λ 1 l=1 l N k = g jk 1 = 0, λ 2 (A4) (A5) where l 0 = 1,...,N q, and j 0 = 1,...,N k. Multiplying both sides of (A1)byg l0 q and summing over l 0,it can be seen from (A3) and (A4) that λ 1 = 0. Further, multiplying both sides of (A1) byg l0 q and applying (A3), it can be seen that N q μ l0 = 1 μ l n lq. n l=1 It follows from the arbitrariness of l 0 that μ l is a constant. Next, multiplying both sides of (A2)byg j0 k and summing over j 0, it can be seen from (A3) and (A5) that λ 2 = 0. Further, multiplying both sides of (A2) by g l0 q and applying (A3) and (A6), it can be seen that n j0 k g j0 k = n (A6) l=1 η j 0 lg lq l=1 η. (A7) Next, if we do not impose constraints (5) and maximize the likelihood over G k and G q using the Lagrange multiplier method, we can obtain score equations which are equivalent to the score equations when the constraints are imposed. This implies that the profile likelihoods for θ with or without the constraints are identical because the profile likelihoods are the joint likelihood subject to the constraints defined by the score equations which are the same under both cases. Finally, the score equations for maximizing the unconstrained joint likelihood with respective to G k are (A7). By applying (A7) to the joint likelihood, G k can be eliminated from the likelihood expression and the resulting profile likelihood is exactly the conditional likelihood based on pr(y k Y k, S = 1). REFERENCES ANDERSON,J.A.(1972). Separate sample logistic discrimination. Biometrika 59, BRESLOW, N.E.(1976). Regression analysis of the log odds ratio: a method for retrospective studies. Biometrics 32, BRESLOW,N.E.(1981). Odds ratio estimators when the data are sparse. Biometrika 68, BRESLOW,N.E.(1996). Statistics in epidemiology: the case-control study. J. Am. Statist. Assoc. 91, BRESLOW, N.E.&DAY, N.(1980). Statistical Methods in Cancer Research. Volume I: The Analysis of Case-control Studies. IARC Scientific Publications. Lyon: IARC. CHATTERJEE, N. & CARROLL, R. J. (2005). Semiparametric maximum likelihood estimation exploiting geneenvironment independence in case-control studies. Biometrika 92, CHEN,H.Y.(2003). A note on prospective analysis of outcome-dependent samples. J. R. Statist. Soc. B 65, CHEN,H.Y.(2004). Nonparametric and semiparametric models for missing covariates in parametric regressions. J. Am. Statist. Assoc. 99, CHEN,H.Y.(2007). A semiparametric odds ratio model for measuring association. Biometrics 63, CHEN,H.Y.(2010). Compatibility of conditionally specified models. Statist. Prob. Lett. 80, COPELAND, K.T.,CHECKOWAY, H.,MCMICHAEL, A.J.&HOLBROOK, R.H.(1977). Bias due to misclassification in estimation of relative risk. Am. J. Epidemiol. 105, LEE,A.J.,MCMURCHY,L.&SCOTT,A.J.(1997). Re-using data from case-control studies. Statist. Med. 16, LIANG,K.Y.(1985). Odds ratio inference with dependent data. Biometrika 72, LIANG, K.Y.&QIN, J.(2000). Regression analysis under non-standard situations: a pairwise pseudolikelihood approach. J. R. Statist. Soc. B 62, NAGELKERKE, N.J.D.,MOSES, S.,PLUMMER, F.A.,BRUNHAM, R.C.&FISH, D.(1995). Logistic regression in case-control studies: the effect of using indepedent variables. Statist. Med. 14, OSIUS,G.(2005). The association between two random elements: a complete characterization and odds ratio models. Metrika 60,

13 Biased sampling designs 13 PIEGORSCH,W.W.,WEINBERG,C.R.&TAYLOR,J.A.(1994). Non-hierarchical logistic models and case-only designs for assessing susceptibility in population based case-control studies. Statist. Med. 13, PRENTICE, R.L.&PYKE, R.(1979). Logistic disease incidence models and case-control studies. Biometrika 66, RABINOWITZ,D.(1997). A note on efficient estimation from case-control data. Biometrika 84, SATTEN, G.A.&CARROLL, R.J.(2000). Conditional and unconditional categorical regression models with missing covariates. Biometrics 56, SCOTT,A.J.&WILD,C.J.(1997). Fitting regression models to case-control data by maximum likelihood. Biometrika 84, UMBACH, D.M.&WEINBERG, C.M.(1997). Designing and analyzing case-control studies to exploit independence of genotype and exposure. Statist. Med. 16, WEINBERG, C.R.&WACHOLDER, S.(1993). Prospective analysis of case-control data under general multiplicativeintercept risk models. Biometrika 80, [Received April Revised May 2010]

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative