A unified framework for studying parameter identifiability and estimation in biased sampling designs

Size: px
Start display at page:

Download "A unified framework for studying parameter identifiability and estimation in biased sampling designs"

Transcription

1 Biometrika Advance Access published January 31, 2011 Biometrika (2011), pp C 2011 Biometrika Trust Printed in Great Britain doi: /biomet/asq059 A unified framework for studying parameter identifiability and estimation in biased sampling designs BY HUA YUN CHEN Division of Epidemiology and Biostatistics, School of Public Health, University of Illinois at Chicago, 1603 West Taylor Street, Chicago, Illinois 60612, U.S.A. hychen@uic.edu SUMMARY Based on the odds ratio representation of a joint density, we propose a unified framework to study parameter identifiability in biased sampling designs. It is shown that most of these designs encountered in practice can be reformulated within the proposed framework and, as a result, the question of parameter identifiability can be largely clarified. Estimation of the identifiable parameters is considered and traditional results on the equivalence of the prospective and retrospective likelihoods are extended. Information contained in data on certain identifiable parameters is often very limited. Such parameters can be poorly estimated by the likelihood approach with practically attainable sample sizes, which can substantially affect the estimates of parameters of primary interest. A partially penalized likelihood approach is proposed to address this. Simulation results suggest that the proposed approach has good performance. Some key words: Case-control design; Matched case-control design; Outcome-dependent sampling design; Profile likelihood; Weak identifiability. 1. INTRODUCTION Case-control and matched case-control designs have been extensively used in epidemiological studies. One key feature of these designs is that, when the outcome is modelled by logistic regression, parameter identifiability and estimation can be studied relatively easily (Anderson, 1972; Prentice & Pyke, 1979; Breslow & Day, 1980; Breslow, 1996; Rabinowitz, 1997). The results on parameter identifiability and estimation have been extended to more general biased sampling designs (Weinberg & Wacholder, 1993; Scott & Wild, 1997; Chen, 2003, 2007). However, such designs can have many variants, and the models for data analysis may need to accommodate structures beyond traditional logistic regression. One such example is the study of gene-environment interaction using the case-control design under the assumptions of gene-environment independence and/or the Hardy Weinberg equilibrium in the general population. The question of parameter identifiability in such a problem is more involved and the traditional prospective analysis of the outcome-dependent sampling design can be inefficient (Umbach & Weinberg, 1997; Chatterjee & Carroll, 2005). The important issues of parameter identifiability and estimation in complex statistical models fitted to biased sampling designs has not been systematically addressed. Odds ratio parameters play an important role in studying biased sampling designs (Breslow, 1976, 1981; Liang, 1985; Liang & Qin, 2000; Satten & Carroll, 2000; Chen, 2003, 2007; Osius, 2005). Chen (2003, 2004, 2007) proposed odds ratio decompositions of a conditional density and a bivariate joint density for studying parameter identifiability and estimation. However, the

2 2 HUA YUN CHEN results obtained there do not apply in general. In this article, we propose a unified framework based on the odd ratio representation of a joint density (Chen, 2010) to study parameter identifiability, estimation and inference in fitting complex models to data from biased sampling designs. 2. PARAMETER IDENTIFIABILITY IN GENERAL BIASED SAMPLING PROBLEMS 2 1. Odds ratio representation of a joint density Let the density of Y = (Y 1,...,Y q ) given X be f (y x).letp j (y j x) be the marginal density of Y j given X. Assume the positivity condition holds, i.e., if p j (y j x)>0for j = 1,...,q, then f (y 1,...,y q x)>0. This assumption is satisfied by almost all applications in practice. Let (y 10,...,y q0, x 0 ) be a fixed point in the sample space. Let y j = (y l, l = j). Definethe conditional odds ratio function between y j and y j given x as η j {(y j, y j0 ); (y j, y j0 ) x}= f (y j, y j x) f (y j0, y j0 x) f (y j0, y j x) f (y j, y j0 x). In the following, we use η j (y j ; y j x) to denote η j {(y j, y j0 ); (y j, y j0 ) x}.forq = 2, Chen (2007) derived an odds ratio representation of f (y 1, y 2 x) as f (y 1, y 2 x) = η 2 (y 2 ; y 1 x) f 1 (y 1 y 20, x) f 2 (y 2 y 10, x) η2 (y 2 ; y 1 x) f 1 (y 1 y 20, x) f 2 (y 2 y 10, x)dy 1 dy 2, (1) where f 1 and f 2 are, respectively, conditional densities for Y 1 given Y 2 and X, andfory 2 given Y 1 and X. One important property of the odds ratio representation is that the three components in the expression η 2 (y 2 ; y 1 x), f 1 (y 1 y 20, x) and f 2 (y 2 y 10, x) are uniquely identifiable, and are also variation independent if no additional constraint is imposed on f (y 1, y 2 x). Following Chen (2010), this representation can be extended to the case with q > 2. Let Y q and (Y q 1,...,Y 1 ) be the two sets of variables in applying (1). The three components of the decomposition are η q {y q ; (y q 1,...,y 1 ) x}, f q (y q y (q 1)0,...,y 10, x) and p(y q 1,...,y 1 y q0, x). By applying a similar representation to p(y q 1,...,y 1 y q0, x) and repeating the process, a representation of the conditional joint density, f (y 1,...,y q x), can be obtained as q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0, x} q f j(y j y j0, x) q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0, x} q { f j(y j y j0, x)dy j }. (2) The following proposition states the identifiability of the components in the odds ratio representation. PROPOSITION 1. The components in the representation, f (y j y j0, x) (j = 1,...,q), and η j {y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0, x} ( j = 2,...,q), are all uniquely identifiable. Proof. The result follows from the identifiability result for q = 2(Chen, 2007) and from the inductive argument used to derive the general representation Parameter identifiability in general biased sampling designs Most biased sampling designs can be viewed as a selective sample from the general population. Let S be the indicator of selection into the biased sample. Consider the biased sampling design

3 having sampling probability Biased sampling designs 3 pr(s = 1 Y 1,...,Y q, X) = π(y, X) q ψ j (Y j, X), (3) where ψ j ( j = 1,...,q) are either known or unknown. The design can be viewed as a multistage biased sampling design. For example, in the first stage, the sampling probability depends on (Y 1, X). For those selected in the first stage, the second stage of selection depends on (Y 2, X) and so on. The design is very flexible and includes as special cases most outcome-dependent sampling designs, such as the case-control and matched case-control sampling designs. Under this design, dpr(y 1,...,y q x, S = 1) can be expressed as q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0, x} q dg j(y j x) q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0, x} q dg j(y j x), where dg j (y j x) denotes dpr(y j y j0, x, S = 1) = ψ j(y j, x) f j (y j y j0, x)dy j ψ j (y j, x) f j (y j y j0, x)dy j ( j = 1,...,q). It can be seen that f (y 1,...,y q x) and p(y 1,...,y q x, S = 1) share the odds ratio functions, which means the odds ratio functions are always identifiable from the biased sampling design. Proposition 2 states the identifiability results and its proof is given in the Supplementary Material. PROPOSITION 2. Under the biased sampling design (3): (i) the odds ratio functions in (2) are always identifiable; (ii) if ψ j is not parametrically modelled and is unknown, then f (y j y j0, x) is not identifiable; and (iii) if ψ j is known and ψ j > 0 for all (y j, x), then f (y j y j0, x) is identifiable. A parameter is identifiable from the biased sample design (3) if it is identifiable from the union of the odds ratio functions and the identifiable f (y j y j0, x) (j = 1,...,q) Applications to specific biased sampling problems Proposition 2 can be applied to different biased sampling designs to determine the identifiability of parameters. The following examples illustrate the applications. Example 1. Let W, Z and X be, respectively, the true outcome, the risk factor and the confounder under control, i.e., the matched variable in a matched case-control design. Let U denote the observed outcome subject to misclassification. Assume that the sampling probability, pr(s = 1 U, W, X, Z) = π(u, X), is of unknown functional form. The observed data consist of (U, X, Z, S = 1). Let the misclassification probabilities be p 0 = pr(u = 1 W = 0, X, Z) and p 1 = pr(u = 0 W = 1, X, Z). Assume for brevity that p 0 and p 1 are independent of (X, Z). Assume a linear logistic regression model for the true outcome (Copeland et al., 1977), i.e., pr(w = 1 X, Z) = exp(β 0 + β 1 Z + β 2 X) 1 + exp(β 0 + β 1 Z + β 2 X).

4 4 HUA YUN CHEN The model for the misclassified outcome becomes pr(u = 1 X, Z) = p 0 + (1 p 1 ) exp(β 0 + β 1 Z + β 2 X). 1 + exp(β 0 + β 1 Z + β 2 X) This can be fitted into the proposed framework by setting Y 2 = U and Y 1 = Z. The conditional odds ratio function, η(u = 1; Z X), is {p 0 + (1 p 1 ) exp(β 0 + β 1 Z + β 2 X)}{1 p 0 + p 1 exp(β 0 + β 1 Z 0 + β 2 X)} {p 0 + (1 p 1 ) exp(β 0 + β 1 Z 0 + β 2 X)}{1 p 0 + p 1 exp(β 0 + β 1 Z + β 2 X)}, where U 0 = 0. When β 1 = 0, none of (p 0, p 1,β 0,β 2 ) is identifiable from the sampling design. Under the assumptions that β 1 = 0, that (p 0, p 1 ) are known and that X is absent, parameters (β 0,β 1 ) are identifiable if Z takes at least three distinct values. When X is present, parameters (β 0,β 1,β 2 ) are identifiable when X takes at least two values, Z takes at least three values and the X values and the Z values different from Z 0 form at least three distinct (X, Z) points. When neither p 0 nor p 1 is known, parameters (p 0, p 1,β 0,β 1,β 2 ) are identifiable when Z takes at least five distinct values and X takes at least two distinct values, and the X values and the Z values different from Z 0 form at least five distinct (X, Z) points. Example 2. In a case-control design or a matched case-control design, the association among covariates themselves in the population may be of interest (Nagelkerke et al., 1995; Lee et al., 1997). Let W be the binary outcome and U, Z and X be covariates. Suppose that the relationship of U to Z and X is of interest. The sampling design satisfies pr(s = 1 W, U, Z, X) = π(w, X). This can be fitted into the proposed framework by setting Y 3 = W, Y 2 = U and Y 1 = Z. The joint density for (W, U, Z) given X under the biased sample design, p(w, U, Z X, S = 1), can be expressed as η 1 {W ; (U, Z) X)η 2 (U; Z W 0, X)dF 1 (W X)dF 2 (U X)dF 3 (Z X) η1 {W ; (U, Z) X)η 2 (U; Z W 0, X)dF 1 (W X)dF 2 (U X)dF 3 (Z X). Let g 1 (W U, Z, X,β) and g 2 (U Z, X,θ) be the models for the population with η 1 and η 2 being, respectively, the odds ratio functions. If g 1 is a linear logistic model and g 2 is a normal linear regression model with the regression coefficients α and residual variance σ 2,then log η 1 {W = 1; (U, Z) X}=β 1 (U U 0 ) + β 2 (Z Z 0 ), log η 2 (U; Z W 0, X) = α 1 σ 2 (Z Z 0)(U U 0 ) [ ] {1 + exp(β0 + β 1 U 0 + β 2 Z + β 3 X)}{1 + exp(β 0 + β 1 U + β 2 Z 0 + β 3 X)} + log, {1 + exp(β 0 + β 1 U + β 2 Z + β 3 X)}{1 + exp(β 0 + β 1 U 0 + β 2 Z 0 + β 3 X)} where W 0 = 0. It can be seen from the above expression that β 1 and β 2 are always identifiable from η 1 when U and Z each take at least two distinct values. When either β 1 = 0orβ 2 = 0, β 0 is not identifiable from η 2.Whenβ 1 = 0andβ 2 = 0, (β 0,β 3 ) may be identifiable from η 2.Inthe latter case, when X is absent and α 1 = 0, β 0 is identifiable from η 2.WhenX is present, β 0 and α 1 /σ 2 are identifiable only when either Z or U takes at least three distinct values. When X is present and α 1 = 0, (β 0,β 3 ) and α 1 /σ 2 are identifiable when X takes at least two distinct values

5 Biased sampling designs 5 and either Z or U takes at least three distinct values, and X, Z = Z 0 and U = U 0 form at least three distinct (U, X, Z) points. The identifiability results obtained in Chatterjee & Carroll (2005) for studying gene and environment effects under the assumption of gene-environment independence in the general population can be viewed as a special case of this example. By setting the disease status D = W,the genetic variants G = U, the environment factor E = Z and g 2 (U Z) = g 2 (U), the odds ratio functions under their model are log η 1 {D = 1; (G, E)}=m(G, E,β 1 ) m(g 0, E 0,β 1 ), ( ) [1 + exp{β0 + m(g 0, E,β 1 )}][1 + exp{β 0 + m(g, E 0,β 1 )}] log η 2 (G; E D 0 = 0) = log, [1 + exp{β 0 + m(g, E,β 1 )}][1 + exp{β 0 + m(g 0, E 0,β 1 )}] where m(g, E,β 1 ) may take the form β 11 G + β 12 E or the form β 11 G + β 12 E + β 13 G E. Both β 11 and β 12 are identifiable from η 1.Whenβ 13 = 0, β 0 is identifiable even if β 11 = β 12 = 0. When β 13 = 0, β 0 is also identifiable if neither β 11 nor β 12 is zero. Example 3. Let D taking 0 or 1 denote the disease status, G denote the genetic factor, E denote the environment factor and Z denote other risk factors. In a case-only sampling design, pr(s = 1 G, E, Z, D) = Dπ (Piegorsch et al., 1994). If the distribution of Z is unspecified, inference on the gene-environment interaction can be based on the distribution pr(g, E Z, D = 1, S = 1) = Under the model that η(g; E Z, D = 1)p(G E 0, Z)p(E G 0, Z) η(g; E Z, D = 1)p(G E0, Z)p(E G 0, Z)dGdE. pr(d = 1 G, E, Z) = exp(β 0 + β 1 E + β 2 G + β 12 EG + β 3 Z) 1 + exp(β 0 + β 1 E + β 2 G + β 12 EG + β 3 Z) and gene-environment independence in the control population, i.e., pr(g, E D = 0) = p(g D = 0)p(E D = 0), logη(g; E Z, D = 1) = β 12 (E E 0 )(G G 0 ). Under model (4) and gene-environment independence in the general population, log η(g; E Z, D = 1) reduces to β 12 (E E 0 )(G G 0 ) plus log {1 + exp(β 0 + β 1 E 0 + β 2 G + β 12 E 0 G + β 3 Z)}{1 + exp(β 0 + β 1 E + β 2 G 0 + β 12 EG 0 + β 3 Z)} {1 + exp(β 0 + β 1 E + β 2 G + β 12 EG + β 3 Z)}{1 + exp(β 0 + β 1 E 0 + β 2 G 0 + β 12 E 0 G 0 + β 3 Z)}. (4) In practice, additional assumptions such as gene-environment independence given Z (γ = 0) and/or the rare disease assumption which implies the second term in the foregoing displayed equation is approximately 0 make the estimation of β 12 easier. However, parameters (β 0,β 1,β 2,β 12,β 3,γ) can become identifiable from the odds ratio model when β 12 = 0and E, G and Z take many distinct values even without the foregoing additional assumptions. One feature of these examples is that the identifiability of some parameters requires other parameters not taking certain values in the parameter space. For example, in Example 1, when β 1 = 0, other parameters become unidentifiable. Such identifiability is termed weak identifiability in this paper. In contrast, in the linear logistic regression under a case-control design, identifiable parameters are not subject to the problem. Such identifiability is termed strong identifiability in this paper. The problem with the weak identifiability in Example 1 is that when β 1

6 6 HUA YUN CHEN is close to zero, information in the data for distinguishing different values of β 0 can be very limited, which may require extremely large sample sizes to estimate the parameter with reasonable accuracy. The poorly estimated β 0 can substantially affect the estimation of β 1, which is usually the parameter of primary interest. We tackle this difficult problem in 4 based on a modification of one of the equivalent likelihoods discussed in the next section. 3. EQUIVALENT LIKELIHOODS FOR PARAMETER ESTIMATION AND INFERENCE 3 1. Biased sampling designs of the case-control type In a case-control design, X is absent. Let Y q denote the outcome and let Y q 1,...,Y 1 denote the covariates. The sampling probability satisfies π(y 1,...,Y q ) = ψ q (Y q ). The marginal distribution for Y q is fixed by the sampling design. That is, p(y q S = 1) is known. The joint density for the sample is p(y q 1,...,y 1 y q )p(y q S = 1), which can be rewritten as p(y 1,...,y q S = 1) with known p(y q S = 1). Let(Y1 i,...,y q i, Si = 1) (i = 1,...,n) be the observed data. It is easy to see that the maximizer of the retrospective likelihood n i=1 p(yq 1 i,...,y 1 i Y q i ) is the same as the maximizer of the joint likelihood n i=1 p(y1 i,...,y q i Si = 1) subject to the constraints p(y q = y kq S = 1) = n kq /n, for k = 1, 0, corresponding to cases and controls. Under the general sampling design (3), we maximize the joint likelihood n q j=2 η j{y j i; (Y j 1 i,...,y 1 i) y q0,...,y ( j+1)0 } q dg j(y j i ) q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0 } q dg j(y j ) i=1 subject to the constraints [ q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0 } ] q 1 {dg j(y j )} yq =y kq dg q (y kq ) q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0 } q dg j(y j ) = n kq n for k = 1,...,N q,wherey 1q,...,y Nq q are all the distinct values Y q takes and n 1q,...,n Nq q are the corresponding frequencies. We termed this sampling design the case-control type. Imposing the constraints does not affect the identifiability results obtained in the previous section, because p(y q S = 1) was assumed known in the parameter identifiability problem. (5) PROPOSITION 3. Assume that G q (y q ) and G k (y k ),k = q, are unconstrained and variation independent of each other and of q η j {y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0 } j=2 j = k,q dg j (y j ). The profile likelihoods for parameters other than G k (y k ) and G q (y q ) under sampling design (3) based on the joint likelihood with or without constraints are identical. Furthermore, the unconstrained joint likelihood with G k (y k ) profiled out is the same as the conditional likelihood based on Y k given Y k.

7 Biased sampling designs 7 Remark 1. The condition on G k (y k ) required by Proposition 3 is satisfied if either condition 1 or 2 in the following holds for the fixed k: (i) the function ψ k (y k ) is not constrained and is unknown; or (ii) the function ψ k (y k ) is known and f k (y k y k0 ) is not constrained, and is variation independent of G q (y q ) and j = k,q η j {y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0 }dg(y j ). In a typical case-control design, ψ q (y q ) is unknown and ψ k (y k ) for k = q is known. This means that the second condition is satisfied in the application to the case-control design. The proof is given in Appendix. Proposition 3 extends the classical equivalence result on the prospective and retrospective analyses of case-control design to allow for more complex models. For example, in the case-control study of gene-environment interaction, if the distribution of E is not parametrically modelled, Proposition 3 states that inferences based on the unconstrained conditional likelihood from p(d, G E, S = 1) for the parameters other than p(e D = 0, G 0 ) in p(g, E D) are exactly the same as those based on the retrospective likelihood from p(g, E D). This result is sharper than the result obtained in Chatterjee & Carroll (2005) when no additional information on the disease prevalence is available. Their result stated that the prospective likelihood from p(d, G E, S = 1) can be used for solving the likelihood estimator of the regression parameters, but cannot be used for inference. In contrast, our result implies that p(d, G E, S = 1) can be used as an ordinary likelihood for both estimation and inference on parameters in p(g, E D) other than p(e D = 0, G 0 ). Such parameters include (β 0,β 1,θ)in Chatterjee & Carroll (2005), but do not include κ in their η. This result is also in agreement with those in Scott & Wild (1997) for multiplicative models Biased sampling designs of the matched case-control type In a biased sampling design of the matched case-control type, p(y q, x S = 1) may be considered fixed by the sampling design. Assume that the observed data are grouped by strata defined by the values X takes. Within stratum k, the joint density of the observed data is m k i=1 [ q j=2 η j{y j ik ; (Y ik q j=2 ] j 1,...,Yik q 1 ) y q0,...,y ( j+1)0, X i } dg j(y j ik X k ) [ η j {y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0, x} ] q dg j(y j X k ), where (Yq ik,...,yik 1, X k )(i = 1,...,m k ) are data from the stratum. If for a fixed j, G j (y j x) is unconstrained and is variation independent of other components in the numerator of the expression, then G j (y j x) can be eliminated from the conditional likelihood based on the permutation of Y j ik (i = 1,...,m k ). If several such G j exist, they can all be eliminated from the conditional likelihood based on a combination of all the permutations on each of the observed y j within a stratum. The conditional logistic analysis of a matched casecontrol design is a special case of the permutation approach. This generalization is useful for analyzing genetic studies with a matched case-control design either when the gene-environment distribution is modelled or when correlated family members are modelled. Such applications will be discussed elsewhere.

8 8 HUA YUN CHEN 4. THE PARTIALLY PENALIZED LIKELIHOOD APPROACH 4 1. Partially penalized likelihood for weakly identifiable parameters When weakly identifiable parameters are involved, the maximum likelihood estimator can have very poor performance under practically attainable sample sizes. We propose the following partially penalized likelihood to tackle this problem. Let l(β 0,β 1, )= n l i (β 0,β 1 Y i, X i ) 1 2 (β 0 m) t (β 0 m), i=1 where l i (β 0,β 1 Y i, X i ) = log p(y i1,...,y iq X i, S i = 1,β 0,β 1 ) or an equivalent loglikelihood, both β 0 and β 1 can be vectors, is a diagonal matrix with positive diagonal elements, and m is a vector of plausible values for β 0. Let (β0,β 1 ) be the true parameter values for (β 0,β 1 ) and β 0 be a scale. Let { li (β0 = E,β 1 ) } 2 ( ) ω0 0 = 01. (β 0,β 1 ) 10 1 The maximum likelihood estimator for β 1 has asymptotic variance ( 1 ω ) 1.The information on β 0, ω 0, is usually very small when β 0 is weakly identifiable. When is large relative to ω 0, the influence of the estimator of β 0 on the variation of the estimator of β 1 can be substantial in the maximum likelihood approach. Let n 1/2 (m n β0 ) h, and λ n/n σ, where m n and λ n are, respectively, the guessed β 0 and the penalty parameter. The proposed partially penalized approach downweights the influence. The variance for the penalized likelihood estimator for β 1 is ( V ( ˆβ 1 ) = 1 ) 1 { σ + ω }( 0 ω 0 + σ (σ + ω 0 ) ) ω 0 + σ This variance is smaller than that of the maximum likelihood estimator. As the penalized likelihood estimator can have bias, the mean square error is a better measure. The following proposition gives the optimal result. A proof is given in the Supplementary Material. PROPOSITION 4. Let β 0 be a scalar parameter. Under the regularity conditions for the maximum likelihood estimator to be consistent and asymptotically normal, the penalized maximum likelihood estimator for β 1 has a minimum mean square error matrix, ( /(ω 0 + 1/h 2 )) 1, which is attained when σ = h 2. Proposition 4 can be used as a guideline for choosing the penalty parameter λ n, because λ n nσ = n h 2 1 (m n β 0 )2. The optimal choice of λ n depends on the distance between the guessed value for β 0 and the true β 0. However, in practice, we do not know the true β 0. If the probable range of the true value for β 0 is known, we may take λ n as the reciprocal of the squared range. In cases where information on the range is not available, we may base the choice of λ n on the notion of controlling the influence of the β 0 estimator on the variance of the β 1 estimator. We can thus choose a suboptimal λ n, as small as possible, such that the variance for the β 1 estimator has little change relative to the change in λ n. This criterion can be implemented relatively easily.

9 Biased sampling designs 9 Table 1. Simulation results on outcome misclassification in the case-control design with misclassification probabilities 0 1r for cases and 0 2 for controls β 1 = 1(r 10%) β 1 = 1(r 0 1%) Methods Bias Evar Mvar Bias Evar Mvar Logist MLE RMLE (L) PMLE (L) RMLE (U) PMLE (U) Bias, estimated truth; Evar, empirical estimate of the variance based on the parameter estimates; Mvar, average of the estimated variance; Logist, logistic regression applied directly to the case-control data; MLE, maximum likelihood estimator; RMLE, maximum likelihood estimator with β 0 fixedat0 8(L)or1 2(U) times the true value; PMLE, penalized maximum likelihood estimator with m at 0 8(L) or1 2(U) times the true value and λ taking the optimal value Applications of the partially penalized likelihood approach We now apply the partially penalized likelihood approach to the outcome misclassification in a case-control design and to the estimation of genetic and environment effects in a case-control design with gene-environment independence in the general population. We demonstrate the applications using simulation. For the outcome misclassification, the simulation model is the same as Example 1 except that X is absent. The covariate Z follows the uniform distribution on [ 1, 1]. The disease risk ratio is fixed at β 1 = 1. To evaluate the impact of the different disease prevalence rates on the performance of the methods, we set β 0 = 2 2 or 6 9, which roughly corresponds to disease prevalence rates of r = 10% or 0 1% at the exposure level of z = 0. The misclassification rates are set to q 0 = 10%r, which roughly corresponds to 10% of the cases in the case-control sample being false positive, and q 1 = 20%, which corresponds to 20% of the disease cases in the population not diagnosed. Each simulated dataset has 500 cases and 500 controls. The optimal valueissetforλ. In the analysis, the misclassification probabilities were assumed known and the guessed β 0 was set to 0 8 times or 1 2 times the true β 0. The simulation results based on 1000 replicates are listed in Table 1. It can be seen from the table that the maximum likelihood estimator of β 1 is subject to large bias and the variance estimates based on the inverse of the information matrix poorly estimated the actual variance. The partially penalized maximum likelihood estimator of β 1 has a small variance that can be well estimated. There is relatively slight improvement of the maximum penalized likelihood estimator over the restricted maximum likelihood estimator, which may be due to the limited information in the data on β 0. For the study of gene-environment effects, we simulated a special case of Example 2. In the simulation, we set G to a binary variable with pr(g = 1) = 0 1. The environment variable is uniformly distributed on [0, 2]. The intercept β 0 is set to 2 2or 6 9. The relative risk parameters are set to (0 5, 0 5, 1). Each simulated dataset has 500 cases and 500 controls. We computed the traditional prospective likelihood estimator based on pr(d G, E, S = 1), the penalized likelihood estimator with the guessed β 0 either at 0 8 times or 1 2 times the true β 0. The penalty λ is set to the optimal values. Owing to the difficulty in computing the maximum likelihood estimator under the gene-environment independence assumption, we only computed it for the case β 0 = 2 2 using the penalized approach with λ set to a very small value, Simulation results based on 1000 replicates are listed in Table 2. From the table, we see that the partially penalized approach can substantially reduce the variance of the β 1 estimators at the cost of a low level of bias when compared with the traditional prospective analysis

10 10 HUA YUN CHEN Table 2. Simulation results on gene-environment effects in the case-control design with geneenvironment independence in the general population Gene (0 5) Environment (0 5) Interaction (1 0) Methods Bias Evar Mvar Bias Evar Mvar Bias Evar Mvar Prevalence rate 10% (β 0 = 2 2) Logist MLE RMLE (L) PMLE (L) RMLE (U) PMLE (U) Prevalence rate 0 1% (β 0 =-6 9) Logist RMLE (L) PMLE (L) RMLE (U) PMLE (U) Bias, estimated truth; Evar, empirical estimate of the variance based on the parameter estimates; Mvar, average of the estimated variance; Logist, logistic regression applied directly to the case-control data; MLE, maximum likelihood estimator calculated using PMLE with m the true β 0 and λ = 0 001; RMLE, maximum likelihood estimator with β 0 fixed at 0 8 (L) or 1 2 (U) times of the true value; PMLE, penalized maximum likelihood estimator with m at 0 8 (L) or 1 2 (U) times of the true value and λ taking the optimal value. and the maximum likelihood estimator. The restricted maximum likelihood can also perform well in comparison with the penalized likelihood approach when the guessed β 0 is far away from the truth. This is likely due to the small amount of information on β 0 contained in the data. When the guessed β 0 is close to the truth, such as the case of β 0 = 2 2, the partially penalized approach has a noticeable improvement over the restricted maximum likelihood approach. SUPPLEMENTARY MATERIAL Supplementary Material available at Biometrika online includes proofs of Propositions 2 and 4. ACKNOWLEDGEMENT I thank the editor, the associate editor and two referees for very helpful comments and suggestions. I also thank Dr Sally Freels for reading and correcting an earlier version of this paper. The research was partially supported by a grant from the National Science Foundation, U.S.A. APPENDIX Proof of Proposition 3. When G k (y k ) is unconstrained and is variation independent of q η j {y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0 } dg j (y j ), j=2 j = k the maximizer for G k has to be achieved with all probability mass concentrated on the observed Y k values when the likelihood is unconstrained. For a constrained likelihood, the maximum of the likelihood has to be less than or equal to the maximum of the corresponding unconstrained likelihood. If the maximizer for

11 G k based on the unconstrained conditional likelihood, n i=1 Biased sampling designs 11 q j=2 η j{y j i; (Y j 1 i,...,y 1 i) y q0,...,y ( j+1)0 } q 1 dg j(y j i ) q j=2 η j{y j ; (y j 1,...,y 1 ) y q0,...,y ( j+1)0 } q 1 dg j(y j ), also satisfies the constraints, the maximizer for G k in the constrained conditional likelihood is achieved with all probability mass concentrated on the observed Y k values. Let y jk ( j = 1,...,N k ) be the observed distinct Y k values having frequencies of occurrence n jk ( j = 1,...,N k ).Let η jl = q η t {y t ; (y t 1,...,y 1 ) y q0,...,y (t+1)0 } dg t (y t ) (yk =y jk,y q =y lq ). t=2 t = k,q The loglikelihood with the Lagrange multipliers can be written as l(g k, G q,μ 1,...,μ Nq,λ 1,λ 2,θ) n q = log η j {Y j i ; (Y j 1 i,...,y 1 i ) y q0,...,y ( j+1)0 }+ log dg j (Y j i ) i=1 j = k,q N k N k + n jk log g jk + n lq log g lq n log η jl g jk g lq N q l=1 N q ( Nk + μ η ) l l=1 l=1 η n lq n N q N k + λ 1 g lq 1 + λ 2 g jk 1, l=1 N q l=1 where θ denote all the parameters other than G k and G q in the likelihood and g jk and g lq are probability masses for G k and G q, respectively. Note that l g l0 q l g j0 k l μ l0 = = n l 0 q g l0 q n η jl 0 g jk l=1 η + λ 1 + μ η N jl 0 g q jk l0 l=1 η μ η jlg jk g Nk lq η jl 0 g jk l l=1 ( N k = n j 0 k n g j0 k N q ( + μ l l=1 l=1 η j 0 lg lq l=1 η + λ 2 η j0 lg lq l=1 η η jl 0 g jk g l0 q l=1 η n l0q n l=1 η ) 2 = 0, (A1) η l=1 η ) j 0 lg lq ( N k l=1 η = 0, (A2) ) 2 = 0, (A3)

12 12 HUA YUN CHEN N q l = g lq 1 = 0, λ 1 l=1 l N k = g jk 1 = 0, λ 2 (A4) (A5) where l 0 = 1,...,N q, and j 0 = 1,...,N k. Multiplying both sides of (A1)byg l0 q and summing over l 0,it can be seen from (A3) and (A4) that λ 1 = 0. Further, multiplying both sides of (A1) byg l0 q and applying (A3), it can be seen that N q μ l0 = 1 μ l n lq. n l=1 It follows from the arbitrariness of l 0 that μ l is a constant. Next, multiplying both sides of (A2)byg j0 k and summing over j 0, it can be seen from (A3) and (A5) that λ 2 = 0. Further, multiplying both sides of (A2) by g l0 q and applying (A3) and (A6), it can be seen that n j0 k g j0 k = n (A6) l=1 η j 0 lg lq l=1 η. (A7) Next, if we do not impose constraints (5) and maximize the likelihood over G k and G q using the Lagrange multiplier method, we can obtain score equations which are equivalent to the score equations when the constraints are imposed. This implies that the profile likelihoods for θ with or without the constraints are identical because the profile likelihoods are the joint likelihood subject to the constraints defined by the score equations which are the same under both cases. Finally, the score equations for maximizing the unconstrained joint likelihood with respective to G k are (A7). By applying (A7) to the joint likelihood, G k can be eliminated from the likelihood expression and the resulting profile likelihood is exactly the conditional likelihood based on pr(y k Y k, S = 1). REFERENCES ANDERSON,J.A.(1972). Separate sample logistic discrimination. Biometrika 59, BRESLOW, N.E.(1976). Regression analysis of the log odds ratio: a method for retrospective studies. Biometrics 32, BRESLOW,N.E.(1981). Odds ratio estimators when the data are sparse. Biometrika 68, BRESLOW,N.E.(1996). Statistics in epidemiology: the case-control study. J. Am. Statist. Assoc. 91, BRESLOW, N.E.&DAY, N.(1980). Statistical Methods in Cancer Research. Volume I: The Analysis of Case-control Studies. IARC Scientific Publications. Lyon: IARC. CHATTERJEE, N. & CARROLL, R. J. (2005). Semiparametric maximum likelihood estimation exploiting geneenvironment independence in case-control studies. Biometrika 92, CHEN,H.Y.(2003). A note on prospective analysis of outcome-dependent samples. J. R. Statist. Soc. B 65, CHEN,H.Y.(2004). Nonparametric and semiparametric models for missing covariates in parametric regressions. J. Am. Statist. Assoc. 99, CHEN,H.Y.(2007). A semiparametric odds ratio model for measuring association. Biometrics 63, CHEN,H.Y.(2010). Compatibility of conditionally specified models. Statist. Prob. Lett. 80, COPELAND, K.T.,CHECKOWAY, H.,MCMICHAEL, A.J.&HOLBROOK, R.H.(1977). Bias due to misclassification in estimation of relative risk. Am. J. Epidemiol. 105, LEE,A.J.,MCMURCHY,L.&SCOTT,A.J.(1997). Re-using data from case-control studies. Statist. Med. 16, LIANG,K.Y.(1985). Odds ratio inference with dependent data. Biometrika 72, LIANG, K.Y.&QIN, J.(2000). Regression analysis under non-standard situations: a pairwise pseudolikelihood approach. J. R. Statist. Soc. B 62, NAGELKERKE, N.J.D.,MOSES, S.,PLUMMER, F.A.,BRUNHAM, R.C.&FISH, D.(1995). Logistic regression in case-control studies: the effect of using indepedent variables. Statist. Med. 14, OSIUS,G.(2005). The association between two random elements: a complete characterization and odds ratio models. Metrika 60,

13 Biased sampling designs 13 PIEGORSCH,W.W.,WEINBERG,C.R.&TAYLOR,J.A.(1994). Non-hierarchical logistic models and case-only designs for assessing susceptibility in population based case-control studies. Statist. Med. 13, PRENTICE, R.L.&PYKE, R.(1979). Logistic disease incidence models and case-control studies. Biometrika 66, RABINOWITZ,D.(1997). A note on efficient estimation from case-control data. Biometrika 84, SATTEN, G.A.&CARROLL, R.J.(2000). Conditional and unconditional categorical regression models with missing covariates. Biometrics 56, SCOTT,A.J.&WILD,C.J.(1997). Fitting regression models to case-control data by maximum likelihood. Biometrika 84, UMBACH, D.M.&WEINBERG, C.M.(1997). Designing and analyzing case-control studies to exploit independence of genotype and exposure. Statist. Med. 16, WEINBERG, C.R.&WACHOLDER, S.(1993). Prospective analysis of case-control data under general multiplicativeintercept risk models. Biometrika 80, [Received April Revised May 2010]

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative

More information

Nuisance parameter elimination for proportional likelihood ratio models with nonignorable missingness and random truncation

Nuisance parameter elimination for proportional likelihood ratio models with nonignorable missingness and random truncation Biometrika Advance Access published October 24, 202 Biometrika (202), pp. 8 C 202 Biometrika rust Printed in Great Britain doi: 0.093/biomet/ass056 Nuisance parameter elimination for proportional likelihood

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2008 Paper 241 A Note on Risk Prediction for Case-Control Studies Sherri Rose Mark J. van der Laan Division

More information

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Yi-Hau Chen Institute of Statistical Science, Academia Sinica Joint with Nilanjan

More information

A note on profile likelihood for exponential tilt mixture models

A note on profile likelihood for exponential tilt mixture models Biometrika (2009), 96, 1,pp. 229 236 C 2009 Biometrika Trust Printed in Great Britain doi: 10.1093/biomet/asn059 Advance Access publication 22 January 2009 A note on profile likelihood for exponential

More information

Compatibility of conditionally specified models

Compatibility of conditionally specified models Compatibility of conditionally specified models Hua Yun Chen Division of epidemiology & Biostatistics School of Public Health University of Illinois at Chicago 1603 West Taylor Street, Chicago, IL 60612

More information

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Glenn Heller and Jing Qin Department of Epidemiology and Biostatistics Memorial

More information

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence Special Issue Paper Received 7 January 20, Accepted 28 September 20 Published online 24 February 202 in Wiley Online Library (wileyonlinelibrary.com) DOI: 0.002/sim.4460 Efficient designs of gene environment

More information

Missing Covariate Data in Matched Case-Control Studies

Missing Covariate Data in Matched Case-Control Studies Missing Covariate Data in Matched Case-Control Studies Department of Statistics North Carolina State University Paul Rathouz Dept. of Health Studies U. of Chicago prathouz@health.bsd.uchicago.edu with

More information

Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies

Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies Biometrika (2005), 92, 2, pp. 399 418 2005 Biometrika Trust Printed in Great Britain Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies BY NILANJAN

More information

STAT331. Cox s Proportional Hazards Model

STAT331. Cox s Proportional Hazards Model STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations

More information

Asymptotic equivalence of paired Hotelling test and conditional logistic regression

Asymptotic equivalence of paired Hotelling test and conditional logistic regression Asymptotic equivalence of paired Hotelling test and conditional logistic regression Félix Balazard 1,2 arxiv:1610.06774v1 [math.st] 21 Oct 2016 Abstract 1 Sorbonne Universités, UPMC Univ Paris 06, CNRS

More information

Survival Analysis for Case-Cohort Studies

Survival Analysis for Case-Cohort Studies Survival Analysis for ase-ohort Studies Petr Klášterecký Dept. of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, harles University, Prague, zech Republic e-mail: petr.klasterecky@matfyz.cz

More information

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs STAT 5500/6500 Conditional Logistic Regression for Matched Pairs Motivating Example: The data we will be using comes from a subset of data taken from the Los Angeles Study of the Endometrial Cancer Data

More information

Ignoring the matching variables in cohort studies - when is it valid, and why?

Ignoring the matching variables in cohort studies - when is it valid, and why? Ignoring the matching variables in cohort studies - when is it valid, and why? Arvid Sjölander Abstract In observational studies of the effect of an exposure on an outcome, the exposure-outcome association

More information

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs

STAT 5500/6500 Conditional Logistic Regression for Matched Pairs STAT 5500/6500 Conditional Logistic Regression for Matched Pairs The data for the tutorial came from support.sas.com, The LOGISTIC Procedure: Conditional Logistic Regression for Matched Pairs Data :: SAS/STAT(R)

More information

Combining multiple observational data sources to estimate causal eects

Combining multiple observational data sources to estimate causal eects Department of Statistics, North Carolina State University Combining multiple observational data sources to estimate causal eects Shu Yang* syang24@ncsuedu Joint work with Peng Ding UC Berkeley May 23,

More information

Shrinkage Estimators for Robust and Efficient Inference in Haplotype-Based Case-Control Studies. Abstract

Shrinkage Estimators for Robust and Efficient Inference in Haplotype-Based Case-Control Studies. Abstract Shrinkage Estimators for Robust and Efficient Inference in Haplotype-Based Case-Control Studies YI-HAU CHEN Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan R.O.C. yhchen@stat.sinica.edu.tw

More information

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-Level. Information From External Big Data Sources

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-Level. Information From External Big Data Sources Journal of the American Statistical Association ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: http://www.tandfonline.com/loi/uasa20 Constrained Maximum Likelihood Estimation for Model Calibration

More information

Misclassification in Logistic Regression with Discrete Covariates

Misclassification in Logistic Regression with Discrete Covariates Biometrical Journal 45 (2003) 5, 541 553 Misclassification in Logistic Regression with Discrete Covariates Ori Davidov*, David Faraggi and Benjamin Reiser Department of Statistics, University of Haifa,

More information

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,

More information

1 Introduction A common problem in categorical data analysis is to determine the effect of explanatory variables V on a binary outcome D of interest.

1 Introduction A common problem in categorical data analysis is to determine the effect of explanatory variables V on a binary outcome D of interest. Conditional and Unconditional Categorical Regression Models with Missing Covariates Glen A. Satten and Raymond J. Carroll Λ December 4, 1999 Abstract We consider methods for analyzing categorical regression

More information

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen Harvard University Harvard University Biostatistics Working Paper Series Year 2014 Paper 175 A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome Eric Tchetgen Tchetgen

More information

Equivalence of random-effects and conditional likelihoods for matched case-control studies

Equivalence of random-effects and conditional likelihoods for matched case-control studies Equivalence of random-effects and conditional likelihoods for matched case-control studies Ken Rice MRC Biostatistics Unit, Cambridge, UK January 8 th 4 Motivation Study of genetic c-erbb- exposure and

More information

Modification and Improvement of Empirical Likelihood for Missing Response Problem

Modification and Improvement of Empirical Likelihood for Missing Response Problem UW Biostatistics Working Paper Series 12-30-2010 Modification and Improvement of Empirical Likelihood for Missing Response Problem Kwun Chuen Gary Chan University of Washington - Seattle Campus, kcgchan@u.washington.edu

More information

Analysis of Matched Case Control Data in Presence of Nonignorable Missing Exposure

Analysis of Matched Case Control Data in Presence of Nonignorable Missing Exposure Biometrics DOI: 101111/j1541-0420200700828x Analysis of Matched Case Control Data in Presence of Nonignorable Missing Exposure Samiran Sinha 1, and Tapabrata Maiti 2, 1 Department of Statistics, Texas

More information

,..., θ(2),..., θ(n)

,..., θ(2),..., θ(n) Likelihoods for Multivariate Binary Data Log-Linear Model We have 2 n 1 distinct probabilities, but we wish to consider formulations that allow more parsimonious descriptions as a function of covariates.

More information

TESTS FOR EQUIVALENCE BASED ON ODDS RATIO FOR MATCHED-PAIR DESIGN

TESTS FOR EQUIVALENCE BASED ON ODDS RATIO FOR MATCHED-PAIR DESIGN Journal of Biopharmaceutical Statistics, 15: 889 901, 2005 Copyright Taylor & Francis, Inc. ISSN: 1054-3406 print/1520-5711 online DOI: 10.1080/10543400500265561 TESTS FOR EQUIVALENCE BASED ON ODDS RATIO

More information

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017 Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction

More information

Power and Sample Size Calculations with the Additive Hazards Model

Power and Sample Size Calculations with the Additive Hazards Model Journal of Data Science 10(2012), 143-155 Power and Sample Size Calculations with the Additive Hazards Model Ling Chen, Chengjie Xiong, J. Philip Miller and Feng Gao Washington University School of Medicine

More information

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links Communications of the Korean Statistical Society 2009, Vol 16, No 4, 697 705 Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links Kwang Mo Jeong a, Hyun Yung Lee 1, a a Department

More information

A class of latent marginal models for capture-recapture data with continuous covariates

A class of latent marginal models for capture-recapture data with continuous covariates A class of latent marginal models for capture-recapture data with continuous covariates F Bartolucci A Forcina Università di Urbino Università di Perugia FrancescoBartolucci@uniurbit forcina@statunipgit

More information

Fair Inference Through Semiparametric-Efficient Estimation Over Constraint-Specific Paths

Fair Inference Through Semiparametric-Efficient Estimation Over Constraint-Specific Paths Fair Inference Through Semiparametric-Efficient Estimation Over Constraint-Specific Paths for New Developments in Nonparametric and Semiparametric Statistics, Joint Statistical Meetings; Vancouver, BC,

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Introduction to mtm: An R Package for Marginalized Transition Models

Introduction to mtm: An R Package for Marginalized Transition Models Introduction to mtm: An R Package for Marginalized Transition Models Bryan A. Comstock and Patrick J. Heagerty Department of Biostatistics University of Washington 1 Introduction Marginalized transition

More information

Ann Arbor, Michigan 48109, U.S.A. 2 Division of Cancer Epidemiology and Genetics, National Cancer Institute,

Ann Arbor, Michigan 48109, U.S.A. 2 Division of Cancer Epidemiology and Genetics, National Cancer Institute, Biometrics 64, 685 694 September 2008 DOI: 10.1111/j.1541-0420.2007.00953.x Exploiting Gene-Environment Independence for Analysis of Case Control Studies: An Empirical Bayes-Type Shrinkage Estimator to

More information

Sample size calculations for logistic and Poisson regression models

Sample size calculations for logistic and Poisson regression models Biometrika (2), 88, 4, pp. 93 99 2 Biometrika Trust Printed in Great Britain Sample size calculations for logistic and Poisson regression models BY GWOWEN SHIEH Department of Management Science, National

More information

Power and sample size calculations for designing rare variant sequencing association studies.

Power and sample size calculations for designing rare variant sequencing association studies. Power and sample size calculations for designing rare variant sequencing association studies. Seunggeun Lee 1, Michael C. Wu 2, Tianxi Cai 1, Yun Li 2,3, Michael Boehnke 4 and Xihong Lin 1 1 Department

More information

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA Kasun Rathnayake ; A/Prof Jun Ma Department of Statistics Faculty of Science and Engineering Macquarie University

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Statistical inference on the penetrances of rare genetic mutations based on a case family design

Statistical inference on the penetrances of rare genetic mutations based on a case family design Biostatistics (2010), 11, 3, pp. 519 532 doi:10.1093/biostatistics/kxq009 Advance Access publication on February 23, 2010 Statistical inference on the penetrances of rare genetic mutations based on a case

More information

Sensitivity analysis and distributional assumptions

Sensitivity analysis and distributional assumptions Sensitivity analysis and distributional assumptions Tyler J. VanderWeele Department of Health Studies, University of Chicago 5841 South Maryland Avenue, MC 2007, Chicago, IL 60637, USA vanderweele@uchicago.edu

More information

Figure 36: Respiratory infection versus time for the first 49 children.

Figure 36: Respiratory infection versus time for the first 49 children. y BINARY DATA MODELS We devote an entire chapter to binary data since such data are challenging, both in terms of modeling the dependence, and parameter interpretation. We again consider mixed effects

More information

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection Model Selection in GLMs Last class: estimability/identifiability, analysis of deviance, standard errors & confidence intervals (should be able to implement frequentist GLM analyses!) Today: standard frequentist

More information

Matched-Pair Case-Control Studies when Risk Factors are Correlated within the Pairs

Matched-Pair Case-Control Studies when Risk Factors are Correlated within the Pairs International Journal of Epidemiology O International Epidemlologlcal Association 1996 Vol. 25. No. 2 Printed In Great Britain Matched-Pair Case-Control Studies when Risk Factors are Correlated within

More information

Lecture 5: LDA and Logistic Regression

Lecture 5: LDA and Logistic Regression Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant

More information

Simple Sensitivity Analysis for Differential Measurement Error. By Tyler J. VanderWeele and Yige Li Harvard University, Cambridge, MA, U.S.A.

Simple Sensitivity Analysis for Differential Measurement Error. By Tyler J. VanderWeele and Yige Li Harvard University, Cambridge, MA, U.S.A. Simple Sensitivity Analysis for Differential Measurement Error By Tyler J. VanderWeele and Yige Li Harvard University, Cambridge, MA, U.S.A. Abstract Simple sensitivity analysis results are given for differential

More information

Additive and multiplicative models for the joint effect of two risk factors

Additive and multiplicative models for the joint effect of two risk factors Biostatistics (2005), 6, 1,pp. 1 9 doi: 10.1093/biostatistics/kxh024 Additive and multiplicative models for the joint effect of two risk factors A. BERRINGTON DE GONZÁLEZ Cancer Research UK Epidemiology

More information

A note on L convergence of Neumann series approximation in missing data problems

A note on L convergence of Neumann series approximation in missing data problems A note on L convergence of Neumann series approximation in missing data problems Hua Yun Chen Division of Epidemiology & Biostatistics School of Public Health University of Illinois at Chicago 1603 West

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Lecture 2: Poisson and logistic regression

Lecture 2: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 11-12 December 2014 introduction to Poisson regression application to the BELCAP study introduction

More information

Regression analysis of biased case control data

Regression analysis of biased case control data Ann Inst Stat Math (2016) 68:805 825 DOI 10.1007/s10463-015-0511-3 Regression analysis of biased case control data Palash Ghosh Anup Dewani Received: 4 September 2013 / Revised: 2 February 2015 / Published

More information

A Robust Test for Two-Stage Design in Genome-Wide Association Studies

A Robust Test for Two-Stage Design in Genome-Wide Association Studies Biometrics Supplementary Materials A Robust Test for Two-Stage Design in Genome-Wide Association Studies Minjung Kwak, Jungnam Joo and Gang Zheng Appendix A: Calculations of the thresholds D 1 and D The

More information

Repeated ordinal measurements: a generalised estimating equation approach

Repeated ordinal measurements: a generalised estimating equation approach Repeated ordinal measurements: a generalised estimating equation approach David Clayton MRC Biostatistics Unit 5, Shaftesbury Road Cambridge CB2 2BW April 7, 1992 Abstract Cumulative logit and related

More information

Fitting regression models to case-control data by maximum likelihood

Fitting regression models to case-control data by maximum likelihood Biometrika (1997), 84,1, pp. 57-71 Printed in Great Britain Fitting regression models to case-control data by maximum likelihood BY A. J. SCOTT AND C. J. WILD Department of Statistics, University of Auckland,

More information

Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses

Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association for binary responses Outline Marginal model Examples of marginal model GEE1 Augmented GEE GEE1.5 GEE2 Modeling the scale parameter ϕ A note on modeling correlation of binary responses Using marginal odds ratios to model association

More information

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Overview In observational and experimental studies, the goal may be to estimate the effect

More information

Exact McNemar s Test and Matching Confidence Intervals Michael P. Fay April 25,

Exact McNemar s Test and Matching Confidence Intervals Michael P. Fay April 25, Exact McNemar s Test and Matching Confidence Intervals Michael P. Fay April 25, 2016 1 McNemar s Original Test Consider paired binary response data. For example, suppose you have twins randomized to two

More information

Propensity Score Weighting with Multilevel Data

Propensity Score Weighting with Multilevel Data Propensity Score Weighting with Multilevel Data Fan Li Department of Statistical Science Duke University October 25, 2012 Joint work with Alan Zaslavsky and Mary Beth Landrum Introduction In comparative

More information

Estimating and contextualizing the attenuation of odds ratios due to non-collapsibility

Estimating and contextualizing the attenuation of odds ratios due to non-collapsibility Estimating and contextualizing the attenuation of odds ratios due to non-collapsibility Stephen Burgess Department of Public Health & Primary Care, University of Cambridge September 6, 014 Short title:

More information

Building a Prognostic Biomarker

Building a Prognostic Biomarker Building a Prognostic Biomarker Noah Simon and Richard Simon July 2016 1 / 44 Prognostic Biomarker for a Continuous Measure On each of n patients measure y i - single continuous outcome (eg. blood pressure,

More information

DIAGNOSTICS FOR STRATIFIED CLINICAL TRIALS IN PROPORTIONAL ODDS MODELS

DIAGNOSTICS FOR STRATIFIED CLINICAL TRIALS IN PROPORTIONAL ODDS MODELS DIAGNOSTICS FOR STRATIFIED CLINICAL TRIALS IN PROPORTIONAL ODDS MODELS Ivy Liu and Dong Q. Wang School of Mathematics, Statistics and Computer Science Victoria University of Wellington New Zealand Corresponding

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu 1 / 35 Tip + Paper Tip Meet with seminar speakers. When you go on

More information

Multivariate Survival Analysis

Multivariate Survival Analysis Multivariate Survival Analysis Previously we have assumed that either (X i, δ i ) or (X i, δ i, Z i ), i = 1,..., n, are i.i.d.. This may not always be the case. Multivariate survival data can arise in

More information

STA 216, GLM, Lecture 16. October 29, 2007

STA 216, GLM, Lecture 16. October 29, 2007 STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

Bi-level feature selection with applications to genetic association

Bi-level feature selection with applications to genetic association Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may

More information

Estimating the Marginal Odds Ratio in Observational Studies

Estimating the Marginal Odds Ratio in Observational Studies Estimating the Marginal Odds Ratio in Observational Studies Travis Loux Christiana Drake Department of Statistics University of California, Davis June 20, 2011 Outline The Counterfactual Model Odds Ratios

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

Lecture 5: Poisson and logistic regression

Lecture 5: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 3-5 March 2014 introduction to Poisson regression application to the BELCAP study introduction

More information

Discussion of Missing Data Methods in Longitudinal Studies: A Review by Ibrahim and Molenberghs

Discussion of Missing Data Methods in Longitudinal Studies: A Review by Ibrahim and Molenberghs Discussion of Missing Data Methods in Longitudinal Studies: A Review by Ibrahim and Molenberghs Michael J. Daniels and Chenguang Wang Jan. 18, 2009 First, we would like to thank Joe and Geert for a carefully

More information

Chapter 22: Log-linear regression for Poisson counts

Chapter 22: Log-linear regression for Poisson counts Chapter 22: Log-linear regression for Poisson counts Exposure to ionizing radiation is recognized as a cancer risk. In the United States, EPA sets guidelines specifying upper limits on the amount of exposure

More information

Tests of independence for censored bivariate failure time data

Tests of independence for censored bivariate failure time data Tests of independence for censored bivariate failure time data Abstract Bivariate failure time data is widely used in survival analysis, for example, in twins study. This article presents a class of χ

More information

Introduction to Analysis of Genomic Data Using R Lecture 6: Review Statistics (Part II)

Introduction to Analysis of Genomic Data Using R Lecture 6: Review Statistics (Part II) 1/45 Introduction to Analysis of Genomic Data Using R Lecture 6: Review Statistics (Part II) Dr. Yen-Yi Ho (hoyen@stat.sc.edu) Feb 9, 2018 2/45 Objectives of Lecture 6 Association between Variables Goodness

More information

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis STAT 6350 Analysis of Lifetime Data Failure-time Regression Analysis Explanatory Variables for Failure Times Usually explanatory variables explain/predict why some units fail quickly and some units survive

More information

Linear Methods for Prediction

Linear Methods for Prediction This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Flexible Estimation of Treatment Effect Parameters

Flexible Estimation of Treatment Effect Parameters Flexible Estimation of Treatment Effect Parameters Thomas MaCurdy a and Xiaohong Chen b and Han Hong c Introduction Many empirical studies of program evaluations are complicated by the presence of both

More information

Multiple Imputation for Missing Values Through Conditional Semiparametric Odds Ratio Models

Multiple Imputation for Missing Values Through Conditional Semiparametric Odds Ratio Models Multiple Imputation for Missing Values Through Conditional Semiparametric Odds Ratio Models Hui Xie Assistant Professor Division of Epidemiology & Biostatistics UIC This is a joint work with Drs. Hua Yun

More information

On Two-Stage Hypothesis Testing Procedures Via Asymptotically Independent Statistics

On Two-Stage Hypothesis Testing Procedures Via Asymptotically Independent Statistics UW Biostatistics Working Paper Series 9-8-2010 On Two-Stage Hypothesis Testing Procedures Via Asymptotically Independent Statistics James Dai FHCRC, jdai@fhcrc.org Charles Kooperberg fred hutchinson cancer

More information

The distinction between a biologic interaction or synergism

The distinction between a biologic interaction or synergism ORIGINAL ARTICLE The Identification of Synergism in the Sufficient-Component-Cause Framework Tyler J. VanderWeele,* and James M. Robins Abstract: Various concepts of interaction are reconsidered in light

More information

LOCAL LINEAR REGRESSION FOR GENERALIZED LINEAR MODELS WITH MISSING DATA

LOCAL LINEAR REGRESSION FOR GENERALIZED LINEAR MODELS WITH MISSING DATA The Annals of Statistics 1998, Vol. 26, No. 3, 1028 1050 LOCAL LINEAR REGRESSION FOR GENERALIZED LINEAR MODELS WITH MISSING DATA By C. Y. Wang, 1 Suojin Wang, 2 Roberto G. Gutierrez and R. J. Carroll 3

More information

Stat 642, Lecture notes for 04/12/05 96

Stat 642, Lecture notes for 04/12/05 96 Stat 642, Lecture notes for 04/12/05 96 Hosmer-Lemeshow Statistic The Hosmer-Lemeshow Statistic is another measure of lack of fit. Hosmer and Lemeshow recommend partitioning the observations into 10 equal

More information

Using Estimating Equations for Spatially Correlated A

Using Estimating Equations for Spatially Correlated A Using Estimating Equations for Spatially Correlated Areal Data December 8, 2009 Introduction GEEs Spatial Estimating Equations Implementation Simulation Conclusion Typical Problem Assess the relationship

More information

Journal of Biostatistics and Epidemiology

Journal of Biostatistics and Epidemiology Journal of Biostatistics and Epidemiology Methodology Marginal versus conditional causal effects Kazem Mohammad 1, Seyed Saeed Hashemi-Nazari 2, Nasrin Mansournia 3, Mohammad Ali Mansournia 1* 1 Department

More information

Parameter Redundancy with Covariates

Parameter Redundancy with Covariates Biometrika (2010), xx, x, pp. 1 9 1 2 3 4 5 6 7 C 2007 Biometrika Trust Printed in Great Britain Parameter Redundancy with Covariates By D. J. Cole and B. J. T. Morgan School of Mathematics, Statistics

More information

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Ann Inst Stat Math (0) 64:359 37 DOI 0.007/s0463-00-036-3 Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Paul Vos Qiang Wu Received: 3 June 009 / Revised:

More information

Lab 8. Matched Case Control Studies

Lab 8. Matched Case Control Studies Lab 8 Matched Case Control Studies Control of Confounding Technique for the control of confounding: At the design stage: Matching During the analysis of the results: Post-stratification analysis Advantage

More information

Minimum distance estimation for the logistic regression model

Minimum distance estimation for the logistic regression model Biometrika (2005), 92, 3, pp. 724 731 2005 Biometrika Trust Printed in Great Britain Minimum distance estimation for the logistic regression model BY HOWARD D. BONDELL Department of Statistics, Rutgers

More information

PQL Estimation Biases in Generalized Linear Mixed Models

PQL Estimation Biases in Generalized Linear Mixed Models PQL Estimation Biases in Generalized Linear Mixed Models Woncheol Jang Johan Lim March 18, 2006 Abstract The penalized quasi-likelihood (PQL) approach is the most common estimation procedure for the generalized

More information

On Fitting Generalized Linear Mixed Effects Models for Longitudinal Binary Data Using Different Correlation

On Fitting Generalized Linear Mixed Effects Models for Longitudinal Binary Data Using Different Correlation On Fitting Generalized Linear Mixed Effects Models for Longitudinal Binary Data Using Different Correlation Structures Authors: M. Salomé Cabral CEAUL and Departamento de Estatística e Investigação Operacional,

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 24 Paper 153 A Note on Empirical Likelihood Inference of Residual Life Regression Ying Qing Chen Yichuan

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE Donald A. Pierce Oregon State Univ (Emeritus), RERF Hiroshima (Retired), Oregon Health Sciences Univ (Adjunct) Ruggero Bellio Univ of Udine For Perugia

More information

Describing Contingency tables

Describing Contingency tables Today s topics: Describing Contingency tables 1. Probability structure for contingency tables (distributions, sensitivity/specificity, sampling schemes). 2. Comparing two proportions (relative risk, odds

More information

On the Breslow estimator

On the Breslow estimator Lifetime Data Anal (27) 13:471 48 DOI 1.17/s1985-7-948-y On the Breslow estimator D. Y. Lin Received: 5 April 27 / Accepted: 16 July 27 / Published online: 2 September 27 Springer Science+Business Media,

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models Thomas Kneib Institute of Statistics and Econometrics Georg-August-University Göttingen Department of Statistics

More information

Lecture 3. Truncation, length-bias and prevalence sampling

Lecture 3. Truncation, length-bias and prevalence sampling Lecture 3. Truncation, length-bias and prevalence sampling 3.1 Prevalent sampling Statistical techniques for truncated data have been integrated into survival analysis in last two decades. Truncation in

More information

Analysis of matched case control data with multiple ordered disease states: Possible choices and comparisons

Analysis of matched case control data with multiple ordered disease states: Possible choices and comparisons STATISTICS IN MEDICINE Statist. Med. 2007; 26:3240 3257 Published online 5 January 2007 in Wiley InterScience (www.interscience.wiley.com).2790 Analysis of matched case control data with multiple ordered

More information

Survival Analysis Math 434 Fall 2011

Survival Analysis Math 434 Fall 2011 Survival Analysis Math 434 Fall 2011 Part IV: Chap. 8,9.2,9.3,11: Semiparametric Proportional Hazards Regression Jimin Ding Math Dept. www.math.wustl.edu/ jmding/math434/fall09/index.html Basic Model Setup

More information