Statistical inference on the penetrances of rare genetic mutations based on a case family design
|
|
- Cassandra Willis
- 6 years ago
- Views:
Transcription
1 Biostatistics (2010), 11, 3, pp doi: /biostatistics/kxq009 Advance Access publication on February 23, 2010 Statistical inference on the penetrances of rare genetic mutations based on a case family design HONG ZHANG Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA and Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui , People s Republic of China SYLVIANE OLSCHWANG Institut National de la Santé et de la Recherche Médicale (INSERM), Unité 891, Centrede Recherches en Cancérologie de Marseille, Marseille, France and Department of Oncogenetics, Institut Paoli-Calmettes, Marseille, France KAI YU Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA yuka@mail.nih.gov SUMMARY We propose a formal statistical inference framework for the evaluation of the penetrance of a rare genetic mutation using family data generated under a kin cohort type of design, where phenotype and genotype information from first-degree relatives (sibs and/or offspring) of case probands carrying the targeted mutation are collected. Our approach is built upon a likelihood model with some minor assumptions, and it can be used for age-dependent penetrance estimation that permits adjustment for covariates. Furthermore, the derived likelihood allows unobserved risk factors that are correlated within family members. The validity of the approach is confirmed by simulation studies. We apply the proposed approach to estimating the age-dependent cancer risk among carriers of the MSH2 or MLH1 mutation. Keywords: Case family design; Penetrance; Proportional hazards model; Rare mutation; Unobserved risk factors. 1. INTRODUCTION An increasing number of mutations have been found to be associated with an elevated risk for various genetic disorders. A precise estimation of the age-dependent risk for people carrying the disease-causing mutations is essential for defining prevention strategies and understanding underlying mechanisms of the diseases. When a disease causal mutation is identified, a precise estimation of its penetrance is possible using the kin cohort design (Wacholder and others, 1998), which has been studied extensively in To whom correspondence should be addressed. c The Author Published by Oxford University Press. All rights reserved. For permissions, please journals.permissions@oxfordjournals.org.
2 520 H. ZHANG AND OTHERS the literature, for example, Gail, Pee, Benichou, and Carroll (1999), Chatterjee and Wacholder (2001), Chatterjee and others (2006), Wang and others (2007), among others. Gail, Pee, and Carroll (1999) studied the advantages and disadvantages of the kin cohort design. They found that the kin cohort design has several practical advantages, including comparatively rapid execution, modest reductions in required sample sizes compared with cohort or case control designs, and the ability to study the effects of an autosomal dominant mutation on several disease outcomes; the disadvantages include 2 sources of bias: a proband s decision to participate is influenced by the disease status of his relatives and the proband is unable to recall the disease histories of relatives accurately. In a standard kin cohort design, a volunteer (either affected or unaffected) agrees to be genotyped, and the phenotype information on the disease histories of his or her first-degree relatives is obtained through a questionnaire. When the information on both phenotype and genotype for relatives is available, alternative approaches are needed in order to take full advantage of all available data while correcting for bias due to the effects of ascertainment. In this paper, we assume that all probands are affected carriers, though the proposed approach can be extended to include unaffected probands carrying the mutation. Recently, Wang and others (2006) proposed a nonparametric method for estimating the penetrance of a rare mutation. Olschwang and others (2009) proposed an alternative parametric logistic regression model. Both approaches rely on the assumption that the penetrance of noncarriers is zero. This assumption might not be true for many genetic diseases. The penetrance estimate could be severely biased if this assumption was not valid in real applications. In this paper, we focus on rare mutations and aim at developing a rigorous statistical inference framework for such case family design. The main difference between this design and the standard kin cohort design is that the former collects information on both phenotypes and genotypes of the probands relatives, while the latter simply collects the phenotypes of the relatives through a questionnaire. The assumption of zero penetrance for the noncarriers is not required for our approach. Furthermore, the proposed approach is based on a likelihood model conditioned on the phenotypes of all individuals; therefore, the derived estimate should not suffer from the biases mentioned for the kin cohort design. Some covariates such as gender and ethnicity can be incorporated easily in our approach. Multiple rare mutations can also be handled in the context of the proposed conditional likelihood framework. The derivation of the conditional likelihood functions requires minor assumptions. The maximum likelihood estimates (MLEs) can be obtained through standard optimization algorithm available in mathematical/statistical softwares. Statistical inferences, such as constructing confidence intervals and testing hypotheses for the parameters characterizing the penetrance, can be performed based on the standard large-sample theories. The performance of the proposed approach is examined through simulation studies, which illustrate the desired properties of the approach. Finally, we demonstrate the application of the proposed approach by applying it to a study of Lynch syndrome. 2. AGE-INDEPENDENT PENENTRANCES 2.1 Notation Throughout this paper, the mutations responsible for the disease of interest are assumed to be on autosome. In the case family design considered, some unrelated affected individuals (cases) collected from a case control study are genotyped, and those cases carrying the study mutation are termed case probands ; the first-degree relatives (sibs and/or offspring) of the case probands are interviewed for phenotyping and genotyping. To motivate our approach, we first focus on congenital or early-onset diseases that manifest before the ages at which subjects are ascertained. We want to estimate the age-independent penetrance of a known disease-causing mutation. Suppose some case probands are ascertained, and several first-degree
3 Statistical inferences on the penetrances of rare genetic mutations 521 relatives of each case proband are then collected for genotyping at the disease locus. Throughout this paper, we assume that the mutation (allele m is the mutation of wild allele M) causing disease is rare. Since the mutation is so rare that homogeneous genotype mm is seldom seen if we assume Hardy Weinberg equilibrium holds for the alleles, then we have only 2 genotypes, namely Mm (mutation, denoted by g = 1) and M M (nonmutation, denoted by g = 0). Let the disease penetrance of M M and Mm be f 0 and f 1, respectively. Let the disease status of an individual be d that takes value 1 if affected and 0 otherwise. 2.2 Likelihood function Suppose I unrelated case probands carrying the mutation are ascertained. To derive the likelihood function of the observed data, we need to make the following assumptions: (i) The study mutation is rare. (ii) Hardy Weinberg equilibrium holds for the corresponding allele, mating is random, and Mendelian inheritance law holds. (iii) The study mutation is independent of the unobserved risk factors. (iv) The disease is rare. (v) There is no interaction effect between the study mutation and the unobserved risk factors. That is, the joint disease penetrance satisfies the following relationship: P(d = 1 g = 1, r) = c 1 P(d = 1 g = 0, r), (2.1) where r is a vector of unobserved risk factor values and c 1 is a constant independent of r. Under the assumptions (i) (v), the likelihood function for the observed genotypes of the relatives can be approximated by L i ( f 0, f 1 ) = p a i 1 (1 p 1) n 1i a i p b i 0 (1 p 0) n 0i b i, with p 1 = f 1 f 1 + f 0 and p 0 = 1 f 1 2 f 1 f 0, (2.2) where a i (b i ) is the number of the affected (unaffected) relatives carrying the mutation and n 1i (n 0i ) is the number of affected (unaffected) relatives, of the ith case proband, i = 1,..., I. Refer to Appendix A of the supplementary material (available at Biostatistics online) for the derivation of (2.2) that is available. Notice that p 1 (p 0 ) is the probability of a relative being a carrier, given the condition that he/she is affected (unaffected) and the case proband is a carrier. It is seen that p 0 has exactly the same value as that given in Wang and others (2006) when f 0 = 0. Furthermore, when f 0 = 0 (i.e. a noncarrier has penetrance 0), all the affected relatives are carriers and they provide no information on f 1. It can be seen from (2.2) that the relative s genotypes within the same family are conditionally independent given the ascertainment scheme. We want to point out that this is not an assumption but is the result derived from the assumptions (i) (v). An important advantage of this likelihood is that it is independent of the unobserved risk factors, making it suitable for estimating marginal penetrances of carriers and noncarriers. The assumption (i) is the key assumption, which is the motivation for this study. The assumptions (ii) and (iii) are commonly seen in literature, which are used to derive the conditional mutation distribution of a proband s relatives. The assumption (iv) is a technical one, and our simulation study shows that the performance of the proposed approach is acceptable even when the disease is common with the prevalence being 0.1. The assumption (v) is equivalent to the multiplicative model for multiple risk factors (see e.g. Gail and others, 2008; Yu and others, 2009). In particular, the following log-linear model satisfies the assumption (v): P(d = 1 g, r) = c 2 exp{ag + b τ r},
4 522 H. ZHANG AND OTHERS where c 2 is a constant and a and b are regression parameters. Throughout this paper, τ stands for the transpose of a vector. Notice that we do not assume any correlation structure for the unobserved risk factors of family members. Furthermore, the unobserved risk factors can be of any type, such as discrete and continuous, environmental or genetical. 2.3 Identifiability of f 1 and f 0 When genotypes are available only for the unaffected relatives of case probands, we see from the likelihood function (2.2) that the penetrances f 1 and f 0 are not identifiable. However, the 2 penetrances f 1 and f 0 are identifiable when at least 1 affected relative and 1 unaffected relative are genotyped, provided that f 1 > f 0 > 0. Actually, there is a one-to-one relationship between the penetrances { f 1, f 0 } and the estimable parameters {p 1, p 0 } when f 1 > f 0 > 0. This is different from the situation in the standard case control design, where only the relative risk f 1 /f 0 is identifiable. Notice that in our case family design, our retrospective likelihood function is conditioned on the mutation status of the proband and disease status. This additional conditioning as well as the assumption of rare mutation make both f 1 and f 0 identifiable. It is also noticed that f 0 and f 1 are not identifiable when f 1 = f 0 but this is not a problem since the major purpose of our case family design is to estimate the penetrance function of a known risk mutation with f 1 > f Maximum likelihood estimates Denote A = I i=1 a i, B = I i=1 b i, N 1 = I i=1 n 1i, and N 0 = I i=1 n 0i, Then the overall likelihood can be written as L( f 0, f 1 ) = p A 1 (1 p 1) N 1 A p B 0 (1 p 0) N 0 B, with p 1 = f 1 f 1 + f 0 and p 0 = 1 f 1 2 f 1 f 0. (2.3) Since the above likelihood function is the product of 2 binomial likelihood functions, the MLEs of p 1 and p 0 are ˆp 1 = A/N 1 and ˆp 0 = B/N 0, respectively. Therefore, the MLEs of f 1 and f 0 are, respectively, or equivalently, fˆ 1 = ˆp 1(1 2 ˆp 0 ) and fˆ 0 = (1 ˆp 1)(1 2 ˆp 0 ), with ˆp 1 = A and ˆp 0 = B, (2.4) ˆp 1 ˆp 0 ˆp 1 ˆp 0 N 1 N 0 fˆ 1 = AN 0 2B A and fˆ 0 = N 1N 0 AN 0 2B N 1 + 2AB. (2.5) AN 0 B N 1 AN 0 B N 1 When f 0 = 0, the MLE of f 1 is (1 2B/N 0 )/(1 B/N 0 ). This estimator is simpler than that of Wang and others (2006) since their method needs to estimate an additional offset for each family. When f 0 is not equal to 0, using (1 2B/N 0 )/(1 B/N 0 ) as an estimator of f 1 could produce considerable bias. For example, if f 0 = 0.1 and f 1 = 0.2, then the estimator (1 2B/N 0 )/(1 B/N 0 ) converges to (1 2p 0 )/(1 p 0 ) = 1/9 as the sample size goes to infinity and the relative bias (Rbias) is (1/9 0.2)/0.2 = 4/9. If all the affected relatives are carriers so that N 1 = A, then the MLEs of f 0 and f 1 are 0 and (1 2B/N 0 )/(1 B/N 0 ), respectively. This confirms the fact that the affected relatives provide no information on f 1 when f 0 = 0, as was mentioned in Section 2.2. With a large sample size, the MLEs fˆ 1 and fˆ 0 converge to f 1 and f 0, respectively, so that they asymptotically locate within the interval [0, 1]. When the sample size is not large enough, however, the 2 estimates could be negative or greater than 1. In such situation, we can estimate the penetrances by adding a constraint 0 f 0, f 1 1.
5 Statistical inferences on the penetrances of rare genetic mutations Hypothesis testing and confidence interval It is of interest to test the null hypothesis that the mutation has no effect on the disease ( f 0 = f 1 ), provided that the genotypes of some affected relatives are available. To test this null hypothesis, we can construct a likelihood ratio test. Since the common penetrance under the null hypothesis is not identifiable, the limiting null distribution of the likelihood ratio test is no longer standard chi-square distributed. To assess the significance of the likelihood ratio test statistic, we can adopt a permutation test by permutating the disease status of the relatives. The confidence intervals of the penetrances can be constructed based on the asymptotic normality of the MLEs, with the variance covariance matrix of the MLEs being estimated by the inverse of the observed information matrix. 3. AGE-DEPENDENT PENETRANCES 3.1 Notation In most situations, the penetrances depend on age, and we are interested in estimating age-dependent penetrances. Suppose that we observe the ages at diagnosis for all the relatives and the ages at onset for those affected individuals. We will take this information into account in the evaluation of the agedependent penetrances. For the ith proband, suppose the information on the phenotypes and genotypes of n i relatives are collected. Let the genotype and affection status of the jth relative (zeroth relative is the case proband) of the ith case proband be coded by g i j and d i j, respectively. That is, g i j = 1 if the jth relative is a carrier and 0 otherwise, and d i j = 1 if the jth relative is affected and 0 otherwise. Let a i j and t i j (t i j is an unobserved value that is greater than a i j if the jth relative is unaffected) be the current age and the age at onset of the jth relative, respectively. Let y i j = min{t i j, a i j }. 3.2 Likelihood function We can formulate a conditional likelihood for the ith family s data as P(g i d i, y i, g i0 = 1, d i0 = 1, y i0, a i, a i0 ), where g i = (g i1,..., g ini ), d i = (d i1,..., d ini ), y i = (y i1,..., y ini ), and a i = (a i1,..., a ini ). To derive the likelihood function, we need the following assumption corresponding to the assumption (v) for the age-independent penetrances: (v) There is no interaction effect between the study mutation and the unobserved risk factors, that is, the density function of the age at onset p(t g, r) given the study mutation g and unobserved risk factors r satisfies the relationship where c 3 is a constant. p(t g = 1, r) = c 3 p(t g = 0, r), (3.1) Under Cox s proportional hazards model (Cox, 1972), the hazard function is multiplicative with respect to g and r if there is no interaction effect. Therefore, the Cox model together with the rare disease assumption imply the assumption (v) since the density function is approximately the hazard function under the assumptions. Under the assumptions (i) (iv) and (v), we can show that the overall likelihood can be approximated by I I n i λ d i j (y i j g i j )S(y i j g i j ) L = P(g i d i, y i, g i0 = 1, d i0 = 1, y i0, a i, a i0 ) = 1g=0 λ d i j (y i j g)s(y i j g), (3.2) i=1 i=1 j=1
6 524 H. ZHANG AND OTHERS where λ( g) and S( g) are, respectively, the hazard function and survival function of age at onset of individuals carrying genotype g. The derivation of (3.2) is similar to that of (2.2) so is omitted. We can assume a suitable functional form for λ(t g). For example, under the given assumptions, the joint proportional hazards model implies a marginal proportional hazard function λ(t g) = λ 0 (t; η)e βg, (3.3) where λ 0 (t; η) is the baseline hazard function known up to a parameter vector η of finite dimension. If only unaffected relatives are genotyped, then the likelihood function (3.2) reduces to I n i i=1 j=1 S(y i j g i j ) S(y i j 0) + S(y i j 1). (3.4) It can be shown that S( 1) and S( 0) are not identifiable in (3.4), as in Section 2.3. For rare disease, one can assume that the penetrance of noncarriers is nearly zero so that S(y 0) 1, and the likelihood function is approximately I n i i=1 j=1 ( S(yi j 1) ) gi j ( 1 + S(y i j 1) S(y i j 1) ) 1 gi j, (3.5) which is equivalent to model (3) of Olschwang and others (2009). Making the additional assumption of a Weibull survival function form of S(y 1) yields a logistic regression model given by (5) of Olschwang and others (2009). 3.3 MLE, hypothesis testing, and confidence interval The MLEs of the unknown parameters can be obtained by the Newton Raphson algorithm or any optimization algorithm. To examine whether the study mutation has effect on the disease, we can test the null hypothesis β = 0 using either likelihood ratio test or Wald test, where β is given in (3.3). We can also estimate the variances of the MLEs and construct the confidence intervals of the unknown parameters based on a large-sample theory. 4. COVARIATES AND MULTIPLE MUTATIONS ADJUSTMENT In many real applications, we might be interested in comparing penetrances between 2 groups, for example, male versus female. Also, when there are multiple known disease-causing mutations involved, we are interested in comparing the penetrances among multiple mutations. An example will be given in Section 6. We can extend the previous likelihood functions further to adjust for covariates and multiple disease-causing mutations. In the following example, we illustrate how to incorporate covariates and multiple mutations in the situation where the genotypes from both affected and unaffected relatives are available. 4.1 Likelihood function Assume that a covariate vector Z is observed for each relative. Then we can incorporate the covariates effect in a proportional hazards model: (t g, Z) = 0 (t; η)e βg+γ τ Z, (4.1)
7 Statistical inferences on the penetrances of rare genetic mutations 525 where (t g, Z) is the cumulative hazard function of the age at onset given covariate Z and genotype g and 0 (t; η) is the baseline cumulative hazard function corresponding to g = 0 and Z = 0, which is known up to a parameter vector η of finite dimension. The likelihood function is therefore approximately I n i i=1 j=1 exp{(βg i j + γ τ Z i j )d i j 0 (y i j ; η)e βg i j +γ τ Z i j } 1g=0 exp{(βg + γ τ Z i j )d i j 0 (y i j ; η)e βg+γ τ Z i j }. Suppose K types of disease-causing rare mutations are considered. We assume that each family can have at most one type of mutation segregated. Let δ i be a mutation indicator, that is, δ i = k if the ith case proband has the kth type of mutation. We assume that the cumulative hazard functions of these risk mutations are proportional: k (t) = 0 (t; η)e β k, k = 1,..., K, (4.2) where β 1 = 0. The approximated likelihood function can be written as I n i exp{( K k=1 1 k (δ i )β k g i j )d i j 0 (y i j ; η)e K k=1 1 k (δ i )β k g i j } 1g=0 exp{( K k=1 1 k (δ i )β k g i j )d i j 0 (y i j ; η)e (4.3) K k=1 1 k (δ i )β k g i j }, i=1 j=1 where 1 k (δ i ) is an indicator function taking value 1 if δ i = k and 0 otherwise. Refer to Appendix B of the supplementary material (available at Biostatistics online) for the derivation of (4.3). This expression shows that the mutation behaves as a family-shared categorical covariate. 4.2 MLE, confidence interval, and hypothesis testing The MLEs of η, β, γ, and β k, k = 2,..., K, in (4.1) and (4.2) can be obtained using the Newton Raphson algorithm or any optimization algorithm. The variance estimates of the MLEs and confidence intervals of the unknown parameters can be obtained as before. It is of interest to compare the penetrances for various disease-causing mutations, which can be conducted by the standard likelihood ratio test based on the likelihood (4.3). The proportionality of the hazard functions can also be tested by a likelihood ratio test, with the alternative hypothesis being that the mutations have their own specific penetrance functions. If the null hypothesis that the penetrance functions are proportional is not rejected, we can feel free to apply the proportional hazards model (4.2); otherwise we need to estimate mutation-specific penetrance functions. 5. SIMULATION STUDIES We conducted simulation studies to assess the performance of the proposed approach. First, we studied the age-independent penetrances. We assumed 2 independent disease related singlenucleotide polymorphisms (SNPs): one is the study mutation with minor allele frequency (MAF) 0.01 or and the other one is unobserved with MAF 0.2. We assumed dominant mode of inheritance for both the SNPs. The disease and risk factors were related by a logistic regression model: P(d = 1 g, r) = exp{a + bg + log(or)r} 1 + exp{a + bg + log(or)r}, (5.1) where g (r) is 1 if the genotype of the study SNP (unobserved SNP) is of higher risk and 0 otherwise and OR is the odds ratio parameter for the unobserved SNP, which takes value 1 or 2. The marginal penetrance f 1 for carriers was fixed at 0.5 and the other penetrance f 0 was 0.03 or 0.1. The values of
8 526 H. ZHANG AND OTHERS log-or parameters a and b were determined by the other parameters. The genotypes of parents were generated under Hardy Weinberg equilibrium and random mating, and the genotypes of offspring were independently generated given parental genotypes. From a large number of generated families with 3 offspring, we randomly selected families with the first offspring being affected and carrying the study mutation and treated them as the source population from which the study sample was collected. A sample of size 200, 500, or 1000 was drawn from this population and simulation results based on replications were produced. Reported in Table 1 are the Rbias of the estimates defined as the Table 1. Age-independent penetrance estimates MAF f 0 OR Rbias SE SEE ECP # Rbias SE SEE ECP # Number of cases = Number of controls = 200 f 0 f Number of cases = Number of controls = 500 f 0 f Number of cases = Number of controls = 1000 f 0 f MAF of study mutation. True value of penetrance f 0. OR parameter of unobserved risk factor. Rbias defined as mean MLE divided by true penetrance minus 1. 95% empirical coverage probability. # Mean SE of estimated penetrance. SEE of estimated penetrance.
9 Statistical inferences on the penetrances of rare genetic mutations 527 mean estimated penetrances divided by the true penetrance minus 1, empirical standard errors (SE) and mean estimated standard errors (SEE) of the estimates, and empirical coverage probability (ECP) of the penetrances. Overall, the estimates have minor bias when the disease is rare ( f 0 = 0.03) and the study mutation is rare (MAF = 0.001), with absolute relative biases no more than 1.2%. Common disease ( f 0 = 0.1), increased MAF (0.01) of the study mutation, and positive effect of unobserved mutation (OR = 2) has small impact on the estimates, with Rbias 7.9% 3.4%. In all situations, the SEE are very close to the empirical ones. The relative bias tends to be stable and remain to be small when the sample size increases. We also estimated the penetrance of carriers using only the genotypes of unaffected relatives by assuming zero penetrance of noncarriers. The resulting Rbias is generally small when f 0 = 0.03 but it could become considerably large when f 0 = 0.1 (results not shown). It is also seen from Table 1 that the Rbias for a mutation with MAF = tends to be smaller than that observed for a mutation with MAF = Additional simulation results show that the relative biases get larger when the MAF increases. For example, a MAF of 0.03 produces relative biases at the range of 10.3% 23.7%, and an MAF of 0.1 produces relative biases at the range of 41.9% 65.1%, with the other parameters the same as those in Table 1. It appears that the proposed approach is suitable for rare mutation with MAF Second, we studied the proposed approach when the penetrance is age dependent. We generated data from the following Cox model with Weibull baseline hazard function: λ(t g, r) = (t/e ψ ) ξ e βg+log(or)r, (5.2) where g and r are the same as those in (5.1) with the same MAFs. The OR was fixed at 1 or 2. The other parameters ξ, ψ, and β were determined by 3 cumulative risk probabilities: p 30,0 = P(T 30 g = 0), p 60,0 = P(T 60 g = 0), and p 60,1 = P(T 60 g = 1), where T is the age at onset. To mimic common disease, we set p 30,0 = 0.03 and p 60,0 = 0.09; to mimic rare disease, we set p 30,0 = 0.01 and p 60,0 = In both situations, we set p 60,1 = 0.5. The ages of the relatives of a proband were generated from the uniform distribution in the interval (a 5, a + 5), where a is the current age of the proband that is uniformly distributed in the interval (20, 70). The ages, genotypes, and disease status were generated for a large number of families similarly to the age-dependent situation. In each family, there were 2 parents and 3 offspring whose data were generated. Altogether, families with 1 affected proband (the first offspring) carrying the mutation in each family were obtained. From these families, we sampled 400 or 1000 families and estimated ξ, ψ, and β in model (5.2) by ignoring the unobserved mutation. Substituting the estimated parameters gave the estimates of marginal survival functions of carriers and noncarriers. Based on 5000 replications, we calculated the mean estimated survival functions of both carriers and noncarriers and the 90% confidence intervals of the survival functions. Presented in Figures 1 and 2 are the results for carriers and noncarriers, respectively, with sample size 1000 and OR = 1 (unobserved mutation does not play a role on the disease). We can see that the bias of the estimates reduces dramatically when the MAF of study mutation decreases from 0.01 to 0.001, showing that the approximation of the likelihood function works pretty good for relatively rare mutation. When the disease gets common, the proposed method using both affected and unaffected relatives does not produce extra bias. However, the method that uses only unaffected relatives has much larger bias for common disease. This extra bias is due to the improper assumption of zero penetrance function of noncarriers for common disease. Other results for sample size 400 or OR = 2 are presented in Figures s1 s6 of the supplementary material (available at Biostatistics online). In summary, the bias of the penetrance functions get smaller as the sample size increases. The positive effect of unobserved mutation (OR = 2) has only limited impact on the penetrance function estimates. In particular, the impact is minimal when the MAF of the study mutation is small and the disease is rare.
10 528 H. ZHANG AND OTHERS Fig. 1. Estimated survival functions of carriers with sample size 1000 and OR = 1. Common mutation : MAF = 0.01; rare mutation : MAF = 0.001; common disease : P(T 30 g = 0) = 0.03 and P(T 60 g = 0) = 0.09; rare disease : P(T 30 g = 0) = 0.01 and P(T 60 g = 0) = Finally, we examined the robustness of the specification of the baseline hazard function. Our simulation studies showed that the misspecification of the baseline hazard function could result in bias, with its magnitude depending on the true and misspecified functions. Here, we do not present the simulation results but briefly summarize them. If the true baseline hazard function is gamma, Weibull, or log-normal, but it was misspecified to be any other 2 functions, then the resulting penetrance estimate had small bias; if the baseline hazard function is piecewise constant but it was misspecified to be Weibull, then the bias could be relatively large. 6. APPLICATION TO A STUDY OF LYNCH SYNDROME We applied the proposed approach to a study of Lynch syndrome (Olschwang and others, 2009). In this study, the carriers were identified in 8 genetic units of France and Switzerland. These units offered
11 Statistical inferences on the penetrances of rare genetic mutations 529 Fig. 2. Estimated survival functions of noncarriers with sample size 1000 and OR = 1. Common mutation : MAF = 0.01; rare mutation : MAF = 0.001; common disease : P(T 30 g = 0) = 0.03 and P(T 60 g = 0) = 0.09; rare disease : P(T 30 g = 0) = 0.01 and P(T 60 g = 0) = germline analysis of MSH2 and MLH2 genes. A restrospective questionnaire was conducted to ask for some information on asymptomatic first-degree relatives of carriers. The collected information includes the type of disease-causing germline mutation identified in the proband, birth data, sex, and age at genetic diagnosis. The presence or absence of disease-causing mutation was then assessed from these relatives. Phenotypes and genotypes from 856 asymptomatic first-degree relatives of MSH2 or MLH1 carriers were collected from those 8 centers. For each relative, the gender and mutation status at genes MSH2 and MLH1 were obtained, as summarized in Table 2. Furthermore, the ages of the relatives were available, so that we could estimate the age-dependent penetrances. With pooled data, we assumed a Weibull survival function (t/e ψ ) ξ for carriers. With gender or mutation type adjusted, we assumed a proportional hazards model with survival function (t/e ψ ) ξ e β 1x 1 or (t/e ψ ) ξ e β 2x 2. Here, x 1 = 1 if male and 0 if female and x 2 = 1 if MSH1 and 0 if MSH2. We obtained the MLEs of the unknown parameters (ψ, ξ, β 1, and β 2 ), estimated SE of the MLEs, and 95% confidence intervals of the parameters. The estimation and hypothesis-testing results are presented in Table 3.
12 530 H. ZHANG AND OTHERS Table 2. Summary of genotypes Mutation absent Mutation present MLH1 MSH2 MLH1 MSH2 Males Females Table 3. Estimates of the parameters for the Lynch syndrome data Parameter MLE SE CI LRT P-value # Pooled data ψ (4.020, 4.314) ξ (1.266, 4.614) Adjusted by gender (β 1 : regression parameter) ψ (4.010, 4.507) ξ (1.184, 4.520) β ( 0.267, 1.333) Adjusted by gene type (β 2 : regression parameter) ψ (3.959, 4.487) ξ (1.242, 4.503) β ( 0.568, 1.085) The parameters are defined in Section 6. MLE estimate of unknown parameter. Estimated standard error of the MLE. Confidence interval of unknown parameter. Likelihood ratio test statistic. # P-value of likelihood ratio test. The estimated survival function together with its confidence interval curves based on 5000 bootstrappings (Efron and Tibshirani, 1993) are plotted in Figure s7 of of the supplementary material (available at Biostatistics online). The penetrance difference between male and female was moderately large, the penetrance difference between 2 genes was minor, and both of differences were not statistically significant (with P-values and 0.515, respectively). These results are consistent with those of Olschwang and others (2009). In this example, we fitted a more general Weibull baseline hazard function with 2 parameters while Olschwang and others (2009) fitted an exponential baseline hazard function with a threshold value. 7. DISCUSSION A precise estimation of the age-dependent risk for people carrying disease-causing mutations would have a tremendous impact on public health, which is instrumental in the counseling of individuals who are identified by genetic testing as carriers and who are faced with different options for cancer prevention or early detection. We provide a rigorous statistical inference framework for the evaluation of the penetrance of a rare mutation. The approach can handle both covariates and multiple rare mutations. It is helpful to check the parametric assumption of the baseline hazard function. Because the design is retrospective and the observations are subject to censoring, rigorously checking the parametric assumption is a great challenge. In practice, one can try some commonly used parametric baselines and choose the
13 Statistical inferences on the penetrances of rare genetic mutations 531 one with the largest likelihood. This technique, however, could be misleading if the true baseline is very different from the selected ones. Instead, a nonparametric approach that does not assume any parametric baseline is much more desirable, although it involves some computational and theoretic issues. We will pursue this in the future research. The proposed approach allows for unobserved risk factors that are correlated among family members, provided that there is no interaction effect between the study mutation and unobserved risk factors. When the interaction effect is present, the proposed approach can produce considerably large bias on the penetrance estimates. More advanced methods such as the frailty model could be helpful in resolving this problem, for example, Hsu and others (2004) and Hsu and Gorfine (2006). The development of an inference procedure is still under way. When the disease is not rare, as demonstrated in Section 2.4 and the simulation studies, assuming zero penetrance for noncarriers can produce considerably large bias on the penetrance estimate of carriers. In such situation, genotypes from affected relatives are helpful to improve estimation with the proposed approach. Therefore, it is important to collect genotype information from both affected and unaffected relatives when adopting such a case family design for the penetrance estimation. We hope our proposed method could make this potentially very useful design more accessible for the future study of rare mutations. SUPPLEMENTARY MATERIAL Supplementary material is available at ACKNOWLEDGMENTS We would like to thank Dr. Gilles Thomas for helpful discussions, Dr B. J. Stone for editorial help, Drs C. Lasset, Q. Wang, P. Hutter, M. P. Buisine, R. Etienne, C. Caron, V. Bourdon, and S. Baert-Desurmont for data collection. Conflict of Interest: None declared. FUNDING Intramural Program of the National Institutes of Health to H.Z. and K.Y.; Natural Science Foundation of China ( ) to H.Z.; Institut National du Cancer to S.O. REFERENCES CHATTERJEE, N., KALAYLIOGLU, Z., SHIH, J. H. AND GAIL, M. H. (2006). Case-control and case-only designs with genotype and family history data: estimating relative risk, residual familial aggregation, and cumulative risk. Biometrics 62, CHATTERJEE, N. AND WACHOLDER, S. (2001). A marginal likelihood approach for estimating penetrance from kin cohort designs. Biometrics 57, COX, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society, Series B (Methodological) 34, EFRON, B. AND TIBSHIRANI, R. (1993). An Introduction to the Bootstrap. New York: Chapman & Hall. GAIL, M. H., PEE, D., BENICHOU, J. AND CARROLL, R. (1999). Designing studies to estimate the penetrance of an identified autosomal dominant mutation: cohort, case-control, and genotyped-proband designs. Genetic Epidemiology 16,
14 532 H. ZHANG AND OTHERS GAIL, M. H., PEE, D. AND CARROLL, R. (1999). Kin cohort designs for gene characterization. Journal of the National Cancer Institute. Monographs 26, GAIL, M. H., PFEIFFER, R. M., WHEELER, W. AND PEE, D. (2008). Probability of detecting disease-associated single nucleotide polymorphisms in case-control genome-wide association studies. Biostatistics 9, HSU, L., CHEN, L., GORFINE, M. AND MALONE, K. (2004). Semiparametric estimation of marginal hazard function from the case-control family studies. Biometrics 60, HSU, L. AND GORFINE, M. (2006). Multivariate survival analysis for case-control family data. Biostatistics 7, OLSCHWANG, S., YU, K., LASSET, C., BAERT-DESURMONT, S., BUISINE, M. P., WANG, Q., HUTTER, P., ROULEAU, E., CARON, O., BOURDON, V. and others (2009). Age-dependent cancer risk is not different in between MSH2 and MLH1 mutation carriers. Journal of Cancer Epidemiology doi: /2009/ WACHOLDER, S., HARTGE, P., STRUEWING, J. P., PEE, D., MCADAMS, M., BRODY, L. AND TUCKER, M. (1998). The kin-cohort study for estimating penetrance. American Journal of Epidemiology 148, WANG, Y., CLARK, L. N., MARDER, K. AND RABINOWITZ, D. (2007). Nonparametric estimation of age-at-onset distributions from censored kin-cohort data. Biometrika 94, WANG, Y., OTTMAN, R. AND RABINOWITZ, D. (2006). A method for estimating penetrance from families sampled for linkage analysis. Biometrics 62, YU, K., LI, Q., BERGEN, A. W., PFEIFFER, R. M., ROSENBERG, P. S., CAPORASO, N., KRAFT, P. AND CHATTERJEE, N. (2009). Pathway analysis by adaptive combination of P-values. Genetic Epidemiology 33, [Received November 22, 2009; revised January 11, 2010; accepted for publication January 18, 2010]
SCORE TESTS FOR FAMILIAL CORRELATION IN GENOTYPED PROBAND DESIGNS
SCORE TESTS FOR FAMILIAL CORRELATION IN GENOTYPED PROBAND DESIGNS R. J. Carroll, M. H. Gail, J. Benichou, C. D. Galindo & D. Pee January 28, 1998 Abstract In the genotyped proband design, a proband is
More informationSCORE TESTS FOR FAMILIAL CORRELATION IN GENOTYPED PROBAND DESIGNS
SCORE TESTS FOR FAMILIAL CORRELATION IN GENOTYPED PROBAND DESIGNS Raymond J. Carroll, Mitchell H. Gail, Jacques Benichou, & David Pee September 7, 1998 Short title. Genotyped Proband Designs Raymond J.
More informationIntroduction to Statistical Analysis
Introduction to Statistical Analysis Changyu Shen Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology Beth Israel Deaconess Medical Center Harvard Medical School Objectives Descriptive
More informationNONPARAMETRIC ADJUSTMENT FOR MEASUREMENT ERROR IN TIME TO EVENT DATA: APPLICATION TO RISK PREDICTION MODELS
BIRS 2016 1 NONPARAMETRIC ADJUSTMENT FOR MEASUREMENT ERROR IN TIME TO EVENT DATA: APPLICATION TO RISK PREDICTION MODELS Malka Gorfine Tel Aviv University, Israel Joint work with Danielle Braun and Giovanni
More informationPairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion
Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Glenn Heller and Jing Qin Department of Epidemiology and Biostatistics Memorial
More informationSurvival Analysis Math 434 Fall 2011
Survival Analysis Math 434 Fall 2011 Part IV: Chap. 8,9.2,9.3,11: Semiparametric Proportional Hazards Regression Jimin Ding Math Dept. www.math.wustl.edu/ jmding/math434/fall09/index.html Basic Model Setup
More informationSemiparametric Regression
Semiparametric Regression Patrick Breheny October 22 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/23 Introduction Over the past few weeks, we ve introduced a variety of regression models under
More informationPredicting disease Risk by Transformation Models in the Presence of Unspecified Subgroup Membership
Predicting disease Risk by Transformation Models in the Presence of Unspecified Subgroup Membership Qianqian Wang, Yanyuan Ma and Yuanjia Wang University of South Carolina, Penn State University and Columbia
More informationDNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to
1 1 1 1 1 1 1 1 0 SUPPLEMENTARY MATERIALS, B. BIVARIATE PEDIGREE-BASED ASSOCIATION ANALYSIS Introduction We propose here a statistical method of bivariate genetic analysis, designed to evaluate contribution
More informationStatistical Methods for Alzheimer s Disease Studies
Statistical Methods for Alzheimer s Disease Studies Rebecca A. Betensky, Ph.D. Department of Biostatistics, Harvard T.H. Chan School of Public Health July 19, 2016 1/37 OUTLINE 1 Statistical collaborations
More informationSTAT331. Cox s Proportional Hazards Model
STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations
More informationPrevious lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.
Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative
More informationMarginal Screening and Post-Selection Inference
Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2
More informationCOMBINING ISOTONIC REGRESSION AND EM ALGORITHM TO PREDICT GENETIC RISK UNDER MONOTONICITY CONSTRAINT
Submitted to the Annals of Applied Statistics COMBINING ISOTONIC REGRESSION AND EM ALGORITHM TO PREDICT GENETIC RISK UNDER MONOTONICITY CONSTRAINT By Jing Qin,, Tanya P. Garcia,,, Yanyuan Ma, Ming-Xin
More informationSupplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control
Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model
More informationBIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY
BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY Ingo Langner 1, Ralf Bender 2, Rebecca Lenz-Tönjes 1, Helmut Küchenhoff 2, Maria Blettner 2 1
More informationUniversity of California, Berkeley
University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2008 Paper 241 A Note on Risk Prediction for Case-Control Studies Sherri Rose Mark J. van der Laan Division
More informationPower and Sample Size Calculations with the Additive Hazards Model
Journal of Data Science 10(2012), 143-155 Power and Sample Size Calculations with the Additive Hazards Model Ling Chen, Chengjie Xiong, J. Philip Miller and Feng Gao Washington University School of Medicine
More information1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics
1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
More informationUniversity of California, Berkeley
University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 24 Paper 153 A Note on Empirical Likelihood Inference of Residual Life Regression Ying Qing Chen Yichuan
More informationFULL LIKELIHOOD INFERENCES IN THE COX MODEL
October 20, 2007 FULL LIKELIHOOD INFERENCES IN THE COX MODEL BY JIAN-JIAN REN 1 AND MAI ZHOU 2 University of Central Florida and University of Kentucky Abstract We use the empirical likelihood approach
More informationUNIVERSITY OF CALIFORNIA, SAN DIEGO
UNIVERSITY OF CALIFORNIA, SAN DIEGO Estimation of the primary hazard ratio in the presence of a secondary covariate with non-proportional hazards An undergraduate honors thesis submitted to the Department
More informationTwo-stage Adaptive Randomization for Delayed Response in Clinical Trials
Two-stage Adaptive Randomization for Delayed Response in Clinical Trials Guosheng Yin Department of Statistics and Actuarial Science The University of Hong Kong Joint work with J. Xu PSI and RSS Journal
More informationA general mixed model approach for spatio-temporal regression data
A general mixed model approach for spatio-temporal regression data Thomas Kneib, Ludwig Fahrmeir & Stefan Lang Department of Statistics, Ludwig-Maximilians-University Munich 1. Spatio-temporal regression
More informationProportional hazards model for matched failure time data
Mathematical Statistics Stockholm University Proportional hazards model for matched failure time data Johan Zetterqvist Examensarbete 2013:1 Postal address: Mathematical Statistics Dept. of Mathematics
More informationSemiparametric Mixed Effects Models with Flexible Random Effects Distribution
Semiparametric Mixed Effects Models with Flexible Random Effects Distribution Marie Davidian North Carolina State University davidian@stat.ncsu.edu www.stat.ncsu.edu/ davidian Joint work with A. Tsiatis,
More informationSurvival Analysis for Case-Cohort Studies
Survival Analysis for ase-ohort Studies Petr Klášterecký Dept. of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, harles University, Prague, zech Republic e-mail: petr.klasterecky@matfyz.cz
More informationProbability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies
Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National
More informationFrailty Modeling for clustered survival data: a simulation study
Frailty Modeling for clustered survival data: a simulation study IAA Oslo 2015 Souad ROMDHANE LaREMFiQ - IHEC University of Sousse (Tunisia) souad_romdhane@yahoo.fr Lotfi BELKACEM LaREMFiQ - IHEC University
More informationIntroduction to QTL mapping in model organisms
Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University kbroman@jhsph.edu www.biostat.jhsph.edu/ kbroman Outline Experiments and data Models ANOVA
More informationNormal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,
Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability
More informationApproximation of Survival Function by Taylor Series for General Partly Interval Censored Data
Malaysian Journal of Mathematical Sciences 11(3): 33 315 (217) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES Journal homepage: http://einspem.upm.edu.my/journal Approximation of Survival Function by Taylor
More informationParametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1
Parametric Modelling of Over-dispersed Count Data Part III / MMath (Applied Statistics) 1 Introduction Poisson regression is the de facto approach for handling count data What happens then when Poisson
More informationPrevious lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)
Previous lecture Single variant association Use genome-wide SNPs to account for confounding (population substructure) Estimation of effect size and winner s curse Meta-Analysis Today s outline P-value
More informationChapter 1 Statistical Inference
Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations
More informationLecture 5 Models and methods for recurrent event data
Lecture 5 Models and methods for recurrent event data Recurrent and multiple events are commonly encountered in longitudinal studies. In this chapter we consider ordered recurrent and multiple events.
More informationEfficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence
Special Issue Paper Received 7 January 20, Accepted 28 September 20 Published online 24 February 202 in Wiley Online Library (wileyonlinelibrary.com) DOI: 0.002/sim.4460 Efficient designs of gene environment
More informationMAS3301 / MAS8311 Biostatistics Part II: Survival
MAS3301 / MAS8311 Biostatistics Part II: Survival M. Farrow School of Mathematics and Statistics Newcastle University Semester 2, 2009-10 1 13 The Cox proportional hazards model 13.1 Introduction In the
More informationOther Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model
Other Survival Models (1) Non-PH models We briefly discussed the non-proportional hazards (non-ph) model λ(t Z) = λ 0 (t) exp{β(t) Z}, where β(t) can be estimated by: piecewise constants (recall how);
More informationBTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014
BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 Homework 4 (version 3) - posted October 3 Assigned October 2; Due 11:59PM October 9 Problem 1 (Easy) a. For the genetic regression model: Y
More informationPerson-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data
Person-Time Data CF Jeff Lin, MD., PhD. Incidence 1. Cumulative incidence (incidence proportion) 2. Incidence density (incidence rate) December 14, 2005 c Jeff Lin, MD., PhD. c Jeff Lin, MD., PhD. Person-Time
More informationComputational Systems Biology: Biology X
Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,
More informationLecture 3. Truncation, length-bias and prevalence sampling
Lecture 3. Truncation, length-bias and prevalence sampling 3.1 Prevalent sampling Statistical techniques for truncated data have been integrated into survival analysis in last two decades. Truncation in
More informationTests of independence for censored bivariate failure time data
Tests of independence for censored bivariate failure time data Abstract Bivariate failure time data is widely used in survival analysis, for example, in twins study. This article presents a class of χ
More informationAnalysing geoadditive regression data: a mixed model approach
Analysing geoadditive regression data: a mixed model approach Institut für Statistik, Ludwig-Maximilians-Universität München Joint work with Ludwig Fahrmeir & Stefan Lang 25.11.2005 Spatio-temporal regression
More informationLecture 9. QTL Mapping 2: Outbred Populations
Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred
More informationOn the Breslow estimator
Lifetime Data Anal (27) 13:471 48 DOI 1.17/s1985-7-948-y On the Breslow estimator D. Y. Lin Received: 5 April 27 / Accepted: 16 July 27 / Published online: 2 September 27 Springer Science+Business Media,
More informationBinomial Mixture Model-based Association Tests under Genetic Heterogeneity
Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Hui Zhou, Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 April 30,
More informationContinuous Time Survival in Latent Variable Models
Continuous Time Survival in Latent Variable Models Tihomir Asparouhov 1, Katherine Masyn 2, Bengt Muthen 3 Muthen & Muthen 1 University of California, Davis 2 University of California, Los Angeles 3 Abstract
More informationLecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015
Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits.
More informationGOODNESS-OF-FIT TESTS FOR ARCHIMEDEAN COPULA MODELS
Statistica Sinica 20 (2010), 441-453 GOODNESS-OF-FIT TESTS FOR ARCHIMEDEAN COPULA MODELS Antai Wang Georgetown University Medical Center Abstract: In this paper, we propose two tests for parametric models
More informationNature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.
Supplementary Figure 1 Number of cases and proxy cases required to detect association at designs. = 5 10 8 for case control and proxy case control The ratio of controls to cases (or proxy cases) is 1.
More informationNIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.
NIH Public Access Author Manuscript Published in final edited form as: Stat Sin. 2012 ; 22: 1041 1074. ON MODEL SELECTION STRATEGIES TO IDENTIFY GENES UNDERLYING BINARY TRAITS USING GENOME-WIDE ASSOCIATION
More informationTime-varying proportional odds model for mega-analysis of clustered event times
Biostatistics (2017) 00, 00, pp. 1 18 doi:10.1093/biostatistics/kxx065 Time-varying proportional odds model for mega-analysis of clustered event times TANYA P. GARCIA Texas A&M University, Department of
More informationConstrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources
Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Yi-Hau Chen Institute of Statistical Science, Academia Sinica Joint with Nilanjan
More informationCTDL-Positive Stable Frailty Model
CTDL-Positive Stable Frailty Model M. Blagojevic 1, G. MacKenzie 2 1 Department of Mathematics, Keele University, Staffordshire ST5 5BG,UK and 2 Centre of Biostatistics, University of Limerick, Ireland
More information8 Nominal and Ordinal Logistic Regression
8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on
More informationA comparison of inverse transform and composition methods of data simulation from the Lindley distribution
Communications for Statistical Applications and Methods 2016, Vol. 23, No. 6, 517 529 http://dx.doi.org/10.5351/csam.2016.23.6.517 Print ISSN 2287-7843 / Online ISSN 2383-4757 A comparison of inverse transform
More informationModeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17
Modeling IBD for Pairs of Relatives Biostatistics 666 Lecture 7 Previously Linkage Analysis of Relative Pairs IBS Methods Compare observed and expected sharing IBD Methods Account for frequency of shared
More informationMODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES
MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by
More informationAdditive and multiplicative models for the joint effect of two risk factors
Biostatistics (2005), 6, 1,pp. 1 9 doi: 10.1093/biostatistics/kxh024 Additive and multiplicative models for the joint effect of two risk factors A. BERRINGTON DE GONZÁLEZ Cancer Research UK Epidemiology
More informationFull likelihood inferences in the Cox model: an empirical likelihood approach
Ann Inst Stat Math 2011) 63:1005 1018 DOI 10.1007/s10463-010-0272-y Full likelihood inferences in the Cox model: an empirical likelihood approach Jian-Jian Ren Mai Zhou Received: 22 September 2008 / Revised:
More informationA COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky
A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky Empirical likelihood with right censored data were studied by Thomas and Grunkmier (1975), Li (1995),
More informationSNP Association Studies with Case-Parent Trios
SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature
More informationA Parametric Copula Model for Analysis of Familial Binary Data
Am. J. Hum. Genet. 64:886 893, 1999 A Parametric Copula Model for Analysis of Familial Binary Data David-Alexandre Trégouët, 1 Pierre Ducimetière, 1 Valéry Bocquet, 1 Sophie Visvikis, 3 Florent Soubrier,
More informationFriday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo
Friday Harbor 2017 From Genetics to GWAS (Genome-wide Association Study) Sept 7 2017 David Fardo Purpose: prepare for tomorrow s tutorial Genetic Variants Quality Control Imputation Association Visualization
More informationModelling geoadditive survival data
Modelling geoadditive survival data Thomas Kneib & Ludwig Fahrmeir Department of Statistics, Ludwig-Maximilians-University Munich 1. Leukemia survival data 2. Structured hazard regression 3. Mixed model
More informationConfounding, mediation and colliding
Confounding, mediation and colliding What types of shared covariates does the sibling comparison design control for? Arvid Sjölander and Johan Zetterqvist Causal effects and confounding A common aim of
More informationImproving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates
Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates Anastasios (Butch) Tsiatis Department of Statistics North Carolina State University http://www.stat.ncsu.edu/
More informationHERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)
BIRS 016 1 HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA) Malka Gorfine, Tel Aviv University, Israel Joint work with Li Hsu, FHCRC, Seattle, USA BIRS 016 The concept of heritability
More informationOne-stage dose-response meta-analysis
One-stage dose-response meta-analysis Nicola Orsini, Alessio Crippa Biostatistics Team Department of Public Health Sciences Karolinska Institutet http://ki.se/en/phs/biostatistics-team 2017 Nordic and
More informationPowerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions
Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions Nilanjan Chatterjee, Zeynep Kalaylioglu 2, Roxana Moslehi, Ulrike Peters 3, Sholom Wacholder
More informationChapter 4. Parametric Approach. 4.1 Introduction
Chapter 4 Parametric Approach 4.1 Introduction The missing data problem is already a classical problem that has not been yet solved satisfactorily. This problem includes those situations where the dependent
More informationPrediction of the Confidence Interval of Quantitative Trait Loci Location
Behavior Genetics, Vol. 34, No. 4, July 2004 ( 2004) Prediction of the Confidence Interval of Quantitative Trait Loci Location Peter M. Visscher 1,3 and Mike E. Goddard 2 Received 4 Sept. 2003 Final 28
More informationFULL LIKELIHOOD INFERENCES IN THE COX MODEL: AN EMPIRICAL LIKELIHOOD APPROACH
FULL LIKELIHOOD INFERENCES IN THE COX MODEL: AN EMPIRICAL LIKELIHOOD APPROACH Jian-Jian Ren 1 and Mai Zhou 2 University of Central Florida and University of Kentucky Abstract: For the regression parameter
More informationExpression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia
Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.
More informationEquivalence of random-effects and conditional likelihoods for matched case-control studies
Equivalence of random-effects and conditional likelihoods for matched case-control studies Ken Rice MRC Biostatistics Unit, Cambridge, UK January 8 th 4 Motivation Study of genetic c-erbb- exposure and
More informationGoodness of Fit Goodness of fit - 2 classes
Goodness of Fit Goodness of fit - 2 classes A B 78 22 Do these data correspond reasonably to the proportions 3:1? We previously discussed options for testing p A = 0.75! Exact p-value Exact confidence
More information[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements
[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements Aasthaa Bansal PhD Pharmaceutical Outcomes Research & Policy Program University of Washington 69 Biomarkers
More informationStat 5101 Lecture Notes
Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random
More informationUNIVERSITÄT POTSDAM Institut für Mathematik
UNIVERSITÄT POTSDAM Institut für Mathematik Testing the Acceleration Function in Life Time Models Hannelore Liero Matthias Liero Mathematische Statistik und Wahrscheinlichkeitstheorie Universität Potsdam
More informationHarvard University. Harvard University Biostatistics Working Paper Series
Harvard University Harvard University Biostatistics Working Paper Series Year 2014 Paper 174 Control Function Assisted IPW Estimation with a Secondary Outcome in Case-Control Studies Tamar Sofer Marilyn
More informationQuantile Regression for Residual Life and Empirical Likelihood
Quantile Regression for Residual Life and Empirical Likelihood Mai Zhou email: mai@ms.uky.edu Department of Statistics, University of Kentucky, Lexington, KY 40506-0027, USA Jong-Hyeon Jeong email: jeong@nsabp.pitt.edu
More informationI Have the Power in QTL linkage: single and multilocus analysis
I Have the Power in QTL linkage: single and multilocus analysis Benjamin Neale 1, Sir Shaun Purcell 2 & Pak Sham 13 1 SGDP, IoP, London, UK 2 Harvard School of Public Health, Cambridge, MA, USA 3 Department
More informationA unified framework for studying parameter identifiability and estimation in biased sampling designs
Biometrika Advance Access published January 31, 2011 Biometrika (2011), pp. 1 13 C 2011 Biometrika Trust Printed in Great Britain doi: 10.1093/biomet/asq059 A unified framework for studying parameter identifiability
More informationBustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #
Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either
More informationIgnoring the matching variables in cohort studies - when is it valid, and why?
Ignoring the matching variables in cohort studies - when is it valid, and why? Arvid Sjölander Abstract In observational studies of the effect of an exposure on an outcome, the exposure-outcome association
More information11 Survival Analysis and Empirical Likelihood
11 Survival Analysis and Empirical Likelihood The first paper of empirical likelihood is actually about confidence intervals with the Kaplan-Meier estimator (Thomas and Grunkmeier 1979), i.e. deals with
More informationCombining dependent tests for linkage or association across multiple phenotypic traits
Biostatistics (2003), 4, 2,pp. 223 229 Printed in Great Britain Combining dependent tests for linkage or association across multiple phenotypic traits XIN XU Program for Population Genetics, Harvard School
More informationIntroduction to QTL mapping in model organisms
Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University kbroman@jhsph.edu www.biostat.jhsph.edu/ kbroman Outline Experiments and data Models ANOVA
More informationPreface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of
Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of Probability Sampling Procedures Collection of Data Measures
More informationOutline. Frailty modelling of Multivariate Survival Data. Clustered survival data. Clustered survival data
Outline Frailty modelling of Multivariate Survival Data Thomas Scheike ts@biostat.ku.dk Department of Biostatistics University of Copenhagen Marginal versus Frailty models. Two-stage frailty models: copula
More informationHarvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen
Harvard University Harvard University Biostatistics Working Paper Series Year 2014 Paper 175 A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome Eric Tchetgen Tchetgen
More informationConstrained estimation for binary and survival data
Constrained estimation for binary and survival data Jeremy M. G. Taylor Yong Seok Park John D. Kalbfleisch Biostatistics, University of Michigan May, 2010 () Constrained estimation May, 2010 1 / 43 Outline
More informationNon-iterative, regression-based estimation of haplotype associations
Non-iterative, regression-based estimation of haplotype associations Benjamin French, PhD Department of Biostatistics and Epidemiology University of Pennsylvania bcfrench@upenn.edu National Cancer Center
More informationPrerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3
University of California, Irvine 2017-2018 1 Statistics (STATS) Courses STATS 5. Seminar in Data Science. 1 Unit. An introduction to the field of Data Science; intended for entering freshman and transfers.
More informationLecture 12: Effect modification, and confounding in logistic regression
Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007 Today Categorical predictor create dummy variables just like for linear regression
More informationA Robust Test for Two-Stage Design in Genome-Wide Association Studies
Biometrics Supplementary Materials A Robust Test for Two-Stage Design in Genome-Wide Association Studies Minjung Kwak, Jungnam Joo and Gang Zheng Appendix A: Calculations of the thresholds D 1 and D The
More informationEconometric Analysis of Cross Section and Panel Data
Econometric Analysis of Cross Section and Panel Data Jeffrey M. Wooldridge / The MIT Press Cambridge, Massachusetts London, England Contents Preface Acknowledgments xvii xxiii I INTRODUCTION AND BACKGROUND
More informationStatistics in medicine
Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu
More information