Statistical inference on the penetrances of rare genetic mutations based on a case family design

Size: px
Start display at page:

Download "Statistical inference on the penetrances of rare genetic mutations based on a case family design"

Transcription

1 Biostatistics (2010), 11, 3, pp doi: /biostatistics/kxq009 Advance Access publication on February 23, 2010 Statistical inference on the penetrances of rare genetic mutations based on a case family design HONG ZHANG Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA and Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui , People s Republic of China SYLVIANE OLSCHWANG Institut National de la Santé et de la Recherche Médicale (INSERM), Unité 891, Centrede Recherches en Cancérologie de Marseille, Marseille, France and Department of Oncogenetics, Institut Paoli-Calmettes, Marseille, France KAI YU Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA yuka@mail.nih.gov SUMMARY We propose a formal statistical inference framework for the evaluation of the penetrance of a rare genetic mutation using family data generated under a kin cohort type of design, where phenotype and genotype information from first-degree relatives (sibs and/or offspring) of case probands carrying the targeted mutation are collected. Our approach is built upon a likelihood model with some minor assumptions, and it can be used for age-dependent penetrance estimation that permits adjustment for covariates. Furthermore, the derived likelihood allows unobserved risk factors that are correlated within family members. The validity of the approach is confirmed by simulation studies. We apply the proposed approach to estimating the age-dependent cancer risk among carriers of the MSH2 or MLH1 mutation. Keywords: Case family design; Penetrance; Proportional hazards model; Rare mutation; Unobserved risk factors. 1. INTRODUCTION An increasing number of mutations have been found to be associated with an elevated risk for various genetic disorders. A precise estimation of the age-dependent risk for people carrying the disease-causing mutations is essential for defining prevention strategies and understanding underlying mechanisms of the diseases. When a disease causal mutation is identified, a precise estimation of its penetrance is possible using the kin cohort design (Wacholder and others, 1998), which has been studied extensively in To whom correspondence should be addressed. c The Author Published by Oxford University Press. All rights reserved. For permissions, please journals.permissions@oxfordjournals.org.

2 520 H. ZHANG AND OTHERS the literature, for example, Gail, Pee, Benichou, and Carroll (1999), Chatterjee and Wacholder (2001), Chatterjee and others (2006), Wang and others (2007), among others. Gail, Pee, and Carroll (1999) studied the advantages and disadvantages of the kin cohort design. They found that the kin cohort design has several practical advantages, including comparatively rapid execution, modest reductions in required sample sizes compared with cohort or case control designs, and the ability to study the effects of an autosomal dominant mutation on several disease outcomes; the disadvantages include 2 sources of bias: a proband s decision to participate is influenced by the disease status of his relatives and the proband is unable to recall the disease histories of relatives accurately. In a standard kin cohort design, a volunteer (either affected or unaffected) agrees to be genotyped, and the phenotype information on the disease histories of his or her first-degree relatives is obtained through a questionnaire. When the information on both phenotype and genotype for relatives is available, alternative approaches are needed in order to take full advantage of all available data while correcting for bias due to the effects of ascertainment. In this paper, we assume that all probands are affected carriers, though the proposed approach can be extended to include unaffected probands carrying the mutation. Recently, Wang and others (2006) proposed a nonparametric method for estimating the penetrance of a rare mutation. Olschwang and others (2009) proposed an alternative parametric logistic regression model. Both approaches rely on the assumption that the penetrance of noncarriers is zero. This assumption might not be true for many genetic diseases. The penetrance estimate could be severely biased if this assumption was not valid in real applications. In this paper, we focus on rare mutations and aim at developing a rigorous statistical inference framework for such case family design. The main difference between this design and the standard kin cohort design is that the former collects information on both phenotypes and genotypes of the probands relatives, while the latter simply collects the phenotypes of the relatives through a questionnaire. The assumption of zero penetrance for the noncarriers is not required for our approach. Furthermore, the proposed approach is based on a likelihood model conditioned on the phenotypes of all individuals; therefore, the derived estimate should not suffer from the biases mentioned for the kin cohort design. Some covariates such as gender and ethnicity can be incorporated easily in our approach. Multiple rare mutations can also be handled in the context of the proposed conditional likelihood framework. The derivation of the conditional likelihood functions requires minor assumptions. The maximum likelihood estimates (MLEs) can be obtained through standard optimization algorithm available in mathematical/statistical softwares. Statistical inferences, such as constructing confidence intervals and testing hypotheses for the parameters characterizing the penetrance, can be performed based on the standard large-sample theories. The performance of the proposed approach is examined through simulation studies, which illustrate the desired properties of the approach. Finally, we demonstrate the application of the proposed approach by applying it to a study of Lynch syndrome. 2. AGE-INDEPENDENT PENENTRANCES 2.1 Notation Throughout this paper, the mutations responsible for the disease of interest are assumed to be on autosome. In the case family design considered, some unrelated affected individuals (cases) collected from a case control study are genotyped, and those cases carrying the study mutation are termed case probands ; the first-degree relatives (sibs and/or offspring) of the case probands are interviewed for phenotyping and genotyping. To motivate our approach, we first focus on congenital or early-onset diseases that manifest before the ages at which subjects are ascertained. We want to estimate the age-independent penetrance of a known disease-causing mutation. Suppose some case probands are ascertained, and several first-degree

3 Statistical inferences on the penetrances of rare genetic mutations 521 relatives of each case proband are then collected for genotyping at the disease locus. Throughout this paper, we assume that the mutation (allele m is the mutation of wild allele M) causing disease is rare. Since the mutation is so rare that homogeneous genotype mm is seldom seen if we assume Hardy Weinberg equilibrium holds for the alleles, then we have only 2 genotypes, namely Mm (mutation, denoted by g = 1) and M M (nonmutation, denoted by g = 0). Let the disease penetrance of M M and Mm be f 0 and f 1, respectively. Let the disease status of an individual be d that takes value 1 if affected and 0 otherwise. 2.2 Likelihood function Suppose I unrelated case probands carrying the mutation are ascertained. To derive the likelihood function of the observed data, we need to make the following assumptions: (i) The study mutation is rare. (ii) Hardy Weinberg equilibrium holds for the corresponding allele, mating is random, and Mendelian inheritance law holds. (iii) The study mutation is independent of the unobserved risk factors. (iv) The disease is rare. (v) There is no interaction effect between the study mutation and the unobserved risk factors. That is, the joint disease penetrance satisfies the following relationship: P(d = 1 g = 1, r) = c 1 P(d = 1 g = 0, r), (2.1) where r is a vector of unobserved risk factor values and c 1 is a constant independent of r. Under the assumptions (i) (v), the likelihood function for the observed genotypes of the relatives can be approximated by L i ( f 0, f 1 ) = p a i 1 (1 p 1) n 1i a i p b i 0 (1 p 0) n 0i b i, with p 1 = f 1 f 1 + f 0 and p 0 = 1 f 1 2 f 1 f 0, (2.2) where a i (b i ) is the number of the affected (unaffected) relatives carrying the mutation and n 1i (n 0i ) is the number of affected (unaffected) relatives, of the ith case proband, i = 1,..., I. Refer to Appendix A of the supplementary material (available at Biostatistics online) for the derivation of (2.2) that is available. Notice that p 1 (p 0 ) is the probability of a relative being a carrier, given the condition that he/she is affected (unaffected) and the case proband is a carrier. It is seen that p 0 has exactly the same value as that given in Wang and others (2006) when f 0 = 0. Furthermore, when f 0 = 0 (i.e. a noncarrier has penetrance 0), all the affected relatives are carriers and they provide no information on f 1. It can be seen from (2.2) that the relative s genotypes within the same family are conditionally independent given the ascertainment scheme. We want to point out that this is not an assumption but is the result derived from the assumptions (i) (v). An important advantage of this likelihood is that it is independent of the unobserved risk factors, making it suitable for estimating marginal penetrances of carriers and noncarriers. The assumption (i) is the key assumption, which is the motivation for this study. The assumptions (ii) and (iii) are commonly seen in literature, which are used to derive the conditional mutation distribution of a proband s relatives. The assumption (iv) is a technical one, and our simulation study shows that the performance of the proposed approach is acceptable even when the disease is common with the prevalence being 0.1. The assumption (v) is equivalent to the multiplicative model for multiple risk factors (see e.g. Gail and others, 2008; Yu and others, 2009). In particular, the following log-linear model satisfies the assumption (v): P(d = 1 g, r) = c 2 exp{ag + b τ r},

4 522 H. ZHANG AND OTHERS where c 2 is a constant and a and b are regression parameters. Throughout this paper, τ stands for the transpose of a vector. Notice that we do not assume any correlation structure for the unobserved risk factors of family members. Furthermore, the unobserved risk factors can be of any type, such as discrete and continuous, environmental or genetical. 2.3 Identifiability of f 1 and f 0 When genotypes are available only for the unaffected relatives of case probands, we see from the likelihood function (2.2) that the penetrances f 1 and f 0 are not identifiable. However, the 2 penetrances f 1 and f 0 are identifiable when at least 1 affected relative and 1 unaffected relative are genotyped, provided that f 1 > f 0 > 0. Actually, there is a one-to-one relationship between the penetrances { f 1, f 0 } and the estimable parameters {p 1, p 0 } when f 1 > f 0 > 0. This is different from the situation in the standard case control design, where only the relative risk f 1 /f 0 is identifiable. Notice that in our case family design, our retrospective likelihood function is conditioned on the mutation status of the proband and disease status. This additional conditioning as well as the assumption of rare mutation make both f 1 and f 0 identifiable. It is also noticed that f 0 and f 1 are not identifiable when f 1 = f 0 but this is not a problem since the major purpose of our case family design is to estimate the penetrance function of a known risk mutation with f 1 > f Maximum likelihood estimates Denote A = I i=1 a i, B = I i=1 b i, N 1 = I i=1 n 1i, and N 0 = I i=1 n 0i, Then the overall likelihood can be written as L( f 0, f 1 ) = p A 1 (1 p 1) N 1 A p B 0 (1 p 0) N 0 B, with p 1 = f 1 f 1 + f 0 and p 0 = 1 f 1 2 f 1 f 0. (2.3) Since the above likelihood function is the product of 2 binomial likelihood functions, the MLEs of p 1 and p 0 are ˆp 1 = A/N 1 and ˆp 0 = B/N 0, respectively. Therefore, the MLEs of f 1 and f 0 are, respectively, or equivalently, fˆ 1 = ˆp 1(1 2 ˆp 0 ) and fˆ 0 = (1 ˆp 1)(1 2 ˆp 0 ), with ˆp 1 = A and ˆp 0 = B, (2.4) ˆp 1 ˆp 0 ˆp 1 ˆp 0 N 1 N 0 fˆ 1 = AN 0 2B A and fˆ 0 = N 1N 0 AN 0 2B N 1 + 2AB. (2.5) AN 0 B N 1 AN 0 B N 1 When f 0 = 0, the MLE of f 1 is (1 2B/N 0 )/(1 B/N 0 ). This estimator is simpler than that of Wang and others (2006) since their method needs to estimate an additional offset for each family. When f 0 is not equal to 0, using (1 2B/N 0 )/(1 B/N 0 ) as an estimator of f 1 could produce considerable bias. For example, if f 0 = 0.1 and f 1 = 0.2, then the estimator (1 2B/N 0 )/(1 B/N 0 ) converges to (1 2p 0 )/(1 p 0 ) = 1/9 as the sample size goes to infinity and the relative bias (Rbias) is (1/9 0.2)/0.2 = 4/9. If all the affected relatives are carriers so that N 1 = A, then the MLEs of f 0 and f 1 are 0 and (1 2B/N 0 )/(1 B/N 0 ), respectively. This confirms the fact that the affected relatives provide no information on f 1 when f 0 = 0, as was mentioned in Section 2.2. With a large sample size, the MLEs fˆ 1 and fˆ 0 converge to f 1 and f 0, respectively, so that they asymptotically locate within the interval [0, 1]. When the sample size is not large enough, however, the 2 estimates could be negative or greater than 1. In such situation, we can estimate the penetrances by adding a constraint 0 f 0, f 1 1.

5 Statistical inferences on the penetrances of rare genetic mutations Hypothesis testing and confidence interval It is of interest to test the null hypothesis that the mutation has no effect on the disease ( f 0 = f 1 ), provided that the genotypes of some affected relatives are available. To test this null hypothesis, we can construct a likelihood ratio test. Since the common penetrance under the null hypothesis is not identifiable, the limiting null distribution of the likelihood ratio test is no longer standard chi-square distributed. To assess the significance of the likelihood ratio test statistic, we can adopt a permutation test by permutating the disease status of the relatives. The confidence intervals of the penetrances can be constructed based on the asymptotic normality of the MLEs, with the variance covariance matrix of the MLEs being estimated by the inverse of the observed information matrix. 3. AGE-DEPENDENT PENETRANCES 3.1 Notation In most situations, the penetrances depend on age, and we are interested in estimating age-dependent penetrances. Suppose that we observe the ages at diagnosis for all the relatives and the ages at onset for those affected individuals. We will take this information into account in the evaluation of the agedependent penetrances. For the ith proband, suppose the information on the phenotypes and genotypes of n i relatives are collected. Let the genotype and affection status of the jth relative (zeroth relative is the case proband) of the ith case proband be coded by g i j and d i j, respectively. That is, g i j = 1 if the jth relative is a carrier and 0 otherwise, and d i j = 1 if the jth relative is affected and 0 otherwise. Let a i j and t i j (t i j is an unobserved value that is greater than a i j if the jth relative is unaffected) be the current age and the age at onset of the jth relative, respectively. Let y i j = min{t i j, a i j }. 3.2 Likelihood function We can formulate a conditional likelihood for the ith family s data as P(g i d i, y i, g i0 = 1, d i0 = 1, y i0, a i, a i0 ), where g i = (g i1,..., g ini ), d i = (d i1,..., d ini ), y i = (y i1,..., y ini ), and a i = (a i1,..., a ini ). To derive the likelihood function, we need the following assumption corresponding to the assumption (v) for the age-independent penetrances: (v) There is no interaction effect between the study mutation and the unobserved risk factors, that is, the density function of the age at onset p(t g, r) given the study mutation g and unobserved risk factors r satisfies the relationship where c 3 is a constant. p(t g = 1, r) = c 3 p(t g = 0, r), (3.1) Under Cox s proportional hazards model (Cox, 1972), the hazard function is multiplicative with respect to g and r if there is no interaction effect. Therefore, the Cox model together with the rare disease assumption imply the assumption (v) since the density function is approximately the hazard function under the assumptions. Under the assumptions (i) (iv) and (v), we can show that the overall likelihood can be approximated by I I n i λ d i j (y i j g i j )S(y i j g i j ) L = P(g i d i, y i, g i0 = 1, d i0 = 1, y i0, a i, a i0 ) = 1g=0 λ d i j (y i j g)s(y i j g), (3.2) i=1 i=1 j=1

6 524 H. ZHANG AND OTHERS where λ( g) and S( g) are, respectively, the hazard function and survival function of age at onset of individuals carrying genotype g. The derivation of (3.2) is similar to that of (2.2) so is omitted. We can assume a suitable functional form for λ(t g). For example, under the given assumptions, the joint proportional hazards model implies a marginal proportional hazard function λ(t g) = λ 0 (t; η)e βg, (3.3) where λ 0 (t; η) is the baseline hazard function known up to a parameter vector η of finite dimension. If only unaffected relatives are genotyped, then the likelihood function (3.2) reduces to I n i i=1 j=1 S(y i j g i j ) S(y i j 0) + S(y i j 1). (3.4) It can be shown that S( 1) and S( 0) are not identifiable in (3.4), as in Section 2.3. For rare disease, one can assume that the penetrance of noncarriers is nearly zero so that S(y 0) 1, and the likelihood function is approximately I n i i=1 j=1 ( S(yi j 1) ) gi j ( 1 + S(y i j 1) S(y i j 1) ) 1 gi j, (3.5) which is equivalent to model (3) of Olschwang and others (2009). Making the additional assumption of a Weibull survival function form of S(y 1) yields a logistic regression model given by (5) of Olschwang and others (2009). 3.3 MLE, hypothesis testing, and confidence interval The MLEs of the unknown parameters can be obtained by the Newton Raphson algorithm or any optimization algorithm. To examine whether the study mutation has effect on the disease, we can test the null hypothesis β = 0 using either likelihood ratio test or Wald test, where β is given in (3.3). We can also estimate the variances of the MLEs and construct the confidence intervals of the unknown parameters based on a large-sample theory. 4. COVARIATES AND MULTIPLE MUTATIONS ADJUSTMENT In many real applications, we might be interested in comparing penetrances between 2 groups, for example, male versus female. Also, when there are multiple known disease-causing mutations involved, we are interested in comparing the penetrances among multiple mutations. An example will be given in Section 6. We can extend the previous likelihood functions further to adjust for covariates and multiple disease-causing mutations. In the following example, we illustrate how to incorporate covariates and multiple mutations in the situation where the genotypes from both affected and unaffected relatives are available. 4.1 Likelihood function Assume that a covariate vector Z is observed for each relative. Then we can incorporate the covariates effect in a proportional hazards model: (t g, Z) = 0 (t; η)e βg+γ τ Z, (4.1)

7 Statistical inferences on the penetrances of rare genetic mutations 525 where (t g, Z) is the cumulative hazard function of the age at onset given covariate Z and genotype g and 0 (t; η) is the baseline cumulative hazard function corresponding to g = 0 and Z = 0, which is known up to a parameter vector η of finite dimension. The likelihood function is therefore approximately I n i i=1 j=1 exp{(βg i j + γ τ Z i j )d i j 0 (y i j ; η)e βg i j +γ τ Z i j } 1g=0 exp{(βg + γ τ Z i j )d i j 0 (y i j ; η)e βg+γ τ Z i j }. Suppose K types of disease-causing rare mutations are considered. We assume that each family can have at most one type of mutation segregated. Let δ i be a mutation indicator, that is, δ i = k if the ith case proband has the kth type of mutation. We assume that the cumulative hazard functions of these risk mutations are proportional: k (t) = 0 (t; η)e β k, k = 1,..., K, (4.2) where β 1 = 0. The approximated likelihood function can be written as I n i exp{( K k=1 1 k (δ i )β k g i j )d i j 0 (y i j ; η)e K k=1 1 k (δ i )β k g i j } 1g=0 exp{( K k=1 1 k (δ i )β k g i j )d i j 0 (y i j ; η)e (4.3) K k=1 1 k (δ i )β k g i j }, i=1 j=1 where 1 k (δ i ) is an indicator function taking value 1 if δ i = k and 0 otherwise. Refer to Appendix B of the supplementary material (available at Biostatistics online) for the derivation of (4.3). This expression shows that the mutation behaves as a family-shared categorical covariate. 4.2 MLE, confidence interval, and hypothesis testing The MLEs of η, β, γ, and β k, k = 2,..., K, in (4.1) and (4.2) can be obtained using the Newton Raphson algorithm or any optimization algorithm. The variance estimates of the MLEs and confidence intervals of the unknown parameters can be obtained as before. It is of interest to compare the penetrances for various disease-causing mutations, which can be conducted by the standard likelihood ratio test based on the likelihood (4.3). The proportionality of the hazard functions can also be tested by a likelihood ratio test, with the alternative hypothesis being that the mutations have their own specific penetrance functions. If the null hypothesis that the penetrance functions are proportional is not rejected, we can feel free to apply the proportional hazards model (4.2); otherwise we need to estimate mutation-specific penetrance functions. 5. SIMULATION STUDIES We conducted simulation studies to assess the performance of the proposed approach. First, we studied the age-independent penetrances. We assumed 2 independent disease related singlenucleotide polymorphisms (SNPs): one is the study mutation with minor allele frequency (MAF) 0.01 or and the other one is unobserved with MAF 0.2. We assumed dominant mode of inheritance for both the SNPs. The disease and risk factors were related by a logistic regression model: P(d = 1 g, r) = exp{a + bg + log(or)r} 1 + exp{a + bg + log(or)r}, (5.1) where g (r) is 1 if the genotype of the study SNP (unobserved SNP) is of higher risk and 0 otherwise and OR is the odds ratio parameter for the unobserved SNP, which takes value 1 or 2. The marginal penetrance f 1 for carriers was fixed at 0.5 and the other penetrance f 0 was 0.03 or 0.1. The values of

8 526 H. ZHANG AND OTHERS log-or parameters a and b were determined by the other parameters. The genotypes of parents were generated under Hardy Weinberg equilibrium and random mating, and the genotypes of offspring were independently generated given parental genotypes. From a large number of generated families with 3 offspring, we randomly selected families with the first offspring being affected and carrying the study mutation and treated them as the source population from which the study sample was collected. A sample of size 200, 500, or 1000 was drawn from this population and simulation results based on replications were produced. Reported in Table 1 are the Rbias of the estimates defined as the Table 1. Age-independent penetrance estimates MAF f 0 OR Rbias SE SEE ECP # Rbias SE SEE ECP # Number of cases = Number of controls = 200 f 0 f Number of cases = Number of controls = 500 f 0 f Number of cases = Number of controls = 1000 f 0 f MAF of study mutation. True value of penetrance f 0. OR parameter of unobserved risk factor. Rbias defined as mean MLE divided by true penetrance minus 1. 95% empirical coverage probability. # Mean SE of estimated penetrance. SEE of estimated penetrance.

9 Statistical inferences on the penetrances of rare genetic mutations 527 mean estimated penetrances divided by the true penetrance minus 1, empirical standard errors (SE) and mean estimated standard errors (SEE) of the estimates, and empirical coverage probability (ECP) of the penetrances. Overall, the estimates have minor bias when the disease is rare ( f 0 = 0.03) and the study mutation is rare (MAF = 0.001), with absolute relative biases no more than 1.2%. Common disease ( f 0 = 0.1), increased MAF (0.01) of the study mutation, and positive effect of unobserved mutation (OR = 2) has small impact on the estimates, with Rbias 7.9% 3.4%. In all situations, the SEE are very close to the empirical ones. The relative bias tends to be stable and remain to be small when the sample size increases. We also estimated the penetrance of carriers using only the genotypes of unaffected relatives by assuming zero penetrance of noncarriers. The resulting Rbias is generally small when f 0 = 0.03 but it could become considerably large when f 0 = 0.1 (results not shown). It is also seen from Table 1 that the Rbias for a mutation with MAF = tends to be smaller than that observed for a mutation with MAF = Additional simulation results show that the relative biases get larger when the MAF increases. For example, a MAF of 0.03 produces relative biases at the range of 10.3% 23.7%, and an MAF of 0.1 produces relative biases at the range of 41.9% 65.1%, with the other parameters the same as those in Table 1. It appears that the proposed approach is suitable for rare mutation with MAF Second, we studied the proposed approach when the penetrance is age dependent. We generated data from the following Cox model with Weibull baseline hazard function: λ(t g, r) = (t/e ψ ) ξ e βg+log(or)r, (5.2) where g and r are the same as those in (5.1) with the same MAFs. The OR was fixed at 1 or 2. The other parameters ξ, ψ, and β were determined by 3 cumulative risk probabilities: p 30,0 = P(T 30 g = 0), p 60,0 = P(T 60 g = 0), and p 60,1 = P(T 60 g = 1), where T is the age at onset. To mimic common disease, we set p 30,0 = 0.03 and p 60,0 = 0.09; to mimic rare disease, we set p 30,0 = 0.01 and p 60,0 = In both situations, we set p 60,1 = 0.5. The ages of the relatives of a proband were generated from the uniform distribution in the interval (a 5, a + 5), where a is the current age of the proband that is uniformly distributed in the interval (20, 70). The ages, genotypes, and disease status were generated for a large number of families similarly to the age-dependent situation. In each family, there were 2 parents and 3 offspring whose data were generated. Altogether, families with 1 affected proband (the first offspring) carrying the mutation in each family were obtained. From these families, we sampled 400 or 1000 families and estimated ξ, ψ, and β in model (5.2) by ignoring the unobserved mutation. Substituting the estimated parameters gave the estimates of marginal survival functions of carriers and noncarriers. Based on 5000 replications, we calculated the mean estimated survival functions of both carriers and noncarriers and the 90% confidence intervals of the survival functions. Presented in Figures 1 and 2 are the results for carriers and noncarriers, respectively, with sample size 1000 and OR = 1 (unobserved mutation does not play a role on the disease). We can see that the bias of the estimates reduces dramatically when the MAF of study mutation decreases from 0.01 to 0.001, showing that the approximation of the likelihood function works pretty good for relatively rare mutation. When the disease gets common, the proposed method using both affected and unaffected relatives does not produce extra bias. However, the method that uses only unaffected relatives has much larger bias for common disease. This extra bias is due to the improper assumption of zero penetrance function of noncarriers for common disease. Other results for sample size 400 or OR = 2 are presented in Figures s1 s6 of the supplementary material (available at Biostatistics online). In summary, the bias of the penetrance functions get smaller as the sample size increases. The positive effect of unobserved mutation (OR = 2) has only limited impact on the penetrance function estimates. In particular, the impact is minimal when the MAF of the study mutation is small and the disease is rare.

10 528 H. ZHANG AND OTHERS Fig. 1. Estimated survival functions of carriers with sample size 1000 and OR = 1. Common mutation : MAF = 0.01; rare mutation : MAF = 0.001; common disease : P(T 30 g = 0) = 0.03 and P(T 60 g = 0) = 0.09; rare disease : P(T 30 g = 0) = 0.01 and P(T 60 g = 0) = Finally, we examined the robustness of the specification of the baseline hazard function. Our simulation studies showed that the misspecification of the baseline hazard function could result in bias, with its magnitude depending on the true and misspecified functions. Here, we do not present the simulation results but briefly summarize them. If the true baseline hazard function is gamma, Weibull, or log-normal, but it was misspecified to be any other 2 functions, then the resulting penetrance estimate had small bias; if the baseline hazard function is piecewise constant but it was misspecified to be Weibull, then the bias could be relatively large. 6. APPLICATION TO A STUDY OF LYNCH SYNDROME We applied the proposed approach to a study of Lynch syndrome (Olschwang and others, 2009). In this study, the carriers were identified in 8 genetic units of France and Switzerland. These units offered

11 Statistical inferences on the penetrances of rare genetic mutations 529 Fig. 2. Estimated survival functions of noncarriers with sample size 1000 and OR = 1. Common mutation : MAF = 0.01; rare mutation : MAF = 0.001; common disease : P(T 30 g = 0) = 0.03 and P(T 60 g = 0) = 0.09; rare disease : P(T 30 g = 0) = 0.01 and P(T 60 g = 0) = germline analysis of MSH2 and MLH2 genes. A restrospective questionnaire was conducted to ask for some information on asymptomatic first-degree relatives of carriers. The collected information includes the type of disease-causing germline mutation identified in the proband, birth data, sex, and age at genetic diagnosis. The presence or absence of disease-causing mutation was then assessed from these relatives. Phenotypes and genotypes from 856 asymptomatic first-degree relatives of MSH2 or MLH1 carriers were collected from those 8 centers. For each relative, the gender and mutation status at genes MSH2 and MLH1 were obtained, as summarized in Table 2. Furthermore, the ages of the relatives were available, so that we could estimate the age-dependent penetrances. With pooled data, we assumed a Weibull survival function (t/e ψ ) ξ for carriers. With gender or mutation type adjusted, we assumed a proportional hazards model with survival function (t/e ψ ) ξ e β 1x 1 or (t/e ψ ) ξ e β 2x 2. Here, x 1 = 1 if male and 0 if female and x 2 = 1 if MSH1 and 0 if MSH2. We obtained the MLEs of the unknown parameters (ψ, ξ, β 1, and β 2 ), estimated SE of the MLEs, and 95% confidence intervals of the parameters. The estimation and hypothesis-testing results are presented in Table 3.

12 530 H. ZHANG AND OTHERS Table 2. Summary of genotypes Mutation absent Mutation present MLH1 MSH2 MLH1 MSH2 Males Females Table 3. Estimates of the parameters for the Lynch syndrome data Parameter MLE SE CI LRT P-value # Pooled data ψ (4.020, 4.314) ξ (1.266, 4.614) Adjusted by gender (β 1 : regression parameter) ψ (4.010, 4.507) ξ (1.184, 4.520) β ( 0.267, 1.333) Adjusted by gene type (β 2 : regression parameter) ψ (3.959, 4.487) ξ (1.242, 4.503) β ( 0.568, 1.085) The parameters are defined in Section 6. MLE estimate of unknown parameter. Estimated standard error of the MLE. Confidence interval of unknown parameter. Likelihood ratio test statistic. # P-value of likelihood ratio test. The estimated survival function together with its confidence interval curves based on 5000 bootstrappings (Efron and Tibshirani, 1993) are plotted in Figure s7 of of the supplementary material (available at Biostatistics online). The penetrance difference between male and female was moderately large, the penetrance difference between 2 genes was minor, and both of differences were not statistically significant (with P-values and 0.515, respectively). These results are consistent with those of Olschwang and others (2009). In this example, we fitted a more general Weibull baseline hazard function with 2 parameters while Olschwang and others (2009) fitted an exponential baseline hazard function with a threshold value. 7. DISCUSSION A precise estimation of the age-dependent risk for people carrying disease-causing mutations would have a tremendous impact on public health, which is instrumental in the counseling of individuals who are identified by genetic testing as carriers and who are faced with different options for cancer prevention or early detection. We provide a rigorous statistical inference framework for the evaluation of the penetrance of a rare mutation. The approach can handle both covariates and multiple rare mutations. It is helpful to check the parametric assumption of the baseline hazard function. Because the design is retrospective and the observations are subject to censoring, rigorously checking the parametric assumption is a great challenge. In practice, one can try some commonly used parametric baselines and choose the

13 Statistical inferences on the penetrances of rare genetic mutations 531 one with the largest likelihood. This technique, however, could be misleading if the true baseline is very different from the selected ones. Instead, a nonparametric approach that does not assume any parametric baseline is much more desirable, although it involves some computational and theoretic issues. We will pursue this in the future research. The proposed approach allows for unobserved risk factors that are correlated among family members, provided that there is no interaction effect between the study mutation and unobserved risk factors. When the interaction effect is present, the proposed approach can produce considerably large bias on the penetrance estimates. More advanced methods such as the frailty model could be helpful in resolving this problem, for example, Hsu and others (2004) and Hsu and Gorfine (2006). The development of an inference procedure is still under way. When the disease is not rare, as demonstrated in Section 2.4 and the simulation studies, assuming zero penetrance for noncarriers can produce considerably large bias on the penetrance estimate of carriers. In such situation, genotypes from affected relatives are helpful to improve estimation with the proposed approach. Therefore, it is important to collect genotype information from both affected and unaffected relatives when adopting such a case family design for the penetrance estimation. We hope our proposed method could make this potentially very useful design more accessible for the future study of rare mutations. SUPPLEMENTARY MATERIAL Supplementary material is available at ACKNOWLEDGMENTS We would like to thank Dr. Gilles Thomas for helpful discussions, Dr B. J. Stone for editorial help, Drs C. Lasset, Q. Wang, P. Hutter, M. P. Buisine, R. Etienne, C. Caron, V. Bourdon, and S. Baert-Desurmont for data collection. Conflict of Interest: None declared. FUNDING Intramural Program of the National Institutes of Health to H.Z. and K.Y.; Natural Science Foundation of China ( ) to H.Z.; Institut National du Cancer to S.O. REFERENCES CHATTERJEE, N., KALAYLIOGLU, Z., SHIH, J. H. AND GAIL, M. H. (2006). Case-control and case-only designs with genotype and family history data: estimating relative risk, residual familial aggregation, and cumulative risk. Biometrics 62, CHATTERJEE, N. AND WACHOLDER, S. (2001). A marginal likelihood approach for estimating penetrance from kin cohort designs. Biometrics 57, COX, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society, Series B (Methodological) 34, EFRON, B. AND TIBSHIRANI, R. (1993). An Introduction to the Bootstrap. New York: Chapman & Hall. GAIL, M. H., PEE, D., BENICHOU, J. AND CARROLL, R. (1999). Designing studies to estimate the penetrance of an identified autosomal dominant mutation: cohort, case-control, and genotyped-proband designs. Genetic Epidemiology 16,

14 532 H. ZHANG AND OTHERS GAIL, M. H., PEE, D. AND CARROLL, R. (1999). Kin cohort designs for gene characterization. Journal of the National Cancer Institute. Monographs 26, GAIL, M. H., PFEIFFER, R. M., WHEELER, W. AND PEE, D. (2008). Probability of detecting disease-associated single nucleotide polymorphisms in case-control genome-wide association studies. Biostatistics 9, HSU, L., CHEN, L., GORFINE, M. AND MALONE, K. (2004). Semiparametric estimation of marginal hazard function from the case-control family studies. Biometrics 60, HSU, L. AND GORFINE, M. (2006). Multivariate survival analysis for case-control family data. Biostatistics 7, OLSCHWANG, S., YU, K., LASSET, C., BAERT-DESURMONT, S., BUISINE, M. P., WANG, Q., HUTTER, P., ROULEAU, E., CARON, O., BOURDON, V. and others (2009). Age-dependent cancer risk is not different in between MSH2 and MLH1 mutation carriers. Journal of Cancer Epidemiology doi: /2009/ WACHOLDER, S., HARTGE, P., STRUEWING, J. P., PEE, D., MCADAMS, M., BRODY, L. AND TUCKER, M. (1998). The kin-cohort study for estimating penetrance. American Journal of Epidemiology 148, WANG, Y., CLARK, L. N., MARDER, K. AND RABINOWITZ, D. (2007). Nonparametric estimation of age-at-onset distributions from censored kin-cohort data. Biometrika 94, WANG, Y., OTTMAN, R. AND RABINOWITZ, D. (2006). A method for estimating penetrance from families sampled for linkage analysis. Biometrics 62, YU, K., LI, Q., BERGEN, A. W., PFEIFFER, R. M., ROSENBERG, P. S., CAPORASO, N., KRAFT, P. AND CHATTERJEE, N. (2009). Pathway analysis by adaptive combination of P-values. Genetic Epidemiology 33, [Received November 22, 2009; revised January 11, 2010; accepted for publication January 18, 2010]

SCORE TESTS FOR FAMILIAL CORRELATION IN GENOTYPED PROBAND DESIGNS

SCORE TESTS FOR FAMILIAL CORRELATION IN GENOTYPED PROBAND DESIGNS SCORE TESTS FOR FAMILIAL CORRELATION IN GENOTYPED PROBAND DESIGNS R. J. Carroll, M. H. Gail, J. Benichou, C. D. Galindo & D. Pee January 28, 1998 Abstract In the genotyped proband design, a proband is

More information

SCORE TESTS FOR FAMILIAL CORRELATION IN GENOTYPED PROBAND DESIGNS

SCORE TESTS FOR FAMILIAL CORRELATION IN GENOTYPED PROBAND DESIGNS SCORE TESTS FOR FAMILIAL CORRELATION IN GENOTYPED PROBAND DESIGNS Raymond J. Carroll, Mitchell H. Gail, Jacques Benichou, & David Pee September 7, 1998 Short title. Genotyped Proband Designs Raymond J.

More information

Introduction to Statistical Analysis

Introduction to Statistical Analysis Introduction to Statistical Analysis Changyu Shen Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology Beth Israel Deaconess Medical Center Harvard Medical School Objectives Descriptive

More information

NONPARAMETRIC ADJUSTMENT FOR MEASUREMENT ERROR IN TIME TO EVENT DATA: APPLICATION TO RISK PREDICTION MODELS

NONPARAMETRIC ADJUSTMENT FOR MEASUREMENT ERROR IN TIME TO EVENT DATA: APPLICATION TO RISK PREDICTION MODELS BIRS 2016 1 NONPARAMETRIC ADJUSTMENT FOR MEASUREMENT ERROR IN TIME TO EVENT DATA: APPLICATION TO RISK PREDICTION MODELS Malka Gorfine Tel Aviv University, Israel Joint work with Danielle Braun and Giovanni

More information

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Glenn Heller and Jing Qin Department of Epidemiology and Biostatistics Memorial

More information

Survival Analysis Math 434 Fall 2011

Survival Analysis Math 434 Fall 2011 Survival Analysis Math 434 Fall 2011 Part IV: Chap. 8,9.2,9.3,11: Semiparametric Proportional Hazards Regression Jimin Ding Math Dept. www.math.wustl.edu/ jmding/math434/fall09/index.html Basic Model Setup

More information

Semiparametric Regression

Semiparametric Regression Semiparametric Regression Patrick Breheny October 22 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/23 Introduction Over the past few weeks, we ve introduced a variety of regression models under

More information

Predicting disease Risk by Transformation Models in the Presence of Unspecified Subgroup Membership

Predicting disease Risk by Transformation Models in the Presence of Unspecified Subgroup Membership Predicting disease Risk by Transformation Models in the Presence of Unspecified Subgroup Membership Qianqian Wang, Yanyuan Ma and Yuanjia Wang University of South Carolina, Penn State University and Columbia

More information

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to 1 1 1 1 1 1 1 1 0 SUPPLEMENTARY MATERIALS, B. BIVARIATE PEDIGREE-BASED ASSOCIATION ANALYSIS Introduction We propose here a statistical method of bivariate genetic analysis, designed to evaluate contribution

More information

Statistical Methods for Alzheimer s Disease Studies

Statistical Methods for Alzheimer s Disease Studies Statistical Methods for Alzheimer s Disease Studies Rebecca A. Betensky, Ph.D. Department of Biostatistics, Harvard T.H. Chan School of Public Health July 19, 2016 1/37 OUTLINE 1 Statistical collaborations

More information

STAT331. Cox s Proportional Hazards Model

STAT331. Cox s Proportional Hazards Model STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations

More information

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

COMBINING ISOTONIC REGRESSION AND EM ALGORITHM TO PREDICT GENETIC RISK UNDER MONOTONICITY CONSTRAINT

COMBINING ISOTONIC REGRESSION AND EM ALGORITHM TO PREDICT GENETIC RISK UNDER MONOTONICITY CONSTRAINT Submitted to the Annals of Applied Statistics COMBINING ISOTONIC REGRESSION AND EM ALGORITHM TO PREDICT GENETIC RISK UNDER MONOTONICITY CONSTRAINT By Jing Qin,, Tanya P. Garcia,,, Yanyuan Ma, Ming-Xin

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY Ingo Langner 1, Ralf Bender 2, Rebecca Lenz-Tönjes 1, Helmut Küchenhoff 2, Maria Blettner 2 1

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2008 Paper 241 A Note on Risk Prediction for Case-Control Studies Sherri Rose Mark J. van der Laan Division

More information

Power and Sample Size Calculations with the Additive Hazards Model

Power and Sample Size Calculations with the Additive Hazards Model Journal of Data Science 10(2012), 143-155 Power and Sample Size Calculations with the Additive Hazards Model Ling Chen, Chengjie Xiong, J. Philip Miller and Feng Gao Washington University School of Medicine

More information

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics 1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 24 Paper 153 A Note on Empirical Likelihood Inference of Residual Life Regression Ying Qing Chen Yichuan

More information

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

FULL LIKELIHOOD INFERENCES IN THE COX MODEL October 20, 2007 FULL LIKELIHOOD INFERENCES IN THE COX MODEL BY JIAN-JIAN REN 1 AND MAI ZHOU 2 University of Central Florida and University of Kentucky Abstract We use the empirical likelihood approach

More information

UNIVERSITY OF CALIFORNIA, SAN DIEGO

UNIVERSITY OF CALIFORNIA, SAN DIEGO UNIVERSITY OF CALIFORNIA, SAN DIEGO Estimation of the primary hazard ratio in the presence of a secondary covariate with non-proportional hazards An undergraduate honors thesis submitted to the Department

More information

Two-stage Adaptive Randomization for Delayed Response in Clinical Trials

Two-stage Adaptive Randomization for Delayed Response in Clinical Trials Two-stage Adaptive Randomization for Delayed Response in Clinical Trials Guosheng Yin Department of Statistics and Actuarial Science The University of Hong Kong Joint work with J. Xu PSI and RSS Journal

More information

A general mixed model approach for spatio-temporal regression data

A general mixed model approach for spatio-temporal regression data A general mixed model approach for spatio-temporal regression data Thomas Kneib, Ludwig Fahrmeir & Stefan Lang Department of Statistics, Ludwig-Maximilians-University Munich 1. Spatio-temporal regression

More information

Proportional hazards model for matched failure time data

Proportional hazards model for matched failure time data Mathematical Statistics Stockholm University Proportional hazards model for matched failure time data Johan Zetterqvist Examensarbete 2013:1 Postal address: Mathematical Statistics Dept. of Mathematics

More information

Semiparametric Mixed Effects Models with Flexible Random Effects Distribution

Semiparametric Mixed Effects Models with Flexible Random Effects Distribution Semiparametric Mixed Effects Models with Flexible Random Effects Distribution Marie Davidian North Carolina State University davidian@stat.ncsu.edu www.stat.ncsu.edu/ davidian Joint work with A. Tsiatis,

More information

Survival Analysis for Case-Cohort Studies

Survival Analysis for Case-Cohort Studies Survival Analysis for ase-ohort Studies Petr Klášterecký Dept. of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, harles University, Prague, zech Republic e-mail: petr.klasterecky@matfyz.cz

More information

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National

More information

Frailty Modeling for clustered survival data: a simulation study

Frailty Modeling for clustered survival data: a simulation study Frailty Modeling for clustered survival data: a simulation study IAA Oslo 2015 Souad ROMDHANE LaREMFiQ - IHEC University of Sousse (Tunisia) souad_romdhane@yahoo.fr Lotfi BELKACEM LaREMFiQ - IHEC University

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University kbroman@jhsph.edu www.biostat.jhsph.edu/ kbroman Outline Experiments and data Models ANOVA

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data Malaysian Journal of Mathematical Sciences 11(3): 33 315 (217) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES Journal homepage: http://einspem.upm.edu.my/journal Approximation of Survival Function by Taylor

More information

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1 Parametric Modelling of Over-dispersed Count Data Part III / MMath (Applied Statistics) 1 Introduction Poisson regression is the de facto approach for handling count data What happens then when Poisson

More information

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure) Previous lecture Single variant association Use genome-wide SNPs to account for confounding (population substructure) Estimation of effect size and winner s curse Meta-Analysis Today s outline P-value

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Lecture 5 Models and methods for recurrent event data

Lecture 5 Models and methods for recurrent event data Lecture 5 Models and methods for recurrent event data Recurrent and multiple events are commonly encountered in longitudinal studies. In this chapter we consider ordered recurrent and multiple events.

More information

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence Special Issue Paper Received 7 January 20, Accepted 28 September 20 Published online 24 February 202 in Wiley Online Library (wileyonlinelibrary.com) DOI: 0.002/sim.4460 Efficient designs of gene environment

More information

MAS3301 / MAS8311 Biostatistics Part II: Survival

MAS3301 / MAS8311 Biostatistics Part II: Survival MAS3301 / MAS8311 Biostatistics Part II: Survival M. Farrow School of Mathematics and Statistics Newcastle University Semester 2, 2009-10 1 13 The Cox proportional hazards model 13.1 Introduction In the

More information

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model Other Survival Models (1) Non-PH models We briefly discussed the non-proportional hazards (non-ph) model λ(t Z) = λ 0 (t) exp{β(t) Z}, where β(t) can be estimated by: piecewise constants (recall how);

More information

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 Homework 4 (version 3) - posted October 3 Assigned October 2; Due 11:59PM October 9 Problem 1 (Easy) a. For the genetic regression model: Y

More information

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data Person-Time Data CF Jeff Lin, MD., PhD. Incidence 1. Cumulative incidence (incidence proportion) 2. Incidence density (incidence rate) December 14, 2005 c Jeff Lin, MD., PhD. c Jeff Lin, MD., PhD. Person-Time

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

Lecture 3. Truncation, length-bias and prevalence sampling

Lecture 3. Truncation, length-bias and prevalence sampling Lecture 3. Truncation, length-bias and prevalence sampling 3.1 Prevalent sampling Statistical techniques for truncated data have been integrated into survival analysis in last two decades. Truncation in

More information

Tests of independence for censored bivariate failure time data

Tests of independence for censored bivariate failure time data Tests of independence for censored bivariate failure time data Abstract Bivariate failure time data is widely used in survival analysis, for example, in twins study. This article presents a class of χ

More information

Analysing geoadditive regression data: a mixed model approach

Analysing geoadditive regression data: a mixed model approach Analysing geoadditive regression data: a mixed model approach Institut für Statistik, Ludwig-Maximilians-Universität München Joint work with Ludwig Fahrmeir & Stefan Lang 25.11.2005 Spatio-temporal regression

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

On the Breslow estimator

On the Breslow estimator Lifetime Data Anal (27) 13:471 48 DOI 1.17/s1985-7-948-y On the Breslow estimator D. Y. Lin Received: 5 April 27 / Accepted: 16 July 27 / Published online: 2 September 27 Springer Science+Business Media,

More information

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Hui Zhou, Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 April 30,

More information

Continuous Time Survival in Latent Variable Models

Continuous Time Survival in Latent Variable Models Continuous Time Survival in Latent Variable Models Tihomir Asparouhov 1, Katherine Masyn 2, Bengt Muthen 3 Muthen & Muthen 1 University of California, Davis 2 University of California, Los Angeles 3 Abstract

More information

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015 Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits.

More information

GOODNESS-OF-FIT TESTS FOR ARCHIMEDEAN COPULA MODELS

GOODNESS-OF-FIT TESTS FOR ARCHIMEDEAN COPULA MODELS Statistica Sinica 20 (2010), 441-453 GOODNESS-OF-FIT TESTS FOR ARCHIMEDEAN COPULA MODELS Antai Wang Georgetown University Medical Center Abstract: In this paper, we propose two tests for parametric models

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs. Supplementary Figure 1 Number of cases and proxy cases required to detect association at designs. = 5 10 8 for case control and proxy case control The ratio of controls to cases (or proxy cases) is 1.

More information

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15. NIH Public Access Author Manuscript Published in final edited form as: Stat Sin. 2012 ; 22: 1041 1074. ON MODEL SELECTION STRATEGIES TO IDENTIFY GENES UNDERLYING BINARY TRAITS USING GENOME-WIDE ASSOCIATION

More information

Time-varying proportional odds model for mega-analysis of clustered event times

Time-varying proportional odds model for mega-analysis of clustered event times Biostatistics (2017) 00, 00, pp. 1 18 doi:10.1093/biostatistics/kxx065 Time-varying proportional odds model for mega-analysis of clustered event times TANYA P. GARCIA Texas A&M University, Department of

More information

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Yi-Hau Chen Institute of Statistical Science, Academia Sinica Joint with Nilanjan

More information

CTDL-Positive Stable Frailty Model

CTDL-Positive Stable Frailty Model CTDL-Positive Stable Frailty Model M. Blagojevic 1, G. MacKenzie 2 1 Department of Mathematics, Keele University, Staffordshire ST5 5BG,UK and 2 Centre of Biostatistics, University of Limerick, Ireland

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

A comparison of inverse transform and composition methods of data simulation from the Lindley distribution

A comparison of inverse transform and composition methods of data simulation from the Lindley distribution Communications for Statistical Applications and Methods 2016, Vol. 23, No. 6, 517 529 http://dx.doi.org/10.5351/csam.2016.23.6.517 Print ISSN 2287-7843 / Online ISSN 2383-4757 A comparison of inverse transform

More information

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17 Modeling IBD for Pairs of Relatives Biostatistics 666 Lecture 7 Previously Linkage Analysis of Relative Pairs IBS Methods Compare observed and expected sharing IBD Methods Account for frequency of shared

More information

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by

More information

Additive and multiplicative models for the joint effect of two risk factors

Additive and multiplicative models for the joint effect of two risk factors Biostatistics (2005), 6, 1,pp. 1 9 doi: 10.1093/biostatistics/kxh024 Additive and multiplicative models for the joint effect of two risk factors A. BERRINGTON DE GONZÁLEZ Cancer Research UK Epidemiology

More information

Full likelihood inferences in the Cox model: an empirical likelihood approach

Full likelihood inferences in the Cox model: an empirical likelihood approach Ann Inst Stat Math 2011) 63:1005 1018 DOI 10.1007/s10463-010-0272-y Full likelihood inferences in the Cox model: an empirical likelihood approach Jian-Jian Ren Mai Zhou Received: 22 September 2008 / Revised:

More information

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky Empirical likelihood with right censored data were studied by Thomas and Grunkmier (1975), Li (1995),

More information

SNP Association Studies with Case-Parent Trios

SNP Association Studies with Case-Parent Trios SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature

More information

A Parametric Copula Model for Analysis of Familial Binary Data

A Parametric Copula Model for Analysis of Familial Binary Data Am. J. Hum. Genet. 64:886 893, 1999 A Parametric Copula Model for Analysis of Familial Binary Data David-Alexandre Trégouët, 1 Pierre Ducimetière, 1 Valéry Bocquet, 1 Sophie Visvikis, 3 Florent Soubrier,

More information

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo Friday Harbor 2017 From Genetics to GWAS (Genome-wide Association Study) Sept 7 2017 David Fardo Purpose: prepare for tomorrow s tutorial Genetic Variants Quality Control Imputation Association Visualization

More information

Modelling geoadditive survival data

Modelling geoadditive survival data Modelling geoadditive survival data Thomas Kneib & Ludwig Fahrmeir Department of Statistics, Ludwig-Maximilians-University Munich 1. Leukemia survival data 2. Structured hazard regression 3. Mixed model

More information

Confounding, mediation and colliding

Confounding, mediation and colliding Confounding, mediation and colliding What types of shared covariates does the sibling comparison design control for? Arvid Sjölander and Johan Zetterqvist Causal effects and confounding A common aim of

More information

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates Anastasios (Butch) Tsiatis Department of Statistics North Carolina State University http://www.stat.ncsu.edu/

More information

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA) BIRS 016 1 HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA) Malka Gorfine, Tel Aviv University, Israel Joint work with Li Hsu, FHCRC, Seattle, USA BIRS 016 The concept of heritability

More information

One-stage dose-response meta-analysis

One-stage dose-response meta-analysis One-stage dose-response meta-analysis Nicola Orsini, Alessio Crippa Biostatistics Team Department of Public Health Sciences Karolinska Institutet http://ki.se/en/phs/biostatistics-team 2017 Nordic and

More information

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions Nilanjan Chatterjee, Zeynep Kalaylioglu 2, Roxana Moslehi, Ulrike Peters 3, Sholom Wacholder

More information

Chapter 4. Parametric Approach. 4.1 Introduction

Chapter 4. Parametric Approach. 4.1 Introduction Chapter 4 Parametric Approach 4.1 Introduction The missing data problem is already a classical problem that has not been yet solved satisfactorily. This problem includes those situations where the dependent

More information

Prediction of the Confidence Interval of Quantitative Trait Loci Location

Prediction of the Confidence Interval of Quantitative Trait Loci Location Behavior Genetics, Vol. 34, No. 4, July 2004 ( 2004) Prediction of the Confidence Interval of Quantitative Trait Loci Location Peter M. Visscher 1,3 and Mike E. Goddard 2 Received 4 Sept. 2003 Final 28

More information

FULL LIKELIHOOD INFERENCES IN THE COX MODEL: AN EMPIRICAL LIKELIHOOD APPROACH

FULL LIKELIHOOD INFERENCES IN THE COX MODEL: AN EMPIRICAL LIKELIHOOD APPROACH FULL LIKELIHOOD INFERENCES IN THE COX MODEL: AN EMPIRICAL LIKELIHOOD APPROACH Jian-Jian Ren 1 and Mai Zhou 2 University of Central Florida and University of Kentucky Abstract: For the regression parameter

More information

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.

More information

Equivalence of random-effects and conditional likelihoods for matched case-control studies

Equivalence of random-effects and conditional likelihoods for matched case-control studies Equivalence of random-effects and conditional likelihoods for matched case-control studies Ken Rice MRC Biostatistics Unit, Cambridge, UK January 8 th 4 Motivation Study of genetic c-erbb- exposure and

More information

Goodness of Fit Goodness of fit - 2 classes

Goodness of Fit Goodness of fit - 2 classes Goodness of Fit Goodness of fit - 2 classes A B 78 22 Do these data correspond reasonably to the proportions 3:1? We previously discussed options for testing p A = 0.75! Exact p-value Exact confidence

More information

[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements

[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements [Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements Aasthaa Bansal PhD Pharmaceutical Outcomes Research & Policy Program University of Washington 69 Biomarkers

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

UNIVERSITÄT POTSDAM Institut für Mathematik

UNIVERSITÄT POTSDAM Institut für Mathematik UNIVERSITÄT POTSDAM Institut für Mathematik Testing the Acceleration Function in Life Time Models Hannelore Liero Matthias Liero Mathematische Statistik und Wahrscheinlichkeitstheorie Universität Potsdam

More information

Harvard University. Harvard University Biostatistics Working Paper Series

Harvard University. Harvard University Biostatistics Working Paper Series Harvard University Harvard University Biostatistics Working Paper Series Year 2014 Paper 174 Control Function Assisted IPW Estimation with a Secondary Outcome in Case-Control Studies Tamar Sofer Marilyn

More information

Quantile Regression for Residual Life and Empirical Likelihood

Quantile Regression for Residual Life and Empirical Likelihood Quantile Regression for Residual Life and Empirical Likelihood Mai Zhou email: mai@ms.uky.edu Department of Statistics, University of Kentucky, Lexington, KY 40506-0027, USA Jong-Hyeon Jeong email: jeong@nsabp.pitt.edu

More information

I Have the Power in QTL linkage: single and multilocus analysis

I Have the Power in QTL linkage: single and multilocus analysis I Have the Power in QTL linkage: single and multilocus analysis Benjamin Neale 1, Sir Shaun Purcell 2 & Pak Sham 13 1 SGDP, IoP, London, UK 2 Harvard School of Public Health, Cambridge, MA, USA 3 Department

More information

A unified framework for studying parameter identifiability and estimation in biased sampling designs

A unified framework for studying parameter identifiability and estimation in biased sampling designs Biometrika Advance Access published January 31, 2011 Biometrika (2011), pp. 1 13 C 2011 Biometrika Trust Printed in Great Britain doi: 10.1093/biomet/asq059 A unified framework for studying parameter identifiability

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

Ignoring the matching variables in cohort studies - when is it valid, and why?

Ignoring the matching variables in cohort studies - when is it valid, and why? Ignoring the matching variables in cohort studies - when is it valid, and why? Arvid Sjölander Abstract In observational studies of the effect of an exposure on an outcome, the exposure-outcome association

More information

11 Survival Analysis and Empirical Likelihood

11 Survival Analysis and Empirical Likelihood 11 Survival Analysis and Empirical Likelihood The first paper of empirical likelihood is actually about confidence intervals with the Kaplan-Meier estimator (Thomas and Grunkmeier 1979), i.e. deals with

More information

Combining dependent tests for linkage or association across multiple phenotypic traits

Combining dependent tests for linkage or association across multiple phenotypic traits Biostatistics (2003), 4, 2,pp. 223 229 Printed in Great Britain Combining dependent tests for linkage or association across multiple phenotypic traits XIN XU Program for Population Genetics, Harvard School

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University kbroman@jhsph.edu www.biostat.jhsph.edu/ kbroman Outline Experiments and data Models ANOVA

More information

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of Probability Sampling Procedures Collection of Data Measures

More information

Outline. Frailty modelling of Multivariate Survival Data. Clustered survival data. Clustered survival data

Outline. Frailty modelling of Multivariate Survival Data. Clustered survival data. Clustered survival data Outline Frailty modelling of Multivariate Survival Data Thomas Scheike ts@biostat.ku.dk Department of Biostatistics University of Copenhagen Marginal versus Frailty models. Two-stage frailty models: copula

More information

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen Harvard University Harvard University Biostatistics Working Paper Series Year 2014 Paper 175 A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome Eric Tchetgen Tchetgen

More information

Constrained estimation for binary and survival data

Constrained estimation for binary and survival data Constrained estimation for binary and survival data Jeremy M. G. Taylor Yong Seok Park John D. Kalbfleisch Biostatistics, University of Michigan May, 2010 () Constrained estimation May, 2010 1 / 43 Outline

More information

Non-iterative, regression-based estimation of haplotype associations

Non-iterative, regression-based estimation of haplotype associations Non-iterative, regression-based estimation of haplotype associations Benjamin French, PhD Department of Biostatistics and Epidemiology University of Pennsylvania bcfrench@upenn.edu National Cancer Center

More information

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3 University of California, Irvine 2017-2018 1 Statistics (STATS) Courses STATS 5. Seminar in Data Science. 1 Unit. An introduction to the field of Data Science; intended for entering freshman and transfers.

More information

Lecture 12: Effect modification, and confounding in logistic regression

Lecture 12: Effect modification, and confounding in logistic regression Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007 Today Categorical predictor create dummy variables just like for linear regression

More information

A Robust Test for Two-Stage Design in Genome-Wide Association Studies

A Robust Test for Two-Stage Design in Genome-Wide Association Studies Biometrics Supplementary Materials A Robust Test for Two-Stage Design in Genome-Wide Association Studies Minjung Kwak, Jungnam Joo and Gang Zheng Appendix A: Calculations of the thresholds D 1 and D The

More information

Econometric Analysis of Cross Section and Panel Data

Econometric Analysis of Cross Section and Panel Data Econometric Analysis of Cross Section and Panel Data Jeffrey M. Wooldridge / The MIT Press Cambridge, Massachusetts London, England Contents Preface Acknowledgments xvii xxiii I INTRODUCTION AND BACKGROUND

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 4: and multivariable regression Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu

More information