Statistical inference on the penetrances of rare genetic mutations based on a case family design

Similar documents
SCORE TESTS FOR FAMILIAL CORRELATION IN GENOTYPED PROBAND DESIGNS

SCORE TESTS FOR FAMILIAL CORRELATION IN GENOTYPED PROBAND DESIGNS

Introduction to Statistical Analysis

NONPARAMETRIC ADJUSTMENT FOR MEASUREMENT ERROR IN TIME TO EVENT DATA: APPLICATION TO RISK PREDICTION MODELS

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Survival Analysis Math 434 Fall 2011

Semiparametric Regression

Predicting disease Risk by Transformation Models in the Presence of Unspecified Subgroup Membership

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

Statistical Methods for Alzheimer s Disease Studies

STAT331. Cox s Proportional Hazards Model

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Marginal Screening and Post-Selection Inference

COMBINING ISOTONIC REGRESSION AND EM ALGORITHM TO PREDICT GENETIC RISK UNDER MONOTONICITY CONSTRAINT

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

University of California, Berkeley

Power and Sample Size Calculations with the Additive Hazards Model

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

University of California, Berkeley

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Two-stage Adaptive Randomization for Delayed Response in Clinical Trials

A general mixed model approach for spatio-temporal regression data

Proportional hazards model for matched failure time data

Semiparametric Mixed Effects Models with Flexible Random Effects Distribution

Survival Analysis for Case-Cohort Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Frailty Modeling for clustered survival data: a simulation study

Introduction to QTL mapping in model organisms

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Chapter 1 Statistical Inference

Lecture 5 Models and methods for recurrent event data

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence

MAS3301 / MAS8311 Biostatistics Part II: Survival

Other Survival Models. (1) Non-PH models. We briefly discussed the non-proportional hazards (non-ph) model

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data

Computational Systems Biology: Biology X

Lecture 3. Truncation, length-bias and prevalence sampling

Tests of independence for censored bivariate failure time data

Analysing geoadditive regression data: a mixed model approach

Lecture 9. QTL Mapping 2: Outbred Populations

On the Breslow estimator

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Continuous Time Survival in Latent Variable Models

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

GOODNESS-OF-FIT TESTS FOR ARCHIMEDEAN COPULA MODELS

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

Time-varying proportional odds model for mega-analysis of clustered event times

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources

CTDL-Positive Stable Frailty Model

8 Nominal and Ordinal Logistic Regression

A comparison of inverse transform and composition methods of data simulation from the Lindley distribution

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Additive and multiplicative models for the joint effect of two risk factors

Full likelihood inferences in the Cox model: an empirical likelihood approach

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

SNP Association Studies with Case-Parent Trios

A Parametric Copula Model for Analysis of Familial Binary Data

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Modelling geoadditive survival data

Confounding, mediation and colliding

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

One-stage dose-response meta-analysis

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

Chapter 4. Parametric Approach. 4.1 Introduction

Prediction of the Confidence Interval of Quantitative Trait Loci Location

FULL LIKELIHOOD INFERENCES IN THE COX MODEL: AN EMPIRICAL LIKELIHOOD APPROACH

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Equivalence of random-effects and conditional likelihoods for matched case-control studies

Goodness of Fit Goodness of fit - 2 classes

[Part 2] Model Development for the Prediction of Survival Times using Longitudinal Measurements

Stat 5101 Lecture Notes

UNIVERSITÄT POTSDAM Institut für Mathematik

Harvard University. Harvard University Biostatistics Working Paper Series

Quantile Regression for Residual Life and Empirical Likelihood

I Have the Power in QTL linkage: single and multilocus analysis

A unified framework for studying parameter identifiability and estimation in biased sampling designs

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Ignoring the matching variables in cohort studies - when is it valid, and why?

11 Survival Analysis and Empirical Likelihood

Combining dependent tests for linkage or association across multiple phenotypic traits

Introduction to QTL mapping in model organisms

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Outline. Frailty modelling of Multivariate Survival Data. Clustered survival data. Clustered survival data

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

Constrained estimation for binary and survival data

Non-iterative, regression-based estimation of haplotype associations

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Lecture 12: Effect modification, and confounding in logistic regression

A Robust Test for Two-Stage Design in Genome-Wide Association Studies

Econometric Analysis of Cross Section and Panel Data

Statistics in medicine

Transcription:

Biostatistics (2010), 11, 3, pp. 519 532 doi:10.1093/biostatistics/kxq009 Advance Access publication on February 23, 2010 Statistical inference on the penetrances of rare genetic mutations based on a case family design HONG ZHANG Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA and Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui 230026, People s Republic of China SYLVIANE OLSCHWANG Institut National de la Santé et de la Recherche Médicale (INSERM), Unité 891, Centrede Recherches en Cancérologie de Marseille, 13009 Marseille, France and Department of Oncogenetics, Institut Paoli-Calmettes, 13009 Marseille, France KAI YU Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA yuka@mail.nih.gov SUMMARY We propose a formal statistical inference framework for the evaluation of the penetrance of a rare genetic mutation using family data generated under a kin cohort type of design, where phenotype and genotype information from first-degree relatives (sibs and/or offspring) of case probands carrying the targeted mutation are collected. Our approach is built upon a likelihood model with some minor assumptions, and it can be used for age-dependent penetrance estimation that permits adjustment for covariates. Furthermore, the derived likelihood allows unobserved risk factors that are correlated within family members. The validity of the approach is confirmed by simulation studies. We apply the proposed approach to estimating the age-dependent cancer risk among carriers of the MSH2 or MLH1 mutation. Keywords: Case family design; Penetrance; Proportional hazards model; Rare mutation; Unobserved risk factors. 1. INTRODUCTION An increasing number of mutations have been found to be associated with an elevated risk for various genetic disorders. A precise estimation of the age-dependent risk for people carrying the disease-causing mutations is essential for defining prevention strategies and understanding underlying mechanisms of the diseases. When a disease causal mutation is identified, a precise estimation of its penetrance is possible using the kin cohort design (Wacholder and others, 1998), which has been studied extensively in To whom correspondence should be addressed. c The Author 2010. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

520 H. ZHANG AND OTHERS the literature, for example, Gail, Pee, Benichou, and Carroll (1999), Chatterjee and Wacholder (2001), Chatterjee and others (2006), Wang and others (2007), among others. Gail, Pee, and Carroll (1999) studied the advantages and disadvantages of the kin cohort design. They found that the kin cohort design has several practical advantages, including comparatively rapid execution, modest reductions in required sample sizes compared with cohort or case control designs, and the ability to study the effects of an autosomal dominant mutation on several disease outcomes; the disadvantages include 2 sources of bias: a proband s decision to participate is influenced by the disease status of his relatives and the proband is unable to recall the disease histories of relatives accurately. In a standard kin cohort design, a volunteer (either affected or unaffected) agrees to be genotyped, and the phenotype information on the disease histories of his or her first-degree relatives is obtained through a questionnaire. When the information on both phenotype and genotype for relatives is available, alternative approaches are needed in order to take full advantage of all available data while correcting for bias due to the effects of ascertainment. In this paper, we assume that all probands are affected carriers, though the proposed approach can be extended to include unaffected probands carrying the mutation. Recently, Wang and others (2006) proposed a nonparametric method for estimating the penetrance of a rare mutation. Olschwang and others (2009) proposed an alternative parametric logistic regression model. Both approaches rely on the assumption that the penetrance of noncarriers is zero. This assumption might not be true for many genetic diseases. The penetrance estimate could be severely biased if this assumption was not valid in real applications. In this paper, we focus on rare mutations and aim at developing a rigorous statistical inference framework for such case family design. The main difference between this design and the standard kin cohort design is that the former collects information on both phenotypes and genotypes of the probands relatives, while the latter simply collects the phenotypes of the relatives through a questionnaire. The assumption of zero penetrance for the noncarriers is not required for our approach. Furthermore, the proposed approach is based on a likelihood model conditioned on the phenotypes of all individuals; therefore, the derived estimate should not suffer from the biases mentioned for the kin cohort design. Some covariates such as gender and ethnicity can be incorporated easily in our approach. Multiple rare mutations can also be handled in the context of the proposed conditional likelihood framework. The derivation of the conditional likelihood functions requires minor assumptions. The maximum likelihood estimates (MLEs) can be obtained through standard optimization algorithm available in mathematical/statistical softwares. Statistical inferences, such as constructing confidence intervals and testing hypotheses for the parameters characterizing the penetrance, can be performed based on the standard large-sample theories. The performance of the proposed approach is examined through simulation studies, which illustrate the desired properties of the approach. Finally, we demonstrate the application of the proposed approach by applying it to a study of Lynch syndrome. 2. AGE-INDEPENDENT PENENTRANCES 2.1 Notation Throughout this paper, the mutations responsible for the disease of interest are assumed to be on autosome. In the case family design considered, some unrelated affected individuals (cases) collected from a case control study are genotyped, and those cases carrying the study mutation are termed case probands ; the first-degree relatives (sibs and/or offspring) of the case probands are interviewed for phenotyping and genotyping. To motivate our approach, we first focus on congenital or early-onset diseases that manifest before the ages at which subjects are ascertained. We want to estimate the age-independent penetrance of a known disease-causing mutation. Suppose some case probands are ascertained, and several first-degree

Statistical inferences on the penetrances of rare genetic mutations 521 relatives of each case proband are then collected for genotyping at the disease locus. Throughout this paper, we assume that the mutation (allele m is the mutation of wild allele M) causing disease is rare. Since the mutation is so rare that homogeneous genotype mm is seldom seen if we assume Hardy Weinberg equilibrium holds for the alleles, then we have only 2 genotypes, namely Mm (mutation, denoted by g = 1) and M M (nonmutation, denoted by g = 0). Let the disease penetrance of M M and Mm be f 0 and f 1, respectively. Let the disease status of an individual be d that takes value 1 if affected and 0 otherwise. 2.2 Likelihood function Suppose I unrelated case probands carrying the mutation are ascertained. To derive the likelihood function of the observed data, we need to make the following assumptions: (i) The study mutation is rare. (ii) Hardy Weinberg equilibrium holds for the corresponding allele, mating is random, and Mendelian inheritance law holds. (iii) The study mutation is independent of the unobserved risk factors. (iv) The disease is rare. (v) There is no interaction effect between the study mutation and the unobserved risk factors. That is, the joint disease penetrance satisfies the following relationship: P(d = 1 g = 1, r) = c 1 P(d = 1 g = 0, r), (2.1) where r is a vector of unobserved risk factor values and c 1 is a constant independent of r. Under the assumptions (i) (v), the likelihood function for the observed genotypes of the relatives can be approximated by L i ( f 0, f 1 ) = p a i 1 (1 p 1) n 1i a i p b i 0 (1 p 0) n 0i b i, with p 1 = f 1 f 1 + f 0 and p 0 = 1 f 1 2 f 1 f 0, (2.2) where a i (b i ) is the number of the affected (unaffected) relatives carrying the mutation and n 1i (n 0i ) is the number of affected (unaffected) relatives, of the ith case proband, i = 1,..., I. Refer to Appendix A of the supplementary material (available at Biostatistics online) for the derivation of (2.2) that is available. Notice that p 1 (p 0 ) is the probability of a relative being a carrier, given the condition that he/she is affected (unaffected) and the case proband is a carrier. It is seen that p 0 has exactly the same value as that given in Wang and others (2006) when f 0 = 0. Furthermore, when f 0 = 0 (i.e. a noncarrier has penetrance 0), all the affected relatives are carriers and they provide no information on f 1. It can be seen from (2.2) that the relative s genotypes within the same family are conditionally independent given the ascertainment scheme. We want to point out that this is not an assumption but is the result derived from the assumptions (i) (v). An important advantage of this likelihood is that it is independent of the unobserved risk factors, making it suitable for estimating marginal penetrances of carriers and noncarriers. The assumption (i) is the key assumption, which is the motivation for this study. The assumptions (ii) and (iii) are commonly seen in literature, which are used to derive the conditional mutation distribution of a proband s relatives. The assumption (iv) is a technical one, and our simulation study shows that the performance of the proposed approach is acceptable even when the disease is common with the prevalence being 0.1. The assumption (v) is equivalent to the multiplicative model for multiple risk factors (see e.g. Gail and others, 2008; Yu and others, 2009). In particular, the following log-linear model satisfies the assumption (v): P(d = 1 g, r) = c 2 exp{ag + b τ r},

522 H. ZHANG AND OTHERS where c 2 is a constant and a and b are regression parameters. Throughout this paper, τ stands for the transpose of a vector. Notice that we do not assume any correlation structure for the unobserved risk factors of family members. Furthermore, the unobserved risk factors can be of any type, such as discrete and continuous, environmental or genetical. 2.3 Identifiability of f 1 and f 0 When genotypes are available only for the unaffected relatives of case probands, we see from the likelihood function (2.2) that the penetrances f 1 and f 0 are not identifiable. However, the 2 penetrances f 1 and f 0 are identifiable when at least 1 affected relative and 1 unaffected relative are genotyped, provided that f 1 > f 0 > 0. Actually, there is a one-to-one relationship between the penetrances { f 1, f 0 } and the estimable parameters {p 1, p 0 } when f 1 > f 0 > 0. This is different from the situation in the standard case control design, where only the relative risk f 1 /f 0 is identifiable. Notice that in our case family design, our retrospective likelihood function is conditioned on the mutation status of the proband and disease status. This additional conditioning as well as the assumption of rare mutation make both f 1 and f 0 identifiable. It is also noticed that f 0 and f 1 are not identifiable when f 1 = f 0 but this is not a problem since the major purpose of our case family design is to estimate the penetrance function of a known risk mutation with f 1 > f 0. 2.4 Maximum likelihood estimates Denote A = I i=1 a i, B = I i=1 b i, N 1 = I i=1 n 1i, and N 0 = I i=1 n 0i, Then the overall likelihood can be written as L( f 0, f 1 ) = p A 1 (1 p 1) N 1 A p B 0 (1 p 0) N 0 B, with p 1 = f 1 f 1 + f 0 and p 0 = 1 f 1 2 f 1 f 0. (2.3) Since the above likelihood function is the product of 2 binomial likelihood functions, the MLEs of p 1 and p 0 are ˆp 1 = A/N 1 and ˆp 0 = B/N 0, respectively. Therefore, the MLEs of f 1 and f 0 are, respectively, or equivalently, fˆ 1 = ˆp 1(1 2 ˆp 0 ) and fˆ 0 = (1 ˆp 1)(1 2 ˆp 0 ), with ˆp 1 = A and ˆp 0 = B, (2.4) ˆp 1 ˆp 0 ˆp 1 ˆp 0 N 1 N 0 fˆ 1 = AN 0 2B A and fˆ 0 = N 1N 0 AN 0 2B N 1 + 2AB. (2.5) AN 0 B N 1 AN 0 B N 1 When f 0 = 0, the MLE of f 1 is (1 2B/N 0 )/(1 B/N 0 ). This estimator is simpler than that of Wang and others (2006) since their method needs to estimate an additional offset for each family. When f 0 is not equal to 0, using (1 2B/N 0 )/(1 B/N 0 ) as an estimator of f 1 could produce considerable bias. For example, if f 0 = 0.1 and f 1 = 0.2, then the estimator (1 2B/N 0 )/(1 B/N 0 ) converges to (1 2p 0 )/(1 p 0 ) = 1/9 as the sample size goes to infinity and the relative bias (Rbias) is (1/9 0.2)/0.2 = 4/9. If all the affected relatives are carriers so that N 1 = A, then the MLEs of f 0 and f 1 are 0 and (1 2B/N 0 )/(1 B/N 0 ), respectively. This confirms the fact that the affected relatives provide no information on f 1 when f 0 = 0, as was mentioned in Section 2.2. With a large sample size, the MLEs fˆ 1 and fˆ 0 converge to f 1 and f 0, respectively, so that they asymptotically locate within the interval [0, 1]. When the sample size is not large enough, however, the 2 estimates could be negative or greater than 1. In such situation, we can estimate the penetrances by adding a constraint 0 f 0, f 1 1.

Statistical inferences on the penetrances of rare genetic mutations 523 2.5 Hypothesis testing and confidence interval It is of interest to test the null hypothesis that the mutation has no effect on the disease ( f 0 = f 1 ), provided that the genotypes of some affected relatives are available. To test this null hypothesis, we can construct a likelihood ratio test. Since the common penetrance under the null hypothesis is not identifiable, the limiting null distribution of the likelihood ratio test is no longer standard chi-square distributed. To assess the significance of the likelihood ratio test statistic, we can adopt a permutation test by permutating the disease status of the relatives. The confidence intervals of the penetrances can be constructed based on the asymptotic normality of the MLEs, with the variance covariance matrix of the MLEs being estimated by the inverse of the observed information matrix. 3. AGE-DEPENDENT PENETRANCES 3.1 Notation In most situations, the penetrances depend on age, and we are interested in estimating age-dependent penetrances. Suppose that we observe the ages at diagnosis for all the relatives and the ages at onset for those affected individuals. We will take this information into account in the evaluation of the agedependent penetrances. For the ith proband, suppose the information on the phenotypes and genotypes of n i relatives are collected. Let the genotype and affection status of the jth relative (zeroth relative is the case proband) of the ith case proband be coded by g i j and d i j, respectively. That is, g i j = 1 if the jth relative is a carrier and 0 otherwise, and d i j = 1 if the jth relative is affected and 0 otherwise. Let a i j and t i j (t i j is an unobserved value that is greater than a i j if the jth relative is unaffected) be the current age and the age at onset of the jth relative, respectively. Let y i j = min{t i j, a i j }. 3.2 Likelihood function We can formulate a conditional likelihood for the ith family s data as P(g i d i, y i, g i0 = 1, d i0 = 1, y i0, a i, a i0 ), where g i = (g i1,..., g ini ), d i = (d i1,..., d ini ), y i = (y i1,..., y ini ), and a i = (a i1,..., a ini ). To derive the likelihood function, we need the following assumption corresponding to the assumption (v) for the age-independent penetrances: (v) There is no interaction effect between the study mutation and the unobserved risk factors, that is, the density function of the age at onset p(t g, r) given the study mutation g and unobserved risk factors r satisfies the relationship where c 3 is a constant. p(t g = 1, r) = c 3 p(t g = 0, r), (3.1) Under Cox s proportional hazards model (Cox, 1972), the hazard function is multiplicative with respect to g and r if there is no interaction effect. Therefore, the Cox model together with the rare disease assumption imply the assumption (v) since the density function is approximately the hazard function under the assumptions. Under the assumptions (i) (iv) and (v), we can show that the overall likelihood can be approximated by I I n i λ d i j (y i j g i j )S(y i j g i j ) L = P(g i d i, y i, g i0 = 1, d i0 = 1, y i0, a i, a i0 ) = 1g=0 λ d i j (y i j g)s(y i j g), (3.2) i=1 i=1 j=1

524 H. ZHANG AND OTHERS where λ( g) and S( g) are, respectively, the hazard function and survival function of age at onset of individuals carrying genotype g. The derivation of (3.2) is similar to that of (2.2) so is omitted. We can assume a suitable functional form for λ(t g). For example, under the given assumptions, the joint proportional hazards model implies a marginal proportional hazard function λ(t g) = λ 0 (t; η)e βg, (3.3) where λ 0 (t; η) is the baseline hazard function known up to a parameter vector η of finite dimension. If only unaffected relatives are genotyped, then the likelihood function (3.2) reduces to I n i i=1 j=1 S(y i j g i j ) S(y i j 0) + S(y i j 1). (3.4) It can be shown that S( 1) and S( 0) are not identifiable in (3.4), as in Section 2.3. For rare disease, one can assume that the penetrance of noncarriers is nearly zero so that S(y 0) 1, and the likelihood function is approximately I n i i=1 j=1 ( S(yi j 1) ) gi j ( 1 + S(y i j 1) 1 1 + S(y i j 1) ) 1 gi j, (3.5) which is equivalent to model (3) of Olschwang and others (2009). Making the additional assumption of a Weibull survival function form of S(y 1) yields a logistic regression model given by (5) of Olschwang and others (2009). 3.3 MLE, hypothesis testing, and confidence interval The MLEs of the unknown parameters can be obtained by the Newton Raphson algorithm or any optimization algorithm. To examine whether the study mutation has effect on the disease, we can test the null hypothesis β = 0 using either likelihood ratio test or Wald test, where β is given in (3.3). We can also estimate the variances of the MLEs and construct the confidence intervals of the unknown parameters based on a large-sample theory. 4. COVARIATES AND MULTIPLE MUTATIONS ADJUSTMENT In many real applications, we might be interested in comparing penetrances between 2 groups, for example, male versus female. Also, when there are multiple known disease-causing mutations involved, we are interested in comparing the penetrances among multiple mutations. An example will be given in Section 6. We can extend the previous likelihood functions further to adjust for covariates and multiple disease-causing mutations. In the following example, we illustrate how to incorporate covariates and multiple mutations in the situation where the genotypes from both affected and unaffected relatives are available. 4.1 Likelihood function Assume that a covariate vector Z is observed for each relative. Then we can incorporate the covariates effect in a proportional hazards model: (t g, Z) = 0 (t; η)e βg+γ τ Z, (4.1)

Statistical inferences on the penetrances of rare genetic mutations 525 where (t g, Z) is the cumulative hazard function of the age at onset given covariate Z and genotype g and 0 (t; η) is the baseline cumulative hazard function corresponding to g = 0 and Z = 0, which is known up to a parameter vector η of finite dimension. The likelihood function is therefore approximately I n i i=1 j=1 exp{(βg i j + γ τ Z i j )d i j 0 (y i j ; η)e βg i j +γ τ Z i j } 1g=0 exp{(βg + γ τ Z i j )d i j 0 (y i j ; η)e βg+γ τ Z i j }. Suppose K types of disease-causing rare mutations are considered. We assume that each family can have at most one type of mutation segregated. Let δ i be a mutation indicator, that is, δ i = k if the ith case proband has the kth type of mutation. We assume that the cumulative hazard functions of these risk mutations are proportional: k (t) = 0 (t; η)e β k, k = 1,..., K, (4.2) where β 1 = 0. The approximated likelihood function can be written as I n i exp{( K k=1 1 k (δ i )β k g i j )d i j 0 (y i j ; η)e K k=1 1 k (δ i )β k g i j } 1g=0 exp{( K k=1 1 k (δ i )β k g i j )d i j 0 (y i j ; η)e (4.3) K k=1 1 k (δ i )β k g i j }, i=1 j=1 where 1 k (δ i ) is an indicator function taking value 1 if δ i = k and 0 otherwise. Refer to Appendix B of the supplementary material (available at Biostatistics online) for the derivation of (4.3). This expression shows that the mutation behaves as a family-shared categorical covariate. 4.2 MLE, confidence interval, and hypothesis testing The MLEs of η, β, γ, and β k, k = 2,..., K, in (4.1) and (4.2) can be obtained using the Newton Raphson algorithm or any optimization algorithm. The variance estimates of the MLEs and confidence intervals of the unknown parameters can be obtained as before. It is of interest to compare the penetrances for various disease-causing mutations, which can be conducted by the standard likelihood ratio test based on the likelihood (4.3). The proportionality of the hazard functions can also be tested by a likelihood ratio test, with the alternative hypothesis being that the mutations have their own specific penetrance functions. If the null hypothesis that the penetrance functions are proportional is not rejected, we can feel free to apply the proportional hazards model (4.2); otherwise we need to estimate mutation-specific penetrance functions. 5. SIMULATION STUDIES We conducted simulation studies to assess the performance of the proposed approach. First, we studied the age-independent penetrances. We assumed 2 independent disease related singlenucleotide polymorphisms (SNPs): one is the study mutation with minor allele frequency (MAF) 0.01 or 0.001 and the other one is unobserved with MAF 0.2. We assumed dominant mode of inheritance for both the SNPs. The disease and risk factors were related by a logistic regression model: P(d = 1 g, r) = exp{a + bg + log(or)r} 1 + exp{a + bg + log(or)r}, (5.1) where g (r) is 1 if the genotype of the study SNP (unobserved SNP) is of higher risk and 0 otherwise and OR is the odds ratio parameter for the unobserved SNP, which takes value 1 or 2. The marginal penetrance f 1 for carriers was fixed at 0.5 and the other penetrance f 0 was 0.03 or 0.1. The values of

526 H. ZHANG AND OTHERS log-or parameters a and b were determined by the other parameters. The genotypes of parents were generated under Hardy Weinberg equilibrium and random mating, and the genotypes of offspring were independently generated given parental genotypes. From a large number of generated families with 3 offspring, we randomly selected 1 000 000 families with the first offspring being affected and carrying the study mutation and treated them as the source population from which the study sample was collected. A sample of size 200, 500, or 1000 was drawn from this population and simulation results based on 100 000 replications were produced. Reported in Table 1 are the Rbias of the estimates defined as the Table 1. Age-independent penetrance estimates MAF f 0 OR Rbias SE SEE ECP # Rbias SE SEE ECP # Number of cases = Number of controls = 200 f 0 f 1 0.01 0.1 1 0.071 0.030 0.030 0.891 0.053 0.077 0.076 0.954 0.01 0.1 2 0.042 0.031 0.030 0.906 0.039 0.076 0.075 0.953 0.01 0.03 1 0.050 0.013 0.013 0.882 0.042 0.067 0.066 0.952 0.01 0.03 2 0.038 0.013 0.013 0.887 0.027 0.066 0.066 0.952 0.001 0.1 1 0.009 0.032 0.032 0.930 0.008 0.073 0.073 0.948 0.001 0.1 2 0.034 0.032 0.032 0.938 0.004 0.073 0.072 0.942 0.001 0.03 1 0.003 0.014 0.013 0.902 0.012 0.065 0.065 0.949 0.001 0.03 2 0.022 0.014 0.014 0.909 0.002 0.065 0.064 0.943 Number of cases = Number of controls = 500 f 0 f 1 0.01 0.1 1 0.076 0.019 0.019 0.887 0.049 0.048 0.047 0.939 0.01 0.1 2 0.049 0.019 0.019 0.910 0.035 0.047 0.047 0.949 0.01 0.03 1 0.057 0.008 0.008 0.904 0.038 0.042 0.042 0.941 0.01 0.03 2 0.044 0.008 0.008 0.912 0.024 0.042 0.041 0.948 0.001 0.1 1 0.003 0.020 0.020 0.941 0.004 0.046 0.046 0.949 0.001 0.1 2 0.027 0.020 0.020 0.95 0.008 0.045 0.045 0.944 0.001 0.03 1 0.003 0.009 0.009 0.927 0.008 0.041 0.041 0.949 0.001 0.03 2 0.014 0.009 0.009 0.935 0.006 0.040 0.040 0.945 Number of cases = Number of controls = 1000 f 0 f 1 0.01 0.1 1 0.079 0.013 0.013 0.869 0.048 0.034 0.033 0.910 0.01 0.1 2 0.052 0.013 0.013 0.905 0.033 0.033 0.033 0.934 0.01 0.03 1 0.059 0.006 0.006 0.908 0.037 0.03 0.030 0.919 0.01 0.03 2 0.047 0.006 0.006 0.916 0.023 0.029 0.029 0.940 0.001 0.1 1 0.000 0.014 0.014 0.946 0.003 0.032 0.032 0.950 0.001 0.1 2 0.024 0.014 0.014 0.953 0.009 0.032 0.032 0.941 0.001 0.03 1 0.004 0.006 0.006 0.939 0.007 0.029 0.029 0.950 0.001 0.03 2 0.010 0.006 0.006 0.942 0.006 0.029 0.028 0.944 MAF of study mutation. True value of penetrance f 0. OR parameter of unobserved risk factor. Rbias defined as mean MLE divided by true penetrance minus 1. 95% empirical coverage probability. # Mean SE of estimated penetrance. SEE of estimated penetrance.

Statistical inferences on the penetrances of rare genetic mutations 527 mean estimated penetrances divided by the true penetrance minus 1, empirical standard errors (SE) and mean estimated standard errors (SEE) of the estimates, and empirical coverage probability (ECP) of the penetrances. Overall, the estimates have minor bias when the disease is rare ( f 0 = 0.03) and the study mutation is rare (MAF = 0.001), with absolute relative biases no more than 1.2%. Common disease ( f 0 = 0.1), increased MAF (0.01) of the study mutation, and positive effect of unobserved mutation (OR = 2) has small impact on the estimates, with Rbias 7.9% 3.4%. In all situations, the SEE are very close to the empirical ones. The relative bias tends to be stable and remain to be small when the sample size increases. We also estimated the penetrance of carriers using only the genotypes of unaffected relatives by assuming zero penetrance of noncarriers. The resulting Rbias is generally small when f 0 = 0.03 but it could become considerably large when f 0 = 0.1 (results not shown). It is also seen from Table 1 that the Rbias for a mutation with MAF = 0.001 tends to be smaller than that observed for a mutation with MAF = 0.01. Additional simulation results show that the relative biases get larger when the MAF increases. For example, a MAF of 0.03 produces relative biases at the range of 10.3% 23.7%, and an MAF of 0.1 produces relative biases at the range of 41.9% 65.1%, with the other parameters the same as those in Table 1. It appears that the proposed approach is suitable for rare mutation with MAF 0.01. Second, we studied the proposed approach when the penetrance is age dependent. We generated data from the following Cox model with Weibull baseline hazard function: λ(t g, r) = (t/e ψ ) ξ e βg+log(or)r, (5.2) where g and r are the same as those in (5.1) with the same MAFs. The OR was fixed at 1 or 2. The other parameters ξ, ψ, and β were determined by 3 cumulative risk probabilities: p 30,0 = P(T 30 g = 0), p 60,0 = P(T 60 g = 0), and p 60,1 = P(T 60 g = 1), where T is the age at onset. To mimic common disease, we set p 30,0 = 0.03 and p 60,0 = 0.09; to mimic rare disease, we set p 30,0 = 0.01 and p 60,0 = 0.03. In both situations, we set p 60,1 = 0.5. The ages of the relatives of a proband were generated from the uniform distribution in the interval (a 5, a + 5), where a is the current age of the proband that is uniformly distributed in the interval (20, 70). The ages, genotypes, and disease status were generated for a large number of families similarly to the age-dependent situation. In each family, there were 2 parents and 3 offspring whose data were generated. Altogether, 1 000 000 families with 1 affected proband (the first offspring) carrying the mutation in each family were obtained. From these families, we sampled 400 or 1000 families and estimated ξ, ψ, and β in model (5.2) by ignoring the unobserved mutation. Substituting the estimated parameters gave the estimates of marginal survival functions of carriers and noncarriers. Based on 5000 replications, we calculated the mean estimated survival functions of both carriers and noncarriers and the 90% confidence intervals of the survival functions. Presented in Figures 1 and 2 are the results for carriers and noncarriers, respectively, with sample size 1000 and OR = 1 (unobserved mutation does not play a role on the disease). We can see that the bias of the estimates reduces dramatically when the MAF of study mutation decreases from 0.01 to 0.001, showing that the approximation of the likelihood function works pretty good for relatively rare mutation. When the disease gets common, the proposed method using both affected and unaffected relatives does not produce extra bias. However, the method that uses only unaffected relatives has much larger bias for common disease. This extra bias is due to the improper assumption of zero penetrance function of noncarriers for common disease. Other results for sample size 400 or OR = 2 are presented in Figures s1 s6 of the supplementary material (available at Biostatistics online). In summary, the bias of the penetrance functions get smaller as the sample size increases. The positive effect of unobserved mutation (OR = 2) has only limited impact on the penetrance function estimates. In particular, the impact is minimal when the MAF of the study mutation is small and the disease is rare.

528 H. ZHANG AND OTHERS Fig. 1. Estimated survival functions of carriers with sample size 1000 and OR = 1. Common mutation : MAF = 0.01; rare mutation : MAF = 0.001; common disease : P(T 30 g = 0) = 0.03 and P(T 60 g = 0) = 0.09; rare disease : P(T 30 g = 0) = 0.01 and P(T 60 g = 0) = 0.03. Finally, we examined the robustness of the specification of the baseline hazard function. Our simulation studies showed that the misspecification of the baseline hazard function could result in bias, with its magnitude depending on the true and misspecified functions. Here, we do not present the simulation results but briefly summarize them. If the true baseline hazard function is gamma, Weibull, or log-normal, but it was misspecified to be any other 2 functions, then the resulting penetrance estimate had small bias; if the baseline hazard function is piecewise constant but it was misspecified to be Weibull, then the bias could be relatively large. 6. APPLICATION TO A STUDY OF LYNCH SYNDROME We applied the proposed approach to a study of Lynch syndrome (Olschwang and others, 2009). In this study, the carriers were identified in 8 genetic units of France and Switzerland. These units offered

Statistical inferences on the penetrances of rare genetic mutations 529 Fig. 2. Estimated survival functions of noncarriers with sample size 1000 and OR = 1. Common mutation : MAF = 0.01; rare mutation : MAF = 0.001; common disease : P(T 30 g = 0) = 0.03 and P(T 60 g = 0) = 0.09; rare disease : P(T 30 g = 0) = 0.01 and P(T 60 g = 0) = 0.03. germline analysis of MSH2 and MLH2 genes. A restrospective questionnaire was conducted to ask for some information on asymptomatic first-degree relatives of carriers. The collected information includes the type of disease-causing germline mutation identified in the proband, birth data, sex, and age at genetic diagnosis. The presence or absence of disease-causing mutation was then assessed from these relatives. Phenotypes and genotypes from 856 asymptomatic first-degree relatives of MSH2 or MLH1 carriers were collected from those 8 centers. For each relative, the gender and mutation status at genes MSH2 and MLH1 were obtained, as summarized in Table 2. Furthermore, the ages of the relatives were available, so that we could estimate the age-dependent penetrances. With pooled data, we assumed a Weibull survival function (t/e ψ ) ξ for carriers. With gender or mutation type adjusted, we assumed a proportional hazards model with survival function (t/e ψ ) ξ e β 1x 1 or (t/e ψ ) ξ e β 2x 2. Here, x 1 = 1 if male and 0 if female and x 2 = 1 if MSH1 and 0 if MSH2. We obtained the MLEs of the unknown parameters (ψ, ξ, β 1, and β 2 ), estimated SE of the MLEs, and 95% confidence intervals of the parameters. The estimation and hypothesis-testing results are presented in Table 3.

530 H. ZHANG AND OTHERS Table 2. Summary of genotypes Mutation absent Mutation present MLH1 MSH2 MLH1 MSH2 Males 139 94 83 86 Females 132 105 120 97 Table 3. Estimates of the parameters for the Lynch syndrome data Parameter MLE SE CI LRT P-value # Pooled data ψ 4.167 0.075 (4.020, 4.314) ξ 2.940 0.854 (1.266, 4.614) Adjusted by gender (β 1 : regression parameter) ψ 4.259 0.127 (4.010, 4.507) ξ 2.852 0.851 (1.184, 4.520) β 1 0.533 0.408 ( 0.267, 1.333) 1.897 0.168 Adjusted by gene type (β 2 : regression parameter) ψ 4.223 0.135 (3.959, 4.487) ξ 2.872 0.832 (1.242, 4.503) β 2 0.258 0.422 ( 0.568, 1.085) 0.425 0.515 The parameters are defined in Section 6. MLE estimate of unknown parameter. Estimated standard error of the MLE. Confidence interval of unknown parameter. Likelihood ratio test statistic. # P-value of likelihood ratio test. The estimated survival function together with its confidence interval curves based on 5000 bootstrappings (Efron and Tibshirani, 1993) are plotted in Figure s7 of of the supplementary material (available at Biostatistics online). The penetrance difference between male and female was moderately large, the penetrance difference between 2 genes was minor, and both of differences were not statistically significant (with P-values 0.168 and 0.515, respectively). These results are consistent with those of Olschwang and others (2009). In this example, we fitted a more general Weibull baseline hazard function with 2 parameters while Olschwang and others (2009) fitted an exponential baseline hazard function with a threshold value. 7. DISCUSSION A precise estimation of the age-dependent risk for people carrying disease-causing mutations would have a tremendous impact on public health, which is instrumental in the counseling of individuals who are identified by genetic testing as carriers and who are faced with different options for cancer prevention or early detection. We provide a rigorous statistical inference framework for the evaluation of the penetrance of a rare mutation. The approach can handle both covariates and multiple rare mutations. It is helpful to check the parametric assumption of the baseline hazard function. Because the design is retrospective and the observations are subject to censoring, rigorously checking the parametric assumption is a great challenge. In practice, one can try some commonly used parametric baselines and choose the

Statistical inferences on the penetrances of rare genetic mutations 531 one with the largest likelihood. This technique, however, could be misleading if the true baseline is very different from the selected ones. Instead, a nonparametric approach that does not assume any parametric baseline is much more desirable, although it involves some computational and theoretic issues. We will pursue this in the future research. The proposed approach allows for unobserved risk factors that are correlated among family members, provided that there is no interaction effect between the study mutation and unobserved risk factors. When the interaction effect is present, the proposed approach can produce considerably large bias on the penetrance estimates. More advanced methods such as the frailty model could be helpful in resolving this problem, for example, Hsu and others (2004) and Hsu and Gorfine (2006). The development of an inference procedure is still under way. When the disease is not rare, as demonstrated in Section 2.4 and the simulation studies, assuming zero penetrance for noncarriers can produce considerably large bias on the penetrance estimate of carriers. In such situation, genotypes from affected relatives are helpful to improve estimation with the proposed approach. Therefore, it is important to collect genotype information from both affected and unaffected relatives when adopting such a case family design for the penetrance estimation. We hope our proposed method could make this potentially very useful design more accessible for the future study of rare mutations. SUPPLEMENTARY MATERIAL Supplementary material is available at http://biostatistics.oxfordjournals.org. ACKNOWLEDGMENTS We would like to thank Dr. Gilles Thomas for helpful discussions, Dr B. J. Stone for editorial help, Drs C. Lasset, Q. Wang, P. Hutter, M. P. Buisine, R. Etienne, C. Caron, V. Bourdon, and S. Baert-Desurmont for data collection. Conflict of Interest: None declared. FUNDING Intramural Program of the National Institutes of Health to H.Z. and K.Y.; Natural Science Foundation of China (10701067) to H.Z.; Institut National du Cancer to S.O. REFERENCES CHATTERJEE, N., KALAYLIOGLU, Z., SHIH, J. H. AND GAIL, M. H. (2006). Case-control and case-only designs with genotype and family history data: estimating relative risk, residual familial aggregation, and cumulative risk. Biometrics 62, 36 48. CHATTERJEE, N. AND WACHOLDER, S. (2001). A marginal likelihood approach for estimating penetrance from kin cohort designs. Biometrics 57, 245 252. COX, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society, Series B (Methodological) 34, 187 220. EFRON, B. AND TIBSHIRANI, R. (1993). An Introduction to the Bootstrap. New York: Chapman & Hall. GAIL, M. H., PEE, D., BENICHOU, J. AND CARROLL, R. (1999). Designing studies to estimate the penetrance of an identified autosomal dominant mutation: cohort, case-control, and genotyped-proband designs. Genetic Epidemiology 16, 15 39.

532 H. ZHANG AND OTHERS GAIL, M. H., PEE, D. AND CARROLL, R. (1999). Kin cohort designs for gene characterization. Journal of the National Cancer Institute. Monographs 26, 55 60. GAIL, M. H., PFEIFFER, R. M., WHEELER, W. AND PEE, D. (2008). Probability of detecting disease-associated single nucleotide polymorphisms in case-control genome-wide association studies. Biostatistics 9, 201 215. HSU, L., CHEN, L., GORFINE, M. AND MALONE, K. (2004). Semiparametric estimation of marginal hazard function from the case-control family studies. Biometrics 60, 936 944. HSU, L. AND GORFINE, M. (2006). Multivariate survival analysis for case-control family data. Biostatistics 7, 387 398. OLSCHWANG, S., YU, K., LASSET, C., BAERT-DESURMONT, S., BUISINE, M. P., WANG, Q., HUTTER, P., ROULEAU, E., CARON, O., BOURDON, V. and others (2009). Age-dependent cancer risk is not different in between MSH2 and MLH1 mutation carriers. Journal of Cancer Epidemiology doi:10.1155/2009/791754. WACHOLDER, S., HARTGE, P., STRUEWING, J. P., PEE, D., MCADAMS, M., BRODY, L. AND TUCKER, M. (1998). The kin-cohort study for estimating penetrance. American Journal of Epidemiology 148, 623 630. WANG, Y., CLARK, L. N., MARDER, K. AND RABINOWITZ, D. (2007). Nonparametric estimation of age-at-onset distributions from censored kin-cohort data. Biometrika 94, 403 414. WANG, Y., OTTMAN, R. AND RABINOWITZ, D. (2006). A method for estimating penetrance from families sampled for linkage analysis. Biometrics 62, 1081 1088. YU, K., LI, Q., BERGEN, A. W., PFEIFFER, R. M., ROSENBERG, P. S., CAPORASO, N., KRAFT, P. AND CHATTERJEE, N. (2009). Pathway analysis by adaptive combination of P-values. Genetic Epidemiology 33, 700 709. [Received November 22, 2009; revised January 11, 2010; accepted for publication January 18, 2010]