On Identifying Rare Variants for Complex Human Traits

Size: px

Start display at page:

Download "On Identifying Rare Variants for Complex Human Traits"

Claud Elliott
6 years ago
Views:

1 On Identifying Rare Variants for Complex Human Traits Ruixue Fan Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2015

3 ABSTRACT On Identifying Rare Variants for Complex Human Traits Ruixue Fan This thesis focuses on developing novel statistical tests for rare variants association analysis incorporating both marginal effects and interaction effects among rare variants. Compared with common variants, rare variants have lower minor allele frequencies (typically less than 5%), and hence traditional association tests for common variants will lose power for rare variants. Therefore, there is a pressing need of new analytical tools to tackle the problem of rare variants association with complex human traits. Several collapsing methods have been proposed that aggregate information of rare variants in a region and test them together. They can be divided into burden tests and non-burden tests based on their aggregation strategies. They are all variations of regression-based methods with the assumption that the phenotype is associated with the genotype via a (linear) regression model. Most of these methods consider only marginal effects of rare variants and fail to take into

4 account gene-gene and gene-environmental interactive effects, which are ubiquitous and are of utmost importance in biological systems. In this thesis, we propose a summation of partition approach (SPA) a nonparametric strategy for rare variants association analysis. Extensive simulation studies show that SPA is powerful in detecting not only marginal effects but also gene-gene interaction effects of rare variants. Moreover, extensions of SPA are able to detect gene-environment interactions and other interactions existing in complicated biological system as well. We are also able to obtain the asymptotic behavior of the marginal SPA score, which guarantees the power of the proposed method. Inspired by the idea of stepwise variable selection, a significance-based backward d ropping algorithm(sda) is proposed to locate truly influential rare variants in a genetic region that has been identified significant. Unlike traditional backward dropping approaches which remove the least significant variables first, SDA introduces the idea of eliminating the most significant variable at each round. The removed variables are collected and their effects are evaluated by an influence ratio score the relative p-value change. Our simulation studies show that SDA is powerful to detect causal variables and SDA has lower false discovery rate than LASSO. We also demonstrate our method using the dataset provided by Genetic Analysis Workshop (GAW) 17 and the results support the superiority of SDA over LASSO.

5 The general partition-retention framework can also be applied to detect geneenvironmental interaction effects for common variants. We demonstrate this method using the dataset from Genetic Analysis Workshop (GAW) 18. Our nonparametric approach is able to identify a lot more possible influential geneenvironmental pairs than traditional linear regression models. We propose in this thesis a SPA-SDA two step approach for rare variants association analysis at genomic scale: first identify significant regions of moderate sizes using SPA, and then apply SDA to the identified regions to pinpoint truly influential variables. This approach is computationally efficient for genomic data and it has the capacity to detect gene-gene and gene-environmental interactions.

6 Contents List of Tables List of Figures Acknowledgments iv vi x Chapter 1 Introduction 1 Chapter 2 Rare Variants Common vs. Rare Variants Existing Methods for Rare Variants Association Analysis Burden Tests Non-burden Tests Why Another Approach? Chapter 3 Summation of Partition Approach (SPA) Methods A General Framework for Common Variants Identification Rare Variants Marginal Association Score I Rare Variants G G Association Score I Adaptive Test Score p i

7 3.2 Simulation Results Simulation Models Type I Error of I 1, I 2 and p Power Comparison in Marginal Effect Models for both Dichotomous and Continuous Traits Power Comparison in Gene-Gene Interaction Effect Models for both Dichotomous and Continuous Traits Properties of I Extension of SPA Rare Variants Gene-Environment Interaction Association Score I Type I Error and Power of I Application to GAW17 dataset Comments about SPA Chapter 4 A Significance-based Dropping Algorithm(SDA) The SDA algorithm Simulation Results A Toy Example Demonstration of SDA with 500 SNPs Power of SDA Application to GAW17 Dataset Properties of SDA Discussion Chapter 5 A Partition-based Approach to Identify Gene-Environment Interactions in Genome Wide Association Studies Gene-Environmental Interactions ii

8 5.2 Methods Dataset A General Framework G E Association Measure I Permutation Strategy Results Partitions Created by Environmental Factors SNPs with Significant G E Interaction Effects Effect of Different Permutation Strategies Discussion Chapter 6 Conclusion 82 Bibliography 83 iii

9 List of Tables 2.1 Common vs. rare variants Partitions by two SNPs Type I error estimates of I 1, I 2 and p Type I error estimates of I 2 in two different null settings Power of different methods for gene F LT 1 and ARNT in GAW17 dataset The first 10 rounds of BDA for the toy example The influence score and its significance in the first 20 rounds of BDA for EXAMPLE The influence score and its significance in the first 10 rounds of BDA for EXAMPLE FDR of different methods for Example 1 and Simulation scenarios Partitions created by genotypic and environmental factors.. 74 iv

10 5.2 Partitions based on the summarized quantities of age, smoking status or medicine Number of significant SNPs with p-value less than P-values for testing the pedigree dependence of SBP and DBP 79 v

11 List of Figures 3.1 Power comparison in the marginal effect model when the effect sizes are constant.powers were calculated for nominal α levels 0.05 (left) and 0.01(right) and for dichotomous traits (upper) and continuous traits (lower). Powers were evaluated for I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. Scenarios with different sample sizes were considered. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates Power comparison in the marginal effect model when the effect sizes of causal variants are negatively correlated with MAFs.Powers were calculated for nominal α levels 0.05 (left) and 0.01(right) and for dichotomous traits (upper) and continuous traits (lower). Powers were evaluated for I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. Scenarios with different sample sizes were considered. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates Power comparison in the marginal effect model with a mixture of protective and risk rare variants. Powers were calculated for nominal α levels 0.05 (left) and 0.01(right) and for dichotomous traits (upper) and continuous traits (lower). Powers were evaluated for I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. Scenarios with different sample sizes were considered. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates vi

12 3.4 Power comparison in G G interaction effect model when 50% of rare variants participate in the interaction effect.powers were calculated for nominal α levels 0.05 (left) and 0.01(right) and for dichotomous traits (upper) and continuous traits (lower). Powers were evaluated for I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. Scenarios with different sample sizes were considered. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates Power comparison in G G interaction effect model when 75% of rare variants participate in the interaction effect.powers were calculated for nominal α levels 0.05 (left) and 0.01(right) and for dichotomous traits (upper) and continuous traits (lower). Powers were evaluated for I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. Scenarios with different sample sizes were considered. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates Power comparison in the scenario with both main effect and G G interaction effect.powers were calculated for nominal α levels 0.05 (left) and 0.01(right) and for dichotomous traits (upper) and continuous traits (lower). Powers were evaluated for I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. Scenarios with different sample sizes were considered. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates Power comparison in scenarios 1 6 for dichotomous traits with 500 cases and 500 controls when there are 30 SNP under consideration. Powers were calculated for nominal α levels 0.05 (left) and 0.01(right) when only positive G E effects exist (upper) and when both positive and negative G E effects exist (lower). Powers were evaluated for I2 (with both global and local permutations), I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. Scenarios with different sample sizes were considered. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates vii

13 3.8 Power comparison in two G E interaction models for dichotomous trait. Powers were calculated for nominal α levels 0.05 (left) and 0.01(right). Powers were evaluated for I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates Flowchart of SDA algorithm P-values after removing the designated variable in the first round of BDA for toy example. The largest p-value is attained after removing variable X P-value trace (left) and influence ratio (IR) score trace (right) in SDA for toy example and its permutations. Red line is the p-value trace or IR score trace for toy example. Green dotted lines are the p-value traces or IR score traces for 1, 000 permutations. In the right plot, blue line and magenta line are traces of the 95% and 90% quantiles of each round in SDA. Causal SNPs are marked as crosses. The first five removed/retained variables in SDA are all true causal variables Return frequencies from SDA(Q95) of all variables among 100 replications for each scenario. Red line represent the return frequencies of true causal variables and green lines represent return frequencies of non-causal variables Average FDR from SDA(Q95), LASSO(λ min ) and LASSO(λ 1se ) with different number of selected variables in 100 repetitions of different scenarios Return frequencies of the causal rare variants in GAW17 using different methods. The x-axis is the index of 148 causal rare variants in the simulation model of GAW17, and the y-axis is the selection frequencies of these causal variables in 200 replications. The selection frequencies are calculated using SDA(Q90), SDA(Q95), LASSO(λ min ), LASSO(λ 1se ) and Fisher exact test viii

14 4.7 Average FDR of SDA(Q90) and LASSO(λ min ) in GAW17 data SPA-SDA two step approach for rare variants associaiton analysis. n is the number of regions The marginal effect of the genotype (left), the medication effect when genotype = 1 (middle) and the medication effect when genotype is 0 (right) Positions of detected SNPs with significant interaction effects for SBP and DBP on chromosome ix

15 Acknowledgments I owe my deepest appreciation to my advisor, Professor Shaw-Hwa Lo, for his continuous support, encouragement and guidance to both my graduate study and to my life. This dissertation would not have been possible without his persistent help. I remember for many times I walked into Professor Lo s office with frustration, he always encouraged me and helped me solve my research problems with great patience. I would also like to thank Professors Tian Zheng, Lauren A. Hannah, Shuang Wang and Pei Wang for agreeing to serve on my defense committee. In addition, I would like to thank all my friends at Statistics department of Columbia University, especially members of the statistical genetics group, for their constant support throughout the the past four years. I have had the opportunity to learn from many professors in the department and got inspiration from my fellow Ph.D classmates. Last but most importantly, I want to express my deepest gratitude to my x

16 parents, Aicheng Fan and Bingqin Guan, for their most precious love and support. And to my husband, Su Chen, for his everlasting love. He has seen my ups and downs and I thank him for choosing to stand by me and hold my hands. I thank my son, Derrick Chen. He is the origin of my joy. Whenever I see his innocent smiling face, I will be refilled with energy and never feel tired. xi

17 To my family xii

18 1 Chapter 1 Introduction Since Mendel s pea plant experiemnt in 1850s, scientists have put a lot of effort decoding how genetic variations contribute to phenotypic expressin. Despite of the success of large scale biological studies such as genome wide association studies (GWAS) in discovering many disease variants, most of which are common variants with minor allele frequencies (MAFs) greater than 0.05, for diabetes, heart disease, Alzheimer disease, etc., the variants identified thus far confer relatively small risk, explain a small fraction of familial clustering, and add little practical value in disease prediction. The issue of so-called missing heritability has been a serious concern that has attracted considerable attention and discussion recently. (Eichler et al. 2010; Gibson et al. 2010; Maher 2008; Makowsky et al. 2011; Manolio et al. 2009) A number of explanations have been suggested for this phenomenon including: (1) an as-yet undiscovered larger set of variants of smaller effects, (2) rare variants with larger effects that may be eluding the current GWAS, (3) unaccounted effects, due to gene-gene (G G) and gene-environment (G E) interactions, (4) undetected structure effects including copy number variations (CNVs), and (5) over-estimated her-

19 2 itability.(clayton 2009; de Los Campos et al. 2010; Jakobsdottir et al. 2009; Janssens and van Duijn 2008; McCarthy et al. 2008; Witte 2010) In this thesis, we focus on addressing the issues of (2) and (3) and propose a two-step approach to evaluate the association between rare variants and complex human traits accounting for G G and G E interactions as well as to identify influential rare variants in a specific region. In Chapter 2, we will review several existing methods for rare variants association analysis. Chapter 3 develops the summation of partition approach for rare variants association analysis and presents the results from simulation study and a dataset from GAW17. Chapter 4 proposes a signficance-based backward dropping algorithm to identify influential variables in a region that are identified as significant. Results from simulation studies and from GAW17 dataset show the superiority of our proposed methods. Chapter 5 is a study of gene-environmental interaction effect, an extension of the general framework. Chapter 6 concludes.

20 3 Chapter 2 Rare Variants 2.1 Common vs. Rare Variants Genetic variations can be arbitrarily divied into two types of variants common variants and rare variants based on their minor allele frequencies. Common variants are defined as variants with MAFs greater than 1% and rare variants are those with MAFs less than the threshold. Recently there has been a considerable debate concerning the frequency of genetic variations contributing to complex human diseases such as hypertension, diabetes and cancer, and this is the heart of Common Disease Common Variants (CDCV) vs. Common Disease Rare Variants (CDRV) debate. CDCV hypthesis states that the genetic risk of complex traits is mostly contributed by a small number of common variants with moderately small effects in the sense that that if a disease occurs among a lot of people, it should be caused by variants present in a lot of people.(park et al. 2010) This is one of the motivations of genome wide association studies (GWAS), including human genome project starting from the year 2,000. GWAS has mapped over 1 million common variants in

21 4 over 10, 000 individuals, and identified many genetic variants associated with complex traits with strong association with p-value less than the Bonferroni corrected p-value threshold. However, the variants identified so far explain only a small proportion of genetic variance of the traits, even for highly heritable traits. The most notable example is human heights a highly heritable trait with 80% heritability. GWAS has mapped about 50 genetic loci associated with human height but they put together explain only about 5% of the phenotypic variability. (Manolio et al. 2009) That is, the variants from GWAS do little of the prediction work that one could do by solely asking people the height of their parents. This phenomena is called missing heritability. This suggests limitations in GWAS to identify common variants associated with complex traits. One reason is that gene chips used in GWAS are typically designed with a focus on common SNPs, and contain a very small number of rare variants. The alternatives CDRV hypothesis claims that the common disease is caused by multiple rare variants with large effect size.(bodmer and Bonilla 2008; Pritchard 2001) There are fundamental differences in functional mechanisms between common and rare variants. Most rare variants are likely to be missense mutations in protein coding region or promoter regions, and they are able to change amino acids sequence to affect protein-protein interaction or alter gene expression level. (Cohen et al. 2004; Frayling et al. 1998) On the contrary, the common variants mapped in genome-wide association studies (GWAS) are unlikely to affect protein function directly. They are associated with a disease phenotype because they are in linkage disequilibrium (LD) with those functional variants. Experimental studies also show that effects of rare variants tend to be larger than those of common variants. Based on recent

22 5 publications, there is a clear difference in the distribution of odds ratios for common and rare variants. Only a few common variants have ORs greater than 2, and the majority fall between 1.1 and 1.5. On the other hand, most identified rare variants to date have an OR greater than 2, and the mean OR is (Iyengar and Elston 2007; Bodmer and Bonilla 2008). Therefore, the detection and investigation in rare variants will help researchers understand the disease etiology and provide insights to medical treatments. Due to their very low frequencies, most rare SNPs will be missed by the tagsnp method used in GWAS. Therefore, the best way to detect them is by direct sequencing. In the past few years, next generation sequencing technologies have been developed and commercially introduced into biological research. Compared with conventional capillary-based sequencing, they can process millions of sequence reads in parallel, avoid the bias in old-fashioned vector cloning and sample preparation and hence provide a time-efficient, cost-effective and more accurate sequencing readout (Mardis 2008). Recently, candidate gene resequencing has been applied to identify millions of rare mutations in human genome. For example, 1000 Genome Project (http : // is carried out to sequence at least 1000 genomes and is expected to detect most rare SNPs with 1% 5% frequencies. In addition, exome Seuqncing Project by National Heart, Lung and Blood Institute (NHLBI) sequenced 1, 361 European Americans and 1, 088 African Americans and discovered that more than 72% of the reads are variants with 3 or fewer minor allele counts. In the future, whole-genome sequencing is possible for a larger pool of individuals and large amount of sequence data with rare variants will be generated. Methods that are capable of detecting these influential rare variants are very much in need. (Table 2.1 summaries the differences between common and rare variants.)

23 6 Table 2.1: Common vs. rare variants Common Variants Rare Variants MAF > 0.01 < 0.01 Disease Model Common-Disease Common-Disease Common-Variants Rare-Variants Study Genome-Wide Assocation 1, 000 Genome Project Studies (GWAS) Exome Sequencing Project (NHLBI) Effect Size Small Large OR = OR > 2, mean OR = Existing Methods for Rare Variants Association Analysis With increasing attention on rare variants, the question of how to evaluate their effects arises. Due to the extremely low frequencies of rare variants, the classical association tests for common variants are inappropriate for the ananlysis of rare variants. (Morris and Zeggini 2010) Single marker tests, such as Pearson χ 2 test, Fisher Exact test and Cochran- Armitage trend test will have inflated type II error and lose power for rare variants, unless the sample size or the effect size is very large. For exampe, to achieve a power of 0.8, the sample size required to detect a rare variant with MAF with relative risk of 2 at α level is approximately 63, 000 cases and 63, 000 controls. On the other hand, multiple marker tests such as multiple regression, although allowing for simultaneous analysis of all SNPs, generally suffer from power loss due to large degree of freedom. Therefore, various collasping methods have been proposed to combine information

24 7 across multiple rare variants loci. The basic idea is that low-frequency variants are rare but aggregating them together may make them common enough to account for variation in common traits. These collapsing-based test can be broadly classified as burden tests and non-burden tests Burden Tests Burden tests group rare variants in a region as a single score and then regress the variant score on the phenotype to assess the cumulative effect of rare variants. If all variants are causal and their effects are in the same direction, burden tests can lead to a large increase in power. One of the first burden tests - Combined Multivariate and Collasping (CMC) method - was proposed by Li and Leal in (Li and Leal 2008) It tests whether the proportions of rare variants carrieres in cases and controls are significantly different. In particular, CMC collapses genotypes across a region such that an individual is coded 1 if at least one rare variant is present and as 0 otherwise, i.e. for the j th individual, 1 if rare variants are present X j = 0 otherwise For regions with only rare variants, then Chi-square test or Fisher s exact test is used to evaluate the association between the collapsed score and the phenotype. For regions with both common and rare variants, rare variants are collapsed while common variants are left unchanged, and then Hotelling s T 2 -test is applied to test for the association.

25 8 The weighted sum (WS) method (Madsen and Browning 2009) introduces the idea of weighing variants according to their MAFs, so that less frequent variants are given higher weights compared with more common variants. Consider a region with K rare variants and n individuals. Let X ij denote the genotype of the i th variant for individual j. The weight for each variant i (i = 1,, K) is calculated as w i = n i q i (1 q i ) where n i is the number of individuals carrying rare variants i and q i is the estimated frequency of the i th variant in controls. Actually, the weight w i is the estimated standard deviation of the total number of mutations in the sample. The genetic score of each individual j is calculated as γ j = L i=1 X ij w j and the test statistic is the summatin of ranks for affected individuals: χ = j affected rank(γ j ) Under the null hypothesis of no associaiton, the standardized score z = χ ˆµ ˆσ follows approximately standard normal distribution. Here ˆµ and ˆσ are estimated mean and standard deviaiton of χ from permutation. Instead of using the conventional cutoff values 0.05 or 0.01 to define rare variants, Price et al. (Price et al. 2010) proposed to choose a threshold with an optimal testing power, and this approach is called variable threshold (VT) method. For each allele-frequency threshold T, VT computes a z-score from

26 9 linear regression using only the SNPs with MAF less than T : z(t ) = K i=1 K n j=1 ξt i X ij (Y j Ȳ ) i=1 n j=1 (ξt i X ij ) 2 where 1 if MAF ξ T i < T i = 0 otherwise The test statistic is defined as the maximum of all possible Z(T ): z max = max all possible T {z(t )} Its significance is evaluted by permutation. Burden tests have been shown to be powerful compared with single marker tests, but a limitation is that it assumes that all rare variants influence the phenotype in the same direction and with the same magnitude of effect. However, in practice it is highly possible that most variants have little or no effect, while some may be protective and others deleterious. Moreover, the magnitude of each variants effect is likely to vary. Hence, the aggregated score might introduce noise or dilute signal of association, leading to a decrease in power Non-burden Tests Instead of aggregating genotype into a single score, non-burden tests aggregates scores from individual SNP together to evalaute the association between genotype and phenotype. One non-burden test is the replication-based test (RB) (Ionita-Laza et al. 2011), which also introduces a weighted sum score. In RB, variants are classified according to their numbers of occurance in cases and controls. It uses n k k to denote the number of variants with exact k copies

27 10 of the minor allele in controls and k copies of the minor allele in cases. For risky variants with k greater than k, the weighted-sum test statistic is defined as S + = N r k=0 k >k n k k log[p(k, k )] where p(k, k ) is the probbility of a variant occuring at most k times in controls and at least k times in cases when the null hypothesis is true, and N r is the upper threshold on the number of occurance of a variant in controls. Assuming the number of mutations of rare variants follows a Poisson distribution, p(k, k ) is calculated as p(k, k ) = ppois(k, ˆf)(1 ppois(k 1, ˆf)) where ˆf = (k+k )/2 is the estimated frequency in both cases and controls, and ppois is the Poisson cumulative distribution functin. For protective varaints, define S = N r k =0 k>k n k k log[p(k, k)] where n k k, p(k, k) and N r are similarly defined as before. A max-statistic max(s +, S ) is calculated and its signifiance is assessed by permutation. Another non-burden test is the sequencing kernel association test (SKAT)(Wu et al. 2011) which is based on the random effect regression model. For the j th individual, let y j be the phenotype, X j = (X j1,, X jm ) be covariates and G j = (G j1,, G jk ) is the genotype for the K variants within the region under consideration. For continuous traits, consider the linear model y j = α 0 + α X j + β G j + ɛ j where, α 0 is the intercept, α = [α 1,, α m ] is the vector of regression coefficients for covariates and β = [β 1,, β K ] is the vector of regresison co-

28 11 efficients for the K genetic variants. SKAT assumes that each β j follows an arbitrary distribution with mean 0 and variance w j τ, where τ is a variance component and w j is a pre-specified weight for variant j. SKAT aims to test for H 0 : τ = 0 using a variance component score test statistic: Q = (y ˆµ) K(y ˆµ) where the weighted linear kernel K = GWG, ˆµ is the predicted mean of y under H 0. Here G is an n K matrix of genotype and W = diag{w 1,, w K } contains the weights of the K variants. The default weight is set as w i = Beta(MAF i ; 1, 25), the beta density function with parameters 1 and 25 evaluated at the sample MAF for the i th variant in both cases and controls. The kernel matrix could have different forms. A weighted quandratic kernel K(G j, G j ) = ( K i=1 w ig ji G ij ) was also proposed that accounts for both marginal effects and G G interaction effects by genetic factors. SKAT with this weighted quadratic kernel is called SKATint. Many alternative methods that have also been proposed can be considered variations of these approaches. (Han and Pan 2010; Morris and Zeggini 2010; Neale et al. 2011) Non-burden tests have been shown to be powerful in the presence of different directions of genetic effects, but they do not systematically address the issue of G G and G E interactions Why Another Approach? The aforesaid methods have been shown to work well in different simulated models (mostly with marginal effects only). However, all these tests only consider marginal effects from rare variants and they do not systemically address

29 12 the issue of interactions among rare variants (G G), or between rare variants and covariates, such as environmental factors (G E). Therefore, additional statistical methods are needed to generate scientific knowledge on the etiology of complex diseases where interactions among genetic, biological and environmental variables work together to produce a phenotype. In this thesis, we propose the summation of partition approach (SPA), a robust model-free method that not only tests for the marginal effects of rare SNPs but also naturally incorporates G G and G E interactions. As with existing methods, SPA is based on aggregating information across rare variants in a region of interest. We shall demonstrate the power of SPA and compare with existing methods for both dichotomous and quantitative phenotypes. Simulation studies show that in disease models without interactions, the performance of SPA is comparable to or even better than the most competitive existing method in current literature; in the presence of G G interactions, SPA substantially outperforms all the other methods. Another advantage of our procedure is its simplicity and extensibility. We also demonstrate how to incorporate an environmental factor in the proposed framework and show that the augmented test score is powerful in detecting G E interactions. Similar approaches can be taken to account for interactions with common variants or other covariates. When large volumes of datasets with rare variants become available in the near future, the proposed procedure - SPA - will become a powerful tool to detect complicated interaction effects in various genetic regions and it will help us to better understand the mechanisms of complex human diseases.

30 13 Chapter 3 Summation of Partition Approach (SPA) 3.1 Methods To better understand the motivation and rational behind SPA, we briefly review a general framework that has been adopted for detecting common variants with interactions. A core element in this framework is the influence score derived from what we now know as the Partition Retention (PR) method. (Chernoff et al. 2009) Several forms and variations were associated with the PR method before it was finally coined this name in A General Framework for Common Variants Identification We demonstrate a basic tool adopted by our method. Suppose there are n subjects with a response variable Y and K discrete explanatory variables {X 1,, X K }. If each X i can take three discrete values, we generate a par-

31 14 tition Π with 3 K non-overlapping partition elements. Let n i be the number of subjects in partition i, Ȳi the average response for subjects in partition i, and Ȳ the average response from all subjects. An influence measure between the response and the predictors is defined as: I(X 1,, X K ) = n 2 i (Ȳi Ȳ ) 2 i Π It has been shown that under the null hypothesis that none of the predictors has influence on Y, the normalized I, I/(nσ 2 ) (σ 2 denotes the variance of Y ) is asymptotically distributed as a weighted sum of χ 2 random variables of 1 degree of freedom each such that the total weight is less than 1. (Chernoff et al. 2009) The main structure of this measure is the partition formed by the K discrete variables with 3 K partition elements each containing non-overlapping observations. This influence measure captures any discrepancy between the conditional mean and the grand mean of Y and thus is able to detect X Y association regardless of the structure of dependence. It can be easily generalized to any discrete random variables with finite number of outcomes. In case-control studies, the influence measure can be rewritten as: I = ( ) 2 n 2 i ˆp D N A i N A + N U i Π where N A is the number of affected individuals, N U is the number of unaffected individuals, and ˆp D i is the proportion of cases in partition i. Several variations of this partition-based method have been successful at identifying influential common variants and their interactions in human diseases, such as Rheumatoid Arthritis (Qiao et al. 2009; Huang et al. 2009; Ding et al. 2007) and breast cancer (Lo et al. 2008; Zheng et al. 2007). Its success in detecting common variants relies on the essence that many partition cells contain more than singleton subjects, however, this property will diminish for rare variants due

32 15 to their extremely low frequencies. To effectively deal with rare variants, we need to modify this partition procedure properly to accommodate for the sparseness, which can be achieved by the proposed summation of partition approach (SPA). We introduce below several test statistics of SPA, including the marginal test score I 1 and G G interaction score I Rare Variants Marginal Association Score I 1 The general framework mentioned above can be extended to rare variants association analysis for both dichotomous and continuous phenotypes. In population-based case-control studies, suppose there are N unrelated individuals, among which N A are cases and N U = N N A are controls. The region of interest G contains K rare variants and the genotype of the j th individual is denoted (X (j) 1,, X (j) ). Each X(j) i (i = 1,, K) can take values 0, 1 or 2, K indicating the number of rare variants at this position. The SPA test score I 1 that accounts for the marginal information contributed by these K rare SNPs is defined as I 1 = K i=1 n 2 i ( ˆp D i N ) 2 A N A + N u where ˆp D i, for the i th SNP, is the fraction of all observed rare variants that are from cases, and n i is the total number of i th rare variant observed in all subjects. For continuous traits, I 1 is defined as I 1 = K n 2 i i=1 (Ȳi Ȳ ) 2 where Ȳi, for the i th SNP, is the averaged response for subjects bearing at least one rare variant, Ȳ is the averaged response from all subjects and n i is

33 16 defined as above. Different from the original influence measure, I 1 recognizes the partition elements formed by individual SNP and hence the partitions from different SNPs are not non-overlapping any more; therefore, I 1 does not suffer from the sparseness of rare variants. Under the null hypothesis of no influence, the differences between ˆp D i and N A N A +N U for dichotomous traits (or between Ȳi and Ȳ for continuous traits) for all i are small, so a large I 1 value indicates that some rare variants in the region might be associated with the disease phenotype. Additionally, since each term of I 1 is the squared difference between the conditional average and the grand average, it can detect both directions of departure from the expected difference zero, which renders I 1 the ability to capture the association even in a region with both risk and protective rare variants. In our analyses we will rely on the method of permutation to assess its statistical significance Rare Variants G G Association Score I 2 In order to increase the power of detecting the genotype-phenotype associations as well as to elucidate the biological pathways that underpin disease, more and more attentions have been given to the identification of interactions between SNP loci. (Marchini et al. 2005; Hu et al. 2011; Cordell 2009) A limitation of I 1 is that it considers little interactions among rare SNPs. From the general framework, we propose a second SPA test score I 2 that evaluates G G interactions among rare variants. As the genotype at each SNP position can take 3 values, in theory we are facing a maximum of 3 K partition elements for all levels of interactions. However, due to the low frequencies of rare variants, the higher order (> 2) interaction

34 17 information among rare SNPs in current sample size will be small. For example, if the sample size is 1, 000 and the SNP frequency is 0.01, the expected number of observing one specific rare variants triplet is 1, = If a region contains 20 independent rare SNPs, the expected total number of rare variants triplets would be ( 20 3 ) = 1.14, which provides very low signal for 3-way interaction detection. Therefore, for current sample size, we only consider an influence measure that takes into account 2-way interactions among rare variants. For a pair of rare SNPs i and i, we consider three aggregated cells (Table 3.1): individuals with rare variants only on SNP i (denoted mm), individuals with rare variants only on SNP i (denoted Mm) and individuals with rare variants on both SNPs (denoted mm). Note that we do not consider the cell MM where individuals have no rare variant at either position. Table 3.1: Partitions by two SNPs Partitions SNP i 0 1, 2 SNP i 0 MM mm 1, 2 Mm mm For dichotomous trait, the SPA test score I 2 for G G interaction is defined as I 2 = K n 2 ii [(ˆpd ii,mm n d ) 2 +(ˆp d ii n d + n,mm n d ) 2 +(ˆp d ii u n d + n,mm n d ) 2 ] u n d + n u i=1 i >i where n ii is the number of subjects who have at least one rare variant in either SNP (i or i ), ˆp ii,mm is the fraction of subjects that are cases in partition mm, ˆp ii,mm is that fraction in partition Mm, and ˆp ii,mm in partition mm.

35 18 For quantitative trait, I 2 is defined as I 2 = K n 2 ii [(Ȳii,mM Ȳ )2 + (Ȳii,Mm Ȳ )2 + (Ȳii,mm Ȳ )2 ] i=1 i >i where Ȳii,mM is the average response for individuals in partition mm, Ȳii,Mm in partition Mm, and Ȳii,mm in partition mm. If two rare variants have interactions, the difference between the conditional average and the unconditional average will be large, leading to a large I 2 value. Again, permutation is used to evaluate the significance of the test statistic I 2. Even though I 2 only considers 2-way interaction, it can be easily extended to include higher-order ( 3) interactions by generating partitions based on m-tuples (m 3) of rare SNPs Adaptive Test Score p When it is unclear whether G G interaction is involved in the onset of disease, we propose an adaptive score p that is a compromise between I 1 and I 2. We first evaluate the significance of I 1 and I 2. Then the adaptive test score is defined as: p = min(p(i 1 ), p(i 2 )) where p(i 1 ) and p(i 2 ) are the p-values of I 1 and I 2 separately. We evaluate the significance of p by permutation. 3.2 Simulation Results Simulation Models We simulated several scenarios for the purpose of evaluation and comparison of our test scores with several existing rare variants association methods. The

36 19 genotype consists of 20 independent rare variants in each scenario. For type I error evaluation, we simulated the null scenario where none of the 20 variants affects the phenotype. For empirical power comparisons, we generate two different sets of simulations. The first set of simulations are marginal effect models, in which the MAF of all SNPs are uniformly distributed between and In scenario 1, 5 out of the 20 rare SNPs are risk SNPs and the effect size is constant. Scenario 2 is similar to scenario 1 except that the risk effect is negatively correlated with MAF. Scenario 3 has 5 protective variants and 5 deleterious variants with effect size negatively correlated with MAF. The second set of simulations contains 2-way G G interaction between rare variants, with MAF 0.01 for all 20 SNPs. In scenario 4, 50% of the SNPs (10 out of 20 SNPs) have interaction effects. Scenario 5 is similar to scenario 4 but 75% of the SNPs are involved in G G interactions. Both main effect and G G interaction effect exist in scenario Type I Error of I 1, I 2 and p The empirical type I error rates for I 1, I 2 and p are presented in Table 3.2 for nominal levels α = 0.05 and α = 0.01 with four different sample sizes: 600, 1000, 1500 and The results show that I 1, I 2 and p are well controlled at both significance levels for either dichotomous or continuous trait, even when the sample size is small, indicating that the proposed tests are valid methods Power Comparison in Marginal Effect Models for both Dichotomous and Continuous Traits We compare the power of I 1, I 2 and p with competing methods: CMC, WS, VT, RB, SKAT (with weighted linear kernel) and SKATint (a modified SKAT

37 20 Table 3.2: Type I error estimates of I 1, I 2 and p Dichotomous Trait α = 0.05 α = 0.01 Sample Size I 1 I 2 p I 1 I 2 p Continuous Trait α = 0.05 α = 0.01 Sample Size I 1 I 2 p I 1 I 2 p score with weighted quadratic kernel) in three marginal effect models when (1) only risk variants exist and the effect size is constant, (2) only risk variants exist and the effect size is negatively correlated with MAF, or (3) a mixture of risk and protective rare variants exists. In all three marginal effect scenarios, the performance of I 1 and SKAT are comparable and they are both superior to the other tests (Figure 3.1, 3.2 and 3.3). For dichotomous traits, I 1 is the most powerful method, followed by SKAT and p. For continuous traits, SKAT and I 1 are most competitive; both of them are more powerful than the other methods. The power of the adaptive score p is very close to I 1 ; p is much more powerful than CMC, WS, VT and RB. In addition, I 1 and p are quite robust to different simulation scenarios, even in the presence of a mixture of risk and protective variants, while CMC, WS and VT suffer substantial power loss when causal rare variants have opposite effects (Figure 3.3). It is worth noting that although I 1 does not in-

38 21 tentionally highlight less frequent variants by giving them higher weights, it is still the most powerful (for dichotomous trait) or the second most powerful (for quantitative traits) method even in scenarios where the effect size is negatively correlated with MAF, showing that its good performance is intrinsic and is not driven by a specific weighting scheme. The test score I 2 does not show a high power in these marginal effect models as it is designed to detect G G interaction effects but not the marginal effect Power Comparison in Gene-Gene Interaction Effect Models for both Dichotomous and Continuous Traits We evaluate the power of different methods in two G G interaction effect models (scenarios 4 and 5). The advantage of the G G interaction association score I 2 over all the other methods is apparent for both dichotomous and continuous traits (Figure 3.4, 3.5 and 3.6). For dichotomous traits, when the sample size is large, the power of I 2 is substantially higher than all the other methods. For continuous traits, I 2 is uniformly the most powerful method for all sample sizes; for example, when the sample size is 2000, I 2 is 38% more powerful than SKATint at α = Moreover, the adaptive score p has a power that is just slightly less than I 2, and p is substantially more powerful than the rest. On the other hand, VT, WS and CMC suffer from significant loss of power in the presence of complicated G G interaction effects. We also examine the scenario in which the phenotypes are influenced by both genetic marginal and GG interaction effects (scenario 6). Here the marginal effect is set to be small so that it will not mask the interaction effect. I 2 is

39 22 still consistently the most powerful test and p* is the second best, followed by SKATint (Figure 3.6). For continuous traits with sample size 2000, I 2 is 29% more powerful than SKATint, and p is 28% more powerful than SKATint at α = Properties of I 1 Assume there are n individuals with n a cases and n u controls and there are K SNPs under consideration. Let X ji denote the genotype of the i th SNP for the j th individual (X ji can take values 0,1 and 2). Suppose the minor allele frequency for SNP i is q i and we can obtain the probability mass function for X ji : 0 with probability (1 q i ) 2 x ji = 1 with probability 2q i (1 q i ) 2 with probability qi 2 It is easy to show that E(X ji ) = 2q i, E(X 2 ji) = 2q i +2q 2 i, therefore, the variance of X ji is V (X ji ) = 2q i (1 q i ). Definition The normalized I 1 score is defined as: ( p d i I N 1 = K i=1 n 2 i ) 2 n d n d +n u s 2 y K i=1 n i where n i = n j=1 x ji is the number of observed rare variants at the i th location, p d i = n j=1 x jii(y j = 1)/ n j=1 x ji = n j=1 x jiy j /n i is the proportion of rare vairant i that are from cases and s 2 y is sample variance of phenotype Y. Theorem 1. Under the null hypothesis of no association, I N 1 is asymptotically a weighted sum of K independent χ 2 1 variables, and the summation of the weights is K i=1 q i (1 q i ) K. j=1 q i

40 23 Proof. Let then Z i = n i s y ni (p d i n d n ), I N 1 = K i=1 Z 2 i i = 1,, K Z i = nd n i n d i n s y ni n j=1 = x n j=1 jiy j n y j i n s y ni n j=1 = y j(x ji n i /n) s y ni n j=1 = (y j EY )(x ji n i /n) s y ni n j=1 = (y j EY )[(x ji 2q i ) + (2q i n i /n)] s y ni n j=1 = (y j EY )(x ji 2q i ) n j=1 + (y j EY )(2q i n i /n) s y ni s y ni n j=1 = (y j EY )(x ji 2q i ) n j=1 (n i /n 2q i ) (y j EY ) s y ni s y ni The second term is of the order of O p ( 1 n ), so as n, the second term goes to 0. Therefore Z i n j=1 (y j EY )(x ji 2q i ) s y ni. Define Z 1 Z Z = 2. Z K

41 24 Therefore, Z = = n j=1 (y j EY )(x j1 2q 1 ) 1 n j=1 (y j EY )(x j2 2q 2 ) σ y ni. n j=1 (y j EY )(x jk 2q K ) (y j EY )(x j1 2q 1 )/σ y n 1 n (y j EY )(x j2 2q 2 )/σ y n i=1 n i j=1. (y j EY )(x jk 2q K )/σ y K n K i=1 n i 1 n n j=1 T j1 T j2. T jk where T ji = (y j EY )(x ji 2q i )/σ y, for i = 1,, K. As n, n K i=1 n i = 1 K i=1 n i/n 1 K i=1 2q i. From multivariate CLT, the remaining part converges to N(µ, Σ), where µ is the expectation of vector T = (T j1,, T jk ) T, Σ is the covariance matrix which equals to V ar(t 11 ) COV (T 11, T 12 ) COV (T 11, T 1K ) COV (T Σ = 12, T 11 ) V AR(T 12 ) COV (T 12, T 1K ).... COV (T 1K, T 11 ) COV (T 1K, T 12 ) V ar(t 1K ) Under the null hypothesis of no associaiton between X and Y, µ = 0 and the

42 25 variances are computed as follows. ( ) y1 EY V ar(t 11 ) = V ar (x 11 2q 1 ) σ y [ (y1 ) ] 2 EY = E (x 11 2q 1 ) 2 σ y [ ( )] 2 y1 EY E (x 11 2q 1 ) σ y ( ) 2 y1 EY = E E(x 11 2q 1 ) 2 [ E σ y ( y1 EY σ y ) E(x 11 2q 1 )] 2 since X ( ) 2 y1 EY = E E(x 11 2q 1 ) 2 σ y ( ) y1 EY since E = 0 σ y ( ) y1 EY = V ar V ar(x 11 ) σ y = 1 V ar(x 11 ) = 2q 1 (1 q 1 ) ( y1 EY COV (T 11, T 12 ) = COV (x 11 2q 1 ), y ) 1 EY (x 12 2q 2 ) σ y σ y ( (y1 ) 2 EY = E (x 11 2q 1 )(x 12 2q 2 )) σ y ( ) y1 EY E (x 11 2q 1 ) E σ y = E[(x 11 2q 1 )(x 12 2q 2 )] since X = COV (x 11, x 12 ) = Y ( ) y1 EY (x 12 2q 2 ) σ y = Y

43 26 Hence, Σ is the covariance matrix of X: COV (X). Therefore, I N 1 = Z T Z K λ i χ 2 1 i=1 where K λ i = i=1 If X is are independent, K i=1 V (x ji ) K j=1 2q i = K i=1 q i (1 q i ) K j=1 q. i λ i = V (x ji) K j=1 2q i = q i(1 q i ) K j=1 q i K i=1 Remark : As q i is between 0 and 1, q i (1 q i ) < q i, hence K q i (1 q i ) i=1 K < j=1 q i = 1. This means the summation of the weights is less than 1, q i K j=1 q i which prevents the explosion of the degree of freedom of the test statistic I N 1. No matter how many SNPs are involved, the expectation of I N 1 will be less than 1. This overcomes the problem of higher degree of freedom in multiple regression. Since the frequency for rare variants is very low, q i is close to 0 and 1 q i is close to 1. Therefore, the summation of weights will be close to 1 for rare variants analysis. Proposition For case-control studies, I 1 is equivalent to SKAT with linear kernel and equal weight. Proof. The SKAT statistic is defined as Q = (y ˆµ) XWX (y ˆµ) where y = (y 1,, y n ) is the phenotype, ˆµ is the predicted mean of y under H 0, X is the n K matrix of genotype and X is the n K matrix of genotype.

44 27 If all w js are set to be 1, then Q = = = = = ( K n ) 2 (y j ȳ)x ji i=1 j=1 ( K n y j x ji i=1 i=1 j=1 j=1 ) 2 n ȳx ji j=1 [( K n ) ( n j=1 x y jx ji ji n j=1 x ji [ K j=1 K i=1 n i ( n j=1 x ji1(y j = 1) n j=1 x ji n 2 i (ˆp d i n d n )2 n j=1 ȳx )] 2 ji n j=1 x ji ȳ)] 2 for dichotomous traits = I Extension of SPA Rare Variants Gene- Environment Interaction Association Score I2 Increasing evidence have shown that gene and environmental (G E) interactions are widely involved in the etiology of complex diseases, including diabetes, cancer and psychiatric disorders. (Hamza et al. 2011; Andreasen et al. 2008; Thomas 2010a;b) Conventional methods to detect G E interactions are mostly based on regression models, which will lose power for rare variants. SPA can be easily extended to incorporate covariates, such as environmental factors in the testing procedure, considering both the environmental marginal

45 28 effect and the G E interaction information. Here we focus on case-control study design. Suppose an environmental factor E has J levels. The SPA test score for detecting the effect of the environmental factor is expressed as: I 2 = j j=1 K i=1 n 2 i,j ( ˆp D i,j n d n d + n u where ˆp D i,j is the fraction of rare variants at position i on level j that are from cases, and n i,j is the total number of i th rare variants observed at level j. I 2 is a modification of I 1 by building additional overlapping rare variants partition cells to J non-overlapping partitions created by the environmental factor. The significance of I 2 is evaluated by permutation. We propose two permutation strategies: (1) global permutation that permutes the phenotype among all individuals; and (2) local permutation that permutes the phenotype within each stratum of the environmental factor. Both permutation strategies are investigated in our study. ) Type I Error and Power of I 2 For the G E interaction score I2, we investigated its type I error and power for dichotomous trait using two permutation strategies: global permutation and local permutation, denoted by I2 Global and I2 Local respectively. As I2 considers both the genetic and environmental marginal effects as well as G E interaction effect, I2 Global is appropriate for testing the null hypothesis of no association at all (no G marginal, E marginal or G E interaction effects), and I2 Local is appropriate for testing the null hypothesis of no E marginal effect. The type I error of I2 are evaluated for two null hypotheses. The first null hypothesis (null-1) assumes the dichotomous traits are completely determined

46 29 by the baseline penetrance. The second null hypothesis (null-2) assumes that the phenotypes are affected by environmental marginal (E marginal) effect. Table 3.3 presents the type I error of I 2 Global and I 2 Local in these two null settings. In null-1, both I 2 Global and I 2 Local are correctly controlled at levels α = 0.05 and In null-2, I 2 Local still hits the target level while I 2 Global has significant higher values. This is because I 2 Global is able to test any effect from genetic or environmental factors, including the E marginal effect; hence the results of I 2 Global in null-2 are indeed the power of I 2 Global in the presence of E marginal effect. On the other hand, I 2 Local removes the E marginal effect, so it shows the correct type-i error in both null-1 and null-2. Table 3.3: Type I error estimates of I 2 in two different null settings Null-1: No G, E or G E Effects α = 0.05 α = 0.01 Sample Size I2 Global I2 Local I2 Global I2 Local Null-2 : Marginal Environmental Effect only α = 0.05 α = 0.01 Sample Size I2 Global I2 Local I2 Global I2 Local Two scenarios are considered to compare the power of I2 Global, I2 Local and competing methods when (1) the phenotypes are affected by E marginal effect and positive G E effect, (2) the phenotypes are affected by E marginal effect and both positive and negative G E effects. In computation, SKAT

47 30 and SKATint regress the phenotype on the environmental factor when calculating the test statistic (Wu et al. 2011). I2 Global and I2 Local use the environmental factor as in their definition. All the other methods work on the phenotype and the genotype directly. The results show that I2 Global has much higher power than all the other tests because it takes into account both E marginal and G E interaction effects, and I2 Local outperforms all the remaining methods that do not consider G E interaction effects (Figure 3.7) Application to GAW17 dataset The genetic analysis workshop 17 (GAW17) provided genotypes of 3, 205 autosomal genes on 697 individuals from the 1000 Genome Project. A dichotomous phenotype was simulated from a linear model using SNPs from 34 genes and most causal SNPs were rare variants. A total of 200 simulation replicates were carried out and the genotype was held fixed for all replicates. See (Almasy et al. 2011) for more details of the simulation model. Here we chose to re-analyze two causal genes F LT 1 and ANRT. In the workshop, F LT 1 has been shown to exhibit a strong signal in many well-known methods while ARNT could not be identified by any existing approach. For both genes, an upper frequency of 0.05 was used as the MAF cutoff to define rare variants and only nonsynonymous SNPs were examined. We computed the power of our test scores and competing methods using all 200 replicates. Power was calculated as the proportion of replicates with p-value less than 0.05 out of the 200 simulations. As shown in Table 3.4, I 1 was fairly robust for detecting both genes. For F LT 1, two count-based collapsing methods - CMC and WS are most powerful, followed by VT and I 1. For ARNT, I 1 is substantially more powerful than the other methods; its power is 47% higher than the sec-

48 31 ond best method SKAT. Given that the simulated model is a simple additive linear model with genetic marginal effects only, methods considering G G interactions, including I 2 and SKATint, do not have apparent advantages in power gain for detecting either F LT 1 or ARNT. Table 3.4: Power of different methods for gene F LT 1 and ARNT in GAW17 dataset F LT 1 ARNT Num of NS a Rare SNPs 19 (10 causal) 9 (5 causal) I I p SKAT SKATint RB VT WS CMC a Nonsynonymous 3.5 Comments about SPA We propose here the summation of partition approach (SPA), a flexible robust model-free framework for rare variants association analysis that incorporates both G G and G E interactions. The proposed SPA test scores create partitions from individual SNP and combine the information across all rare variants in a region of interest. I 1 is designed to detect marginal effects of rare variants and I 2 is designed to capture the G G interaction effects among rare variants. In various marginal effect models, I 1 is more powerful than most approaches examined in our study. Its performance is comparable to SKAT,

49 32 which is regarded as the most competitive existing method. In G G interaction models or in the scenario with both marginal and G G interaction effects, I 2 is superior to all the other methods in terms of detection power. The adaptive score p is a compromise between I 1 and I 2 and has the advantage of both test scores. Its performance is just a little shy of the better of the two scores I 1 and I 2, for both marginal effect models and interaction effect models. Therefore, p is a self-tuning adaptive score that is able to gain power automatically regardless of the simulation scenario. In practice when we have no clue of how the genotype affects the phenotype, we suggest to use the adaptive score p. A significant p-value of p indicates a potential true signal from either marginal or interaction effects of rare variants. In our study, we focus on the situation with 20 rare SNPs. If the SNP number changes to 30, the simulation results (Figure 3.8 ) are qualitatively similar in that I 1 is the most powerful in marginal effect models and I 2 is the most powerful in interaction effect models. I2 is an augmented score of I 1 that incorporates covariates. It can be used to test the hypothesis of no association at all (neither G marginal, E marginal nor G E effect) using global permutation or to test the hypothesis of no E marginal effect using local permutation. By local permutation, I2 removes the marginal effect of the environmental factor while still captures variations of the genetic effect at different levels of the environmental factor. In a similar fashion, covariates can be incorporated into I 2 and the resulting augmented score could be used to detect E G G 3-way interactions between an environmental factor and two rare variants. I 2 can also be used to test the interaction effect between common and rare

50 33 variants if one treats the common variant as an environmental factor. It can be further extended to detect 3-way interactions among the environmental factor, common and rare variants by building additional overlapping partitions based on rare variants on top of the non-overlapping partition cells generated by the environmental factor and the common variant. A global permutation can detect both main and interaction effects of these factors, and a local permutation that permutes the phenotype within each non-overlapping partition cell will capture the Ecommonrare 3-way interaction effect. I2 deals with categorical covariates naturally. In order to handle continuous covariates, such as age, height and BMI, we suggest taking the discretization approach that divides continuous variables into distinct buckets. These pseudo-categorical variables generated by discretization can be applied to directly. In practice, we usually set the number of buckets to be 2 5 and the results are quite satisfactory. Moreover, a new influence measure dealing with continuous covariates directly is under preparation. The insight of SPA is similar to the partition retention (PR) influence measure as in (Chernoff et al. 2009). The PR method generates non-overlapping partition elements over the sample space and assigns each partition cell a weight that is proportional to the probability of falling into that cell. Its success in detecting influential variables relies on the essence that weights are not too small for all partition elements, especially for those cells that generate signals. Therefore, the PR method may lose power for rare variants association studies as the partition cells with true signals will have very low weights due to the extremely low frequencies of rare variants. SPA differs from the PR method by creating overlapping partition elements to avoid the sparseness and to boost

51 34 the signal from rare variants. The information measure I 1 can be viewed as a special case of I 1 = K w i n i (ˆp D N A i N A + N U i=1 where {w i } are weights for each rare variant that sum to 1. Weights can be defined in various ways. The inherent choice we take here is w i = n i / K i=1 n i. If external information is available on possible effects of a rare variant to disease, it is straightforward to incorporate such information in our test approach by tuning the weight. Some commonly used weights are based on (1) MAF of the variant as in (Madsen and Browning 2009); or (2) externallydefined weights such as predictions from SIFT and PolyPhen, as suggested by Price et al. (Price et al. 2010) In our study, even though we do not incorporate the weight information, SPA is still superior over the other methods. We believe that after tuning the weight, SPA will exhibit a better performance. ) 2 Population stratification has been shown to be an important problem for common variant association analysis. For rare variants, this problem is more likely to occur due to their low frequency and possible uneven distribution among populations. It is straightforward to control population stratification in our approach as we can consider population as an environmental factor and apply it to I2. An alternative is to treat population with PCA and include the discretized eigenvalues in our analysis. A major advantage of SPA is that it is highly extensible. The building blocks of SPA are the partitions formed by individual rare variant and it is easy to incorporate complex interactions. As demonstrated in the article, we are able to

52 35 take into account interactions with environmental factors. Similar approaches can be applied when considering interactions with common variants or other covariates. It can also be generalized to other research areas to benefit the practitioners and scientists in various fields. We believe that the proposed framework of SPA will offer substantial opportunities in detecting potential complicated interactions. Once interaction effects indeed exist, our approach is capable of identifying these interactions and thus adding to the detection power.

53 36 Figure 3.1: Power comparison in the marginal effect model when the effect sizes are constant.powers were calculated for nominal α levels 0.05 (left) and 0.01(right) and for dichotomous traits (upper) and continuous traits (lower). Powers were evaluated for I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. Scenarios with different sample sizes were considered. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates.

54 37 Figure 3.2: Power comparison in the marginal effect model when the effect sizes of causal variants are negatively correlated with MAFs.Powers were calculated for nominal α levels 0.05 (left) and 0.01(right) and for dichotomous traits (upper) and continuous traits (lower). Powers were evaluated for I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. Scenarios with different sample sizes were considered. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates.

55 38 Figure 3.3: Power comparison in the marginal effect model with a mixture of protective and risk rare variants. Powers were calculated for nominal α levels 0.05 (left) and 0.01(right) and for dichotomous traits (upper) and continuous traits (lower). Powers were evaluated for I1, I2, p, SKAT, SKATint, VT, RB, WS and CMC. Scenarios with different sample sizes were considered. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates.

56 Figure 3.4: Power comparison in G G interaction effect model when 50% of rare variants participate in the interaction effect.powers were calculated for nominal α levels 0.05 (left) and 0.01(right) and for dichotomous traits (upper) and continuous traits (lower). Powers were evaluated for I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. Scenarios with different sample sizes were considered. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates. 39

57 Figure 3.5: Power comparison in G G interaction effect model when 75% of rare variants participate in the interaction effect.powers were calculated for nominal α levels 0.05 (left) and 0.01(right) and for dichotomous traits (upper) and continuous traits (lower). Powers were evaluated for I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. Scenarios with different sample sizes were considered. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates. 40

58 41 Figure 3.6: Power comparison in the scenario with both main effect and G G interaction effect.powers were calculated for nominal α levels 0.05 (left) and 0.01(right) and for dichotomous traits (upper) and continuous traits (lower). Powers were evaluated for I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. Scenarios with different sample sizes were considered. P- values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates.

59 Figure 3.7: Power comparison in scenarios 1 6 for dichotomous traits with 500 cases and 500 controls when there are 30 SNP under consideration. Powers were calculated for nominal α levels 0.05 (left) and 0.01(right) when only positive G E effects exist (upper) and when both positive and negative G E effects exist (lower). Powers were evaluated for I 2 (with both global and local permutations), I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. Scenarios with different sample sizes were considered. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates. 42

60 43 α=0.05 α=0.01 Power I 1 I 2 p* SKAT SKATint RB VT WS CMC Power I 1 I 2 p* SKAT SKATint RB VT WS CMC Scenario Scenario Figure 3.8: Power comparison in two G E interaction models for dichotomous trait. Powers were calculated for nominal α levels 0.05 (left) and 0.01(right). Powers were evaluated for I 1, I 2, p, SKAT, SKATint, VT, RB, WS and CMC. P-values were estimated using 10, 000 permutations and power was evaluated using 1, 000 replicates.

61 44 Chapter 4 A Significance-based Dropping Algorithm(SDA) The existing methods for rare variants association analysis are able to detect a region that contains causal rare variants, however, they cannot select which rare variants in this region are truly responsible for complex traits. Traditional methods to study the significance of a specific variant are single marker tests, such as Fisher s exact test for case-control studies and simple linear regression for quantitative traits. Due to the extremely low frequencies of rare variants, these methods will suffer from multiple comparison even when the number of SNPs reduces to a moderate number by collapsing appraoches. In addition, most standard variable selection approaches are penalized regressions, such as LASSO (Tibshirani 1996). However, for rare variants with low frequencies, these methods could suffer from sparseness of the design matrix. Sun and Wang (Sun and Wang 2014) designed a set-based statistical selection procedure to locate susceptible rare variants within a region. But this method is computational intensive because it assesses the association between

62 45 the phenotype and all combinations of rare variants in a region, and hence the computation time increases exponentially with the number of rare variants; therefore, it is computational infeasible for a set with hundreds of rare variants. In this chapter, we propose a significance-based d ropping algorithm (SDA) that conducts variable selection in a stepwise elimination fashion. We show that SDA has lower false discovery rate than LASSO in both our simulated data and in the data provided by GAW17. Theoretical results suggest that SDA is more powerful for variants with larger effect sizes. 4.1 The SDA algorithm SDA is a greedy algorithm that searches for the variable that makes the largest change in significance (p-value) through stepwise eliminations. For a region with K rare variants X 1,, X K, the algorithm details are described below and Figure 4.1 is the flowchart. 1. With the starting set S = {X 1,, X K } and retention set R =, compute the influence score for this region I (K) = K i=1 I i and evaluate its significance p (K). 2. For each j {1,, K}, remove variant X j and compute the SPA marginal score for the remaining K 1 variables: I (K\j) = i {1,,K}\j I i and its significance p (K\j). 3. Remove the variable X j that gives the largest p (K\j) from S and add it to retention set R. The deletion of this variant leads to the largest change in p (K) and hence it is the most influential. If more than one variable have the same largest p (K\j), randomly choose one among them

63 46 and remove it. After one round of elimination, the resulting set is S = S\X j = {X 1,, X j 1, X j+1,, X K } and R = R X j = {X j }. 4. The influence ratio (IR) of the removed variable j is calculated as IR j = p (K\j) p (K) p (K). 5. Repeat step 2 4 until there is only one variable left or p-value starts decreasing. Figure 4.1: Flowchart of SDA algorithm From steps 1 5 above, we obtain a series of removed variables and their associated influence ratio IR k (k = 1,, K). The same procedure is repeated on 10, 000 permuted datasets. This process will generate 10, 000 sets of removed variables. Based on these 1, 000 sets, we construct a 95% or 90% confidence

64 47 curve for each dropping round. Variables with influence ratio above the threshold will represent potential influential rare variants. Different from traditional stepwise dropping algorithms that remove non-significant variables and retain significant variables, SDA aims to remove significant variables until none of the remaining variables provide signals. We can expect that p-value will increase to 1 very fast when only a small proportion of the variables are truly responsible for the phenotype. As a matter of fact, the dropping round can stop when p-value is very large (close to 1) so that the computation time can be reduced significantly. 4.2 Simulation Results A Toy Example We demonstrate SDA and related issues by considering a small artificial example. A Toy Example: Our design problem consists of 20 independent rare variants with minor allele frequencies (MAFs) uniformly distributed between and The observed phenotype Y is a binary variable. Y is associated with the first 5 rare variants (X 1,, X 5 ) in a logistic relationship. The effect size of the 5 causal SNPs are negatively correlated with their MAFs. To measure the significance of this region with 20 rare SNPs, we compute the influence score I and its p-value: p (20) = , which is less than our threshold 0.05, so this region is significantly associated with the phenotype. Our next goal is to locate rare SNPs in this region that are truly influential

65 48 for the phenotype. We start with the initial variable set: S = {X 1,, X 20 }, then take turns to eliminate one of these 20 variables at a time and obtain the significance for the remaining 19 variables p (20\k), k {1,, 20}. We find that the maximum p-value is attained after removing variable X 5 with p (20\5) = (Figure 4.2) Therefore we remove (and collect) variable X 5 and calculate its influence ratio (IR) : IR 5 = p(20\5) p (20) p (20) = = The set R = {X 5 } and set S = {X 1,, X 4, X 6,, X 20 }. Repeating this process leads to discarding (and collecting) variable X 2 with IR 2 = 1.618, so R = {X 5, X 2 }. We continue repeating the same process until there is only one variable left or the p-value is very close to 1. The abbreviated history of the first 10 rounds of this process is presented in Table 4.1 which lists the variables discarded at each stage, the p-values after each dropping round and their associated influence ratio. For our toy example, the first five variables collected in the retention set are X 5, X 2, X 3, X 1 and X 4 ; as a matter of fact, they are indeed the five true causal variables in our simulation model. When all five variables are present, this region is significant with p-value ; after removing variable X 5, the p-value for this region is greater than 0.05 so this region is no longer significant. When all five causal variables are removed, the p-value is very close to 1 because only non-causal variables are left. The significance of variables is determined by the relative change in p-value after it is removed, represented in IR score. A large IR value indicates possible signals and low IR values are signs of noises. We see that the first 5 removed variables are truly causal (Table 4.1). In order to set a threshold to differentiate signal from noise, 10, 000 permutations that randomly change the labels

66 49 of cases and controls are conducted. For each permutation, the same SDA procedure is performed that generates a sequence of p-values and influence ratios (IR), as shown in Figure 4.3. In Figure 4.3, the red line represents the p-value trace or influence ratio trace for the toy example and the green lines are the p-value traces or influence ratio traces for permutations. We propose here two possible thresholds 95% quantile (Q95) or 90% quantile(q90) of the permuted influence ratio sequence (the blue curve and magenta curve in Figure 4.3). Variables with influence ratio greater than the threshold at the same dropping round are considered significant and are retained as signals. Table 4.1 also lists the first 10 rounds of Q95 and Q90 in our toy example. The first five retained variables X 5, X 2, X 3, X 1 and X 4 have IR score greater than both Q95 and Q90, and hence they will be identified as influential variables using either threshold. Using Q90 or Q95, none of the remaining variables are significant, so the FDR for both thresholds is 0%. Table 4.1: The first 10 rounds of BDA for the toy example Initial Set: {X 1,, X 20 };Initial p-value: Round Variables removed p-value after removing Influence Ratio Q Q Round Variables removed p-value after removing Influence Ratio Q Q

67 50 We compare our results with Fisher s exact test, where only variable X 5 is statistically significant (p-value = ) while the other four causal variables all have p-value greater than We also compute the results from LASSO using R package glmnet. The penalty coefficient λ of LASSO is determined by cross validation. Two λ values are provided by the cv.glmnet function in R: one is λ min which gives the minimum prediction error; another is λ 1se which gives the prediction error at one standard error of the smallest value. We evaluate the the performance of LASSO with either λ min or λ 1se. The results show that LASSO(λ min ) can identify 4 causal SNPs and LASSO(λ 1se ) can only identify 1 causal SNP. Therefore, SDA outperforms Fisher s exact test and LASSO because SDA is able to identify all five causal variables while the other methods cannot Demonstration of SDA with 500 SNPs We present two more examples. Both have 500 rare variants with 20 (EXAM- PLE 1) and 10 (EXAMPLE 2) causal SNPs respectively. The sample size is 500 cases and 500 controls. SDA is applied to both datsets with both Q95 and Q90 threshold. As we know the true model for our simulations, properties of the method can be evaluated. The first tens of rounds in SDA for EXAMPLE 1 and 2 are listed in Table 4.2 and 4.3. For EXAMPLE 1 (Table 4.2), if we use threshold Q95, 11 SNPs are identified significant with 9 true causal variables and 2 false positives. If Q90 criteria is used, 15 SNPs are significant with 10 true signals and 5 false positives. Therefore, for SDA with Q95 (Q90), the false discovery rate is 18% (33%). For EXAMPLE 2, the first removed variable is X 3 with influence ratio (IR) 12.6, which exceeds the 95% quantile in 10, 000 permutations. The second and third removed variables are X 9 and X 8, which

68 51 are true causal SNPs in the simulation model. The fourth removed variable is an imposter X 267. The fifth varaible is X 4 which is the last variable with IR greater than the 95% quantile threshold. Therefore, using Q95 threshold, 5 variables are retained as significant SNPs and 4 of them are true causal, hence the false discovery rate is 20%. If the threshold is Q90, two more variables X 2 and X 437 are identified with one true signal and one imposter, thus the false discovery rate is 2/7 = 28.5%. Table 4.2: The influence score and its significance in the first 20 rounds of BDA for EXAMPLE 1 Round p-value after removing Variables removed Influence Ratio Significance Round p-value after removing Variables removed Influence Ratio Significance Round p-value after removing Variables removed Influence Ratio Significance Our goal is to find influential rare variants. Whatever methods we use, there is the possibility that among those selected as influential, there will be imposters which are not causally related to the dependent variable but confused with influential variables. When our analysis locates some candidates for being influential, it helps to have some methods to decide how likely our results are true. Here we propose to permute the values of dependent variable Y and find the number of false positives under the null hypothesis. Given a group of

69 52 Table 4.3: The influence score and its significance in the first 10 rounds of BDA for EXAMPLE 2 Round Variables Removed Influence Ratio Significance Round Variables Removed Influence Ratio Significance noninfluential variables with a phenotype variable Y as in the permuted sample, we can apply SDA and count how many variables are selected at random to estimate the number of false discoveries in our analysis. In this example, we run SDA on 1, 000 permuted samples and the average number of selected variables is 4.3 for Q95 and is 2.7 for Q90. The estimated number of false discoveries highly matches the true number of false discoveries in this example. In practice when no true model is available, this estimate provides a guideline about the correctness and validity of the result. Similarly as in the toy example, we have the opportunitey to compare the results of SDA with those of LASSO and Fisher s exact test. The results of Fisher s exact test are from single marker test and subject to multiple comparison issue, therefore a Bonferroni-corrected p-value (0.05/500) is applied as the cutoff threshold. When implementing LASSO, 5-fold cross validation is used to determine the penalty coefficient, λ. We evaluate the variables seletion performance of LASSO using both λ min and λ 1se. For EXAMPLE 1, if we use Fisher exact test, none of the 500 SNPs has p-value less than the Bonferroni corrected threshold (= 0.05/500). We

70 53 Table 4.4: FDR of different methods for Example 1 and 2 Ex1(20) Ex2(10) Fisher Exact Test N(0 a ) N(0) b LASSO (λ min ) 60.7%(11) 69.5%(5) LASSO (λ 1se ) N(0) 50%(5) SDA Q %(9) 20%(4) SDA Q %(10) 28.5%(5) a The number in bracket is the true positive frequency. b N(0) means none of the 500 rare SNPs can be identified. also implemented LASSO on this simulated dataset, which leads to 28 findings with 11 true signals. The false discovery rate for LASSO is 60%, which is much higher than the FDR for SDA. Table 4.4 lists the FDR of SDA(Q95), SDA(Q90), Fisher Exact Test with Bonferroni corrected threshold, LASSO(λ min ), LASSO(λ 1se ) for these two examples. The number in the bracket is the number of true positive discoveries in each case. The FDR of N(0) represents the situation that none of the 500 SNPs is identified significant. In both examples, none of the causal SNPs can be detected by Fisher exact test. LASSO with λ 1se gives a lower FDR compared with λ min but identifies less causal SNPs. The number of causal SNPs identified by SDA is comparable with that from LASSO(λ min ); furthermore, FDR of SDA is much lower than LASSO(λ 1se ) and LASSO(λ min ) Power of SDA We consider six scenarios to evaluate the power of SDA (Table 4.5). For each scenario, we replicate the simulation for 1, 000 times and select among them 100 datasets with p-values less than 0.05 and greater than (It is

71 54 Table 4.5: Simulation scenarios Scenario No.Cases+No.Controls Total No. SNPs No. Causal SNPs difficult to evaluate SDA if p-value is less than because in this chapter we use 10, 000 permutations to assess the significance. One way to solve this problem is to increase the number of permutations.) For each of the 100 replicates, we use SDA and LASSO to select potential causal variables and assess their capacity to identify true known SNPs. Figure 4.4 shows the number of times each variable is selected from SDA(Q95) among 100 replications for the 6 scenarios. Red vertical lines represent the return frequencies of true causal variables and green lines represent the return frequencies for non-causal variables. True causal signals (red) are selected with considerably higher frequencies compared with non-causal variables (green), which suggests that SDA has the power to distinguish the true signals from noises, regardless of the number of true signals among variables. In real practice with replicates, a variable identified more frequently by SDA is more likely to be a true causal variable compared with a variable identified less frequently. In addition, as the sample size increases, the return frequencies become higher, leading to an increased power of SDA. (scenario 1 vs. scenario 3 and scenario 2 vs. scenario 4) With the same number of causal SNPs, the less variables there are in the region of interest, the more frequently the true causal variables will be selected. (scenario 1 vs. scenario 5 and scenario 2 vs. scenario 6)

72 55 We also compare the average FDR of SDA(Q95) and LASSO in different scenarios with different number of selected variables. For each repetition, we apply SDA and LASSO, and select the top x(x = 2, 5, 10, 15, 20) most significant variables. For SDA, these variables are the variables removed in the first x dropping rounds. For LASSO, we rank SNPs by the absolute value of their coefficients and choose the top x variables. As shown in Figure 4.5, SDA has a lower FDR than LASSO for almost all variable set sizes in all scenarios. FDR will decrease when signal/noise ratio increases Application to GAW17 Dataset The dataset from Genetic Analysis Workshop 17 (GAW17) includes genotypes of 3, 205 autosomal genes for 697 individuals. A dichotomous phenotype was simulated form a linear model using SNPs from 34 genes. Most causal SNPs are rare variants with MAF < simulation replicates are available with fixed genotype (Almasy et al. 2011), so we can evaluate the power of different variable selection procedures using data with real genotype and simulated phenotype. In our analysis, we focus on the 34 genes with causal variants in the simulation model. In particular, there are 148 causal rare variants and 104 noncausal rare variants in the genetic region of these 34 genes. We first evaluate the return frequency for each of the 148 causal variants using different methods (Figure 4.6). SDA(Q90) has the highest return frequencies for these causal SNPs, followed by SDA(Q95). They outperform LASSO(λ min ) and LASSO(λ 1se ). The single marker test Fisher s exact test almost has no power for these rare variants.

73 56 We further examine the average false discovery rate of SDA (Q90) and LASSO (λ min ) in the 200 replicates when k variables are selected (k = 5, 10, 15, 20). The k variables selected in LASSO are the top k variables with the largest absolute coefficients. As shown in Figure 4.7, SDA has a lower FDR compared with LASSO at different numbers of selected variables. In particular, when 10 variants are selected, SDA has an average FDR less than 0.30 and while LASSO has an average FDR of Properties of SDA I N 1 can be written as I N 1 = where n i Z i = (p d i n d s y ni n ), i = 1,, K Assume the genetic variables X are independent. As shown in the proof of Theorem 1, where U i = 1 n Under the null hypothesis, Z i K i=1 1 K Z 2 i i=1 q i U i n (y i µ Y )(x ji 2q i )/σ Y j=1 U i N(0, q i (1 q i )) and I N 1 K i=1 q i (1 q i ) X i qi

74 57 where X i iid χ 2 1. Under the alternative hypothesis that X i is correlated with Y, The expectation of U i is µ Ui U i N(µ Ui, σ Ui ) sum of non-central χ 2 distribution with weights and noncentrality parameters ( µui and its variance is σ Ui. I converges to a weighted σ 2 U i K i=1 q i ) 2. σ Ui Thereore, the expectation of I will converge to { ( K σ 2 ( ) )} 2 Ui µui 1 + σ Ui i=1 K i=1 q i and its variance goes to ( K σ 2 Ui i=1 K i=1 q i ) 2 ( ( µui σ Ui ) 2 ) The following theorem studies the properties of µ Ui and σ Ui under logistic regression setting when there is only one causal variables and x follows Bernoulli distribution. Theorem 2. Assume under the alternative hypothesis, log P (Y = 1) P (Y = 0) = c + βx where c R. For simplicity, let x Bermoulli(q) and β 0. Define test statistics Then we have in distribution, where U = 1 n n i=1 (Y i µ Y )(x i q) σ y U nµ T σ T N(0, 1) (4.2.1) µ T = 1 σ Y E[(Y µ Y )(x q)]

75 58 and and satisfies: σ T = 1 σ Y SD[(Y µ Y )(x q)] (i) µ T is a monotonically increasing function of β (ii) µ T σ T is monotonically increasing fucntion of β if P (Y = 1) 1 2. Proof is a direct result from central limit theorem. We shall proof (1) and (2) in the following. Proof of (i): µ Y = E(Y ) = P (Y = 1) = = = q 1 + e + 1 q c β 1 + e c e + q(e c e c β ) c (1 + e c )(1 + e c β ) ( ) ( 1 e c 1 e β 1 + e + q c 1 + e c 1 + e c β ) Define g(β) = 1 e β. It is obvious that g(β) is a monotonically increasing 1+e c β function of β and g(β) [0, 1). So ( ) 1 e c µ Y = 1 + e + q g(β) c 1 + e c and σ Y = V ar(y ) = µ Y (1 µ Y )

76 59 We also have E(Y µ Y )(x q) = E(XY ) qµ Y Therefore, = P (X = 1, Y = 1) qµ Y [ q = 1 + e q q c β 1 + e + 1 q ] c β 1 + e c e c 1 e β = q(1 q) 1 + e c 1 + e c β e c = q(1 q) g(β) 1 + e c µ T = E(Y µ Y )(x q) σ Y e c g(β) = q(1 q) 1 + e c µy (1 µ Y ) g(β) = const µy (1 µ Y ) Here const = q(1 q) e c 1+e c does not depend on β. Thus we just need to prove that g(β) µy (1 µ Y is an increasing function of β. A rearrangements gives g(β) µy (1 µ Y ) = 1 ( ) ( µy 1 µy g(β) g(β) Here 1 µ Y g(β) is obviously a descrasing function of β since both g(β) and µ Y are increasing w.r.t β. And µ Y g(β) = q ( ) e c e c ) e c g(β) Therefore µ Y is also a monotonically decreasing function of β. Summing them g(β) up, we got the conclusion that µ T is monotonically increasing w.r.t. β. Proof of (ii): To prove µ T σ T µ T σ T = is a decreasing function of β, notice that µ 2 T 1 = E(T 2 ) µ 2 T 1 E(T 2 ) µ 2 T

77 60 Therefore we just need to prove that µ T = const g(β), we have µy (1 µ Y ) µ T E(T 2 ) is increasing w.r.t β. Recall that σ 2 Y E(T 2 ) = E(Y µ Y ) 2 (X q) 2 Thereore, e c = µ 2 Y q 2 (1 q) 1 + e + (1 µ c Y ) 2 q 2 1 (1 q) 1 + e c + µ 2 Y (1 q) 2 e c β q 1 + e + (1 µ c β Y ) 2 (1 q) 2 1 q 1 + e [ c β = q(1 q) µ 2 e c Y q 1 + e + (1 1 c µ2 Y )q 1 + e + e c β c µ2 Y (1 q) 1 + e ] c β +(1 µ 2 1 Y )(1 q) 1 + e [ ( c β q = q(1 q) µ 2 Y 2µ Y 1 + e + 1 q ) + c 1 + e ( c β q 1 + e + 1 q )] c 1 + e [ ( c β q = q(1 q) µ 2 Y + (1 2µ Y ) 1 + e + 1 q )] c 1 + e c β µ T E(T 2 ) g(β) ( µ 2 Y + (1 2µ Y ) 1 ( ) 2 µy g(β) + 1 2µ Y g(β) We have shown previously that µ)y g(β) 1 2µ Y ) g(β) ( ) q + 1 q 1+e c 1+e c β ) q 1 q 1+e c + 1+e c β g(β) is monotonically decreasing w.r.t. β and is monotonically decreasing w.r.t. β if µ Y 1. The last thing we want 2 to check is that h(β) = q 1 q 1+e c + 1+e c β g(β) is monotonically decreasing w.r.t.β.

78 61 Recall that g(β) = 1 β e. Let e β = γ (0, 1], then 1+e c β g(β) = h(β) = 1 γ 1 + γe c q + 1 q 1+e c 1+γe c 1 γ 1+γe c = q(1 + γe c ) + (1 q)(1 + e c ) (1 + e c )(1 γ) 1 + e c qe c (1 r) 1 γ This is an monotonically increasing function of γ, thus it is a monotonically decreasing function of β. Remark : Under H 0 : β = 0, µ T = 0. When β 0, the absolute value of µ T will be greater than Discussion We propose in this chapter a backward dropping algorithm SDA to locate possible influential rare variants after a region has been identified significant. It is based on the I 1 SP A score which determines the significance of a region. SDA is a greedy backward elimination approach which at each round of dropping removes the variable that increases p-value the most significantly. A influence ratio score is calculated for each removed/retained variable. The stopping rule is determined by quantiles of the influence ratio score in permutation. Unlike traditional stepwise elimination, SDA chooses to remove the most significant variable at each round. Compared with LASSO, it has higher power and lower false discovery rate. It is computation efficient and can be applied to large regions with hundreds of rare variants.

79 62 SDA can be extended to account for G G interaction effects. Similar as the I 2 score of SPA, for a region with K variants, there are K (K 1)/2 rare variants pairs under consideration. A backward dropping based on score I 2 can be performed in two different forms. The first choice is to remove one variable at a time. After removing a variable, there will be (K 1) (K 2)/2 pairs left and their significance can be again evaluated using permutation. The variable leading to the largest p-value increase will be discarded and we repeat the backward dropping process until there is only one variable left. The second elimination choice is to remove the most significant rare variants pair at a time. The second approach can give the information of each SNP pair while the first approach will be more computation-efficient. Similar extensions can also be used to assess the interaction between two genetic regions. The computation efficient SDA algorithm is able to identify influential SNPs at genomic scale. For genomic data with large numbers of rare variants, we propose a SPA-SDA two step approach (Figure 4.8): first divide all rare variants into genetic regions of moderate sizes, such as 500 SNPs and perform SPA to evaluate the significance of each region; if a region is significantly associated with the phenotype, apply SDA to identify influential rare variants. If a region is not significant, we may try to evaluate its interaction effects with environmental factors or other possible covariates. This SPA-SDA approach has the power to identify influential rare variants and is a valuable tool for rare variants association analysis in the genomic era.

80 63 p-value x Variable Removed Figure 4.2: P-values after removing the designated variable in the first round of BDA for toy example. The largest p-value is attained after removing variable X 5.

81 64 p value trace in backward dropping Influence score in backward dropping p value Example Null Influence Score Example Null 95% quantile in permutation 90% quantile in permutation Drop Round Drop Round Figure 4.3: P-value trace (left) and influence ratio (IR) score trace (right) in SDA for toy example and its permutations. Red line is the p-value trace or IR score trace for toy example. Green dotted lines are the p- value traces or IR score traces for 1, 000 permutations. In the right plot, blue line and magenta line are traces of the 95% and 90% quantiles of each round in SDA. Causal SNPs are marked as crosses. The first five removed/retained variables in SDA are all true causal variables.

82 65 Scenario1 Scenario2 Selection Frequency Causal Noise Selection Frequency Causal Noise Variable Index Variable Index Scenario3 Scenario4 Selection Frequency Causal Noise Selection Frequency Causal Noise Variable Index Variable Index Scenario5 Scenario6 Selection Frequency Causal Noise Selection Frequency Causal Noise Variable Index Variable Index Figure 4.4: Return frequencies from SDA(Q95) of all variables among 100 replications for each scenario. Red line represent the return frequencies of true causal variables and green lines represent return frequencies of non-causal variables.

83 66 Scenario 1 Scenario 3 Scenario 5 FDR SDA LASSO(λ min) LASSO(λ lse) FDR SDA LASSO(λ min) LASSO(λ lse) FDR SDA LASSO(λ min) LASSO(λ lse) No. Variables Selected No. Variables Selected No. Variables Selected Scenario 2 Scenario 4 Scenario 6 FDR SDA LASSO(λ min) LASSO(λ lse) FDR SDA LASSO(λ min) LASSO(λ lse) FDR SDA LASSO(λ min) LASSO(λ lse) No. Variables Selected No. Variables Selected No. Variables Selected Figure 4.5: Average FDR from SDA(Q95), LASSO(λ min ) and LASSO(λ 1se ) with different number of selected variables in 100 repetitions of different scenarios.

84 67 Return frequencies of 148 causal variables Frequency SDA(Q90) SDA(Q95) LASSO(λ min ) LASSO(λ lse ) Fisher Variants Figure 4.6: Return frequencies of the causal rare variants in GAW17 using different methods. The x-axis is the index of 148 causal rare variants in the simulation model of GAW17, and the y-axis is the selection frequencies of these causal variables in 200 replications. The selection frequencies are calculated using SDA(Q90), SDA(Q95), LASSO(λ min ), LASSO(λ 1se ) and Fisher exact test.

85 68 FDR of SDA and LASSO FDR SDA LASSO(λ min ) No. Variables included Figure 4.7: Average FDR of SDA(Q90) and LASSO(λ min ) in GAW17 data.

86 Figure 4.8: SPA-SDA two step approach for rare variants associaiton analysis. n is the number of regions. 69

A Robust Model-free Approach for Rare Variants Association Studies Incorporating Gene-Gene and Gene-Environmental Interactions

A Robust Model-free Approach for Rare Variants Association Studies Incorporating Gene-Gene and Gene-Environmental Interactions Ruixue Fan and Shaw-Hwa Lo * Department of Statistics, Columbia University,