Power and sample size calculations for designing rare variant sequencing association studies.

Similar documents
Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5


Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Computational Systems Biology: Biology X

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Association studies and regression

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

Case-Control Association Testing. Case-Control Association Testing

1 Preliminary Variance component test in GLM Mediation Analysis... 3

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing

PQL Estimation Biases in Generalized Linear Mixed Models

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Comparison of Statistical Tests for Disease Association with Rare Variants

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

Linear Regression (1/1/17)

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Frequentist-Bayesian Model Comparisons: A Simple Example

THE STATISTICAL ANALYSIS OF GENETIC SEQUENCING AND RARE VARIANT ASSOCIATION STUDIES. Eugene Urrutia

Package LBLGXE. R topics documented: July 20, Type Package

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

BTRY 4830/6830: Quantitative Genomics and Genetics

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Pooled Association Tests for Rare Genetic Variants: A Review and Some New Results

STA 216, GLM, Lecture 16. October 29, 2007

Math 423/533: The Main Theoretical Topics

An Approximate Test for Homogeneity of Correlated Correlation Coefficients

Exact McNemar s Test and Matching Confidence Intervals Michael P. Fay April 25,

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Estimation in Generalized Linear Models with Heterogeneous Random Effects. Woncheol Jang Johan Lim. May 19, 2004

Methods for Cryptic Structure. Methods for Cryptic Structure

Prediction of double gene knockout measurements

An introduction to biostatistics: part 1

Application of Variance Homogeneity Tests Under Violation of Normality Assumption

Lecture 9. QTL Mapping 2: Outbred Populations

QUANTIFYING PQL BIAS IN ESTIMATING CLUSTER-LEVEL COVARIATE EFFECTS IN GENERALIZED LINEAR MIXED MODELS FOR GROUP-RANDOMIZED TRIALS

University of California, Berkeley

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Binary choice 3.3 Maximum likelihood estimation

Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017

Marginal Screening and Post-Selection Inference

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Package KMgene. November 22, 2017

Bayesian Linear Regression

Bootstrapping high dimensional vector: interplay between dependence and dimensionality

Testing for group differences in brain functional connectivity

SUPPLEMENTARY INFORMATION

Statistics in medicine

Tables Table A Table B Table C Table D Table E 675

GWAS IV: Bayesian linear (variance component) models

I Have the Power in QTL linkage: single and multilocus analysis

A note on profile likelihood for exponential tilt mixture models

Hypothesis Testing One Sample Tests

(Genome-wide) association analysis

Asymptotic distribution of the largest eigenvalue with application to genetic data

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models

How to analyze many contingency tables simultaneously?

Efficient Bayesian mixed model analysis increases association power in large cohorts

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Preliminaries. Copyright c 2018 Dan Nettleton (Iowa State University) Statistics / 38

8 Nominal and Ordinal Logistic Regression

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

A stationarity test on Markov chain models based on marginal distribution

Optimal Methods for Using Posterior Probabilities in Association Testing

Robust Bayesian Variable Selection for Modeling Mean Medical Costs

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

Negative Multinomial Model and Cancer. Incidence

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Statistical methods for analyzing genetic sequencing association studies

High-Throughput Sequencing Course

Asymptotic Statistics-VI. Changliang Zou

Bayesian spatial quantile regression

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Multidimensional heritability analysis of neuroanatomical shape. Jingwei Li

FaST linear mixed models for genome-wide association studies

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Powerful Genetic Association Analysis for Common or Rare Variants with High Dimensional Structured Traits

Classification. Chapter Introduction. 6.2 The Bayes classifier

Generalized Linear Modeling - Logistic Regression

Bayesian Regression (1/31/13)

An Adaptive Association Test for Microbiome Data

Bayesian Inference of Interactions and Associations

Extreme inference in stationary time series

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

Introduction to Analysis of Genomic Data Using R Lecture 6: Review Statistics (Part II)

Linear Methods for Prediction

Linear Models and Estimation by Least Squares

Transcription:

Power and sample size calculations for designing rare variant sequencing association studies. Seunggeun Lee 1, Michael C. Wu 2, Tianxi Cai 1, Yun Li 2,3, Michael Boehnke 4 and Xihong Lin 1 1 Department of Biostatistics, Harvard School of Public Health, Boston 2 Department of Biostatistics, The University of North Carolina at Chapel Hill, Chapel Hill 3 Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill 4 Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor 1 Introduction Recently, Wu et al. [4] have proposed the sequence kernel machine test (SKAT) to test association between genetic variants in a gene or region and a continuous or binary trait. SKAT, which uses the kernel machine regression framework, is very flexible and computationally efficient. From extensive simulation studies and real data application, it has been shown that SKAT is more powerful than the collapsing based burden tests under many circumstances [4]. To design new sequence association study with SKAT as a testing procedure, it is important to know the required sample size to achieve a proper statistical power. Power and sample size calculation can be done by simulations, however this computer intensive approach would be time consuming. Here, we derive the analytical formula of the statistical power of SKAT. Required sample size can be computed easily by inverting the power function. In addition, We have developed user friendly R package which implements the power and sample size calculation formula (Web Resources). 1

2 Power and Sample Size Calculation Formula for SKAT: Statistical Derivations 2.1 Continuous Traits For simplicity, we assume no covariates are present. However, the results presented can be easily extended to accommodate covariates. Suppose there are n individuals and y = (y 1,..., y n ) is a vector of continuous phenotype. We assume that p variants are observed in a particular gene or genomic region, and G is an n p genotype matrix with G i being the i th row of the G. To relate the SNP set to the phenotype, we consider the linear model y i = α 0 + G iβ + ɛ i, where ɛ i N(0, σ 2 ). Without loss of generality and for the ease of presentation, we set each entry of G i to be centered such that E(G i ) = 0, and σ = 1 for continuous traits. The SKAT test statistic with a kernel K(, ) is n Q = (y ȳ1) K(y ȳ1), where ȳ = n 1 y i. In the case of weighted linear kernel with a weight function w( ), i=1 K = GWG, where W = diag{w( ˆm 1 ),..., w( ˆm p )}, and ˆm j is an observed MAF of the j th variant. Denote µ β = Gβ and Z = y ȳ1 µ β, and then Q = (y ȳ1) K(y ȳ1) = (Z + µ β ) K(Z + µ β ). By the spectral decomposition, K = UΛU. Since each element of Z is an independent Gaussian with mean 0 and asymptotic variance 1, Q asymptotically follows m j=1 λ jχ 2 1 (δ j) with δ j = µ β u ju j µ β. Here, λ j is the j th diagonal element of Λ, and u j is the j th column of U. For computational efficiency, we approximate the mixture of chi-square distributions of Q us- 2

ing the non-central chi-square approximation with ν degrees of freedom and non-centrality parameter δ [2] under the null and alternative. calculations are similar. Specifically, we compute c k Results using the Davies method [1] for power = p j=1 λk j for the null distribution, and c k = p j=1 λk j + k p j=1 λk j δ j for the alternative distribution up to k = 4. These values can be obtained from p λ k j = trace(k k ) = trace{(g GW) k } (1) j=1 and p j=1 λ k j δ j = µ β Kk µ β = trace(µ β Kk µ β ) = trace{(g GW) k 1 G µ β µ βgw}. (2) Suppose A = E(G GW/n) and B = E(G µ β µ β GW/n2 ). Since the distribution of G can be inferred from population genetic simulations or existing data (e.g. 1000 genome project data), we can obtain both A and B. By the continuity of trace and matrix multiplication, trace(k k ) = n k trace(a k ) and trace(µ β Kk µ β ) = n k+1 trace(a k 1 B). After computing c 1,..., c 4, we obtain following values. µ Q = c 1, σ Q = 2c 2, s 1 = c 3 /c 3/2 2, s 2 = c 4 /c 2 2, ( 1/ s 1 ) s 2 1 a = s 2 if s 2 1 > s 2 1/, s 2 if s 2 1 s 2 s 1 a 3 a 2 if s 2 1 δ = > s 2, 0 if s 2 1 s 2 l = a 2 2δ, µ X = l + δ, and σ X = 2 l + 2δ. Note that we modified the approximation of Liu et al.(2009) [2] when s 2 1 s 2 by matching kurtosis, instead of skewness, to improve the estimation of tail probability. To estimate the power, we first compute µ Q, µ X, σ Q, and σ X under the null. A critical value with level α is q c = (q(1 α; χ 2 l (δ)) µ X) σ Q σ X + µ Q, where q( ; χ 2 l (δ)) is a quantile function of χ2 l (δ). Then, we recompute µ Q, µ X, σ Q, and σ X under 3

the alternative and estimate the power P ( χ 2 l (δ) > σ ) X (q c µ Q ) + µ X. σ Q 2.2 Dichotomous Traits in Cross-Sectional and Prospective Studies In the absence of covariates, the logistic model we consider is logit (π i ) = α 0 + G iβ, (3) where y i is a disease status (1 = disease, 0 = non-disease). We assume that the prevalence/incidence of disease is known. Our SKAT test statistic with a weighted linear kernel K is Q = (y ˆπ 0 1) K(y ˆπ 0 1), where ˆπ 0 = n 1 n i=1 y i, the estimated disease probability under H 0. Denote µ β = (π 1 ˆπ 0,..., π n ˆπ 0 ), where π satisfies (3), and V = diag[v 1,..., v n ], and v i = π i (1 π i ) is var(y i ). Then Q can be written as Q = (y ˆπ 0 1) K(y ˆπ 0 1) = (y ˆπ 0 1 µ β + µ β ) V 1/2 V 1/2 KV 1/2 V 1/2 (y ˆπ 0 1 µ β + µ β ) = (Z + V 1/2 µ β ) K(Z + V 1/2 µ β ), where Z = V 1/2 (y ˆπ 0 1 µ β ), and K = V 1/2 KV 1/2. Since each element of Z has mean 0 and variance 1, (u j Z)2 asymptotically follows independent χ 2 1 distribution. Now we apply the same argument shown in the Section 2.1 using K instead of K, and estimate the power. 2.3 Modifications of Power Calculations for Rare Variants With finite sample size n, causal variants that are rare may not be observed. Our power and sample size calculations can account for this uncertainty. Suppose the population MAF for the j th 4

variant is m j. Let θ j be the chance the variant j is observed (polymorphic) in sample size n. Then θ j = 1 (1 m j ) 2n, and the sample size required to observe this variant with at least π j chance is n > ln(1 θ j) 2ln(1 m j ). For example, to have θ j minimum sample size is = 99.9% chance to observe a variant with a given MAF, the required MAF 0.1 0.01 0.001 0.0001 Minimum n 33 344 3453 34537 One can see larger sample size is needed to observe a rarer variant. Suppose there are p variants in a region in the population, with rare variants, the model we fit is actually y i = α 0 + G i β + ɛ i for continuous traits, and logit (π i ) = α 0 + G i β for dichotomous traits, where G i = (G i1 1,..., G ip p ) = G i, and = diag[ 1,..., p ]. Here, j is an indicator that variant j is observed in sample size n. Under this model, the weighted linear kernel K = GW G = G W G, and trace(k k ) and trace(µ β Kk µ β ) can be approximated as trace(k k ) n k trace((aπ) k ) and trace(µ β Kk µ β ) n k+1 trace((aπ) k 1 BΠ), where Π = diag[θ 1,..., θ p ]. We further improve the approximation by incorporating the fact that E( i ) = E( 2 i ). Let A 2 be a p p matrix with the (i, j)th element being a ij θ i θ I(i j) j, where a ij is the (i, j) th element of A, and then E( G GW /n) A 2. Let us denote A 1 = Π, A 3 = ΠAA 2, A 4 = A 2 AA 2, and then trace(k k ) n k trace(aa k ), and trace(µ β Kk µ β ) n k+1 trace(ba k ). Now we compute the power from the χ 2 approximation method which is described in the previous section. 2.4 Power and Sample Size Calculations for Retrospective Case-Control Studies It is well known that logistic regression can be used to analyze case-control data [3]. However, it is necessary to incorporate the retrospective nature of case-control studies to properly estimate the power. Let S be a selection indicator such that S = 1 denotes a subject is selected in the case-control sample. Then the conditional distribution of G and y given S = 1, instead of the unconditional distribution of G and y, should be used to compute power. Denote by π i = P r(y i = 5

1 G i, S i = 1) the case-control probability, if the the population disease probability follows the logistic model (3), then the case-control probability π i also follows the same logistic model except for a different intercept [3] as logit ( π i ) = α 0 + G iβ, (4) where α 0 = α 0 + log { } { } P (S = 1 y = 1) ˆπ0 P (y = 0) = α 0 + log, (5) P (S = 1 y = 0) (1 ˆπ 0 )P (y = 1) where P (S = 1 y = 1) is the probability that a case is sampled and P (S = 1 y = 0) is the probability that a control is sampled, and P (y = 1) is the population disease prevalence/incidence. Further one can show that P (G S = 1) = P (G y = 1, S = 1)P (y = 1 S = 1) + P (G y = 0, S = 1)P (y = 0 S = 1) = ˆπ 0 P (y = 1) P (y = 1 G)P (G) + 1 ˆπ 0 P (y = 0 G)P (G). (6) P (y = 0) We compute A, B, W, A 2 and Π by estimating µ β and V using (5) and by using conditional distribution (6), and subsequently estimate the power. 3 Computing the average power over different regions The statistical power of rare variants analysis depends on the LD structure of genomic regions to be investigated, and MAFs of causal variants. If one is interested in only one known region and knows in advance which variants are causal, they can directly estimate power using the power formula given above. In practice, however, one is often interested in more than one region and only hypothesizes a disease model of causal variants. For example, one may hypothesize that a certain percentage of rare variants are causal, instead of selecting priori causal variants. In this case, we propose to use the average power of regions given a disease model. This average power can be easily computed from taking mean of obtained powers from the power formula using randomly selected regions/causal variants. Our experience shows that 50 100 sets of different regions/causal variants are enough to compute the average power stably. 6

4 Web Resources An implementation of SKAT and power/sample size calculations in the R language can be found at http://www.hsph.harvard.edu/ xlin/software.html. References [1] Davies, R. (1980). Algorithm AS 155: The distribution of a linear combination of χ 2 random variables. Applied Statistics 29(3), 323 333. [2] Liu, H., Y. Tang, and H. Zhang (2009). A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Computational Statistics & Data Analysis 53(4), 853 856. [3] Prentice, R. and R. Pyke (1979). Logistic disease incidence models and case-control studies. Biometrika 66(3), 403 411. [4] Wu, M., S. Lee, T. Cai, Y. Li, M. Boehnke, and X. Lin (2011). Rare Variant Association Testing for Sequencing Data Using the Sequence Kernel Association Test (SKAT). Manuscript. 7