The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Similar documents
The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Power and sample size calculations for designing rare variant sequencing association studies.

Association studies and regression

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Linear Regression (1/1/17)

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

(Genome-wide) association analysis

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Package KMgene. November 22, 2017

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Computational Systems Biology: Biology X

Marginal Screening and Post-Selection Inference

Case-Control Association Testing. Case-Control Association Testing

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

1 Preliminary Variance component test in GLM Mediation Analysis... 3

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

New Penalized Regression Approaches to Analysis of Genetic and Genomic Data

Methods for Cryptic Structure. Methods for Cryptic Structure

Lecture 5: LDA and Logistic Regression


Variance Component Models for Quantitative Traits. Biostatistics 666

Propensity Score Weighting with Multilevel Data

Bayesian Inference of Interactions and Associations

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Primal-dual Covariate Balance and Minimal Double Robustness via Entropy Balancing

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

Causal Model Selection Hypothesis Tests in Systems Genetics

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Research Statement on Statistics Jun Zhang

Stat 579: Generalized Linear Models and Extensions

Hypothesis Testing Based on the Maximum of Two Statistics from Weighted and Unweighted Estimating Equations

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

SUPPLEMENTARY INFORMATION

25 : Graphical induced structured input/output models

Course on Inverse Problems

1. Understand the methods for analyzing population structure in genomes

An Introduction to Expectation-Maximization

Bayesian Methods for Machine Learning

,..., θ(2),..., θ(n)

Feature Selection via Block-Regularized Regression

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Correlation, z-values, and the Accuracy of Large-Scale Estimators. Bradley Efron Stanford University

A Robust Test for Two-Stage Design in Genome-Wide Association Studies

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Covariance and Correlation

Building a Prognostic Biomarker

Introduction to QTL mapping in model organisms

25 : Graphical induced structured input/output models

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Some New Methods for Latent Variable Models and Survival Analysis. Latent-Model Robustness in Structural Measurement Error Models.

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence

Generalized Linear Models Introduction

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Data Mining Stat 588

USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago

Lecture 3: Basic Statistical Tools. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture WS Evolutionary Genetics Part I 1

Set-based Tests for Genetic Association and Gene-Environment Interaction

Methods for High Dimensional Inferences With Applications in Genomics

Statistical distributions: Synopsis

Gaussian Process Regression Forecasting of Computer Network Conditions

Lecture 1 Basic Statistical Machinery

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Linear Regression. Volker Tresp 2018

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

Overlap Propensity Score Weighting to Balance Covariates

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data.

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Statistical Analysis of Causal Mechanisms

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

Lecture 11: Multiple trait models for QTL analysis

Genetics Studies of Multivariate Traits

High Dimensional Propensity Score Estimation via Covariate Balancing

Statistical Methods for Modeling Heterogeneous Effects in Genetic Association Studies

On Identifying Rare Variants for Complex Human Traits

Breeding Values and Inbreeding. Breeding Values and Inbreeding

Some models of genomic selection

Genotype Imputation. Biostatistics 666

Distinctive aspects of non-parametric fitting

Transcription:

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies Ian Barnett, Rajarshi Mukherjee & Xihong Lin Harvard University ibarnett@hsph.harvard.edu August 5, 2014 Ian Barnett JSM 2014 August 5, 2014 1 / 20

Background Genome-wide association studies (GWAS): millions of common (minor allele frequency > 0.05) SNPs genotyped. Gene-level/pathway-level analysis can provide power to detect these types of effects by combining information over the SNPs. Goal: Develop powerful, computationally efficient, statistical methodology for SNP-sets that have the power to detect joint SNP effects. Ian Barnett JSM 2014 August 5, 2014 2 / 20

Model Model n subjects, q covariates, p genetic variants. Y i is phenotype for ith individual X i contains q covariates for ith individual G i contains SNP information (minor allele counts) in a gene/pathway/snp-set for ith individual α and β contain regression coefficients. µ i = E(Y i G i, X i ) h(µ i ) = X i α + G i β h( ) is the link function. Ian Barnett JSM 2014 August 5, 2014 3 / 20

Marginal SNP test statistics The marginal score test statistic for the jth variant is: Z j = G T j (Y ˆµ 0) where ˆµ 0 is the MLE of E(Y H 0 ). Assume Z j is normalized. Letting UU T = Ĉov(Z) = ˆΣ, define the transformed (decorrelated) test statistics: Z = U 1 L Z MVN(0, n I p) Ian Barnett JSM 2014 August 5, 2014 4 / 20

Current popular methods Method SKAT MinP p Test statistic j=1 Z j 2 max j { Z j } High power when signal sparsity is low. Accurate Pros p-values can be obtained quickly. Cons Can have very low power when sparsity is high. High power when signal sparsity is high. Slightly lower power when sparsity is low. Difficult to obtain accurate analytic p- values. Ian Barnett JSM 2014 August 5, 2014 5 / 20

The higher criticism Let S(t) = p j=1 1 { Zj t} Assumes Σ = I p Under H 0, S(t) Binomial(p, 2 Φ(t)) where Φ(t) = 1 Φ(t) is the survival function of the normal distribution. The Higher Criticism test statistic is: HC = sup t>0 { } S(t) 2p Φ(t) 2p Φ(t)(1 2 Φ(t)) Ian Barnett JSM 2014 August 5, 2014 6 / 20

The higher criticism Histogram of the Z i Density 0.00 0.10 0.20 0.30 4 2 0 2 4 t argmax{hc(t)} Ian Barnett JSM 2014 August 5, 2014 7 / 20

Adjusting for correlation Recalling that Z = U 1 Z, let S (t) = p j=1 1 { Z j t} Note that under H 0, S (t) Binomial(p, 2 Φ(t)) regardless for general correlated Σ. The innovated Higher Criticism test statistic is: { } S (t) 2p Φ(t) ihc = sup t>0 2p Φ(t)(1 2 Φ(t)) Ian Barnett JSM 2014 August 5, 2014 8 / 20

Asymptotic p-values for ihc The supremum of this standardized empirical process follows a Gumbel distribution asymptotically. For 0 < ɛ < δ < 1, if the supremum is taken over the interval Φ 1 (1 δ/2) < t < Φ 1 (1 ɛ/2), then, as shown in Jaeschke (1979), we can write: pr(ihc < x) exp[ exp{ x(2 log ρ) 1/2 2 1 log π + 2 1 log 2 ρ + 2 log ρ}], where ρ = 2 1 log[δ(1 ɛ)/{ɛ(1 δ)}]. Jaeschke (1979) shows that this converges in distribution at an abysmal rate of O{(log p) 1/2 } Ian Barnett JSM 2014 August 5, 2014 9 / 20

Slow convergence to asymptotic distribution The supremum of the higher criticism test statistic is taken over the range Φ 1 (1 δ/2) < t < Φ 1 (1 ɛ/2) where ɛ = 0 01 and δ = 0 40. Each empirical distribution is constructed using 500 samples of p independent standard normal random variables. 1.0 0.8 CDF(x) 0.6 0.4 0.2 0.0 Theoretical; p= Empirical; p=10 2 Empirical; p=10 6 0 1 2 3 4 x Ian Barnett JSM 2014 August 5, 2014 10 / 20

Analytic p-values for ihc Letting h be the observed GHC statistic: ( { p-value = pr sup t>0 S (t) 2p Φ(t) 2p Φ(t)(1 2 Φ(t)) There exists 0 < t 1 < < t p, such that ( p ) p-value = 1 pr {S (t k ) p k} k=1 } h Then apply the chain rule of conditioning to get a product of binomial probabilities. ) Ian Barnett JSM 2014 August 5, 2014 11 / 20

Type I error rates Table: Empirical type I error rate percentages from 10 6 simulations for the higher criticism using the proposed analytic method over a range of significance levels, α. Asymptotic type I error rate percentages are provided for comparison: exact(asymptotic). p α 2 10 50 5 0 4 98(5 95) 4 98(3 24) 5 01(1 84) 1 0 1 00(2 24) 9 92 10 1 (7 31 10 1 ) 1 01(1 59 10 1 ) 1 0 10 1 9 97 10 2 (2 05 10 1 ) 1 01 10 1 (6 03 10 2 ) 9 75 10 2 (4 90 10 3 ) 1 0 10 2 1 01 10 2 (5 32 10 2 ) 1 12 10 2 (7 30 10 3 ) 9 80 10 3 (4 00 10 4 ) Ian Barnett JSM 2014 August 5, 2014 12 / 20

Decorrelating dampens true signals Cancer Genetic Markers of Susceptibility (CGEM) Breast Cancer GWAS: FGFR2 gene Frequency 0 2 4 6 8 Z Frequency 0 2 4 6 8 Z * 4 2 0 2 4 Decorrelating causes ihc to lose power. Marginal test statistics Ian Barnett JSM 2014 August 5, 2014 13 / 20

Comparison Method MinP SKAT ihc Robust to signal sparsity Robust to correlation/ld structure Computationally efficient Does not require decorrelating test statistics Ian Barnett JSM 2014 August 5, 2014 14 / 20

Comparison Method MinP SKAT ihc GHC Robust to signal sparsity Robust to correlation/ld structure Computationally efficient Doesn t require decorrelating test statistics *We will also consider the omnibus test, OMNI, in our power simulations. It is based on the minimum p-value of the SKAT, MinP, and GHC. Ian Barnett JSM 2014 August 5, 2014 15 / 20

Our contribution: the generalized higher critcism (GHC) Recall S(t) = p j=1 1 { Zj t} Now we allow Σ to have arbitrary correlation structure. S(t) is no longer binomial. Instead we approximate with Beta-binomial, matching on first two moments. The Generalized Higher Criticism test statistic is: S(t) 2p Φ(t) GHC = sup t>0 Var(S(t)) GHC achieves the same as detection boundary as HC. Ian Barnett JSM 2014 August 5, 2014 16 / 20

The variance estimator Var(S(t)) Theorem 1 Let r n = 2 p(1 p) 1 k<l p (Σ kl) n and let H i (t) be the Hermite polynomials: H 0 (t) = 1, H 1 (t) = t, H 2 (t) = t 2 1 and so on. Then ( ) Cov S(t k ), S(t j ) = p[2 Φ(max{t j, t k }) 4 Φ(t j ) Φ(t k )] +4p(p 1)φ(t j )φ(t k ) H 2i 1 (t j )H 2i 1 (t k )r 2i i=1 (2i)! Proof follows from Schwartzman and Lin (2009) where they showed: P(Z k > t i, Z l > t j ) = Φ(t i ) Φ(t j ) + φ(t i )φ(t j ) n=1 Σ n kl n! H n 1(t i )H n 1 (t j ) Ian Barnett JSM 2014 August 5, 2014 17 / 20

Analytic p-values for the GHC Letting h be the observed GHC statistic: S(t) 2p p-value = pr sup Φ(t) t>0 Var(S(t)) h There exists 0 < t 1 < < t p, such that ( p ) p-value = 1 pr {S(t k ) p k} k=1 Ian Barnett JSM 2014 August 5, 2014 18 / 20

Power simulations ρ 1=0.4; ρ 3=0; p=40; 2 causal ρ 1=0.4; ρ 3=0; p=40; 4 causal power 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 ρ 2 GHC MinP SKAT ihc OMNI power 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 ρ 2 GHC MinP SKAT ihc OMNI ρ 1=0.4; ρ 3=0.4; p=40; 2 causal ρ 1=0.4; ρ 3=0.4; p=40; 4 causal power 0.0 0.2 0.4 0.6 0.8 1.0 GHC MinP SKAT ihc OMNI 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ρ 2 power 0.0 0.2 0.4 0.6 0.8 1.0 ρ 1 : correlation within causal variants. ρ 2 : correlation between causal and noncausal variants. ρ 3 : correlation within non-causal variants. 0.0 0.1 0.2 0.3 0.4 0.5 ρ 2 Ian Barnett JSM 2014 August 5, 2014 19 / 20 GHC MinP SKAT ihc OMNI

Data analysis The National Cancer Institute s Cancer Genetic Markers of Susceptibility (CGEM) breast cancer GWAS. Sample has 1145 cases, 1142 controls with european ancestry. 5 4 FGFR2 PTCD3 POLR1A Observed log 10(pvalue) 3 2 1 0 0 1 2 3 4 5 Expected log 10(pvalue) Ian Barnett JSM 2014 August 5, 2014 20 / 20