The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Similar documents
The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

(Genome-wide) association analysis

Linear Regression (1/1/17)

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Package KMgene. November 22, 2017

Computational Systems Biology: Biology X

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Case-Control Association Testing. Case-Control Association Testing

Association studies and regression

Power and sample size calculations for designing rare variant sequencing association studies.

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Causal Model Selection Hypothesis Tests in Systems Genetics

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Variance Component Models for Quantitative Traits. Biostatistics 666

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Bayesian Inference of Interactions and Associations


Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Methods for Cryptic Structure. Methods for Cryptic Structure

New Penalized Regression Approaches to Analysis of Genetic and Genomic Data

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

STAT 302 Introduction to Probability Learning Outcomes. Textbook: A First Course in Probability by Sheldon Ross, 8 th ed.

An Introduction to Expectation-Maximization

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

On Identifying Rare Variants for Complex Human Traits

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Statistical Methods in Mapping Complex Diseases

Lecture WS Evolutionary Genetics Part I 1

25 : Graphical induced structured input/output models

Set-based Tests for Genetic Association and Gene-Environment Interaction

1 Preliminary Variance component test in GLM Mediation Analysis... 3

1. Understand the methods for analyzing population structure in genomes

Linear Regression. Volker Tresp 2018

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

Some models of genomic selection

Lecture 5: LDA and Logistic Regression

Lecture 11: Multiple trait models for QTL analysis

Breeding Values and Inbreeding. Breeding Values and Inbreeding

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Genotype Imputation. Biostatistics 666

Research Statement on Statistics Jun Zhang

Lecture 3: Basic Statistical Tools. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013

Ch. 5 Joint Probability Distributions and Random Samples

Prediction. Prediction MIT Dr. Kempthorne. Spring MIT Prediction

USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago

Introduction to QTL mapping in model organisms

Correlation, z-values, and the Accuracy of Large-Scale Estimators. Bradley Efron Stanford University

Stat 579: Generalized Linear Models and Extensions

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

GLIDE: GPU-based LInear Detection of Epistasis

Propensity Score Weighting with Multilevel Data

Feature Selection via Block-Regularized Regression

Lecture 9. QTL Mapping 2: Outbred Populations

Equivalent Partial Correlation Selection for High Dimensional Gaussian Graphical Models

THE STATISTICAL ANALYSIS OF GENETIC SEQUENCING AND RARE VARIANT ASSOCIATION STUDIES. Eugene Urrutia

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Marginal Screening and Post-Selection Inference

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

NONPARAMETRIC ADJUSTMENT FOR MEASUREMENT ERROR IN TIME TO EVENT DATA: APPLICATION TO RISK PREDICTION MODELS

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

Package pfa. July 4, 2016

PASS Sample Size Software. Poisson Regression

Knockoffs as Post-Selection Inference

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

Building a Prognostic Biomarker

Review of probability and statistics 1 / 31

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Data Mining Stat 588

A Robust Test for Two-Stage Design in Genome-Wide Association Studies

Confounder Adjustment in Multiple Hypothesis Testing

SUPPLEMENTARY INFORMATION

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

25 : Graphical induced structured input/output models

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

Course on Inverse Problems

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing

Some New Methods for Latent Variable Models and Survival Analysis. Latent-Model Robustness in Structural Measurement Error Models.

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

QTL Model Search. Brian S. Yandell, UW-Madison January 2017

Transcription:

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies Ian Barnett, Rajarshi Mukherjee & Xihong Lin Harvard University ibarnett@hsph.harvard.edu June 24, 2014 Ian Barnett BIRS June 24, 2014 1 / 18

Background Genome-wide association studies (GWAS): millions of common (minor allele frequency > 0.05) SNPs genotyped. Gene-level/pathway-level analysis can provide power to detect these types of effects by combining information over the SNPs. Goal: Develop powerful, computationally efficient, statistical methodology for SNP-sets that have the power to detect joint SNP effects. Ian Barnett BIRS June 24, 2014 2 / 18

Model Model n subjects, q covariates, p genetic variants. Y i is phenotype for ith individual X i contains q covariates for ith individual G i contains SNP information (minor allele counts) in a gene/pathway/snp-set for ith individual α and β contain regression coefficients. µ i = E(Y i G i, X i ) h(µ i ) = X i α + G i β h( ) is the link function. Ian Barnett BIRS June 24, 2014 3 / 18

Marginal SNP test statistics The marginal score test statistic for the jth variant is: Z j = G T j (Y ˆµ 0) where ˆµ 0 is the MLE of E(Y H 0 ). Assume Z j is normalized. Letting UU T = Ĉov(Z) = ˆΣ, define the transformed (decorrelated) test statistics: Z = U 1 Z L MVN(0, I p) n Ian Barnett BIRS June 24, 2014 4 / 18

Current popular methods Method SKAT MinP p Test statistic j=1 Z j 2 max j { Z j } High power when signal sparsity is low. Accurate Pros p-values can be obtained quickly. Cons Can have very low power when sparsity is high. High power when signal sparsity is high. Slightly lower power when sparsity is low. Difficult to obtain accurate analytic p- values. Ian Barnett BIRS June 24, 2014 5 / 18

The higher criticism Let S(t) = p j=1 1 { Zj t} Assumes Σ = I p Under H 0, S(t) Binomial(p, 2 Φ(t)) where Φ(t) = 1 Φ(t) is the survival function of the normal distribution. The Higher Criticism test statistic is: HC = sup t>0 { } S(t) 2p Φ(t) 2p Φ(t)(1 2 Φ(t)) Ian Barnett BIRS June 24, 2014 6 / 18

The higher criticism Histogram of the Z i Density 0.00 0.10 0.20 0.30 4 2 0 2 4 t argmax{hc(t)} Ian Barnett BIRS June 24, 2014 7 / 18

Adjusting for correlation Recalling that Z = U 1 Z, let S (t) = p j=1 1 { Z j t} Note that under H 0, S (t) Binomial(p, 2 Φ(t)) regardless for general correlated Σ. The innovated Higher Criticism test statistic is: { } S (t) 2p Φ(t) ihc = sup t>0 2p Φ(t)(1 2 Φ(t)) Ian Barnett BIRS June 24, 2014 8 / 18

Adjusting for correlation Cancer Genetic Markers of Susceptibility (CGEM) Breast Cancer GWAS: FGFR2 gene Frequency 0 2 4 6 8 Z Frequency 0 2 4 6 8 Z * 4 2 0 2 4 Decorrelating causes ihc to lose power. Marginal test statistics Ian Barnett BIRS June 24, 2014 9 / 18

Comparison Method MinP SKAT ihc Robust to signal sparsity Robust to correlation/ld structure Computationally efficient Does not require decorrelating test statistics Ian Barnett BIRS June 24, 2014 10 / 18

Comparison Method MinP SKAT ihc GHC Robust to signal sparsity Robust to correlation/ld structure Computationally efficient Doesn t require decorrelating test statistics *We will also consider the omnibus test, OMNI, in our power simulations. It is based on the minimum p-value of the SKAT, MinP, and GHC. Ian Barnett BIRS June 24, 2014 11 / 18

Our contribution: the generalized higher critcism (GHC) Recall S(t) = p j=1 1 { Zj t} Now we allow Σ to have arbitrary correlation structure. S(t) is no longer binomial. Instead we approximate with Beta-binomial, matching on first two moments. The Generalized Higher Criticism test statistic is: S(t) 2p Φ(t) GHC = sup t>0 Var(S(t)) Ian Barnett BIRS June 24, 2014 12 / 18

The variance estimator Var(S(t)) Theorem 1 Let r n = 2 p(1 p) 1 k<l p (Σ kl) n and let H i (t) be the Hermite polynomials: H 0 (t) = 1, H 1 (t) = t, H 2 (t) = t 2 1 and so on. Then ( ) Cov S(t k ), S(t j ) = p[2 Φ(max{t j, t k }) 4 Φ(t j ) Φ(t k )] +4p(p 1)φ(t j )φ(t k ) H 2i 1 (t j )H 2i 1 (t k )r 2i i=1 (2i)! Proof follows from Schwartzman and Lin (2009) where they showed: P(Z k > t i, Z l > t j ) = Φ(t i ) Φ(t j ) + φ(t i )φ(t j ) n=1 Σ n kl n! H n 1(t i )H n 1 (t j ) Ian Barnett BIRS June 24, 2014 13 / 18

Analytic p-values for the GHC Letting h be the observed GHC statistic: S(t) 2p p-value = pr sup Φ(t) t>0 Var(S(t)) h There exists 0 < t 1 < < t p, such that ( p ) p-value = 1 pr {S(t k ) p k} k=1 Ian Barnett BIRS June 24, 2014 14 / 18

Power simulations ρ 1=0.4; ρ 3=0; p=40; 2 causal ρ 1=0.4; ρ 3=0; p=40; 4 causal power 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 ρ 2 GHC MinP SKAT ihc OMNI power 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 ρ 2 GHC MinP SKAT ihc OMNI ρ 1=0.4; ρ 3=0.4; p=40; 2 causal ρ 1=0.4; ρ 3=0.4; p=40; 4 causal power 0.0 0.2 0.4 0.6 0.8 1.0 GHC MinP SKAT ihc OMNI 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ρ 2 power 0.0 0.2 0.4 0.6 0.8 1.0 ρ 1 : correlation within causal variants. ρ 2 : correlation between causal and noncausal variants. ρ 3 : correlation within non-causal variants. 0.0 0.1 0.2 0.3 0.4 0.5 ρ 2 Ian Barnett BIRS June 24, 2014 15 / 18 GHC MinP SKAT ihc OMNI

Data analysis The National Cancer Institute s Cancer Genetic Markers of Susceptibility (CGEM) breast cancer GWAS. Sample has 1145 cases, 1142 controls with european ancestry. 5 4 FGFR2 PTCD3 POLR1A Observed log 10(pvalue) 3 2 1 0 0 1 2 3 4 5 Expected log 10(pvalue) Ian Barnett BIRS June 24, 2014 16 / 18

Next (and final) step Thresholding tests (GHC and MinP) and summing tests (SKAT) are good complements Combining these classes of tests in a more principled way (than OMNI) is to use the following test statistic: p Z j γ I { Zj >t} γ = 2, t = 0 SKAT γ = 0 GHC sup γ,t We label this test as OPT. j=1 Ian Barnett BIRS June 24, 2014 17 / 18

Simulations p = 20, exchangeable correlation ρ For OPT, the supremum is selected from t (0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4) and γ (0, 0.5, 1, 1.5, 2). Non-zero β decrease with ρ. p=20; k=1; exchangeable correlation ρ power p=20; k=4; exchangeable correlation ρ power Power 0.0 0.2 0.4 0.6 0.8 1.0 SKAT MINP GHC OPT Power 0.0 0.2 0.4 0.6 0.8 1.0 SKAT MINP GHC OPT 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 ρ ρ Ian Barnett BIRS June 24, 2014 18 / 18