Efficient Bayesian mixed model analysis increases association power in large cohorts

Similar documents
(Genome-wide) association analysis

Supplementary Information for Efficient Bayesian mixed model analysis increases association power in large cohorts

GWAS with mixed models

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

ARTICLE Mixed Model with Correction for Case-Control Ascertainment Increases Association Power

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Genotype Imputation. Biostatistics 666

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

Supplementary Information

FaST linear mixed models for genome-wide association studies

ARTICLE MASTOR: Mixed-Model Association Mapping of Quantitative Traits in Samples with Related Individuals

Asymptotic distribution of the largest eigenvalue with application to genetic data

arxiv: v1 [stat.me] 21 Nov 2018

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

GWAS IV: Bayesian linear (variance component) models

Improved linear mixed models for genome-wide association studies

EFFICIENT COMPUTATION WITH A LINEAR MIXED MODEL ON LARGE-SCALE DATA SETS WITH APPLICATIONS TO GENETIC STUDIES

Methods for Cryptic Structure. Methods for Cryptic Structure

Power and sample size calculations for designing rare variant sequencing association studies.

FaST Linear Mixed Models for Genome-Wide Association Studies

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

A FAST, ACCURATE TWO-STEP LINEAR MIXED MODEL FOR GENETIC ANALYSIS APPLIED TO REPEAT MRI MEASUREMENTS

Using Genomic Structural Equation Modeling to Model Joint Genetic Architecture of Complex Traits

Linear Regression (1/1/17)

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

Recent advances in statistical methods for DNA-based prediction of complex traits

Supplementary Information for: Detection and interpretation of shared genetic influences on 42 human traits

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

MACAU 2.0 User Manual

Bayesian Regression (1/31/13)

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data.

Package MACAU2. R topics documented: April 8, Type Package. Title MACAU 2.0: Efficient Mixed Model Analysis of Count Data. Version 1.

Statistical Methods for Modeling Heterogeneous Effects in Genetic Association Studies

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Multi-population genomic prediction. Genomic prediction using individual-level data and summary statistics from multiple.

Research Statement on Statistics Jun Zhang

Association studies and regression

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Incorporating Functional Annotations for Fine-Mapping Causal Variants in a. Bayesian Framework using Summary Statistics

Combining SEM & GREML in OpenMx. Rob Kirkpatrick 3/11/16

BLUP without (inverse) relationship matrix

Statistical inference in Mendelian randomization: From genetic association to epidemiological causation

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Maximum Likelihood for Variance Estimation in High-Dimensional Linear Models

GENOMIC SELECTION WORKSHOP: Hands on Practical Sessions (BL)

Implicit Causal Models for Genome-wide Association Studies

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Limited dimensionality of genomic information and effective population size

Latent Variable Methods for the Analysis of Genomic Data

Latent Variable models for GWAs

SUPPLEMENTARY INFORMATION

Conditions under which genome-wide association studies will be positively misleading

Pedigree and genomic evaluation of pigs using a terminal cross model

Statistical Methods in Mapping Complex Diseases

Bayesian Statistics for Personalized Medicine. David Yang

Module 10: SNP-Based Heritability Analysis Practical. Need help?

I Have the Power in QTL linkage: single and multilocus analysis

Partitioning Genetic Variance

Lecture 9: Kernel (Variance Component) Tests and Omnibus Tests for Rare Variants. Summer Institute in Statistical Genetics 2017

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

GENETIC VARIANT SELECTION: LEARNING ACROSS TRAITS AND SITES. Laurel Stell Chiara Sabatti

Extension of single-step ssgblup to many genotyped individuals. Ignacy Misztal University of Georgia

PCA and admixture models

SPARSE MODEL BUILDING FROM GENOME-WIDE VARIATION WITH GRAPHICAL MODELS

Causal Graphical Models in Systems Genetics

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Multi-trait analysis of genome-wide association summary statistics using MTAG

Computational Systems Biology: Biology X

Lecture WS Evolutionary Genetics Part I 1

Package LBLGXE. R topics documented: July 20, Type Package

Package bwgr. October 5, 2018

A Hybrid Bayesian Approach for Genome-Wide Association Studies on Related Individuals

Overview. Background

Inferring Genetic Architecture of Complex Biological Processes

GLIDE: GPU-based LInear Detection of Epistasis

Multiple QTL mapping

Bayesian Partition Models for Identifying Expression Quantitative Trait Loci

Genotyping strategy and reference population

Nature Methods: doi: /nmeth.3439

A Comparison of Robust Methods for Mendelian Randomization Using Multiple Genetic Variants

Bayesian Inference of Interactions and Associations

GENOME-WIDE association studies (GWAS) have allowed

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Simulation Study on Heterogeneous Variance Adjustment for Observations with Different Measurement Error Variance

Lecture 28: BLUP and Genomic Selection. Bruce Walsh lecture notes Synbreed course version 11 July 2013

Sensitivity Analysis with Several Unmeasured Confounders

Calculation of IBD probabilities

USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago

Lecture 11: Multiple trait models for QTL analysis

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Selection-adjusted estimation of effect sizes

Computational Approaches to Statistical Genetics

Transcription:

Linear regression Existing mixed model methods New method: BOLT-LMM Time O(MM) O(MN 2 ) O MN 1.5 Corrects for confounding? Power Efficient Bayesian mixed model analysis increases association power in large cohorts 10/20/14: ASHG 2014 Po-Ru Loh Harvard School of Public Health

Mixed model association is hot Linear mixed models (LMMs) for GWAS have attracted intense research interest in the past few years High-profile computational speedups: EMMAX (Kang et al. 2010 Nat Gen) P3D/TASSEL (Zhang et al. 2010 Nat Gen) FaST-LMM (Lippert et al. 2011 Nat Meth) GEMMA (Zhou & Stephens 2012 Nat Gen) GRAMMAR-Gamma (Svishcheva et al. 2012 Nat Gen) Extensions to standard LMM: Multiple loci (Segura et al. 2012 Nat Gen) Multiple traits (Korte et al. 2012 Nat Gen) Sparse modeling (Listgarten et al. 2013 Nat Gen) GCTA-LOCO (Yang et al. 2014 Nat Gen) See also: ASHG 2014 talk 169, Fusi & poster 1458S, Heckerman

Mixed model association corrects for confounding and increases power Linear regression Existing mixed model methods Time O(MM) O(MN 2 ) M = # SNPs N = # samples Corrects for confounding? Power

New mixed model method (BOLT-LMM) increases speed, further increases power Linear regression Existing mixed model methods New method: BOLT-LMM Time O(MM) O(MN 2 ) O MN 1.5 Corrects for confounding? Power Models non-infinitesimal genetic architectures

BOLT-LMM s iterative algorithm increases speed; non-infinitesimal model increases power Speed: New retrospective statistic allows rapid computation via iterative methods x m T V 1 y 2 (V 1 y) T Θ (V 1 y) Power: Flexible Bayesian prior on SNP effect sizes models genetic architectures with large-effect loci Normal infinitesimal model non-infinitesimal model Non-normal: Heavier tails

BOLT-LMM requires far less time and memory than existing methods a b 10 4 10 3 FaST-LMM-Select GEMMA EMMAX GCTA-LOCO BOLT-LMM 10 3 FaST-LMM-Select GCTA-LOCO EMMAX GEMMA BOLT-LMM 10 2 100x CPU CPU time (hrs) 10 2 10 1 N 2 N 1.5 Memory (GB) 10 1 N 2 N 1 10x RAM 10 0 10 0 10-1 7500 30000 120000 480000 N = # of samples 7500 30000 120000 480000 N = # of samples

BOLT-LMM can handle big data Number of samples (N) BOLT-LMM: Time, memory 60,000 5 hours 4.6 GB 480,000 3 days 35 GB Existing methods: Time, memory ~2 weeks >60 GB ~4 years >3.5 TB Benchmark data sets: Simulated genotypes generated from WTCCC2 samples M = 300K SNPs Simulated phenotypes with h g 2 = 0.2 SNP heritability

BOLT-LMM increases power in simulations (while controlling false positives) a h 2 causal = 0.5, N = 15633 b M causal = 2500, N = 15633 c M causal = 2500, h 2 causal = 0.5 Increased χ 2 stats at causal SNPs = Increased effective sample size Mean standardized effect SNP χ 2 10 9 8 7 6 5 4 BOLT-LMM 3 Standard LMM PCA 2 0 2500 5000 10000 M causal = # of causal SNPs 10 9 8 7 6 5 4 3 2 BOLT-LMM Standard LMM PCA 0.3 0.5 0.7 h 2 causal 10 9 8 7 6 5 4 3 2 BOLT-LMM Standard LMM PCA 5211 15633 N = # of samples Data: Real genotypes from N=15,633 WTCCC2 samples Simulated phenotypes with varying numbers of causal SNPs + environmental stratification

BOLT-LMM increases association power in WGHS phenotypes Increase in χ 2 statistics at known loci vs. PCA (%) a b Prediction R 2 (%) 14 12 10 8 6 4 2 0 14 12 10 8 6 4 2 ApoB LDL HDL Cholesterol Triglycerides Height Standard LMM vs. PCA BOLT-LMM vs. PCA BMI Systolic BP PCA Diastolic BP Standard LMM BOLT-LMM Data: Women s Genome Health Study N=23,294 European-ancestry samples 0

How does BOLT-LMM increase power? Under the hood, Part 1

Mixed models increase power by conditioning on polygenic effects Joint modeling reveals association! Example: Is phenotype y associated with x 1? Hard to tell

Mixed models increase power by conditioning on polygenic effects Joint modeling reveals association! Example: Is phenotype y associated with x 1? How about if we also have x 2?

Mixed models increase power by conditioning on polygenic effects Joint modeling reveals association! Example: Is phenotype y associated with x 1? Knowledge of x 2 reveals assoc of x 1 through joint model Hard to tell Note: In mixed model, x 1 (test SNP) is modeled as fixed effect; x 2 is modeled as random effect

Better joint model of phenotype => better conditioning => more power Infinitesimal model Standard mixed model: all SNPs causal with normally distributed effect sizes: β ~ N 0, h g 2 M Non-infinitesimal model Reality: Only a small fraction of SNPs causal with larger effects BOLT-LMM: SNP effect sizes modeled with mixture of two Gaussians Normal ASHG 2014 poster 1767S, Vilhjalmsson Non-normal: Heavier tails

How does BOLT-LMM increase speed? (And what exactly does it compute?) Under the hood, Part 2

Idea 1: New retrospective statistic + algorithm increases speed (for infinitesimal LMM) x m T V 1 y 2 (V 1 y) T Θ (V 1 y) Retrospective mixed model assoc statistic Compute quasi-likelihood score test similar to MASTOR: Jakobsdottir & McPeek 2013 AJHG Calibrate using approach similar to GRAMMAR-Gamma Svishcheva et al. 2012 NG Fast iterative algorithm Replace expensive eigendecomposition with linear system solving Legarra & Misztal 2008, Van Raden 2008 J Dairy Sci No need to compute genetic relationship matrix (GRM) => save time, RAM

Idea 2: Bayesian extension of linear mixed model (non-inf. model) increases power x m T V 1 y 2 (V 1 y) T Θ (V 1 y) Retrospective mixed model assoc statistic x m T y rrrrr 2 T y rrrrr Θ y rrrrr Retrospective mixed model assoc statistic using Bayesian prior Compute fast variational approximation of posteriors Meuwissen et al. 2009, Logsdon et al. 2010, Carbonetto & Stephens 2012 Calibrate using LD score regression Bulik-Sullivan et al. biorxiv (under revision, Nat Genet) ASHG 2014 talk 351, Finucane & poster 1787T, Bulik-Sullivan

New mixed model method (BOLT-LMM) increases speed, further increases power Linear regression Existing mixed model methods New method: BOLT-LMM Time O(MM) O(MN 2 ) O MN 1.5 Corrects for confounding? Power Models non-infinitesimal genetic architectures

Limitations and future directions Limitations Loss of power from case-control ascertainment ASHG 2014 poster 1746S, Hayeck Potential breakdown of approximations in family studies, rare variant tests, non-human data sets Future directions Fast heritability estimation Multiple variance components Multiple phenotypes

Acknowledgments Alkes Price Nick Patterson Bjarni Vilhjalmsson Hilary Finucane Sasha Gusev George Tucker Bonnie Berger Daniel Chasman Brendan Bulik-Sullivan Benjamin Neale Rany Salem BOLT-LMM software: http://hsph.harvard.edu/alkes-price/software/ (v1.1 handles imputed SNPs) Google BOLT-LMM Loh et al. biorxiv (under revision, Nat Genet)