Linear regression Existing mixed model methods New method: BOLT-LMM Time O(MM) O(MN 2 ) O MN 1.5 Corrects for confounding? Power Efficient Bayesian mixed model analysis increases association power in large cohorts 10/20/14: ASHG 2014 Po-Ru Loh Harvard School of Public Health
Mixed model association is hot Linear mixed models (LMMs) for GWAS have attracted intense research interest in the past few years High-profile computational speedups: EMMAX (Kang et al. 2010 Nat Gen) P3D/TASSEL (Zhang et al. 2010 Nat Gen) FaST-LMM (Lippert et al. 2011 Nat Meth) GEMMA (Zhou & Stephens 2012 Nat Gen) GRAMMAR-Gamma (Svishcheva et al. 2012 Nat Gen) Extensions to standard LMM: Multiple loci (Segura et al. 2012 Nat Gen) Multiple traits (Korte et al. 2012 Nat Gen) Sparse modeling (Listgarten et al. 2013 Nat Gen) GCTA-LOCO (Yang et al. 2014 Nat Gen) See also: ASHG 2014 talk 169, Fusi & poster 1458S, Heckerman
Mixed model association corrects for confounding and increases power Linear regression Existing mixed model methods Time O(MM) O(MN 2 ) M = # SNPs N = # samples Corrects for confounding? Power
New mixed model method (BOLT-LMM) increases speed, further increases power Linear regression Existing mixed model methods New method: BOLT-LMM Time O(MM) O(MN 2 ) O MN 1.5 Corrects for confounding? Power Models non-infinitesimal genetic architectures
BOLT-LMM s iterative algorithm increases speed; non-infinitesimal model increases power Speed: New retrospective statistic allows rapid computation via iterative methods x m T V 1 y 2 (V 1 y) T Θ (V 1 y) Power: Flexible Bayesian prior on SNP effect sizes models genetic architectures with large-effect loci Normal infinitesimal model non-infinitesimal model Non-normal: Heavier tails
BOLT-LMM requires far less time and memory than existing methods a b 10 4 10 3 FaST-LMM-Select GEMMA EMMAX GCTA-LOCO BOLT-LMM 10 3 FaST-LMM-Select GCTA-LOCO EMMAX GEMMA BOLT-LMM 10 2 100x CPU CPU time (hrs) 10 2 10 1 N 2 N 1.5 Memory (GB) 10 1 N 2 N 1 10x RAM 10 0 10 0 10-1 7500 30000 120000 480000 N = # of samples 7500 30000 120000 480000 N = # of samples
BOLT-LMM can handle big data Number of samples (N) BOLT-LMM: Time, memory 60,000 5 hours 4.6 GB 480,000 3 days 35 GB Existing methods: Time, memory ~2 weeks >60 GB ~4 years >3.5 TB Benchmark data sets: Simulated genotypes generated from WTCCC2 samples M = 300K SNPs Simulated phenotypes with h g 2 = 0.2 SNP heritability
BOLT-LMM increases power in simulations (while controlling false positives) a h 2 causal = 0.5, N = 15633 b M causal = 2500, N = 15633 c M causal = 2500, h 2 causal = 0.5 Increased χ 2 stats at causal SNPs = Increased effective sample size Mean standardized effect SNP χ 2 10 9 8 7 6 5 4 BOLT-LMM 3 Standard LMM PCA 2 0 2500 5000 10000 M causal = # of causal SNPs 10 9 8 7 6 5 4 3 2 BOLT-LMM Standard LMM PCA 0.3 0.5 0.7 h 2 causal 10 9 8 7 6 5 4 3 2 BOLT-LMM Standard LMM PCA 5211 15633 N = # of samples Data: Real genotypes from N=15,633 WTCCC2 samples Simulated phenotypes with varying numbers of causal SNPs + environmental stratification
BOLT-LMM increases association power in WGHS phenotypes Increase in χ 2 statistics at known loci vs. PCA (%) a b Prediction R 2 (%) 14 12 10 8 6 4 2 0 14 12 10 8 6 4 2 ApoB LDL HDL Cholesterol Triglycerides Height Standard LMM vs. PCA BOLT-LMM vs. PCA BMI Systolic BP PCA Diastolic BP Standard LMM BOLT-LMM Data: Women s Genome Health Study N=23,294 European-ancestry samples 0
How does BOLT-LMM increase power? Under the hood, Part 1
Mixed models increase power by conditioning on polygenic effects Joint modeling reveals association! Example: Is phenotype y associated with x 1? Hard to tell
Mixed models increase power by conditioning on polygenic effects Joint modeling reveals association! Example: Is phenotype y associated with x 1? How about if we also have x 2?
Mixed models increase power by conditioning on polygenic effects Joint modeling reveals association! Example: Is phenotype y associated with x 1? Knowledge of x 2 reveals assoc of x 1 through joint model Hard to tell Note: In mixed model, x 1 (test SNP) is modeled as fixed effect; x 2 is modeled as random effect
Better joint model of phenotype => better conditioning => more power Infinitesimal model Standard mixed model: all SNPs causal with normally distributed effect sizes: β ~ N 0, h g 2 M Non-infinitesimal model Reality: Only a small fraction of SNPs causal with larger effects BOLT-LMM: SNP effect sizes modeled with mixture of two Gaussians Normal ASHG 2014 poster 1767S, Vilhjalmsson Non-normal: Heavier tails
How does BOLT-LMM increase speed? (And what exactly does it compute?) Under the hood, Part 2
Idea 1: New retrospective statistic + algorithm increases speed (for infinitesimal LMM) x m T V 1 y 2 (V 1 y) T Θ (V 1 y) Retrospective mixed model assoc statistic Compute quasi-likelihood score test similar to MASTOR: Jakobsdottir & McPeek 2013 AJHG Calibrate using approach similar to GRAMMAR-Gamma Svishcheva et al. 2012 NG Fast iterative algorithm Replace expensive eigendecomposition with linear system solving Legarra & Misztal 2008, Van Raden 2008 J Dairy Sci No need to compute genetic relationship matrix (GRM) => save time, RAM
Idea 2: Bayesian extension of linear mixed model (non-inf. model) increases power x m T V 1 y 2 (V 1 y) T Θ (V 1 y) Retrospective mixed model assoc statistic x m T y rrrrr 2 T y rrrrr Θ y rrrrr Retrospective mixed model assoc statistic using Bayesian prior Compute fast variational approximation of posteriors Meuwissen et al. 2009, Logsdon et al. 2010, Carbonetto & Stephens 2012 Calibrate using LD score regression Bulik-Sullivan et al. biorxiv (under revision, Nat Genet) ASHG 2014 talk 351, Finucane & poster 1787T, Bulik-Sullivan
New mixed model method (BOLT-LMM) increases speed, further increases power Linear regression Existing mixed model methods New method: BOLT-LMM Time O(MM) O(MN 2 ) O MN 1.5 Corrects for confounding? Power Models non-infinitesimal genetic architectures
Limitations and future directions Limitations Loss of power from case-control ascertainment ASHG 2014 poster 1746S, Hayeck Potential breakdown of approximations in family studies, rare variant tests, non-human data sets Future directions Fast heritability estimation Multiple variance components Multiple phenotypes
Acknowledgments Alkes Price Nick Patterson Bjarni Vilhjalmsson Hilary Finucane Sasha Gusev George Tucker Bonnie Berger Daniel Chasman Brendan Bulik-Sullivan Benjamin Neale Rany Salem BOLT-LMM software: http://hsph.harvard.edu/alkes-price/software/ (v1.1 handles imputed SNPs) Google BOLT-LMM Loh et al. biorxiv (under revision, Nat Genet)