GWAS with mixed models

Size: px

Start display at page:

Download "GWAS with mixed models"

Alexander Green
6 years ago
Views:

1 GWAS with mixed models (the trip from 10 0 to 10 8 ) Yurii Aulchenko yurii [dot] Aulchenko [at] gmail [dot] com YuriiA consulting 1

2 Outline Methodology development Speeding things up Simplifying the math From math to software Conclusions & remarks 2

3 Methodology development Method (math) 3

4 Methodology development Data Method (math) 3

5 Methodology development Method (math) 4

6 Methodology development Algorithm Method (math) 4

7 Methodology development Implementation Algorithm Method (math) 4

8 Methodology development Data Implementation Algorithm Method (math) 4

9 Methodology development Fine! Data Implementation Algorithm Method (math) 4

10 Methodology development Fine! Data Wrong Implementation Algorithm Method (math) 4

11 Methodology development Fine! Data Wrong Implementation Algorithm Method (math) 4

12 Methodology development Fine! Data Wrong Implementation Algorithm Method (math) 4

13 Methodology development Fine! Data Wrong Too Implementation slow Algorithm Method (math) 4

14 Methodology development Fine! Data Wrong Too Implementation slow Algorithm Method (math) 4

15 Mixed Models for GWAS Natural way to model correlated data Model the distribution of phenotypes as y i =!+" g i +G i +# i, where " is the effect of a SNP, G is distributed as multivariate normal with VC-matrix proportional to the relationship matrix Parameters: {!, ", h 2, $ 2 } ML way: apply LR to test significance of " 5

16 Mixed Models for GWAS Natural way to model correlated data Model the distribution of phenotypes as y i =!+" g i +G i +# i, where " is the effect of a SNP, G is distributed as multivariate normal with VC-matrix proportional to the relationship matrix Parameters: {!, ", h 2, $ 2 } ML way: apply LR to test significance of " 6

17 Mixed Models for GWAS Natural way to model correlated data Model Problem the distribution (07): of estimating phenotypes as y i =!+" g i +G i +# i, the model for single SNP where " is the effect of a SNP, G is distributed takes about 15 minutes. as multivariate normal with VC-matrix Single GWAS = few years proportional to the relationship matrix Parameters: {!, ", h 2, $ 2 } ML way: apply LR to test significance of " 6

18 Where can we improve? Implementation Algorithm Method (math) 7

19 Two-step / score test The main problem is estimation of h 2 each time we introduce new SNP into the model If we assume that a SNP has small e ect on the trait, then its inclusion into the model should not change the estimate of h 2 much Therefore two-step estimation approach can be used: First, estimate h 2 using MM without SNP: y i = µ + G i + i Use the same estimate ĥ 2 to correct the test of association for every SNP genome-wide 8

20 Fasta way (Chen & Abecasis, AJHG, 07) The obtained estimates are used to construct the variance-covariance matrix for the data, ˆ Score test is constructed accounting for ˆ : T 2 i = (ḡ i T ˆ 1 Ȳ ) 2 ḡ i T ˆ 1ḡ i 9

21 GRAMMAR way (Aulchenko et al., 07; Amin et al., 07; Svishcheva et al., Nat Genet, 12) Define Y = ˆ 1 Ȳ Approximate ḡ i T ˆ 1ḡ i with ḡ i T (ˆ )ḡ i where (ˆ ) is a scalar T 2 i = (ḡ i T ˆ 1 Ȳ ) 2 ḡ i T ˆ 1ḡ i 1 (ˆ ) (ḡ i T Y ) 2 ḡ it ḡ i 10

22 Speed comparison 11

23 Accuracy of Grammar-γ T 2 i = (ḡ i T ˆ 1 Ȳ ) 2 ḡ i T ˆ 1ḡ i 1 (ˆ ) (ḡ i T Y ) 2 ḡ it ḡ i a GRAMMAR-Gamma, LR-GC GRAMMAR-Gamma, LR-GC b Height n = 2,592 y = x y = x FASTA FRI n = 164 y = x y = x BMI n = 2,591 y = 0.999x y = x 0 FASTA avrpphb n = 90 y = x y = x HDL n = 2,585 y = x y = x 0 FASTA avrrpm1 n = 84 y = x y = x 0 Grey dots: FASTA vs LM. Black dots: Grammar- vs FASTA. Upper row: human data ( is almost the same); lower row: A. thaliana highly structured data (less accurate) (Svishcheva et al., Nat Genet, 12) FASTA FASTA FASTA 12

24 Accuracy of Grammar-γ T 2 i = (ḡ i T ˆ 1 Ȳ ) 2 ḡ i T ˆ 1ḡ i 1 (ˆ ) (ḡ i T Y ) 2 ḡ it ḡ i a GRAMMAR-Gamma, LR-GC GRAMMAR-Gamma, LR-GC b Height n = 2,592 y = x y = x FASTA FRI n = 164 y = x y = x BMI n = 2,591 y = 0.999x y = x 0 FASTA avrpphb n = 90 y = x y = x HDL n = 2,585 y = x y = x 0 FASTA avrrpm1 n = 84 y = x y = x 0 Grey dots: FASTA vs LM. Black dots: Grammar- vs FASTA. Upper row: human data ( is almost the same); lower row: A. thaliana highly structured data (less accurate) (Svishcheva et al., Nat Genet, 12) FASTA FASTA FASTA 12

25 Accuracy of Grammar-γ Ti 2 = (ḡ i T ˆ 1 Ȳ ) 2 1 ḡ T i ˆ 1ḡ i (ˆ ) (ḡ i T Y ) 2 ḡ it ḡ i Sub-optimal approximation a GRAMMAR-Gamma, LR-GC GRAMMAR-Gamma, LR-GC Height n = 2,592 y = x y = x Var( m b FASTA FRI n = 164 y = x y = x 0 FASTA when BMI HDL n = 2,591 n = 2,585 y = 0.999x y = x y = x y = x E g easily tested 0 0 FASTA FASTA 0 n i,j= 1 1 gmi E gm ij gmj E gm avrpphb n = 90 y = x y = x 0 n l= 1 FASTA g ml pts i and j define a pair of relatives, 1 is an elem 0 m avrrpm1 n = 84 y = x y = x 0 2 FASTA Grey dots: FASTA vs LM. Black dots: Grammar- vs FASTA. Upper row: human data ( is almost the same); lower row: A. thaliana highly structured data (less accurate) (Svishcheva et al., Nat Genet, 12) ) is large 13

26 More general problem: GWAS for multiple traits Let us step back to 10 (Grammar-! not there yet) 07-10: from 15 minutes for single SNP to 15 minutes for a GWAS 14

27 More general problem: GWAS for multiple traits Let us step back to 10 (Grammar-! not there yet) 07-10: from 15 minutes for single SNP to 15 minutes for a GWAS What if we have 100,000 traits? Back to several years?! 14

28 More general problem: GWAS for multiple traits Let us step back to 10 (Grammar-! not there yet) 07-10: from 15 minutes for single SNP to 15 minutes for a GWAS What if we have 100,000 traits? Back to several years?! Treatment of the problem for arbitrary number of traits, t Using FASTA approach: sequence of GLS problems 14

29 Where can we improve? Method (math) 15

30 Where can we improve? Algorithm Method (math) 15

31 Where can we improve? Implementation Algorithm Method (math) 15

32 Where can we improve? Work in Implementation collaboration with prof. Algorithm Bientinesi and mr. Fabregat- Method (math) Traver, RWTH Aachen 15

33 Algorithms with CLAK CLAK: system for automatic generation of algorithms Twenty algorithms generated Two selected: one for few-trait (<=10), and one for multi-trait (>10) GWAS 16

34 Implementation Effective factorization Grouping and use of multi-threaded BLAS-3 for large matrix by matrix operations Custom thread-based parallelization Double buffering and asynchronous data transfers 17

35 Speed comparison 100 EMMAX GWFGLS FaSTLMM CLAK-Chol EMMAX GWFGLS FaSTLMM CLAK-Chol 68 hours hours 7 hours 25 hours 10 6 hours 1,000 10,000,000,000 Sample size (n) *10 7 Number of SNPs (m) EMMAX FaSTLMM GWFGLS CLAK-Eig 4.58 years 26 months EMMAX: 2789x FaST-LMM: 1352x GWFGLS: 1012x 1 months 14 hours * Number of traits (t) CLAK-Eig: 1x Number of traits (t) 18

36 Large sample EMMAX FaSTLMM GWFGLS CLAK-Eig Number of traits (t) 4.58 years 26 months EMMAX GWFGLS FaSTLMM CLAK-Chol EMMAX GWFGLS FaSTLMM CLAK-Chol * hours 7 hours Number of SNPs (m) 68 hours 6 hours 500 months 1,000 10,000, , hours CLAK-Eig: 1x * Sample size (n) Number of traits (t) 25 hours EMMAX: FaST-LMM: 1352x GWFGLS: 2789x 1012x 19

37 Many SNPs EMMAX GWFGLS FaSTLMM CLAK-Chol 44 hours 7 hours 1,000 10,000,000,000 Sample size (n) EMMAX GWFGLS FaSTLMM 25 hours CLAK-Chol 68 hours EMMAX FaSTLMM GWFGLS CLAK-Eig 500 months * hours CLAK-Eig: 1x * Number of traits (t) 4.58 years 26 months Number of SNPs (m) Number of traits (t) EMMAX: 6 hours FaST-LMM: 1352x GWFGLS: 2789x 1012x

38 Multi-trait GWAS EMMAX GWFGLS FaSTLMM CLAK-Chol 7 hours 1,000 10,000,000,000 Sample size (n) EMMAX FaSTLMM 44 hours GWFGLS CLAK-Eig 25 hours 10 EMMAX GWFGLS FaSTLMM CLAK-Chol *10 7 Number of SNPs (m) 68 hours 4.58 years 6 hours 26 months 1 00 EMMAX: 2789x FaST-LMM: months 1352x GWFGLS: hours 1012x * 10 4 CLAK-Eig: 1x Number of traits (t) Number of traits (t) 21

39 Multi-trait GWAS ,000 10,000,000, EMMAX FaSTLMM GWFGLS CLAK-Eig EMMAX GWFGLS FaSTLMM CLAK-Chol 44 hours 7 hours Sample size (n) * Number of traits (t) 25 hours 4.58 years 26 months 10 EMMAX GWFGLS FaSTLMM CLAK-Chol *10 7 Number of SNPs (m) 68 hours 6 hours months hours Number of traits (t) EMMAX: FaST-LMM: 1352x GWFGLS: CLAK-Eig: 2789x 1012x 1x 22

40 Multi-trait GWAS Running 44 hours on real 25 hours FaST-LMM: data: 1352x 10 6 hours 10 Metabolome 7 hours GWAS: >100,000 traits 1,000 10,000,000, *10 7 Sample size (n) Number of SNPs (m) EMMAX FaSTLMM GWFGLS CLAK-Eig EMMAX GWFGLS FaSTLMM CLAK-Chol * Number of traits (t) 4.58 years 26 months EMMAX GWFGLS FaSTLMM CLAK-Chol 68 hours months hours Number of traits (t) EMMAX: GWFGLS: CLAK-Eig: 2789x 1012x 1x 23

41 Multi-trait GWAS Running 44 hours on real 25 hours FaST-LMM: data: 1352x 10 6 hours 10 Metabolome 7 hours GWAS: >100,000 traits 1,000 10,000,000, *10 7 Sample size (n) Number of SNPs (m) EMMAX FaSTLMM GWFGLS CLAK-Eig EMMAX GWFGLS FaSTLMM CLAK-Chol * Number of traits (t) 4.58 years 26 months EMMAX GWFGLS FaSTLMM CLAK-Chol 68 hours months hours Number of traits (t) EMMAX: Finished in 8 hours GWFGLS: CLAK-Eig: 2789x 1012x 1x 23

42 Conclusions Enormous progress: from 15 minutes for single SNP test to 100(s) of GWAS in 15 minutes (x10,000,000) 24

43 Conclusions Enormous progress: from 15 minutes for single SNP test to 100(s) of GWAS in 15 minutes (x10,000,000) Problem knowledge: in Method-Algorithm- Implementation, every step counts 24

44 Conclusions Enormous progress: from 15 minutes for single SNP test to 100(s) of GWAS in 15 minutes (x10,000,000) Problem knowledge: in Method-Algorithm- Implementation, every step counts Practical way to produce practical methods: agile methodology (short feedback loop) 24

45 The GenABEL project Implementing the agile methodology framework Open source Free (as in freedom ) Collaborative, open, aiming to provide an agile environment 25

46 26

47 Stay tuned! 26

48 Method/algorithmic similarities Animal breeding literature/methods from 19/70s FMM of W. Astle and D. Balding (08) (MixABEL, 10) FaST-LMM of Lippert et al. (12); GEMMA of Zhou & Stephens (12) FASTA-like: EMMAX of Kang et al., (10), P3D of Zhang et al. (12) 27

49 YuriiA consulting 28

50 28

Efficient Bayesian mixed model analysis increases association power in large cohorts

Linear regression Existing mixed model methods New method: BOLT-LMM Time O(MM) O(MN 2 ) O MN 1.5 Corrects for confounding? Power Efficient Bayesian mixed model analysis increases association power in large