Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Size: px

Start display at page:

Download "Theoretical and computational aspects of association tests: application in case-control genome-wide association studies."

Kory Parker
5 years ago
Views:

studies Mathieu Emily November 18, 2014 Caen mathieu.

1 Theoretical and computational aspects of association tests: application in case-control genome-wide association studies Mathieu Emily November 18, 2014 Caen - Agrocampus Ouest - IRMAR, Rennes, France

Laboratoire de Mathématiques Appliquées de l Agrocampus (LMA 2 ) http://math.agrocampus-ouest.

2 Laboratoire de Mathématiques Appliquées de l Agrocampus (LMA 2 ) People: 6 Faculty, 1 research assistant, 5 PhD Research: Multivariate exploratory data analysis, Biostatistics, High-dimensional data Main topics: Sensometrics, Genomic data analysis mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 2

3 Outline 1 Genome-wide association studies 2 Power in single-locus association 3 Two-locus association 4 Conclusion mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 3

4 Outline 1 Genome-wide association studies Context and problematic 2 Power in single-locus association 3 Two-locus association 4 Conclusion mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 4

Genome-wide association studies (GWAS) Case/control studies Detection of differences in allelic frequencies between cases and controls individuals Genotyping of individuals from both populations

5 Genome-wide association studies (GWAS) Case/control studies Detection of differences in allelic frequencies between cases and controls individuals Genotyping of individuals from both populations Challenges: technological Large increase in the number of markers on chips: 100k, 300k, 500k and 1000k! computational statistical - Agrocampus Ouest - IRMAR - Rennes 5

6 Genome-wide association studies (GWAS) Statistical and computational challenges Individual Phenotype Marker 1 Marker 2... Marker 500,000 Y X 1 X 2... X 500,000 Id 1 healthy AA AC TG Id 2 diseased AC AC GG..... Id 1,000 diseased AC CC TG... Let Y be a random variable with a Bernoulli distribution (The case where Y is continuous is not treated here) Let X i {i = 1... p} be p random variables with 3 states (X i = 0 homozygote, X i = 1 heterozygote and X i = 2 homozygote for the minor allele) corresponding to Marker i genotype How Y is explained by {X i } i=1...p?.. mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 6

to various complex diseases: prostate cancer, Crohn s disease, etc.

7 A success story?...yes Since 2005, a lot of variants has been found in susceptibility to various complex diseases: prostate cancer, Crohn s disease, etc... Manhattan plot for T1 Diabetes in the WTCCC dataset mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 7

8 A success story?...yes and no GWAS typically identify common variants with small effect sizes, lower right part of the graph (Bush WS, Moore JH, PLoS Comput Biol, 2012) - Agrocampus Ouest - IRMAR - Rennes 8

9 A success story?...no GWAS has generated new challenges such as: the quest of missing heritability! - Agrocampus Ouest - IRMAR - Rennes 9

10 Discrepancy between biology and statistics In biology GWAS are limited by complex phenomenon such as: Genome structure Complexity of diseases Potentiality for a large number of false positive results The future is to put prior knowledge in the analysis...and potentially make the problem more complex mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 10

11 Discrepancy between biology and statistics In biology GWAS are limited by complex phenomenon such as: Genome structure Complexity of diseases Potentiality for a large number of false positive results The future is to put prior knowledge in the analysis...and potentially make the problem more complex From a statistical point of view, GWAS are challenging because of : Correlation between SNPs Interaction between variables High dimensional problem with categorical variables The future is to investigate the behavior of basic statistical procedures in this specific context mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 11

12 Outline 1 Genome-wide association studies 2 Power in single-locus association Direct single-locus association Application with the WTCCC dataset 3 Two-locus association 4 Conclusion mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 12

13 Single-locus association GWAS are usually performed via a single-locus approach: Each SNP is tested independently Question: what is the most powerful statistical test to detect signal? Manhattan plot for T1 Diabetes in the WTCCC dataset (Nature, 2007) mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 13

14 Theoretical context and notations Let X and Y two binary variables with values in {1, 2}. X can be a biallelic biological marker. Y can be the presence/absence of a disease. Data are usually summarized in a 2x2 contingency table: X = 1 X = 2 Total Y = 1 n 11 n 12 n 1. = N(1 φ) Y = 2 n 21 n 22 n 2. = Nφ Total n.1 n.2 N where n ij is the total number of observations with Y = i and X = j. The marginal counts for Y are assumed to be fixed. One-margin fixed design. Let introduce φ as the balance of the design. Detecting association between X and Y is equivalent to compare two binomial proportions, π 1 and π 2 where: π i = P[X = 2 Y = i] for i = 1, 2 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 14

15 Statistical hypothesis and tests (1) Our objective is to test: H 0 : π 1 = π 2 vs H 1 : π 1 π 2 (1) Exact tests: Fisher exact test Power function for exact test is hardly tractable. Asymptotic tests Pearson s χ 2 test Likelihood Ratio test (LRT) Statistical hypothesis in Equation 1 can be reformulated as: ) H 0 : log ( π1 1 π 1 π 2 1 π 2 = log (OR(π 1, π 2)) = 0 vs H 1 : log ( π1 1 π 1 π 2 1 π 2 where OR(π 1, π 2) is the so-called odds-ratio between π 1 and π 2. Statistical inference on odds-ratio can be used. ) 0 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 15

16 Statistical hypothesis and tests (2) Let introduce the expected counts obtained under independence between X and Y : m ij = n i.n.j N Pearson s χ 2 statistic: Likelihood ratio: Odds-ratio inference: P = LR = 2 2 i=1 2 i=1 ( ) with : t = log n11 n 22 n 12 n 21 and SE = 2 (n ij m ij ) 2 j=1 2 j=1 ( z 2 t = SE m ij ( ) nij n ij log m ij ) 2 1 n n n n 22 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 16

17 Statistical hypothesis and tests (3) Under H 0, all three tests follow a central χ 2 distribution with 1df: P H0 χ 2 (1) and LR H0 χ 2 (1) and z 2 H0 χ 2 (1) Under H 1, each of the three tests follows a non-central χ 2 distribution with 1df: P H1 χ 2 (λ P, 1) and LR H1 χ 2 (λ LR, 1) and z 2 H1 χ 2 (λ z 2, 1) qs Power comparison between P, LR and z 2 is equivalent to compare the non-central parameters λ P, λ LR and λ z 2. mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 17

18 Power study framework In the context of 2x2 tables analysis, power studies have been used to estimate the sample size needed to gain a certain level of power. Power study performed before experimentation. Here we propose a post-hoc power study, that can be made posterior to the experiments. To compare non-central parameters, we assume that N is fixed and propose the following scheme: 1 Definition of a general situation for H 1 2 Estimation of the three non-central parameters (λ P, λ LR and λ z 2 ) 3 Theoretical comparison of the non-central parameter estimates mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 18

19 Local alternatives for H 1 We consider the situation of local alternatives given by: π 2 = π 1 + h N. Let us introduce the mean contingency table, NE, and the mean expected contingency table, ME, as follows: NE= X = 1 X = 2 Total Y = 1 ne 11 = N(1 π 1 )(1 φ) ne 12 = Nπ 1 (1 φ) N(1 φ) Y = 2 ne 21 = N(1 π 2 )φ ne 22 = Nπ 2 φ Nφ Total n.1 = N(1 π) n.2 = N π N ME= X = 1 X = 2 Total Y = 1 me 11 = N(1 π)(1 φ) me 12 = N π(1 φ) N(1 φ) Y = 2 me 21 = N(1 π)φ me 22 = N πφ Nφ Total n.1 = N(1 π) n.2 = N π N where π = π 1(1 φ) + π 2φ. mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 19

20 Estimation of the non-central parameters Under local alternatives, non-central parameter, λ, is asymptotically equal to the statistic of the test calculated on NE and ME. Thus, estimates for non-central parameters are given by: λ P = λ LR = 2 2 i=1 2 i=1 2 (ne ij me ij ) 2 j=1 2 j=1 ( te λ z 2 = SE e ( ) with : t e = log ne11 ne 22 ne 12 ne 21 and SE e = me ij ( ) neij ne ij log me ij ) 2 1 ne ne ne ne 22 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 20

21 When h is small we have: Taylor approximations where λ P = φ(1 φ)h 2 k=2 λ LR = φ(1 φ)h 2 k=2 ( h N ) k 2 g k (π 1)φ k 2 ( h ) k 2 g k (π 2 k 2 1) N k(k 1) i=0 φi g k (π 1) = ( ( 1 π 1 ) k 1 ( ) ) k 1 1 = (1 π1)k 1 ( π 1) k 1 1 π 1 (π 1(1 π 1)) k 1 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 21

22 Taylor approximations (2) 4 th order When h is small we have: ( ) λ P λ LR h3 φ(1 φ) 2φ 1 [g 2(π 1) + h ( )] 5φ 2 φ 1 g 3(π 1) N 3 n 6 and: ( ) λ P λ 1/12 φ(1 φ)π1(1 z 2 h 4 π 1) g 3(π 1) 3π1 2 3π1 + 1 > 0 Parameters of importance: φ and π 1 h? mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 22

23 χ 2 - LRT Plot of the difference in power between χ 2 and LRT. mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 23

24 χ 2 - z 2 Plot of the difference in power between χ 2 and z 2. mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 24

25 Power comparison for φ=0.1 π 1 = 0.05 π 1 = 0.1 π 1 = 0.4 If π 1 is small, power is different between χ 2 and LR mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 25

26 Power comparison for φ=0.5 π 1 = 0.05 π 1 = 0.1 π 1 = 0.4 Similar powers for each test mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 26

27 Power comparison for φ=0.9 π 1 = 0.05 π 1 = 0.1 π 1 = 0.4 If π 1 is small, power is different between χ 2 and LR mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 27

28 Recommandations χ 2 always outperforms z 2. If h > 0 (Causal effect): π 1 small and φ small: χ 2 > LRT π 1 small and φ high: χ 2 < LRT If h < 0 (Protective effect): π 1 small and φ small: χ 2 < LRT π 1 small and φ high: χ 2 > LRT mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 28

29 Benchmark dataset: WTCCC (Nature, 2007) 500,000 Single Nucleotide Polymorphisms (SNPs) (X i ) 3,000 Controls 7 diseases with 2,000 cases for each disease. Two possible strategies for studying Crohn s disease: 1 2, 000 cases vs 3, 000 controls: φ = , 000 cases vs 15, 000 controls: φ = 0.11 The following filters are used: Control of the number of missing data (< 50) Control of Hardy-Weinberg Equilibrium (p.val > 0.05) Restriction to rare alleles: f 0.05 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 29

30 Chromosome 20 Ranking can changed between tests. SNP ranking χ LR z mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 30

31 Outline 1 Genome-wide association studies 2 Power in single-locus association 3 Two-locus association Odds-ratio and δ method for counts Statistical interaction Biological interaction 4 Conclusion mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 31

32 Gene-gene interaction Single-locus scan fails at explaining biological complexity: Protein interaction networks Pathways A natural extension to single-locus approach is two-locus approach: SNP-SNP interaction or Gene-Gene interaction Main challenges: The number of tests: 125 billions of tests ( ) The large class of interaction models. One useful tool: Approximation of odds-ratio inference using δ method mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 32

33 Inference on odds-ratio The aim is to test the association between Y and m categories for X k with: Φ = [OR(x 1),..., OR(x m)] Null hypothesis can be written as: or equivalently: H 0 : Φ = [1,..., 1] H 0 : Ψ = [ψ(x 1),..., ψ(x m)] = [log(or(x 1 )),..., log(or(x m))] = [0,..., 0] mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 33

34 Classical test in genetic epidemiology Test Let W = ΨV 1 Ψ t Ψ = [ψ(x 1),..., ψ(x m)] Let V be the variance-covariance matrix for Ψ As W is a Wald statistic, we have: W χ 2 (m) In practice Ψ is estimated using Maximum Likelihood Estimation Estimating V 1 is more complex mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 34

35 Estimation of Ψ using MLE Contingency tables are given by: Y = y 1. y n, X l = x 1l. x nl = n1 0 n where nk s is the number of individuals i with y i = s and x il = k Then: OR(x l ) = P(Y = 1 X = x ( ) 1 l) P(Y = 1 X = x0) P(Y = 0 X = x l ) P(Y = 0 X = x 0) can be estimated by:. n 0 m. n 1 m OR(x l ) = n1 l n0 x 0 nl 0 nx 1 1 ψ(x l ) = log(n 1 l) log(n 0 l) log(n 0 x 0 ) + log(n 1 x 0 ) mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 35

36 Estimation of V (2) δ approximation Counts are assumed to follow a multinomial distribution: [N 1 x0 ;... ; N 1 xm ] Mult(p 1 x 0 ;... ; p 1 x m ) We can write: ) Nx 1 l n 1 px 1 (1 px l (1 1 + l ) δ 1 n 1 px 1 x l l log(n x 1 ) log(n 1 p 1 (1 p 1 x x l l ) + l ) δ n 1 p 1 x 1 l x l with: δx 1 l N (0, 1) Cov(δx 1 l ; δx 1 px 1 p 1 n ) = l xn (1 px 1 )(1 p 1 l xn ) if l n mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 36

37 Estimation of V (2) Example (( ) Cov(ψ(x k ), ψ(x l )) = Cov log(nk 1 ) log(nk 0 ) log(nx 0 0 ) + log(nx 1 0 ), ) (log(nl 1 ) log(nl 0 ) log(nx 0 0 ) + log(nx 1 0 ) Approximated thanks to: log(n x 1 l ) log(n 1 p 1 x l ) + (1 px 1 ) l n 1 px 1 δx 1 l l Variance-covariance structure of δ s mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 37

38 Application to statistical interaction deviation from linearity (1) Let X = (X k, X l ) be a pair of SNPs with 9 categories: x 0 = AABB, x 1 = AABb, x 2 = AAbb, x 3 = AaBB, x 4 = AaBb, x 5 = Aabb, x 6 = aabb, x 7 = aabb, x 8 = aabb Saturated logistic model is given by: logit (P(Y = 1 X )) =β i {Aa;aa} Test for interaction consists in testing: β i I Xk =i + i {Aa;aa} j {Bb;bb} i {Bb;bb} β ij I Xk =i;x l =j [β AaBb, β Aabb, β aabb, β Aabb ] = [0, 0, 0, 0] β i I Xl =i mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 38

39 Application to statistical interaction deviation from linearity (2) H 0 can be formulated as i {Aa, aa} and j = i {Bb, bb}: OR(X k = i X l = j) = OR(X k = AA X l = j)or(x k = i X l = BB) n 1 ijn 1 AABB n 1 ibb n1 AAj = n0 ijn 0 AABB n 0 ibb n0 AAj Ψ = [ψ AaBb ; ψ Aabb ; ψ aabb ; ψ aabb ] = [0; 0; 0; 0] with ψ ij = log ( n a ij n a 00 n i0 a na 0j ( n u ij n00 u n u i0 nu 0j ) ) mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 39

40 Computational cost Comparative analysis between a Wald test and a Likelihood Ratio Test (LRT) nsim Time (sec) Time (sec) LRT Wald Execution time is divided by almost 2. mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 40

41 WTCCC analysis After filtering using prior knowledge 3.5 millions tests have been performed Overall analysis of the 7 diseases from the WTCCC mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 41

42 Crohn s disease Significant hit between two genes: APC and IQGAP1 p-value: and after multiple testing correction Biological insights for the interaction M. Emily et al., European Journal of Human Genetics, QQ-plot for Crohn s disease with (black) and without (blue) APC-IQGAP1 interaction mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 42

43 Application to biological interaction non-linearity effect : IndOR IndOR: Independent Odds Ratio IndOR is based on a définition of epistasis (Cordell, 2002) The absence of epistasis means that two genes share the same amount of dependency between cases and controls. For a pair of SNPs (X k, X l ), H 0 can be formulated as: i {AA, Aa, aa} and j {BB, Bb, bb} P ((X k, X l ) = (i, j) Y = 1) P(X k = i Y = 1)P(X l = j Y = 1) = P((X k, X l ) = (i, j) Y = 0) P(X k = i Y = 0)P(X l = j Y = 0) mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 43

44 IndOR: Independent Odds Ratio Thanks to Bayes formula we have for H 0: P ((X k, X l ) = (i, j) Y = 1) P(X k = i Y = 1)P(X l = j Y = 1) = P((X k, X l ) = (i, j) Y = 0) P(X k = i Y = 0)P(X l = j Y = 0) IndOR = ΨV 1 Ψ t, with Ψ = [ψ AaBb, ψ Aabb, ψ aabb, ψ aabb ] IndOR χ 2 (4), sous H 0 ( ) OR(xi, x j ) ψ ij = log = 0 OR(x i )OR(x j ) M. Emily, Statistics In Medicine, mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 44

45 Historical epistatic disease models X 2 X γ γ γ 1 γ γ(1 + θ) γ(1 + θ) 2 γ γ(1 + θ) γ(1 + θ) X 2 X γ γ γ 1 γ γ γ 2 γ γ γ(1 + θ) RR: Jointly Recessive-Recessive X 2 X γ γ γ 1 γ γ γ 2 γ γ(1 + θ) γ(1 + θ) DD: Jointly Dominant-Dominant RD: Jointly Recessive-Dominant mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 45

46 RR DD RD Historical epistatic disease models Power Ratio r 2 PLINK T IH BOOST IndOR Case Only mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 46

47 Biological epistatic disease models X 2 X 1 BB Bb bb AA γ γ γ Aa γ γ(1 + θ) γ aa γ γ γ I: Interface X 2 X 1 BB Bb bb AA γ γ γ Aa γ γ γ(1 + θ) aa γ(1 + θ) γ(1 + θ) γ(1 + θ) Mod: Modifying-effect mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 47

48 Biological epistatic disease models I Mod Power Ratio r 2 PLINK T IH BOOST IndOR Case Only mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 48

49 Crohn s disease hits Control set Statistic SNP1 Chr1 (Position) SNP2 Chr2 (Position) p-value corr. p-value Shared PLINK rs ( ) rs ( ) Combined PLINK rs ( ) rs ( ) Shared T IH rs ( ) rs ( ) Combined T IH rs ( ) rs ( ) Shared BOOST rs ( ) rs ( ) Combined BOOST rs ( ) rs ( ) Shared IndOR rs ( ) rs ( ) Combined IndOR rs ( ) rs ( ) Shared CaseOnly rs ( ) rs ( ) Combined CaseOnly rs ( ) rs ( ) Shared PLINK rs ( ) rs ( ) Combined PLINK rs ( ) rs ( ) Shared T IH rs ( ) rs ( ) Combined T IH rs ( ) rs ( ) Shared BOOST rs ( ) rs ( ) Combined BOOST rs ( ) rs ( ) Shared IndOR rs ( ) rs ( ) Combined IndOR rs ( ) rs ( ) Shared CaseOnly rs ( ) rs ( ) Combined CaseOnly rs ( ) rs ( ) mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 49

50 Outline 1 Genome-wide association studies 2 Power in single-locus association 3 Two-locus association 4 Conclusion mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 50

51 Conclusion/Discussion Single-locus statistical tests are not equivalent: χ 2 test always outperforms z 2. The comparison between χ 2 and LRT depends jointly on the observed proportion of cases (φ) and the frequency of the variant (π 1 ): Causal effect Protective effect φ is small φ is large φ is small φ is large Rare variant χ 2 LRT LRT χ 2 Common variant LRT χ 2 χ 2 LRT Future work: Effect of tagging: indirect association Test for linear trend (Cochran-Armitage test) Two-locus interaction: δ approximation for counts Improvement of linear and non-linear tests Future work: Theoretical power study Investigation of the effect of tagging Thank you for your attention! mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 51

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 20: Epistasis and Alternative Tests in GWAS Jason Mezey jgm45@cornell.edu April 16, 2016 (Th) 8:40-9:55 None Announcements Summary