Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Similar documents
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

BTRY 4830/6830: Quantitative Genomics and Genetics

Computational Systems Biology: Biology X

Case-Control Association Testing. Case-Control Association Testing

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

BTRY 7210: Topics in Quantitative Genomics and Genetics

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Genotype Imputation. Biostatistics 666

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Association studies and regression

Linear Regression (1/1/17)

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

Bayesian Inference of Interactions and Associations

1. Understand the methods for analyzing population structure in genomes

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Goodness of Fit Goodness of fit - 2 classes

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

2. Map genetic distance between markers

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Population Genetics I. Bio

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

Computational Approaches to Statistical Genetics

STAT 536: Genetic Statistics

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Department of Forensic Psychiatry, School of Medicine & Forensics, Xi'an Jiaotong University, Xi'an, China;

Equivalence of random-effects and conditional likelihoods for matched case-control studies

QTL model selection: key players

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

(Genome-wide) association analysis

Session 3 The proportional odds model and the Mann-Whitney test

Package LBLGXE. R topics documented: July 20, Type Package

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

ML Testing (Likelihood Ratio Testing) for non-gaussian models

Multiple QTL mapping

Bioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics.

Computational Systems Biology: Biology X

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

GWAS IV: Bayesian linear (variance component) models

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 6: Introduction to Quantitative genetics. Bruce Walsh lecture notes Liege May 2011 course version 25 May 2011

Adaptive testing of conditional association through Bayesian recursive mixture modeling

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Introduction to Linkage Disequilibrium

Measures of Association and Variance Estimation

Asymptotic distribution of the largest eigenvalue with application to genetic data

An introduction to biostatistics: part 1

Pearson s Test, Trend Test, and MAX Are All Trend Tests with Different Types of Scores

Analyzing metabolomics data for association with genotypes using two-component Gaussian mixture distributions

p(d g A,g B )p(g B ), g B

Discrete Multivariate Statistics

Categorical Data Analysis Chapter 3

Lecture 21: October 19

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Stochastic processes and

Backward Genotype-Trait Association. in Case-Control Designs

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Lecture WS Evolutionary Genetics Part I 1

Inferring Genetic Architecture of Complex Biological Processes

How to analyze many contingency tables simultaneously?

I Have the Power in QTL linkage: single and multilocus analysis

Module 4: Bayesian Methods Lecture 9 A: Default prior selection. Outline

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Lecture 01: Introduction

Régression en grande dimension et épistasie par blocs pour les études d association

Methods for Cryptic Structure. Methods for Cryptic Structure

Partitioning Genetic Variance

Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

Genetic Association Studies in the Presence of Population Structure and Admixture

Solutions for Examination Categorical Data Analysis, March 21, 2013

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

1 Preliminary Variance component test in GLM Mediation Analysis... 3

Statistics 3858 : Contingency Tables

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Multivariate analysis of genetic data an introduction

Topic 21 Goodness of Fit

Model Selection for Multiple QTL

Affected Sibling Pairs. Biostatistics 666

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Lecture 3: Basic Statistical Tools. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013

Reports of the Institute of Biostatistics

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Transcription:

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies Mathieu Emily November 18, 2014 Caen mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR, Rennes, France

Laboratoire de Mathématiques Appliquées de l Agrocampus (LMA 2 ) http://math.agrocampus-ouest.fr/ People: 6 Faculty, 1 research assistant, 5 PhD Research: Multivariate exploratory data analysis, Biostatistics, High-dimensional data Main topics: Sensometrics, Genomic data analysis mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 2

Outline 1 Genome-wide association studies 2 Power in single-locus association 3 Two-locus association 4 Conclusion mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 3

Outline 1 Genome-wide association studies Context and problematic 2 Power in single-locus association 3 Two-locus association 4 Conclusion mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 4

Genome-wide association studies (GWAS) Case/control studies Detection of differences in allelic frequencies between cases and controls individuals Genotyping of individuals from both populations Challenges: technological Large increase in the number of markers on chips: 100k, 300k, 500k and 1000k! computational statistical mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 5

Genome-wide association studies (GWAS) Statistical and computational challenges Individual Phenotype Marker 1 Marker 2... Marker 500,000 Y X 1 X 2... X 500,000 Id 1 healthy AA AC TG Id 2 diseased AC AC GG..... Id 1,000 diseased AC CC TG... Let Y be a random variable with a Bernoulli distribution (The case where Y is continuous is not treated here) Let X i {i = 1... p} be p random variables with 3 states (X i = 0 homozygote, X i = 1 heterozygote and X i = 2 homozygote for the minor allele) corresponding to Marker i genotype How Y is explained by {X i } i=1...p?.. mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 6

A success story?...yes Since 2005, a lot of variants has been found in susceptibility to various complex diseases: prostate cancer, Crohn s disease, etc... Manhattan plot for T1 Diabetes in the WTCCC dataset mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 7

A success story?...yes and no GWAS typically identify common variants with small effect sizes, lower right part of the graph (Bush WS, Moore JH, PLoS Comput Biol, 2012) mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 8

A success story?...no GWAS has generated new challenges such as: the quest of missing heritability! mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 9

Discrepancy between biology and statistics In biology GWAS are limited by complex phenomenon such as: Genome structure Complexity of diseases Potentiality for a large number of false positive results The future is to put prior knowledge in the analysis...and potentially make the problem more complex mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 10

Discrepancy between biology and statistics In biology GWAS are limited by complex phenomenon such as: Genome structure Complexity of diseases Potentiality for a large number of false positive results The future is to put prior knowledge in the analysis...and potentially make the problem more complex From a statistical point of view, GWAS are challenging because of : Correlation between SNPs Interaction between variables High dimensional problem with categorical variables The future is to investigate the behavior of basic statistical procedures in this specific context mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 11

Outline 1 Genome-wide association studies 2 Power in single-locus association Direct single-locus association Application with the WTCCC dataset 3 Two-locus association 4 Conclusion mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 12

Single-locus association GWAS are usually performed via a single-locus approach: Each SNP is tested independently Question: what is the most powerful statistical test to detect signal? Manhattan plot for T1 Diabetes in the WTCCC dataset (Nature, 2007) mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 13

Theoretical context and notations Let X and Y two binary variables with values in {1, 2}. X can be a biallelic biological marker. Y can be the presence/absence of a disease. Data are usually summarized in a 2x2 contingency table: X = 1 X = 2 Total Y = 1 n 11 n 12 n 1. = N(1 φ) Y = 2 n 21 n 22 n 2. = Nφ Total n.1 n.2 N where n ij is the total number of observations with Y = i and X = j. The marginal counts for Y are assumed to be fixed. One-margin fixed design. Let introduce φ as the balance of the design. Detecting association between X and Y is equivalent to compare two binomial proportions, π 1 and π 2 where: π i = P[X = 2 Y = i] for i = 1, 2 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 14

Statistical hypothesis and tests (1) Our objective is to test: H 0 : π 1 = π 2 vs H 1 : π 1 π 2 (1) Exact tests: Fisher exact test Power function for exact test is hardly tractable. Asymptotic tests Pearson s χ 2 test Likelihood Ratio test (LRT) Statistical hypothesis in Equation 1 can be reformulated as: ) H 0 : log ( π1 1 π 1 π 2 1 π 2 = log (OR(π 1, π 2)) = 0 vs H 1 : log ( π1 1 π 1 π 2 1 π 2 where OR(π 1, π 2) is the so-called odds-ratio between π 1 and π 2. Statistical inference on odds-ratio can be used. ) 0 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 15

Statistical hypothesis and tests (2) Let introduce the expected counts obtained under independence between X and Y : m ij = n i.n.j N Pearson s χ 2 statistic: Likelihood ratio: Odds-ratio inference: P = LR = 2 2 i=1 2 i=1 ( ) with : t = log n11 n 22 n 12 n 21 and SE = 2 (n ij m ij ) 2 j=1 2 j=1 ( z 2 t = SE m ij ( ) nij n ij log m ij ) 2 1 n 11 + 1 n 12 + 1 n 21 + 1 n 22 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 16

Statistical hypothesis and tests (3) Under H 0, all three tests follow a central χ 2 distribution with 1df: P H0 χ 2 (1) and LR H0 χ 2 (1) and z 2 H0 χ 2 (1) Under H 1, each of the three tests follows a non-central χ 2 distribution with 1df: P H1 χ 2 (λ P, 1) and LR H1 χ 2 (λ LR, 1) and z 2 H1 χ 2 (λ z 2, 1) qs Power comparison between P, LR and z 2 is equivalent to compare the non-central parameters λ P, λ LR and λ z 2. mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 17

Power study framework In the context of 2x2 tables analysis, power studies have been used to estimate the sample size needed to gain a certain level of power. Power study performed before experimentation. Here we propose a post-hoc power study, that can be made posterior to the experiments. To compare non-central parameters, we assume that N is fixed and propose the following scheme: 1 Definition of a general situation for H 1 2 Estimation of the three non-central parameters (λ P, λ LR and λ z 2 ) 3 Theoretical comparison of the non-central parameter estimates mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 18

Local alternatives for H 1 We consider the situation of local alternatives given by: π 2 = π 1 + h N. Let us introduce the mean contingency table, NE, and the mean expected contingency table, ME, as follows: NE= X = 1 X = 2 Total Y = 1 ne 11 = N(1 π 1 )(1 φ) ne 12 = Nπ 1 (1 φ) N(1 φ) Y = 2 ne 21 = N(1 π 2 )φ ne 22 = Nπ 2 φ Nφ Total n.1 = N(1 π) n.2 = N π N ME= X = 1 X = 2 Total Y = 1 me 11 = N(1 π)(1 φ) me 12 = N π(1 φ) N(1 φ) Y = 2 me 21 = N(1 π)φ me 22 = N πφ Nφ Total n.1 = N(1 π) n.2 = N π N where π = π 1(1 φ) + π 2φ. mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 19

Estimation of the non-central parameters Under local alternatives, non-central parameter, λ, is asymptotically equal to the statistic of the test calculated on NE and ME. Thus, estimates for non-central parameters are given by: λ P = λ LR = 2 2 i=1 2 i=1 2 (ne ij me ij ) 2 j=1 2 j=1 ( te λ z 2 = SE e ( ) with : t e = log ne11 ne 22 ne 12 ne 21 and SE e = me ij ( ) neij ne ij log me ij ) 2 1 ne 11 + 1 ne 12 + 1 ne 21 + 1 ne 22 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 20

When h is small we have: Taylor approximations where λ P = φ(1 φ)h 2 k=2 λ LR = φ(1 φ)h 2 k=2 ( h N ) k 2 g k (π 1)φ k 2 ( h ) k 2 g k (π 2 k 2 1) N k(k 1) i=0 φi g k (π 1) = ( ( 1 π 1 ) k 1 ( ) ) k 1 1 = (1 π1)k 1 ( π 1) k 1 1 π 1 (π 1(1 π 1)) k 1 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 21

Taylor approximations (2) 4 th order When h is small we have: ( ) λ P λ LR h3 φ(1 φ) 2φ 1 [g 2(π 1) + h ( )] 5φ 2 φ 1 g 3(π 1) N 3 n 6 and: ( ) λ P λ 1/12 φ(1 φ)π1(1 z 2 h 4 π 1) g 3(π 1) 3π1 2 3π1 + 1 > 0 Parameters of importance: φ and π 1 h? mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 22

χ 2 - LRT Plot of the difference in power between χ 2 and LRT. mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 23

χ 2 - z 2 Plot of the difference in power between χ 2 and z 2. mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 24

Power comparison for φ=0.1 π 1 = 0.05 π 1 = 0.1 π 1 = 0.4 If π 1 is small, power is different between χ 2 and LR mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 25

Power comparison for φ=0.5 π 1 = 0.05 π 1 = 0.1 π 1 = 0.4 Similar powers for each test mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 26

Power comparison for φ=0.9 π 1 = 0.05 π 1 = 0.1 π 1 = 0.4 If π 1 is small, power is different between χ 2 and LR mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 27

Recommandations χ 2 always outperforms z 2. If h > 0 (Causal effect): π 1 small and φ small: χ 2 > LRT π 1 small and φ high: χ 2 < LRT If h < 0 (Protective effect): π 1 small and φ small: χ 2 < LRT π 1 small and φ high: χ 2 > LRT mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 28

Benchmark dataset: WTCCC (Nature, 2007) 500,000 Single Nucleotide Polymorphisms (SNPs) (X i ) 3,000 Controls 7 diseases with 2,000 cases for each disease. Two possible strategies for studying Crohn s disease: 1 2, 000 cases vs 3, 000 controls: φ = 0.4 2 2, 000 cases vs 15, 000 controls: φ = 0.11 The following filters are used: Control of the number of missing data (< 50) Control of Hardy-Weinberg Equilibrium (p.val > 0.05) Restriction to rare alleles: 0.005 f 0.05 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 29

Chromosome 20 Ranking can changed between tests. SNP ranking 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 χ 2 78 301 300 303 302 18 304 63 279 547 299 269 29 371 330 LR 301 78 300 303 302 304 18 547 299 63 279 269 29 330 371 z 2 301 78 300 303 302 18 304 63 547 279 299 269 29 330 371 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 30

Outline 1 Genome-wide association studies 2 Power in single-locus association 3 Two-locus association Odds-ratio and δ method for counts Statistical interaction Biological interaction 4 Conclusion mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 31

Gene-gene interaction Single-locus scan fails at explaining biological complexity: Protein interaction networks Pathways A natural extension to single-locus approach is two-locus approach: SNP-SNP interaction or Gene-Gene interaction Main challenges: The number of tests: 125 billions of tests (1.25 11 ) The large class of interaction models. One useful tool: Approximation of odds-ratio inference using δ method mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 32

Inference on odds-ratio The aim is to test the association between Y and m categories for X k with: Φ = [OR(x 1),..., OR(x m)] Null hypothesis can be written as: or equivalently: H 0 : Φ = [1,..., 1] H 0 : Ψ = [ψ(x 1),..., ψ(x m)] = [log(or(x 1 )),..., log(or(x m))] = [0,..., 0] mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 33

Classical test in genetic epidemiology Test Let W = ΨV 1 Ψ t Ψ = [ψ(x 1),..., ψ(x m)] Let V be the variance-covariance matrix for Ψ As W is a Wald statistic, we have: W χ 2 (m) In practice Ψ is estimated using Maximum Likelihood Estimation Estimating V 1 is more complex mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 34

Estimation of Ψ using MLE Contingency tables are given by: Y = y 1. y n, X l = x 1l. x nl = n1 0 n 1 1.. where nk s is the number of individuals i with y i = s and x il = k Then: OR(x l ) = P(Y = 1 X = x ( ) 1 l) P(Y = 1 X = x0) P(Y = 0 X = x l ) P(Y = 0 X = x 0) can be estimated by:. n 0 m. n 1 m OR(x l ) = n1 l n0 x 0 nl 0 nx 1 1 ψ(x l ) = log(n 1 l) log(n 0 l) log(n 0 x 0 ) + log(n 1 x 0 ) mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 35

Estimation of V (2) δ approximation Counts are assumed to follow a multinomial distribution: [N 1 x0 ;... ; N 1 xm ] Mult(p 1 x 0 ;... ; p 1 x m ) We can write: ) Nx 1 l n 1 px 1 (1 px l (1 1 + l ) δ 1 n 1 px 1 x l l log(n x 1 ) log(n 1 p 1 (1 p 1 x x l l ) + l ) δ n 1 p 1 x 1 l x l with: δx 1 l N (0, 1) Cov(δx 1 l ; δx 1 px 1 p 1 n ) = l xn (1 px 1 )(1 p 1 l xn ) if l n mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 36

Estimation of V (2) Example (( ) Cov(ψ(x k ), ψ(x l )) = Cov log(nk 1 ) log(nk 0 ) log(nx 0 0 ) + log(nx 1 0 ), ) (log(nl 1 ) log(nl 0 ) log(nx 0 0 ) + log(nx 1 0 ) Approximated thanks to: log(n x 1 l ) log(n 1 p 1 x l ) + (1 px 1 ) l n 1 px 1 δx 1 l l Variance-covariance structure of δ s mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 37

Application to statistical interaction deviation from linearity (1) Let X = (X k, X l ) be a pair of SNPs with 9 categories: x 0 = AABB, x 1 = AABb, x 2 = AAbb, x 3 = AaBB, x 4 = AaBb, x 5 = Aabb, x 6 = aabb, x 7 = aabb, x 8 = aabb Saturated logistic model is given by: logit (P(Y = 1 X )) =β 0 + + i {Aa;aa} Test for interaction consists in testing: β i I Xk =i + i {Aa;aa} j {Bb;bb} i {Bb;bb} β ij I Xk =i;x l =j [β AaBb, β Aabb, β aabb, β Aabb ] = [0, 0, 0, 0] β i I Xl =i mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 38

Application to statistical interaction deviation from linearity (2) H 0 can be formulated as i {Aa, aa} and j = i {Bb, bb}: OR(X k = i X l = j) = OR(X k = AA X l = j)or(x k = i X l = BB) n 1 ijn 1 AABB n 1 ibb n1 AAj = n0 ijn 0 AABB n 0 ibb n0 AAj Ψ = [ψ AaBb ; ψ Aabb ; ψ aabb ; ψ aabb ] = [0; 0; 0; 0] with ψ ij = log ( n a ij n a 00 n i0 a na 0j ( n u ij n00 u n u i0 nu 0j ) ) mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 39

Computational cost Comparative analysis between a Wald test and a Likelihood Ratio Test (LRT) nsim Time (sec) Time (sec) LRT Wald 1000 65.42 35.84 2000 120.98 70.36 Execution time is divided by almost 2. mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 40

WTCCC analysis After filtering using prior knowledge 3.5 millions tests have been performed Overall analysis of the 7 diseases from the WTCCC mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 41

Crohn s disease Significant hit between two genes: APC and IQGAP1 p-value: 1.13 10 9 and 6.4 10 4 after multiple testing correction Biological insights for the interaction M. Emily et al., European Journal of Human Genetics, 2009. QQ-plot for Crohn s disease with (black) and without (blue) APC-IQGAP1 interaction mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 42

Application to biological interaction non-linearity effect : IndOR IndOR: Independent Odds Ratio IndOR is based on a définition of epistasis (Cordell, 2002) The absence of epistasis means that two genes share the same amount of dependency between cases and controls. For a pair of SNPs (X k, X l ), H 0 can be formulated as: i {AA, Aa, aa} and j {BB, Bb, bb} P ((X k, X l ) = (i, j) Y = 1) P(X k = i Y = 1)P(X l = j Y = 1) = P((X k, X l ) = (i, j) Y = 0) P(X k = i Y = 0)P(X l = j Y = 0) mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 43

IndOR: Independent Odds Ratio Thanks to Bayes formula we have for H 0: P ((X k, X l ) = (i, j) Y = 1) P(X k = i Y = 1)P(X l = j Y = 1) = P((X k, X l ) = (i, j) Y = 0) P(X k = i Y = 0)P(X l = j Y = 0) IndOR = ΨV 1 Ψ t, with Ψ = [ψ AaBb, ψ Aabb, ψ aabb, ψ aabb ] IndOR χ 2 (4), sous H 0 ( ) OR(xi, x j ) ψ ij = log = 0 OR(x i )OR(x j ) M. Emily, Statistics In Medicine, 2012. mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 44

Historical epistatic disease models X 2 X 1 0 1 2 0 γ γ γ 1 γ γ(1 + θ) γ(1 + θ) 2 γ γ(1 + θ) γ(1 + θ) X 2 X 1 0 1 2 0 γ γ γ 1 γ γ γ 2 γ γ γ(1 + θ) RR: Jointly Recessive-Recessive X 2 X 1 0 1 2 0 γ γ γ 1 γ γ γ 2 γ γ(1 + θ) γ(1 + θ) DD: Jointly Dominant-Dominant RD: Jointly Recessive-Dominant mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 45

RR DD RD Historical epistatic disease models Power Ratio r 2 PLINK T IH BOOST IndOR Case Only 1 0 0.36 0.63 0.53 0.55 0.97 2 0 0.46 0.75 0.74 0.72 0.97 5 0 0.55 0.85 0.85 0.89 0.97 10 0 0.61 0.90 0.92 0.96 0.97 1 0 0.49 0.78 0.45 0.62 0.80 2 0 0.61 0.89 0.57 0.72 0.80 5 0 0.71 0.95 0.67 0.76 0.80 10 0 0.75 0.96 0.71 0.78 0.80 1 0 0.52 0.81 0.64 0.70 0.97 2 0 0.65 0.90 0.81 0.84 0.97 5 0 0.74 0.96 0.90 0.93 0.97 10 0 0.78 0.97 0.93 0.95 0.97 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 46

Biological epistatic disease models X 2 X 1 BB Bb bb AA γ γ γ Aa γ γ(1 + θ) γ aa γ γ γ I: Interface X 2 X 1 BB Bb bb AA γ γ γ Aa γ γ γ(1 + θ) aa γ(1 + θ) γ(1 + θ) γ(1 + θ) Mod: Modifying-effect mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 47

Biological epistatic disease models I Mod Power Ratio r 2 PLINK T IH BOOST IndOR Case Only 1 0 0.06 0.08 0.36 0.40 0.66 2 0 0.07 0.10 0.47 0.50 0.66 5 0 0.08 0.11 0.58 0.59 0.66 10 0 0.08 0.11 0.63 0.64 0.66 1 0 0.05 0.06 0.53 0.55 0.83 2 0 0.06 0.06 0.68 0.72 0.83 5 0 0.06 0.06 0.80 0.85 0.83 10 0 0.06 0.07 0.85 0.90 0.83 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 48

Crohn s disease hits Control set Statistic SNP1 Chr1 (Position) SNP2 Chr2 (Position) p-value corr. p-value Shared PLINK rs6496669 15 (88696269) rs434157 5 (112219541) 1.16 10 5 1 Combined PLINK rs6496669 15 (88696269) rs434157 5 (112219541) 3.17 10 5 1 Shared T IH rs6496669 15 (88696269) rs434157 5 (112219541) 1.19 10 5 1 Combined T IH rs6496669 15 (88696269) rs434157 5 (112219541) 2.65 10 5 1 Shared BOOST rs6496669 15 (88696269) rs434157 5 (112219541) 7.03 10 9 2.11 10 3 Combined BOOST rs6496669 15 (88696269) rs434157 5 (112219541) 3.55 10 9 1.06 10 3 Shared IndOR rs6496669 15 (88696269) rs434157 5 (112219541) 4.44 10 9 1.33 10 3 Combined IndOR rs6496669 15 (88696269) rs434157 5 (112219541) 9.42 10 14 2.83 10 8 Shared CaseOnly rs6496669 15 (88696269) rs434157 5 (112219541) 3.70 10 8 0.011 Combined CaseOnly rs6496669 15 (88696269) rs434157 5 (112219541) 3.70 10 8 0.011 Shared PLINK rs9009 8 (11739415) rs2830075 21 (26424313) 5.22 10 4 1 Combined PLINK rs9009 8 (11739415) rs2830075 21 (26424313) 1.35 10 2 1 Shared T IH rs9009 8 (11739415) rs2830075 21 (26424313) 6.27 10 4 1 Combined T IH rs9009 8 (11739415) rs2830075 21 (26424313) 1.59 10 2 1 Shared BOOST rs9009 8 (11739415) rs2830075 21 (26424313) 1.24 10 4 1 Combined BOOST rs9009 8 (11739415) rs2830075 21 (26424313) 8.12 10 4 1 Shared IndOR rs9009 8 (11739415) rs2830075 21 (26424313) 1.36 10 6 0.40 Combined IndOR rs9009 8 (11739415) rs2830075 21 (26424313) 1.42 10 7 0.042 Shared CaseOnly rs9009 8 (11739415) rs2830075 21 (26424313) 1.00 10 5 1 Combined CaseOnly rs9009 8 (11739415) rs2830075 21 (26424313) 1.00 10 5 1 mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 49

Outline 1 Genome-wide association studies 2 Power in single-locus association 3 Two-locus association 4 Conclusion mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 50

Conclusion/Discussion Single-locus statistical tests are not equivalent: χ 2 test always outperforms z 2. The comparison between χ 2 and LRT depends jointly on the observed proportion of cases (φ) and the frequency of the variant (π 1 ): Causal effect Protective effect φ is small φ is large φ is small φ is large Rare variant χ 2 LRT LRT χ 2 Common variant LRT χ 2 χ 2 LRT Future work: Effect of tagging: indirect association Test for linear trend (Cochran-Armitage test) Two-locus interaction: δ approximation for counts Improvement of linear and non-linear tests Future work: Theoretical power study Investigation of the effect of tagging Thank you for your attention! mathieu.emily@agrocampus-ouest.fr - Agrocampus Ouest - IRMAR - Rennes 51