Régression en grande dimension et épistasie par blocs pour les études d association

Similar documents
Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

25 : Graphical induced structured input/output models

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Linear Regression (1/1/17)

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Bayesian Inference of Interactions and Associations

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

SNP Association Studies with Case-Parent Trios

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

(Genome-wide) association analysis

SNP-SNP Interactions in Case-Parent Trios

BTRY 7210: Topics in Quantitative Genomics and Genetics

Lecture WS Evolutionary Genetics Part I 1

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Lecture 6: Introduction to Quantitative genetics. Bruce Walsh lecture notes Liege May 2011 course version 25 May 2011

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing

Case-Control Association Testing. Case-Control Association Testing

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

25 : Graphical induced structured input/output models

Methods for Cryptic Structure. Methods for Cryptic Structure

BTRY 4830/6830: Quantitative Genomics and Genetics

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

2. Map genetic distance between markers

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative traits

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

Association studies and regression

Some models of genomic selection

Computational Systems Biology: Biology X

p(d g A,g B )p(g B ), g B

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Lecture 2. Basic Population and Quantitative Genetics

Genetics Studies of Multivariate Traits

Genetic Association Studies in the Presence of Population Structure and Admixture

Multidimensional heritability analysis of neuroanatomical shape. Jingwei Li

arxiv: v1 [stat.me] 10 Jun 2018

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

Calculation of IBD probabilities

Lecture 11: Multiple trait models for QTL analysis

Penalized methods in genome-wide association studies

Asymptotic distribution of the largest eigenvalue with application to genetic data

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Variance Component Models for Quantitative Traits. Biostatistics 666

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Bi-level feature selection with applications to genetic association

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction

Power and sample size calculations for designing rare variant sequencing association studies.

Causal inference in biomedical sciences: causal models involving genotypes. Mendelian randomization genes as Instrumental Variables

Department of Forensic Psychiatry, School of Medicine & Forensics, Xi'an Jiaotong University, Xi'an, China;

Computational Approaches to Statistical Genetics

Lecture 9. Short-Term Selection Response: Breeder s equation. Bruce Walsh lecture notes Synbreed course version 3 July 2013

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Lecture 3. Introduction on Quantitative Genetics: I. Fisher s Variance Decomposition

Supplementary Figures

Learning interactions via hierarchical group-lasso regularization

Affected Sibling Pairs. Biostatistics 666

Multiple QTL mapping

1. Understand the methods for analyzing population structure in genomes

How to analyze many contingency tables simultaneously?

Linear Regression. Volker Tresp 2018

Heterogeneous Multitask Learning with Joint Sparsity Constraints

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

arxiv: v1 [stat.ml] 13 Nov 2008

contents: BreedeR: a R-package implementing statistical models specifically suited for forest genetic resources analysts

Lecture 2. Fisher s Variance Decomposition

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Statistical Methods in Mapping Complex Diseases

A Least Squares Formulation for Canonical Correlation Analysis

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

The universal validity of the possible triangle constraint for Affected-Sib-Pairs

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Latent Variable Methods for the Analysis of Genomic Data

Genetics Studies of Comorbidity

The Quantitative TDT

Genotype Imputation. Biostatistics 666

Penalized Methods for Multiple Outcome Data in Genome-Wide Association Studies

GENETICS - CLUTCH CH.1 INTRODUCTION TO GENETICS.

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Genetic Studies of Multivariate Traits

QTL model selection: key players

Linkage and Linkage Disequilibrium

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

Lesson 4: Understanding Genetics

Inferring Causal Phenotype Networks from Segregating Populat

Properties of permutation-based gene tests and controlling type 1 error using a summary statistic based gene test

BAYESIAN STATISTICAL METHODS IN GENE-ENVIRONMENT AND GENE-GENE INTERACTION STUDIES

A Short Introduction to the Lasso Methodology

A multivariate regression approach to association analysis of a quantitative trait network

Evolution of phenotypic traits

Transcription:

Régression en grande dimension et épistasie par blocs pour les études d association V. Stanislas, C. Dalmasso, C. Ambroise Laboratoire de Mathématiques et Modélisation d Évry "Statistique et Génome" 1 / 45

Summary 1 GWAS and Block of linkage desequilibrium Genome Wide Association Studies Blocks of linkage desiquilibrium Hierachical Clustering with Adjacency Constraints How to improve? Some computation times 2 Epistasis The Gene-Gene Eigen Epistasis Modeling approach Simulations 3 Application Ankylosing Spondylitis First results 2 / 45

Sommaire 1 GWAS and Block of linkage desequilibrium Genome Wide Association Studies Blocks of linkage desiquilibrium Hierachical Clustering with Adjacency Constraints How to improve? Some computation times 2 Epistasis The Gene-Gene Eigen Epistasis Modeling approach Simulations 3 Application Ankylosing Spondylitis First results 3 / 45

Single-Nucleotide Polymorphism Data 90 % of human genetic variation, In human genom, SNP with allelic frequency greater than 1 % are present every 300 base pairs (in average) 2 SNP among 3 substitute cytosine with thymine Figure: SNP (wikipedia) 2 4 / 45

SNP Data Four possible C/T configurations X ij {0, 1, 2} (Individual i at locus j) Father Mother C T C 0 1 T 1 2 High-dimension n individuals p markers X 11 X 1p.. X n1 X np up to few million of SNPs can be genotyped! number of variables number of individuals (p n) 5 / 45

Genome-Wide Association Studies GWAS characteristics : Objective : find associations between genetic markers (SNP i,j {0, 1, 2}) and a phenotypic trait (Y i {0, 1} or Y i R) Genetic markers SNP http ://www.siriusgenomics.com/technology/ c Pasieka, Science Photo Library 6 / 45

Genome-Wide Association Studies Generalized Linear Model g(e[y i x i ]) = β 0 + p β j x ij j=1, i = 1,..., n n : number of individuals p : number of covariates Y i : response for the individual i x.j : observations for covariate j (coded in 0, 1 or 2) 7 / 45

Genome-Wide Association Studies SNP analysis Differences between cases and controls at a specific SNP GWAS limits : Reproductibility Heritability Data particularities : Structuration High dimension (p» n) Small effects 8 / 45

The LD measures Linkage Desiquilibrium non-random association of alleles at two or more loci depends on the difference between observed allelic frequencies and those expected from a independent randomly distributed model. Computation Z j the indicator of the presence of minor allele for SNP j. Z j B(p j ) D(j, k) = p jk p j p k = E[Z j Zk] p j p k = cov(z j, Z k ) r 2 (j, k) = corr(z j, Z k ) ou D (j, k) = D(j, k)/dmax(j, k) 9 / 45

How to estimate LD? snp vv vv VV uu a b c uu d e f UU g h i snp v V u α β U γ δ Only the genotype data table is observed α, β, γ, δ are estimated a system of equations. e.g : α = 2a + b + d + pe with p the "probability" of the haplotype (uv, UV ). estimating p, then (α, β, γ, δ) and finally D = p UV p U p V. 10 / 45

The LD block structure the r 2 coefficients among the 50 first SNPs of the Chromosome 22 (Dalmasso et al. 2008) LD structured in blocks 1.0 10 0.8 20 0.6 Row 30 0.4 40 0.2 0.0 10 20 30 40 Column Dimensions: 50 x 50 11 / 45

Hierachical Clustering with Adjacency Constraints 1 2 3 4 5 6 7 8 9 Cluster Dendrogram Height 0 5 10 15 10 11 12 15 16 17 18 19 20 corr LD 13 14... 12 / 45

Block-Wise Approach using Linkage Disequilibrium (BALD) 1 Hierarchical clustering of the SNPs with adjacency constraint and using the LD similarity. 2 Estimation of the optimal number of groups using the Gap statistic (Tibshirani et. al., 2001). 13 / 45

The h-band All coefficients outside the band h are null a p h similarity matrix a hierarchical clustering with adjacency constraint 14 / 45

A pseudocode Data: X {0, 1, 2} n p, Sim C {C i = {X.i }, i 1,..., p} /* clusters = singletons */ ; D {1 Sim(X.i, X.(i+1) ), i 1,..., p 1} ; for step = 1 to p 1 do i argmin i {1,...,p step} D(C i, C i+1 ) ; C C \ {C i, C i +1} {C i C i +1} ; d 1 D(C i 1, C i C i +1) ; d 2 D(C i C i +1, C i +2); D D \ {D(C i 1, C i ), D(C i, C i +1)} {d 1, d 2 } ; end 15 / 45

The Ward s distance Ward Constrained Hierarchical Clustering d(a, B) = n An B n A + n B ( 1 S na 2 A,A + 1 S nb 2 B,B 2 ) S A,B n A n B 16 / 45

The pencils trick : Calculating S AA and S AB 17 / 45

The pencils Assessing S AA, S BB and S AB requires the calculation of sums of LD measures within pencil-shaped areas defined by : direction : right or left depth : hloc end point : lim Two arrays of sizes p h for storing the pencils sums. 18 / 45

The binary min-heap All nodes are either less than or equal to each of its children. Uniquely represented by storing its level order traversal in an array. Given a position i : Parent(i) = i/2 Left(i) = 2i Right(i) = 2i + 1 19 / 45

DeleteMin Time complexity : O(log(p)) 20 / 45

InsertHeap Time complexity : O(log(p)) 21 / 45

BuildHeap Data: An array A Result: A min-heap H for i = length(a)/2 down to 1 do PercolDown(A, i); end Time complexity : O(plog(p)) 22 / 45

Time complexity of some operations findmin insert deletemin unordered array O(p) O(1) O(p) binary heap O(1) O(log(p)) O(log(p)) 23 / 45

cward in seconds... t (s) 0 5 10 15 0e+00 1e+05 2e+05 3e+05 4e+05 5e+05 p Figure: The mean computation time t versus the number of markers p for the cward algorithm applied to randomly sampled SNP matrices. N = 100, h = 30 and t is averaged across 50 simulation runs. 24 / 45

Compared to a former implementation t (s) 0 10 20 30 40 50 60 +pencils -heaps +pencils +heaps 0 5000 10000 15000 20000 25000 30000 p Figure: The mean computation time t versus the number of markers p for the cward algorithm and an implementation without heaps. t is averaged across 20 simulation runs. 25 / 45

Scalable Hierarchical Clustering with pencils and binary heap To sum up : 1 A O(p 2 ) algorithm does not scale for GWA studies. 2 The Ward distance written in a simple way. 3 Space complexity of O(ph) by using the pencils trick. 4 Time complexity of : O(ph) }{{} + O(plog(p)) }{{} 2 p h arrays of pencils sums building the heap and insert/delete heaps operations within the loop Ongoing work : Currently implemented with a genotype matrix as input. can be generalized to any band similarity matrix. 26 / 45

Sommaire 1 GWAS and Block of linkage desequilibrium Genome Wide Association Studies Blocks of linkage desiquilibrium Hierachical Clustering with Adjacency Constraints How to improve? Some computation times 2 Epistasis The Gene-Gene Eigen Epistasis Modeling approach Simulations 3 Application Ankylosing Spondylitis First results 27 / 45

Epistasis Definition Interaction of alleles effects from different markers Existing methods mainly SNP x SNP some at the block (gene) scale Advantages of gene (or block) scale approaches results biologically interpretable genetic effects may be easier to detect reduce the number of variables 28 / 45

Epistasis - Gene scale methods Existing gene scale methods : Two or few genes PCA + logistic regression (He et al. 2011, Li et al. 2009, Zhang et al. 2008) PLS + logistic regression (Wang T et al. 2009) For a larger number of genes PCA + LASSO (D Angelo et al. 2009) PCA + pathway-guided penalized regression (Wang X et al. 2014) 29 / 45

Epistasis - Gene scale methods Existing gene scale methods : Two or few genes PCA + logistic regression (He et al. 2011, Li et al. 2009, Zhang et al. 2008) PLS + logistic regression (Wang T et al. 2009) For a larger number of genes PCA + LASSO (D Angelo et al. 2009) PCA + pathway-guided penalized regression (Wang X et al. 2014) Objectives : To develop a new gene scale method considers a more accurate definition of interaction variables, is applicable with with genes, takes into account the group structure 29 / 45

Group modeling approach SNP 1,1.. SNP 1,p1.. SNP G,1.. SNP G,pG Pheno Ind 1 1 0 0 1 1 Ind 2 0 0 2 1 0. 2 1 1 2 1. 0 1 0 0 0 Ind i 0 2 1 0 0 }{{} gene 1 }{{} gene G We note SNP 1,1 = X 1,1 30 / 45

Group modeling approach SNP 1,1.. SNP 1,p1.. SNP G,1.. SNP G,pG Pheno Ind 1 1 0 0 1 1 Ind 2 0 0 2 1 0. 2 1 1 2 1. 0 1 0 0 0 Ind i 0 2 1 0 0 }{{} gene 1 }{{} gene G We note SNP 1,1 = X 1,1 r, s two genes 30 / 45

Group modeling approach Model Y i = β 0 + g X g ik ) ( β g k C }{{} Main effects + ɛ i T β = β 1,1, β 1,2,, β 1,p1,, β G,1,, β G,p }{{} G }{{} gene 1 gene G 31 / 45

Group modeling approach Model Y i = β 0 + g X g ik ) ( β g k C }{{} Main effects + r,s γ r,s Z r,s i }{{} Interaction effects + ɛ i T β = β 1,1, β 1,2,, β 1,p1,, β G,1,, β G,p }{{} G }{{} gene 1 gene G couple γ = γ 12,... γ 1G }{{} γ 1G,1,,γ 1G,q..., γ (G 1)G q : # of interaction variables for a 31 / 45

Interaction variable : Gene-Gene Eigen Epistasis (G-GEE) We consider f u (X r i, X s i ) to represent the interaction between genes r, s. Eigen Epistasis û = arg max cor(y, f u(x r, X s )) u, u =1 f u (X r, X s ) = W rs u with W rs = {X r ij X ik s,pr ;k=1,,ps }j=1 i=1 n max cor[w ˆ u, y] 2 == max u T W rs T yy T W rs u u, u =1 u, u =1 u : only eigen vector associated of W rs T yy T W rs For each couple (r, s) Z rs = W rs u 32 / 45

Coefficients estimation Group LASSO regression (ˆβ, ˆγ) = argmin β,γ (y i X i β Z i γ) 2 + Limits of the grouplasso regression : i ( λ pg β g 2 + ) pr p s γ rs 2 g rs Difficult to compute p-value or confidence interval 33 / 45

Coefficients estimation Group LASSO regression (ˆβ, ˆγ) = argmin β,γ (y i X i β Z i γ) 2 + Limits of the grouplasso regression : i ( λ pg β g 2 + ) pr p s γ rs 2 g rs Difficult to compute p-value or confidence interval Adaptive-Ridge Cleaning Becu JM, 2015 Use of a specific penalty for group LASSO Permutation test based on Fisher test approach for each group P k = 1 B #{F k F k} 33 / 45

Interaction variable modeling approaches comparison methods criteria Z rs T i γ rs G-GEE cor(y, f u (X r, X s )) W rs uγ rs PCA var(g r v) and var(g s v) CCA cor(g r a, G s b) PLS cov(yg r c, G s w) q j=1 q k=1 γrs jk C r j C s k q j=1 γrs j A r j B s j q j=1 γrs j T rs j 34 / 45

Simulation design Genotype : X i N p (0, Σ) with Σ a block diagonal correlation matrix (ρ = 0.8 for two SNPs in the same gene) MAF j U[0.05, 0.5] with MAF j = 0.2 if j causal SNP Continuous phenotype simulated under two different schemes : from Wang X et al., 2014 : ) Y i = β 0 + g PCA model : β g ( k C Y i = β 0 + g X g ik β g ( k C + rs X g ik ) γ rs + rs Xij r Xik s (j,k) C 2 + ɛ i (1) γ rs C r i1c s i1 + ɛ i. (2) 35 / 45

Simulations design Scenarios : We consider 600 subjects and 6 SNPs by gene First scenario on 6 genes, two settings : same genes for main and interaction effects, different genes for main and interaction effects. Second scenario on 25 genes, one setting : different genes for main and interaction effects. 36 / 45

Simulations results - First scenario on 6 genes Gene Pair Power 1.00 0.75 0.50 0.25 0.00 Wang X et al. model Methods G-GEE PLS CCA PCA Gene Pair Power 1.00 0.75 0.50 0.25 0.00 PCA model Main effects : gene 1 gene 2 Interaction effects : gene 1 x gene 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 r 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 r 2 Wang X et al. model PCA model 1.00 1.00 Gene Pair Power 0.75 0.50 0.25 0.00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 r 2 Gene Pair Power 0.75 0.50 0.25 0.00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 r 2 Main effects : gene 1 gene 2 Interaction effects : gene 3 x gene 4 37 / 45

Simulations results - First scenario on 6 genes Genes1 Genes2 Genes3 Wang X et al. model Genes1 Genes2 Genes3 PCA model Genes4 Genes5 Genes6 X.Genes1.Genes2 X.Genes1.Genes3 X.Genes1.Genes4 X.Genes1.Genes5 X.Genes1.Genes6 X.Genes2.Genes3 X.Genes2.Genes4 X.Genes2.Genes5 X.Genes2.Genes6 X.Genes3.Genes4 X.Genes3.Genes5 X.Genes3.Genes6 X.Genes4.Genes5 X.Genes4.Genes6 X.Genes5.Genes6 Genes4 Genes5 Genes6 X.Genes1.Genes2 X.Genes1.Genes3 X.Genes1.Genes4 X.Genes1.Genes5 X.Genes1.Genes6 X.Genes2.Genes3 X.Genes2.Genes4 X.Genes2.Genes5 X.Genes2.Genes6 X.Genes3.Genes4 X.Genes3.Genes5 X.Genes3.Genes6 X.Genes4.Genes5 X.Genes4.Genes6 X.Genes5.Genes6 Main effects : gene 1 gene 2 Interaction effects : gene 1 x gene 2 Genes1 Genes2 G-GEE CCA PCA PLS r 2 = 0.7 Wang X et al. model Genes1 Genes2 G-GEE CCA PCA PLS r 2 = 0.7 PCA model Genes3 Genes4 Genes5 Genes6 X.Genes1.Genes2 X.Genes1.Genes3 X.Genes1.Genes4 X.Genes1.Genes5 X.Genes1.Genes6 X.Genes2.Genes3 X.Genes2.Genes4 X.Genes2.Genes5 X.Genes2.Genes6 X.Genes3.Genes4 X.Genes3.Genes5 X.Genes3.Genes6 X.Genes4.Genes5 X.Genes4.Genes6 X.Genes5.Genes6 Genes3 Genes4 Genes5 Genes6 X.Genes1.Genes2 X.Genes1.Genes3 X.Genes1.Genes4 X.Genes1.Genes5 X.Genes1.Genes6 X.Genes2.Genes3 X.Genes2.Genes4 X.Genes2.Genes5 X.Genes2.Genes6 X.Genes3.Genes4 X.Genes3.Genes5 X.Genes3.Genes6 X.Genes4.Genes5 X.Genes4.Genes6 X.Genes5.Genes6 Main effects : gene 1 gene 2 Interaction effects : gene 3 x gene 4 G-GEE CCA PCA PLS r 2 = 0.6 G-GEE CCA PCA PLS r 2 = 0.6 38 / 45

Simulations results - Second scenario on 25 genes Gene1 Gene10{ Gene1.Gene2 Gene3.Gene4 Main effects : gene 1 gene 2 Gene5.Gene6 Gene7.Gene8 Gene9.Gene10 Interaction effects : gene 3 x gene 4 gene 5 x gene 6 gene 7 x gene 8 gene 9 x gene 10 G-GEE CCA PCA PLS Figure: Wang X et al. model, r 2 = 0.7 39 / 45

Sommaire 1 GWAS and Block of linkage desequilibrium Genome Wide Association Studies Blocks of linkage desiquilibrium Hierachical Clustering with Adjacency Constraints How to improve? Some computation times 2 Epistasis The Gene-Gene Eigen Epistasis Modeling approach Simulations 3 Application Ankylosing Spondylitis First results 40 / 45

Ankylosing Spondylitis Chronic inflammatory disease of the axial skeleton Epidemiology : Age at first symptoms : 20-30 years Sexe : predominance for men (sex ratio 2M :1W) Prevalence : depend of populations (0.1% - 1.4%) Right etiology unknown : Environmental factors? Genetic factors? Importance of HLA complex HLA complex : Localized on chromosome 6 Regroup about 200 genes Coding the immunity system Antigen HLA-B27 : associated to SPA Effect from other gene in HLA group? Outside HLA group? 41 / 45

Known genes Tsui et al., 2014 : The genetic basis of ankylosing spondylitis : new insights into disease pathogenesis, The Application of Clinical Genetics :7 105-115 29 susceptibility genes identified by GWAS 42 / 45

Results Significant results G-GEE HLA-B x SULT1A1 IL23R x ERAP2 PLS HLA-B EOMES x BACH2 PCA HLA-B CCA - 43 / 45

Conclusions and perspectives The G-GEE method : Takes into account the gene structure of data Can be applied on a large number of genes Uses a specific interaction modeling approach Ankylosing Spondylitis : Identification of potential interactions to discuss with doctors HLA-B effect Perspectives : Explore new f u (X r i, X s i ) definition Additional simulations on larger data set New applications on other data set 44 / 45

Thank you for your attention! 45 / 45

Adaptive-Ridge Cleaning λ specific penalty for group LASSO : k(j) ˆθ m k(j) m 2