How to analyze many contingency tables simultaneously?

Similar documents
COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

Weierstraß-Institut. für Angewandte Analysis und Stochastik. Leibniz-Institut im Forschungsverbund Berlin e. V. Preprint ISSN

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Case-Control Association Testing. Case-Control Association Testing

p(d g A,g B )p(g B ), g B

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

1. Understand the methods for analyzing population structure in genomes

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

Asymptotic distribution of the largest eigenvalue with application to genetic data

Bayesian Inference of Interactions and Associations

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Linkage and Linkage Disequilibrium

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Cover Page. The handle holds various files of this Leiden University dissertation

Marginal Screening and Post-Selection Inference

The Quantitative TDT

Family-wise Error Rate Control in QTL Mapping and Gene Ontology Graphs

Association studies and regression

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Some models of genomic selection

Looking at the Other Side of Bonferroni

Power and sample size calculations for designing rare variant sequencing association studies.

(Genome-wide) association analysis

Enhancing eqtl Analysis Techniques with Special Attention to the Transcript Dependency Structure

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Affected Sibling Pairs. Biostatistics 666

A TUTORIAL ON THE INHERITANCE PROCEDURE FOR MULTIPLE TESTING OF TREE-STRUCTURED HYPOTHESES

BTRY 7210: Topics in Quantitative Genomics and Genetics

Lecture 9. QTL Mapping 2: Outbred Populations

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

BTRY 4830/6830: Quantitative Genomics and Genetics

Genotype Imputation. Biostatistics 666

Genetic Association Studies in the Presence of Population Structure and Admixture

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Introduction to Linkage Disequilibrium

I Have the Power in QTL linkage: single and multilocus analysis

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

Compatible simultaneous lower confidence bounds for the Holm procedure and other Bonferroni based closed tests

Significant Pattern Mining

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

EMPIRICAL BAYES METHODS FOR ESTIMATION AND CONFIDENCE INTERVALS IN HIGH-DIMENSIONAL PROBLEMS

Bayesian Regression (1/31/13)

Lecture WS Evolutionary Genetics Part I 1

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

Statistical Applications in Genetics and Molecular Biology

Hunting for significance with multiple testing

Calculation of IBD probabilities

Methods for Cryptic Structure. Methods for Cryptic Structure

Linear Regression (1/1/17)

Applying the Benjamini Hochberg procedure to a set of generalized p-values

Computational Systems Biology: Biology X

Statistical testing. Samantha Kleinberg. October 20, 2009

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Régression en grande dimension et épistasie par blocs pour les études d association

On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative traits

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Non-specific filtering and control of false positives

Visualizing Population Genetics

Efficient Algorithms for Detecting Genetic Interactions in Genome-Wide Association Study

Evolution of quantitative traits

Stationary Distribution of the Linkage Disequilibrium Coefficient r 2

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Goodness of Fit Goodness of fit - 2 classes

Topic 3: Hypothesis Testing

Variance Component Models for Quantitative Traits. Biostatistics 666

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

Private Computation with Genomic Data for Genome-Wide Association and Linkage Studies

Lecture 6 April

Multiple point hypothesis test problems and effective numbers of tests

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Computational Approaches to Statistical Genetics

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

2. Map genetic distance between markers

Statistical Methods in Mapping Complex Diseases

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

the long tau-path for detecting monotone association in an unspecified subpopulation

Question: If mating occurs at random in the population, what will the frequencies of A 1 and A 2 be in the next generation?

Breeding Values and Inbreeding. Breeding Values and Inbreeding

Principal-Agent Games - Equilibria under Asymmetric Information -

Lecture 7: Hypothesis Testing and ANOVA

High-dimensional statistics, with applications to genome-wide association studies

Transcription:

How to analyze many contingency tables simultaneously? Thorsten Dickhaus Humboldt-Universität zu Berlin Beuth Hochschule für Technik Berlin, 31.10.2012

Outline Motivation: Genetic association studies Statistical setup Refined statistical inference methods Real data example Reference: Dickhaus, T., Straßburger, K., Schunk, D., Morcillo, C., Illig, T., and Navarro, A. (2012): How to analyze many contingency tables simultaneously in genetic association studies. SAGMB 11, Article 12.

What is a SNP (single nucleotide polymorphism)? Bi-allelic SNPs: Exactly two possible alleles Locus 1 2 3 4... i... M

What is a SNP (single nucleotide polymorphism)? Bi-allelic SNPs: Exactly two possible alleles Locus 1 2 3 4... i... M Tom A A G T... A... G

What is a SNP (single nucleotide polymorphism)? Bi-allelic SNPs: Exactly two possible alleles Locus 1 2 3 4... i... M Tom A A G T... A... G Andrew A A G C... A... C

What is a SNP (single nucleotide polymorphism)? Bi-allelic SNPs: Exactly two possible alleles Locus 1 2 3 4... i... M Tom A A G T... A... G Andrew A A G C... A... C Rachel A A G C... G... G

What is a SNP (single nucleotide polymorphism)? Bi-allelic SNPs: Exactly two possible alleles Locus 1 2 3 4... i... M Tom (m) A A G T... A... G Tom (p) A A G T... A... C Andrew A A G C... A... C Rachel A A G C... G... G

What is a SNP (single nucleotide polymorphism)? Bi-allelic SNPs: Exactly two possible alleles Locus 1 2 3 4... i... M Tom (m) A A G T... A... G Tom (p) A A G T... A... C Andrew A A G C... A... C A A G C... G... C Rachel A A G C... G... G A A G T... G... G

Contingency table layout in association studies Assume a bi-allelic marker (SNP) at a particular locus and a binary phenotype of interest, e. g., a disease status. Genotype A 1 A 1 A 1 A 2 A 2 A 2 Σ Phenotype 1 x 1,1 x 1,2 x 1,3 n 1. Phenotype 0 x 2,1 x 2,2 x 2,3 n 2. Absolute count n.1 n.2 n.3 N In case of allelic tests: Genotype A 1 A 2 Σ Phenotype 1 x 1,1 x 1,2 n 1. Phenotype 0 x 2,1 x 2,2 n 2. Absolute count n.1 n.2 N

Formalized association test problem Multiple test problem with system of hypotheses H = (H j : 1 j M), where H j : Genotype j Phenotype with two-sided alternatives K j.

Formalized association test problem Multiple test problem with system of hypotheses H = (H j : 1 j M), where H j : Genotype j Phenotype with two-sided alternatives K j. Abbreviated notation (one particular position): n = (n 1., n 2., n.1, n.2, n.3 ) N 5 resp. n = (n 1., n 2., n.1, n.2 ) N 4, ( ) ( ) x11 x x = 12 x 13 N x 21 x 22 x 2 3 x11 x resp. x = 12 N 23 x 21 x 2 2. 22 In both cases, the probability of observing x given n is under the null given by n n f (x n) = n! N! x x x!.

Tests for association of marker and phenotype (i) Chi-squared test Q(x) = r (x rs e rs ) 2, where e rs = n r. n.s /N. s e rs Resulting exact (non-asymptotic) p-value: p Q (x) = x f ( x n), with summation over all x with marginals n such that Q( x) Q(x). (Local) level α test: ϕ Q (x) = 1 pq (x) α

Tests for association of marker and phenotype (ii) Tests of Fisher-type p Fisher (x) = x f ( x n), with summation over all x with marginals n such that f ( x n) f (x n). Corresponding level α test: ϕ Fisher (x) = 1 pfisher (x) α

Tests for association of marker and phenotype (ii) Tests of Fisher-type p Fisher (x) = x f ( x n), with summation over all x with marginals n such that f ( x n) f (x n). Corresponding level α test: ϕ Fisher (x) = 1 pfisher (x) α ϕ Q (x) and ϕ Fisher (x) keep the (local) significance level α conservatively for any sample size N. In other words: p Q (X) U and p Fisher (X) U under the null, U UNI[0, 1].

Estimating the proportion of informative SNPs (References: Schweder and Spjøtvoll (1982), Storey et al., 2004)

Estimating the proportion of informative SNPs (References: Schweder and Spjøtvoll (1982), Storey et al., 2004)

Estimating the proportion of informative SNPs (References: Schweder and Spjøtvoll (1982), Storey et al., 2004)

Caveat: Storey s method does not work for discrete p-values p Q (X) and p Fisher (X)

Discreteness: Realized randomized p-values Definition: Statistical model (Ω, A, (P ϑ ) ϑ Θ ) given Two-sided test problem H : {ϑ = ϑ 0 } versus K : {ϑ ϑ 0 } Discrete test statistic: X P ϑ with values in Ω U UNI[0, 1], stochastically independent of X A realized randomized p-value for testing H versus K is a measurable mapping p r : Ω [0, 1] [0, 1] with P ϑ0 (p r (X, U) t) = t for all t [0, 1].

Realized randomized p-values based on p Q (X) and p Fisher (X) Lemma: Based upon the chi-squared and Fisher-type testing strategies, corresponding realized randomized p-values can be calculated as p r Q(x, u) = p Q (x) u f ( x n), x:q( x)=q(x) p r Fisher(x, u) = p Fisher (x) uγf (x n), where u denotes the realization of U UNI[0, 1], stochastically independent of X and γ γ(x) = { x : f ( x n) = f (x n)}. We propose realized randomized p-values for estimating π 0. For final decision making, their non-randomized counterparts should be used (Reproducibility!).

Effective number of tests A thought experiment Assume markers indexed by I = {1,..., M} can be divided into disjoint groups with indices in subsets I g I, g {1,..., G}.

Effective number of tests A thought experiment Assume markers indexed by I = {1,..., M} can be divided into disjoint groups with indices in subsets I g I, g {1,..., G}. Let ϕ = (ϕ i, i I) and assume that for each g {1,..., G} and for any pair (i, j) I g the identity {ϕ i = 1} = {ϕ j = 1} holds. Then, effectively only one single test is performed in each subgroup.

Effective number of tests A thought experiment Assume markers indexed by I = {1,..., M} can be divided into disjoint groups with indices in subsets I g I, g {1,..., G}. Let ϕ = (ϕ i, i I) and assume that for each g {1,..., G} and for any pair (i, j) I g the identity {ϕ i = 1} = {ϕ j = 1} holds. Then, effectively only one single test is performed in each subgroup. Denoting i(g) = min I g for g = 1,..., G, it holds G G FWER ϑ (ϕ) = P ϑ P ϑ {ϕ i(g) = 1}. g=1 i I 0 I g {ϕ i = 1} Consequently, multiplicity correction in this extreme scenario only has to be done with respect to G << M. Bonferroni-type adjustment α/g would be valid! g=1

Effective number of tests Cheverud-Nyholt method and beyond M eff. = 1 + 1 M M i=1 M (1 rij). 2 The numbers r ij are measures of correlation among markers i and j and can typically be obtained from linkage disequilibrium (LD) matrices. More sophisticated methods exist in the literature, e. g.: j=1 simplem by X. Gao et al. (2008) K eff. by Moskvina and Schmidt (2008) All rely on the correlation structure reflected by the r ij s.

Our proposed data analysis workflow 1. Compute realized randomized p-values p r (x j, u j ) and non-randomized versions p(x j ), j = 1,..., M. 2. Estimate the proportion π 0 of uninformative SNPs by ˆπ 0. 3. Determine the effective number of tests M eff. by utilizing correlation values obtained from an appropriate LD matrix of the M SNPs. 4. For a pre-defined FWER level α, determine the list of associated markers by performing the multiple test ϕ = (ϕ j, j = 1,..., M), where ϕ j (x j ) = 1 p(xj ) t with t = α/(m eff. ˆπ 0 ).

Real data example: Herder et al. (2008) Replication study Herder, C. et al. (2008). Variants of the PPARG, IGF2BP2, CDKAL1, HHEX, and TCF7L2 genes confer risk of type 2 diabetes independently of BMI in the German KORA studies. Horm. Metab. Res. 40, 722 726. Data: M = 44 SNPs on ten different genes (N 1900 study participants) Results section:...(conservative) Bonferroni correction for 10 genes... Authors claim: Threshold t = 0.005 for raw marginal p-values controls the FWER at α = 5%

Herder et al. (2008): Data re-analysis LD information: Taken from the HapMap project (population CEU ) Estimated effective number of tests: M eff. = 40.63 K eff. = 16.73 (Cheverud-Nyholt method), (Moskvina-Schmidt method). Estimated proportion of uninformative SNPs: ˆπ 0 = 0.4545 (Storey et al., 2004) Resulting threshold according to our method: t = α/(k eff. ˆπ 0 ) = α/(16.73 0.4545) = α/7.604 = 0.0066. In conclusion: Our proposed method confirms the authors heuristic argumentation and endorses their scientific claims.

Future research goals Effective number of tests for continuous response Effective number of tests for FDR control Adaptive estimation of effective numbers of tests Statistical methodology for confirmatory functional studies (fmri data) Hierarchical multiple testing methods for (auto-) correlated data (time series)