COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

Similar documents
How to analyze many contingency tables simultaneously?

Case-Control Association Testing. Case-Control Association Testing

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Weierstraß-Institut. für Angewandte Analysis und Stochastik. Leibniz-Institut im Forschungsverbund Berlin e. V. Preprint ISSN

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

2. Map genetic distance between markers

Multiple testing: Intro & FWER 1

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Association studies and regression

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

1. Understand the methods for analyzing population structure in genomes

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

Biology. Revisiting Booklet. 6. Inheritance, Variation and Evolution. Name:

Computational Systems Biology: Biology X

Some models of genomic selection

Looking at the Other Side of Bonferroni

Outline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white

Significant Pattern Mining

Population Genetics I. Bio

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Introduction to population genetics & evolution

On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative traits

Breeding Values and Inbreeding. Breeding Values and Inbreeding

The Quantitative TDT

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

SPRINGFIELD TECHNICAL COMMUNITY COLLEGE ACADEMIC AFFAIRS

Marginal Screening and Post-Selection Inference

Exam 1 PBG430/

Genetic proof of chromatin diminution under mitotic agamospermy

Parts 2. Modeling chromosome segregation

p(d g A,g B )p(g B ), g B

A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction

Outline for today s lecture (Ch. 14, Part I)

Lecture 9. QTL Mapping 2: Outbred Populations

Question: If mating occurs at random in the population, what will the frequencies of A 1 and A 2 be in the next generation?

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50

Department of Forensic Psychiatry, School of Medicine & Forensics, Xi'an Jiaotong University, Xi'an, China;

Parts 2. Modeling chromosome segregation

Multivariate analysis of genetic data an introduction

Methods for Cryptic Structure. Methods for Cryptic Structure

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Problems for 3505 (2011)

Life Cycles, Meiosis and Genetic Variability24/02/2015 2:26 PM

Linkage and Linkage Disequilibrium

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

Computational Approaches to Statistical Genetics

Private Computation with Genomic Data for Genome-Wide Association and Linkage Studies

Patterns of inheritance

Jeopardy. Evolution Q $100 Q $100 Q $100 Q $100 Q $100 Q $200 Q $200 Q $200 Q $200 Q $200 Q $300 Q $300 Q $300 Q $300 Q $300

BTRY 7210: Topics in Quantitative Genomics and Genetics

Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem

Introduction to Quantitative Genetics. Introduction to Quantitative Genetics

The phenotype of this worm is wild type. When both genes are mutant: The phenotype of this worm is double mutant Dpy and Unc phenotype.

Lesson 4: Understanding Genetics

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Affected Sibling Pairs. Biostatistics 666

BS 50 Genetics and Genomics Week of Oct 3 Additional Practice Problems for Section. A/a ; B/B ; d/d X A/a ; b/b ; D/d

Neutral Theory of Molecular Evolution

InGen: Dino Genetics Lab Post-Lab Activity: DNA and Genetics

Lecture 6 April

Learning gene regulatory networks Statistical methods for haplotype inference Part I

Family resemblance can be striking!

Linear Regression (1/1/17)

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

I Have the Power in QTL linkage: single and multilocus analysis

Name: Period: EOC Review Part F Outline

Asymptotic distribution of the largest eigenvalue with application to genetic data

Family-wise Error Rate Control in QTL Mapping and Gene Ontology Graphs

Classical Selection, Balancing Selection, and Neutral Mutations

BTRY 4830/6830: Quantitative Genomics and Genetics

Lecture 7: Hypothesis Testing and ANOVA

(Genome-wide) association analysis

Unit 2 Lesson 4 - Heredity. 7 th Grade Cells and Heredity (Mod A) Unit 2 Lesson 4 - Heredity

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Objective 3.01 (DNA, RNA and Protein Synthesis)

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Meiosis and Tetrad Analysis Lab

Natural Selection results in increase in one (or more) genotypes relative to other genotypes.

Genetics Studies of Multivariate Traits

Efficient Algorithms for Detecting Genetic Interactions in Genome-Wide Association Study

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Lecture WS Evolutionary Genetics Part I 1

Learning Your Identity and Disease from Research Papers: Information Leaks in Genome-Wide Association Study

Genotype Imputation. Class Discussion for January 19, 2016

Processes of Evolution

(Write your name on every page. One point will be deducted for every page without your name!)

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Learning ancestral genetic processes using nonparametric Bayesian models

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Genotype Imputation. Biostatistics 666

Calculation of IBD probabilities

Heredity and Genetics WKSH

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

Non-specific filtering and control of false positives

Cover Page. The handle holds various files of this Leiden University dissertation

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Genetic Association Studies in the Presence of Population Structure and Admixture

Transcription:

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics Thorsten Dickhaus University of Bremen Institute for Statistics AG DANK Herbsttagung 2016 WIAS Berlin, 18.11.2016

Outline 1 Motivation: Genetic association studies 2 Statistical setup and challenges 3 Combined two-stage procedure Reference: Mieth, B., Kloft, M., Rodriguez, J.A., Sonnenburg, S., Vobruba, R., Morcillo-Suarez, C., Farre, X., Marigorta, U.M., Fehr, E., Dickhaus, T., Blanchard, G., Schunk, D., Navarro, A. and Müller, K.-R. (2016): Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies. Scientific Reports, in press.

Outline 1 Motivation: Genetic association studies 2 Statistical setup and challenges 3 Combined two-stage procedure Reference: Mieth, B., Kloft, M., Rodriguez, J.A., Sonnenburg, S., Vobruba, R., Morcillo-Suarez, C., Farre, X., Marigorta, U.M., Fehr, E., Dickhaus, T., Blanchard, G., Schunk, D., Navarro, A. and Müller, K.-R. (2016): Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies. Scientific Reports, in press.

Outline 1 Motivation: Genetic association studies 2 Statistical setup and challenges 3 Combined two-stage procedure Reference: Mieth, B., Kloft, M., Rodriguez, J.A., Sonnenburg, S., Vobruba, R., Morcillo-Suarez, C., Farre, X., Marigorta, U.M., Fehr, E., Dickhaus, T., Blanchard, G., Schunk, D., Navarro, A. and Müller, K.-R. (2016): Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies. Scientific Reports, in press.

Outline 1 Motivation: Genetic association studies 2 Statistical setup and challenges 3 Combined two-stage procedure

Deoxyribonucleic acid (DNA) Genetic information is coded in the base pairs at loci of the DNA Different possible realizations at one locus: alleles Body cells are diploid, i. e., consisting of two sets of chromosomes Individual with same alleles on both chromosomal doublehelices at a particular locus i: homozygous at i, otherwise heterozygous at i

What is a SNP (single nucleotide polymorphism)? Bi-allelic SNPs: Exactly two possible alleles Locus 1 2 3 4... i... M

What is a SNP (single nucleotide polymorphism)? Bi-allelic SNPs: Exactly two possible alleles Locus 1 2 3 4... i... M Tom A A G T... A... G

What is a SNP (single nucleotide polymorphism)? Bi-allelic SNPs: Exactly two possible alleles Locus 1 2 3 4... i... M Tom A A G T... A... G Andrew A A G C... A... C

What is a SNP (single nucleotide polymorphism)? Bi-allelic SNPs: Exactly two possible alleles Locus 1 2 3 4... i... M Tom A A G T... A... G Andrew A A G C... A... C Rachel A A G C... G... G

What is a SNP (single nucleotide polymorphism)? Bi-allelic SNPs: Exactly two possible alleles Locus 1 2 3 4... i... M Tom (m) A A G T... A... G Tom (p) A A G T... A... C Andrew A A G C... A... C Rachel A A G C... G... G

What is a SNP (single nucleotide polymorphism)? Bi-allelic SNPs: Exactly two possible alleles Locus 1 2 3 4... i... M Tom (m) A A G T... A... G Tom (p) A A G T... A... C Andrew A A G C... A... C A A G C... G... C Rachel A A G C... G... G A A G T... G... G

Outline 1 Motivation: Genetic association studies 2 Statistical setup and challenges 3 Combined two-stage procedure

Contingency table layout in association studies Assume a bi-allelic marker (SNP) at a particular locus and a binary phenotype of interest, e. g., a disease status. Genotype A 1 A 1 A 1 A 2 A 2 A 2 Σ Phenotype 1 x 1,1 x 1,2 x 1,3 n 1. Phenotype 0 x 2,1 x 2,2 x 2,3 n 2. Absolute count n.1 n.2 n.3 N In case of allelic tests: Genotype A 1 A 2 Σ Phenotype 1 x 1,1 x 1,2 n 1. Phenotype 0 x 2,1 x 2,2 n 2. Absolute count n.1 n.2 N

Formalized association test problem Multiple test problem with system of hypotheses H = (H j : 1 j M), where H j : Genotype j Phenotype with two-sided alternatives K j.

Formalized association test problem Multiple test problem with system of hypotheses H = (H j : 1 j M), where H j : Genotype j Phenotype with two-sided alternatives K j. Abbreviated notation (one particular position): n = (n 1., n 2., n.1, n.2, n.3 ) N 5 resp. n = (n 1., n 2., n.1, n.2 ) N 4, ( ) ( ) x11 x x = 12 x 13 N x 21 x 22 x 2 3 x11 x resp. x = 12 N 23 x 21 x 2 2. 22

Hypergeometric table probability In both cases, the probability of observing x given n is under the null given by n n f (x n) = n! N! x x x!.

Tests for association of marker and phenotype (i) Chi-squared test Q(x) = r (x rs e rs ) 2, where e rs = n r. n.s /N. s e rs Resulting exact (non-asymptotic) p-value: p Q (x) = x f ( x n), with summation over all x with marginals n such that Q( x) Q(x). (Local) level α test: ϕ Q (x) = 1 pq (x) α

Tests for association of marker and phenotype (ii) Tests of Fisher-type p Fisher (x) = x f ( x n), with summation over all x with marginals n such that f ( x n) f (x n). Corresponding level α test: ϕ Fisher (x) = 1 pfisher (x) α

Tests for association of marker and phenotype (ii) Tests of Fisher-type p Fisher (x) = x f ( x n), with summation over all x with marginals n such that f ( x n) f (x n). Corresponding level α test: ϕ Fisher (x) = 1 pfisher (x) α ϕ Q and ϕ Fisher keep the (local) significance level α conservatively for any sample size N.

Challenges for the statistical methodology Fachbereich 03 1 M >> 1 simultaneous association tests (high multiplicity, big data ) 2 Interactions between genes (genetic networks) A simple locus-by-locus analysis with a Bonferroni adjustment for multiplicity is inappropriate here! (We want to control the family-wise error rate (FWER), i. e., the probability of at least one type I error among the M individual tests.)

Interactions: First example Statistically non-significant! (α = 0.025) AC AG TC TG Y = 0 0 1 15 0 16 Y = 1 8 0 0 8 16

Interactions: Second example Statistically non-significant! (α = 0.05) AACC AACG AAGG ATCC ATCG ATGG TTCC TTCG TTGG Y = 0 0 120 0 125 0 124 0 129 0 498 Y = 1 60 0 60 0 249 0 64 0 64 497

Outline 1 Motivation: Genetic association studies 2 Statistical setup and challenges 3 Combined two-stage procedure

The combi method in a nutshell 1 Fix a number 1 k M of hypothesized informative positions. 2 Run a support vector machine-based classification. Record the k largest of the SVM weights in absolute value. 3 Compute the k corresponding p-values for testing association. 4 For a pre-defined FWER level α, decide that position j is informative if j is among the top k SVM positions and its p-value is below a threshold t t (k, α). The threshold t has to be chosen such that the FWER is controlled at level α for the entire procedure.

Computation of t We developed a fully resampling-based method for calibrating t on the basis of the ascertained data. Essentially, it employs an estimation of the correlation resp. affinity of SVM weights and p-values for association. The resampling method is a generalization of the min P procedure (Westfall and Young, 1993). Writing t in the form α/m eff., we conjecture that k < M eff. << M for most of the relevant applications. We call M eff. the effective number of tests in the second step of the combi method.

Application of the combi method (WTCCC 2007 data) Fachbereich 03