Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Similar documents
BTRY 4830/6830: Quantitative Genomics and Genetics

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

BTRY 7210: Topics in Quantitative Genomics and Genetics

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Case-Control Association Testing. Case-Control Association Testing

Goodness of Fit Goodness of fit - 2 classes

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Linear Regression (1/1/17)

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

I Have the Power in QTL linkage: single and multilocus analysis

Computational Systems Biology: Biology X

3 Comparison with Other Dummy Variable Methods

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

McGill University. Faculty of Science MATH 204 PRINCIPLES OF STATISTICS II. Final Examination

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Section VII. Chi-square test for comparing proportions and frequencies. F test for means

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Lecture 7: Hypothesis Testing and ANOVA

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Multiple QTL mapping

Statistical Distribution Assumptions of General Linear Models

Lecture 21: October 19

The Quantitative TDT

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

NON-PARAMETRIC STATISTICS * (

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

Bayesian Regression (1/31/13)

Association studies and regression

Stat 5101 Lecture Notes

Partitioning Genetic Variance

Lecture 11: Multiple trait models for QTL analysis

HYPOTHESIS TESTING: THE CHI-SQUARE STATISTIC

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

Logistic Regression Analysis

The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization.

STAT 135 Lab 9 Multiple Testing, One-Way ANOVA and Kruskal-Wallis

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

Statistics for laboratory scientists II

Lecture Topic 4: Chapter 7 Sampling and Sampling Distributions

Topic 28: Unequal Replication in Two-Way ANOVA

Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017

Population Genetics I. Bio

Lecture WS Evolutionary Genetics Part I 1

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

Lecture 2. Basic Population and Quantitative Genetics

Chapter 1 Statistical Inference

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50

Genotype Imputation. Biostatistics 666

Overview. Background

Modeling the Mean: Response Profiles v. Parametric Curves

Lecture 3. Introduction on Quantitative Genetics: I. Fisher s Variance Decomposition

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

QTL model selection: key players

Logistic regression: Miscellaneous topics

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

STAT 536: Genetic Statistics

Testing Independence

Turning a research question into a statistical question.

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Maximum-Likelihood Estimation: Basic Ideas

My data doesn t look like that..

LOOKING FOR RELATIONSHIPS

Lecture 2. Fisher s Variance Decomposition

2. Map genetic distance between markers

CDA Chapter 3 part II

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Econometrics. 4) Statistical inference

Introduction to QTL mapping in model organisms

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

Sleep data, two drugs Ch13.xls

Introduction to the Analysis of Variance (ANOVA)

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Textbook Examples of. SPSS Procedure

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

Introduction to Nonparametric Statistics

ML Testing (Likelihood Ratio Testing) for non-gaussian models

Regression With a Categorical Independent Variable: Mean Comparisons

Power and sample size calculations for designing rare variant sequencing association studies.

BIO 682 Nonparametric Statistics Spring 2010

One-Way Tables and Goodness of Fit

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Empirical Power of Four Statistical Tests in One Way Layout

How to analyze many contingency tables simultaneously?

Econ 583 Homework 7 Suggested Solutions: Wald, LM and LR based on GMM and MLE

Part 1.) We know that the probability of any specific x only given p ij = p i p j is just multinomial(n, p) where p k1 k 2

Generalized linear models

Lecture 9. QTL Mapping 2: Outbred Populations

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

An Introduction to Path Analysis

Introduction to QTL mapping in model organisms

Lecture 2: Introduction to Quantitative Genetics

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Transcription:

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 20: Epistasis and Alternative Tests in GWAS Jason Mezey jgm45@cornell.edu April 16, 2016 (Th) 8:40-9:55

None Announcements

Summary of lecture 20 We will discuss epistasis and testing for epistasis (potentially a good topic for your project!?) We will briefly discuss alternative testing approaches in GWAS

Introduction to epistasis I So far, we have applied a GWAS analysis by considering statistical models between one genetic marker and the phenotype This is the standard approach applied in all GWAS analyses and the one that you should apply as a first step when analyzing GWAS data (always!) However, we could start considering more than one marker in each of the statistical models we consider One reason we might want to do this is to test for statistical interactions among genetic markers (or more specifically, between the causal polymorphisms that they are tagging)

Introduction to epistasis II If we wanted to consider two markers at a time, our current statistical framework extends easily (note that a index AFTER a comma indicates a different marker): Y = 1 ( µ + X a,1 a,1 + X d,1 d,1 + X a,2 a,2 + X d,2 d,2 )+ However, this equation only has four regression parameters and with two markers, we have more than four classes of genotypes To make this explicit, recall that we define the genotypic value of the phenotype as the expected value of the phenotype Y given a genotype: G Ak A l B k B l = E(Y g = A k A l B k B l ) For the case of two markers, we therefore have nine classes of genotypes and therefore nine possible genotypic values, i.e. we need nine parameters to model this system (why are there nine?): B 1 B 1 B 1 B 2 B 2 B 2 A 1 A 1 G A1 A 1 B 1 B 1 G A1 A 1 B 1 B 2 G A1 A 1 B 2 B 2 A 1 A 2 G A1 A 2 B 1 B 1 G A1 A 2 B 1 B 2 G A1 A 2 B 2 B 2 A 2 A 2 G A2 A 2 B 1 B 1 G A2 A 2 B 1 B 2 G A2 A 2 B 2 B 2

Introduction to epistasis III As an example, for a sample that we can appropriately model with a linear regression model, we can plot the phenotypes associated with each of the nine classes: In this case, both marginal loci are additive

Introduction to epistasis IV With nine classes, we also get the possibility of conditional relationships we have not seen before: This is an example of epistasis

Notes about epistasis 1 epistasis - a case where the effect of an allele substitution at one locus A1 -> A2 alters the effect of a substituting an allele at another locus B1->B2 This may be equivalently phrased as a change in the expected phenotype (genotypic value) for a genotype at one locus conditional on the state of a locus at another marker Note that there is a symmetry in epistasis such that if the effect of at least one allelic substitution (from one genotype to another) for one locus depends on the genotype at the other locus, then at least one allelic substitution of the other locus will be dependent as well A consequence of this symmetry is if there is an epistatic relationship between two loci BOTH will be causal polymorphisms for the phenotype (!!!) If there is an epistatic effect (=relationship) between loci, we would therefore like to know this information Note that we need not consider such relationships for a pair of loci, but such relationships can exist among three (three-way), four (four-way), etc. The amount of epistasis among loci for any given phenotype is unknown (but without question it is ubiquitous!!)

Notes about epistasis epistasis - a case where the effect of an allele substitution at one locus A1 -> A2 alters the effect of a substituting an allele at another locus B1->B2 This may be equivalently phrased as a change in the expected phenotype (genotypic value) for a genotype at one locus conditional on the state of a locus at another marker Note that there is a symmetry in epistasis such that if the effect of at least one allelic substitution (from one genotype to another) for one locus depends on the genotype at the other locus, then at least one allelic substitution of the other locus will be dependent as well A consequence of this symmetry is if there is an epistatic relationship between two loci BOTH will be causal polymorphisms for the phenotype (!!!) If there is an epistatic effect (=relationship) between loci, we would therefore like to know this information Note that we need not consider such relationships for a pair of loci, but such relationships can exist among three (three-way), four (four-way), etc. The amount of epistasis among loci for any given phenotype is unknown (but without question it is ubiquitous!!)

Notes about epistasis II Note that the definition of epistasis is entirely statistical (!!) and says nothing about mechanism (although people have misappropriated the term in this way) The term epistasis was coined by Fisher in the 1920 s Epistasis is sometimes called genotype by genotype, G by G, or G x G Geneticists often use the term modifiers to describe the dependence of genetic effects at a locus on the state of another locus - this is just epistasis (!!) We can also consider the effects of a locus when considering the entire genetic background (i.e. all the state in the rest of the genome!) - this is also epistasis (!!)

Modeling epistasis I To model epistasis, we are going to use our same GLM framework (!!) The parameterization (using Xa and Xd) that we have considered so far perfectly models any case where there is no epistasis We will account for the possibility of epistasis by constructing additional dummy variables and adding additional parameters (so that we have 9 total in our GLM)

Modeling epistasis II Recall the dummy variables we have constructed so far: X a,1 = X a,2 = 8 < : 8 < : 1 for A 1 A 1 0 for A 1 A 2 X d,1 = 1 for A 2 A 2 1 for B 1 B 1 0 for B 1 B 2 X d,2 = 1 for B 2 B 2 We will use these dummy variables to construct additional dummy variables in our GLM (and add additional parameters) to account for epistasis Y = 1 ( µ + X a,1 a,1 + X d,1 d,1 + X a,2 a,2 + X d,2 d,2 + X a,1 X a,2 a,a + X a,1 X d,2 a,d + X d,1 X a,2 d,a + X d,1 X d,2 d,d ) 8 < : 8 < : 1 for A 1 A 1 1 for A 1 A 2 1 for A 2 A 2 1 for B 1 B 1 1 for B 1 B 2 1 for B 2 B 2

Modeling epistasis III Y = 1 ( µ + X a,1 a,1 + X d,1 d,1 + X a,2 a,2 + X d,2 d,2 + X a,1 X a,2 a,a + X a,1 X d,2 a,d + X d,1 X a,2 d,a + X d,1 X d,2 d,d ) To provide some intuition concerning what each of these are capturing, consider the values that each of the genotypes would take for dummy variable Xa,1: B 1 B 1 B 1 B 2 B 2 B 2 A 1 A 1 1-1 -1 A 1 A 2 0 0 0 A 2 A 2 1 1 1

Modeling epistasis IV Y = 1 ( µ + X a,1 a,1 + X d,1 d,1 + X a,2 a,2 + X d,2 d,2 + X a,1 X a,2 a,a + X a,1 X d,2 a,d + X d,1 X a,2 d,a + X d,1 X d,2 d,d ) To provide some intuition concerning what each of these are capturing, consider the values that each of the genotypes would take for dummy variable Xd,1: B 1 B 1 B 1 B 2 B 2 B 2 A 1 A 1 1-1 -1 A 1 A 2 1 1 1 A 2 A 2-1 -1-1

Modeling epistasis V Y = 1 ( µ + X a,1 a,1 + X d,1 d,1 + X a,2 a,2 + X d,2 d,2 + X a,1 X a,2 a,a + X a,1 X d,2 a,d + X d,1 X a,2 d,a + X d,1 X d,2 d,d ) To provide some intuition concerning what each of these are capturing, consider the values that each of the genotypes would take for dummy variable Xa,1,Xa,2: B 1 B 1 B 1 B 2 B 2 B 2 A 1 A 1 1 0 1 A 1 A 2 0 0 0 A 2 A 2 1 0-1

Modeling epistasis VI Y = 1 ( µ + X a,1 a,1 + X d,1 d,1 + X a,2 a,2 + X d,2 d,2 + X a,1 X a,2 a,a + X a,1 X d,2 a,d + X d,1 X a,2 d,a + X d,1 X d,2 d,d ) To provide some intuition concerning what each of these are capturing, consider the values that each of the genotypes would take for dummy variable Xa,1Xd,2 (similarly for Xa,2Xd,1): B 1 B 1 B 1 B 2 B 2 B 2 A 1 A 1 1-1 1 A 1 A 2 0 0 0 A 2 A 2-1 1-1

Modeling epistasis VII Y = 1 ( µ + X a,1 a,1 + X d,1 d,1 + X a,2 a,2 + X d,2 d,2 + X a,1 X a,2 a,a + X a,1 X d,2 a,d + X d,1 X a,2 d,a + X d,1 X d,2 d,d ) To provide some intuition concerning what each of these are capturing, consider the values that each of the genotypes would take for dummy variable Xd,1,Xd,2: B 1 B 1 B 1 B 2 B 2 B 2 A 1 A 1 1-1 1 A 1 A 2-1 1-1 A 2 A 2 1-1 1

Inference for epistasis 1 To infer epistatic relationships we will use the exact same genetic framework and statistical framework that we have been considering For the genetic framework, we are still testing markers that we are assuming are in LD with causal polymorphisms that could have an epistatic relationship (so we are indirectly inferring that there is epistasis from the marker genotypes) For inference, we going to estimate epistatic parameters using the same approach as before (!!), i.e. for a linear model: X =[1, X a,1, X d,1, X a,2, X d,2, X a,a, X a,d, X d,a, X d,d ] =[ µ, a,1, d,1, a,2, d,2, a,a, a,d, d,a, d,d] T ˆ =(X T X) 1 X T y

Inference for epistasis II For hypothesis testing, we will just use an LRT calculated the same way as before (!!) For an F-statistic for a linear regression and for logistic estimate the parameters under the null and alternative model and substitute these into the likelihood equations that have the same form as before (with some additional dummy variables and parameters) The only difference is the degrees of freedom for a given test we consider = number of parameters in the alternative model - the number of parameters in the null model

Inference for epistasis III For example, we could use the entire model to test the same hypothesis that we have been considering for a single marker: \ H 0 : a,1 =0\ d,1 = 0 H A : a,1 6= 0[ d,1 6= 0 We could also test whether either marker has evidence of being a causal polymorphism: H 0 : a,1 =0\ d,1 =0\ a,2 =0\ d,2 = 0 H A : a,1 6= 0[ d,1 6=0[ a,2 6=0[ d,2 6= 0 We can also test just for epistasis (note this is equivalent to testing an interaction effect in an ANOVA!): H 0 : a,a =0\ a,d =0\ d,a =0\ d,d = 0 H A : a,a 6= 0[ a,d 6=0[ d,a 6=0[ d,d 6= 0 We can also test the entire model (what is the interpretation in this case!?): H 0 : a,1 =0\ d,1 =0\ a,2 =0\ d,2 =0\ a,a =0\ a,d =0\ d,a =0\ d,d = 0 H A : a,1 6= 0[ d,1 6=0[ a,2 6=0[ d,2 6=0[ a,a 6=0[ a,d 6=0[ d,a 6=0[ d,d 6= 0

Final notes on testing for epistasis Since testing for epistasis requires considering models with more parameters, these tests are generally less powerful than tests of one marker at a time In addition testing for epistasis among all possible pairs of markers (or three or four!, etc.) produces many tests (how many?) Also, identification of a causal polymorphism can be accomplished by testing just one marker at a time (!!) For these reasons, epistasis is often a secondary analysis and we often consider a subset of markers (what might be good strategies) Note however that correctly inferring epistasis is of value for many reasons (for example?) so we would like to do this How to infer epistasis is an active area of research (!!)

Review: GWAS analysis So far, we have considered a regression (generalized linear modeling = GLM) approach for constructing statistical models of the association of genetic polymorphisms and phenotype With this considered the following hypotheses: H 0 : a =0\ d = 0 H A : a 6= 0[ d 6= 0 Note that this X coding of genotypes test the general null hypothesis (in fact, any coding X of the genotypes can be used to construct a test in a GWAS) There are therefore many other ways in which we could construct a different hypothesis test and any of these will be a reasonable (and acceptable) strategy for performing a GWAS analysis

Alternative tests in GWAS I Since our basic null / alternative hypothesis construction in GWAS covers a large number of possible relationships between genotypes and phenotypes, there are a large number of tests that we could apply in a GWAS e.g. t-tests, ANOVA, Wald s test, non-parametric permutation based tests, Kruskal-Wallis tests, other rank based tests, chisquare, Fisher s exact, Cochran-Armitage, etc. (see PLINK for a somewhat comprehensive list of tests used in GWAS) When can we use different tests? The only restriction is that our data conform to the assumptions of the test (examples?) We could therefore apply a diversity of tests for any given GWAS

Alternative tests in GWAS II Should we use different tests in a GWAS (and why)? Yes we should - the reason is different tests have different performance depending on the (unknown) conditions of the system and experiment, i.e. some may perform better than others In general, since we don t know the true conditions (and therefore which will be best suited) we should run a number of tests and compare results How to compare results of different GWAS is a fuzzy case (=no nonconditional rules) but a reasonable approach is to treat each test as a distinct GWAS analysis and compare the hits across analyses using the following rules: If all methods identify the same hits (=genomic locations) then this is good evidence that there is a causal polymorphism If methods do not agree on the position (e.g. some are significant, some are not) we should attempt to determine the reason for the discrepancy (this requires that we understand the tests and experience)

Alternative tests in GWAS III We do not have time in this course to do a comprehensive review of possible tests (keep in mind, every time you learn a new test in a statistics class, there is a good chance you could apply it in a GWAS!) Let s consider a few examples alternative tests that could be applied Remember that to apply these alternative tests, you will perform N alternative tests for each marker-phenotype combinations, where for each case, we are testing the following hypotheses with different (implicit) codings of X (!!): H 0 : Cov(Y,X) = 0 H A : Cov(Y,X) 6= 0

Alternative test examples I First, let s consider a case-control phenotype and consider a chi-square test (which has deep connections to our logistic regression test under certain assumptions but it has slightly different properties!) To construct the test statistic, we consider the counts of genotypephenotype combinations (left) and calculate the expected numbers in each cell (right): Case Control A 1 A 1 n 11 n 12 n 1. A 1 A 2 n 21 n 22 n 2. A 2 A 2 n 31 n 32 n 3. n.1 n.2 n We then construct the following test statistic: in this Where the (asymptotic) distribution when the null hypothesis is true is: 2 d.f.=2. X X LRT = 2ln = 2 Case Control A 1 A 1 (n.1 n 1. )/n (n.2 n 1. )/n n 1. A 1 A 2 (n.1 n 2. )/n (n.2 n 2. )/n n 2. A 2 A 2 (n.1 n 3. )/n (n.2 n 3. )/n n 3. n.1 n.2 n 3X i=1 2X n ij ln j=1 ze tends to infinite, i.e. when the sam d.f. = (#columns-1)(#rows-1) = 2 an therefore calculate the statistic in! n i n.i n j.!

Alternative test examples II Second, let s consider a Fisher s exact test Note the the LRT for the null hypothesis under the chi-square test was only asymptotically exact, i.e. it is exact as sample size n approaches infinite but it is not exact for smaller sample sizes (although we hope it is close!) Could we construct a test that is exact for smaller sample sizes? Yes, we can calculate a Fisher s test statistic for our sample, where the distribution under the null hypothesis is exact for any sample size (I will let you look up how to calculate this statistic and the distribution under the null on your own): Case Control A 1 A 1 n 11 n 21 A 1 A 2 n 21 n 22 A 2 A 2 n 31 n 32 i-square test) is also often Given this test is exact, why would we ever use Chi-square / what is a rule for when we should use one versus the other?

Alternative test examples III Third, let s ways of grouping the cells, where we could apply either a chisquare or a Fisher s exact test For MAF = A1, we can apply a recessive (left) and dominance test (right): We could also apply an allele test (note these test names are from PLINK): Case Control A 1 A 1 n 11 n 12 A 1 A 2 [ A 2 A 2 n 21 n 22 Case Control A 1 n 11 n 12 A 2 n 21 n 22 Case Control A 1 A 1 [ A 1 A 2 n 11 n 12 A 2 A 2 n 21 n 22 When should we expect one of these tests to perform better than the others?

Basic GWAS wrap-up You now have all the tools at your disposal to perform a GWAS analysis of real data (!!) Recall that producing a good GWAS analysis requires iterative analysis of the data and considering why you might be getting the results that you observe Also recall that the more experience you have performing (careful / thoughtful) GWAS analyses, the better you will get at it!

That s it for today Next lecture: we will begin our brief introduction to Bayesian statistics