BTRY 4830/6830: Quantitative Genomics and Genetics

BTRY 4830/6830: Quantitative Genomics and Genetics Lecture 23: Alternative tests in GWAS / (Brief) Introduction to Bayesian Inference Jason Mezey jgm45@cornell.edu Nov. 13, 2014 (Th) 8:40-9:55

Announcements Homework #5 available (see your TA!) We will get you details for the final next week

Summary of lecture 23 We will review some basics of epistasis and testing for (potentially a good topic for your project!?) We will briefly discuss alternative testing approaches in GWAS We will provide a (brief) introduction to Bayesian inference

Review: epistasis epistasis - a case where the effect of an allele substitution at one locus A1 -> A2 alters the effect of a substituting an allele at another locus B1->B2 This may be equivalently phrased as a change in the expected phenotype (genotypic value) for a genotype at one locus conditional on the state of a locus at another marker Note that there is a symmetry in epistasis such that if the effect of at least one allelic substitution (from one genotype to another) for one locus depends on the genotype at the other locus, then at least one allelic substitution of the other locus will be dependent as well A consequence of this symmetry is if there is an epistatic relationship between two loci BOTH will be causal polymorphisms for the phenotype (!!!) If there is an epistatic effect (=relationship) between loci, we would therefore like to know this information Note that we need not consider such relationships for a pair of loci, but such relationships can exist among three (three-way), four (four-way), etc. The amount of epistasis among loci for any given phenotype is unknown (but without question it is ubiquitous!!)

Review: modeling epistasis I To model epistasis, we are going to use our same GLM framework (!!) The parameterization (using Xa and Xd) that we have considered so far perfectly models any case where there is no epistasis We will account for the possibility of epistasis by constructing additional dummy variables and adding additional parameters (so that we have 9 total in our GLM)

Review: modeling epistasis II Recall the dummy variables we have constructed so far: X a,1 = X a,2 = 1 for A 1 A 1 0 for A 1 A 2 X d,1 = 1 for A 2 A 2 1 for B 1 B 1 0 for B 1 B 2 X d,2 = 1 for B 2 B 2 We will use these dummy variables to construct additional dummy variables in our GLM (and add additional parameters) to account for epistasis Y = γ 1 (β µ + X a,1 β a,1 + X d,1 β d,1 + X a,2 β a,2 + X d,2 β d,2 + X a,1 X a,2 β a,a + X a,1 X d,2 β a,d + X d,1 X a,2 β d,a + X d,1 X d,2 β d,d ) 1 for A 1 A 1 1 for A 1 A 2 1 for A 2 A 2 1 for B 1 B 1 1 for B 1 B 2 1 for B 2 B 2

Review: modeling epistasis III Y = γ 1 (β µ + X a,1 β a,1 + X d,1 β d,1 + X a,2 β a,2 + X d,2 β d,2 + X a,1 X a,2 β a,a + X a,1 X d,2 β a,d + X d,1 X a,2 β d,a + X d,1 X d,2 β d,d ) To provide some intuition concerning what each of these are capturing, consider the values that each of the genotypes would take for dummy variable Xa,1: B 1 B 1 B 1 B 2 B 2 B 2 A 1 A 1 1-1 -1 A 1 A 2 0 0 0 A 2 A 2 1 1 1

Review: modeling epistasis IV Y = γ 1 (β µ + X a,1 β a,1 + X d,1 β d,1 + X a,2 β a,2 + X d,2 β d,2 + X a,1 X a,2 β a,a + X a,1 X d,2 β a,d + X d,1 X a,2 β d,a + X d,1 X d,2 β d,d ) To provide some intuition concerning what each of these are capturing, consider the values that each of the genotypes would take for dummy variable Xd,1: B 1 B 1 B 1 B 2 B 2 B 2 A 1 A 1 1-1 -1 A 1 A 2 1 1 1 A 2 A 2-1 -1-1

Review: modeling epistasis V Y = γ 1 (β µ + X a,1 β a,1 + X d,1 β d,1 + X a,2 β a,2 + X d,2 β d,2 + X a,1 X a,2 β a,a + X a,1 X d,2 β a,d + X d,1 X a,2 β d,a + X d,1 X d,2 β d,d ) To provide some intuition concerning what each of these are capturing, consider the values that each of the genotypes would take for dummy variable Xa,1,Xa,2: B 1 B 1 B 1 B 2 B 2 B 2 A 1 A 1 1 0 1 A 1 A 2 0 0 0 A 2 A 2 1 0-1

Review: modeling epistasis VI Y = γ 1 (β µ + X a,1 β a,1 + X d,1 β d,1 + X a,2 β a,2 + X d,2 β d,2 + X a,1 X a,2 β a,a + X a,1 X d,2 β a,d + X d,1 X a,2 β d,a + X d,1 X d,2 β d,d ) To provide some intuition concerning what each of these are capturing, consider the values that each of the genotypes would take for dummy variable Xa,1Xd,2 (similarly for Xa,2Xd,1): B 1 B 1 B 1 B 2 B 2 B 2 A 1 A 1 1-1 1 A 1 A 2 0 0 0 A 2 A 2-1 1-1

Review: modeling epistasis VII Y = γ 1 (β µ + X a,1 β a,1 + X d,1 β d,1 + X a,2 β a,2 + X d,2 β d,2 + X a,1 X a,2 β a,a + X a,1 X d,2 β a,d + X d,1 X a,2 β d,a + X d,1 X d,2 β d,d ) To provide some intuition concerning what each of these are capturing, consider the values that each of the genotypes would take for dummy variable Xd,1,Xd,2: B 1 B 1 B 1 B 2 B 2 B 2 A 1 A 1 1-1 1 A 1 A 2-1 1-1 A 2 A 2 1-1 1

Review: inference for epistasis 1 To infer epistatic relationships we will use the exact same genetic framework and statistical framework that we have been considering For the genetic framework, we are still testing markers that we are assuming are in LD with causal polymorphisms that could have an epistatic relationship (so we are indirectly inferring that there is epistasis from the marker genotypes) For inference, we going to estimate epistatic parameters using the same approach as before (!!), i.e. for a linear model: X =[1, X a,1, X d,1, X a,2, X d,2, X a,a, X a,d, X d,a, X d,d ] β =[β µ, β a,1, β d,1, β a,2, β d,2, β a,a, β a,d, β d,a, β d,d ] T ˆβ =(X T X) 1 X T y

Review: inference for epistasis II For hypothesis testing, we will just use an LRT calculated the same way as before (!!) For an F-statistic for a linear regression and for logistic estimate the parameters under the null and alternative model and substitute these into the likelihood equations that have the same form as before (with some additional dummy variables and parameters) The only difference is the degrees of freedom for a given test we consider = number of parameters in the alternative model - the number of parameters in the null model

Review: inference for epistasis III For example, we could use the entire model to test the same hypothesis that we have been considering for a single marker: H 0 : β a,1 =0 β d,1 = 0 H A : β a,1 =0 β d,1 = 0 We could also test whether either marker has evidence of being a causal polymorphism: H 0 : β a,1 =0 β d,1 =0 β a,2 =0 β d,2 = 0 H A : β a,1 =0 β d,1 =0 β a,2 =0 β d,2 = 0 We can also test just for epistasis (note this is equivalent to testing an interaction effect in an ANOVA!): H 0 : β a,a =0 β a,d =0 β d,a =0 β d,d = 0 H A : β a,a =0 β a,d =0 β d,a =0 β d,d = 0 We can also test the entire model (what is the interpretation in this case!?): H 0 : β a,1 =0 β d,1 =0 β a,2 =0 β d,2 =0 β a,a =0 β a,d =0 β d,a =0 β d,d = 0 H A : β a,1 =0 β d,1 =0 β a,2 =0 β d,2 =0 β a,a =0 β a,d =0 β d,a =0 β d,d = 0

Final notes on testing for epistasis Since testing for epistasis requires considering models with more parameters, these tests are generally less powerful than tests of one marker at a time In addition testing for epistasis among all possible pairs of markers (or three or four!, etc.) produces many tests (how many?) Also, identification of a causal polymorphism can be accomplished by testing just one marker at a time (!!) For these reasons, epistasis is often a secondary analysis and we often consider a subset of markers (what might be good strategies) Note however that correctly inferring epistasis is of value for many reasons (for example?) so we would like to do this How to infer epistasis is an active area of research (!!)

Review: GWAS analysis So far, we have considered a regression (generalized linear modeling = GLM) approach for constructing statistical models of the association of genetic polymorphisms and phenotype With this considered the following hypotheses: H 0 : β a =0 β d = 0 H A : β a =0 β d = 0 Note that this X coding of genotypes test the general null hypothesis (in fact, any coding X of the genotypes can be used to construct a test in a GWAS) There are therefore many other ways in which we could construct a different hypothesis test and any of these will be a reasonable (and acceptable) strategy for performing a GWAS analysis

Alternative tests in GWAS I Since our basic null / alternative hypothesis construction in GWAS covers a large number of possible relationships between genotypes and phenotypes, there are a large number of tests that we could apply in a GWAS e.g. t-tests, ANOVA, Wald s test, non-parametric permutation based tests, Kruskal-Wallis tests, other rank based tests, chisquare, Fisher s exact, Cochran-Armitage, etc. (see PLINK for a somewhat comprehensive list of tests used in GWAS) When can we use different tests? The only restriction is that our data conform to the assumptions of the test (examples?) We could therefore apply a diversity of tests for any given GWAS

Alternative tests in GWAS II Should we use different tests in a GWAS (and why)? Yes we should - the reason is different tests have different performance depending on the (unknown) conditions of the system and experiment, i.e. some may perform better than others In general, since we don t know the true conditions (and therefore which will be best suited) we should run a number of tests and compare results How to compare results of different GWAS is a fuzzy case (=no nonconditional rules) but a reasonable approach is to treat each test as a distinct GWAS analysis and compare the hits across analyses using the following rules: If all methods identify the same hits (=genomic locations) then this is good evidence that there is a causal polymorphism If methods do not agree on the position (e.g. some are significant, some are not) we should attempt to determine the reason for the discrepancy (this requires that we understand the tests and experience)

Alternative tests in GWAS III We do not have time in this course to do a comprehensive review of possible tests (keep in mind, every time you learn a new test in a statistics class, there is a good chance you could apply it in a GWAS!) Let s consider a few examples alternative tests that could be applied Remember that to apply these alternative tests, you will perform N alternative tests for each marker-phenotype combinations, where for each case, we are testing the following hypotheses with different (implicit) codings of X (!!): H 0 : Cov(Y,X) = 0 H A : Cov(Y,X) = 0

Alternative test examples I First, let s consider a case-control phenotype and consider a chi-square test (which has deep connections to our logistic regression test under certain assumptions but it has slightly different properties!) To construct the test statistic, we consider the counts of genotypephenotype combinations (left) and calculate the expected numbers in each cell (right): Case Control A 1 A 1 n 11 n 12 n 1. A 1 A 2 n 21 n 22 n 2. A 2 A 2 n 31 n 32 n 3. n.1 n.2 n We then construct the following test statistic: LRT = 2lnΛ = 2 in this χ 2 d.f.=2. Case Control A 1 A 1 (n.1 n 1. )/n (n.2 n 1. )/n n 1. A 1 A 2 (n.1 n 2. )/n (n.2 n 2. )/n n 2. A 2 A 2 (n.1 n 3. )/n (n.2 n 3. )/n n 3. n.1 n.2 n Where the (asymptotic) distribution when the null hypothesis is true is: 3 i=1 2 n i n ij ln n.i n j. j=1 ze tends to infinite, i.e. when the sam d.f. = (#columns-1)(#rows-1) = 2 an therefore calculate the statistic in

Alternative test examples II Second, let s consider a Fisher s exact test Note the the LRT for the null hypothesis under the chi-square test was only asymptotically exact, i.e. it is exact as sample size n approaches infinite but it is not exact for smaller sample sizes (although we hope it is close!) Could we construct a test that is exact for smaller sample sizes? Yes, we can calculate a Fisher s test statistic for our sample, where the distribution under the null hypothesis is exact for any sample size (I will let you look up how to calculate this statistic and the distribution under the null on your own): Case Control A 1 A 1 n 11 n 21 A 1 A 2 n 21 n 22 A 2 A 2 n 31 n 32 i-square test) is also often Given this test is exact, why would we ever use Chi-square / what is a rule for when we should use one versus the other?

Alternative test examples III Third, let s ways of grouping the cells, where we could apply either a chisquare or a Fisher s exact test For MAF = A1, we can apply a recessive (left) and dominance test (right): We could also apply an allele test (note these test names are from PLINK): Case Control A 1 A 1 n 11 n 12 A 1 A 2 A 2 A 2 n 21 n 22 Case Control A 1 n 11 n 12 A 2 n 21 n 22 Case Control A 1 A 1 A 1 A 2 n 11 n 12 A 2 A 2 n 21 n 22 When should we expect one of these tests to perform better than the others?

Basic GWAS wrap-up You now have all the tools at your disposal to perform a GWAS analysis of real data (!!) Recall that producing a good GWAS analysis requires iterative analysis of the data and considering why you might be getting the results that you observe Also recall that the more experience you have performing (careful / thoughtful) GWAS analyses, the better you will get at it!

Introduction to Bayesian analysis 1 Up to this point, we have considered statistical analysis (and inference) using a Frequentist formalism There is an alternative formalism called Bayesian that we will now (and in the final lectures) introduce in a very brief manner Note that there is an important conceptual split between statisticians who consider themselves Frequentist of Bayesian but for GWAS analysis (and for most applications where we are concerned with analyzing data) we do not have a preference, i.e. we only care about getting the right biological answer so any (or both) frameworks that get us to this goal are useful In GWAS (and mapping) analysis, you will see both frequentist (i.e. the framework we have built up to this point!) and Bayesian approaches applied

Introduction to Bayesian analysis II In both frequentist and Bayesian analyses, we have the same probabilistic framework (sample spaces, random variables, probability models, etc.) and when assuming our probability model falls in a family of parameterized distributions, we assume that a single fixed parameter value(s) describes the true model that produced our sample However, in a Bayesian framework, we now allow the parameter to have it s own probability distribution (we DO NOT do this in a frequentist analysis), such that we treat it as a random variable This may seem strange - how can we consider a parameter to have a probability distribution if it is fixed? However, we can if we have some prior assumptions about what values the parameter value will take for our system compared to others and we can make this prior assumption rigorous by assuming there is a probability distribution associated with the parameter It turns out, this assumption produces major differences between the two analysis procedures (in how they consider probability, how they perform inference, etc.

Introduction to Bayesian analysis III To introduce Bayesian statistics, we need to begin by introducing Bayes theorem he name Baye Consider a set of events (remember events!?) A = A 1...A k of a sample space S (where k may be infinite), which form a partition of the sample space, i.e. ple space S (where k may be infinite), whi k i A i = S and A i A j = for all i = j. For another event B S (which may be S itself) define the Law of total probability: k k Pr(B) = Pr(B A i )= Pr(B A i )Pr(A i ) i=1 i=1 A Now A we can state Bayes theorem: Pr(A i B) = Pr(A i B) Pr(B) = Pr(B A i)pr(a i ) Pr(B) = Pr(B A i )Pr(A) k i=1 Pr(B A i)pr(a i )

Introduction to Bayesian analysis IV Remember that in a Bayesian (not frequentist!) framework, our parameter(s) have a probability distribution associated with them that reflects our belief in the values that might be the true value of the parameter Since we are treating the parameter as a random variable, we can consider the joint distribution of the parameter AND a sample Y produced under a probability model: Pr(θ Y) Fo inference, we are interested in the probability the parameter takes a certain value given a sample: Pr(θ y) Using Bayes theorem, we can write: Pr(θ y) = Pr(y θ)pr(θ) Pr(y) Also note that since the sample is fixed (i.e. we are considering a single sample) Pr(y) =c, we can rewrite this as follows: Pr(θ y) Pr(y θ)pr(θ)

Introduction to Bayesian analysis V Let s consider the structure of our main equation in Bayesian statistics: Pr(θ y) Pr(y θ)pr(θ) Note that the left hand side is called the posterior probability: Pr(θ y) i.e. the The first term of the right hand side is something we have seen before, i.e. the likelihood (!!): Pr(y θ) =L(θ y) The second term of the right hand side is new and is called the prior: Pr(θ) Note that the prior is how we incorporate our assumptions concerning the values the true parameter value may take In a Bayesian framework, we are making two assumptions (unlike a frequentist where we make one assumption: 1. the probability distribution that generated the sample, 2. the probability distribution of the parameter

Probability in a Bayesian framework By allowing for the parameter to have an prior probability distribution, we produce a change in how we consider probability in a Bayesian versus Frequentist perspective For example, consider a coin flip, with Bern(p) In a Frequentist framework, we consider a conception of probability that we use for inference to reflect the outcomes as if we flipped the coin an infinite number of times, i.e. if we flipped the coin 100 times and it was heads each time, we do not use this information to change how we consider a new experiment with this same coin if we flipped it again In a Bayesian framework, we consider a conception of probability can incorporate previous observations, i.e. if we flipped a coin 100 times and it was heads each time, we might want to incorporate this information in to our inferences from a new experiment with this same coin if we flipped it again Note that this philosophic distinction is very deep (=we have only scratched the surface with this one example)

Debating the Frequentist versus Bayesian frameworks Frequentists often argue that because they do not take previous experience into account when performing their inference concerning the value of a parameter, such that they do not introduce biases into their inference framework In response, Bayesians often argue: Previous experience is used to specify the probability model in the first place By not incorporating previous experience in the inference procedure, prior assumptions are still being used (which can introduce logical inconsistencies!) The idea of considering an infinite number of observations is not particular realistic (and can be a non-sensical abstraction for the real world) The impact of prior assumptions in Bayesian inference disappear as the sample size goes to infinite Again, note that we have only scratched the surface of this debate!

Types of priors in Bayesian analysis Up to this point, we have discussed priors in an abstract manner To start making this concept more clear, let s consider one of our original examples where we are interested in the knowing the mean human height in the US (what are the components of the statistical framework for this example!? Note the basic components are the same in Frequentist / Bayesian!) If we assume a normal probability model of human height (what parameter are we interested in inferring in this case and why?) in a Bayesian framework, we will at least need to define a prior: Pr(µ) One possible approach is to make the probability of each possible value of the parameter the same (what distribution are we assuming and what is a problem with this approach), which defines an improper prior: Pr(µ) =c Another possible approach is to incorporate our previous observations that heights are seldom infinite, etc. where one choice for incorporating this observations is my defining a prior that has the same distribution as our probability model, which defines a conjugate prior (which is also a proper prior): ce, and use a math- Pr(µ) N(κ, φ 2 )

Constructing the posterior probability Let s put this all together for our heights in the US example First recall that our assumption is the probability model is normal (so what is the form of the likelihood?): dom variable Y N(µ, σ 2 ) 2 Second, assume a normal prior for the parameter we are interested in: ce, and use a math- Pr(µ) N(κ, φ 2 ), From the Bayesian equation, we can now put this together as follows: Pr(µ y) Pr(θ y) Pr(y θ)pr(θ) n i=1 1 e (y i µ)2 1 2σ 2 e (µ κ) 2φ 2 2πσ 2 2πφ 2 Note that with a little rearrangement, this can be written in the following form: Pr(µ y) N ( κ σ 2 + n i y i σ 2 ) ( 1 φ 2 + n σ 2 ), ( 1 φ 2 + n σ 2 ) 1 2

Bayesian inference: estimation Inference in a Bayesian framework differs from a frequentist framework in both estimation and hypothesis testing For example, for estimation in a Bayesian framework, we always construct estimators using the posterior probability distribution, for example: ˆθ = mean(θ y) = θpr(θ y)dθ For example, in our heights in the US example our estimator is: Note 1: again notice that the impact of the prior disappears as the sample size goes to infinite (=same as MLE under this condition): Note 2: estimates in a Bayesian framework can be different than in a likelihood (Frequentist) framework since estimator construction is fundamentally different (!!) or ˆθ = median(θ y) ˆµ = median(µ y) =mean(µ y) = ( κ σ 2 + nȳ σ 2 ) ( 1 φ 2 + n σ 2 ) ( κ + nȳ ) σ 2 σ 2 ( 1 + n ) ( nȳ ) σ 2 ( n ) ȳ φ 2 σ 2 σ 2

That s it for today Next lecture: we will continue our brief introduction to Bayesian statistics