Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Size: px

Start display at page:

Download "Quantitative Genomics and Genetics BTRY 4830/6830; PBSB"

Lynn Wilcox
5 years ago
Views:

1 Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Lecture11: Quantitative Genomics II Jason Mezey March 7, 019 (Th) 10:10-11:5

2 Announcements Homework #5 will be posted by Sat (March 9) and you will have > 1 week to complete (Due: 11:59PM, Tues., March 19) Reminder: more posts coming (e.g., matrix basics!), computer labs today / tomorrow!

3 Summary of lecture 11 Last lecture, we began our introduction of linear regression Today we will introduce estimation and hypothesis testing for the (genetic) linear regression, which will provide a rigorous introduction to the statistical foundation of Genome-wide Association Study (GWAS) analysis Next lecture, we will begin our introduction to GWAS

4 Conceptual Overview Genetic System Does A1 -> A affect Y? Reject / DNR Measured individuals (genotype, phenotype) Regression model Sample or experimental pop Model params F-test Pr(Y X)

5 Review: genetic system Our goal in quantitative genetics / genomics is to identify loci (positions in the genome) that contain causal mutations / polymorphisms / alleles X causal mutation or polymorphism - a position in the genome where an experimental manipulation of the DNA produces an effect on the phenotype under specified X conditions Formally, we may represent this as follows: A 1! A ) Y Z Our experiment will be a statistical experiment (sample and inference!)

6 P r(t r(t (X ), P r(t (X H0 : ce S in this(x) case, isp therefore: ˆ X H0 : = c) or H0 : = c n P r(t (X), P r(t (X ), P r(t (X H : n = c) 0P (x Review: the statistical model S = {possible individuals} 1 = c) i µ) i=1 n L( x) = e n Pn (xi µ) 1 Pn 1 As with any statistical experiment, we need to begin by defining our sample space i=1 i=1 =c = (X ), P r(te (X H0 :, x) P r(t (x = c)s = {Sg, SP } L( x) = (3) e In the most general sense, our sample space is: could = { Possible Individuals ich we also write S = {Sg }\SnP P },n where the subset SP reflects tha (xi µ) 1 = cthe phenotype can = { Possible Individuals i=1 such }{ = P } or sick or ca discrete L( take x) =(e.g. e cases ues asg healthy =S{isg Pset } o Pmodel r(t (X ), P r(t (X H = c) and where the subset as continuous such 0as: height ) the specifically, each individual in our sample space can be quantified asg(4) More = { g P } g = {A1 A1, A1 A, A Aapair } samplefor outcomes so our sample space cantwo be written and notypes, of where a diploid individual allelesas:at a singleflocus n {g,p } we hav ssible Individuals } Pn (xi µ) 1 g = {A1 A1, A1 A, A A }= (5) ) g i=1 P r(f { } L( x) =S = {A epa A, A A } g {g,p } A, g space at a locus and Where is the genotype sample P is the phenotype g (6) P } P r{g, F{g,P } sample space ere we will use g to define the genotype of an individual (at a single locus), suc P r(y X) X) = P r(y ible Individuals } F{g,P (7) P r(f ) } = P r(y, P {g,p } genotype gi = Aj Ak. isevery the set individual of possible genotypes, where for a in the p ividual has that a genotype that could occur inote = {g P } diploid, with two alleles:for H r(y, X)in = the P r(yset )P S r 0 : P values refore has a Fsingle value their phenotype (that takes P r(f ) P r{g, P } {g,p } (8) {g,p } Pn (xi gle value for their genotype (which is an element of the set S ). g = {A A, A A, A A } 1 g i=1 n e r(x)p } P r{g, P r(f P r(y) X) = P r(y, X) = P r(y )P (9) () {g,p } LRT =sick= For the phenotype, this can be any type of measurement (e.g. or healthy,pn (xi g 1 H : P r(y, X) = P r(y )P r(x) 0 P r(y X) = P r(y, X) = P r(y )Pi=1 r(x) height, etc.) n e P r{g, P } (10) () Pn (xi H0 (µ)) P 4 1 X) = P r(y )P r(x) 0 : P r(y, n ehi=1 L( ˆ x) M

7 Review: the statistical model { P Next, we need to define the probability model on the sigma algebra of the sample space ( ): F { F } {g,p} Pr(F {g,p} F) Which defines the probability { { of } each possible genotype and phenotype pair: Pr{g, P} We will define two (types) or random variables (* = state does not matter): Y :(, P ) R X :( g, ) R Note that the probability model P induces a (joint) probability distribution on this random vector (these random variables): Pr(Y,X)

8 Review: looking ahead (the goal...) The goal of quantitative genomics and genetics is to identify cases of the following relationship where when performing the perfect genotype manipulation experiment { we \ have: } Remember that, regardless of the probability P distribution of our random vector, we can define the expectation: and the variance: Pr(Y \ X) =Pr(Y,X) 6= Pr(Y )Pr(X) Var[Y,X]= E[Y,X]=[EY,EX] apple Var(Y apple ) apple Cov(Y,X) Cov(Y,X) Var(X) The goal of quantitative genomics can be rephrased as assessing the following relationship in the perfect experimental framework (although in practice, we will do this by assessing the following relationship in an uncontrolled setting): Cov(Y,X) 6= 0

9 Linear regression We are going to consider a regression model a parameterized model to represent the probability model of X and Y (that is the true statistical model of genetics!!!): Y = 0 + X 1 + N(0, ) Note that in this model, we consider Y to be the dependent or response variable and X to be the independent variable (what are the parameters!?) Also note implicitly assumes: Pr(Y,X)=Pr(Y X) That is, that X is effectively fixed when considering an individual, although note that X still varies (has a probability)

10 Linear regression is a bivariate distribution We ve seen bivariate (multivariate) distributions before:

11 Linear regression I Let s review the structure of a linear regression (not necessarily a genetic model): Y = 0 + X 1 + N(0, ) Y X

12 Linear regression I

13 Linear regression II The linear regression model allows calculation of the (interval) probability of observations (!!) Y = 0 + X 1 + N(0, ) Y X

14 Linear regression III A multiple regression model has the same structure, with a single dependent variable Y and more than one independent variable Xi, Xj, e.g.,

15 The genetic probability model I The quantitative genetic model is a multiple regression model with the following independent ( dummy ) variables: X a (A 1 A 1 )= 1,X a (A 1 A )=0,X a (A A ) = 1 X d (A 1 A 1 )= 1,X d (A 1 A )=1,X d (A A )= 1 1 A 1 A 1 A 1 A 1 A A and the following multiple regression equation: Y = µ + X a a + X d d + N(0, )

16 The genetic probability model II The probability distribution of this model, is therefore: Which has four parameters: The three parameters are required to model the three separate genotypes (A1A1, A1A, AA) Pr(Y X) N( µ + X a a + X d d, µ, a, d, The can be thought of as a random variable that describes the probability an individual will have a specific value of Y, conditional on the genotype AiAj, where the probability is normally distributed around the value determined by the X s and s N(0, ) )

17 The genetic probability model III P Let s consider a specific example where we are interested modeling the relationship between a genotype and a phenotype (such as height) where the latter is well approximated by a normal distribution For this case, the (unknown) conditions of the experiment define the true values of the parameters (unknown to us!), which we will say are X the following (note these are the same for all individuals in the population since they are { parameters of the probability } distribution): Consider an individual i with gi = A1A such that we have: If this individual has a phenotype value yi =.1 then we have the epsilon value i =0.7where the probability of this particular value (i.e. the interval surrounding g i this = A value) j A k is defined by N(0, µ =0.3, a = 0., d =1.1, =1.1 X = X a (A1A) = 0,X d (A1A) = (0)( 0.) + (1)(1.1) )

18 6 6 The genetic probability model IV Note that, while somewhat arbitrary, the advantage of the Xa and Xd coding is the parameters a and d map directly on to relationships between the genotype and phenotype that are important in genetics: If a 6= 0, d =0 then this is a purely additive case If a =0, d 6= 0 then this is only over- or underdominance (homozygotes have equal effects on phenotype) If both are non-zero, there are both additive and dominance effects If both are zero, there is no effect of the genotype on the phenotype (the genotype is not causal!)

19 Genetic example 1 As an example, consider the following of a purely additive case (= no dominance): µ =, a =5, d =0, =1

20 Genetic example II An example of dominance (= not a pure additive case): µ =0, a =4, d = 1, =1

21 Genetic example III A case of NO genetic effect: µ =, a =0, d =0, =1

22 Quantitative genetic formalism For those of you who have been exposed to classic quantitative genetics, you have seen a different notation for this model: P = G + E P is the phenotypic value - the value of the aspect measured G is the genotypic value - the expected value of the phenotype conditional on the genotype E is the environmental value - the value of the phenotype that we cannot explain given the genotype These translate as follows for our one locus case (although note the formalism extends to any multiple locus case): Y = P G =EP =EY = µ + X a a + X d d = E

23 6 Genetic inference I For our model focusing on one locus: Y = µ + X a a + X d d + N(0, We have four possible parameters we could estimate: = µ, a, d, However, for our purposes, we are only interested in the genetic parameters and testing the following null 6 hypothesis: [ ) H 0 : Cov(X a,y)=0\ Cov(X d,y)=0 H A : Cov(X a,y) 6= 0[ Cov(X d,y) 6= 0 \ OR H 0 : a =0\ d =0 H A : a 6= 0[ d 6=0

24 Genetic inference II Recall that inference (whether estimation or hypothesis testing) starts by collecting a sample and defining a statistic on that sample In this case, we are going to collect a sample of n individuals where for each we will measure their phenotype and their genotype (i.e. at the locus we are focusing on) That is an individual i will have phenotype yi and genotype gi = AjAk (where we translate these into xa and xd) Using the phenotype and genotype we will construct both an estimator (a statistic!) and we will additionally construct a test statistic Remember that our regression probability model defines a sampling distribution on our sample and therefore on our estimator and test statistic (!!)

25 Genetic inference III For notation convenience, we are going to use vector / matrix notation to represent a sample: y i = µ + x i,a a + x i,d d + i y 1 y. y n y 1 y. y n = µ + x 1,a a + x 1,d d + 1 µ + x,a a + x,d d +. µ + x n,a a + x n,d d + n 1 x 1,a x 1,d = 1 x,a x,d..... µ a +. d 1 x n,a x n,d 1 n y = x +

26 Genetic estimation I We will define a MLE for our parameters: =[ µ, a, d] Recall that an MLE is simply a statistic (a function that takes a sample in and outputs a number that is our estimate) In this case, our statistic will be a vector valued function that takes T (y, x a, x d )= ˆ 4 5 =[ˆµ, in the vectors that represent our sample ˆa, ˆd] Note that we calculate an MLE for this case just as we would any case (we use the likelihood of the fixed sample where we identify the parameter values that maximize this function) In the linear regression case (just as with normal parameters) this has a closed form: 3 MLE( ˆ) =(x T x) 1 x T y

27 Genetic estimation II Let s look at the structure of this estimator: y 1 y. y n y = x 1 x 1,a x 1,d = 1 x,a x,d..... µ a +. d 1 x n,a x n,d MLE( ˆ) = 4 + ˆµ ˆa ˆd n MLE( ˆ) =(x T x) 1 x T y

28 That s it for today Next lecture: genetic linear regression hypothesis testing and introduction to GWAS!

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture16: Population structure and logistic regression I Jason Mezey jgm45@cornell.edu April 11, 2017 (T) 8:40-9:55 Announcements I April