Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture16: Population structure and logistic regression I Jason Mezey jgm45@cornell.edu April 11, 2017 (T) 8:40-9:55
Announcements I April 11 Genome-Wide Association Studies (GWAS) IV: logistic regression 1 (the model) 12 April 13 Project Assigned Genome-Wide Association Studies (GWAS) V: logistic regression II (IRLS algorithm and GLMs) April 18 Genome-Wide Association Studies (GWAS) X: Haplotype testing, alternative tests, and minimum GWAS analysis 13 April 20 Advanced topics I: Mixed Models April 25 Advanced topics II: Multiple regression (epistasis) and multivariate regression 14 April 27 MAPPING LOCI: BAYESIAN ANALYSIS Bayesian inference I: inference basics / linear models May 2 Bayesian inference II: MCMC algorithms 15 May 4 PEDIGREE / INBRED LINE ANALYSIS / CLASSIC QUANTITATIVE GENETICS Basics of linkage analysis / Inbred line analysis May 9 Project Due Heritability and additive genetic variance 16
Announcements Midterm will be available next week No more homeworks (!!) - just a project and final (and computer labs) Your PROJECT will be assigned on Thurs.! I will have office hours today In Ithaca, same location as always In NY go to the SMALL Genetic Med Conference Room
Conceptual Overview Genetic System Does A1 -> A2 affect Y? Reject / DNR Measured individuals (genotype, phenotype) Regression model Sample or experimental pop Model params F-test Pr(Y X)
Review: modeling covariates I If we have a factor that is correlated with our phenotype and we do not handle it in some manner in our analysis, we risk producing false positives AND/OR reduce the power of our tests! The good news is that, assuming we have measured the factor (i.e. it is part of our GWAS dataset) then we can incorporate the factor in our model as a covariate(s): Y = µ + X a a + X d d + X z,1 z,1 + X z,2 z,2 + The effect of this is that we will estimate the covariate model parameter and this will account for the correlation of the factor with phenotype (such that we can test for our marker correlation without false positives / lower power!)
4 5 Review modeling covariates II How do we perform inference with a covariate in our lines regression model? We perform MLE the same way (!!) our X matrix now simply includes extra columns, one for each of the additional covariates, where for the linear regression we have: MLE( ˆ) =(x T x) 1 x T y We perform hypothesis testing the same way (!!) with a slight difference: our LRT includes the covariate in both the null hypothesis and the alternative, but we are testing the same null hypothesis: H 0 : a =0\ d = 0 H A : a 6= 0[ d 6= 0
Modeling covariates III First, determine the predicted value of the phenotype of each individual under the null hypothesis (how do we set up x?): ŷ i,ˆ 0 = ˆµ + X x i,z,j ˆz,j X j=1 X Second, determine the predicted value Xof the phenotype of each individual under Xthe alternative hypothesis (set up x?): ŷ i,ˆ 1 = ˆµ + x i,a ˆa + x i,d ˆd + X Xx i,z,j ˆz,j X j=1 Third, calculate the Error Sum of Squares for each: SSE( ˆ 0 )= nx (y i ŷ i,ˆ 0 ) 2 i=1 X i=1 Finally, we calculate X the F-statistic with degrees of freedom [2, n-3] (why two degress of freedom?): F [2,n 3] (y, x) = SSE( ˆ 1 )= SSE(ˆ 0 ) SSE(ˆ 1 ) 2 SSE(ˆ 1 ) n 3 nx (y i ŷ i,ˆ 1 ) 2
Modeling covariates VI Say you have GWAS data (a phenotype and genotypes) and your GWAS data also includes information on a number of covariates, e.g. male / female, several different ancestral groups (different populations!!), other risk factors, etc. First, you need to figure out how to code the XZ in each case for each of these, which may be simple (male / female) but more complex with others (where how to code them involves fuzzy rules, i.e. it depends on your context!!) Second, you will need to figure out which to include in your analysis (again, fuzzy rules!) but a good rule is if the parameter estimate associated with the covariate is large (=significant individual p-value) you should include it! There are many ways to figure out how to include covariates (again a topic in itself!!)
Review: population structure Population structure or stratification is a case where a sample includes groups of people that fit into two or more different ancestry groups (fuzzy def!) Population structure is often a major issue in GWAS where it can cause lots of false positives if it is not accounted for in your model Intuitively, you can model population structure as a covariate if you know: How many populations are represented in your sample Which individual in your sample belongs to which population QQ plots are good for determining whether there may be population structure Clustering techniques are good for detecting population structure and determining which individual is in which population (=ancestry group)
Origin of population structure Sarver World Cultures People geographically separate through migration and then the set of alleles present in the population evolves (=changes) over time
Principal Component Analysis (PCA) of population structure Nature Publishing
Learning unmeasured population factors To learn a population factor, analyze the genotype data Data = z 11... z 1k y 11... y 1m x 11... x 1N......... z n1... z nk y n1... y nm x 11... x nn Apply a Principal Component Analysis (PCA) where the axes (features) in this case are individuals and each point is a (scaled) genotype Zi,2 Zi,1 What we are interested in the projections (loadings) of the individual PCs on the axes (dotted arrows) on each of the individual axes, where for each, this will produce n (i.e. one value for each sample) value of a new independent (covariate) variable XZ Y = µ + X a a + X d d + X z,1 z,1 + X z,2 z,2 +
Applying a PCA population structure analysis (in practice) Calculate the nxn (n=sample size) covariance matrix for the individuals in your sample across all genotypes Apply a PCA to this covariance matrix, the output will be matrices containing eigenvalues and eigenvectors (= the Principal Components), where the size of the eigenvalue indicates the ordering of the Principal Component Each Principal Component (PC) will be a n element vector where each element is the loading of the PC on the individual axes, where these are your values of your independent variable coding (e.g., if you include the first PC as your first covariate, your coding will be XZ,1 = PC loadings) Note that you could also get the same answer by calculating an NxN (N=measured genotypes) covariance matrix, apply PCA and take the projects of each sample on the PCs (why might this be less optimal?)
Using the results of a PCA population structure analysis Once you have detected the populations (e.g. by eye in a PCA = fuzzy!) in your GWAS sample, set your independent variables equal to the loadings for each individual, e.g., for two pop covariates, set XZ,1 = Z1, XZ,2 = Z2 You could also determine which individual is in which pop and define random variables for pop assignment, e.g. for two populations include single covariate by setting, XZ,1(pop1) = 1, XZ,1(pop2) = 0 (generally less optimal but can be used!) Use one of these approaches to model a covariate in your analysis, i.e. for every genotype marker that you test in your GWAS: Y = µ + X a a + X d d + X z,1 z,1 + X z,2 z,2 + The goal is to produce a good QQ plot (what if it does not?)
Before (top) and after including a population covariate (bottom)
Review: linear regression So far, we have considered a linear regression is a reasonable model for the relationship between genotype and phenotype (where this implicitly assumes a normal error provides a reasonable approximation of the phenotype distribution given the genotype): Y = µ + X a a + X d d + N(0, 2 )
Case / Control Phenotypes I While a linear regression may provide a reasonable model for many phenotypes, we are commonly interested in analyzing phenotypes where this is NOT a good model As an example, we are often in situations where we are interested in identifying causal polymorphisms (loci) that contribute to the risk for developing a disease, e.g. heart disease, diabetes, etc. In this case, the phenotype we are measuring is often has disease or does not have disease or more precisely case or control Recall that such phenotypes are properties of measured individuals and therefore elements of a sample space, such that we can define a random variable such as Y(case) = 1 and Y(control) = 0
Case / Control Phenotypes II Let s contrast the situation, let s contrast data we might model with a linear regression model versus case / control data:
Case / Control Phenotypes II Let s contrast the situation, let s contrast data we might model with a linear regression model versus case / control data:
Logistic regression I Instead, we re going to consider a logistic regression model
Logistic regression II It may not be immediately obvious why we choose regression line function of this shape The reason is mathematical convenience, i.e. this function can be considered (along with linear regression) within a broader class of models called Generalized Linear Models (GLM) which we will discuss next lecture However, beyond a few differences (the error term and the regression function) we will see that the structure and out approach to inference is the same with this model
Logistic regression III To begin, let s consider the structure of a regression model: We code the X s the same (!!) although a major difference here is the logistic function as yet undefined However, the expected value of Y has the same structure as we have seen before in a regression: We can similarly write for a population using matrix notation (where the X matrix has the same form as we have been considering!): Y = logistic( µ + X a a + X d d )+ l E(Y i X i )=logistic( µ + X i,a a + X i,d d ) E(Y X) =logistic(x ) In fact the two major differences are in the form of the error and the logistic function
Logistic regression: error term I Recall that for a linear regression, the error term accounted for the difference between each point and the expected value (the linear regression line), which we assume follow a normal, but for a logistic regression, we have the same case but the value has to make up the value to either 0 or 1 (what distribution is this?): Y Y Xa Xa
Logistic regression: error term II For the error on an individual i, we therefore have to construct an error that takes either the value of 1 or 0 depending on the value of the expected value of the genotype For Y = 0 i = E(Y i X i )= E(Y A i A j )= logistic( µ + X i,a a + X i,d d ) For Y = 1 i =1 E(Y i X i )=1 E(Y A i A j )=1 logistic( µ + X i,a a + X i,d d )
Logistic regression: error term II For the error on an individual i, we therefore have to construct an error that takes either the value of 1 or 0 depending on the value of the expected value of the genotype For Y = 0 i = E(Y i X i )= E(Y A i A j )= logistic( µ + X i,a a + X i,d d ) For Y = 1 i =1 E(Y i X i )=1 E(Y A i A j )=1 logistic( µ + X i,a a + X i,d d ) For a distribution that takes two such values, a reasonable distribution is therefore the Bernoulli distribution with the following parameter i = Z E(Y i X i )
Logistic regression: error term II For the error on an individual i, we therefore have to construct an error that takes either the value of 1 or 0 depending on the value of the expected value of the genotype For Y = 0 i = E(Y i X i )= E(Y A i A j )= logistic( µ + X i,a a + X i,d d ) For Y = 1 i =1 E(Y i X i )=1 E(Y A i A j )=1 logistic( µ + X i,a a + X i,d d ) For a distribution that takes two such values, a reasonable distribution is therefore the Bernoulli distribution with the following parameter i = Z E(Y i X i ) Pr(Z) bern(p) p = logistic( µ + X a a + X d d )
Logistic regression: error term III This may look complicated at first glance but the intuition is relatively simple If the logistic regression line is near zero, the probability distribution of the error term is set up to make the probability of Y being zero greater than being one (and vice versa for the regression line near one!): i = Z E(Y i X i ) Pr(Z) bern(p) p = logistic( µ + X a a + X d d ) Y Xa
Logistic regression: link function I Next, we have to consider the function for the regression line of a logistic regression (remember below we are plotting just versus Xa but this really is a plot versus Xa AND Xd!!): E(Y i X i )=logistic( µ + X i,a a + X i,d d ) E(Y i X i )= e µ +X i,a a +X i,d d 1+e µ +X i,a a +X i,d d Y Xa
Calculating the components of an individual II For example, say we have an individual i that has genotype A1A1 and phenotype Yi = 0 We know Xa = -1 and Xd = -1 Say we also know that for the population, the true parameters (which we will not know in practice! We need to infer them!) are: µ =0.2 a =2.2 d =0.2 We can then calculate the E(Yi Xi) and the error term for i: Y i = e µ +x i,a a +x i,d d 1+e µ +x i,a a +x i,d d + i 0= e0.2+( 1)2.2+( 1)0.2 1+e 0.2+( 1)2.2+( 1)0.2 + i 0=0.1 0.1
Calculating the components of an individual III For example, say we have an individual i that has genotype A1A1 and phenotype Yi = 1 We know Xa = -1 and Xd = -1 Say we also know that for the population, the true parameters (which we will not know in practice! We need to infer them!) are: µ =0.2 a =2.2 d =0.2 We can then calculate the E(Yi Xi) and the error term for i: Y i = e µ +x i,a a +x i,d d 1+e µ +x i,a a +x i,d d + i 1= e0.2+( 1)2.2+( 1)0.2 1+e 0.2+( 1)2.2+( 1)0.2 + i 1=0.1+0.9
Calculating the components of an individual IV For example, say we have an individual i that has genotype A1A2 and phenotype Yi = 0 We know Xa = 0 and Xd = 1 Say we also know that for the population, the true parameters (which we will not know in practice! We need to infer them!) are: µ =0.2 a =2.2 d =0.2 We can then calculate the E(Yi Xi) and the error term for i: Y i = e µ +x i,a a +x i,d d 1+e µ +x i,a a +x i,d d + i 0= e0.2+(0)2.2+(1)0.2 1+e 0.2+(0)2.2+(1)0.2 + i 0=0.6 0.6
Calculating the components of an individual V For example, say we have an individual i that has genotype A2A2 and phenotype Yi = 0 We know Xa = 1 and Xd = -1 Say we also know that for the population, the true parameters (which we will not know in practice! We need to infer them!) are: µ =0.2 a =2.2 d =0.2 We can then calculate the E(Yi Xi) and the error term for i: Y i = e µ +x i,a a +x i,d d 1+e µ +x i,a a +x i,d d + i 0= e0.2+(1)2.2+( 1)0.2 1+e 0.2+(1)2.2+( 1)0.2 + i 0=0.9 0.9
For the entire probability distributions I Recall that the error term is either the negative of E(Yi Xi) when Yi is zero and 1- E(Yi Xi) when Yi is one: i (Y i = 0) = E(Y i X i ) i (Y i = 1) = 1 E(Y i X i ) For the entire distribution of the population, recall that Pr( i ) bern(p X) E(Y X) p =E(Y X) For example: i = 0.1 i =0.9 p =0.1
For the entire probability Recall that the error term is either the negative of E(Yi Xi) when Yi is zero and 1- E(Yi Xi) when Yi is one: i (Y i = 0) = E(Y i X i ) distributions II i (Y i = 1) = 1 E(Y i X i ) For the entire distribution of the population, recall that Pr( i ) bern(p X) E(Y X) p =E(Y X) For example: i = 0.6 i =0.4 p =0.6
For the entire probability Recall that the error term is either the negative of E(Yi Xi) when Yi is zero and 1- E(Yi Xi) when Yi is one: i (Y i = 0) = E(Y i X i ) distributions III i (Y i = 1) = 1 E(Y i X i ) For the entire distribution of the population, recall that Pr( i ) bern(p X) E(Y X) p =E(Y X) For example: i = 0.9 i =0.1 p =0.9
See you on Thurs.! That s it for today