Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Similar documents
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

BTRY 4830/6830: Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics

Methods for Cryptic Structure. Methods for Cryptic Structure

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association studies and regression

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

EE16B Designing Information Devices and Systems II

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Variance Component Models for Quantitative Traits. Biostatistics 666

Asymptotic distribution of the largest eigenvalue with application to genetic data

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

EE16B Designing Information Devices and Systems II

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Inference for Regression

1. Understand the methods for analyzing population structure in genomes

Simple linear regression

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

(Genome-wide) association analysis

Case-Control Association Testing. Case-Control Association Testing

Linear Regression (1/1/17)

The Quantitative TDT

Ch 2: Simple Linear Regression

Lecture: Face Recognition and Feature Reduction

STA442/2101: Assignment 5

PCA vignette Principal components analysis with snpstats

Formal Statement of Simple Linear Regression Model

Lecture 6: Selection on Multiple Traits

Chapter 12 - Lecture 2 Inferences about regression coefficient

Math 3330: Solution to midterm Exam

The concept of breeding value. Gene251/351 Lecture 5

Lecture 9. Short-Term Selection Response: Breeder s equation. Bruce Walsh lecture notes Synbreed course version 3 July 2013

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Vectors and Matrices Statistics with Vectors and Matrices

Statistical issues in QTL mapping in mice

Lecture: Face Recognition and Feature Reduction

Simple Linear Regression

PCA and admixture models

Lecture WS Evolutionary Genetics Part I 1

Multivariate Statistical Analysis

CBA4 is live in practice mode this week exam mode from Saturday!

MATH5745 Multivariate Methods Lecture 07

Lecture 3: Inference in SLR

Econometrics. 4) Statistical inference

Lecture 14 Simple Linear Regression

Bayesian Inference of Interactions and Associations

Multiple Linear Regression

Part 6: Multivariate Normal and Linear Models

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Machine Learning (CS 567) Lecture 5

Introduction to QTL mapping in model organisms

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

STAT 536: Genetic Statistics

Computational Approaches to Statistical Genetics

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Introduction to QTL mapping in model organisms

Inference in Regression Analysis

SUPPLEMENTARY INFORMATION

Announcements Monday, November 13

Lecture 11: Multiple trait models for QTL analysis

Need for Several Predictor Variables

BNAD 276 Lecture 10 Simple Linear Regression Model

Figure 36: Respiratory infection versus time for the first 49 children.

Machine Learning: Evaluation

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

2. Map genetic distance between markers

Genotype Imputation. Biostatistics 666

Dimension Reduction (PCA, ICA, CCA, FLD,

Lecture 9 SLR in Matrix Form

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Multivariate analysis of genetic data: an introduction

Multiple QTL mapping

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Introduction to Analysis of Genomic Data Using R Lecture 6: Review Statistics (Part II)

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Causal inference in biomedical sciences: causal models involving genotypes. Mendelian randomization genes as Instrumental Variables

Lecture 9. QTL Mapping 2: Outbred Populations

Machine Learning 11. week

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Linear Algebra Review

CSE 546 Final Exam, Autumn 2013

Statistical Inference

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Chapter 11. Regression with a Binary Dependent Variable

G E INTERACTION USING JMP: AN OVERVIEW

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Transcription:

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture16: Population structure and logistic regression I Jason Mezey jgm45@cornell.edu April 11, 2017 (T) 8:40-9:55

Announcements I April 11 Genome-Wide Association Studies (GWAS) IV: logistic regression 1 (the model) 12 April 13 Project Assigned Genome-Wide Association Studies (GWAS) V: logistic regression II (IRLS algorithm and GLMs) April 18 Genome-Wide Association Studies (GWAS) X: Haplotype testing, alternative tests, and minimum GWAS analysis 13 April 20 Advanced topics I: Mixed Models April 25 Advanced topics II: Multiple regression (epistasis) and multivariate regression 14 April 27 MAPPING LOCI: BAYESIAN ANALYSIS Bayesian inference I: inference basics / linear models May 2 Bayesian inference II: MCMC algorithms 15 May 4 PEDIGREE / INBRED LINE ANALYSIS / CLASSIC QUANTITATIVE GENETICS Basics of linkage analysis / Inbred line analysis May 9 Project Due Heritability and additive genetic variance 16

Announcements Midterm will be available next week No more homeworks (!!) - just a project and final (and computer labs) Your PROJECT will be assigned on Thurs.! I will have office hours today In Ithaca, same location as always In NY go to the SMALL Genetic Med Conference Room

Conceptual Overview Genetic System Does A1 -> A2 affect Y? Reject / DNR Measured individuals (genotype, phenotype) Regression model Sample or experimental pop Model params F-test Pr(Y X)

Review: modeling covariates I If we have a factor that is correlated with our phenotype and we do not handle it in some manner in our analysis, we risk producing false positives AND/OR reduce the power of our tests! The good news is that, assuming we have measured the factor (i.e. it is part of our GWAS dataset) then we can incorporate the factor in our model as a covariate(s): Y = µ + X a a + X d d + X z,1 z,1 + X z,2 z,2 + The effect of this is that we will estimate the covariate model parameter and this will account for the correlation of the factor with phenotype (such that we can test for our marker correlation without false positives / lower power!)

4 5 Review modeling covariates II How do we perform inference with a covariate in our lines regression model? We perform MLE the same way (!!) our X matrix now simply includes extra columns, one for each of the additional covariates, where for the linear regression we have: MLE( ˆ) =(x T x) 1 x T y We perform hypothesis testing the same way (!!) with a slight difference: our LRT includes the covariate in both the null hypothesis and the alternative, but we are testing the same null hypothesis: H 0 : a =0\ d = 0 H A : a 6= 0[ d 6= 0

Modeling covariates III First, determine the predicted value of the phenotype of each individual under the null hypothesis (how do we set up x?): ŷ i,ˆ 0 = ˆµ + X x i,z,j ˆz,j X j=1 X Second, determine the predicted value Xof the phenotype of each individual under Xthe alternative hypothesis (set up x?): ŷ i,ˆ 1 = ˆµ + x i,a ˆa + x i,d ˆd + X Xx i,z,j ˆz,j X j=1 Third, calculate the Error Sum of Squares for each: SSE( ˆ 0 )= nx (y i ŷ i,ˆ 0 ) 2 i=1 X i=1 Finally, we calculate X the F-statistic with degrees of freedom [2, n-3] (why two degress of freedom?): F [2,n 3] (y, x) = SSE( ˆ 1 )= SSE(ˆ 0 ) SSE(ˆ 1 ) 2 SSE(ˆ 1 ) n 3 nx (y i ŷ i,ˆ 1 ) 2

Modeling covariates VI Say you have GWAS data (a phenotype and genotypes) and your GWAS data also includes information on a number of covariates, e.g. male / female, several different ancestral groups (different populations!!), other risk factors, etc. First, you need to figure out how to code the XZ in each case for each of these, which may be simple (male / female) but more complex with others (where how to code them involves fuzzy rules, i.e. it depends on your context!!) Second, you will need to figure out which to include in your analysis (again, fuzzy rules!) but a good rule is if the parameter estimate associated with the covariate is large (=significant individual p-value) you should include it! There are many ways to figure out how to include covariates (again a topic in itself!!)

Review: population structure Population structure or stratification is a case where a sample includes groups of people that fit into two or more different ancestry groups (fuzzy def!) Population structure is often a major issue in GWAS where it can cause lots of false positives if it is not accounted for in your model Intuitively, you can model population structure as a covariate if you know: How many populations are represented in your sample Which individual in your sample belongs to which population QQ plots are good for determining whether there may be population structure Clustering techniques are good for detecting population structure and determining which individual is in which population (=ancestry group)

Origin of population structure Sarver World Cultures People geographically separate through migration and then the set of alleles present in the population evolves (=changes) over time

Principal Component Analysis (PCA) of population structure Nature Publishing

Learning unmeasured population factors To learn a population factor, analyze the genotype data Data = z 11... z 1k y 11... y 1m x 11... x 1N......... z n1... z nk y n1... y nm x 11... x nn Apply a Principal Component Analysis (PCA) where the axes (features) in this case are individuals and each point is a (scaled) genotype Zi,2 Zi,1 What we are interested in the projections (loadings) of the individual PCs on the axes (dotted arrows) on each of the individual axes, where for each, this will produce n (i.e. one value for each sample) value of a new independent (covariate) variable XZ Y = µ + X a a + X d d + X z,1 z,1 + X z,2 z,2 +

Applying a PCA population structure analysis (in practice) Calculate the nxn (n=sample size) covariance matrix for the individuals in your sample across all genotypes Apply a PCA to this covariance matrix, the output will be matrices containing eigenvalues and eigenvectors (= the Principal Components), where the size of the eigenvalue indicates the ordering of the Principal Component Each Principal Component (PC) will be a n element vector where each element is the loading of the PC on the individual axes, where these are your values of your independent variable coding (e.g., if you include the first PC as your first covariate, your coding will be XZ,1 = PC loadings) Note that you could also get the same answer by calculating an NxN (N=measured genotypes) covariance matrix, apply PCA and take the projects of each sample on the PCs (why might this be less optimal?)

Using the results of a PCA population structure analysis Once you have detected the populations (e.g. by eye in a PCA = fuzzy!) in your GWAS sample, set your independent variables equal to the loadings for each individual, e.g., for two pop covariates, set XZ,1 = Z1, XZ,2 = Z2 You could also determine which individual is in which pop and define random variables for pop assignment, e.g. for two populations include single covariate by setting, XZ,1(pop1) = 1, XZ,1(pop2) = 0 (generally less optimal but can be used!) Use one of these approaches to model a covariate in your analysis, i.e. for every genotype marker that you test in your GWAS: Y = µ + X a a + X d d + X z,1 z,1 + X z,2 z,2 + The goal is to produce a good QQ plot (what if it does not?)

Before (top) and after including a population covariate (bottom)

Review: linear regression So far, we have considered a linear regression is a reasonable model for the relationship between genotype and phenotype (where this implicitly assumes a normal error provides a reasonable approximation of the phenotype distribution given the genotype): Y = µ + X a a + X d d + N(0, 2 )

Case / Control Phenotypes I While a linear regression may provide a reasonable model for many phenotypes, we are commonly interested in analyzing phenotypes where this is NOT a good model As an example, we are often in situations where we are interested in identifying causal polymorphisms (loci) that contribute to the risk for developing a disease, e.g. heart disease, diabetes, etc. In this case, the phenotype we are measuring is often has disease or does not have disease or more precisely case or control Recall that such phenotypes are properties of measured individuals and therefore elements of a sample space, such that we can define a random variable such as Y(case) = 1 and Y(control) = 0

Case / Control Phenotypes II Let s contrast the situation, let s contrast data we might model with a linear regression model versus case / control data:

Case / Control Phenotypes II Let s contrast the situation, let s contrast data we might model with a linear regression model versus case / control data:

Logistic regression I Instead, we re going to consider a logistic regression model

Logistic regression II It may not be immediately obvious why we choose regression line function of this shape The reason is mathematical convenience, i.e. this function can be considered (along with linear regression) within a broader class of models called Generalized Linear Models (GLM) which we will discuss next lecture However, beyond a few differences (the error term and the regression function) we will see that the structure and out approach to inference is the same with this model

Logistic regression III To begin, let s consider the structure of a regression model: We code the X s the same (!!) although a major difference here is the logistic function as yet undefined However, the expected value of Y has the same structure as we have seen before in a regression: We can similarly write for a population using matrix notation (where the X matrix has the same form as we have been considering!): Y = logistic( µ + X a a + X d d )+ l E(Y i X i )=logistic( µ + X i,a a + X i,d d ) E(Y X) =logistic(x ) In fact the two major differences are in the form of the error and the logistic function

Logistic regression: error term I Recall that for a linear regression, the error term accounted for the difference between each point and the expected value (the linear regression line), which we assume follow a normal, but for a logistic regression, we have the same case but the value has to make up the value to either 0 or 1 (what distribution is this?): Y Y Xa Xa

Logistic regression: error term II For the error on an individual i, we therefore have to construct an error that takes either the value of 1 or 0 depending on the value of the expected value of the genotype For Y = 0 i = E(Y i X i )= E(Y A i A j )= logistic( µ + X i,a a + X i,d d ) For Y = 1 i =1 E(Y i X i )=1 E(Y A i A j )=1 logistic( µ + X i,a a + X i,d d )

Logistic regression: error term II For the error on an individual i, we therefore have to construct an error that takes either the value of 1 or 0 depending on the value of the expected value of the genotype For Y = 0 i = E(Y i X i )= E(Y A i A j )= logistic( µ + X i,a a + X i,d d ) For Y = 1 i =1 E(Y i X i )=1 E(Y A i A j )=1 logistic( µ + X i,a a + X i,d d ) For a distribution that takes two such values, a reasonable distribution is therefore the Bernoulli distribution with the following parameter i = Z E(Y i X i )

Logistic regression: error term II For the error on an individual i, we therefore have to construct an error that takes either the value of 1 or 0 depending on the value of the expected value of the genotype For Y = 0 i = E(Y i X i )= E(Y A i A j )= logistic( µ + X i,a a + X i,d d ) For Y = 1 i =1 E(Y i X i )=1 E(Y A i A j )=1 logistic( µ + X i,a a + X i,d d ) For a distribution that takes two such values, a reasonable distribution is therefore the Bernoulli distribution with the following parameter i = Z E(Y i X i ) Pr(Z) bern(p) p = logistic( µ + X a a + X d d )

Logistic regression: error term III This may look complicated at first glance but the intuition is relatively simple If the logistic regression line is near zero, the probability distribution of the error term is set up to make the probability of Y being zero greater than being one (and vice versa for the regression line near one!): i = Z E(Y i X i ) Pr(Z) bern(p) p = logistic( µ + X a a + X d d ) Y Xa

Logistic regression: link function I Next, we have to consider the function for the regression line of a logistic regression (remember below we are plotting just versus Xa but this really is a plot versus Xa AND Xd!!): E(Y i X i )=logistic( µ + X i,a a + X i,d d ) E(Y i X i )= e µ +X i,a a +X i,d d 1+e µ +X i,a a +X i,d d Y Xa

Calculating the components of an individual II For example, say we have an individual i that has genotype A1A1 and phenotype Yi = 0 We know Xa = -1 and Xd = -1 Say we also know that for the population, the true parameters (which we will not know in practice! We need to infer them!) are: µ =0.2 a =2.2 d =0.2 We can then calculate the E(Yi Xi) and the error term for i: Y i = e µ +x i,a a +x i,d d 1+e µ +x i,a a +x i,d d + i 0= e0.2+( 1)2.2+( 1)0.2 1+e 0.2+( 1)2.2+( 1)0.2 + i 0=0.1 0.1

Calculating the components of an individual III For example, say we have an individual i that has genotype A1A1 and phenotype Yi = 1 We know Xa = -1 and Xd = -1 Say we also know that for the population, the true parameters (which we will not know in practice! We need to infer them!) are: µ =0.2 a =2.2 d =0.2 We can then calculate the E(Yi Xi) and the error term for i: Y i = e µ +x i,a a +x i,d d 1+e µ +x i,a a +x i,d d + i 1= e0.2+( 1)2.2+( 1)0.2 1+e 0.2+( 1)2.2+( 1)0.2 + i 1=0.1+0.9

Calculating the components of an individual IV For example, say we have an individual i that has genotype A1A2 and phenotype Yi = 0 We know Xa = 0 and Xd = 1 Say we also know that for the population, the true parameters (which we will not know in practice! We need to infer them!) are: µ =0.2 a =2.2 d =0.2 We can then calculate the E(Yi Xi) and the error term for i: Y i = e µ +x i,a a +x i,d d 1+e µ +x i,a a +x i,d d + i 0= e0.2+(0)2.2+(1)0.2 1+e 0.2+(0)2.2+(1)0.2 + i 0=0.6 0.6

Calculating the components of an individual V For example, say we have an individual i that has genotype A2A2 and phenotype Yi = 0 We know Xa = 1 and Xd = -1 Say we also know that for the population, the true parameters (which we will not know in practice! We need to infer them!) are: µ =0.2 a =2.2 d =0.2 We can then calculate the E(Yi Xi) and the error term for i: Y i = e µ +x i,a a +x i,d d 1+e µ +x i,a a +x i,d d + i 0= e0.2+(1)2.2+( 1)0.2 1+e 0.2+(1)2.2+( 1)0.2 + i 0=0.9 0.9

For the entire probability distributions I Recall that the error term is either the negative of E(Yi Xi) when Yi is zero and 1- E(Yi Xi) when Yi is one: i (Y i = 0) = E(Y i X i ) i (Y i = 1) = 1 E(Y i X i ) For the entire distribution of the population, recall that Pr( i ) bern(p X) E(Y X) p =E(Y X) For example: i = 0.1 i =0.9 p =0.1

For the entire probability Recall that the error term is either the negative of E(Yi Xi) when Yi is zero and 1- E(Yi Xi) when Yi is one: i (Y i = 0) = E(Y i X i ) distributions II i (Y i = 1) = 1 E(Y i X i ) For the entire distribution of the population, recall that Pr( i ) bern(p X) E(Y X) p =E(Y X) For example: i = 0.6 i =0.4 p =0.6

For the entire probability Recall that the error term is either the negative of E(Yi Xi) when Yi is zero and 1- E(Yi Xi) when Yi is one: i (Y i = 0) = E(Y i X i ) distributions III i (Y i = 1) = 1 E(Y i X i ) For the entire distribution of the population, recall that Pr( i ) bern(p X) E(Y X) p =E(Y X) For example: i = 0.9 i =0.1 p =0.9

See you on Thurs.! That s it for today