Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Similar documents
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 4830/6830: Quantitative Genomics and Genetics

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Linear Regression (1/1/17)

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Methods for Cryptic Structure. Methods for Cryptic Structure

EE16B Designing Information Devices and Systems II

CBA4 is live in practice mode this week exam mode from Saturday!

Association studies and regression

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Asymptotic distribution of the largest eigenvalue with application to genetic data

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Statistical Analysis for QBIC Genetics Adapted by Ellen G. Dow 2017

Lecture 9: Linear Regression

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2008, Mr. Ruey S. Tsay. Solutions to Final Exam

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Generalized Linear Models (1/29/13)

PCA and admixture models

Statistics 135 Fall 2008 Final Exam

Chapter 12 - Lecture 2 Inferences about regression coefficient

PCA vignette Principal components analysis with snpstats

20 Unsupervised Learning and Principal Components Analysis (PCA)

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Linear Models and Estimation by Least Squares

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

EE16B Designing Information Devices and Systems II

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Multiple Linear Regression

Introduction to Statistical Genetics (BST227) Lecture 6: Population Substructure in Association Studies

ISQS 5349 Final Exam, Spring 2017.

STAT 536: Genetic Statistics

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

Tests for the Odds Ratio in Logistic Regression with One Binary X (Wald Test)

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 8: Canonical Correlation Analysis

2.1 Linear regression with matrices

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Content by Week Week of October 14 27

Inferences about a Mean Vector

Simple linear regression

Can you tell the relationship between students SAT scores and their college grades?

1. Understand the methods for analyzing population structure in genomes

Model Building: Selected Case Studies

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Correlation & Simple Regression

Lecture: Face Recognition and Feature Reduction

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

CH.9 Tests of Hypotheses for a Single Sample

Multivariate analysis of genetic data: an introduction

16.400/453J Human Factors Engineering. Design of Experiments II

I Have the Power in QTL linkage: single and multilocus analysis

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Statistical Inference

The concept of breeding value. Gene251/351 Lecture 5

2014/2015 Smester II ST5224 Final Exam Solution

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

The t-test: A z-score for a sample mean tells us where in the distribution the particular mean lies

β j = coefficient of x j in the model; β = ( β1, β2,

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Lecture 6: Linear Regression

Announcements Monday, November 13

Lecture: Face Recognition and Feature Reduction

Economics 620, Lecture 5: exp

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Multivariate Regression

Exam details. Final Review Session. Things to Review

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

(Genome-wide) association analysis

Ch 3: Multiple Linear Regression

Computational Systems Biology: Biology X

Econometrics. 4) Statistical inference

Multiple comparisons - subsequent inferences for two-way ANOVA

Lecture 21: October 19

Inference for Regression

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS

Power and sample size calculations for designing rare variant sequencing association studies.

Ch 2: Simple Linear Regression

CS281 Section 4: Factor Analysis and PCA

Large Scale Data Analysis Using Deep Learning

Population Variance. Concepts from previous lectures. HUMBEHV 3HB3 one-sample t-tests. Week 8

3. (a) (8 points) There is more than one way to correctly express the null hypothesis in matrix form. One way to state the null hypothesis is

I. Multiple Choice Questions (Answer any eight)

Goodness of Fit Goodness of fit - 2 classes

Correlation and Linear Regression

Multivariate Data Analysis Notes & Solutions to Exercises 3

STA442/2101: Assignment 5

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Causal inference in biomedical sciences: causal models involving genotypes. Mendelian randomization genes as Instrumental Variables

Transcription:

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 18: Introduction to covariates, the QQ plot, and population structure II + minimal GWAS steps Jason Mezey jgm45@cornell.edu April 19, 2016 (T) 8:40-9:55

Announcements Project is due the (11:59PM) last day of class! Reminder: Final will be on DURING THE FIRST WEEK OF EXAMS: available Mon., May 16 and due Thurs., May 19

Summary of lecture 18 This lecture, we will continue our discussion of (fixed) covariates in a GWAS analysis and the importance of Quantile-Quantile (QQ) plots We will use the example of population structure to illustrate these concepts We will also begin our discussion of minimal GWAS steps

Conceptual Overview Genetic System Does A1 -> A2 affect Y? Reject / DNR Measured individuals (genotype, phenotype) Regression model Sample or experimental pop Model params F-test Pr(Y X)

Review: modeling covariates I Therefore, if we have a factor that is correlated with our phenotype and we do not handle it in some manner in our analysis, we risk producing false positives AND/OR reduce the power of our tests! The good news is that, assuming we have measured the factor (i.e. it is part of our GWAS dataset) then we can incorporate the factor in our model as a covariate: Y = 1 ( µ + X a a + X d d + X z z ) The effect of this is that we will estimate the covariate model parameter and this will account for the correlation of the factor with phenotype (such that we can test for our marker correlation without false positives / lower power!)

Review: modeling covariates II Say you have GWAS data (a phenotype and genotypes) and your GWAS data also includes information on a number of covariates, e.g. male / female, different ancestral groups, etc. First, you need to figure out how to code the XZ in each case for each of these, which may be simple (male / female) but more complex with others (where how to code them involves fuzzy rules, i.e. it depends on your context!!) Second, you will need to figure out which to include in your analysis (again, fuzzy rules!) but a good rule is if the parameter estimate associated with the covariate is large (=significant individual p-value) you should include it! There are many ways to figure out an appropriate covariate model overall (again a topic in itself but a reasonable QQ plot provides one good metric for assessing appropriate!!)

Review: Quantile-Quantile (QQ) plots 1 A Quantile-Quantile (QQ) plot is an essential tool for detecting the most problematic covariates (and can be used to diagnose many other problems!): While the definition of a QQ-plot is complex, generating a QQplot is easy! We will demonstrate the value of a QQ plot for detecting the often problematic variable: population structure In general, whenever you perform a GWAS, you should construct a QQ plot (!!) and always include a QQ plot in your publication

Review: Quantile-Quantile (QQ) plots II Consider a random variable with a continuous probability distribution quantile - regular, equally spaced intervals of a random variable that divide the random variable into units of equal distribution A Quantile-Quantile (QQ) plot (in general) plots the observed quantiles of one distribution versus another OR plots the observed quantiles of a distribution versus the quantiles of the ideal distribution We will use a QQ plot to plot our the quantile distribution of observed p-values (on the y-axis) versus the quantile distribution of expected p-values (what distribution is this!?)

Review: Quantile-Quantile (QQ) plots III How to construct a QQ plot for a GWAS: If you performed N tests, take the -log (base 10) of each of the p-values and put them in rank order from smallest to largest Create a vector of N values evenly spaces from 1 to 1 / N (how do we do this?), take the -log of each of these values and rank them from smallest to largest Take the pair of the smallest of values of each of these lists and plot a point on an x-y plot with the observed -log p-value on the y-axis and the spaced -log value on the x-axis Repeat for the next smallest pair, for the next, etc. until you have plotted all N pairs in order

Review: QQ outcomes 1 In an ideal GWAS case where there ARE NO causal polymorphisms, your QQ plot will be a line: -log(observed p-values) Observed -log(p-value) 0 1 2 3 4 QQ plot where null is true in every case 0 1 2 3 4 Expected -log(p-value) -log(expected p-values) The reason is that we will observe a uniform distribution of p-values from such a case and in our QQ we are plotting this observed distribution of p-value versus the expected distribution of p-values: a uniform distribution (where both have been -log transformed) Note that if you GWAS analysis is correct but you do not have enough power to detect positions of causal polymorphisms, this will also be your result (!!), i.e. it is a way to assess whether you can detect any hits in your GWAS (!!)

Review: QQ outcomes II In an ideal GWAS case where there ARE causal polymorphisms, your QQ plot will be a line with a tail (!!): -log(observed p-values) Observed -log(p-value) 0 5 10 15 QQ plot where there are some true associations 0 1 2 3 4 Expected -log(p-value) -log(expected p-values) This happens because most of the p-values observed follow a uniform distribution (i.e. they are not in LD with a causal polymorphism so the null hypothesis is correct!) but the few that are in LD with a causal polymorphism will produce significant p-values (extremely low = extremely high -log(p-values)) and these are in the tail This is ideally how you want your QQ-plot to look - if it does, you are in good shape!

Review: QQ outcomes III In practice, you can find your QQ plot looks different than either the null GWAS case or the ideal GWAS case, for example: -log(observed p-values) Observed -log(p-value) 0 1 2 3 4 5 6 QQ plot where something is wrong 0 1 2 3 4 Expected -log(p-value) -log(expected p-values) This indicates that something is wrong (!!!!) and if this is the case, you should not interpret any of your significant p-values as indicating locations of causal polymorphisms (!!!!) Note that this means that you need to find an analysis strategy such that the result of your GWAS produces a QQ plot that does NOT look like this (note that this takes experience and many tools to do consistently!) Also note that unaccounted for covariates can cause this issue and the most frequent culprit is unaccounted for population structure

Example of a covariate impacting a QQ plot: population structure Population structure or stratification is a case where a sample includes groups of people that fit into two or more different ancestry groups (fuzzy def!) Population structure is often a major issue in GWAS where it can cause lots of false positives if it is not accounted for in your model Intuitively, you can model population structure as a covariate if you know: How many populations are represented in your sample Which individual in your sample belongs to which population QQ plots are good for determining whether there may be population structure Clustering techniques are good for detecting population structure and determining which individual is in which population (=ancestry group)

Origin of population structure Sarver World Cultures People geographically separate through migration and then the set of alleles present in the population evolves (=changes) over time

Why might (unaccounted for) structure be a problem in a GWAS? Case Control Pop 1 Pop 2 A1A1 A1A2 A2A2 5 5 20 20 Case Case Control if we analyze together: A1A1 A1A2 A2A2 25 10 A1A1 A1A2 A2A2 20 5 20 5 Control 40 25

What is the consequence for GWAS? Even if you had a case where there were NO causal polymorphisms for a phenotype, you can get false positives if: If you have more than one population in your sample (that you do not model with a covariate) If these populations differ in MAF at a subset of measured genotypes If these populations differ in the mean value of the phenotype In such a case, every genotype where an MAF is different between the populations would be expected to produce a low p-value (=biological false positives!)

QQ plots can be used to detect population structure, e.g.

Learning unmeasured population factors To learn a population factor, analyze the genotype data z 11... z 1k y 11... y 1m x 11... x 1N Data =......... z n1... z nk y n1... y nm x 11... x nn Apply a Principal Component Analysis (PCA) where the axes (features) in this case are individuals and each point is a (scaled) genotype Zi,2 Zi,1 What we are interested in the projections (loadings) of the individual PCs on the axes (dotted arrows) on each of the individual axes, where for each, this will produce n (i.e. one value for each sample) value of a new independent (covariate) variable XZ Y = µ + X a a + X d d + X z,1 z,1 + X z,2 z,2 +

This strategy works (!!) Nature Publishing

Applying a PCA population structure analysis (in practice) Calculate the nxn (n=sample size) covariance matrix for the individuals in your sample across all genotypes Apply a PCA to this covariance matrix, the output will be matrices containing eigenvalues and eigenvectors (= the Principal Components), where the size of the eigenvalue indicates the ordering of the Principal Component Each Principal Component (PC) will be a n element vector where each element is the loading of the PC on the individual axes, where these are your values of your independent variable coding (e.g., if you include the first PC as your first covariate, your coding will be XZ,1 = PC loadings) Note that you could also get the same answer by calculating an NxN (N=measured genotypes) covariance matrix, apply PCA and take the projects of each sample on the PCs (why might this be less optimal?)

Using the results of a PCA population structure analysis Once you have detected the populations (e.g. by eye in a PCA = fuzzy!) in your GWAS sample, set your independent variables equal to the loadings for each individual, e.g., for two pop covariates, set XZ,1 = Z1, XZ,2 = Z2 You could also determine which individual is in which pop and define random variables for pop assignment, e.g. for two populations include single covariate by setting, XZ,1(pop1) = 1, XZ, 1(pop2) = 0 (generally less optimal but can be used!) Use one of these approaches to model a covariate in your analysis, i.e. for every genotype marker that you test in your GWAS: Y = 1 ( µ + X a a + X d d + X z,1 z,1 + X z,2 z,2 ) The goal is to produce a good QQ plot (what if it does not?)

Before (top) and after including a population covariate (bottom)

4 5 Modeling covariates I How do we perform inference with a covariate in our lines regression model? We perform MLE the same way (!!) our X matrix now simply includes extra columns, one for each of the additional covariates, where for the linear regression we have: MLE( ˆ) =(x T x) 1 x T y We perform hypothesis testing the same way (!!) with a slight difference: our LRT includes the covariate in both the null hypothesis and the alternative, but we are testing the same null hypothesis: H 0 : a =0\ d = 0 H A : a 6= 0[ d 6= 0

Modeling covariates II (linear) First, determine the predicted value of the phenotype of each individual under the null hypothesis (how do we set up x?): ŷ i,ˆ 0 = ˆµ + X x i,z,j ˆz,j X j=1 X Second, determine the predicted value Xof the phenotype of each individual under Xthe alternative hypothesis (set up x?): ŷ i,ˆ 1 = ˆµ + x i,a ˆa + x i,d ˆd + X Xx i,z,j ˆz,j X j=1 Third, calculate the Error Sum of Squares for each: SSE( ˆ 0 )= nx (y i ŷ i,ˆ 0 ) 2 i=1 X i=1 Finally, we calculate X the F-statistic with degrees of freedom [2, n-3] (why two degress of freedom?): F [2,n 3] (y, x) = SSE( ˆ 1 )= SSE(ˆ 0 ) SSE(ˆ 1 ) 2 SSE(ˆ 1 ) n 3 nx (y i ŷ i,ˆ 1 ) 2

l(ˆ 0 y) = l(ˆ 1 y) = Modeling covariates III (logistic) First, run the IRLS algorithm for the model under the null ˆ 0 = { ˆ µ, ˆ a =0, ˆ d =0, ˆ z } Second, run the IRLS algorithm for the alternative X h Third, calculate the log-likelihoods for each: nx i=1 nx h X hy i ln( i=1 h y i ln( ˆ 1 = { ˆ µ, ˆ a, ˆ d, ˆ z } 1 ( ˆµ + x i,a ˆa + x i,d ˆd + x i,z ˆz)) + (1 y i )ln(1 1 ( ˆµ + x i,a ˆa + x i,d ˆd + x i,z ˆz)) + (1 y i )ln(1 Finally, calculate the LRT and use Chi-Square (with X h 2 d.f.):!1 LRT = 2ln =2l(ˆ 1 y) 2l(ˆ 0 y) LRT! 2 df =2 i 1 ( ˆµ + x i,a ˆa + x i,d ˆd + x i,z ˆz)) i 1 ( ˆµ + x i,a ˆa + x i,d ˆd i + x i,z ˆz)) i i

That s it for today We will discuss minimal GWAS steps and begin a discussion of mixed models!