BTRY 7210: Topics in Quantitative Genomics and Genetics

Similar documents
Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Linear Regression (1/1/17)

BTRY 4830/6830: Quantitative Genomics and Genetics

Introduction to Linkage Disequilibrium

Department of Forensic Psychiatry, School of Medicine & Forensics, Xi'an Jiaotong University, Xi'an, China;

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Association studies and regression

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

BTRY 7210: Topics in Quantitative Genomics and Genetics

Lecture WS Evolutionary Genetics Part I 1

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Module Contact: Dr Doug Yu, BIO Copyright of the University of East Anglia Version 1

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

Genotype Imputation. Biostatistics 666

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Heredity and Genetics WKSH

Tools and topics for microarray analysis

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

p(d g A,g B )p(g B ), g B

(Genome-wide) association analysis

A novel fuzzy set based multifactor dimensionality reduction method for detecting gene-gene interaction

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

opulation genetics undamentals for SNP datasets

Case-Control Association Testing. Case-Control Association Testing

Bayesian Inference of Interactions and Associations

Latent Variable models for GWAs

1. Understand the methods for analyzing population structure in genomes

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Statistical testing. Samantha Kleinberg. October 20, 2009

GENETICS - CLUTCH CH.1 INTRODUCTION TO GENETICS.

The phenotype of this worm is wild type. When both genes are mutant: The phenotype of this worm is double mutant Dpy and Unc phenotype.

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50

Quantile based Permutation Thresholds for QTL Hotspots. Brian S Yandell and Elias Chaibub Neto 17 March 2012

Coding sequence array Office hours Wednesday 3-4pm 304A Stanley Hall

Family Trees for all grades. Learning Objectives. Materials, Resources, and Preparation

Bioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics.

Lesson 11. Functional Genomics I: Microarray Analysis

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Name Class Date. KEY CONCEPT Gametes have half the number of chromosomes that body cells have.

PCA vignette Principal components analysis with snpstats

Lecture 7: Hypothesis Testing and ANOVA

Lecture 9. Short-Term Selection Response: Breeder s equation. Bruce Walsh lecture notes Synbreed course version 3 July 2013

Calculation of IBD probabilities

Part 2- Biology Paper 2 Inheritance and Variation Knowledge Questions

Régression en grande dimension et épistasie par blocs pour les études d association

Family Trees for all grades. Learning Objectives. Materials, Resources, and Preparation

Lecture 3: Basic Statistical Tools. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013

Calculation of IBD probabilities

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

The Lander-Green Algorithm. Biostatistics 666 Lecture 22

Computational Biology: Basics & Interesting Problems

Unit 3 - Molecular Biology & Genetics - Review Packet

I. Short Answer Questions DO ALL QUESTIONS

Major Genes, Polygenes, and

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation

Causal inference in biomedical sciences: causal models involving genotypes. Mendelian randomization genes as Instrumental Variables

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Short-Term Selection Response: Breeder s equation. Bruce Walsh lecture notes Uppsala EQG course version 31 Jan 2012

allosteric cis-acting DNA element coding strand dominant constitutive mutation coordinate regulation of genes denatured

The Quantitative TDT

Lecture 28. Ingo Ruczinski. December 3, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Ch 11.Introduction to Genetics.Biology.Landis

Quantitative Biology Lecture 3

Name Date Class CHAPTER 10. Section 1: Meiosis

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Objective 3.01 (DNA, RNA and Protein Synthesis)

Introduction to QTL mapping in model organisms

Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

Hunting for significance with multiple testing

Supplementary Figures

Teachers Guide. Overview

contents: BreedeR: a R-package implementing statistical models specifically suited for forest genetic resources analysts

SNP Association Studies with Case-Parent Trios

HEREDITY: Objective: I can describe what heredity is because I can identify traits and characteristics

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

EM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works

Looking at the Other Side of Bonferroni

Learning ancestral genetic processes using nonparametric Bayesian models

Unit 6 Reading Guide: PART I Biology Part I Due: Monday/Tuesday, February 5 th /6 th

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

InGen: Dino Genetics Lab Post-Lab Activity: DNA and Genetics

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

Network Biology-part II

Variance Component Models for Quantitative Traits. Biostatistics 666

Causal Graphical Models in Systems Genetics

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /8/2016 1/38


Genetical theory of natural selection

Introduction to Genetics

Transcription:

BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu February 12, 2015

Lecture 3: Biological false positive control in eqtl analysis

Reminder: what is an eqtl? eqtl ERAP2 expression 3.5 4.0 4.5 5.0 5.5 6.0 A/A A/G G/G rs27290 genotype A 1 A 2 Y

Typical outcome of a genomewide eqtl analysis I eqtl (p < 10 30 ) ERAP2 expression 3.5 4.0 4.5 5.0 5.5 6.0 A/A A/G G/G rs27290 genotype no eqtl (n.s.) ERAP2 expression 3.5 4.0 4.5 5.0 5.5 6.0 T/T T/C C/C rs1908530 genotype

Typical outcome of a genomewide eqtl analysis II

Considering cis- vs trans- eqtl 1

Considering cis- vs trans- eqtl II

eqtl false positives In a genome-wide eqtl analysis we are more concerned about errors in identifying a position as having an eqtl when there is not one, as opposed to missing a location that has an eqtl (why?) The first of these errors is a biological false positive (the second is a biological false negative ) There are both statistical and experimental (and combined) reasons why we can get a biological false positive: Type I error (statistical) Genotyping / Expression measurement errors (experimental) Cases of disequilibrium when there is no linkage (experimental) Unaccounted for covariates (both)

Type I error (statistical) H 0 is true H 0 is false cannot reject H 0 1-α, (correct) β, type II error reject H 0 α, type I error 1 β, power (correct) Pr(T(x) H0) T(x) c α )=1.64 Note that to avoid Type 1 errors when performing the many tests involved in an eqtl analysis, we use multiple test corrections, generally Bonferroni or Benjamini-Hochberg What are the trade-offs of the more conservative (Bonferroni) vs less conservative (Benjamini-Hochberg) controls of Type 1 error?

Genotype / Expression measurement errors (experimental) Genotyping error can produce false positives, where genotypes with lower minor allele frequencies (MAF) are more susceptible (why?) Do these impact both cis- and trans- hits? How might we detect that a genotyping error has occurred from the patterns of linkage disequilibrium (LD)?

Genotype / Expression measurement errors (experimental) Expression measurement error can produce false positives, where both gene expression array and RNA-Seq measurements are susceptible for different reasons (why?)

Genotype / Expression measurement errors (experimental) A possible problem with arrays is probes that are polymorphic in the population (=do not perfectly match the DNA of an individual) A possible with RNA-Seq is psuedogenes that lead to incorrect mapping of sequencing reads (=incorrect quantification of mrna at a gene for an individual)

Disequilibrium but not LD (experimental) A rule of genetics is that genotypes that tend to be physically close to each other in the genome have a high correlations (=linkage disequilibrium or LD) and the further away genotypes are from one another, the lower their correlations (in general): LD equilibrium, linkage A B C Chr. 1 Chr. 2 D equilibrium, no linkage While generally true, this is not always the case (!!) Some issues to watch out for: (a) X and Y chromosomes, (b) low MAF markers in high marker coverage experiments, (c) inbreeding designs

Unaccounted for covariates (statistical and experimental) If the marker is not correlated with a causal polymorphism but the factor is correlated with BOTH the phenotype and the marker such that a test of the marker using our framework will produce a false positive (!!): Cov(Y,X z ) = 0. z Cov(X a,x z ) = 0, H 0 : β a =0 β d = 0 H A : β a =0 β d = 0 Y = β µ + X a β a + X d β d + This can also lead to reduced power...

Unaccounted for covariates (statistical and experimental) Therefore, if we have a factor that is correlated with our phenotype and we do not handle it in some manner in our analysis, we risk producing false positives AND/OR reduce the power of our tests! The good news is that, assuming we have measured the factor (i.e. it is part of our eqtl dataset) then we can incorporate the factor in our model as a covariate: Y = β µ + X a β a + X d β d + X z β z + N(0, σ 2 ) The effect of this is that we will estimate the covariate model parameter and this will account for the correlation of the factor with phenotype (such that we can test for our marker correlation without false positives / lower power!) In general, we should include covariates that might have an impact, how do we assess this? Do we include every possible covariate (why might this be a bad idea)? What if we haven t measured the covariate causing the problem(!)?

Visual diagnostics for assessing statistical model fit Remember that in our eqtl analysis, we are not after the right statistical model but rather a model that we can fit to genotypeexpression pairs that produces as few biological false positives as possible while still allowing us to discover eqtl Any alternative statistical modeling strategy (linear, mixed, nonparametric, etc.) and statistical model (which covariates are included, etc.) is potentially acceptable However, you should always check genome-wide diagnostics of model fit before you believe the eqtl you identify (you should also visually check the expression by genotype plots, the local LD of the eqtl hit, etc.) By far, the two most valuable we have found are Quantile-Quantile plots (QQ) and Heatmap plots (where the latter are not in general usage)

Quantile-Quantile (QQ) plots For an expression variable impacted by an eqtl (i.e. you have found markers at a position significantly associated with the expression variable, you need to plot the p-values for all genotypes in your study when tested versus this expression variable using the following approach: If you performed N tests, take the -log (base 10) of each of the p-values and put them in rank order from smallest to largest Create a vector of N values evenly spaces from 1 to 1 / N (how do we do this?), take the - log of each of these values and rank them from smallest to largest Take the pair of the smallest of values of each of these lists and plot a point on an x-y plot with the observed -log p-value on the y-axis and the spaced -log value on the x-axis Repeat for the next smallest pair, for the next, etc. until you have plotted all N pairs in order This is a QQ plot and YOU SHOULD ONLY CONSIDER YOUR eqtl association to be correct if they look like the following: QQ plots

Heatmap plots A heatmap plot is constructed by calculating p-values for every expression-genotype pair and plotting genotypes in the columns and expression variables on the rows, where colors indicate significance (while-yellow=not significant, red=significant) If you see horizontal stripes or vertical stripes THIS INDICATES A PROBLEM (!!) Population-type factor Hidden factor The horizontal stripes indicate an unaccounted for population structure covariate (i.e. your sample could be divided into multiple populations that differ in allele frequencies at many markers where these populations also differ in the mean value of an expressed gene = every allele that differs in frequency will be statistically associated with the gene!!) - how can we account for such population structure covariates? The vertical stripes indicate an unaccounted for hidden factor covariate (i.e. a covariate that you have not measured that impacts gene expression levels that is also correlated with markers) - how can we account for such hidden factor covariates? In general, YOU SHOULD NOT CONSIDER an eqtl in a horizontal or vertical stripe to be a true eqtl (!!) Why don t people use heatmaps? What is a practical alternative?

Are false positives a problem in eqtl analyses? All eqtl cis-eqtl only trans-eqtl only

That s it for today!