Quantile based Permutation Thresholds for QTL Hotspots. Brian S Yandell and Elias Chaibub Neto 17 March 2012

Similar documents
Quantile-based permutation thresholds for QTL hotspot analysis: a tutorial

Causal Network Models for Correlated Quantitative Traits. outline

Causal Model Selection Hypothesis Tests in Systems Genetics

Inferring Genetic Architecture of Complex Biological Processes

Causal Graphical Models in Systems Genetics

Inferring Causal Phenotype Networks from Segregating Populat

Introduction to QTL mapping in model organisms

Gene mapping in model organisms

Multiple QTL mapping

Seattle Summer Institute : Systems Genetics for Experimental Crosses

Introduction to QTL mapping in model organisms

Hotspots and Causal Inference For Yeast Data

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms

Statistical issues in QTL mapping in mice

Mapping multiple QTL in experimental crosses

TECHNICAL REPORT NO December 1, 2008

Introduction to QTL mapping in model organisms

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Latent Variable models for GWAs

Causal Model Selection Hypothesis Tests. in Systems Genetics

NIH Public Access Author Manuscript Ann Appl Stat. Author manuscript; available in PMC 2011 January 7.

BTRY 7210: Topics in Quantitative Genomics and Genetics

arxiv: v1 [stat.ap] 7 Oct 2010

Enhancing eqtl Analysis Techniques with Special Attention to the Transcript Dependency Structure

TECHNICAL REPORT NO April 22, Causal Model Selection Tests in Systems Genetics 1

A Statistical Framework for Expression Trait Loci (ETL) Mapping. Meng Chen

R/qtl workshop. (part 2) Karl Broman. Biostatistics and Medical Informatics University of Wisconsin Madison. kbroman.org

QTL Model Search. Brian S. Yandell, UW-Madison January 2017

Mapping multiple QTL in experimental crosses

Introduction to population genetics & evolution

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

what is Bayes theorem? posterior = likelihood * prior / C

Genetic Algorithms. Donald Richards Penn State University

Use of hidden Markov models for QTL mapping

Linear Regression (1/1/17)

Statistical testing. Samantha Kleinberg. October 20, 2009

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

A CAUSAL GENE NETWORK WITH GENETIC VARIATIONS INCORPORATING BIOLOGICAL KNOWLEDGE AND LATENT VARIABLES. By Jee Young Moon

Lesson 11. Functional Genomics I: Microarray Analysis

QTL model selection: key players

Mapping QTL to a phylogenetic tree

Fast Bayesian Methods for Genetic Mapping Applicable for High-Throughput Datasets

Overview. Background

Latent Variable Methods for the Analysis of Genomic Data

Lecture WS Evolutionary Genetics Part I 1

Binary trait mapping in experimental crosses with selective genotyping

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013

One-week Course on Genetic Analysis and Plant Breeding January 2013, CIMMYT, Mexico LOD Threshold and QTL Detection Power Simulation

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Genotyping By Sequencing (GBS) Method Overview

Joint Analysis of Multiple Gene Expression Traits to Map Expression Quantitative Trait Loci. Jessica Mendes Maia

Introduction to Bioinformatics

Statistical Applications in Genetics and Molecular Biology

Predicting Protein Functions and Domain Interactions from Protein Interactions

Evolution of phenotypic traits

Computational approaches for functional genomics

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

BTRY 4830/6830: Quantitative Genomics and Genetics

T H E J O U R N A L O F C E L L B I O L O G Y

Tail probability of linear combinations of chi-square variables and its application to influence analysis in QTL detection

AP Biology Essential Knowledge Cards BIG IDEA 1

Linkage analysis and QTL mapping in autotetraploid species. Christine Hackett Biomathematics and Statistics Scotland Dundee DD2 5DA

Evolutionary Genetics Midterm 2008

Eiji Yamamoto 1,2, Hiroyoshi Iwata 3, Takanari Tanabata 4, Ritsuko Mizobuchi 1, Jun-ichi Yonemaru 1,ToshioYamamoto 1* and Masahiro Yano 5,6

Model Selection for Multiple QTL

ACM 116: Lecture 1. Agenda. Philosophy of the Course. Definition of probabilities. Equally likely outcomes. Elements of combinatorics

1. what is the goal of QTL study?

Power and sample size calculations for designing rare variant sequencing association studies.

Principles of QTL Mapping. M.Imtiaz

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Methods for QTL analysis

Computational Systems Biology: Biology X

Genotyping By Sequencing (GBS) Method Overview

Review Article Statistical Methods for Mapping Multiple QTL

Identifying Bio-markers for EcoArray

QTL model selection: key players

Classical Selection, Balancing Selection, and Neutral Mutations

(Genome-wide) association analysis

GSBHSRSBRSRRk IZTI/^Q. LlML. I Iv^O IV I I I FROM GENES TO GENOMES ^^^H*" ^^^^J*^ ill! BQPIP. illt. goidbkc. itip31. li4»twlil FIFTH EDITION

BAYESIAN MAPPING OF MULTIPLE QUANTITATIVE TRAIT LOCI

Causal Model Selection Hypothesis Tests in Systems Genetics: a tutorial

Non-specific filtering and control of false positives

Anumber of statistical methods are available for map- 1995) in some standard designs. For backcross populations,

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Variance Component Models for Quantitative Traits. Biostatistics 666

Nature Genetics: doi: /ng Supplementary Figure 1. The phenotypes of PI , BR121, and Harosoy under short-day conditions.

Computational Biology: Basics & Interesting Problems

Chapter 2: Extensions to Mendel: Complexities in Relating Genotype to Phenotype.

SUPPLEMENTARY INFORMATION

Learning gene regulatory networks Statistical methods for haplotype inference Part I

Affected Sibling Pairs. Biostatistics 666

Major Genes, Polygenes, and

Coding sequence array Office hours Wednesday 3-4pm 304A Stanley Hall

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Network Biology: Understanding the cell s functional organization. Albert-László Barabási Zoltán N. Oltvai

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

Computational Approaches to Statistical Genetics

On the mapping of quantitative trait loci at marker and non-marker locations

25 : Graphical induced structured input/output models

Transcription:

Quantile based Permutation Thresholds for QTL Hotspots Brian S Yandell and Elias Chaibub Neto 17 March 2012 2012 Yandell 1

Fisher on inference We may at once admit that any inference from the particular to the general must be attended with some degree of uncertainty, but this is not the same as to admit that such inference cannot be absolutely rigorous, for the nature and degree of the uncertainty may itself be capable of rigorous expression. Sir Ronald A Fisher(1935) The Design of Experiments 2012 Yandell 2

Why study hotspots? How do genotypes affect phenotypes? genotypes = DNA markers for an individual phenotypes = traits measured on an individual (clinical traits, thousands of mrna expression levels) QTL hotspots = genomic locations affecting many traits common feature in genetical genomics studies biologically interesting may harbor critical regulators Butare these hotspots real? Or are they spurious orrandom? random? non genetic correlation from other environmental factors 2012 Yandell 3

2012 Yandell 4

2012 Yandell 5

2012 Yandell 6

How large a hotspot is large? recently proposed empirical test Brietling et al. Jansen (2008) hotspot = count traits above LOD threshold LOD = rescaled likelihood ratio ~ F statistic assess null distribution with permutation test extension of Churchill and Doerge (1994) extension of Fisher'ss permutation t test test 2012 Yandell 7

Single trait permutation threshold T Churchill hll Doerge (1994) Null distribution of max LOD Permute single trait separate from genotype Find max LOD over genome Repeat 1000 times Find 95% permutation tti threshold h T Identify interested peaks above T in data Controls genome wide error rate (GWER) Chance of detecting at least on peak above T 2012 Yandell 8

Single trait permutation schema ge enotype es ph henotyp pe LOD over genome max LOD 1. shuffle phenotypes to break QTL 2. repeat 1000 times and summarize 2012 Yandell 9

Hotspot count threshold N(T) Breitling et al. Jansen (2008) Null distribution of max count above T Find single trait 95% LOD threshold T Find max count of traits with LODs above T Repeat 1000 times Find 95% count permutation tti threshold h N Identify counts of LODs above T in data Locus specific counts identify hotspots Controls GWER in some way 2012 Yandell 10

Hotspot permutation schema LOD at each locus for each phenotype over genome ge enotype es ph henotyp pes count LODs at locus over threshold T max count N over genome 1. shuffle phenotypes by row to break QTL, keep correlation 2. repeat 1000 times and summarize 2012 Yandell 11

spurious hotspot permutation histogram for hotspot size above 1 trait threshold h 95% threshold at N > 82 using single trait thresholdt = 3.41 2012 Yandell 12

Hotspot sizes based on count of LODs above single trait threshold 5 peaks above count threshold N = 82 all traits counted are nominally significant but no adjustment for multiple testing across traits 2012 Yandell 13

hotspot permutation test (Breitling et al. Jansen 2008 PLoS Genetics) for original dataset and each permuted set: Set single trait LOD threshold T Could use Churchill Doerge (1994) permutations Count number of traits (N) with LOD above T Do this at every marker (or pseudomarker) Probably want to smooth counts somewhat find count with at most 5% of permuted sets above (critical value) as count threshold conclude original counts above threshold are real 2012 Yandell 14

permutation across traits (Breitling et al. Jansen 2008 PLoS Genetics) ) right way wrong way strain marker gene expression break correlation between markers and traits but preserve correlation among traits 2012 Yandell 15

quality vs. quantity in hotspots (Chaibub Neto et al. in review) detecting single trait with very large LOD control FWER across genome control FWER across alltraits finding small hotspots with significant traits all with large LODs could indicate a strongly disrupted signal pathway sliding LOD threshold across hotspot sizes 2012 Yandell 16

Rethinking the approach Breitling et al. depends highly on T Threshold T based on single trait but interested in multiple correlated traits want to control hotspot GWER (hgwer N ) chance of detecting at least one spurious hotspot of size N or larger N = 1 chance of detecting at least 1 peak above threshold across all traits and whole genome Use permutation null distribution of maximum LOD scores across all transcripts and all genomic locations 2012 Yandell 17

Hotspot architecture using multiple trait GWER threshold h ld( (T 1 =7.12) count of all traits with LOD above T 1 = 7.12 all traits counted are significant conservative adjustment for multiple traits 2012 Yandell 18

locus specific LOD quantiles in data for 10(black), 20(blue), 50(red) traits 2012 Yandell 19

locus specific specific LOD quantiles Quantile: what is LOD value for which at least 10 (or 20 or 50) traits are at above it? Breitling hotspots (chr 23121415) 2,3,12,14,15) have many traits with high LODs Chromosome max LOD quantile by trait count color count chr 3 chr 8 chr 12 chr 14 black 10 24 10 18 12 blue 20 11 8 15 11 red 50 6 4 9 9 2012 Yandell 20

Hotspot permutation revisited LOD at each locus over genome per phenotype ge enotype es ph henotyp pes Find quantile = N-th largest LOD at each locus max LOD quantile over genome 1. shuffle phenotypes by row to break QTL, keep correlation 2. repeat 1000 times and summarize 2012 Yandell 21

Tail distribution of LOD quantiles and size specific thresholds h What is locus specific (spurious) hotspot? all traits in hotspot have LOD above null threshold Small spurious hotspots have higher minimum LODs min of 10 values > min of 20 values Large spurious hotspots have many small LODs most are below single trait threshold Null thresholds depending on hotspot size Decrease with spurious hotspot size (starting at N = 1) Be truncated at single trait threshold for large sizes Chen Storey (2007) studied LOD quantiles For multiple peaks on a single trait 2012 Yandell 22

genome wide LOD permutation threshold vs. spurious hotspot size smaller spurious hotspots have higher LOD thresholds larger spurious hotspots allow many traits with small LODS (below T=3.41) 2012 Yandell 23

Hotspot architecture using multiple trait GWER threshold h ld( (T 1 =7.12) 2012 Yandell 24

hotspot architectures using LOD thresholds for 10(black), 20(blue), 50(red) traits 2012 Yandell 25

Sliding threshold between multiple trait (T 1 =7.12) and single trait (T 0 =3.41) GWER T 1 =7.12 controls GWER across all traits T 0 =3.41 controls GWER for single trait 2012 Yandell 26

Hotspot size significance profile Construction Fix significance level (say 5%) At each locus, find largest hotspot that is significant using sliding threshold Plot as profile across genome Interpretation Large hotspots were already significant Traits with LOD > 7.12 could be hubs Smaller hotspots identified by fewer large LODs (chr 8) Subjective choice on what to investigate (chr 13, 5?) 2012 Yandell 27

Hotspot size signifcance profile 2012 Yandell 28

Yeast study 120 individuals 6000 traits 250 markers 1000 permutations 1.8 * 10^10 linear models 2012 Yandell 29

Mouse study 500 individuals 30,000 traits * 6 tissues 2000 markers 1000 permutations 1.8 * 10^13 linear models 1000 x more than yeast study 2012 Yandell 30

Scaling up permutations tremendous computing resource needs Multiple l analyses, periodically redone Algorithms improve Gene annotation and sequence data evolve Verification of properties of methods Theory gives easy cutoff values (LOD > 3) that may not be relevant Need to carefully develop re sampling methods (permutations, etc.) Storage of raw, processed and summary data (and metadata) Terabyte(s) of backed up storage (soon petabytes and more) Web access tools high throughput computing platforms (Condor) Reduce months or years to hours or days Free up your mind to think about science rather than mechanics Free up your desktop/laptop for more immediate tasks Need local (regional) infrastructure Who maintains the machines, algorithms? Who can talk to you in plain language? 2012 Yandell 31

CHTC use: one small project Open Science Grid Glidein ld Usage (4 fb feb 2012) group hours percent 1 BMRB 10710.3 73.49% 2 Biochem_Attie 3660.2 25.11% 3 Statistics_Wahba 178.5 1.22% 2012 Yandell 32

Brietling et al (2008) hotspot size thresholds from permutations 2012 Yandell 33

Breitling Method 2012 Yandell 34

Chaibub b Neto sliding LOD thresholds 2012 Yandell 35

Sliding LOD method 2012 Yandell 36

What s next? Further assess properties (power of test) Drill into identified hotspots Find correlated subsets of traits Look for local causal agents (cis traits) Build causal networks (another talk ) Validate findings for narrow hotspot Incorporate as tool in pipeline Increase access for discipline researchers Increase visibility ibilit of method 2012 Yandell 37

References Chaibub Neto E, Keller MP, Broman AF, Attie AD, Jansen RC, Broman KW, Yandell BS, Quantile based permutation thresholds for QTL hotspots. Genetics (in review). Breitling R, Li Y, Tesson BM, Fu J, Wu C, Wiltshire T, Gerrits A, Bystrykh LV, de Haan G, Su AI, Jansen RC (2008) Genetical Genomics: Spotlight on QTL Hotspots. PLoS Genetics 4: e1000232. Churchill GA, Doerge RW (1994) Empirical threshold values for quantitative trait mapping. Genetics 138: 963 971. 971 2012 Yandell 38