IDR: Irreproducible discovery rate

Similar documents
arxiv: v1 [stat.ap] 21 Oct 2011

MEASURING REPRODUCIBILITY OF HIGH-THROUGHPUT EXPERIMENTS 1

for the Analysis of ChIP-Seq Data

CMARRT: A TOOL FOR THE ANALYSIS OF CHIP-CHIP DATA FROM TILING ARRAYS BY INCORPORATING THE CORRELATION STRUCTURE

Semiparametric Gaussian Copula Models: Progress and Problems

Copulas. MOU Lili. December, 2014

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS

Semiparametric Gaussian Copula Models: Progress and Problems

Probability Distributions and Estimation of Ali-Mikhail-Haq Copula

TECHNICAL REPORT NO. 1151

Estimation of Conditional Kendall s Tau for Bivariate Interval Censored Data

Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics

MODEL-BASED APPROACHES FOR THE DETECTION OF BIOLOGICALLY ACTIVE GENOMIC REGIONS FROM NEXT GENERATION SEQUENCING DATA. Naim Rashid

Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

Nonparametric predictive inference with parametric copulas for combining bivariate diagnostic tests

Gaussian Process Vine Copulas for Multivariate Dependence

First steps of multivariate data analysis

Joint Analysis of Multiple ChIP-seq Datasets with the. the jmosaics package.

Semi-parametric predictive inference for bivariate data using copulas

= Prob( gene i is selected in a typical lab) (1)

Hybrid Copula Bayesian Networks

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

The Instability of Correlations: Measurement and the Implications for Market Risk

Frailty Models and Copulas: Similarities and Differences

University of California, Berkeley

Trivariate copulas for characterisation of droughts

Lecture 12 April 25, 2018

Marginal Specifications and a Gaussian Copula Estimation

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

Partial Correlation with Copula Modeling

3 Comparison with Other Dummy Variable Methods

Estimating empirical null distributions for Chi-squared and Gamma statistics with application to multiple testing in RNA-seq

Variational Inference with Copula Augmentation

Predicting Protein Functions and Domain Interactions from Protein Interactions

The Use of Copulas to Model Conditional Expectation for Multivariate Data

Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

A measure of radial asymmetry for bivariate copulas based on Sobolev norm

Simulating Realistic Ecological Count Data

LARGE NUMBERS OF EXPLANATORY VARIABLES. H.S. Battey. WHAO-PSI, St Louis, 9 September 2018

Mixtures and Hidden Markov Models for analyzing genomic data

By Bhattacharjee, Das. Published: 26 April 2018

Rater agreement - ordinal ratings. Karl Bang Christensen Dept. of Biostatistics, Univ. of Copenhagen NORDSTAT,

Parametric Empirical Bayes Methods for Microarrays

Web-based Supplementary Material for A Two-Part Joint. Model for the Analysis of Survival and Longitudinal Binary. Data with excess Zeros

Bios 6649: Clinical Trials - Statistical Design and Monitoring

Research Article Sample Size Calculation for Controlling False Discovery Proportion

PARSIMONIOUS MULTIVARIATE COPULA MODEL FOR DENSITY ESTIMATION. Alireza Bayestehtashk and Izhak Shafran

Multivariate Distributions

Basic math for biology

Package GMCM. R topics documented: July 2, Type Package. Title Fast estimation of Gaussian Mixture Copula Models. Version

An Introduction to the spls Package, Version 1.0

A New Generalized Gumbel Copula for Multivariate Distributions

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random

Research Projects. Hanxiang Peng. March 4, Department of Mathematical Sciences Indiana University-Purdue University at Indianapolis

VARIABLE SELECTION AND INDEPENDENT COMPONENT

Sequential Monitoring of Clinical Trials Session 4 - Bayesian Evaluation of Group Sequential Designs

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018

Financial Econometrics and Volatility Models Copulas

Econometrics Spring School 2016 Econometric Modelling. Lecture 6: Model selection theory and evidence Introduction to Monte Carlo Simulation

Alignment. Peak Detection

Calibration Estimation of Semiparametric Copula Models with Data Missing at Random

The locfdr Package. August 19, hivdata... 1 lfdrsim... 2 locfdr Index 5

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University

Multivariate Non-Normally Distributed Random Variables

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Part IV Extensions: Competing Risks Endpoints and Non-Parametric AUC(t) Estimation

Reminder: Univariate Data. Bivariate Data. Example: Puppy Weights. You weigh the pups and get these results: 2.5, 3.5, 3.3, 3.1, 2.6, 3.6, 2.

Stat 206: Estimation and testing for a mean vector,

DEXSeq paper discussion

STAT 331. Accelerated Failure Time Models. Previously, we have focused on multiplicative intensity models, where

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

Discussion of Papers on the Extensions of Propensity Score

Dispersion modeling for RNAseq differential analysis

7. Integrated Processes

Estimation for nonparametric mixture models

Tail negative dependence and its applications for aggregate loss modeling

Advanced Introduction to Machine Learning

Exploratory statistical analysis of multi-species time course gene expression

A PRACTICAL WAY FOR ESTIMATING TAIL DEPENDENCE FUNCTIONS

Efficient estimation of a semiparametric dynamic copula model

Copula modeling for discrete data

THE VINE COPULA METHOD FOR REPRESENTING HIGH DIMENSIONAL DEPENDENT DISTRIBUTIONS: APPLICATION TO CONTINUOUS BELIEF NETS

7. Integrated Processes

Introduction to Probability and Statistics (Continued)

Bootstrap Goodness-of-fit Testing for Wehrly Johnson Bivariate Circular Models

Chapter 12 - Part I: Correlation Analysis

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Investigation of an Automated Approach to Threshold Selection for Generalized Pareto

Uncertainty quantification and visualization for functional random variables

Overview of Extreme Value Theory. Dr. Sawsan Hilal space

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

Probabilistic Graphical Models

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Pramod K. Varshney. EECS Department, Syracuse University This research was sponsored by ARO grant W911NF

the long tau-path for detecting monotone association in an unspecified subpopulation

Probability Distribution And Density For Functional Random Variables

Transcription:

IDR: Irreproducible discovery rate Sündüz Keleş Department of Statistics Department of Biostatistics and Medical Informatics University of Wisconsin, Madison April 18, 2017 Stat 877 (Spring 17) 04/11-04/18 1 / 19

IDR Li, Brown, Huang, Bickel (2011). Measuring reproducibility of high-throughput experiments, Annals of Applied Statistics, Vol 5, No 3, 1752-1779. Stat 877 (Spring 17) 04/11-04/18 2 / 19

IDR Developed within the ENCODE project. Most ChIP-seq experiments in the ENCODE project have two replicates. Peaks are called on each replicate separately. Peaks from two replicates are compared to identify reproducible peaks. Many peak callers result in very different numbers of peaks for the same dataset. Next 5 slides are courtesy of Anshul Kundaje. Stat 877 (Spring 17) 04/11-04/18 3 / 19

IDR: Expected rate of irreproducible discoveries. The goal is to limit the expected proportion of peaks that are not reproducible across replicates. Stat 877 (Spring 17) 04/11-04/18 4 / 19

Peak calling Evaluation Evaluated multiple peak callers SPP (Anshul Kundaje) GEM* (David Gifford Lab) PeakSeq (Mark Gerstein Lab) MACSv2 (Tao Liu, Shirley Liu) MOSAICS*/dPeak (Sunduz Keles lab) Hotspot (Bob Thurman) BELT (Victor Jin, Peggy Farnham) Scripture (Noam Soresh) *( sequence aware) Irreproducible Discovery Rate (IDR) model Evaluation datasets CTCF (narrow peaks, high SNR, high specificity motif, > 50K sites) POL2 (narrow and diffused peaks, high SNR, bind at/near TSSs, > 10K sites) GABP (narrow peaks, high SNR, homotypic closely spaced binding events, high specificity motif, > 5K sites) P300 (narrow peaks, high SNR, no sequence motif, enhancer binding) ZNF274 (diffused peaks, low SNR, at ZNF genes, < 1K peaks) Stat 877 (Spring 17) 04/11-04/18 5 / 19

Stat 877 (Spring 17) 04/11-04/18 6 / 19

Number of datasets # peaks called by MACS # peaks called by MACS Instability of default peak calling thresholds Default thresholds IDR threshold of 2% # peaks called by SPP # peaks called by SPP Stat 877 (Spring 17) 04/11-04/18 7 / 19

Rep2 peak ranks Rep2 peak ranks Peak calling Evaluation Peak Rank Scatter plots Stable ranking measure Evaluation criteria No. of reproducibile peak calls (IDR) Stability of ranking measures (peak rank scatter plots) Precision of binding events (distance from motifs) Resolution of closely spaced binding events Ability to detect sites for high and low SNR datasets Ease of use (Speed, memory) Backward compatibility Rep1 peak ranks Unstable ranking measure Peaks that pass IDR Peaks that do not pass IDR Rep1 peak ranks Stat 877 (Spring 17) 04/11-04/18 8 / 19

Peak calling evaluation Default peak caller thresholds are highly unstable and incomparable IDR greatly stabilizes peak calling thresholds CTCF POL2 ZNF274 Stat 877 (Spring 17) 04/11-04/18 9 / 19

How does it work? A graphical method (X 1,1, X 1,2 ),, (X n,1, X n,2 ): signal from n regions on 2 replicates. Ψ n (t, v) = 1 n n I (X i,1 > x ( (1 t)n ),1, X i,2 > x ( (1 t)n ),2 ) i=1 Ψ n (t, v): proportion of pairs that are ranked both on the upper t% of X 1 and v% of X 2 = Empirical bivariate survival function. Since consistency is usually considered as a symmetric notion, use Ψ n (t, t) Ψ n (t). Stat 877 (Spring 17) 04/11-04/18 10 / 19

How does it work? Let R(X i,1 ) and R(X i,2 ) be the ranks of X i,1 and X i,2. If R(X i,1 ) and R(X i,2 ) are perfectly correlated for i = 1,, n, then Ψ(t) = t and Ψ (t) = 1. If R(X i,1 ) and R(X i,2 ) are independent for i = 1,, n, then Ψ(t) = t 2 and Ψ (t) = 2t. Stat 877 (Spring 17) 04/11-04/18 11 / 19

How does it work? If R(X i,1 ) and R(X i,2 ) are perfectly correlated for the top t 0 n, 0 < t 0 < 1 observations and independent for the remaining (1 t 0 )n observations, then top t 0 n points fall on a straight line of slope 1 on the curve of Ψ(.) and the rest (1 t 0 )n points fall onto a parabola Ψ(t) = t2 2tt 0 +t) 1 t 0. Correspondence curve: Stat 877 (Spring 17) 04/11-04/18 12 / 19

How to identify t 0? Inferring the reproducibility of the signals A bivariate Copula model. If (X 1, X 2 ) is a pair of continuous random variables with distribution function F (.,.) and marginals F 1 (.) and F 2 (.), respectively, then U 1 = F 1 (x) U(0, 1) and U 2 = F 2 (x) U(0, 1) and the distribution function of (U 1, U 2 ) is a copula. C(u 1, u 2 ) = Pr(U 1 u 1, U 2 u 2 ) = Pr(X 1 F1 1 (u 1 ), X 2 F2 1 (u 2 )) = F (F1 1 (u 1 ), F2 1 (u 2 )) Equivalently, F (x 1, x 2 ) = C(F 1 (x 1 ), F 2 (x 2 )). Stat 877 (Spring 17) 04/11-04/18 13 / 19

(X 1,, X K ) F X with marginals F X1,, F XK (K = 2). The marginals are linked to the joint distribution through a so-called Copula function. Theorem (Sklar 59) There exists at least one Copula C X such that F X (x 1,, x K ) = C X (F X1 (x 1 ),, F X2 (x 2 ),, F XK (x K )). Copula C(.,.) is the joint density of X 1,, X K. Modeling of the marginals and the Copula can be done separately. Stat 877 (Spring 17) 04/11-04/18 14 / 19

A Copula mixture model K i Bernoulli(π 1 ) denotes whether i-th peak is from the consistent set (K i = 1) or the spurious (K i = 0) set. Given indicator K i, the dependence between the replicates is induced by z 1 = (z 1,1, z 1,2 ) if K i = 1 and z 0 = (z 0,1, z 0,2 ) of K i = 0. ( zi,1 z i,2 ) (( µk K i = k N µ k ) ( σ 2, k ρ k σk 2 ρ k σk 2 σk 2 )), where µ 0 = 0, µ 1 > 0, σ 2 0 = 1, ρ 0 = 0, 0 < ρ 1 < 1. Stat 877 (Spring 17) 04/11-04/18 15 / 19

A Copula mixture model Let ( ) zi,1 µ 1 u i,1 G(z i,1 ) = π 1 Φ + π 0 Φ(z i,1 ) σ 1 ( ) zi,2 µ 1 u i,2 G(z i,2 ) = π 1 Φ + π 0 Φ(z i,2 ), σ 1 where Φ is the standard Normal cumulative distribution function. The actual observations are x i,1 = F1 1 (u i,1 ) x i,2 = F2 1 (u i,2 ), where F 1 and F 2 are the marginal distributions of the two coordinates, which are assumed to be continuous but unknown. Stat 877 (Spring 17) 04/11-04/18 16 / 19

A Copula mixture model Then, for signal i: Pr(X i,1 x 1, X i,2 x 2 ) = π 0 h 0 (G 1 (F 1 (x i,1 )), G 1 (F 2 (x i,2 ))) + π 1 h 1 (G 1 (F 1 (x i,1 )), G 1 (F 2 (x i,2 ))), where ) (( 0 h 0 N 0 h 1 N (( µ1 µ 1 ( 1 0, 0 1 )), ), ( σ 2 1 ρ 1 σ 2 1 ρ 1 σ 2 1 σ 2 1 )), Stat 877 (Spring 17) 04/11-04/18 17 / 19

A Copula mixture model EM algorithm to estimate θ = (µ 1, ρ 1, σ 1, π 0 ). Inference is based on Pr(K i = 1 (x i,1, x i,2 ); ˆθ). Local irreproducible discovery rate: idr(x i,1, x i,2 ) = Pr(K i = 0 (x i,1, x i,2 ); ˆθ), Then, to control IDR at level α: Rank (x i,1, x i,2 ) by their idr values. Select all (x (i),1, x (i),2 ), i = 1,, l where 1 l = argmax i i i idr j α. Basically, IDR is False Discovery Rate (FDR) control in this specific Copula mixture model. j=1 Stat 877 (Spring 17) 04/11-04/18 18 / 19

Caveats If the peak does not appear on both lists, it is discarded. Requires to start with a large set of peaks (consisting of real and spurious peaks). How large should this set be? Continuity assumption. There are typically many ties in scores of peaks. R package: http://cran.fhcrc.org/web/packages/idr/index.html. Use the latest version from https://github.com/nboley/idr. How well does the semi-parametric Copula model fit the data? Model diagnostics. The main strength is that the scores for peaks can be anything (p-values, log-likelihood, ChIP to input enrichment) etc. Stat 877 (Spring 17) 04/11-04/18 19 / 19