IDR: Irreproducible discovery rate Sündüz Keleş Department of Statistics Department of Biostatistics and Medical Informatics University of Wisconsin, Madison April 18, 2017 Stat 877 (Spring 17) 04/11-04/18 1 / 19
IDR Li, Brown, Huang, Bickel (2011). Measuring reproducibility of high-throughput experiments, Annals of Applied Statistics, Vol 5, No 3, 1752-1779. Stat 877 (Spring 17) 04/11-04/18 2 / 19
IDR Developed within the ENCODE project. Most ChIP-seq experiments in the ENCODE project have two replicates. Peaks are called on each replicate separately. Peaks from two replicates are compared to identify reproducible peaks. Many peak callers result in very different numbers of peaks for the same dataset. Next 5 slides are courtesy of Anshul Kundaje. Stat 877 (Spring 17) 04/11-04/18 3 / 19
IDR: Expected rate of irreproducible discoveries. The goal is to limit the expected proportion of peaks that are not reproducible across replicates. Stat 877 (Spring 17) 04/11-04/18 4 / 19
Peak calling Evaluation Evaluated multiple peak callers SPP (Anshul Kundaje) GEM* (David Gifford Lab) PeakSeq (Mark Gerstein Lab) MACSv2 (Tao Liu, Shirley Liu) MOSAICS*/dPeak (Sunduz Keles lab) Hotspot (Bob Thurman) BELT (Victor Jin, Peggy Farnham) Scripture (Noam Soresh) *( sequence aware) Irreproducible Discovery Rate (IDR) model Evaluation datasets CTCF (narrow peaks, high SNR, high specificity motif, > 50K sites) POL2 (narrow and diffused peaks, high SNR, bind at/near TSSs, > 10K sites) GABP (narrow peaks, high SNR, homotypic closely spaced binding events, high specificity motif, > 5K sites) P300 (narrow peaks, high SNR, no sequence motif, enhancer binding) ZNF274 (diffused peaks, low SNR, at ZNF genes, < 1K peaks) Stat 877 (Spring 17) 04/11-04/18 5 / 19
Stat 877 (Spring 17) 04/11-04/18 6 / 19
Number of datasets # peaks called by MACS # peaks called by MACS Instability of default peak calling thresholds Default thresholds IDR threshold of 2% # peaks called by SPP # peaks called by SPP Stat 877 (Spring 17) 04/11-04/18 7 / 19
Rep2 peak ranks Rep2 peak ranks Peak calling Evaluation Peak Rank Scatter plots Stable ranking measure Evaluation criteria No. of reproducibile peak calls (IDR) Stability of ranking measures (peak rank scatter plots) Precision of binding events (distance from motifs) Resolution of closely spaced binding events Ability to detect sites for high and low SNR datasets Ease of use (Speed, memory) Backward compatibility Rep1 peak ranks Unstable ranking measure Peaks that pass IDR Peaks that do not pass IDR Rep1 peak ranks Stat 877 (Spring 17) 04/11-04/18 8 / 19
Peak calling evaluation Default peak caller thresholds are highly unstable and incomparable IDR greatly stabilizes peak calling thresholds CTCF POL2 ZNF274 Stat 877 (Spring 17) 04/11-04/18 9 / 19
How does it work? A graphical method (X 1,1, X 1,2 ),, (X n,1, X n,2 ): signal from n regions on 2 replicates. Ψ n (t, v) = 1 n n I (X i,1 > x ( (1 t)n ),1, X i,2 > x ( (1 t)n ),2 ) i=1 Ψ n (t, v): proportion of pairs that are ranked both on the upper t% of X 1 and v% of X 2 = Empirical bivariate survival function. Since consistency is usually considered as a symmetric notion, use Ψ n (t, t) Ψ n (t). Stat 877 (Spring 17) 04/11-04/18 10 / 19
How does it work? Let R(X i,1 ) and R(X i,2 ) be the ranks of X i,1 and X i,2. If R(X i,1 ) and R(X i,2 ) are perfectly correlated for i = 1,, n, then Ψ(t) = t and Ψ (t) = 1. If R(X i,1 ) and R(X i,2 ) are independent for i = 1,, n, then Ψ(t) = t 2 and Ψ (t) = 2t. Stat 877 (Spring 17) 04/11-04/18 11 / 19
How does it work? If R(X i,1 ) and R(X i,2 ) are perfectly correlated for the top t 0 n, 0 < t 0 < 1 observations and independent for the remaining (1 t 0 )n observations, then top t 0 n points fall on a straight line of slope 1 on the curve of Ψ(.) and the rest (1 t 0 )n points fall onto a parabola Ψ(t) = t2 2tt 0 +t) 1 t 0. Correspondence curve: Stat 877 (Spring 17) 04/11-04/18 12 / 19
How to identify t 0? Inferring the reproducibility of the signals A bivariate Copula model. If (X 1, X 2 ) is a pair of continuous random variables with distribution function F (.,.) and marginals F 1 (.) and F 2 (.), respectively, then U 1 = F 1 (x) U(0, 1) and U 2 = F 2 (x) U(0, 1) and the distribution function of (U 1, U 2 ) is a copula. C(u 1, u 2 ) = Pr(U 1 u 1, U 2 u 2 ) = Pr(X 1 F1 1 (u 1 ), X 2 F2 1 (u 2 )) = F (F1 1 (u 1 ), F2 1 (u 2 )) Equivalently, F (x 1, x 2 ) = C(F 1 (x 1 ), F 2 (x 2 )). Stat 877 (Spring 17) 04/11-04/18 13 / 19
(X 1,, X K ) F X with marginals F X1,, F XK (K = 2). The marginals are linked to the joint distribution through a so-called Copula function. Theorem (Sklar 59) There exists at least one Copula C X such that F X (x 1,, x K ) = C X (F X1 (x 1 ),, F X2 (x 2 ),, F XK (x K )). Copula C(.,.) is the joint density of X 1,, X K. Modeling of the marginals and the Copula can be done separately. Stat 877 (Spring 17) 04/11-04/18 14 / 19
A Copula mixture model K i Bernoulli(π 1 ) denotes whether i-th peak is from the consistent set (K i = 1) or the spurious (K i = 0) set. Given indicator K i, the dependence between the replicates is induced by z 1 = (z 1,1, z 1,2 ) if K i = 1 and z 0 = (z 0,1, z 0,2 ) of K i = 0. ( zi,1 z i,2 ) (( µk K i = k N µ k ) ( σ 2, k ρ k σk 2 ρ k σk 2 σk 2 )), where µ 0 = 0, µ 1 > 0, σ 2 0 = 1, ρ 0 = 0, 0 < ρ 1 < 1. Stat 877 (Spring 17) 04/11-04/18 15 / 19
A Copula mixture model Let ( ) zi,1 µ 1 u i,1 G(z i,1 ) = π 1 Φ + π 0 Φ(z i,1 ) σ 1 ( ) zi,2 µ 1 u i,2 G(z i,2 ) = π 1 Φ + π 0 Φ(z i,2 ), σ 1 where Φ is the standard Normal cumulative distribution function. The actual observations are x i,1 = F1 1 (u i,1 ) x i,2 = F2 1 (u i,2 ), where F 1 and F 2 are the marginal distributions of the two coordinates, which are assumed to be continuous but unknown. Stat 877 (Spring 17) 04/11-04/18 16 / 19
A Copula mixture model Then, for signal i: Pr(X i,1 x 1, X i,2 x 2 ) = π 0 h 0 (G 1 (F 1 (x i,1 )), G 1 (F 2 (x i,2 ))) + π 1 h 1 (G 1 (F 1 (x i,1 )), G 1 (F 2 (x i,2 ))), where ) (( 0 h 0 N 0 h 1 N (( µ1 µ 1 ( 1 0, 0 1 )), ), ( σ 2 1 ρ 1 σ 2 1 ρ 1 σ 2 1 σ 2 1 )), Stat 877 (Spring 17) 04/11-04/18 17 / 19
A Copula mixture model EM algorithm to estimate θ = (µ 1, ρ 1, σ 1, π 0 ). Inference is based on Pr(K i = 1 (x i,1, x i,2 ); ˆθ). Local irreproducible discovery rate: idr(x i,1, x i,2 ) = Pr(K i = 0 (x i,1, x i,2 ); ˆθ), Then, to control IDR at level α: Rank (x i,1, x i,2 ) by their idr values. Select all (x (i),1, x (i),2 ), i = 1,, l where 1 l = argmax i i i idr j α. Basically, IDR is False Discovery Rate (FDR) control in this specific Copula mixture model. j=1 Stat 877 (Spring 17) 04/11-04/18 18 / 19
Caveats If the peak does not appear on both lists, it is discarded. Requires to start with a large set of peaks (consisting of real and spurious peaks). How large should this set be? Continuity assumption. There are typically many ties in scores of peaks. R package: http://cran.fhcrc.org/web/packages/idr/index.html. Use the latest version from https://github.com/nboley/idr. How well does the semi-parametric Copula model fit the data? Model diagnostics. The main strength is that the scores for peaks can be anything (p-values, log-likelihood, ChIP to input enrichment) etc. Stat 877 (Spring 17) 04/11-04/18 19 / 19