Improving the Performance of the FDR Procedure Using an Estimator for the Number of True Null Hypotheses

Size: px

Start display at page:

Download "Improving the Performance of the FDR Procedure Using an Estimator for the Number of True Null Hypotheses"

Joy Mosley
5 years ago
Views:

1 Improving the Performance of the FDR Procedure Using an Estimator for the Number of True Null Hypotheses Amit Zeisel, Or Zuk, Eytan Domany W.I.S. June 5, 29 Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number / 7 of T

2 Introduction/Motivation The vast use of high throughput technologies involves testing thousands of hypotheses simultaneously. The field of multiple testing deals with developing methods to determine the level of significance in such a scenario. Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number 2 / 7 of T

3 Multiple testing m null hypotheses H,i, i =,2,...m. Example: for m variables H,i : µ A i = µ B i. Calculate p-values p i, and set a threshold for significance. The set of random variables (U,V,T,S), and parameters (m,m,m ) describes this scenario: ground truth non-rejected rejected total hypotheses hypotheses null hypothesis is true U V m null hypothesis is false T S m total m R R m Fraction of false discoveries = V R Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number 3 / 7 of T

4 Control of the FDR=E( V R + ) In 995 Benjamini and Hochberg proposed a procedure (BH95) to control the FDR for a given set of p-values: Sort and re-label the p-values, p () p (2)... p (m). 2 Choose q the desired FDR level. 3 Define the set of constants α i = i mq, i =,2,,,,m. 4 Identify R = max{i : p (i) α i }. 5 If R reject all hypotheses (i) =,2,,,R, else no hypothesis is rejected. BH proved that FDR m q m q. The bound is not tight, there is room for improvement. Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number 4 / 7 of T

5 Control vs. estimation, and improved procedures Control (q R) - significance is preset at a desired level q. The procedure yields a set of rejected hypotheses with FDR q. Estimation (R q) - the threshold is preset at a level that yields a desired number of rejections R. The corresponding FDR is estimated. There were many attempts to produce tighter bounds on the FDR, using an estimator for m. Difficulty: the estimator ˆm is a fluctuating random variable. Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number 5 / 7 of T

6 Aim Produce an improved BH procedure using an estimator ˆm for m, the number of true null hypotheses. Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number 6 / 7 of T

7 Definitions: monotonic estimator An estimator for m is a family of functions ˆm ˆm (m) : [,] m R, ˆm ˆm (p,..,p m ). ˆm is a monotonic estimator if it satisfies: ˆm (m) (p,..,p i,..,p m ) ˆm (m) (p,..,p i,..,p m), if p i p i, i =,2,,,m, m 2 ˆm (m) (p,..,p i,..,p m ) ˆm (m ) (p,..,p i,p i+,..,p m ), i =,2,,,m, m 2 Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number 7 / 7 of T

8 Definitions: modified BH procedure (m ˆm ) Given m hypotheses of which m are null, let p,..,p m be the respective p-values. The modified BH procedure with estimator ˆm is: Compute ˆm ˆm (p,..,p m ). 2 Sort and relabel the p-values p ()... p (m). 3 Define the set of constants q k = qk ˆm k =,2...,m. 4 Let R = max{k : p (k) q k }. 5 If R reject p (),..,p (R) else don t reject any hypothesis. Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number 8 / 7 of T

9 Theorem for control (see Benjamini et al 26 (BKY)) Let ˆm ˆm (p,..,p m ) be a monotonic estimator for m. Let ˆm ( ) (p,..,p m ) ˆm (p 2,..,p m ) be the same estimator, but disregarding the first p-value p. Assume that the null p-values are i.i.d. U[,]. Then the modified BH procedure satisfies: FDR = E [ ] V R + m qe [ ] Note: if E ˆm ( ) m, then FDR q. [ ˆm ( ) ] () Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number 9 / 7 of T

10 Both procedures satisfy E(V /R + ) q Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number / 7 of T Two improved procedures The two procedures are based on the estimators: [ IBHsum: ˆm = C(m) min m,max ( s(m),2 m j= p j) ]. C(m),s(m) are universal correction factors..3 (a) C (b).3 s/m m IBHlog: m = 2 m i= log( p i).

11 .5.3 Performance: simulations (IBHsum) ρ is the correlation between test statistics: µ q=.5, ρ= µ q=.5, ρ= µ m /m q=.2, ρ= m /m µ m /m q=.2, ρ= m /m Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number / 7 of T

12 Compare to other methods Results from simulations, m = 5,µ = 3.5: ρ= E(V/R) ρ= Oracle BH (995) BKY (Benjamini et al 26) STS (Storey et al 24) IBHsum IBHlog E(S)/m m /m m /m Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number 2 / 7 of T

13 Applying to 33 gene expression datasets Large number of discoveries Small number of discoveries.8.8 p Two tailed p One tailed i/m i/m Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number 3 / 7 of T

14 q BKY STS IBHsum IBHlog a. Two tailed, large number of discoveries ( studies) (.43) (.38) (.) (.3) b. Two tailed, small number of discoveries ( studies) (.3) (.97) (.4) (.79) c. One tailed, large number of discoveries (8 studies) (.9) (.33) (.26) (.36) d. One tailed, small number of discoveries (5 studies) (.2) (.52) (.7) (.23) Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number 4 / 7 of T

15 Summary and conclusions We proved a theorem that provides a bound on the FDR for any improved procedure based on an monotonic estimator of m. We proposed two improved procedures based on the estimators 2 m j= p j and m i= log( p i). For the case of independent statistics all improved procedures provide similar results: saturation of the bound and more power than BH95. We showed by simulations that even in the case of dependent statistics our procedures provide a reliable bound and improved power. For real gene expression data, where dependencies are expected, our methods improve, in general, over existing ones. Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number 5 / 7 of T

16 Theorem for monotonicity Let p = (p..m ) be a set of independent p-values. Assume that f, the marginal probability density function of the alternatives, is monotonically non-increasing and differentiable. Let B (i) be two threshold FDR procedures rejecting R (i) ( p) hypotheses and each having FDR (i), i =,2. Assume that for any q, R () ( p) R (2) ( p), p. Then it also holds that FDR () FDR (2). Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number 6 / 7 of T

17 THANKS Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving the Performance of the FDR Procedure Using an Estimator June 5, for 29 the Number 7 / 7 of T

Looking at the Other Side of Bonferroni

Looking at the Other Side of Bonferroni Department of Biostatistics University of Washington 24 May 2012 Multiple Testing: Control the Type I Error Rate When analyzing genetic data, one will commonly perform over 1 million (and growing) hypothesis