False Discovery Rate

Size: px

Start display at page:

Download "False Discovery Rate"

Blanche Primrose Hill
5 years ago
Views:

1 False Discovery Rate Peng Zhao Department of Statistics Florida State University December 3, 2018 Peng Zhao False Discovery Rate 1/30

2 Outline 1 Multiple Comparison and FWER 2 False Discovery Rate 3 FDR Control through Benjamini-Hochberg 4 FDR Control with Knockoffs 5 FDR Control through SLOPE Peng Zhao False Discovery Rate 2/30

3 Multiple Comparison and FWER The Problem Recall the definition of P-value: the probability of finding the observed or more extreme results under null hypothesis. For high dimensional data, p, which means the number of hypotheses need to be tested simultaneously is large. We can further prove, if we use chi-square statistics to test the overall effect, the power is almost zero when the true alternative is sparse (for example, like screening problems). We need a more proper criterion for multiple comparison, the goal is to make less mistakes but still construct a powerful test. Peng Zhao False Discovery Rate 3/30

4 Multiple Comparison and FWER FWER Control A strahtforward way is to control Family-Wise Error Rate (FWER): the probability of rejecting any true null hypothesis. Bonferroni Correction Test each individual hypothesis at α/m. Let the ordered p-values be: p (1) p (2) p (3) p (m), Holm s Procedure will reject H (i) if p (j) α m j + 1 for j = 1, 2,..., i Peng Zhao False Discovery Rate 4/30

5 Multiple Comparison and FWER Problems for FWER controlling Sometimes the control of the FWER is not quite needed. For example, when comparing a treatment group and a control group by testing various aspects of the effect, even if some of the null hypotheses are falsely rejected, the overall conclusion that the treatment is better need not be wrong. Another example is the screening problem (like candidates for drug developments), where we want to obtain as many as possible discoveries, but too large a fraction of false discoveries would also burden the second phase analysis. Thus it s hard to choose a suitable significant level. Peng Zhao False Discovery Rate 5/30

6 False Discovery Rate False-Discovery Rates Consider testing m null hypotheses simultaneously, of which m 0 are true, the results can be summarized: Notice that FWER = P(V 1). Peng Zhao False Discovery Rate 6/30

7 False Discovery Rate False-Discovery Rates R is the number of hypotheses rejected, define False discovery proportion (FDP) to be V / max(r, 1), then define False Discovery Rate (FDR) as: FDR = E(FDP) = E{(V/ max(v + S, 1)} = E(V/ max(r, 1)). Notice that although we observe R, we do not observe V, and so R/V is an unobserved random variable, what we want control is its expectation; The meaning to control FDR: If we repeat the experiment many times, then on average we can control the FDP, the controlling for one time may be very bad. This is different from FWER, which we can control for one experiment. Peng Zhao False Discovery Rate 7/30

8 FDR False Discovery Rate If all null hypotheses are true, then the FDR is equivalent to the FWER: all rejections are false rejections. So if V = 0 then FDP= 0; if V 1 then FDP= 1, thus FDR = E(FDP) = P(V 1)=FWER. FWER FDR: if V = 0 then FDP= 0 = 1 V 1 ; if V 1 then FDP 1 = 1 V 1. Taking expectations we have P(V 1) FDR, which means controlling FWER also implies controlling FDR. Thus, procedures that control FDR can be more powerful. Peng Zhao False Discovery Rate 8/30

9 FDR Control through Benjamini-Hochberg Benjamini-Hochberg Let the ordered p-values be: p (1) p (2) p (3) p (m), Benjamini-Hochberg s Procedure (BH) will reject H (i) if p (j) jq m for j = 1, 2,..., i Above procedure means that the if we know the number of rejections R, then we can choose the threshold of P-value as max(r,1)q m. Peng Zhao False Discovery Rate 9/30

10 FDR Control through Benjamini-Hochberg Benjamini-Hochberg Controls FDR Theorem For independent test statistics and for any configuration of false null hypotheses, the above procedure controls the FDR at q. Proof Sketch: Given any threshold value t, reject all hypothesis with p value less than t, then V(t) and R(t) are stochastic processes, and the some facts can be verified: F t = σ{v(s), R(s), t s 1} is a backward filtration; τ = q max(r(τ), 1)/n is a stopping time w.r.t backward filtration F t ; Peng Zhao False Discovery Rate 10/30

11 FDR Control through Benjamini-Hochberg Benjamini-Hochberg Controls FDR { } V(t) t is a martingale running backward in time and [ ] V(t) E F s t = 1 t E[V(t) F s] = 1 t V(s) t s = V(s), s since under F s, V(s) = # { p 0 i : p 0 i s } where p 0 i are independent and uniformly distributed on [0, s]. By Optional Stopping Time Theorem: ( ) V (τ) FDR (τ) = E = q ( ) V (τ) max (R (τ), 1) m E τ = q m EV(1) = qm 0 m. Peng Zhao False Discovery Rate 11/30

12 FDR Control through Benjamini-Hochberg Simulation Settings Consider the following simulation settings: we run 50 independent hypothesis tests, of which 35 have true null hypothesis, so we generate p vaule through uniform distribution between [0, 1]; Let the other 15 null hypotheses be false, we can generate the 15 p values in this way: first sample 1 random number form N(2.5, 1), then we calculate the p value for the two sided z test only based on this number, finally we repeat the process 15 times independently; We choose the threshold values for α or q to be 0.05, 0.1, 0.2 respectively, and the decision results can be summarized as the following table. Peng Zhao False Discovery Rate 12/30

13 FDR Control through Benjamini-Hochberg Simulation Results Truth \Decision NS S NS S NS S BH H H Bonferroni H H Holm H H No correction H H Peng Zhao False Discovery Rate 13/30

14 FDR Control through Benjamini-Hochberg BH Visualization The black, red and green dashed line stands for the threshold curve for q = as 0.05, 0.1 and 0.2 respectively. Peng Zhao False Discovery Rate 14/30

15 FDR Control through Benjamini-Hochberg Comparison Visualization The black, red, green and blue dashed line stands for the threshold curve for Bonferroni, Holm, BH and no correction as α = 0.2 or q = 0.2 respectively. Peng Zhao False Discovery Rate 15/30

16 Knockoffs FDR Control with Knockoffs For regression problem y = Xβ + w, BH can not control FDR (the positive regression dependency does not hold). Knockoffs: a negative control group for the predictors X, then it should have the following characteristics: Conditional independence: X Y X, this can be obtained when X is constructed only from X; Exchangeability: for any subset S {1,..., p}, (X, X) swap(s) d = (X, X), where swap(s) swaps the entries X j and X j for any j S. For example, when p = 3 and S = {2, 3}: ( X1, X 2, X 3, X 1, X 2, X 3 ) swap({2,3}) d = ( ) X 1, X 2, X 3, X 1, X 2, X 3 By construction, the all null hypotheses for all knockoffs are true. Peng Zhao False Discovery Rate 16/30

17 FDR Control with Knockoffs Knockoffs:Examples Suppose X N (0, Σ), with Σ positive-definite, then we can set: [ Σ Σ diag{s} (X, X) N (0, G), where G = Σ diag{s} Σ ], where s is choose to make sure G is positive-definite. Here G is invariant under permutation operator. Then X X d = N (µ, V), where µ = X XΣ 1 diag{s} V = 2 diag{s} diag{s}σ 1 diag{s}. Then the model can be fitted for y vs (X, X) like the lasso min b R 2p 1 2 y [X, X]b λ b 1. Peng Zhao False Discovery Rate 17/30

18 FDR Control with Knockoffs FDR Control through Knockoffs Let adjusted score W j = b j b j+p, then W j statisfies: Under null hypothesis, W j s are symmetrically distributed; Conditional on W j, under null hypothesis, the sign of W j is i.i.d flipping coin. Large W j provides evidence against null hypothesis. Define: S + (t) = { j : W j t }, S (t) = { j : W j t }, then the threshold value is defined as: τ = min{t := 1 + S (t) 1 S + (t) q}, since S + (t) is all the selection sets, S (t) can be viewed as the mirror reflection of the false selection set. Peng Zhao False Discovery Rate 18/30

19 FDR Control with Knockoffs FDR Control through Knockoffs Theorem Denote the set Ŝ = { j : W j τ }, then the FDR can be controlled: Proof Sketch: [ { } ] j Ŝ H0 E q. Ŝ 1 Define V + (t) = # {j H 0 : j S + (t)} and V (t) = # {j H 0 : j S (t)}, and the some facts can be verified: F t = σ{v + (s), V (s), 0 s t} is a filtration; τ is a stopping time w.r.t filtration F t ; Peng Zhao False Discovery Rate 19/30

20 FDR Control with Knockoffs FDR Control through Knockoffs FDP(τ) = V+ (τ) Ŝ V (t) 1 + V (t) q V + (τ) 1 + V (τ) { } V + (t) is a super-martingale in forward time, for s t: 1+V (t) [ V + ] (s) E 1 + V (s) V± (t), V + (s) + V (s) V+ (t) 1 + V (t), since conditional on V + (s) + V (s), V + (s) is hypergeometric. By Optional Stopping Time Theorem and V + (0) Bin ( #H 0, 1 2) : ( V FDR (τ) + ) ( (τ) V qe 1 + V + ) (0) qe (τ) 1 + V q. (0) Peng Zhao False Discovery Rate 20/30

21 FDR Control with Knockoffs Problems for Knockoffs Fixed design Knockoffs work well when n p, however, for n < p, generating X is hard; For random desing knockoffs (Model-X knockoff), we need to assume to have full knowledge of the feature distribution P X ; We need a tractable sampling technique when the parametric assumption for feature distribution P X can not be satisfied. Peng Zhao False Discovery Rate 21/30

22 FDR Control through SLOPE Sorted l 1 Penalized Estimation (SLOPE) Consider the following optimization problem: 1 min b 2 y Xb 2 + λ 1 b (1) + λ 2 b (2) + + λ p b (p), where λ 1 λ 2... λ p 0 and b (1) b (2)... b (p) and J λ (b) = λ 1 b (1) + λ 2 b (2) λ p b (p) is called sorted-l 1 norm. Then we need to prove J λ (b) is convex, reformulate J λ (b) as: J λ (b) = p i=1 (λ i λ i+1 ) f i (b), f i (b) = b (j), j i where λ p+1 = 0, it s sufficient to show each f i (b) is convex. Peng Zhao False Discovery Rate 22/30

23 FDR Control through SLOPE Sorted l 1 Penalized Estimation (SLOPE) Consider diagonal matrix B = Diag( b ), under Min-max theorem for eigenvalues of B, we have: f i (b) = b (j) = sup tr (BP V ), j i dim(v)=k where P V is the projection onto V. So f i (b) is convex, so as J λ (b). For computation, we can use proximal gradient descent, so we need to evaluate the proximity operator: prox λ (y) = argmin x R n 1 2 y x 2 l 2 + n i=1 λ i x (i), which is a strongly convex problem. Peng Zhao False Discovery Rate 23/30

24 FDR Control through SLOPE SLOPE Computation The sign of the solution x i will match that of y i ; Any permutation P to y results a solution Px. So we can assume y 1 y 2 y n 0 first, then restore the signs and permutation in a post-processing step. Then the solution should satisfies x 1 x 2 x n 0. Otherwise suppose x i < x j for i < j, consider x with entries i and j exchanged for x, we have: f (x) f ( x ) = x j y i x i y i + x i y j x j y j = ( x j x i ) ( yi y j ) > 0. Thus the evaluation of the proximity operator is a quadratic program: minimize 1 2 y x 2 l 2 + n i=1 λ ix i subject to x 1 x 2 x n 0. Peng Zhao False Discovery Rate 24/30

25 FDR Control through SLOPE SLOPE Controls FDR Theorem Assume an orthogonal design with i.i.d. N (0, 1) errors, set λ i = λ BH (i) = Φ 1 (1 iq/2p). Then the FDR of the SLOPE procedure obeys: [ ] V FDR = E q n 0 R 1 n. Proof Sketch: [ ] V FDR = E = R 1 n [ ] V E r I {R=r} = r=1 [ n 1 n0 r E r=1 I {Hi is rejected and R=r} i=1 From the optimality of the solution, we can obtain the following characteristics: ] Peng Zhao False Discovery Rate 25/30

26 FDR Control through SLOPE SLOPE Controls FDR {y : H i is rejected and R = r} = {y : y i > λ r and R = r} Consider SLOPE procedure to ỹ = (y 1,..., y i 1, y i+1,..., y n ) with λ = (λ 2,..., λ n ), and let R be the number of rejections for this procedure, then: {y : y i > λ r and R = r} { y : y i > λ r and R = r 1 }. Thus by the normal distribution of y, P (H i rejected and R = r) P ( y i λ r and R = r 1 ) = P ( y i λ r ) P( R = r 1) = qr n P( R = r 1). qn Then we have FDR 0 r 1 n P( R = r 1) = qn 0 n Peng Zhao False Discovery Rate 26/30

27 FDR Control through SLOPE Problems for SLOPE The main theorem only shows SLOPE controls FDR under orthogonal design, for general design only simulation results are provided. The unknown of variance σ may be overestimated by the undetection of some weak signals, then it may cause lose of power; It is important to have a data-driven procedures to choose regularizaing sequence {λ i }. Peng Zhao False Discovery Rate 27/30

28 References I References Benjamini, Y. and Hochberg, Y., Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, pp Bogdan, M., Van Den Berg, E., Sabatti, C., Su, W. and Candes, E.J., SLOPE adaptive variable selection via convex optimization. The annals of applied statistics, 9(3), p Barber, R.F. and Candes, E.J., Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5), pp Verhoeven, K.J., Simonsen, K.L. and McIntyre, L.M., Implementing false discovery rate control: increasing your power. Oikos, 108(3), pp Peng Zhao False Discovery Rate 28/30

29 References II References Candes, E., Fan, Y., Janson, L. and Lv, J., Panning for gold: Model-free knockoffs for high-dimensional controlled variable selection. arxiv preprint, arxiv: Candes, E Stats 300C Theory of Statistics. candes/stats300c/lectures.html. Peng Zhao False Discovery Rate 29/30

30 References Thank you for your time! Peng Zhao False Discovery Rate 30/30

Lecture 12 April 25, 2018

Stats 300C: Theory of Statistics Spring 2018 Lecture 12 April 25, 2018 Prof. Emmanuel Candes Scribe: Emmanuel Candes, Chenyang Zhong 1 Outline Agenda: The Knockoffs Framework 1. The Knockoffs Framework