Technical Report 1004 Dept. of Biostatistics. Some Exact and Approximations for the Distribution of the Realized False Discovery Rate

Size: px

Start display at page:

Download "Technical Report 1004 Dept. of Biostatistics. Some Exact and Approximations for the Distribution of the Realized False Discovery Rate"

Beverly Felicia Robinson
6 years ago
Views:

1 Technical Report 14 Dept. of Biostatistics Some Exact and Approximations for the Distribution of the Realized False Discovery Rate David Gold ab, Jeffrey C. Miecznikowski ab1 a Department of Biostatistics, University at Buffalo, Buffalo, NY , USA b Department of Biostatistics, Roswell Park Cancer Institute, New York Short title: Realized False Discovery Rate Proofs to be sent to: Jeffrey C. Miecznikowski Department of Biostatistics School of Public Health and Health Professions 249 Farber Hall University at Buffalo 3435 Main Street Buffalo NY , USA 1 Corresponding Author. Department of Biostatistics, School of Public Health and Health Professions, 249 Farber Hall, University at Buffalo, 3435 Main Street, Buffalo NY , USA. Tel:+1(716) jcm38@buffalo.edu 1

2 Some Exact and Approximations for the Distribution of the Realized False Discovery Rate by David Gold and Jeff Miecznikowski Here, we derive the distribution of the realized false discovery rate (rfdr), for Benjamini and Hochberg s (1995) procedure, given a general distribution of test statistics. The following notation is refered to 1. a the desired FDR 2. test statistics, X iid π F (X) + π 1 F 1 (X), X = (X 1,..., X m ) mixture of the CDF s F, the null distribution, and F 1 the alternative, with mixture weights π + π 1 = 1. We are mainly interested in the case of a t-test, where F is a t distribution with v degrees of freedom and F 1 is a (possibly) mixture of non-central t s, with v degrees of freedom and noncentrality parameter η. 3. two-sided p-values p i = 2(1 F ( X i )), i = 1,..., m with distribution P (p c) = π P (p c) + π 1 P 1 (p c) 4. ordered p-values p (1),..., p (m) 5. c j = aj/m for j = 1,..., m 1 Density of the Ordered p-value chosen Define the sets: A j = {p (j) aj/m, p (j+1) > a(j + 1)/m, p (j+2) > a(j + 2)/m,..., p (m) > a} B j = {p (j) > aj/m, p (j+1) > a(j + 1)/m,..., p (m) > a} C j = {p (j+1) > a(j + 1)/m, p (j+2) > a(j + 2)/m,..., p (m) > a} Then P (A j ) = P (C j ) P (B j ). Note that for sufficiently large m, large π 1, and entropy between P and P 1, the approximation P (A j ) P ({p (j) aj/m, p (j+1) > a(j + 1)/m}) is efficient. 2

3 2 Distribution of Order Statistics The joint distribution of two order statistics is derived in Casella and Berger. P (p (i1 ) u 1, p (i2 ) u 2 ) define U 1 = I(p i u 1 ) U 2 = I(u 1 < p i u 2 ) P (p (i1 ) u 1, p (i2 ) u2) = P (i 1 U 1 < i 2, i 2 U 1 + U 2 n) + P (U 1 i 2 ) = i 2 1 s1=i 1 m s 1 s2=i 2 s 1 P (U 1 = s 1, U 2 = s 2 ) + P (U 1 i 2 ) for the general case, where p 1,..., p m are not necessarily independent or identically distributed, P (U 1 = s 1, U 2 = s 2 ) = u1 u1 u2 u 1 q Q u2 u 1 u 2 u 2 P pq1,...,p qm (p q1,..., p qm )dp q1 dp qm for the set Q = {q : (q 1,..., q m ) are permutations of (1,..., m)}, and reducing in the iid case to, P (U 1 = s 1, U 2 = s 2 ) = m! s 1!s 2!(m s 1 s 2 )! [P (p u 1)] s 1 [P (p u 2 ) P (p u 1 )] s 2 [1 P (p u 2 )] m s 1 s 2 The joint CDF of k order statistics is derived in Glueck et. al (28) for the non-identically distributed case, in particular with two sub-populations. In order to calculate the probability of B j, or that k = m j 1 of the largest order statistics are greater than constants c 1, c 2,..., c k, following the logic in Glueck et al. and some of their notation, for p 1,..., p m iid F, the joint CDF 3

4 of the order statistics, (p (j1 ),..., p (jk )), P ( k s=1{p (js) c s }) = P ( k s=1{ at least j s of p i s c s }) = P ( k s=1{i s j s }) = P ( k s=1{i s = i s }) i I = i I P ( k+1 s=1{i s i s 1 of p i s (c s 1, c s ]}) = k+1 m! i I s=1 [P (p c s ) P (p c s 1 )] (is i s 1) (i s i s 1 )! given I s = m i=1 I(p i c s ), so that I 1 I k, leading to the index set I = {i : = i i 1 i k i k+1 = m, i s j s all s [1, k]} We, however, are interested in the joint probability P ( k s=1{p (js) > c s }) = P ( k s=1{at most j s 1 of p i s c j }) = P ( k s=1{i s < j s }) = i I P ( k s=1{i s = i s }) = i I P ( k+1 s=1{i s i s 1 of p i s (c s 1, c s ]}) = k+1 m! i I s=1 [P (p c s ) P (p c s 1 )] (is i s 1) (i s i s 1 )! I = {i : = i i 1 i k i k+1 = m, i s < j s all s [1, k]} note that for 2-sided p-values, the P (p c s ) = P (X X cs ) + P (X X cs ), where and P (X X cs ) = F ( X cs ), etc., and X cs = F 1 (.5c s ). The results in Glueck can be extended for a multivariate distribution P ( k+1 s=1{i s i s 1 of p i s (c s 1, c s ]}) = c1 c1 c2 c2 ck+1 ck+1 c c 1 c k P pq1,...,p qm (p q1,..., p qm )dp q1 dp qm q Q c c 1 c k 4

5 There are i 1 integrals from (, c 1 ), (i 2 i 1 ) from (c 1, c 2 ),..., and m i k integrals from (c k, ). In order to compute the respective n-dim integral over 2-sided p-values, for each permutation, there are 2 m possible ways to integrate over the distribution of test statistics, over positive and negative domains, respectively. Suppose for example that X f and X f 1 are independent, and that within each sub-population, the covariance matrix is block diagonal, with B and B 1 blocks, repsectively. This leads to considerable reductions in computation, i.e. the product of B - and B 1 -dim integrals, rather than one m-dim integral, assuming the order of integration is inter-changable. Permuations within-block are not necessary to compute, where variables are exchangable. 3 Density of the p-value threshold P (τ) = P () + m P (τ j)p (A j ) j=1 where P () is the probability that no tests are rejected, and P (τ j) = P (p (j) A j ) = 1 1 a(j+1)/m a P (p (j), p (j+1),..., p (m) A j )dp (j+1) dp (m) or, P (τ j) = τ P (p (j) τ A j ) = τ P (p (j) τ, p (j+1) > a(j + 1)/m,..., p (m) > a)/p (A j ) if τ aj/m, and otherwise. For the bivariate approximation, let Ã j = {p (j) aj/m, p (j+1) > a(j + 1)/m}. 5

6 P (τ j) τ P (p (j) τ Ãj) = τ P (p (j) τ, p (j+1) > a(j + 1)/m)/P (Ãj) = τ P (p (j) τ)/p (Ãj) τ P (p (j) τ, p (j+1) a(j + 1)/m)/P (Ãj) if τ < aj/m and otherwise. Further, for the iid case, ( ) m τ P (p (j) τ) = j P (p = τ)[p (p τ)] j 1 [1 P (p τ)] m j j and j+1 1 τ P (p (j) τ, p (j+1) a(j + 1)/m) = s1=j m s 1 s2=j+1 s 1 where, the partial of P (U 1 j + 1) w.r.t τ is found as above, as ( ) m (j + 1) P (p = τ)[p (p τ)] (j+1) 1 [1 P (p τ)] m (j+1) j + 1 τ P (U 1 = s 1, U 2 = s 2 ) + τ P (U 1 j + 1) and τ = P (U 1 = s 1, U 2 = s 2 ) = m! τ s 1!s 2!(m s 1 s 2 )! [P (p τ)]s 1 [P (p a(j + 1)/m) P (p τ)] s 2 [1 P (p a(j + 1)/m)] m s 1 s 2 m! s 1!s 2!(m s 1 s 2 )! [1 P (p a(j + 1)/m)]m s 1 s 2 [s 1 P (p τ) s1 1 P (p = τ)[p (p a(j + 1)/m) P (p τ)] s 2 P (p τ) s 1 (s 2 )[P (p a(j + 1)/m) P (p τ)] s 2 1 P (p = τ)] Again, note P (p c) is found above. 6

7 4 CDF of rfdr w if w + w 1 > rf DR = w + w 1 if w + w 1 = where w is the count of false, and w 1 true rejections, respectively. The CDF is defined as P (rf DR c j) = w,w 1 : rf DR c P (w, w 1 m, m 1, F, j)p (m, m 1 m, F ) Stating the obvious, m Binom(m, π ) m 1 = m m We can find the conditional joint distribution P (w, w 1 m, m 1, F, j) = P (w, w 1, j m, m 1, F )/P (j m, m 1, F ) as described at the end of Section 3, letting f be partitioned into two subpopulations of size m and m 1, requiring that w and w 1 of the integration limits be (, c 1 ) respectively by sub-population. Also, consider the results in Glueck for two populations. Then we have P (rf DR c) = j P (rf DR c j)p (A j ) mote that P (rf DR c τ) can be approximated well, for large m, treating w, w 1 as independent, e.g. P (w τ) w r = Binom(r, m, π P (p τ)). P (rf DR c) = 1 P (rf DR c τ)p (τ)dτ 7

8 4.1 Independent Case Under the integral, there is a double sum of independent terms, with each depending on τ. Integrating each term and aggregating yields the result. The sum is composed of the terms P (rf DR c τ)p (τ j) = w ( ) m (π τ) r (1 π τ) m r w,w 1 : rf DR c r =1 w 1 r 1 =1 ( m r 1 r ) (π 1 τ) r 1 (1 π 1 F 1 (τ)) m r1 [ τ P (p (j) τ)/p (Ãj) ] τ P (p (j) τ, p (j+1) a(j + 1)/m)/P (Ãj) Then for any r, r 1, in the above, [ [ = E 1 E 2 E 3 + j+1 1 s1=j m s 1 s2=j+1 s 1 E 4 (s 1, s 2 )(E 5 (s 1, s 2 ) E 6 (s 1, s 2 )) where ( ) ( ) m E 1 = (π τ) r (1 π τ) m r m (π 1 τ) r 1 (1 π 1 P 1 (p τ)) m r 1 r r ( ) 1 m E 2 = j P (p = τ)[p (p τ)] j 1 [1 P (p τ)] n j j ( ) m E 3 = (j + 1) P (p = τ)[p (p τ)] (j+1) 1 [1 P (p τ)] n (j+1) j + 1 E 4 = m! s 1!s 2!(m s 1 s 2 )! [1 P (p a(j + 1)/m)]m s 1 s 2 E 5 = s 1 P (p τ) s1 1 P (p = τ)[p (p a(j + 1)/m) P (p τ)] s 2 s 1 E 6 = P (p τ) s 1 (s 2 s 1 )[P (p a(j + 1)/m) P (p τ)] s 2 s 1 1 P (p τ) E 7 = P (Ãj) where f, F are the pdf and cdf of the p-values, respectively. the integrals that need to be performed are proportional to cj τ r +r 1 (1 π τ) m r (1 π 1 P 1 (p τ)) m r 1 P (p = τ)[p (p τ)] j 1 [1 P (p τ)] m j dτ ]] /E 6 8

9 cj τ r +r 1 (1 π τ) m r (1 π 1 P 1 (p τ)) m r 1 P (p = τ) s 1 1 P (p = τ)[p (p a(j + 1)/m) P (p = τ)] s 2 s 1 dτ cj τ r +r 1 (1 π τ) m r (1 π 1 P 1 (p τ)) m r 1 P (p τ) s 1 (s 2 s 1 ) [P (p a(j + 1)/m) P (p τ)] s 2 s 1 1 P (p = τ)dτ 4.2 Correlation Case Correlation introduces complexities, that are beyond our capacity, with current state of the art computing. It is unpractical and unrealistic to expect that we will generate results for the general correlated case. However, for restricted and special cases, results can be achieved quickly. One such case is the block diagonal correlation matrix, of blocks of size B, identically distributted in a block, and further assuming that variables following either component distribution f or f 1 are independent. For the approximation, relying on the set Ã, rather than the full set A, we need perform calculations for the joint density of two order statistics. There are four combinations of two cases that we must consider, for two variables that are indepedent or dependent, and belonging to components f or f 1, respectively, and weight results accordingly. For the independent variables, we take previous results. For dependent variables, we compute, for each component weighting accordingly, P (U 1 = s 1, U 2 = s 2 ) = B! u1 u1 u2 u 1 u2 u 1 u 2 u 2 P p1,...,p B (p 1,..., p B )dp q1 dp qm where B! is the number of ways to permute B variables. If the block sizes vary, then we may compute over each size, and weight accordingly. If allow variables from each component in a block, then we must weight accordingly, with the correct number of permutations, which must be mixed over the correct binomial distribution, given the population rates. All of these considerations are for the sake of computational speed. 9

10 5 SIMULATIONS To Be Determined 6 REFERENCES 1. Benjamini and Hochberg (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B. 57: Glueck, Karimpour-Fard, Mandel,Hunter, Muller (28) Fast Computation by Block Permanents of Cumulative Distribution Functions of Order Statistics from Several Populations. Commun Stat Theory Methods. 37(18): doi: 1.18/ Casella and Berger (21) Statistical Inference. Duxbury Press. 1

PROCEDURES CONTROLLING THE k-fdr USING. BIVARIATE DISTRIBUTIONS OF THE NULL p-values. Sanat K. Sarkar and Wenge Guo

PROCEDURES CONTROLLING THE k-fdr USING BIVARIATE DISTRIBUTIONS OF THE NULL p-values Sanat K. Sarkar and Wenge Guo Temple University and National Institute of Environmental Health Sciences Abstract: Procedures