Weighted Hypothesis Testing. April 7, 2006

Size: px

Start display at page:

Download "Weighted Hypothesis Testing. April 7, 2006"

Bartholomew Mitchell
6 years ago
Views:

1 Weighted Hypothesis Testing Larry Wasseran and Kathryn Roeder 1 Carnegie Mellon University April 7, 2006 The power of ultiple testing procedures can be increased by using weighted p-values Genovese, Roeder and Wasseran We derive the optial weights and we show that the power is rearkably robust to isspecification of these weights. We consider two ethods for choosing weights in practice. The first, external weighting, is based on prior inforation. The second, estiated weighting, uses the data to choose weights. 1 Introduction The power of ultiple testing procedures can be increased by using weighted p-values Genovese, Roeder and Wasseran Dividing each p-value P by a weight w increases the probability of reecting soe hypotheses. Provided the weights have ean one, failywise error control ethods and false discovery control ethods aintain their frequentist error control guarantees. The first such weighting schee appears to be Hol Related ideas are in enaini and Hochberg 1997 and Chen et al There are, of course, other ways to iprove power aside fro weighting. oe notable recent approaches include Rubin, van der Laan and Dudoit 2005, torey 2005, Donoho and Jin 2004 and ignoravitch Of these, our approach is closest to Rubin, van der Laan and Dudoit 2005, hereafter, RVD. In fact, the optial weights derived here, if re-expressed as cutoffs for test statistics, turn out to be identical to the cutoffs derived in RVD. Our ain contributions beyond RVD are i a careful study of potential power losses due to departures fro the optial weights, ii robustness properties of weighted ethods, and iii recovering power after using data splitting to estiate the weights. An iportant distinction between this paper and RVD versus torey 2005 is that torey uses a slightly different loss function and he requires a coon cutoff for all test statistics. This allows hi to ake an elegant connection with the Neyan-Pearson lea. In particular, his ethod autoatically adapts fro one-sided testing to two-sided testing depending on the configuration of eans. ignoravitch 2006 uses invariance arguents to find powerful test statistics for ultiple testing when the underlying tests are ultivariate. In this paper we show that the optial weights for a one paraeter faily. We also show the power is very robust to isspecification of the weights. In particular, we show that i sparse weights a few large weights and iniu weight close to 1 lead to huge power gain for well 1 Research supported by National Institute of Mental Health grants MH057881,MH066278, MH06329 and NF Grant AT The authors thank Jaie Robins for helping us to clarify several issues. 1

2 power Figure 1: Power gain/loss for weighting a single hypothesis. In this exaple, an unweighted hypothesis has power 1/2. The weights are w 0 < 1 < w 1 with w 1 /w 0 =. The top line shows the power when the alternative is given the correct weight w 1. The botto line, which is nearly indistinguishable fro 1/2, shows the power when the alternative is given the incorrect weight w 0. As increases, the power gain increases sharply while the power loss reains nearly constant. specified weights, but inute power loss for poorly specified weights; and ii in the non-sparse case, under weak conditions, the worst case power loss for poorly specified weights is typically better than the power using equal weights. In fact, the power is degraded at ost by a factor of about γ/1 a where a is the fraction of nonnulls and γ is the fraction of nulls that are istaken for alternatives. Figure 1 shows the sparse case. The top line shows power fro correct weighting while the botto line shows power fro incorrect weighting. We see that the power gains overwhel the potential power loss. Figure 2 shows the non-sparse case. The plots on the left show the power as a function of the alternative ean. The dark solid line shows the lowest possible power assuing the weights were estiated as poorly as possible. The lighter solid line is the power of the unweighted onferroni ethod. The dotted line shows the power under theoretically optial weights. The worst case weighted power is typically close to or larger then the onferroni power except for large when they are both large. We consider two ethods for choosing the weights: i external weights, where prior inforation based on scientific knowledge or prior data singles out specific hypotheses and ii estiated weights where the data are used to construct weights. External weights are prone to bias while estiated weights are prone to variability. The two robustness properties reduce concerns about bias and variance. An exaple of external weighting is the following. We have test statistics {T : = 1,..., } associated with spatial locations {s : = 1,..., } where s [0, L], say. These could be association tests for arkers on a genoe. The nuber of tests is large, on the order of 100,000 2

3 Power Worst Case Power Worst case Power u Figure 2: Power as a function of the alternative ean. In these plots, a =.01, = 1000 and α = There are 1 a nulls and a alternatives with ean. The left plots shows what happens when the weights are incorrectly coputed assuing that a fraction γ of nulls are actually alternatives with ean u. In the top plot, we restrict 0 < u <. In the second and third plot, no restriction is placed on u. The top and iddle plot have γ =.1 while the third plot has γ = 1 a all nulls isspecified as alternatives. The dark solid line shows the lowest possible power assuing the weights were estiated as poorly as possible. The lighter solid line is the power of the unweighted onferroni ethod. The dotted line is the power under the optial weights. The vertical line in the top plot is at. The weighted ethod beats unweighted for al <. The right plot shows the least favorable u as a function of. That is, istaking γ nulls for alternatives with ean u leads to the worst power. Also shown is the line u =. 3

4 for exaple. Each T is used to test the null hypothesis that θ = ET = 0. Prior data is in the for of a sooth stochastic process {Zs : s [0, L]}. This ight be fro a whole genoe linkage scan. At alternatives, the ean µs = EZs is a large positive value; however, due to correlation, at nulls close to alternatives, µs is also non-zero. Peaks in the process Zs provide approxiate inforation about the location of alternatives. We want to use the process Z to generate reasonable weights for the test statistics. When external weights are not available, the optial weights can be estiated fro the data. One approach is to use data splitting RVD using a fraction of the data to estiate the weights and the reainder to test. For exaple, consider the two-stage genoe-wide association study e.g., Thoas et al for which a saple of n subects is split into two subsets. Using the first subset, we obtain test statistics {T : = 1,..., } associated with locations {s : = 1,..., }. Typically only the second subset of data are used in the final analysis. uilding on the ideas of kol et al. 2006, we take the two-stage study design further, exploring how the first set of data can be utilized to forulate weights, and the full data set can be used for testing. 2 Weighted Multiple Testing We are given hypotheses H = H 1,..., H and standardized test statistics T = T 1,..., T where T N, 1. The ethods can be extended for nonnoral test statistics but we do not consider that case here. For a two-sided hypothesis, H = 1 if 0 and H = 0 otherwise. For the sake of parsiony, unless otherwise noted, results will be stated for a one-sided test where H = 1 if > 0 although the results extend easily to the two-sided case. Let θ = 1,..., denote the vector of eans. The original data are often of the for X 11 X X 1 X 21 X X 2 X =.... X n1 X n2... X n where the th test statistic T is based on the th colun of X. Usually, T is of the for T = n X /σ where X is approxiately or exactly Nγ, σ 2/n and the noncentrality paraeter is = n γ /σ. The p-values associated with the tests are P = P 1,..., P where P = ΦT, Φ = 1 Φ and Φ denotes the standard Noral CDF. Let 1 P 1 P 4

5 denote the sorted p-values and let denote the sorted test statistics. T 1 T A reection set R is a subset of {1,..., }. ay that R controls failywise error at level α if PR H 0 α where H 0 = { : H = 0}. The onferroni reection set is where we use the notation z β = Φ 1 β. R = { : P < α/} = { : T > z α/ } 2 The weighted onferroni procedure of Genovese, Roeder and Wasseran 2005 is as follows. pecify nonnegative weights w = w 1,..., w and reect hypothesis H if R = { : P α }. 3 w As long as 1 w = 1, the reection set R controls failywise error at level α. For copleteness, we provide the proof. All further proofs are in the appendix. Lea 2.1 If 1 w = 1, then the reection set R controls failywise error at level α. Proof. The failywise error is PR H 0 > 0 = P P αw for soe H 0 P αw = α w αw = α. H 0 H 0 P Genovese, Roeder and Wasseran 2005 also showed that false discovery ethods benefit by weighting. Recall that the false discovery proportion FDP is FDP = nuber of false reections nuber of reections = R H 0 R 4 where the ratio is defined to be 0 if the denoinator is 0. The false discovery rate FDR is FDR = EFDP. enaini and Hochberg 1995 proved FDR α if R = { : P T } where T = ax{ : P α/}. Genovese, Roeder and Wasseran 2004 showed that FDR α if the P s are replaced by Q = P /w as long as 1 w = 1 as before. This paper will focus only on failywise error. iilar results hold for FDR and will be in a followup paper. 5

6 3 Power and Optiality 3.1 Power efore weighting, that is using weight 1, the power of a single, one-sided alternative is The power 2 in the weighted case is π, w = P P < αw π, 1 = PT > z α/ = Φz α/. 5 = P T > Φ 1 αw = Φ Φ 1 zαw /. 6 Weighting increases the power when w > 1 and decreases the power when w < 1. Given θ = 1,..., and w = w 1,..., w we define the average power 1 π, w I > 0. 7 =1 More generally, if is drawn fro a distribution Q and w = w is a weight function we define the average power π, wi > 0dQ. 8 If we take Q to be the epirical distribution of 1,..., then this reduces to the previous expression. In this case we require w 0 and wdq = Optiality and Robustness In the following theore we see that the set of optial weight functions for a one paraeter faily indexed by a constant c. Theore 3.1 Given θ = 1,...,, the optial weight vector w = w 1,..., w that axiizes the average power subect to w 0 and 1 =1 w = 1 is w = ρ c 1,..., ρ c where ρ c = and c cθ is defined by the condition 2 For a two-sided alternative the power is π, w = Φ Φ α 2 + c 1 I > 0, 9 ρ c = =1 Φ 1 αw 2 + Φ Φ 1 αw

7 ρ ρ c = c = 1 ρ ρ c = c = 10 Figure 3: Optial weight function ρ c for various c. In each case = 1000 and α = The functions are noralized to have axiu 1. The proof is in the appendix. oe plots of the function ρ c for various values of c are shown in Figure 3. In these plots, the function is noralized to have axiu 1 for easier visualization. The result generalizes to the case where the alternative eans are rando variables with distribution Q in which case c is defined by ρ c dq = Reark. Reecting when P /w α/ is the sae as reection when Z > /2 + c/. This is identical to the result of Rubin, van der Laan and Dudoit 2005, obtained independently. The reainder of the paper, which shows soe good properties of the weighted ethod, can thus also be considered as providing support for their ethod. In particular, they noted in their siulations then even poorly specified estiates of the cutoffs /2 + c/ can still perfor well. This paper provides insight into why that is true. Fro 6 and 9 we have iediately: Lea 3.2 The power at an alternative with ean under optial weights is Φ c/ /2. The average power under optial weights, which we call the oracle power, is 1 =1 c Φ I >

8 The oracle power is not attainable since the optial weights depend on θ = 1,..., or, equivalently, on Q. In practice, the weights will either be chosen by prior inforation or by estiating the s. This raises the following question: how sensitive is the power to correct specification of the weights? Now we show that the power is very robust to weight isspecification. The weights theselves can be very sensitive to changes in θ. Consider the following exaple. uppose that θ = 1,..., where each is equal to either 0 or soe fixed nuber. The epirical distribution of the s is thus Q = 1 aδ 0 + aδ where δ denotes a point ass and a is the fraction of nonzero eans. The optial weights are 0 for = 0 and 1/a for =. Let Q = 1 a γδ 0 + γδ u + aδ where u is a sall positive nuber. ince we have only oved the ass at 0 to u, and u is sall, we would hope that w will not change uch. ut this is not the case. et where = A + A 2 2c, u = 2 2c 13 A = Φ 1 α, = Φ 1 γk + a Kα γk + a, 14 yields weights w 0 and w 1 on u and such that w 0 /w 1 = K. For exaple, take = 1000, α = 0.05, a =.1, γ =.1, K = 1000, and c =.1. Then u =.03 and = 9.8. The optial weight on under Q is 10 but under Q it is and so is reduced by a factor of More generally we have the following result which shows that the weights are, in a certain sense, a discontinuous function of θ. Lea 3.3 Fix α and. For any δ > 0 and ɛ > 0 there exists Q = 1 aδ 0 + aδ and Q = 1 a γδ 0 + γδ u + aδ such that dq, Q < δ, and ρ ρ < ɛ 15 where a = α/4, dq, Q = sup Q, ], Q, ] is the Kologorov-nirnov distance, ρ is the optial weight function for Q and ρ is the optial weight function for Q. Fortunately, this proble is not serious since it is possible to have high power even with poor weights. In fact, the power of the weighted ethod has the following two robustness properties: Property I: parse weights iniu weight close to 1 are highly robust. If ost weights are less than 1 and the iniu weight is close to 1 then correct specification large weights on alternatives leads to large power gains but incorrect specification large weights on nulls leads to little power loss. 8

9 Property II: Worst case analysis. Weighted hypothesis testing, even with poorly chosen weights, typically does as well or better than onferroni except when the the alternative eans are large, in which both have high power. Let us now ake the these stateents precise. Also, see Genovese, Roeder and Wasseran 2006 and Roeder, acanu, Wasseran and Devlin 2006 for other results on the effect of weight isspecification. Property I. Consider first the case where the weights take two distinct values and the alternatives have a coon ean. Let ɛ denote the fraction of hypotheses given the larger of the two values of the weights. Then, the weight vector w is proportional to,...,, 1,..., 1 }{{}}{{} k ters k ters where k = ɛ and > 1 and hence the noralized weights are where w 1 = w = w 1,..., w }{{} 1, w 0,..., w 0 }{{} k ters k ters ɛ + 1 ɛ, w 0 = 1 ɛ + 1 ɛ. We say that the weights are sparse if ɛ is sall, that is, if ost weights are near 1. Consider an alternative with ean. The power gain by correct weighting is the power under weight w 1 inus the unweighted power π, w 1 π, 1. iilarly, the power loss for incorrect weighting is π, 1 π, w 0. The gain inus the loss, which we call the robustness function, is R, ɛ π, w 1 π, 1 + π, 1 π, w 0 16 = Φ z αw1 / + Φ z αw0 / 2Φ z α/. 17 The gain outweighs the loss if and only if R, ɛ > 0. This is illustrated in figures 1 and 4. Theore 3.4 Fix > 1. Then, li ɛ 0 R, ɛ > 0. Moreover, there exists ɛ > 0 such that R, ɛ > 0 for all ɛ < ɛ. We can generalize this beyond the two-valued case as follows. Let w be any weight vector such that 1 w = 1. Now define the worst case robustness function R in {π, w π, 1} ax {π, 1 π, w }. 18 {: w >1,H =1} {: w <1,H =1} 9

10 Robustness ε = ε = 0.01 ε = Figure 4: Robustness function for = In this exaple, = z α/ which has power 1/2 without weighting. The gain of correct weighting far outweighs the loss for incorrect weighting as long as the fraction of large weights ɛ is sall. We will see that R > 0 under weak conditions and that the axial robustness is obtained for near the onferroni cutoff z α/. Theore 3.5 A necessary and sufficient condition for R > 0 is R b, Φ z α/ + Φ z αb/ 2Φ z α/ 0 19 where = in{w : w > 1}, b = in{w }. Moreover, R b, = + O1 b 20 where = Φ z α/ Φ z α/ > 0 21 and, as b 1, µ{ : R < 0} 0 and inf R 0. The theore is illustrated in Figure 5. We see that there is overwheling robustness as long as the iniu weight is near 1. Even in the extree case b = 0, there is still a safe zone, an interval of values of over which R > 0. Lea 3.6 uppose that 2. Then there exists > 0 such that R,b > 0 for all 0 and all b. An upper bound on is z α/ 1/z α/ z α/. 10

11 R R b = b = 0.67 R R b = b = 0 Figure 5: The robustness function R for several values of b = in w. In each case, = 1000, α = 0.05, = 10. Whenever R > 0, power gain outweighs power loss. When b is near 1, R > 0 for ost. Even when b = 0 there is a safe zone including = 0 as long as 2. 11

12 Property II. Even if the weights are not sparse, the power of the weighted test cannot be too bad as we now show. To begin, assue that each ean is either equal to 0 or for soe fixed > 0. Thus, the epirical distribution is Q = 1 aδ 0 + aδ 22 where δ denotes a point ass and a is the fraction of nonzero s. The optial weights are 1/a for hypotheses whose ean is. To study the effect of isspecification error, consider the case where b = γ nulls are istaken for alternatives with ean u > 0. This corresponds to isspecifying Q to be Q = 1 a γδ 0 + γδ u + aδ. 23 We will study the effect of varying u so let πu denote the power at the true alternative as a function of u. Also, let π onf denote the power using equal weights onferroni. Note that changing Q = 1 aδ 0 + aδ to Q = 1 aδ 0 + aδ for does not change the weights. As the weights are a function of c, we first need to find c as a function of u. The noralization condition 10 reduces to u γφ 2 u + c + aφ 2 + c = α which iplicitly defines the function cu. First we consider what happens when u is restricted to be less than. Theore 3.7 Assue that α/ γ + a 1. Let Q = 1 aδ 0 + aδ and Q = 1 a γδ 0 + γδ u + aδ with 0 u. Let C = sup 0 u cu and define 0 = z α/γ+a, For 0, C = 0 2 /2. 25 For > 0, C is the solution to γφ 2c + aφ In this case, C = zα/γ 2 /2 + Oa. c + = α Let For <, = z α/ + zα/ 2 α1 a z2 q, where q =. 27 γ inf πu π onf. 28 0<u< 12

13 For we have inf 0<u< πu Φ z 2 α/γ Φ 2 log 1 a γ Oa 29 Oa 30 1 γ Oa a The factor Φ 2 log 1 a γ/1 a is the worst case power deficit due to isspecification. γ Now we drop the assuption that u. Theore 3.8 Let Q = 1 aδ 0 + aδ and let Q u 1 a γδ 0 + γδ u + aδ. Let π u denote the power at using the weights coputed under Q u. 1. The least favorable u is u argin u 0 π u = 2c = z α/γ + Oa 32 where c solves and c = zα/γ 2 /2 + Oa. γφ 2c + aφ 2 + c = α The inial power is inf u π c u = Φ = Φ 2 z 2 α/γ 2 + Oa A sufficient condition for inf u π u to be larger than the power of the onferroni ethod is z α/ + zα/ 2 z2 α/γ + Oa Choosing External Weights In choosing external weights, we will focus here on the two-valued case. Thus, w = w 1,..., w }{{} 1, w 0,..., w 0 36 }{{} k ters k ters where k = ɛ, w 1 = /ɛ +1 ɛ and w 0 = 1/ɛ +1 ɛ. In practice, we would typically have a fixed fraction of hypotheses ɛ that we want to give ore weight to. The question is how 13

14 ε R, Figure 6: Top plot: 0 ɛ versus ɛ. otto plot shows R,.1 versus. The turnaround point ɛ is shown with a vertical dotted line. to choose. We will focus on choosing to produce weights with good properties at interesting values of. Now large values of already have high power. Very sall values of have extreely low power and benefit little by weighting. This leads us to focus on constructing weights that are useful for a arginal effect, defined as the alternative that has power 1/2 when given weight 1. Thus, the arginal effect is = z α/. In the rest of this section then we assue that all nonzero s are equal to. Of course, the validity of the procedure does not depend on this assuption being true. Fix 0 < ɛ < 1 and vary. As we increase, we will eventually reach a point 0 ɛ where R, ɛ < 0 which we call turnaround point. Forally, { } 0 ɛ = sup : R, ɛ > The top panel in Figure 6 shows 0 ɛ versus ɛ which shows that for sall ɛ we can choose large without loss of power. The botto panel shows R, ɛ for ɛ = 0.1. We suggest using = ɛ, the value of that axiizes R, ɛ. Theore 4.1 Fix 0 < ɛ < 1. As a function of, R, ɛ is uniodal and satisfies R1, ɛ = 1, R 1, ɛ > 0 and R, ɛ < 0. Hence, 0 ɛ exists and is unique. Also, R, ɛ has a unique axiu at soe point ɛ and R ɛ, ɛ > 0. 14

15 When ɛ is very sall, we can essentially choose as large as we like, For exaple, suppose we want to increase the chance of reecting one particular hypothesis so that ɛ = 1/. Then, and ee Figure 1. w 1 = + 1, w 0 = li li π, w 1 = 1, while li li π, w 0 = 1 2. The next results show that binary weighting schees are optial in a certain sense. uppose we want to have at least a fraction ɛ with high power 1 β and otherwise we want to axiize the iniu power. Theore 4.2 Consider the following optiization proble: Given 0 < ɛ < 1 and 0 < β < 1/2, find a vector w = w 1,..., w that axiizes subect to The solution is given by in π, w w = 1, and #{ : πw, 1 β} ɛ. w = w 1,..., w }{{} 1, w 0,..., w 0 38 }{{} k ters k ters where w 1 = /ɛ + 1 ɛ, w 0 = 1/ɛ + 1 ɛ, k = ɛ, = c1 ɛ/α ɛc and c = Φ z α/ + z 1 β. If our goal is to axiize the nuber of alternatives with high power while aintaining a iniu power loss, the solution is given as follows. Theore 4.3 Consider the following optiization proble: Given 0 < β < 1/2, find a vector w = w 1,..., w that axiizes #{ : πw, 1 β} 39 subect to The solution is w = 1, and in πw, δ. 40 w = w 1,..., w }{{} 1, w 0,..., w 0 41 }{{} k ters k ters 15

16 where and k = ɛ. w 1 = α Φ z α/ + z 1 β, w0 = α Φ z α/ + z δ, ɛ = 1 w 0 w 1 w 0 42 A special case that falls under this Theore perits the iniu power to be 0. In this case w 0 = 0 and ɛ = 1/w 1. 5 Estiated Weights In this section we explain how to use the data to estiate the weights. There are two issues: we ust ensure that the error is still controlled and avoid incurring large losses of power due to replacing θ with an estiator θ. 5.1 Validity With Estiated Weights Data plitting. The approach, taken by RVD, for ensuring that the error control is preserved relies on data splitting. This approach relies on noralized test statistics T l, T 2 based on a partition of the data into subsets X 1, X 2 which include fractions b and 1 b of X, respectively. Note that T = b 1/2 T b 1/2 T 2, = 1,...,. The training data X 1 is used to estiate the noncentrality paraeter of the standardized statistic T 1, where E[T 1 ] = b 1. Testing is conducted using the reaining fraction of the data X 2 1. Consequently ust be rescaled by r s = 1 b/b 1/2 to estiate the noncentrality paraeter of the standardized statistic T 2, i.e., = r 1 s. The estiated weights are ŵ T 1 = ρ c. ecause of the independence between the two portions of the data, failywise error is controlled at the noinal level. Lea 5.1 The procedure that reects when P T 2 < wt 1 α/ controls the failywise error at level α. Recovering Power. As noted by kol et al. 2005, data splitting incurs a loss of power because the p-values are coputed using only a fraction the data. To recover this lost power, we need to use 1 all the data to copute the p-values. When using this approach ust be rescaled by r f = b 1/2 to estiate the noncentrality paraeter of the standardized statistic T, i.e., = r 1 f. As in the data splitting procedure, the estiated weights ŵ T 1 = ρ c depend only on X 1. To preserve error control we proceed as follows. 16

17 Theore 5.2 Assue T k N0, 1 independently for k = 1, 2. uppose that weight wt 1 depends only on X 1 but the p-value P T is allowed to depend on the full data X. Define ct 1 to solve 1 Φ =1 2 + ct 1 b T 1 1 b = α. 43 Then the procedure that reects when P T < w T 1 α/, where w T 1 = α Φ 2 + ct 1 44 controls the failywise error at level α. 5.2 iulations We siulate a study with = 1000 tests, yielding data of the for given in 1. A test of the hypothesis H 0 : 0 is perfored for each using T, which we assue is approxiately norally distributed, or equivalently T 2 χ2 1. In our siulations we generate 50 of the 1000 tests under the alternative hypothesis with shift paraeter = 2, 3, 4 or 5. We copare the power for various levels of a threshold paraeter λ 0,.5, 1, 1.5, 2, 2.5. We use a fraction b = 0.5 of the data to construct the weights and we copare four ethods for estiating the noncentrality paraeter: 1. The noralized statistic 1 = T Hard thresholding: 3. oft thresholding: 4. The Jaes-tein estiator 1 = T 1 I T 1 > λ = signt 1 T 1 λ = 1 2 i T 1 i 2 + T To copute the weights we rescale strategy. 1 by r s or r f as appropriate to the followup testing 17

18 Power results are displayed in Fig. 7. We first consider the power of the RVD procedure which uses the data splitting strategy and λ = 0 Fig. 7, labeled P at the origin. Although RVD suggest using λ = 0, we also exaine the power of this procedure for a range of values of λ. This extended RVD procedure is applying hard-thresholding to estiate θ. Next we consider the power of four testing strategies that use the full data T for testing rather than data splitting. The first approach uses binary weights equal to /M where M = 1 i I{ T > λ}. In this setting, when λ = 0 the ethod reduces to the siple one-stage onferroni approach. For λ > 0 it is the ethod of kol et al The reaining three approaches rely on weights estiated using hard-thresholding H, soft-thresholding, or Jaes-tein J. For λ = 0 ethods H and reduce to the nored saple ean which is the RVD approach adapted to incorporate the full data in the p-value. Clearly, this adaptation of the RVD ethod leads to a valuable increase in power. For λ > 0 this is ethod iposes a hard threshold shrinkage effect on the paraeter estiates. Notice that for any fixed value of λ, ethod H gives the best power. In particular, the difference in power between ethods and H illustrates the advantage of using variable weights estiated fro a fraction of the data. Method H and to a lesser extent ethod are nearly invariant to λ for oderate values of the threshold paraeter. In contrast, ethod, relying on soft-thresholding, experiences a sharp decline in power as λ increases. Finally, the Jaes-tein approach clearly fails in this setting, presuably because ost tests follow the null hypothesis and hence the true signals are shrunk toward 0 which diinishes the power of the procedure. For each condition investigated the tests had size less than 0.05 as expected fro the theory. The Jaes-tein ethod was ost conservative. Fro this experient it appears that shrinkage only enhances power when the signal is very weak. A ore careful analysis reveals that the effect of shrinkage for stronger signals is ore subtle. As 0, ρ c 0. Figure 8 shows ρ c is close to zero for a broad range of values. Consequently, for λ 1.5, the weight function perfors alost the sae role as the threshold paraeter. Using hard-thresholding for λ < 1.5 is essentially equivalent to using using no threshold because a oderate level of shrinkage is autoatically iposed by the weight function. Figure 8 also illustrates how the optial weights vary with the signal strength top panel has greater signal than botto panel. oth panels indicate that larger weights are placed in the idrange of signal strength. Essentially no weight is wasted on tests with sall signals < 1.5 because these tests are not likely to yield significant results. The botto panel shows that large weights are also not wasted on signals so strong that the tests can easily be reected even without up-weighting > 6. The top panel places its largest weights between 2.5 and 4. The botto panel has fewer signals in this range and hence stronger weights can be applied to signals between 2 and 2.5. oth panels indicate near 0 weights would be applied to tests with signals near 0. 18

19 power H H H H H H H P P P P P P P J J J J J J J power H H H H H H P P P P P P H P J J J J J J J λ λ power H H H H H P P P P P H H P J J J J J J J P power H H H H H H P P P J J J P J PJ PJ H J P λ λ Figure 7: Power of weighted tests. Fro top left clockwise: = 2, 3, 4, 5. Methods copared use weights based on hard thresholding H, soft thresholding, binary weights, and Jaes-tein J. 19

20 Weight θ Weight θ 6 Discussion Figure 8: Distribution of weights for two sets of data. An interesting connection can be ade between weights based on threshold-estiators and twostage experiental designs that perfor only a subset of the tests in stage two, based on the results obtained fro stage one. The siplest exaple of this type of two-stage testing is the two-stage onferroni procedure, for which the training data X 1 is used to deterine the M eleents in Λ = { : T 1 > λ}; X 2 is only easured for these coluns. A onferroni correction with α/2m controls FWER at level α for two-sided testing in this setting. In essence this approach is a weighted test with weights equal to /M for the eleents in Λ and zero else where. While the classic two-stage approach uses X 1 for training, and X 2 for testing, an alternative is to use the training data to deterine the weights and then use all of the data to conduct the tests. This strategy was recently investigated by kol et al. 2006, using constant weights. These authors use the training data to deterine Λ and then apply weights equal to /M to the M tests deterined in stage one. This full data approach proved to be considerably ore powerful than the two-stage onferroni approach in siulations. For hard and soft-thresholding, = 0 for any T 1 < λ. Fro 9 it follows that the weights for any test with = 0 are 0 and the reection region is Z 0 =. Hence, a procedure using w = 0 for coluns with = 0 is equivalent to a truncation procedure that tests only those coluns in 20

21 Λ. In practice, λ can be chosen to optiize power or to constrain the experiental budget. It is worth noting that in soe experiental settings, such as those described by kol et al., this experiental design can lead to considerable savings of effort and resources. Our results suggest that this savings can be gleaned without loosing easurable power. The sae ideas used here can be applied to other testing ethods to iprove power. In particular, weights can be added to the FDR ethod, Hol s stepdown test, and the Donoho-Jin 2004 ethod. Weighting ideas can also be used for confidence intervals. We plan to present the details for the other ethods in a followup paper. Another ite to be addressed in future work is the connection with ayesian ethods. As we noted, using weights is equivalent to using a separate reection cutoff for each statistic. The ethods of torey 2005 and ignoravich 2006 find optial cutoffs when the cutoffs are constrained. There is undoubtedly a bias-variance tradeoff. These constrained ethods can estiate optial cutoffs well low variance but they will not achieve the oracle power obtained here since they are by design biased away fro these separate cutoffs. Future work should be directed at coparing these approaches and developing ethods that lie in between these extrees. 7 Appendix Proof of Theore 3.1. Let A denote the set of hypotheses with > 0. Power is optiized if w = 0 for / A. The average power is 1 with constraint Choose w to axiize π = 1 A A Φ Φ 1 αw w =. A Φ Φ 1 αw. λ w i A by setting the derivative to zero w i π = λ + λ α = φ Φ 1 αw φ Φ 1 αw φ Φ 1 αw φ Φ 1 αw α = 0 21

22 The w that solves these equations is given in 9. Finally, solve for c such that i w i =. Proof of Lea 3.3. Choose K > 1 such that 1/K + 1 < 1/a ɛ. Choose 1 > γ > 2α a/k. Choose a sall c > 0. Let = A + A 2 2c and u = 2 2c where A = Φ 1 α, = Φ 1 Kα. 48 γk + a γk + a Then ρ = 1/a and ρ = 1/K + 1. Now dq, Q = γ. Taking K sufficiently large and γ sufficiently close to 2α a/k akes γ < δ. Proof of Theore 3.5. The first stateent follows easily by noting that the worst case corresponds to choosing weight in the first ter in R and choosing weight b in the second ter in R. The rest follows by Taylor expanding R b, around b = 1. Proof of Lea 3.6. With b = 0, R b, 0 when Φz α/ 2Φz α/ With 2, 49 holds at = 0. The left hand side is increasing in for near 0 but 49 does not hold at = z α/. o 49 ust holds in the interval [0, ]. Rewrite 49 as Φz α/ Φz α/ Φz α/. We lower bound the left hand side and upper bound the right hand side. The left hand side is Φz α/ Φz α/ = z α/ z α/ φudu z α/ z α/ φz α/. The right hand side can be bounded using Mill s ratio: Φz α/ φz α/ /z α/. et the lower bound greater than the upper bound to obtain the stated result. It is convenient to prove Theore 3.8 before proving Theore 3.7. Proof of Theore 3.8. Let c solve γφ 2c + aφ 2 + c = α. 50 We clai first that for any c > c, there is no u such that the weights average to 1. Fix c > c. The weights average to 1 if and only if c γφ u + u + aφ c = α

23 ince c > c and since the second ter is decreasing in c, we ust have c Φ u + u > Φ 2c The function ru = Φc/u + u/2 is axiized at u = 2c. o r 2c ru. ut r 2c = Φ 2c. Hence Φ 2c ru Φ 2c. This iplies c < c which is a contradiction. This establishes that sup u cu c. On the other hand, taking c = c and u = 2c solves equation 51. Thus c is indeed the largest c that solves the equation which establishes the first clai. The second clai follows by noting that γφ 2c + aφ Now set this expression equal to α/ and solve. 2 + c = γφ 2c + Oa. 53 Proof of Theore 3.7. Define c as in 50. If u = 2c then the the proof proceeds as in the previous proof. o we first need to establish for which values of is this true. Let rc = γφ 2c + aφ/2 + c/. We want to find out when the solution of rc = α/ is such that 2c, or equivalently, c 2 /2. Now r is decreasing in c. ince γ + a α/, r α/. Hence there is a solution with c 2 /2 if and only if r 2 /2 α/. ut r 2 /2 = γ + aφ so we conclude that there is such a solution if and only if γ + aφ α/, that is, z α/γ+a = 0. Now suppose that < 0. We need to find u to ake c as large as possible in the equation vu, c γφu/2 + c/u + aφ/2 + c/ = α/. Let u = and c = z α/γ+a 2 /2. y direct substitution, vu, c = α/ for this choice of u and c and clearly u as required. We clai that this is the largest possible c. To see this, note that vu, c < vu, c. For 0, vu, c is a decreasing function of u. Hence, vu, c < vu, c vu, c = α/. This contradicts the fact that vu, c = α/. For the second clai, note that the power of the weighted test beats the power of onferroni if and only if the weight w = /αφ/2 + C/2 1 which is equivalent to C z α/ 2 /2. 54 When 0, C = 0 2 /2. y assuption, γ + a 1 so that z α/γ+a z α/ and Now suppose that 0 <. Then C is the solution to rc = γφ 2c + aφ/2 + c/ = α/. We clai that 54 still holds. uppose not. Then, since rc is decreasing in c, rz α/ 2 /2 > rc = α/. ut, by direct calculation, rz α/ 2 /2 > α/ iplies that > which is a contradiction. Thus 28 holds. 23

24 Finally, we turn to 29. In this case, C = zα/γ 2 /2 + Oa. The worst case power is ΦC/ /2 = Φzα/γ 2 /2 /2 + Oa. The latter is increasing in and so is at least Φzα/γ 2 /2 /2 + Oa = Φzα/γ 2 /2 /2 2 + Oa as claied. The next two equations follow fro standard tail approxiations for Gaussians. pecifically, a Gaussian quantile z β/ can be written as z β/ = 2 logl /β where L = c log a for constants a and c Donoho and Jin Inserting this into the previous expression yields the final expression. Proof of Theore 4.2. etting πw, = ΦΦ 1 wα/ equal to 1 β iplies w = /αφz 1 β + z α/ which is equal to w 1 as stated in the theore. The stated for of w 0 iplies that the weights average to 1. The stated solution thus satisfies the restriction that a fraction ɛ have power at least 1 β. Increasing the weight of any hypothesis whose weight is w 0 necessitates reducing the weight of another hypothesis. This either reduces the iniu power of forces a hypothesis with power 1 β to fall below 1 β. Hence, the stated solution does in fact axiize the iniu power. The proof of Theore 4.3 is siilar to the previous proof and is oitted. Proof of Theore 5.2. The failywise error is PR H 0 P P T 1, T 2 w T 1 α H 0 = E P P t 1, T 2 w t 1 α T 1 = t 1 H 0 = E P Φt 1, T 2 w t 1 α T 1 = t 1 H 0 = E P 1 b 1/2 T 2 Φ 1 w t 1 α b 1/2 t 1 H 0 Φ 1 w t 1 = H 0 E Φ α 1 b 1/2 b 1/2 t

25 = H 0 E Φ E Φ =1 E Φ =1 2 + ct1 2 + ct1 2 + ct1 b 1/2 t 1 1 b 1/2 b 1/2 t 1 1 b 1/2 b 1/2 t 1 1 b 1/ = α

26 References enaini, Y. and Hochberg, Y Controlling the false discovery rate: A practical and powerful approach to ultiple testing. Journal of the Royal tatistical ociety, eries, 57, enaini, Y. and Hochberg, Y Multiple Hypothesis Testing with Weights, candinavian Journal of tatistics, 24, Chen, Jaes J. and Lin, Karl K. and Huque, Mohaad and Arani, Rain Weighted p-value adustents for anial carcinogenicity trend test. ioetrics, 56, Donoho, D. and Jin, J Higher criticis for detecting sparse heterogeneous ixtures. The Annals of tatistics, 32, Genovese, C. R., Roeder, K. and Wasseran, L False Discovery Control with P-Value Weighting. To appear: ioetrika. Hol A siple sequentially reective ultiple test procedure. candinavian Journal of tatistics, 6, Roeder, acanu, Wasseran and Devlin Using Linkage Genoe cans to Iprove Power of Association in Genoe cans. The Aerican Journal of Huan Genetics. 78. Rubin, D, van der Laan, M. and Dudoit, Multiple testing procedures which are optial at a siple alternative. Technical report 171, Division of iostatistics, chool of Public Health, University of California, erkeley. kol A.D., cott L.J., Abecasis G.R. and oehnke M Joint analysis is ore efficient than replication-based analysis for two-stage genoe-wide association studies. Nat Genet. 38: ignoravitch, J Optial ultiple testing under the general linear odel. Technical report. Harvard iostatistics. torey J.D The optial discovery procedure: A new approach to siultaneous significance testing. UW iostatistics Working Paper eries, Working Paper 259. Thoas D.C., Haile R.W. and Duggan D Recent developents in genoewide association scans: a workshop suary and review. A J Hu Genet 77:

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are, Page of 8 Suppleentary Materials: A ultiple testing procedure for ulti-diensional pairwise coparisons with application to gene expression studies Anjana Grandhi, Wenge Guo, Shyaal D. Peddada S Notations