Multiple testing with the structure-adaptive Benjamini Hochberg algorithm

Size: px

Start display at page:

Download "Multiple testing with the structure-adaptive Benjamini Hochberg algorithm"

David McDowell
5 years ago
Views:

1 J. R. Statist. Soc. B (2019) 81, Part 1, pp Multiple testing with the structure-adaptive Benjamini Hochberg algorithm Ang Li and Rina Foygel Barber University of Chicago, USA [Received June Final revision September 2018] Summary. In multiple-testing problems, where a large number of hypotheses are tested simultaneously, false discovery rate (FDR) control can be achieved with the well-known Benjamini Hochberg procedure, which adapts to the amount of signal in the data, under certain distributional assumptions. Many modifications of this procedure have been proposed to improve power in scenarios where the hypotheses are organized into groups or into a hierarchy, as well as other structured settings. Here we introduce the structure-adaptive Benjamini Hochberg algorithm (SABHA) as a generalization of these adaptive testing methods. The SABHA method incorporates prior information about any predetermined type of structure in the pattern of locations of the signals and nulls within the list of hypotheses, to reweight the p-values in a data-adaptive way. This raises the power by making more discoveries in regions where signals appear to be more common. Our main theoretical result proves that the SABHA method controls the FDR at a level that is at most slightly higher than the target FDR level, as long as the adaptive weights are constrained sufficiently so as not to overfit too much to the data interestingly, the excess FDR can be related to the Rademacher complexity or Gaussian width of the class from which we choose our data-adaptive weights. We apply this general framework to various structured settings, including ordered, grouped and low total variation structures, and obtain the bounds on the FDR for each specific setting. We also examine the empirical performance of the SABHA method on functional magnetic resonance imaging activity data and on gene drug response data, as well as on simulated data. Keywords: Benjamini Hochberg procedure; False discovery rate; Multiple testing; Power; Structure-adaptive Benjamini Hochberg algorithm 1. Introduction In modern scientific fields with high dimensional data, the problem of multiple testing arises whenever we search over a large number of questions or hypotheses with the hope of discovering signals while maintaining error control. Treating the many questions separately may lead to a large list of discoveries which are mostly spurious; instead, error measures such as the familywise error rate and the false discovery rate (FDR) are popular tools for producing a more reliable and replicable list of discoveries based on the available data. To formalize the problem, consider a list of n hypotheses H 1, :::, H n, accompanied by p-values P 1, :::, P n [0, 1] which have been computed on the basis of observed data. Classical approaches to the multiple-testing problem generally treat these n questions exchangeably; our only information about hypothesis i comes from its p-value P i. In many settings, however, we may have prior information or beliefs about the pattern of signals and nulls among these n questions. For example, we may believe that certain hypotheses are more likely to contain signals Address for correspondence: Rina Foygel Barber, Department of Statistics, University of Chicago, Room 108, 5734 South University Avenue, Chicago, IL 60637, USA. rina@uchicago.edu 2018 Royal Statistical Society /19/81045

2 46 A. Li and R. F. Barber than others (e.g. because of data from prior experiments); or that the true signals are likely to be clustered (e.g. if each hypothesis corresponds to a test performed at some spatial location, and the true signals are likely to appear in spatial clusters); or the hypotheses may come with some natural grouping, with true signals tending to co-occur in the same group (e.g. if each hypothesis corresponds to a gene, then known gene pathways may form such a grouping). In this work, we give a general framework for incorporating this type of information when performing multiple testing. We introduce the structure-adaptive Benjamini Hochberg (BH) procedure: a procedure which places data-adaptive weights on the p-values to adapt to the apparent patterns of signals and nulls in the data. This procedure offers increased power to detect signals by lowering the threshold for making discoveries in regions where the data suggest that signals are highly likely to occur. When run with a target FDR level α, the method offers a finite sample guarantee of FDR control at a level that is only slightly higher than α as long as we restrict the extent to which the weights can adapt to the data, thus avoiding overfitting Outline In Section 2 we give background on the multiple-testing problem, including some existing methods to accommodate known patterns in the signals and nulls, and then present the structureadaptive Benjamini Hochberg algorithm (SABHA) method in Section 3. Section 4 contains our main theoretical results on FDR control for this method, for independent or dependent p-values. In Section 5 we give the details for applying our method in several different settings. Our main theoretical results are proved in Section 6. In Section 7, we then present empirical results on simulated data and on several real data sets, with all code to reproduce all experiments available on line ( rina/sabha.html). We conclude with a discussion of our findings and directions for future work in Section 8. Some proofs and calculations are deferred to the on-line supplementary materials. The programs that were used to analyse the dats can also be obtained from 2. Background: multiple testing and false discovery rate control Before introducing our method, we first give an overview of the multiple-testing problem and various structured methods for FDR control False discovery rates and the Benjamini Hochberg procedure Given a multiple-testing problem consisting of p-values P 1, :::, P n corresponding to n hypotheses, Benjamini and Hochberg (1995) proposed a procedure for setting an adaptive threshold for these p-values, which seeks to control the FDR rather than more conservative error measures. The FDR is defined as the expected false discovery proportion, FDR = E[FDP] FDP = i 1{P i rejected and i H 0 } 1 i 1{P : i rejected} Here H 0 [n]:= {1, :::, n} is the (unknown) set of null hypotheses, i.e. the p-values P i for i H 0 correspond to testing hypotheses where no signal is present (often, we have P i uniformly distributed for i H 0, but in some settings, e.g. using discrete data, this may not be exactly so).

3 The BH procedure proceeds by calculating { ˆk = max k 1:P i α k } n for at least k many p-values P i, Multiple Testing 47 or setting ˆk = 0 if this set is empty, then rejecting any p-value P i which satisfies P i α ˆk=n, for a total of ˆk many rejections. We can think of this as setting an adaptive rejection threshold, α ˆk=n, which lies between the naive threshold α and the Bonferroni adjusted threshold α=n. In the setting where the null p-values are uniformly distributed, are mutually independent and are independent of the non-null p-values, Benjamini and Hochberg (1995) proved that the BH procedure satisfies FDR = α H 0 α: n In fact, under some types of dependence, the FDR is again bounded by α; for example, if the p-values are positively dependent (Benjamini and Yekutieli s (2001) positive regression dependence on a subset condition), or in an asymptotic regime where the empirical distribution of the nulls resembles the uniform distribution (Storey et al., 2004) Modifying the Benjamini Hochberg procedure by estimating the proportion of signals Looking at the above bound on the FDR, given by α H 0 =n, we see that, in situations where the number of signals is a substantial proportion of the total number of hypotheses, the BH procedure will be overly conservative. To adapt to the proportion of signals, Storey (2002) proposed a two-stage method where, first, the relative proportion of high and low p-values is used to estimate the proportion of nulls: { ˆπ 0 = min 1, i 1{P i > τ} n.1 τ/ } :.1/ (To understand this intuitively, if π 0 = H 0 =n is the true proportion of nulls, then we should have approximately π 0 n.1 τ/ many p-values which are greater than τ; here τ is a predefined threshold, e.g. τ = 0:5 orτ = 0:75, chosen so that most of the p-values above τ are likely to be nulls.) Second, this estimated proportion is used to adjust the threshold in the BH procedure: by running the procedure with α= ˆπ 0, we now expect FDR αˆπ H 0 α, 0 n where the first step holds by the FDR level of the BH procedure, whereas the second holds when ˆπ 0 is a good (over)estimate of π 0 = H 0 =n. Storey et al. (2004), theorem 3, proved that this procedure controls the FDR at the desired level α (with a slightly more conservative definition of ˆπ 0 and rejecting only those p-values that are no larger than τ). In some settings, we may believe that there is asymmetry in the true signals in terms of positive versus negative effects if the p-values are computed from real-valued statistics whose null distribution is symmetric (e.g. a z-score), and the non-nulls are more likely to have positive means than negative (or vice versa), then taking this into account in the adaptive procedure will increase power. Zhao and Fung (2016) allowed for this asymmetry by giving distinct weights to p-values that are associated with positive versus negative statistics, specifically, by estimating ˆπ + 0 and ˆπ 0, the proportion of nulls among hypotheses with positive or negative z-scores respectively. More generally, Sun and Cai (2007) studied oracle decision rules for discovering signals, and

4 48 A. Li and R. F. Barber found that using z-scores (i.e. a signed statistic) can outperform a p-value-based procedure by making use of the sign information False discovery rate control with structured signals and nulls Here we briefly describe several existing methods that adapt to particular types of structure in the signals and nulls; more detailed comparisons with the methods that are most relevant to our work will be given later. First, Hu et al. (2010) considered a setting where the hypotheses are partitioned into disjoint groups G 1, :::, G d [n]. The groups may differ widely in terms of the proportion of nulls within the group, π.k/ 0 = H 0 G k = G k. In their group BH procedure, the p-values are then reweighted to take these groupwise proportions into account before applying the BH procedure; this enables higher power to make discoveries within those groups where signals are prevalent. As the null proportions are typically unknown in practice, their adaptive group BH first computes estimates of these proportions, denoted as ˆπ.k/ 0, and proceeds as before. This can be viewed as an extension of the work of Storey (2002) which estimates the overall proportion of nulls π 0 without separating into groups. The main theoretical results of Hu et al. (2010) show, firstly, that, if the true π.k/ 0 s are known exactly, then the FDR is controlled at the desired level α and, second, that if the π.k/ 0 s can instead be estimated consistently, then asymptotic FDR control is achieved. It is important to note that, in this setting, the goal is not to treat each group as a single unit that is rejected or not instead, the rejections are carried out at the level of the individual hypotheses, but the predetermined group structure helps to increase power by identifying groups where there are high proportions of non-nulls. In comparison, clusterwise FDR methods search for data-determined clusters of signals within the data, often treating a cluster (i.e. a data-determined contiguous block of rejection) as a single discovery for the purpose of defining the FDR. This type of multiple testing with spatial correlations or clustering has been extensively studied and arises in many applications, for instance, genomics, geophysical sciences and astronomy. Chouldechova (2014), Schwartzman et al. (2011) and Cheng and Schwartzman (2015) have developed a clusterwise FDR, which treats discoveries at the cluster level rather than counting individual points; in Chouldechova (2014) this is done by modelling the detected cluster sizes for null versus non-null clusters to adjust the usual pointwise FDR bounds and to obtain clusterwise FDR guarantees, whereas Schwartzman et al. (2011) proceeded by testing only at the local maxima of the sequence (for a one-dimensional spatial setting), and Cheng and Schwartzman (2015) developed this idea further by testing extreme values of the derivative of the sequence, rather than the values of the sequence itself. The problem of detecting data-determined clusters of non-nulls was studied also by Siegmund et al. (2011) in the context of copy number variation in genetics. Next, in some settings we may have reason to reject the hypotheses in an ordered way, i.e. to search for a cut-off point in our list so that P 1, :::, P k are labelled as likely signals and P k+1, :::, P n as likely nulls (i.e. P 1, :::, P k are our discoveries). This may be desirable either when prior information suggests which hypotheses are most likely to contain signals in which case we can place these at the beginning of our list before gathering data or when the structure of the problem demands it; for instance, in a model selection algorithm where P k corresponds to the variable that is selected at time k, it might be counterintuitive to conclude that P k+1 contains a true signal whereas P k does not. Some proposed methods for the ordered testing problem include the ForwardStop method of Grazier G Sell et al. (2015), Barber and Candès s (2015) SeqStep, and Li and Barber s (2017) accumulation tests, each with accompanying FDR control guarantees. Barber and Candès s (2015) selective SeqStep modifies the goal by rejecting only those p-values in the first stretch of the list that also lie below some threshold, and Lei and

5 Multiple Testing 49 Fithian s (2016) adaptive SeqStep method finds a substantial increase in power by adapting to the proportion of nulls. In other settings, the ordering of the tests might not be fixed in advance, but might be determined over time by interacting with information revealed in the data set; Lei and Fithian s (2018) AdaPT method and the STAR method of Lei et al. (2017) propose algorithms testing in an order that is determined interactively without loss of FDR control. Another structure that is commonly found in the literature is when the p-values are grouped into a hierarchy, with methods in this setting developed by Benjamini and Bogomolov (2014) and applied to functional magnetic resonance imaging (MRI) data by Schildknecht et al. (2016). An extension of this hierarchical approach is found in the p-filter method of Barber and Ramdas (2017), which controls the FDR simultaneously at the level of individual hypotheses and groups of hypotheses, across different layers of groupings (which may or may not form a hierarchy). Each of these settings can be viewed as a scenario where the multiple-testing procedure adapts to the locations of signals in the data: searching for groups, clusters or regions with a high proportion of signals, or for a long initial stretch of p-values containing many signals. Each of these methods can be described as searching for a specific type of structure and then reweighting the p-values in a data-adaptive way. Conversely, an existing method by Genovese et al. (2006) handles general structures by reweighting the p-values in a non-adaptive way. Each p-value P i is reweighted by using prior information (i.e. the reweighting does not depend on the p-values themselves); then the BH procedure is applied to the reweighted p-values {P i =w i }. In this setting, the w i s can be chosen to reflect any type of known structure, for instance, giving priority to certain hypotheses over others (like in the ordered testing setting) or to certain groups of hypotheses, but the weights cannot be chosen as functions of the p-values themselves; they are non-adaptive. Relatedly, Ramdas et al. (2017) extended the p-filter, mentioned above, to allow for weighting the p-values, again with non-adaptive weights. Our proposed method, which we introduce next, combines these two lines of work by using adaptively chosen weights (i.e. weights which may depend on the p-values themselves) that can be designed to reflect any type of structure believed to be present in the problem; our method can handle general structures with data-adaptive weights. 3. The structure-adaptive Benjamini Hochberg algorithm method We now define our new method: the SABHA method. Definition 1 (SABHA). Given a target FDR level α [0, 1] a threshold τ [0, 1] and values ˆq 1, :::,ˆq n [0, 1] (where ˆq i represents an estimated probability that the ith test corresponds to a null), define { ( ) } α k ˆk = max k 1:P i τ for at least k many p-values P i, ˆq i n with the convention that ˆk = 0 if this set is empty. Then the SABHA method rejects any p-value P i satisfying ( ) α ˆk P i τ,.2/ ˆq i n for a total of ˆk many rejections. If we set τ = 1 and ˆq i = 1 for all i, then this is exactly the original BH procedure; if instead we take τ.0, 1/ and set ˆq i = ˆπ 0 to be an estimate of the proportion of nulls (which is constant

6 50 A. Li and R. F. Barber across all i), then this can give us Storey s modification of the BH procedure. (The threshold τ in the SABHA method plays exactly the same role as in Storey s (2002) method; setting τ = 0:5 is a good default value.) Alternatively, if we choose ˆq to be a fixed vector (i.e. not dependent on P), then this is equivalent to the Genovese et al. (2006) p-value weighting method; their weights w i aregivenby1= ˆq i in our notation. To understand our proposed method intuitively, we sketch a coarse estimate for the false discovery proportion of this method; for simplicity, we shall treat the ˆq i s as fixed for the time being. First, from a Bayesian perspective, consider a random model where each p-value P i has a probability q i of being a null. Suppose that ˆq i q i is an accurate approximation. For some fixed k, we would like to know how many nulls satisfy We have P i ( α ˆq i k n ) τ: [ n { ( ) }] α k E 1 i is null and P i τ i=1 ˆq i n n i=1 {( ) } α k q i τ αk, q i n where the approximation holds since ˆq i q i, and if P i is a null p-value then it should be uniformly distributed. Therefore, when we reject ˆk many p-values, we expect that there are approximately α ˆk many nulls among them, leading to a false discovery proportion that is approximately α. If we instead take a frequentist view and assume that the locations of the nulls and non-nulls are fixed rather than random, then we have [ n { ( ) }] α k E 1 i is null and P i τ = ( ) α k τ αk 1 : i=1 ˆq i n ˆq i n n ˆq i If the weights are chosen to satisfy Σ i H0 1= ˆq i n, then we again have approximately α ˆk many false discoveries out of ˆk total discoveries. For further intuition for the SABHA method, we additionally give an empirical Bayes motivation for the method in the on-line supplementary materials Choosing the weights ˆq The rough calculations above, in both the Bayesian and the frequentist versions, treat the vector of weights ˆq assuming that (a) ˆq fits the true structure of the nulls, i.e. ˆq i q i (Bayesian) or Σ i H0 1= ˆq i n (frequentist), and (b) ˆq is fixed in advance, i.e. it is independent of the vector P of p-values. In practice, in general we may not have information on the locations (or probabilities) of the nulls or even the overall proportion of nulls until we see the p-values, so we would like to choose ˆq after looking at the p-values. However, we cannot choose ˆq arbitrarily; for instance, choosing ˆq =.ɛ, :::, ɛ/ for some arbitrarily small ɛ > 0 would mean that we can always reject all p-values. We shall need to strike a compromise, allowing ourselves to choose weights ˆq that fit the structure of the nulls reasonably well, and that are allowed to adapt to the data but not too much. Specifically, condition (a) above cannot be attained without knowing the true structure of the nulls and non-nulls, but we can approximate it without this knowledge by requiring n i=1 1 1{P i > τ} n: ˆq i 1 τ

7 Multiple Testing 51 This bound approximates condition (a) since E[1{P i > τ}=.1 τ/] q i in the Bayesian case, and E[1{P i > τ}=.1 τ/] 1{i H 0 } in the frequentist case. As a slight technicality, it might be that Σ n i=1 1{P i > τ} >n.1 τ/ (if perhaps most hypotheses are null), and so to ensure that our condition is feasible for some ˆq [0, 1] n we relax the above constraint to the condition n 1{P i > τ} either i=1 ˆq i.1 τ/ n or ˆq = 1 n,.3/ where 1 n is the vector of all 1s. The idea of using the empirical quantities 1{P i > τ}=.1 τ/ to learn about the pattern of nulls and non-nulls is inspired by the estimation of the null proportion in Storey s (2002) adaptive BH procedure and, indeed, if we set ˆq i = ˆπ 0 for every i then condition (3) corresponds exactly to the definition (1) of the estimate ˆπ 0. Next, to relax condition (b), we need to allow the weights ˆq to depend on the p-values P,but only mildly: we shall not allow ourselves to see the exact value of any P i τ (i.e. any p-value that could potentially be a discovery under the SABHA method). We shall allow ˆq to depend on P only through observing, for each p-value P i, (a) whether P i lies above or below the threshold τ, i.e. 1{P i > τ}, and (b) if P i > τ (and therefore cannot be selected as a discovery by the SABHA method), then we also can observe the p-value P i itself. In other words, we can observe, for each i, the value P i 1{P i > τ} a zero indicates that P i lies below the threshold τ (but does not tell us its value) whereas if P i > τ then we observe its value. Often we shall use only the indicator 1{P i > τ} to help us to select weights ˆq, without making use of the actual values P i for those p-values above the threshold τ. (Note that the entire vector. ˆq 1, :::,ˆq n / is a function of all n values.p 1 1{P 1 > τ}, :::, P n 1{P n > τ}/; we do not require that each ˆq i depends only on P i and, indeed, the power of the method relies on sharing information across the set of p-values.) In addition, we shall also constrain ourselves to selecting ˆq from some predefined set Q [0, 1] n that is chosen in advance to reflect the type of structure that we expect to find in the data. (In light of condition (3), which may require us to set ˆq = 1 n, we shall need to choose Q to contain the vector 1 n.) Formally, we encode the conditions above as follows: ˆq depends on the p-values P only through.p i 1{P i > τ}/ i=1,:::,n, and satisfies ˆq Q:.4/ To be clear, conditions (3) and (4) do not place assumptions on the unknown properties of the data; rather, they are constraints that we place on our own construction of the weights vector ˆq and are therefore within our control when we run the SABHA method. To give some concrete examples for the set Q, later we shall study settings such as (a) ordered structure (see Section 5.1), for settings where we believe that any true signals are more likely to appear in the early part of the list (i.e. the list has been arranged to suit this belief, using domain knowledge or data from prior experiments), we may choose Q = {q.0, 1] n : q 1 q 2 ::: q n } (b) grouped structure (see Section 5.2), for settings where we have a predefined split of the hypotheses into groups, we may choose Q = {q.0, 1] n : q i = q j for any indices i, j in the same group}: Although conditions (3) and (4) place constraints on ˆq, we still have some flexibility for choosing these weights, even with Q fixed. One natural method is to use a likelihood-based approach,

8 52 A. Li and R. F. Barber where, if we think of ˆq i as estimating the probability that hypothesis i is null, then we should have Pr.P i > τ/ ˆq i.1 τ/. This motivates the constrained optimization problem { n ˆq = arg max 1{P i > τ} log{q i.1 τ/} + 1{P i τ}log{1 q i.1 τ/} : n } 1{P i > τ} q Q i=1 i=1 q i.1 τ/ n,.5/ which is a convex optimization problem whenever Q is convex. To handle the multiple terms in this objective function, we shall generally use the alternating direction method of multipliers (Boyd et al., 2011); details for specific examples are given in the on-line supplementary materials. (In the case that Σ n i=1 1{P i > τ} >n.1 τ/, then condition (3) requires that we instead set ˆq = 1 n.) Our theoretical results will prove non-asymptotic bounds on the FDR of this method, provided that ˆq is chosen to satisfy conditions (3) and (4). This means that we can control the FDR at finite sample size, even without any advance knowledge of the pattern of nulls and non-nulls or even of the overall proportion of nulls. The FDR bounds will depend directly on the size of the set Q if we choose this subset of [0, 1] n to be large, then the vector of weights ˆq has the potential to overfit to noise in the data and can result in a high FDR, whereas, if Q is highly constrained, then there is little potential for overfitting and the weights vector ˆq behaves almost like a fixed vector, resulting in an FDR bound that is barely higher than the target FDR level α. We formalize this intuition in our theoretical results below. 4. False discovery control results We give two different results controlling the false discoveries that might be selected by our method. We shall first consider a setting where the p-values are assumed to be independent and, second, a dependent setting where the p-values come from dependent z-scores or from a Gaussian copula model. In each case, we show that the FDR is bounded only slightly higher than the target level α, as long as the set Q is sufficiently restricted. Throughout, we shall consider the set of null hypotheses H 0 [n] as fixed and of course unknown. Although assumptions (3) and (4) govern the way in which we can choose weights ˆq from the set Q, how should we choose this set Q initially? As discussed above, to give a result controlling the FDR, the set Q needs to be sufficiently constrained so as to limit the extent to which ˆq can overfit to the data. To quantify this, we introduce an additional definition: for any set A R n, the Rademacher complexity of A is given by [ 1 Rad.A/ = E n sup x A ] x, ξ, where the expectation is taken over a vector ξ of independent Rademacher variables, ξ i IID Unif{±1}. (Note that some texts define the Rademacher complexity without the absolute value, or with a scale factor 2=n in place of 1=n.) It is known that the Rademacher complexity is closely related to Gaussian width and statistical dimension; see for example Bartlett and Mendelsohn (2003), lemma 4. We shall apply this notion of complexity to the set Q inv := {.q 1 1, :::, q 1 / : q Q}: Since 1 n Q always by assumption and therefore 1 n Q inv,wehave n

9 Multiple Testing 53 [ ] 1 Rad.Q inv / E n 1 n, ξ n 1=2 : In contrast, if x is bounded for all x Q inv (i.e. Q is bounded away from zero), then Rad.Q inv / can be bounded by a constant. We can therefore expect that O.n 1=2 / Rad.Q inv / O.1/ for any choice of Q that is bounded away from zero. With these assumptions and definitions in place for ˆq and Q, we turn to our results on FDR control, and we shall see how the complexity measure Rad. / controls the trade-off between flexibility and false discovery control False discovery rate control under independence We first give a result on FDR control under a strong independence assumption: the null p-values.p i / i H0 are mutually independent, and are independent from the non-null p-values.p i / i H0 :.6/ Since, in many applications, null p-values might not be exactly uniformly distributed (for instance the p-values may be discretized), we may wish to relax the uniformity assumption it is common instead to require that the null p-values are superuniform, i.e. for all i H 0, Pr.P i t/ t for all t [0, 1]:.7/ For technical reasons, to obtain our clean FDR guarantee under independence, we require an additional condition relating to the threshold τ: for all i H 0, Pr.P i τ/ τ 0,.8/ where τ 0 > 0. We now turn to our first main result, proving finite sample FDR control for the SABHA procedure under independence. Theorem 1. Fix a target FDR level α [0, 1], a threshold τ.0, 1/ and a set Q.0, 1] n with 1 n Q. Suppose that the vector of p-values P [0, 1] n satisfies independence of the nulls (6), superuniformity (7) and the threshold condition (8). Suppose that ˆq satisfies assumptions (3) and (4). Then the FDR of the SABHA procedure, run with parameters α, τ and ˆq over the p-values P, is bounded as [ FDR = E[FDP] α 1 + Rad.Q inv/ min{τ 0,1 τ} where the expectation is taken with respect to the distribution of the p-values. Since this result holds for any fixed set of nulls H 0 [n], it would also hold in expectation over any random model for the nulls and non-nulls. To interpret the results of theorem 1, we would typically choose a class Q which is sufficiently simple to ensure that Rad.Q inv / 1; in this type of setting, the FDR bound would be only slightly higher than α. Q should reflect our beliefs about the structure of the problem; for instance, we may believe that the signals appear in clusters among our n hypotheses. In contrast, if we do not sufficiently constrain Q, then the bound on FDR can become meaningless for instance, if we take Q = [α,1] n and τ = 0:5, then Rad.Q inv /.α 1 1/=2, and so the upper bound on the FDR is approximately 1. This is not merely an artefact of the theory. For instance, suppose that P i IID Unif[0, 1] for all i, and so there is no signal ],

10 54 A. Li and R. F. Barber in the data. If we choose ˆq i = 1 whenever P i > τ and ˆq i = α whenever P i τ, then, whenever Σ i 1{P i τ} nτ (which occurs roughly half the time), the SABHA procedure rejects all p-values P i τ. The source of the problem is that the vector ˆq is drastically overfitting to the binary vector.1{p i > τ}/ i, which is pure noise. By instead choosing Q with a low Rademacher complexity, we ensure that ˆq cannot overfit too much to these data False discovery proportion control with dependent p-values Although many results in the literature are proved under an assumption of independent p-values, much attention has also been given to the problem of handling dependent p-values, since in many settings the data that are gathered for hypotheses 1, 2, :::, n come from the same set of experiments and are therefore likely to be strongly dependent. Whereas the original BH procedure maintains FDR control as long as this dependence satisfies some positivity assumptions (namely the positive regression dependence on a subset condition of Benjamini and Yekutieli (2001)), this is no longer so for the null-proportion-adaptive BH procedure (1) that was proposed by Storey (2002). We now give a result for the SABHA method, in a more general setting where the p-values may be dependent. To do so, we consider a model where the p-values are derived from potentially dependent z-statistics. Specifically, suppose that P i = 1 Φ.Z i /, Z N.μ, Σ/,.9/ for some mean vector μ R n and some covariance matrix Σ R n n with diagonal entries Σ ii =1. Then P i is superuniformly distributed (i.e. is a null) whenever μ i 0 this means that these p- values are computing a one-sided, right-tailed test for each Z i. Of course, we could equivalently compute a one-sided left-tailed test by taking P i = Φ.Z i /. For simplicity, our results below will be proved for the one-sided setting only, but an analogous result holds for two-sided tests. More generally, we can consider a Gaussian copula model for the vector of p-values, P i = f i.z i /, Z N.0, Σ/ and f i is monotone non-increasing for each i,.10/ i.e. the vector P is obtained by applying monotone marginal transformations to a multivariate Gaussian vector. (To obtain the z-scores model (9) as a special case, we would define f i.z/ = 1 Φ.μ i +z/, since the mean vector has been set to zero in model (10).) (In the high dimensional setting, this model is also known as the non-paranormal model (Liu et al., 2009).) We can assume that the mean is 0, and that Σ ii = 1 for all i, since shifting and rescaling the Z i s can simply be absorbed into the functions f i. In this setting, we obtain the following bound on the FDR. Theorem 2. Fix a target FDR level α [0, 1], a threshold τ.0, 1/ and a set Q [ɛ,1] n with 1 n Q. Assume that the vector of p-values P [0, 1] n follows a Gaussian copula model (10), and the null p-values are superuniform (7) and satisfy the threshold condition (8). Suppose that ˆq lies in Q and satisfies assumption (3). Then the FDR of the SABHA procedure, run with parameters α, τ and ˆq over the p-values P, is bounded as [ FDR = E[FDP] α (1 + {Rad.Q inv /log.en 2 / 1=2 } 1=2 4 4 κ1=4 + {ɛ.1 τ/} 1=2.αc/ 1=2 { } log.n/ 1=2 κ 1=2 ) + n αc + Pr. ˆk < cn/, 2 where κ is the condition number of the covariance Σ in the Gaussian copula model (10). ]

11 Multiple Testing 55 Here we again see the critical role of the complexity of Q, as measured by Rad.Q inv /. (Note that our constraints on ˆq are weaker than for the tighter independence result, since result (4) is no longer required; ˆq Q may now depend arbitrarily on P.) Treating τ, ɛ, κ and c as fixed positive constants, we can interpret this result as proving that the FDR is bounded nearly at level α, as long as Pr. ˆk < cn/ is low and Rad.Q inv / 1 log.n/ 1=2 : (Compare with our FDR bound theorem 1 in the independent setting, where for a meaningful bound we require only Rad.Q inv / 1.) Of course, if the Rademacher complexity Rad.Q inv / is much smaller than this bound, then we can afford values of c and ɛ that are close to 0, i.e. we would allow for a very small number of discoveries and for small weights ˆq i. For instance, if Rad.Q inv / 1= n (as for the ordered testing setting, which we shall discuss in Section 5.1), we can choose any c, ɛ {log.n/=n} 1=2 in particular, this would require that the number of discoveries satisfies ˆk {n log.n/} 1=2 with large probability, rather than a constant proportion of discoveries Comments on assumptions Although theorem 2 is proved in the setting of z-statistics, we would expect that the FDR is (approximately) controlled across a far broader range of distributions, as long as the dependence structure is again well conditioned; the Gaussian assumption is simply an artefact of our proof technique, which uses a result from the literature regarding sub-gaussianity of the Gaussian sign vector, sgn.z/. However, as mentioned above, strong dependence across a large proportion of the p-values would no longer enable FDR control, for instance, in an equicorrelated setting where Z N.0, Σ/, where Σ ij = ρ > 0 for all i j. Indeed, in this type of setting, it is known that even Storey s modification of the BH procedure will no longer control the FDR, so the more flexible SABHA method will of course lose FDR control as well therefore, requiring a bound on condition number κ appears to be necessary for the FDR control result to hold. To give an example of a dependence structure where we do expect FDR control results to hold, we can consider a scenario where the hypotheses have an inherent spatial or temporal structure, with dependence that is strong locally but decays quickly with distance, allowing for a bounded κ Dependent p-values in Storey s method As mentioned earlier, the null-proportion-adaptive BH procedure (1) that was proposed by Storey (2002) is not guaranteed to control the FDR for dependent p-values in general, even if we assume positive dependence. Instead, to characterize the behaviour of this method under weaker forms of dependence, Storey et al. (2004), equation (7), considered an asymptotic setting where the empirical distribution of the null p-values indexed by H0 n [n] converges to a uniform (or superuniform) distribution: for all t [0, 1], i H n 0 1{P i t} nπ 0 G 0.t/ as n.11/ for some cumulative distribution function G 0 satisfying G 0.t/ t. They assumed also that the empirical distribution of the non-nulls converges to some alternative distribution G 1.t/. Additional assumptions of Storey et al. (2004), theorem 4, effectively require also that the adaptive

12 56 A. Li and R. F. Barber BH procedure is likely to make a non-zero number of discoveries: for some c>0, Pr. ˆk cn/ 1 asn :.12/ With these assumptions in place, Storey et al. (2004), theorem 4, proved that lim sup n FDR α. In the adaptive weights setting, it is no longer natural to consider an asymptotic regime where n, since the set of possible weights Q is a subset of.0, 1] n and therefore our problem is intrinsically tied to a particular value of n. For example, if Q is defined relative to a graph structure on n nodes representing the n hypotheses (for instance, testing gene expression levels for n genes, with edges representing known interactions between pairs of genes), it is not clear how to embed this into a sequence of problems with increasing n. Therefore, it is difficult to compare our finite sample results in a dependent setting for the SABHA method with the asymptotic regime that was considered by Storey et al. (2004). However, we note that the asymptotic assumption (12) in their work appears also in our result theorem 2 where we assume that Pr. ˆk < cn/ is low. 5. Applications to specific types of structure We now consider the application of our main result to various specific structured settings. We shall compare with existing work for each setting. For each case we also provide bounds on the Rademacher complexity of Q inv for the relevant sets Q. Bounding Rad.Q inv / enables us to prove an explicit result for FDR control by applying theorem 1 under the assumption of independent p-values, or theorem 2 for p-values obtained from dependent z-scores. All results in this section are proved in the on-line supplementary materials Ordered structure In an ordered setting, we might believe that the hypotheses early in the list are more likely to contain true signals than those later in the list. In this setting, one natural choice for ˆq would be to require that this vector is non-decreasing. To avoid degeneracy, we also impose a lower bound ɛ > 0 and define the set Q = Q ord = {q : ɛ q 1 ::: q n 1}:.13/ We could alternatively choose to enforce that ˆq is a step function of the form ˆq =.ɛ, :::, ɛ, 1, :::,1/, in which case again ˆq Q ord (in fact, Q ord is the convex hull of such step functions). Now we examine the FDR control of the SABHA method for this ordered setting. Lemma 1. For Q = Q ord as defined above, Rad.Q inv / 1 ɛ n : Therefore, under the assumptions of theorem 1 (in the setting where the null p-values are independent), { FDR α } 1, n ɛ.1 τ/ which is barely higher than the target level α (as long as ɛ 1= n). Similarly, theorem 2 gives a meaningful result on the FDR for the dependent z-scores setting as long as ɛ {log.n/=n} 1=2.

13 Multiple Testing Related methods In this ordered scenario, the existing methods of Grazier G Sell et al. (2016), Barber and Candès (2015) and Li and Barber (2017), mentioned earlier in Section 2, propose algorithms for choosing an adaptive cut-off point k and labelling P 1, :::, P k as signals and P k+1, :::, P n as nulls. Alternately, the selective SeqStep method of Barber and Candès (2015) selects an adaptive cutoff k and then rejects P i for i k only if P i τ for some predetermined threshold τ. In contrast, the ordered adaptive method that is proposed here does not have a firm cut-off; instead, the p-values early in the list are given higher priority, but rejections may occur at any point in the list; for example if the last p-value P n is extremely low then it is likely to be rejected even if the overall number of rejections is small. The version of our method that is most comparable with these existing works is when we choose ˆq to be a step function, ˆq =.ɛ, :::, ɛ,1,:::,1/; in this case, when k is the data-adaptive number of ɛ-values in ˆq, the first k p-values are given high priority for discovery whereas the remaining n k p-values are handled as in the original BH procedure, allowing for strong p-values to be rejected even if they appear late in the list. In this sense, this version of our method can be viewed as a relaxation of the ordered testing methods of Grazier G Sell et al. (2015), Barber and Candès (2015) and Li and Barber (2017) Choosing ˆq Q To choose a monotone vector ˆq which satisfies ɛ ˆq 1 ::: ˆq n 1 and reflects the patterns observed in the data, one approach would be to consider ˆq i.1 τ/ as an estimate of Pr.P i > τ/, and then to maximize the likelihood of this model, i.e. to solve problem (5) with the choice Q = Q ord. In the on-line supplementary materials, we give an algorithm for solving this convex optimization problem, which combines the pool adjacent violators algorithm (Barlow et al., 1972) for isotonic regression with the alternating direction method of multipliers (Borokov, 1999). Alternatively, instead of a likelihood-based approach for choosing a monotone vector, we may wish to consider a simpler construction for ˆq, given by a step function of the form q k =.ɛ, :::, ɛ,1,:::,1/.14/ }{{}}{{} k times n k times as discussed above. The highest power (i.e. the largest number of rejections) will be obtained by taking k as large as possible while still ensuring that assumption (3) (and, if desired, (4)) holds. In this case, we can simply set { k K = max k = 1, :::, n : i=1 1{P i > τ} ɛ.1 τ/ + n i=k+1 1{P i > τ} 1 τ } n,.15/ or set K = 0 if this set is empty; we then set ˆq = q K as in equation (14), which satisfies condition (3) by our choice of K. Choosing ˆq to be a step function has some similarity to the recent adaptive SeqStep method of Lei and Fithian (2016), which rejects all P i ɛ for i = 1, :::, K, where ɛ is fixed whereas K is determined adaptively by estimating the false discovery proportion of this set of discoveries by using the indicators {P i > τ} to estimate the proportion of nulls present. (Our parameters ɛ, τ and K are equivalent to their notation s, λ and ˆk.) To compare these two methods, adaptive SeqStep finds an adaptive cut-off K and uses a fixed rejection threshold ɛ for all p-values that appear before the cut-off (i.e. P 1, :::, P K ), whereas the SABHA method with ˆq = q K also finds an adaptive cut-off K but uses an adaptive rejection threshold, and can also reject p-values after the cut-off (i.e. P i for i>k) if they are extremely low.

14 58 A. Li and R. F. Barber 5.2. Group structure Suppose that the set of hypotheses is partitioned into groups [n] = G 1 ::: G d, with group sizes n 1 + :::+ n d = n, with the signals likely to appear together in these groups according to some natural structure or prior information. Specifically, we might believe that each group has its own proportion of nulls and non-nulls, and we might then estimate ˆq to be constant within each group but allow it to vary across groups. In this setting, we assume that these groups are predetermined they are independent of the p-values. Here, our goal is to find individual signals within each group, with the group structure helping to increase our power in this search we are not aiming to reject entire groups. We define Q = Q group = { q : ɛ q i 1 for all i, and q i = q j whenever i and j are in the same group } :.16/ (We again impose a lower bound ɛ to avoid degeneracy.) For example, if ˆπ.k/ 0 ɛ estimates the proportion of nulls within group k, we may choose ˆq i = ˆπ.k/ 0 for each hypothesis i in group k to reweight each group according to its own proportion of nulls (in other words, a groupwise version of Storey s (2002) adaptive BH procedure). A related method, the adaptive group BH procedure that was proposed by Hu et al. (2010), is equivalent to setting ˆq i = ˆπ.k/ 0 1 ˆπ 0 1 ˆπ.k/ 0 for each hypothesis i in group k, where ˆπ 0 estimates the overall proportion of nulls. The theoretical results of Hu et al. (2010) offer exact finite sample FDR control in the oracle setting where each group s proportion of nulls, i.e. the π.k/ 0 s, are known rather than estimated, and asymptotic FDR control using estimates ˆπ.k/ 0 Ẇe shall now see that, if the partition of [n] into groups is not too fine, then our main result in theorem 1 offers a finite sample FDR guarantee even when we use data-dependent estimates ˆπ.k/ 0. Lemma 2. For Q = Q group as defined above, if the groups are of sizes n 1, :::, n d, then Rad.Q inv / 1 2ɛn d ni : Therefore, under the assumptions of theorem 1 (in particular, assuming that the null p-values are independent), { 1 di=1 } ni FDR α 1 + : 2ɛ.1 τ/ n This yields a meaningful bound on FDR as long ɛ Σ d i=1 ni =n. For example, if the d groups are each of size n i = n=d, then we require ɛ.d=n/ to bound the FDR near α. If we instead apply theorem 2 for the setting of dependent z-scores, we would instead obtain a meaningful FDR guarantee when ɛ {d log.n/=n} 1=2. In summary, for the grouped setting, both the SABHA and Hu et al. (2010) methods offer various groupwise schemes for reweighting the p-values; when the groups proportions of nulls i=1

15 Multiple Testing 59 are not known in advance, the theoretical results of Hu et al. (2010) for the adaptive setting prove asymptotic FDR control, whereas our theoretical results can give a strong finite sample guarantee for the data-adaptive procedure Choosing ˆq Q To choose a groupwise constant vector ˆq, we simply need to choose a value q k for each group k = 1, :::, d and then to set ˆq i = q k for each i G k. Here q k should represent our estimated proportion of nulls in group G k. A simple estimate of this proportion is given by ˆπ.k/ 0 = i G k 1{P i > τ}, G k.1 τ/ and, if each of these estimates lie in [ɛ, 1], then we can simply set ˆq i = ˆπ.k/ 0 for each hypothesis i G k as above. We can alternatively consider a likelihood-based approach, i.e. solving problem (5) with the choice Q = Q group.if ˆπ.k/ 0 [ɛ, 1] for each group k, then the constrained maximum likelihood estimator is obtained by simply setting ˆq i = ˆπ.k/ 0 for each hypothesis i G k as above. Otherwise, the optimization problem must be solved jointly over all groups; in the on-line supplementary materials, we give an algorithm for this problem Special case: adapting to the overall proportion of nulls As a special case, we can consider the setting where there is only a single group G 1 = [n]. In this setting, we would then choose the ˆq i s all to take a single value. Letting ˆπ 0 = ˆq i (which is constant across all i), the SABHA method is then equivalent to the BH procedure at threshold α= ˆπ 0 instead of α, where our method would calculate ˆπ 0 = i 1{P i > τ}, n.1 τ/ truncated to the interval [ɛ, 1] if necessary. This coincides exactly with Storey s (2002) modification of the BH procedure, and our result lemma 2 shows that we should expect the FDR to be no larger than α{1 + O.1= n/}, very close to the target level α. (Storey et al. (2004) proved exact FDR control for this adaptive procedure, but in their setting they took a slightly more conservative definition of ˆπ 0.) Special case: positive and negative effects As a second special case, we can consider incorporating sign information in our tests. Suppose that our p-values are computed from real-valued statistics X 1, :::, X n, whose null distribution is continuous and symmetric. Letting F 0 be the cumulative distribution function for the null, for a two-sided test we would set P i = 2{1 F 0. X i /}. Now we define two groups: G + = {i : X i > 0} and G = {i : X i < 0}. We then estimate the proportion of nulls separately for each of these two groups. The effect of separating into these two groups is that we gain power if the true signals are more likely to be positive than negative (or vice versa), compared with estimating a single ˆπ 0 for the entire set of n hypotheses. This version of our procedure is equivalent to Zhao and Fung s (2016) procedure discussed in Section 2. Of course, this appears to contradict our requirement that the groups are chosen ahead of time, i.e. independently of the p-values. In fact, in some settings, this does not pose a problem. The reason is that by assuming that the null X i s are independent, and each have a symmetric distribution (as is the case for z-scores), we ensure that the null X i s are independent from

16 60 A. Li and R. F. Barber sgn.x i / (and furthermore from the other signs, sgn.x j / for j i). Therefore, we can determine the groups G + and G without revealing any information about the null p-values P i, which are a function of X i only and do not depend on sgn.x i / when using a two-sided test Low total variation In some settings, the n hypotheses may exhibit some form of locally smooth or locally constant structure. For instance, if each hypothesis is associated with a spatial location, then it might be natural to assume that signals are spatially clustered, meaning that nearby hypotheses have equal or similar likelihoods of being null or non-null. We might also have this sort of local similarity in other settings, for instance, similarity of individuals within a data set. To generalize this setting, consider a connected undirected graph G =.V G, E G / on nodes V G = [n] with e G = E G many undirected edges. We shall search for a vector ˆq that is locally constant on this graph, meaning that q i q j = 0 for most edges.i, j/ E G. One choice for Q is given by { } Q = Q TV-sparse = q [ɛ,1] n : 1{q i q j } m,.17/.i,j/ E where ɛ > 0 bounds the values away from 0 to avoid degeneracy, and m is some predetermined bound on the number of non-constant edges in the graph. However, Q TV-sparse is not convex, and it may be difficult to choose a ˆq in this set. For this reason we may wish to consider a convex set, { Q = Q TV-l1 = q [ɛ,1] n :.i,j/ E } q i q j m,.18/ which is a strict relaxation of the set that was considered above i.e. Q TV-sparse Q TV-l1 since, for q [0, 1] n, q i q j 1{q i q j }:.i,j/ E.i,j/ E In either case, the parameter m is a tuning parameter that should be specified in advance, perhaps reflecting prior knowledge about the amount of total variation in the underlying structure of the data; our theory will treat m as fixed rather than data adaptive. We shall bound the relevant Rademacher complexities by making use of recent work by Hütter and Rigollet (2016). First, following Hütter and Rigollet (2016), define the incidence matrix of the graph G, D G { 1, 0, 1} e G n, which for each edge.i, j/ E G has a corresponding row with ith entry 1, jth entry 1 and 0s elsewhere; define also the quantity ρ G = max.d G + / k 2, k=1,:::,e G where D G + Rn e G is the pseudoinverse of D G and.d G + / k is its kth column. Our next result shows that ρ G controls the complexity of the set of adaptive weights Q. Lemma 3. For Q = Q TV-sparse as defined above, Rad.Q inv / 1 ɛ n + 2ρ Gm log.n/ : ɛn If we instead take Q = Q TV-l1, then

17 Multiple Testing 61 Rad.Q inv / 1 ɛ n + 2ρ Gm log.n/ ɛ 2 : n Therefore, if we assume independent p-values, then, applying theorem 1, we obtain { 1 FDR α 1 + ɛ.1 τ/ n + 2ρ } Gm log.n/.ɛ or ɛ 2 /.1 τ/n with power of ɛ in the second denominator given by ɛ for Q = Q TV-sparse or ɛ 2 for Q = Q TV-l1. To see an example of the resulting scaling, consider a two-dimensional grid on an array of n n nodes, for which it was shown in Hütter and Rigollet (2016), proposition 4, that ρg = O{ log.n/}. Therefore, we obtain non-trivial FDR guarantees for ɛ max{1= n, m log.n/=n} in the total variation sparsity setting (Q = Q TV-sparse ), and for ɛ {m log.n/=n} 1=2 in the total variation norm setting (Q = Q TV-l1 ). For the one-dimensional case, where G is a chain graph on n nodes, Hütter and Rigollet (2016), section 3.1, calculated ρ G = n, and so to obtain a non-trivial FDR guarantee via theorem 1 with Q = Q TV-sparse we need ɛ {m 2 log.n/=n} 1=2,or choosing Q = Q TV-l1 we instead need ɛ {m 2 log.n/=n} 1=4. For simplicity, we do not give the results that are implied by theorem 2 for the dependent z-scores setting here, but these are also straightforward to calculate Comparison with related methods To compare with methods such as those of Chouldechova (2014) and Siegmund et al. (2011), which search for signals that appear in clusters (in a spatial domain or related setting), their work aims to make discoveries that form clusters (determined by the data) and to measure error at the clusterwise level as well. In contrast, our work sets the weights ˆq i in a locally constant way so that the threshold for making a discovery is locally constant, but the p-values themselves may still lead to discoveries which do not form contiguous or well-defined clusters. In general, for our method, the pattern of discoveries that we might expect to see would show some regions with high proportions of discoveries, and other regions where discoveries are very scarce. This gap would typically be much wider than if we were to apply the BH procedure to the same data, since the SABHA method boosts our ability to make discoveries in high signal regions Choosing ˆq Q As before, we can consider a constrained maximum likelihood approach to choosing ˆq Q, i.e. solving problem (5) for either Q = Q TV-sparse or Q = Q TV-l1.IfweuseQ = Q TV-l1 then this is a convex optimization problem; we give an algorithm for this setting in the on-line supplementary materials. If instead we use Q = Q TV-sparse, the resulting optimization problem is highly nonconvex and may be very difficult to solve. 6. Proof of false discovery rate control In this section we prove our main results on FDR control: theorem 1 and theorem Complexity of a set: Rademacher complexity and cube complexity We first develop a result on set complexity, which we shall use in the proof of theorem 1. For any set A R n, recall that the Rademacher complexity of A is defined as [ 1 Rad.A/ = E n sup x A ] x, ξ ξ i IID Unif{±1}:

18 62 A. Li and R. F. Barber We now define the cube complexity, which is similar to the Rademacher complexity but uses arbitrary mean 0 product distributions on the cube [ 1, 1] n. First define C to be the set of mean 0 product distributions on [ 1, 1] n, i.e. each D C is a distribution of the form D 1 ::: D n where each D i is a mean 0 distribution on [ 1, 1]. Then define [ 1 CubeComplexity.A/ = sup E Y D D C n sup x, Y ] : x A If we take each D i to be the distribution placing probability 0.5 on 1 and on 1, then the Y i s are independent Rademacher variables, i.e. Y has the same distribution as ξ above, and then we would have E Y D [.1=n/sup x A x, Y ] = Rad.A/; this proves that CubeComplexity.A/ Rad.A/. We now prove that the cube complexity is in fact equal to the Rademacher complexity i.e. the supremum over D C is attained by taking D i = Unif{±1} for each i. Lemma 4. For any set A R n, the cube complexity satisfies CubeComplexity.A/ = Rad.A/ Proof of lemma 4 We know that CubeComplexity.A/ Rad.A/ trivially by choosing D i = Unif{±1} for each i. Now we prove the reverse bound. Fix any D C and let Y D. Next, let U i IID Unif[0, 1] be drawn independently of Y, and define ξ {±1} n as { 1, Ui.1 + Y ξ i = i /=2, 1, U i >.1 + Y i /=2: Then we see that ( E[ξ i Y] = 1 Pr U i 1 + Y ) ( i 2 Y +. 1/ Pr U i > 1 + Y ) i 2 Y = 1 + Y i 2 and, marginally, the ξ i s are independent with E[ξ i ] = E[E[ξ i Y]] = E[Y i ] = 0: ( Y ) i = Y i, 2 In other words, after marginalizing over Y, the ξ i s are independent Rademacher variables. We then have [ ] [ ] [ [ ]] E n sup x, Y = E x A n sup x, E[ξ Y] E E x A n sup x, ξ Y by Jensen s inequality x A [ 1 = E n sup x, ξ ] by the tower law of expectations x A = Rad.A/ by definition of Rademacher complexity: Since this holds for any choice D C, this proves the desired bound, CubeComplexity.A/ Rad.A/ Proof of false discovery rate control (theorems 1 and 2) We now turn to proving our FDR control results for the SABHA method. We can express the false discovery proportion FDP as FDP = 1{P i is rejected} = 1{P i {α ˆk=. ˆq i n/} τ},.19/ 1 ˆk 1 ˆk

19 Multiple Testing 63 by definition of the SABHA method. Before we give a formal proof, we begin by giving an informal sketch to provide an intuition for why the FDR control result should hold. If ˆk and ˆq were fixed rather than data dependent, then P i α ˆk ˆq i n τ with probability less than or equal to α ˆk=. ˆq i n/ (since P i is assumed to be superuniform). By concentration, we would then expect FDP α ˆk=. ˆq i n/ ˆk In contrast, our requirement for ˆq ensures that = α 1 ˆq i n :.20/ 1{P i > τ} 1:.1 τ/ ˆq i n Since Pr.P i > τ/ 1 τ for the nulls, by concentration we then expect that 1 ˆq i n 1{P i > τ} 1:.1 τ/ ˆq i n Together, these approximations yield FDP α. We next formalize this intuition and prove our main results Proof of false discovery rate control under independence (theorem 1) Fix any i H 0, and define P i 0 =.P 1, :::, P i 1,0,P i+1, :::, P n /: Let ˆq.P i 0 / and ˆk.P i 0 / denote the weights vector ˆq and number of rejections ˆk that we would obtain, if we were to run the procedure with p-values P i 0 in place of P. For clarity, writing ˆq and ˆk will still denote ˆq.P/ and ˆk.P/, i.e. the values that are obtained by using the p-values P. Note that, by our assumption (4) on the choice of the function ˆq, ifp i τ then we have ˆq = ˆq.P i 0 /. As is often used in proofs for the BH procedure (see for example Ferreira and Zwinderman (2006) for this type of proof technique), we then observe that, if P i is rejected, then replacing it with a smaller p-value would not change the outcome of the procedure i.e., if P i is rejected, then ˆk = ˆk.P i 0 /. By definition of the SABHA method, we can rewrite these facts as P i α ˆk ˆq i n τ P i rejected ˆq = ˆq.P i 0 /, ˆk = ˆk.P i 0 /:.21/ Now, conditioning on P i 0 (i.e. on all the p-values except for P i ), we calculate [ ] 1{Pi {α ˆk=. ˆq E i n/} τ} α.1 ˆk/ P i 0 [ ] 1{Pi α ˆk.P i 0 /={. ˆq.P i 0 // i n} τ} = E α ˆk.P i 0 / P i 0 by expression (21) = Pr[P i α ˆk.P i 0 /={. ˆq.P i 0 // i n} τ P i 0 ] α ˆk.P i 0 /

20 64 A. Li and R. F. Barber α ˆk.P i 0 /={. ˆq.P i 0 // i n} τ by independence (6) and superuniformity (7) α ˆk.P i 0 / 1. ˆq.P i 0 // i n : Plugging this back into equation (19), we obtain FDR = [ ] 1{Pi {α ˆk=. ˆq E i n/} τ} α [ ] 1 E, 1 ˆk. ˆq.P i 0 // i n yielding a similar bound to that in the proof sketch (20). We then continue: 1. ˆq.P i 0 // i n = 1 Pr.P i τ P i 0 / by independence (6). ˆq.P i 0 // i n Pr.P i τ/ [ ] 1{P i τ} = E Pr.P i τ/. ˆq.P i 0 // i n P i 0 [ ] 1{Pi τ} = E Pr.P i τ/ ˆq i n P i 0 by expression (21), so our bound on FDR is now written in terms of ˆq rather than ˆq.P i 0 /: FDR α [ ] 1 E = α [ ] 1{Pi τ} E :. ˆq.P i 0 // i n Pr.P i τ/ ˆq i n Next, we recall our assumption (3) on the choice of weights ˆq: either ˆq satisfies 1{P i > τ} 1, i.1 τ/n ˆq i or ˆq = 1 n. In the first case, since Pr.P i > τ/ 1 τ, we can write 1{P i > τ} 1, Pr.P i > τ/n ˆq i and therefore 1{P i τ} Pr.P i τ/ ˆq i n 1 + { 1{Pi τ} Pr.P i τ/ ˆq i n 1{P } i > τ} Pr.P i > τ/ ˆq i n { 1 + max 0, sup q Q since ˆq Q. In the second case, when ˆq = 1 n,wehave 1{P i τ} Pr.P i τ/ ˆq i n = 1{P i τ} Pr.P i τ/n 1 + 1{P i τ}=pr.p i τ/ 1{P i > τ}=pr.p i > τ/ q i n [ 1{Pi τ} Pr.P i τ/n 1 n ] }, since H 0 n = 1 + Pr.P i > τ/ 1{P i τ}=pr.p i τ/ 1{P i > τ}=pr.p i > τ/ n { 1 + max 0, sup q Q 1{P i τ}=pr.p i τ/ 1{P i > τ}=pr.p i > τ/ q i n },

21 Multiple Testing 65 where the last step uses the assumption that 1 n Q. In either case, then, we obtain ( [ { }]) 1{P i τ}=pr.p i τ/ 1{P i > τ}=pr.p i > τ/ FDR α 1 + E max 0, sup q Q q i n ( [ { }]) Y i = α 1 + E max 0, sup, q i n q Q where we define the vector Y =.Y 1, :::, Y n / as { 1{Pi τ} Y i = Pr.P i τ/ 1{P i > τ} Pr.P i > τ/, i H 0, 0, i H 0 : We see that Y has independent and mean 0 co-ordinates, and Y 1={τ 0.1 τ/} with probability 1 by the threshold condition (8). Therefore, [ { }] [ { }] 1{P i τ}=τ 1{P i > τ}=.1 τ/ Y i E max 0, sup E max 0, sup q Q q i n q Q q i n [ ] 1 E n sup x, Y CubeComplexity.Q inv/ = Rad.Q inv/ x Q inv τ 0.1 τ/ τ 0.1 τ/, where the last step holds by lemma 4. This proves that FDR α[1+ Rad.Q inv /={τ 0.1 τ/}], as desired Proof of false discovery rate control for dependent z-scores (theorem 2) In the dependent setting, we can no longer perform the same calculations, since P i is not superuniform after conditioning on P i 0. Instead, we rely on concentration results. Returning to expression (19), we can bound the false discovery proportion as FDP FDP 1{ ˆk cn} + 1{ ˆk<cn} = 1{P i {α ˆk=. ˆq i n/} τ} 1{ ˆk cn} + 1{ ˆk<cn} 1 ˆk [ 1 α ˆq i n + sup 1{Pi αk=.q i n/} 1 ] + 1{ ˆk<cn}, k αq i n k {cn,:::,n},q Q where the first inequality holds since FDP 1 always. Therefore, rearranging some terms, we can write ( [( ) ] FDR Pr. ˆk < cn/ + α 1 + E 1 [ + E sup k {cn,:::,n},q Q ˆq i n [ 1{Pi αk=.q i n/} αk 1 1 ]]) : q i n Next, we work further with the first expected value. Recall that, by assumption (3), either ˆq = 1 n or ˆq satisfies 1{P i > τ} 1:.1 τ/n ˆq i i

22 66 A. Li and R. F. Barber If ˆq = 1 n, then Σ i H0 1=. ˆq i n/ 1 0 trivially. Otherwise, we have ( ) 1 ˆq i n 1 1 1{P i > τ} = 1 1{P i > τ}=.1 τ/ ˆq i n i.1 τ/n ˆq i ˆq i n 1 1{P i > τ}=.1 τ/ sup : q Q q i n Combining these two cases, we then have ( [ { }] FDR Pr. ˆk < cn/ + α 1 + E max 0, sup [ + E sup k {cn,:::,n},q Q 1 1{P i > τ}=.1 τ/ q Q q i n }{{} term 1 [ 1{Pi αk=.q i n/} 1 ]]) : αk q i n } {{ } term 2 From here, we use concentration results, combined with the bounded complexity of Q inv,to bound term 1 and term 2. The key steps of the remaining part of the proof are as follows. (a) First, we bound the covering number of Q inv with respect to the l -norm, in terms of the Rademacher complexity Rad.Q inv /, using bounds calculated in Srebro et al. (2010). This allows us to write term 1 and term 2 as a maximum over finitely many qs, rather than a supremum over all q Q. (b) Next, we apply a result from Barber and Kolar (2018), proving that, for a multivariate Gaussian vector Z, the centred sign vector sgn.z/ E[sgn.Z/] is sub-gaussian. Since term 1 and term 2 are both functions of indicator variables 1{P i c i } (or, equivalently, 1{P i >c i }) for various thresholds c i, using the Gaussian copula model for P allows us to transform term 1 and term 2 into functions of the sign vector sgn.z/, so we can use the sub-gaussianity result. These techniques will yield the bounds term 1 4 { Rad.Qinv / log.en 2 } / 1 τ ɛ and { } log.n/ 1=2 κ 1=2 term 2 n αc 2 + {Rad.Q inv/log.en 2 / 1=2 1=2 4κ1=4 }.αc/ 1=2, thus proving the theorem. We defer the details to the on-line supplementary materials. 7. Experiments We now present empirical results on simulated and real data. For the simulated data, we shall examine the low total variation setting that was described in Section 5.3 and, in particular, we shall test the role of the total variation constraint on the power and FDR of the resulting method. Then, we shall demonstrate applications of the other two types of structure that were described earlier on real data: ordered structure (as described in Section 5.1) with gene drug response data, and grouped structure (as described in Section 5.2) with functional MRI data. (Code for all experiments is available from rina/sabha.html.)

23 Multiple Testing Simulated data: low total variation In our simulated data experiments, the signals are arranged in a two-dimensional grid, with low total variation in the underlying probabilities of each hypothesis being null. Our estimate of ˆq uses the convex total variation norm as in expression (18) over the two-dimensional grid. To generate the data, we create a list of n = 225 p-values arranged in a grid, and we assign a true underlying prior probability of each p-value being a null, { 0:1, if i is in the high signal region R, q i = 0:9, if i R, where the region R consists of two quarter-circles, in the top right-hand and bottom left-hand corner of the grid, for a total of R =70 points (Fig. 1). The mean probability of a null is.1=n/σ i q i = 0:6511. To generate the p-values themselves, we draw I i Bernoulli.1 q i / indicating whether hypothesis i is a true signal, and then draw Z i N.μ i,1/ where μ i = μ sig I i,for some fixed μ sig > 0 (with larger μ sig indicating a stronger signal). Then we run two-sided z-tests, P i = 2{1 Φ. Z i /}, where Φ is the cumulative distribution function of the standard normal distribution. We repeat our experiment for each value μ sig {0:5, 1, 1:5, :::,3:5}. We implement the SABHA method on the two-dimensional grid graph, choosing τ = 0:5 and defining Q = Q TV-l1 as in expression (18) with the lower bound parameter set at ɛ = 0:1 and with the total variation constraint m {10, 15, 20}. (For the true q, wehave q TV-l1 = 22:2, i.e. q Q for any of these values of m, but nonetheless these values are sufficient to allow for substantial adaptivity.) We then compare the SABHA method with the BH method and also with Storey s modification of the BH method given in equation (1), implemented with parameter τ = 0:5. To fit ˆq for the SABHA method, i.e. to solve the optimization problem (5) with Q = Q TV-l1, details are given in the on-line supplementary materials. For comparison we also run an oracle version of the SABHA method where ˆq = q, the true vector of prior probabilities that each p-value is a null. For all methods we choose the target FDR level α = 0:1. We then compare the observed FDR and power for each of the methods considered. As we can see in Fig. 1(a), for all methods, the average power and average observed FDR both increase with larger μ (stronger signal). The BH, Storey BH and oracle SABHA methods all control FDR at level α = 0:1; the BH is in fact known to control the FDR at the level α H 0 =n, and is therefore quite conservative (i.e. FDR < 0:1) as we observe in our results, whereas the other two methods have FDR 0:1. For the SABHA method, as the signal grows stronger, the observed FDR grows very close to 0:1 for small m = 10 and medium m = 15; for large m = 20, the observed FDR somewhat exceeds the target level α = 0:1, as ˆq is overfitting to the p-values. To compare the power, the BH method is the most conservative (lowest power) followed by the Storey BH method. The oracle SABHA shows the highest power, whereas the SABHA method shows power increasing as the total variation constraint m increases. Here, the SABHA method is consistently more powerful than the BH and Storey BH methods over the range of μ sig.in particular, the SABHA (with the small bound m = 10) and Storey BH methods achieve nearly identical FDR levels when the signal strength μ sig is not too low, and so the SABHA method s improvement in power is meaningful in that it does not come at the cost of increasing the FDR. In Figs 1(b) 1(g), we further compare the methods according to their estimated probabilities of a null, for one trial at signal strength μ sig = 2:5. For the SABHA method with m = 10, 15, 20, the plots display the estimated ˆq. We can compare against the true q, which is also the input to the oracle SABHA method. Finally, since the BH and Storey BH methods are equivalent to taking ˆq = 1 n and ˆq = ˆπ 0 1 n and proceeding as with the SABHA method, we display these estimated ˆqs as well. As expected, for the SABHA method, we see that the fitted weights ˆq i

24 68 A. Li and R. F. Barber FDR Power (b) (d) (c) (e) Signal Strength μ sig (a) Fig. 1. (a) Power and observed FDR level of all procedures averaged over 50 trials (the target FDR level is α D 0.1) (, BH;, Storey BH;, SABHA (m D 10);, SABHA (m D 15);, SABHA (md20);, oracle SABHA) and (b) (g) true q versus estimated ˆq for a single trial with μ sig D2.5 (black 0 contains all signals; white 1 contains all nulls) (see Section 7.1 for details): (b) true probability; (c) BH; (d) Storey BH; (e) SABHA (m D 10); (f) SABHA (m D 15); (g) SABHA (m D 20) (f) (g) approximate the overall pattern of underlying probabilities q i, and the larger values of m show more overfitting in the sense that the weights ˆq i show distinctions between regions of the grid, which are not actually different in the underlying model (leading to the increased FDR that was observed previously) Gene drug response data: ordered structure We next apply the SABHA method to an ordered testing problem: the gene dosage response application that was described in Li and Barber (2017). The data consist (Coser et al., 2003) of n = genes expression levels at various drug dosages. (The data are available from or via the GEOquery package (Davis and Meltzer, 2007) in R.) For each gene i=1, :::, 22283, we calculate a p-value P i which evaluates evidence for a change in gene expression level between the control (no dose) setting and a low dose setting, using a permutation test. The genes are arranged in order, with the beginning of the list (low index i) corresponding to genes whose differential expression level at high dosage leads us to believe that this gene is more likely to exhibit a non-zero response at low dosage; for the low dosage data we then run a one-sided test; for instance, a gene that showed increased expression at the high dosage is tested for increased expression at the low

25 Multiple Testing 69 dosage. See Li and Barber (2017) for more details on the data set and on how the ordering and the p-values are calculated. We run the SABHA method by using Q = Q ord (13) and choose ˆq to be a step function as in expression (14), with lower bound parameter ɛ = 0:1 and threshold τ = 0:5. We compare the SABHA against the BH, and against the Storey BH method with threshold τ = 0:5. We also compare against three methods for ordered hypothesis testing: the ForwardStop method (Grazier G sell et al., 2015), the SeqStep method (Barber and Candès, 2015) (with parameter C = 2), the accumulation test method with the Hinge-Exp function (Li and Barber, 2017) (parameter C = 2), and the adaptive SeqStep method (Lei and Fithian, 2016); see Section 5.1 for some more details on this family of methods. We run each method with target FDR level α = 0:01, 0:02, :::,0:5. In this data set, the sample size (the number of measurements for each gene) at each of the three settings control, low dose and high dose is 5. We repeat our experiment three times: first, using this full sample size; second, using a sample size of 2 for the high dose setting only, to reduce the quality of the ordering (i.e. we shall be less accurate in our assessment of which genes are most likely to show a change in expression level at the low dosage); third, using a random ordering (i.e. no information is contained in the ordering; true signals are equally likely to appear anywhere on the list) Results Fig. 2 shows the results from all the methods across the range of α-values, for all three settings for how the ordering is determined. The plotted outcome is the number of discoveries, i.e. the number of genes that are selected as showing a significant difference between the low dose and no dose setting. Neither the BH nor the Storey BH method, which do not use the information of the ordering, can make more than a few discoveries at any level α below approximately 0.3; since these methods do not use the ordering information, their outcomes are identical across the three settings. Turning to the ordered testing methods (accumulation test or HingeExp, SeqStep, ForwardStop and adaptive SeqStep), we see that these methods can recover a substantial number of genes even for low α when the ordering is highly informative, with a wide range of performance across the three methods, but make essentially no discoveries when the ordering is random (carries no information). Next, we see that the SABHA method can perform well across the range of scenarios; although it is not the optimal method in any single scenario, it is the only method whose performance is adaptive whereas the existing methods are all specialized to one or the other extreme. For the first setting (sample size 5 used to obtain a highly informative ordering), we see that the SABHA method is less powerful than the best ordered testing method but nonetheless gives strong performance even at low α-values; for the second setting (sample size 2 used to obtain a moderately informative ordering), the SABHA method is now more comparable with the best ordered testing method across much of the range of α-values; and for the third setting (a random, i.e. completely uninformative, ordering), the SABHA method continues to perform well and is nearly as powerful as the Storey BH method which makes the most discoveries in this setting, whereas the ordered testing methods now have effectively zero power except for adaptive SeqStep at the high values of α. To summarize, the SABHA method can adapt to the amount of information that is carried in the ordering, achieving good performance relative to the best method in each setting. In practice, then, when we do not know ahead of time whether the ordered structure is informative or not, the SABHA method is a good choice as it can adapt to any scenarios.

26 70 A. Li and R. F. Barber # of discoveries # of discoveries Target FDR level α (a) Target FDR level α (b) # of discoveries Target FDR level α (c) Fig. 2. Results for the differential gene expression experiment (for each method, the plot shows the number of discoveries made (i.e. the number of genes selected as showing significant change in expression at the low drug dosage compared with the control), at a range of target FDR values α (, SeqStep;, accumulation test (HingeExp);, ForwardStop;, Adaptive SeqStep;, BH;, Storey;, SABHA): (a) highly informative ordering; (b) moderately informative ordering; (c) random ordering 7.3. Functional magnetic resonance imaging data: grouped structure We now show an application of our method in an analysis of functional MRI data, in which the group structure of the problem can be exploited (see Section 5.2 for more details on the group-structured setting). The data, gathered by Keller et al. (2001) and available from consist of functional MRI measurements taken for subjects who are shown two stimuli in a sequence, a picture and a sentence in either order, with the task of determining whether the two agree. We restrict our attention to data from a single individual (subject 04867), and to the 20 trials with the picture displayed first and the sentence second, where the procedure is of the form picture displayed (4 s); no stimulus (4 s) ; sentence displayed (4 s); no stimulus (4 s) : }{{}}{{} picture phase.8s/ sentence phase.8s/ Measurements are taken every 0.5 s, for a total of 16 brain images each in the picture phase and in the sentence phase. Though only a fraction of the brain of each subject was imaged,

27 Multiple Testing 71 Picture Sentence (a) (b) (c) (d) (e) Fig. 3. Results for the functional MRI data (the eight images in each column are the horizontal slices of the brain): (a) the 24 ROIs defined in the data set, each pictured in a different colour; (b) the average activity levels recorded in the experiment for the picture phase and for the sentence phase (white highest activity level; black zero activity level); (c) the p-values obtained at each voxel using the paired t-test (black 0 most significant; white 1 least significant); (d) the estimated vector ˆq for the SABHA method using the group structure defined by the ROIs (black 0 contains all signals; white 1 contains all nulls); (e) results for the BH, Storey BH and SABHA methods (, voxel labelled as a discovery) these images reflect three-dimensional activity information, as each image includes eight twodimensional slices. The 4691 measured voxels are grouped into 24 regions of interest (ROIs), anatomically or functionally distinct regions in the brain. (Although 4698 voxels are measured and 25 ROIs are defined, seven voxels are not labelled with an ROI, and one ROI is not assigned to any voxels, so we remove these from the data.) We are interested in how the activity of different parts of the brain differs between the picture phase and the sentence phase, which can be formulated as a multiple-testing problem, with each voxel in the brain corresponding to a hypothesis. For each voxel we compute the average activity level across the 16 images in each of the two phases, for each of the 20 trials, which leads to a p-value computed from a paired t-test with sample size 20. In the first few columns of Fig. 3, we display the partition into ROIs (Fig. 3(a)), the original data averaged for the picture phase and the sentence phase (Fig. 3(b)) and the calculated p-values (Fig. 3(c)). Each column is a single three-dimensional image of the brain, split into eight horizontal slices.

High-throughput Testing

High-throughput Testing Noah Simon and Richard Simon July 2016 1 / 29 Testing vs Prediction On each of n patients measure y i - single binary outcome (eg. progression after a year, PCR) x i - p-vector