Adaptive Filtering Procedures for Replicability Analysis of High-throughput Experiments

Size: px

Start display at page:

Download "Adaptive Filtering Procedures for Replicability Analysis of High-throughput Experiments"

Pearl Phelps
5 years ago
Views:

1 Adaptive Filtering Procedures for Replicability Analysis of High-throughput Experiments Jingshu Wang 1, Weijie Su 1, Chiara Sabatti 2, and Art B. Owen 2 1 Department of Statistics, University of Pennsylvania 2 Department of Statistics, Stanford University January 2018 Abstract Replicability is a fundamental quality of scientific discoveries. While meta-analysis provides a framework to evaluate the strength of signals across multiple studies accounting for experimental variability, it does not investigate replicability. A single, possibly non-reproducible study, can be enough to bring significance. In contrast, the partial conjunction (PC) alternative hypothesis stipulates that for a chosen number r (r > 1), at least r out of n related individual hypotheses are non-null, making it a useful measure of replicability. Motivated by genetics problems, we consider settings where a large number M of partial conjunction null hypotheses are tested, using an n M matrix of p-values where n is the number of studies. Applying multiple testing adjustments directly to PC p-values can be very conservative. We here introduce AdaFilter, a new procedure that, mindful of the fact that the PC null is a composite hypothesis, increases power by filtering out unlikely candidate PC hypotheses using the whole p-value matrix. We prove that appropriate versions of AdaFilter control the familywise error rate and the per family error rate under independence. We show that these error rates and the false discovery rate can be controlled under independence and a withinstudy local dependence structure while achieving much higher power than existing methods. We illustrate the effectiveness of the AdaFilter procedures with three different case studies. Keywords: meta-analysis, partial conjunction, multiple hypothesis testing, composite null jingshuw@upenn.edu Jingshu Wang is supported by NIH grant 2 R01HG and the Wharton Dean s Fund. Art Owen is supported by the US NSF grant DMS Chiara Sabatti is supported by NIH grants R01MH and U01HG

2 1 Introduction Replication is the cornerstone of science (Moonesinghe et al., 2007). In the last decade, however, both the popular (Lehrer, 2010) and the scientific press (Begley and Ellis, 2012; Baker, 2016) have given space to a number of publications highlighting the lack of reproducibility of research results. The National Academy of Sciences has identified this reproducibility crisis as a major focus area. While addressing this challenge requires efforts on multiple fronts, statistics has important contributions to make. On one hand, there is a need for data analysis methods that result in the identification of findings with higher chance of being reproducible. On the other hand, it is crucial to have a framework with which to evaluate the reproducibility of scientific results in a manner that is objective and precise, while properly accounting for experimental and sampling variability. The field known as meta-analysis (Hedges and Olkin, 1985) has provided one such framework, but with an emphasis on aggregating information across n studies to increase power rather than evaluating their agreement or lack-thereof. In meta-analysis one tests the global null of no effect in any studies. It is possible that this is rejected largely on the basis of just one study with extremely significant effect. Such a rejection may be undesirable as it could arise from some irreproducible property, such as technical or modeling bias, of that one significant study. The accepted approach in meta-analysis to account for studies heterogeneity is to rely on a random effects model. While this is robust to inter-study variability, the rejection of the global null still is not equivalent to replication: it does not guarantee that the effect in question has been observed in multiple separate studies. Functional magnetic resonance imaging (fmri) is one field in which the need of explicit guarantees of replication has been felt earlier. The solution put forward in Price and Friston (1997) is conjunction (logical and ) testing: a result is considered replicated when the corresponding null hypothesis is rejected in all n settings where it is tested. While conjunction tests guarantee replication, their power quickly decreases as the number of studies n increases, as the decision relies on the least significant p-values across studies. The partial conjunction (PC) hypothesis represents a compromise between these two extremes: it provides replication guarantees while maintaining reasonable power. Introduced by Friston et al. (2005), and studied in Benjamini and Heller (2008), the partial conjunction (PC) hypothesis tests whether at least r out of n hypotheses are false. As long as r 2, rejecting the partial conjunction null explicitly guarantees that the finding is replicable in at least some of these studies. Testing for a PC hypothesis can evaluate the consistency of genetic findings across multiple studies or platforms (Heller and Yekutieli, 2014). It can also be used to identify effects that are common across different conditions/subgroups (Sun and Wei, 2011; Heller et al., 2007), even when these are assessed in the context of a single study. For example, in multiple tissue eqtl (expression quantitative trait loci) studies, PC can help to judge whether the detected gene regulation patterns are consistent across tissues or tissue-specific (Flutre et al., 2013). Or, in genome-wide association studies (GWAS) of multiple phenotypes (Shin et al., 2014), PC can identify the single nucleotide polymorphisms (SNPs) that are associated with more than one trait, and therefore are likely to probe an important biological pathway. 2

3 In high-throughput experiments like the ones just described it is common to test tens of thousands of hypotheses in the same study. A typical application of the PC framework in this context, therefore, requires controlling for multiplicity. For example, when studying the influence of a condition on gene expression, one typically quantifies the expression levels of 20,000 genes and tests the corresponding 20,000 null hypotheses of no association between the condition of interest and the expression levels in one study. To identify which of the potential rejections are replicated r times across n studies using a PC approach, one would need to carry out the corresponding 20,000 PC tests. A straightforward approach to controlling for multiplicity is to apply standard adjustment procedures to the p-values for the many PC hypotheses. However, this strategy has been shown to have very low power (Heller and Yekutieli, 2014; Sun et al., 2015). The issue is particularly severe in genetics studies, where the large number of hypotheses tested is coupled with a sparse signal (only a small portion of the PC hypotheses in non-null). This power loss which represents a major roadblock to the application of the PC framework in replicability studies is linked to the composite nature of the PC null hypotheses for r 2. Both Heller and Yekutieli (2014) and Bogomolov and Heller (2015) suggest procedures that try to work-around this difficulty and result in power increases. However, Bogomolov and Heller (2015) is designed for the case of n = r = 2 and the empirical Bayes approach in Heller and Yekutieli (2014) encounters computational difficulties as the values of n increases substantially. We here address this limitation, introducing the new adaptive filtering procedures (AdaFilter) to test multiple PC hypotheses that control global errors without dramatically decreasing power for any number of studies n. The key idea is that when we are testing whether at least r out of n hypotheses are false, we can treat differently the hypotheses that appear to be false in at least r 1 studies from those that do not. Our AdaFilter procedures for PC hypotheses provably control several simultaneous error rates when all the test statistics are independent. Simulation studies suggest that these properties extend to the case of local and weak dependence between the hypotheses within each study. In addition, AdaFilter has a partial monotonicity property: reducing any of the n p-values for a given null hypothesis can only make it more likely rejected. It is however possible that lowering one of the p-values for one null hypothesis can make it less likely that another null hypothesis will be rejected. This lack of a corresponding complete monotonicity property leads to the power gain of our proposals. The structure of the paper is as follows. Section 2 precisely defines the PC testing, and illustrates the power limitation of out-of-the-box multiplicity adjustment strategies. Section 3 introduces our AdaFilter procedures. Section 4 proves their validity under independence, provides further understanding of the procedures from an important monotonicity property, discusses extensions to more general scenarios and offers a comparison with previous methods. Section 5 explores the performance with simulations and finally, Section 6 describes the application of our procedures to three case studies, one testing for replication in multiple experiments and the other two testing for signal consistency across different conditions/subgroups. 3

4 2 Multiple testing for partial conjunctions 2.1 Problem set up We consider the problem where M null hypotheses are tested in n studies (H 0ij ) n M. In keeping with what is true in many applications, we assume that only summary statistics are available from these studies, and, for simplicity, we focus on the situation where the summary statistics are the p-values (p ij ) n M of the hypotheses in each of the studies, which is the realization of a random matrix (P ij ) n M. To evaluate if the findings have been replicated at least r times, we want to simultaneously test M partial conjunction null hypotheses at replicability level r: H r/n 0j : fewer than r out of n hypotheses are nonnull for r = 1,, n and j = 1, 2,..., M. The replicability level r is specified beforehand. In practice, one can set r = 2 for a minimal amount of reproducibility or take r proportional to n to describe a desired rate of reproducibility. We use the indicator variable v j to keep track of the truth status of H r/n 0j (v j = 0 if the null is true and v j = 1 otherwise). For each j = 1, 2,..., M, let p j = (p 1j,..., p nj ) be the vector of p-values for individual hypotheses (H 01j,..., H 0nj ) involved in H r/n 0j, and let p (1)j p (2)j p (n)j be its sorted values. For a multiple testing procedure, the decision function ϕ j = 1 if we reject H r/n 0j and ϕ j = 0 otherwise. Because of multiple testing adjustment, our rejection of the jth PC hypothesis (ϕ j = 1) will depend most strongly on p j. However, the elements of p j for j j may also play a role, which makes adafilter different from both the usual meta-analysis (r = 1) and straightforward partial conjunction testing. We use 1:M as a concise notation of the index set {1, 2,..., M}. The number of total discoveries is R = M ϕ j and the total number of false discoveries is V = M ϕ j1 vj =0. There are many possible definitions of the simultaneous error rate (Dudoit and Van Der Laan, 2007), with familywise error rate (FWER) and false discovery rate (FDR) being the most common ones. In addition, we consider the per-family error rate (PFER), as it provides a motivation for our procedures. With the notation introduced, the FWER:= P(V 1), the PFER:= E(V ) and the FDR:= E(V/ max(r, 1)). We say that an error rate is controlled at level α when for any configuration of true and non-true null hypotheses the expectations defining the error rates are bounded by α. Throughout the paper, we use an upper case letter to denote a random variable and the corresponding lower case letter for the specific realization of the random variable. 2.2 Multiple comparison adjustment procedures on PC p-values For a vector of individual P-values (P 1, P 2,..., P n ) from multiple studies, let P r/n denote the P- value for a single PC null hypothesis H r/n 0 obtained by combining the individual P-values into a single value. Benjamini and Heller (2008) discuss three approaches to obtain these p-values, which we report here, using the standard notation (P (1) P (2) P (n) ) 4

5 1. Simes method: { } n r + 1 Pr/n S = min i=r,...,n i r + 1 P (i) which is valid for positively dependent P-values (Benjamini and Heller, 2008). 2. Fisher s method: ( Pr/n F = P χ 2 (2(n r+1)) 2 which is valid for independent P-values. 3. Bonferroni s method: n log P (i) ), i=r P B r/n = (n r + 1)P (r), which is valid under any dependence structure of P-values. Applying any of these rules, we can obtain a P-value P r/n,j for the jth PC null hypothesis, with j = 1,..., M. With these premises, a straightforward approach to the multiple PC testing problem consists of simply applying standard multiplicity adjustment procedures to the individual PC P-values {P r/n,j : j = 1, 2,, M}. For example, to control for FWER at level α, we could use the Bonferroni rule, which rejects H r/n 0j if P r/n,j α/m. This would also control the PFER at level α (Tukey, 1953). To control for FDR we could use the procedure introduced in Benjamini and Hochberg (1995) (BH). This can be described as rejecting H r/n 0j if P r/n,j γ 0 where γ 0 is a datadependent threshold defined as γ 0 = max {γ { 0, α M, 2α M,, α} M } Mγ α 1(P r/n,j γ). Aside from the appeal of relying on standard procedures, these strategies are often too conservative in the context of PC, especially when the true non-null individual hypotheses are a small subset. To quantify this conservativeness, we start by considering the example of the Bonferroni correction when r = n. We will then discuss a more general case, before proposing our alternatives in Section The conjunction case r = n Rejecting H n/n 0j corresponds to declaring that the finding j is replicated in all the n studies. The combined p-values for H n/n 0j are simply P n/n,j = P (n)j. To control for PFER (or FWER) at level α, the Bonferroni correction rejects H n/n 0j if p (n)j α/m. To understand the performance of this procedure, and how it interplays with the composite nature of the PC hypothesis, define sets I k 1:M such that j I k means that precisely k of the 5

6 alternative hypotheses H 1ij hold for the given j. In terms of the partial conjunction hypotheses, for k = 1,..., n 1 we have I k = {j 1:M H (k+1)/n 0j is true but H k/n 0j is false}, while I 0 = {j 1:M H 1/n 0j is true} and I n = {j 1:M H n/n 0j is false}. The sets {I k, k = 0, 1,, n} are a valid partition and H n/n 0 = {H n/n 0j is true} = n 1 k=0 I k. Assuming that the individual P-values are all independent, we can use this decomposition of the set of null H n/n 0j to understand the value of PWER and obtain an upper-bound: E(V ) = j H n/n 0 n 1 k=0 E(1 P(n)j α/m) = n 1 k=0 j I k P(P ij α M i s.t. H 0ij is true) j I k P(P (n)j α/m) (1) n 1 k=0 j I k αn k M = n 1 n k k=0 I k αn k M n k (2) The first inequality in (2) is close to an equality when all the tests of non-null hypotheses H 1ij have high power, so that their associated p-values would be very small. The second inequality in (2) reduces to an equality for many standard tests. Therefore, the bound is tight and there are definitely situations where it provides an approximation of E(V ). We are interested in the setting where M is large. Let δ k = I k /M, the bound for PFER is ( α α ) 2 ( α ) n ) E(V ) α δ n 1 + δ n 2 n 3( M + δ + + δ0 : M M in the limit the bound on E(V ) is dominated by δ n 1 α so long as δ n 1 is bounded away from 0. If instead δ n 1 = 0, then E(V ) = O(M 1 ). More generally if δ n 1 is small then E(V ) will be much less than its nominal level α for large M. In genetics problems we might have M in the thousands (microarrays) or millions (SNPs) with δ 0 1 and δ n 1 0. Under these circumstances the procedure would be very conservative, in fact much more conservative than Bonferroni usually is The general case r n A key lesson from the calculation in the previous section is that this straightforward application of multiple comparison correction strategies to the PC p-values does not account for the fact that PC are composite hypotheses, and therefore controls the global error rate under the worst case scenario (δ n 1 = 1). However, this is not a requirement for a valid multiplicity adjustment procedure, which should simply control the target global error under the true (and unknown) configuration of null hypotheses. The problem is not limited to the case r = n. For general 2 r n, the magnitude of E(V ) for Bonferroni correction of individual PC p-values will depend mainly on δ r 1 in the large M setting. It is clear that more efficient procedures should be available if the fractions δ k were known. Unfortunately, this is typically not the case. However, if good estimates of δ k can be obtained from 6

7 PFER/FWER control 1 estimated V p 2 FDR control γ estimated FDP γ 1 p γ (a) (b) Figure 1: Illustration of the adaptive filtering procedures for n = r = 2. (a) Rejection (red) and filtering (blue) regions are shown when α = 0.2. Each triangle corresponds to a pair of p-values. (b)selection of γ when α = 0.2 under the complete null scenario. the given p-value matrix, it is possible that these could be employed in more efficient procedures for simultaneous testing of PC null hypotheses. This is what motivates the Bayesian methods (Heller and Yekutieli, 2014; Flutre et al., 2013). In this paper we take a frequentist perspective and rather than estimating the δ k, we work directly on an alternative estimation of V. 3 Adaptive filtering procedures The observations made in the previous section can be recast as follows. The key reason for the conservativeness of straightforward approaches is that the partial conjunction null being composite the inequality P(P r/n γ) γ can be very loose. When r = n, the value of P(P (n)j γ) can be as low as γ n under the complete null (all individual hypotheses are null) but as high as γ if only one individual hypothesis is null. The fact that P(P r/n γ) can be substantially lower than γ results in a power loss, as the standard multiple testing procedures are designed to control error when P(P r/n γ) = γ. To overcome this, our AdaFilter procedures leverage regions A γ [0, 1] n such that the much tighter inequality P(P r/n,j γ P j A γ ) γ holds for any configuration in the partial conjunction null space. To motivate our construction, we start with the simple example where n = r = 2 and P 1j and P 2j are independent. There are three possible scenarios for H 2/2 j0 being true: (H 1j0, H 2j0 ) 7

8 {(0, 0), (0, 1), (1, 0)}. The rejection region has the form R γ = {(p 1, p 2 ) max(p 1, p 2 ) γ}, with γ [0, 1]. Let us now define the filtering region A γ = {(p 1, p 2 ) min(p 1, p 2 ) γ}, illustrated in Figure 1(a) as the blue L shaped region. It s easy to check geometrically that P(P j R γ P j A γ and H 2/2 0j is true) γ under all three null scenarios, since at least one of P 1j and P 2j is stochastically greater than uniform. Recall that both the Bonferroni and BH procedures are based on an implicit estimation of the number of false rejections V associated with a threshold γ: ˆVγ = γm. However, in the case of partial conjunctions, using the total number of hypotheses M is too conservative: one needs to consider as candidates for rejection only the H 2/2 j0 such that P j A γ. The observation that P(P j R γ P j A γ and H 2/2 is true) γ leads us to another estimate ˆV Aγ ( M = γ 1 min(p1j,p 2j ) γ 0j ). Obviously, ˆV Aγ ˆV γ and the gap can be large if most of the hypotheses are complete nulls. We can use the estimate ˆV Aγ in multiple testing procedures, replicating the construction of Bonferroni or BH and determining the threshold γ adaptively. To control for FWER and PFER at level α (which is to guarantee E(V ) α), we will choose γ as large as possible, subject to ˆV Aγ α. This choice is well defined because ˆV Aγ is a non-decreasing function of γ ranging from 0 at γ = 0 to M at γ = 1. To control the FDR at level α, one can estimate the false discovery proportion V/R as ˆV Aγ /R and select the largest γ 0 such that ˆV Aγ / max(r, 1) α. The above choices encourage choosing γ towards 0 when all H 2/2 j0 are true to avoid false discoveries (Figure 1(b)). In the next section we introduce multiple testing procedures that follow this scheme for {H r/n j0, j = 1,..., M} with any value of r. 3.1 AdaFilter procedures for partial conjunctions The AdaFilter procedures for H r/n 0j are designed keeping in mind that the truth status of H (r 1)/n 0 is important in evaluating the probability of the rejection region for H r/n 0. We use the Bonferroni method: P B r/n,j = (n r + 1)P (r)j for a single PC hypothesis and compare against the threshold γ, determined adaptively from the whole p-value matrix. The choice of γ depends on the filtering regions: A γ := { (p 1, p 2,, p n ) (n r + 1)p (r 1) γ }. We propose an AdaFilter Bonferroni test (two versions) and an AdaFilter Benjamini-Hochberg (BH) test. It is convenient to first introduce the notion of filtering and selection P -values. Filtering P -values F j : F j := (n r + 1)P (r 1)j (3) 8

9 Selection P -values S j S j := P B r/n,j = (n r + 1)P (r)j (4) Definition 1 (AdaFilter Bonferroni). For a level α, and with F j and S j given by (3) and (4) respectively, reject H r/n 0j if S j γ0 Bon where { γ0 Bon = max γ { α M,, α 2, α} γ M } 1 Fj γ α. Definition 2 (AdaFilter BH). For a level α, and with F j and S j given by (3) and (4) respectively, reject H r/n 0j if S j γ0 BH where γ0 BH M M = max {γ I α,m γ 1 Fj γ α 1 Sj γ } { k, I α,m = m α } k 0:M, m 1:M, k m. Our AdaFilter Bonferroni can be defined equivalently using a more programmable two-step approach, as illustrated in Figure 2. Definition 3 (AdaFilter Bonferroni, alternative version). For a level α, and with F j and S j given by (3) and (4) respectively, proceed as follows. First sort {F 1, F 2,, F M } into F (1) F (2) F (M). Then let and define { m = min j 1:M m = α } j < F (j) { m, if F (m ) α m 1 m 1, otherwise. Finally, reject H r/n 0j if S j α/m. If the set in (5) is, then take m = m = M. Proposition 3.1. The two definitions of the Adaptive filtering Bonferroni procedure in Definition 1 and Definition 3 are equivalent. The alternative form provides another view of the AdaFilter procedures. We filter out unlikely hypotheses with large F j values and only adjust for the m remaining hypotheses. If m M then adaptive filtering can reject many more hypotheses than direct Bonferroni or BH on S j. One will see from (6) in the next section that, because of the conditional validity of the selection P-values there is no need to correct for selection bias when selecting based on F j. (5) 9

10 m = m' 1 m = m' α F (j) = F j* S j* selected PC α F (j) = F j* S j* selected PC γ 0 α M y = α j γ 0 α M y = α j 1 m' M j 1 m' M j Figure 2: AdaFilter Bonferroni procedure. On the x-axis, indexes j of the hypotheses, ordered in increasing values of the filtering p-values: F (1) F (M). On the y-axis, filtering and selection p-values for each of the hypotheses: F (j) are plotted with filled triangles, and S j with crosses where j is the true index of F (j). The solid green line is the α/j boundary, and the color blue is used to indicate hypotheses that pass the filtering step, while the red circle marks hypotheses that pass the selection step. The left and right panels represent two possible definitions of m. 4 Theoretical properties of AdaFilter When the P-values are independent across both the studies and the multiple individual hypotheses, we can prove that the AdaFilter procedures control the corresponding global errors under any configurations of the true individual hypotheses. In addition to independence we assume validity. Definition 4 (Valid p-value). If P(P ij α) α under H 0ij for any α [0, 1], then P ij is a valid P -value. Theorem 4.1. Let (P ij ) n M contain independent valid p-values. Then the AdaFilter Bonferroni procedure in Definition 1 controls FWER and PFER at level α for the null hypotheses {H r/n 0j : j = 1, 2,, M}. The proof of Theorem 4.1 is given in the appendix. It depends on Lemma 4.2, which gives a conditional validity property of the selection statistics S j after they are filtered via F j. Lemma 4.2. Let (P ij ) n M some j 1:M. Then contain independent valid p-values and assume that H r/n 0j is true for P ( S j β F j β ) β (6) 10

11 holds whenever β > 0 and P ( F j β ) > 0. Here F j and S j are given by (3) and (4) respectively. Equation (6) can be equivalently written as P ( S j β ) βp(f j β). It holds also when P(F j β) = 0 as S j F j is always true. We note that the independence assumption on (P ij ) n M in Theorem 4.1 can be relaxed. In the proof, we only require that if v j = 0 (H r/n 0j is true), then the P-values (P 1j, P 2j,, P nj ) are independent of each other and of the other entries in (P ij ) n M. Even in this more relaxed form, independence is not a reasonable assumption in many applications. We will investigate with simulations (Section 5) how robust our procedures are to violations of this hypothesis. In addition, while the fact that the AdaFilter BH procedure controls FDR remains a conjecture at this stage, we will use simulations to illustrate how FDR appears to be well controlled in practice. 4.1 Partial monotonicity Wang and Owen (2017) discussed the importance of monotonicity for the combined p-value of a single PC test, where monotonicity is defined as the property that the combined p-value is a non-decreasing function of its individual p-values. Wang and Owen (2017) shows that due to the complexity of the null space of the PC hypothesis, there do exist non-monotone PC p-values that are uniformly most powerful. However, one needs to avoid proposing a non monotone test as they are completely unreasonable in practice (Perlman and Wu, 1999). In terms of multiple testing on many PC hypotheses, we will show that investigating monotonicity leads to a deeper understanding of our AdaFilter procedures. To clarify our analysis, we first define two monotonicity concepts: complete monotonicity and partial monotonicity. Definition 5 (Complete monotonicity). A multiple testing procedure has complete monotonicity if each decision function ϕ j (p 1,, p M ) is a non-increasing function in all the elements of (p ij ) n M for j = 1, 2,, M. Definition 6 (Partial monotonicity). A multiple testing procedure has partial monotonicity if for all j 1:M, its decision function ϕ j (p 1,, p M ) is non-increasing in all elements of (p 1j, p 2j,..., p nj ). Complete monotonicity guarantees that if any p-value p ij decreases while all others are held fixed, we reject all PC null hypotheses that we originally rejected and maybe more. Partial monotonicity only requires the test of hypothesis j to be monotone in the p-values for that same hypothesis. It allows a reduction in p ij for j j to reverse a rejection of H r/n 0j. Most multiple comparison procedures for a single study (for example, the BH procedure) satisfy complete monotonicity. However, it turns out that in our problem, complete monotonicity is a stringent assumption and prevents us from finding efficient procedures for PC nulls. We can show that our procedures satisfy partial monotonicity but do not satisfy complete monotonicity. Corollary 4.3. Let (P ij ) be a matrix of valid p-values. Then both the AdaFilter Bonferroni and the AdaFilter BH procedures satisfy partial monotonicity for all null hypotheses H r/n 0j, j = 1, 2,, M. 11

12 j Study 1 Study 2 F j S j Table 1: Toy example showing the power of AdaFilter by violating complete monotonicity. Here, both n = r = 2 and M = 2. AdaFilter Bonferroni finds the first hypothesis significant (γ0 Bon = 0.05) in both studies while neither of the individual p-value can be rejected in each study separately using Bonferroni when FWER is controlled at α = To see how our two AdaFilter procedures do not satisfy complete monotonicity, consider a case j with F j > γ 0 where γ 0 is the original selection threshold. Now let F j increase while keeping S j and the individual P-values not involving j fixed, as γ M 1 F j γ keeps decreasing, the selection threshold increases, which can reverse the rejection of H r/n 0k for some k j. In other words, the increase of individual p-values can induce a larger number of unlikely PC non-null hypotheses to be filtered out, and, in turn, increase the number of rejections of non-null candidates. Violation of complete monotonicity is necessary for a PC multiple testing procedure to have a high power. An increase of individual p-values can make the non-null configuration more similar across studies, which helps in getting more rejections and violates the complete monotonicity rule. As in the toy example illustrated in Table 1, when combining results of two repeated experiments and setting r = 2, we may even reject a conjunction hypothesis using AdaFilter while neither of the single hypotheses it joins are rejected separately. The gain of power here derives from having learned the similarities between the two experiments. 4.2 Extensions and relation to literature Remark 1: variable r and n In some applications, the M PC null hypotheses can have varying r j or n j. Then the jth PC null hypothesis becomes: H r j/n j 0j : fewer than r j out of n j hypotheses are nonnull. For example in genetic analysis, if each P ij is the P-value testing for the null hypothesis related to gene j in experiment i, then gene j might not be present in every single experiment, making n j different. The AdaFilter procedures still work in this scenario. We replace formulas (3) and (4) by F j = (n j r j + 1)P (rj 1)j, and S j = (n j r j + 1)P (rj )j respectively. Both AdaFilter Bonferroni and AdaFilter BH still control the required error rate as the conditional validity Lemma 4.2 still holds. Remark 2: other strategies to increase power Two methods related to our AdaFilter procedures for PC hypotheses are the multiple testing procedure developed in Bogomolov and Heller (2015) for replicability analysis where n = r = 2 and the empirical Bayes approach in Heller 12

13 and Yekutieli (2014) for controlling the Bayes FDR for multiple PC hypotheses. Both methods are developed to improve the efficiency of the straightforward procedures. Our AdaFilter procedures generalize the procedures in Bogomolov and Heller (2015) to arbitrary n and r and provide a frequentist approach comparable to the empirical Bayes method in Heller and Yekutieli (2014). Bogomolov and Heller (2015) describe procedures to control either FWER or FDR when n = r = 2. These are also based on a filtering step: promising candidate hypotheses for each study are identified as those whose corresponding p-value in the other study is below a threshold. Then a promising candidate hypothesis is rejected if both its two individual p-values are small. To maximize the efficiency, the authors suggest a data-adaptive threshold that is very similar to our AdaFilter procedures. For instance, to control FWER, they chose two thresholds γ 1 and γ 2 to satisfy ( M ) ( M ) γ 1 1 P2j γ 2 α/2, and γ 2 1 P1j γ 1 α/2. When γ 1 γ 2, then ( M ) ( M ( γ 1 1 min(p1j,p 2j ) γ 1 γ 1 1P1j γ P2j γ 1 )) α Thus γ0 Bon γ 1 γ 2 where γ0 Bon is defined in our AdaFilter Bonferroni procedure for n = r = 2, making our procedure similar to theirs. The selection step in Bogomolov and Heller (2015) consists in rejecting H 2/2 0j if both p 2j γ 2 and p 1j γ 1 again comparable to what we described. These similarities extend to the FDR controlling procedures. Our contribution is that we propose a general adaptive filtering idea that can be applied to any combination of r and n. Heller and Yekutieli (2014) used an empirical Bayes approach. They first learn the proportion of each of the 2 n possible configurations of individual hypotheses being null or non-null, along with the distribution of the Z-values under each configuration. While this approach certainly deals with the loss of power due to the composite nature of the PC hypotheses, it requires adopting a Bayesian framework and has substantial computational costs. AdaFilter provides an alternative approach to increase the efficiency of multiple comparisons of PC tests, with no need to estimate the fractions of each configuration and to abandon the frequentist approach. More importantly, the computation cost of our proposal is O(n) while the cost of the empirical Bayes method increases exponentially in n. Simulation results in Section 5 show that our procedure is more accurate in controlling FDR when n is large for PC hypotheses. Remark 3: testing for all possible values of r The partial conjunction null H r/n 0 can be meaningfully defined for any 1 r n, and sometimes it is of interest to test for all possible r values, adding another layer of multiplicity. Benjamini and Heller (2008) specifically consider this case, providing an FDR controlling procedure. A desirable property in this setting is for the rejection set of the PC hypotheses to be monotonically decreasing as r increases. The filtering procedure we describe, applied to control FDR for each value of r separately, does not have this property: the filtering information based on the whole P-value matrix can be different for different 13

14 n r Table 2: Configurations of n and r in our simulation. r. Of course, by controlling FDR for each value of r separately, the expected number of false discoveries accumulates when different r values are considered at the same time and this is not a strategy that we would suggest, even aside from the lack of monotonicity. The investigation of powerful strategies in this context is for future research. 5 Simulations We benchmark the performance of AdaFilter using (a) straightforward multiple comparison correction procedures on the three forms of PC p-values in Section 2.2 and (b) the empirical Bayes method in Heller and Yekutieli (2014). To implement (b), we used the R package repfdr of the original authors. We consider two dependence structures across the individual hypothesis p-values within each study: independence (which is assumed in our theoretical results) and local dependence, which is a reasonable assumption in many high dimensional settings, such as GWAS. We set M = and control PFER at the nominal level α = 1, FDR at the nominal level α = 0.2, and Bayes FDR at the same level α = 0.2 for repfdr. Bayes FDR corresponds to the posterior probability of the hypothesis being null given the rejection region, which has been shown to be similar but slightly more conservative than the frequentist FDR under independence (Efron, 2012). For a given n, there are 2 n combinations of individual hypotheses being null or non-null. We use two parameters to control the probability of each combination: π 0 is the probability of the complete null combination (0, 0,, 0) and π r/n is the total probability of the combinations not belonging to H r/n 0j. We set π r/n = 0.02 and consider two values for π 0 : π 0 = 0.5 or 0.9. All PC null combinations except for the complete null have equal probabilities and add up to 1 π 0 π r/n. All non-null PC combinations also have equal probabilities. We consider six different configurations of n and r, as listed in Table 2. We assume that p-values belonging to different studies are independent and, within one study, the correlation of the M Z-values is Σ ρ I b b where is the Kronecker product, the block size b = 100 and Σ ρ R m m has 1s on the diagonal and common value ρ 0 off the diagonal. The Z-values then have a distribution with local block dependence a reasonable assumption for many genetics data sets such as in GWA or eqtl studies. Based on the z-values, we calculate the two-sided p-value for each individual hypothesis. To generate the signals, when the individual component hypothesis is non-null, we sample the mean of the z-value uniformly and independently from I = {±µ 1, ±µ 2, ±µ 3, ±µ 4 } where the four values {µ 1, µ 2, µ 3, µ 4 } correspond to detection power of 0.02, 0.2, 0.5, 0.95 respectively (the detection power is calculated assuming that the individual null component hypothesis H 0ij is rejected when P ij α/m). For each parameter configuration, we run B = 100 random experiments and calculate the average power, number of false discoveries and 14

15 false discovery proportions. Studying PFER control, we compare our AdaFilter Bonferroni procedure with three straightforward procedures, using Bonferroni correction to the three forms of M PC p-values in Section 2.2. Table 3 shows the average PFER and power of the four procedures over the six configurations in Table 2 (more detailed results for each configuration is in Figure B.1 and Figure B.2). All methods successfully control PFER at the nominal level under both the independence and block dependence cases. However, the straightforward procedures are much more conservative than our AdaFilter Bonferroni procedure, especially when both n and r are large. The gain in power is more pronounced when the complete null fraction is higher, which is expected in many genetics applications. π 0 = 0.5 π 0 = 0.8 ρ = 0 ρ = 0.8 ρ = 0 ρ = 0.8 Method PFER Recall(%) PFER Recall(%) PFER Recall(%) PFER Recall(%) Bonferroni-Pr/n B Bonferroni-Pr/n F Bonferroni-Pr/n S AdaFilter Bonferroni Table 3: Results for methods targeting a PFER of α = 1. The numbers in the table are averages of all 6 combinations of n and r values, with 2 π 0 values and 2 ρ values (in total 24 combinations). Each combination has 100 repeated experiments. We analyze five procedures that control FDR: our AdaFilter BH procedure, three straightforward procedures where BH correction is applied to the three forms of M PC p-values, and repfdr. Table 4 shows the average FDR and power of the five procedures over the six configurations (more detailed results in Figure B.3 and Figure B.4). Our AdaFilter BH procedure and the three straightforward procedures control FDR at the nominal level. However, similar to what happens in the PFER control, the straightforward approaches are too conservative. Our AdaFilter BH procedure has much more power than these methods. The repfdr method fails to always control the FDR at its nominal level especially when n is large and the individual hypotheses are locally correlated within studies. One explanation is that a large n translates into a large number of parameters that need to be estimated in repfdr a task that might not be performed satisfactorily. In the cases when repfdr does control FDR, its power is comparable with that of our proposal. However, it is consistently less powerful than AdaFilter BH when π 0 = 0.9 and ρ = 0.8, a scenario that is closest to the real GWA study, indicating that AdaFilter BH can more powerful when applied to GWAS. 15

16 π 0 = 0.5 π 0 = 0.8 ρ = 0 ρ = 0.8 ρ = 0 ρ = 0.8 Method FDR Recall(%) FDR Recall(%) FDR Recall(%) FDR Recall(%) BH-Pr/n B BH-Pr/n F BH-Pr/n S repfdr AdaFilter BH Table 4: Results for methods targeting a nominal FDR of α = 0.2. The numbers in the table are averages of all 6 combinations of n and r values, with 2 π 0 values and 2 ρ values (in total 24 combinations). Each combination has 100 repeated experiments. 6 Case studies Here we use our methods to analyze three data sets: one tests for replication in multiple experiments and the other two test for consistent signals across different conditions/subgroups of the data. 6.1 Duchenne Muscular Dystrophy microarray studies We consider a replication analysis based on four publicly available microarray datasets on Duchenne muscular dystrophy (DMD). Affecting about 1:3500 newborn males, DMD is the most common form of muscular dystrophy and the most common sex linked disease in males. Following Kotelnikova et al. (2012), we investigate four NCBI GEO DMD-related microarray datasets (GDS 214, GDS 563, GDS 1956 and GDS 3027, Table 5) that are conducted in different labs and different platforms with no shared biological samples; each dataset is collected with the purpose of identifying genes that are differentially expressed in DMD cases. For each experiment, the data is preprocessed using RMA (Irizarry et al., 2003) and the individual p-values for each probe set are calculated using Limma (Smyth, 2004) and then rescaled to achieve an approximately uniform empirical null distribution of the p-values. If a gene is represented by two or more probe sets on a chip, we use the Bonferroni rule to combine the probe specific p-values and obtain a single gene p-value. Study GDS 1956 is generally less powerful than other studies (Figure B.5). As the four studies used different microarray platforms, not every gene is presented in all studies. However, the application of our AdaFilter BH procedure at level α = 0.05 still leads to the discovery of many consistently differentially expressed genes at r = 2, 3, 4 (Table 5(b)). Table 5(c) shows how our AdaFilter BH increases power, allowing for rejection of the complete conjunction hypotheses (r = n = 4) even when one of the study has a relatively high p-value. Specifically, while the third study (GDS 1956) has less power than the others, our method can compensate this deficiency by leveraging the overall similarity of the results in this study compared with other studies. The 16

17 GEO ID Platform Description Source GDS 214 custom Affymetrix 4 healthy, 26 DMD Muscle GDS 563 Affymmetrix U95A 11 healthy, 12 DMD Quadriceps Muscle GDS 1956 Affymetrix U133A 18 healthy, 10 DMD Muscle GDS 3027 Affymetrix U133A 14 healthy, 23 DMD Quadriceps Muscle r M Rejected (b) (a) Gene Symbol GDS 214 GDS 563 GDS 1956 GDS 3027 HLA-DPB1 8.11e e e e-08 MYL4 1.48e e e e-08 PPP1R1A 6.93e e e e-04 SRPX 1.64e e e e-08 DMD e e-10 (c) Table 5: (a)information of GEO datasets used. (b)adafilter BH selection results. (c)some of the Selected genes. The first four are genes selected at r = 4 whose individual p-value in study GDS 1956 is above The last gene (DMD) is selected at r = 2. FDR is controlled at α = 0.05 for each r. disease causal gene DMD is, on the other hand, not significantly differentially expressed in every study, indicating that partial conjunction is preferable to complete conjunction to encourage more true findings. 6.2 Multi-tissue AGEMAP data The AGEMAP study (Zahn et al., 2007) investigated age-related gene expression in mice. The experiments comprised ten mice at each of four age groups: each of these 40 mice contributed samples from 16 different tissues, resulting in 618 microarray data sets (some mice missing in specific tissues). From each microarray, 8932 probes were sampled. We use the robust-regression approach in Wang et al. (2017) to calculate confounder-adjusted p-values of age effect for each gene and each tissue; this is a simplified version of the LEAPP algorithm, originally used on this data by Sun et al. (2012). Wang et al. (2017) showed that the p-values of the genes are asymptotically independent within each tissue after having adjusted for confounder. Correlation of p-values across tissues are considered as negligible. We here consider the problem of identifying genes whose expression significantly change with age across multiple tissues. Figure 3 gives the results, in terms of the number of significant genes, when PFER is controlled at 1 and r ranges from 2 to 7. For all r values, the adaptive filtering Bonferroni procedure is much more powerful than applying Bonferroni adjustment to the PC p-values. 17

18 Power Comparison of Partial Conjunction for Agemap Data # of significant genes B Bonferroni P r n F Bonferroni P r n S Bonferroni P r n AdaFilter Bonferroni r Figure 3: The number of genes whose H r/n null hypotheses were rejected by each of the compared procedures. Here n = 16 for 16 tissues and r ranges from 2 to 7 and PFER is controlled at α = Metabolites super-pathways GWAS data Finally, we examine the multi-trait GWAS data from Shin et al. (2014), a comprehensive study of the genetic loci influencing human metabolism. A total of 7824 adult individuals from 2 European populations were recruited in the study, and M = 2,182,555 SNPs were recorded, either directly genotyped or imputed from the HapMap 2 panel. For each sample, the study measured the levels of 333 metabolites, categorized into 8 non-overlapping super pathways. Due to privacy concerns, only summary statistics measurements are publicly available: these are consist in the p-values and test statistics for the null hypotheses of no association between each marker and each of the m = 275 annotated metabolites measurements (only 275 of the 333 annotated metabolites can be downloaded from the original authors website GWAS/gwas/index.php?task=download). We are interested in discovering SNPs that significantly influence metabolites in multiple super pathways, as this information gives a more comprehensive view of the genetic regulations and benefits other downstream analysis, such as the side effects in drug development. To calculate individual p-values p ij for each marker j and each super-pathway i, we start with the Z-values Z sj for test of association between each metabolite s and marker j, which are given as summary statistics in the original GWAS. For a super-pathway i, let {s 1, s 2,, s ni } be the index 18

19 set of metabolite measures that belong to it. We assume that (Z s1 j, Z s2 j,, Z sni j) N (0, Σ i ) where Σ i can be accurately estimated in principle since we have millions of markers and most of the individual hypotheses are null. We estimate Σ i as the graphical Lasso estimated covariance matrix of a 2000 randomly sampled SNPs (markers) that lie at least 1Mbp away from each other with the tuning parameters selected by cross-validation. The graphical Lasso approach guarantees an accurate sparse inverse covariance matrix estimation, that is needed for computing the p-values for each super-pathway. Let p ij be the p-value for the association between a super-pathway i and marker j, in other words, the p-value for the null H ij0 : no metabolite measure of super pathway i is associated with marker j. Given (Z s1 j, Z s2 j,, Z sni j) and ˆΣ i, we calculate p-values p ij from the chi-square test treating the estimated Σ i as known. These p ij serve as individual p-values which will be used in the partial conjunction testing. Figure 4a shows the estimated correlation across metabolites assuming (Z 1j, Z 2j,, Z mj ) N (0, Σ) for j = 1, 2,, M. We estimate Σ by applying the Minimum Covariance Determinant (MCD) (Rousseeuw and Driessen, 1999) estimator to the 2000 randomly sampled SNPs, where MCD is a highly robust method to reduce the influence of the sparse non-null hypotheses. Notice that we choose MCD instead of graphical Lasso here as we do not need an estimate of the inverse of Σ. It is evident that most of the nonzero correlations are between traits within the same super-pathways: this allows us to apply the adaptive filtering procedures for PC hypotheses across super-pathways with confidence. We can now input these p-values to the multiple-comparison adjustment procedure. Figure 4b compares the number of significant SNPs when FDR is controlled at 0.05 and r ranges from 2 to 5. Compared with other four methods, our AdaFilter BH procedure is much more powerful in rejecting PC null hypotheses for any value of r. The method repfdr rejects less than our AdaFilter BH. This is a consistent result with our simulations as in this GWAS, the fraction of complete nulls π 0 is almost 1 and there exist strong local correlations among nearby SNPs. Table 6 shows the selected SNPs for r = 4. Most of the individual p-values are not below the usual p-values threshold (10 8 ), showing the power gain of our method. The genes that corresponds to these SNPs are GCKR, C2orf16, ZNF512, GPN1, SLC17A3, ABO and ABCC1. Looking up the traits that have been found to be affected by these genes using the Phenotype-Genotype Integrator website (PheGenI, we find that most of these genes are associated with a diverse class of traits, consistently with our results. For example, gene GCKR is a glucokinase regulator and affects 62 different traits including Glucose, Cholesterol, obesity, uric acid, Calcium and Insulin. Gene ABO is related to the first discovered blood group system ABO and affects 52 traits, including blood coagulation factors, alkaline phosphatase, stroke and Coronary Artery Disease. The protein encoded by gene ABCC1 is a member of the superfamily of ATP-binding cassette (ABC) transporters. This gene is found to affect 4 traits, which includes metabolism and stroke. Finally, the gene SLC17A3 encodes a voltage-driven transporter that excretes intracellular urate and organic anions from the blood into renal tubule cells and affects uric acid, blood pressure and homocysteine. 19

20 Lipid Cofactors and vitamins Energy Nucleotide Carbohydrate Xenobiotics Amino acid Peptide Peptide Amino acid Xenobiotics Carbohydrate Nucleotide Energy Cofactors and vitamins Lipid # of significant loci B BH P r n F BH P r n S BH P r n repfdr AdaFilter BH r 27 1 (a) Correlation of the metabolites measures. (b) Number of significant SNPs. Figure 4: Figure 4a shows the correlation of test statistics of the metabolites measures under the null. The darker the color, the higher the absolute value of the correlation. The metabolite measurements are reordered so that metabolites in the same pathway are adjacent to each other. The red lines and blue test labels the eight super-pathways. Figure 4b compares the results of adaptive-filtering BH with those of the application of BH to PC p-values calculated with three rules. r ranges from 2 to 5 and FDR is targeted at α = Conclusion Testing partial conjunction hypotheses provides a framework to assess the replicability of scientific findings across multiple studies. However, simultaneous testing for thousands of partial conjunction hypotheses often results in very low power, limiting the utility of this framework in high-throughput experiments. We introduced here a new multiple testing approach based on adaptive filtering, AdaFilter, which can greatly increase the power in simultaneous testing of PC hypotheses, while controlling global error rates. We can theoretically prove that appropriate variants of our procedure control FWER and PFER under independence of all individual hypotheses. More importantly, we show through extensive simulations that the AdaFilter procedures control FWER, PFER and FDR both under independence and under weak dependence of the individual hypotheses within each study. However, one limitation of our approach is that it does require independence of the results across different studies. The key reason for the low power of previous methods in PC multiple testing is that a null PC hypothesis is a composite hypothesis. To increase power, AdaFilter filters out unlikely candidates PC null hypotheses by establishing conditional validity, validity of a p-value conditional on sub- 20

Adaptive Filtering Multiple Testing Procedures for Partial Conjunction Hypotheses

Adaptive Filtering Multiple Testing Procedures for Partial Conjunction Hypotheses arxiv:1610.03330v1 [stat.me] 11 Oct 2016 Jingshu Wang, Chiara Sabatti, Art B. Owen Department of Statistics, Stanford University