CARS: Covariate Assisted Ranking and Screening for Large-Scale Two-Sample Inference

Size: px

Start display at page:

Download "CARS: Covariate Assisted Ranking and Screening for Large-Scale Two-Sample Inference"

Walter Gregory
6 years ago
Views:

1 ARS: ovariate Assisted Ranking and Screening for Large-Scale Two-Sample Inference T. Tony ai University of Pennsylvania, Philadelphia, USA Wenguang Sun University of Southern alifornia, Los Angeles, USA Weinan Wang University of Southern alifornia, Los Angeles, USA Summary. Two-sample multiple testing has a wide range of applications. The conventional practice first reduces the original observations to a vector of p-values and then chooses a cutoff to adjust for multiplicity. However, this data reduction step could cause significant loss of information and thus lead to suboptimal testing procedures. In this paper, we introduce a new framework for two-sample multiple testing by incorporating a carefully constructed auxiliary variable in inference to improve the power. A data-driven multiple testing procedure is developed by employing a covariate-assisted ranking and screening (ARS) approach that optimally combines the information from both the primary and auxiliary variables. The proposed ARS procedure is shown to be asymptotic valid and optimal for false discovery rate (FDR) control. The procedure is implemented in the R-package ARS. Numerical results confirm the effectiveness of ARS in FDR control and show that it achieves substantial power gain over existing methods. ARS is also illustrated through an application to the analysis of satellite imaging data set for supernova detection. Keywords: ompound decision theory; False discovery rate; Logically correlated tests; Multiple testing with covariates; Uncorrelated screening. 1. Introduction A common goal in modern scientific studies is to identify features that exhibit differential levels across two or more conditions. The task becomes difficult in large-scale comparative experiments, where differential features are sparse among thousands or even millions of features being investigated. The conventional practice is to first reduce the original samples to a vector of p-values and then choose a cutoff to adjust for multiplicity. However, the first step of data reduction could cause significant loss of information and thus lead to suboptimal testing procedures. This paper proposes new strategies to extract sparsity information in the sample using an auxiliary covariate sequence and develops optimal covariate-assisted inference procedures for large-scale two-sample multiple testing problems. We focus on a setting where both mean vectors are individually sparse. Such a setting arises naturally in many modern scientific applications. For example, the detection of sequentially activated genes in time-course microarray experiments, considered in Section B.8 in the Supplementary Material, involves identifying varied effect sizes across different time points [alvano et al. (2005); Sun and Wei (2011)]. Since only a small fraction of genes are differentially expressed from the baseline, the problem of identifying varied levels over

2 2 ai & Sun & Wang time essentially reduces to a multiple testing problem with several high-dimensional sparse vectors (after removing the baseline effects). The second example arises from the detection of supernova explosions considered in Section 5.4. The potential locations can be identified by testing sudden changes in brightness in satellite images taken over a period of time. After the measurements are converted into grey-scale images and vectorized, multiple tests are conducted to compare the intensity levels between two sparse vectors. Another case in point is the analysis of differential networks, where the goal is to detect discrepancies between two or more networks with possibly sparse edges. We first describe the conventional framework for two-sample inference and then discuss its limitations. Let X and Y be two random vectors recording the measurement levels of the same m features under two experimental conditions, respectively. The population mean vectors are given by µ x = E(X) = (µ x1,...,µ xm ) and µ y = E(Y ) = (µ y1,...,µ ym ). A classical formulation for identifying differential features is to carry out m two-sample tests: H i,0 : µ xi = µ yi vs. H i,1 : µ xi µ yi, 1 i m. (1.1) Suppose we have collected two random samples {X 1,...,X n1 } and {Y 1,...,Y n2 } as independent copies of X and Y, respectively. The standard practice starts with a data reduction step: a two-sample t-statistic T i is computed to compare the two conditions for feature i, then T i is converted to a p-value or z-value. Finally a significance threshold is chosen to control the multiplicity. However, this conventional practice, which only utilizes a vector of p-values, may suffer from substantial information loss. Our idea is to exploit the sparsity information in the mean vectors to improve the efficiency of tests. Let I( ) be an indicator function. Observe that θ 1i = I(µ xi µ yi ) and θ 2i = I(µ xi 0 or µ yi 0) obey the following logical relationship: θ 1i = 0 if θ 2i = 0. (1.2) When both µ x and µ y are sparse, the union support S = {i : θ 2i 0} is also sparse. However,onlyinformationaboutθ 1i isretainedafterthedatareductionstepinconventional practice. We propose to construct an auxiliary covariate sequence that extracts the sparsity information on the union support from the original data. The covariate, which provides supplementary evidence in inference via the logical relationship (1.2), can help eliminate a large number of null locations and thus improve the accuracy of inference. A key step in our methodological development is to recast the problem in the framework of multiple testing with a covariate sequence. This requires a careful construction of the covariate that leads to a simple bivariate model and an easily implementable methodology. Section 2 discusses strategies for constructing the pair of primary and auxiliary variables. Then we develop oracle and data-driven multiple testing procedures for the consequent bivariate model in Section 3. The proposed method employs a covariate-assisted ranking and screening(ars) approach that simultaneously incorporates the primary and auxiliary information in multiple testing. We show that the ARS procedure controls the false discovery rate at the nominal level, and outperforms existing methods in power by effectively utilizing the discarded information in conventional practice. We mention two related strategies in the literature: testing following screening and testing following grouping. In the first strategy, the hypotheses are formed and tested hierarchically via a screen-and-clean method [Zehetmayer et al.(2005); Reiner-Benaim et al. (2007); Zehetmayer et al. (2008); Wasserman and Roeder (2009); Bourgon et al. (2010)]. Following that strategy, one can first inspect the sample to identify the union support of µ x

3 ARS: ovariate-assisted Two-Sample Inference 3 and µ y, and then conduct two-sample tests on the narrowed subset to further eliminate the null locations with no differential levels. The screen-and-clean approach requires sample splitting to ensure the independence between the screening and testing stages to avoid selection bias [Rubin et al. (2006); Bourgon et al. (2010)]. However, even the screening stage can significantly narrow down the focus, sample splitting often leads to loss of power. For example, the empirical studies in Skol et al. (2006) concluded that a two-stage analysis is in general inferior compared to a naive joint analysis that combines the data from both stages. The second strategy (Liu, 2014) can be described as testing following grouping, that is, the hypotheses are analyzed in groups via a divide-and-test method. Liu (2014) developed an uncorrelated screening (US) method, which first divides the hypotheses into two groups according to a screening statistic, and then applies multiple testing procedures to the groups separately to identify non-null cases. It was shown in Liu (2014) that US controls the error rate at the nominal level and outperforms competitive methods in power. Our approach marks a clear departure from existing methods. Both the screen-and-clean and divide-and-test strategies involve dichotomizing a continuous variable, which fails to fully utilize the auxiliary information. By contrast, our proposed ARS procedure models the screening covariate as a continuous variable and employs a novel ranking and selection procedure that optimally integrates the information from both the primary and auxiliary variables. In Section 4, we develop further results on a general bivariate model; our study reveals the connections between existing methods and provides insights on the advantage of the proposed ARS procedure. Simulation results in Section 5 demonstrate that ARS controls the false discovery rate in finite samples and uniformly dominates all existing methods. The gain in power is substantial in many settings. We illustrate our method to analyze a time-course satellite image dataset in Section 5.4. The application shows improved sensitivity of the proposed method in identifying changes between images taken over time. Section 6 further discusses related issues and open problems. The proofs are provided in Section 7 and Appendix A. Additional numerical results are given in Appendix B. 2. A Bivariate Random Mixture Model for Two-Sample Multiple Testing Suppose {X ij : 1 j n x } and {Y ik : 1 k n y }, i = 1,,m, are repeated measurements of generic independent random variables X i and Y i, respectively. Let β 0 = (β 0i : 1 i m) be a latent vector which itself is sparse [including the special case where β 0i = 0 for all i]. onsider the following model X ij = β 0i +µ xi +ǫ xij, Y ik = β 0i +µ yi +ǫ yik, (2.1) with corresponding population means given by µ xi = β 0i + µ xi and µ yi = β 0i + µ yi. We assume that µ xi and µ yi satisfy µ xi (1 π x)δ 0 +π x g µx ( ) and µ yi (1 π y)δ 0 +π y g µy ( ), where δ 0 is the Dirac delta function, and g µx and g µy are densities of nonzero effects. Remark 1. Model (2.1) can be applied to scenarios with non-sparse µ x and µ y as long as some baseline measurements are available. In Section A.8 in the supplement, we show that after baseline removal, β 0 becomes sparse, and related model assumptions hold. Let n = n x +n y. Denote γ x = n x /n and γ y = n y /n. The population means µ xi and µ yi are estimated by X i = (n x ) 1 n x j=1 X ij and Ȳi = (n y ) 1 n y k=1 Y ik, respectively. For ease iid ofpresentation, we first focus on the Gaussianmodel, where ǫ xij N(0,σxi 2 ) and ǫ iid yij N(0,σyi 2 ). More general settings will be discussed in Sections 3.6.

4 4 ai & Sun & Wang Remark 2. The proposed methodology only requires X i and Ȳi to be normal. In practical situations where n x and n y are large, our method works well without the normality assumption. Numerical results with non-gaussian errors are provided in Section 5.3. The two-sample inference problem is concerned with the simultaneous testing of m hypotheses H i,0 : µ xi = µ yi vs H i,1 : µ xi µ yi, i = 1,,m. Let T 1i and T 2i be summary statisticsthat containthe informationabout θ 1i = I(µ xi µ yi ) (support ofmeandifference) and θ 2i = I(µ xi 0 or µ yi 0) (union support), respectively. T 1i is the primary statistic in inference and T 2i is an auxiliary covariate. The term auxiliary indicates that we do not use T 2i to make inference on θ 1i directly. Instead, we aim to incorporate T 2i in inference to (indirectly) support the evidence provided in the primary variable T 1i. We first discuss how to construct the primary and auxiliary statistics from the original data, and then introduce a bivariate random mixture model to describe their joint distribution. Finally, we formulate a decision-theoretic framework for two-sample simultaneous inference with an auxiliary covariate onstructing the primary and auxiliary statistics A key step in our formulation is to construct a pair of statistics (T 1i,T 2i ) such that (i) the pair extracts information from the data effectively; (ii) the pair leads to a simple bivariate model via which the logical relationship (1.2) can be exploited. To focus on the main ideas, we first discuss the Gaussian case with known variances. Extensions to two-sample tests with non-gaussian errors and unknown variances are discussed in Section 3.6. The general strategies for constructing the pair (T 1i,T 2i ) can be described as follows. First, T 1i is used to capture the information on θ 1i ; hence X i Ȳi should be incorporated in its expression. Second, to capture the information on the union support θ 2i, we propose to use the weighted sum X i + κ i Ȳ i, where κ i > 0 is the weight to be specified later. Under the normality assumption, the covariance of Xi Ȳi and X i + κ i Ȳ i is given by σxi 2 /n x κ i σyi 2 /n y. This motivates us to choose the weight κ i = (γ y σxi 2 )/(γ xσyi 2 ), which leads to zero correlation, a crucial property for simplifying the model and facilitating the methodological development. Finally, the difference and weighted sum are standardized to make the statistics comparable across tests. ombining the above considerations, we propose to use the following pair of statistics to summarize the information in the data: (T 1i,T 2i ) = nx n y n ( Xi Ȳi σ pi, ) X i +κ iȳi, (2.2) κ i σ pi where σ 2 pi = γ yσ 2 xi +γ xσ 2 yi. Denote T 1 = (T 1i : 1 i m) and T 2 = (T 2i : 1 i m) A bivariate random mixture model We develop a bivariate model to describe the joint distribution of T 1i and T 2i. Let θ i = (θ 1i,θ 2i ). Assume that θ i are independent and identically distributed (i.i.d.) bivariate randomvectorsthattakevaluesintheartesianproductspace{0,1} 2 = {(0,0),(0,1),(1,0),(1,1)}. For each combination θ i = (j,k), (T 1i,T 2i ) are jointly distributed with conditional density f(t 1i,t 2i θ 1i = j,θ 2i = k). Denote π jk = P(θ 1i = j,θ 2i = k). In practice, we do not know (θ 1i,θ 2i ) but only observe (T 1i,T 2i ) from a mixture model f(t 1i,t 2i ) = (j,k) {0,1} 2 π jkf(t 1i,t 2i θ 1i = j,θ 2i = k). (2.3)

5 ARS: ovariate-assisted Two-Sample Inference 5 The goal is to determine the value of θ 1i based on pairs {(T 1i,T 2i ) : 1 i m}. The mixture model (2.3) is difficult to analyze. However, if T 1i and T 2i are carefully constructed as done in Section 2.1, then several simplifications can be made. First, the logical relationship (1.2) implies that π 10 = 0; thus we only have three terms in (2.3). Second, according to our construction (2.2), T 1i and T 2i are conditionally independent: f(t 1i,t 2i µ xi,µ yi ) = f(t 1i µ xi,µ yi )f(t 2i µ xi,µ yi ). (2.4) The next proposition utilizes (2.4) to further simplify the model. Proposition 1. The conditional independence (2.4) implies that f(t 1i,t 2i θ 1i = 0,θ 2i = 0) = f(t 1i θ 1i = 0)f(t 2i θ 2i = 0); f(t 1i,t 2i θ 1i = 0,θ 2i = 1) = f(t 1i θ 1i = 0)f(t 2i θ 1i = 0,θ 2i = 1); and f(t 1i,t 2i θ 1i = 0) = f(t 1i θ 1i = 0)f(t 2i θ 1i = 0). (2.5) The last equation shows that T 1i and T 2i are independent under the null hypothesis H i0 : θ 1i = 0. This is a critical result for our later methodological and theoretical developments. The joint density is given by f(t 1i,t 2i ) = π 00 f(t 1i θ 1i = 0)f(t 2i θ 2i = 0)+π 01 f(t 1i θ 1i = 0)f(t 2i θ 1i = 0,θ 2i = 1) +π 11 f(t 1i,t 2i θ 1i = 1,θ 2i = 1). (2.6) 2.3. Problem formulation Our goal is to make inference on θ 1i = I(µ xi µ yi ), 1 i m, by simultaneously testing m hypotheses H i,0 : θ 1i = 0 vs H i,1 : θ 1i = 1. ompared to conventional approaches, we aim to develop methods utilizing m pairs {(T 1i,T 2i ) : 1 i m} instead of a single vector {T 1i : 1 i m}. This new problem can be recast and solved in the framework of multiple testing with a covariate: T 1i is viewed as the primary statistic for assessing significance, and T 2i is viewed as a covariate to assist inference by providing supporting information. The concepts of error rate and power are similar to those in the conventional settings. A multiple testing procedure is represented by a thresholding rule of the form δ = {δ i = I(S i < t) : i = 1,,m} {0,1} m, (2.7) where δ i = 1 if we reject hypothesis i and δ i = 0 otherwise. Here S i is a significance index that ranks the hypotheses from the most significant to least significant, and t is a threshold. In large-scale testing problems, the false discovery rate (FDR, Benjamini and Hochberg, 1995) has been widely used to control the inflation of Type I errors. For a given decision rule δ = (δ i : 1 i m) of the form (2.7), the FDR is defined as [ m i=1 FDR δ = E (1 θ ] 1i)δ i ( m i=1 δ, (2.8) i) 1 where x y = max(x,y). A closely related concept is the marginal false discovery rate (mfdr), which is defined by mfdr δ = E{ m i=1 (1 θ 1i)δ i } E( m i=1 δ. (2.9) i)

6 6 ai & Sun & Wang Genovese and Wasserman (2002) showed that mfdr = FDR + O(m 1/2 ) when the BH procedure (Benjamini and Hochberg, 1995) is applied to m independent tests. We use mfdr mainly for technical considerations to obtain optimality result. Proposition 7 in Section 7.2 gives sufficient conditions under which the mfdr and FDR are asymptotically equivalent and shows that the conditions are fulfilled by our proposed method. Define the expected number of true positives ETP δ = E( m i=1 θ 1iδ i ). Other related power measures include the missed discovery rate (MDR, Taylor et al., 2005), average power (Efron, 2007) and false non-discovery/negative rate (FNR, Genovese and Wasserman, 2002; Sarkar, 2002). Our optimality result is developed based on the mfdr and ETP. We call a multiple testing procedure valid if it controls the mfdr at the nominal level and optimal if it has the largest ETP among all valid mfdr procedures. 3. Oracle and Data-driven Procedures The basic framework of our methodological developments is explained as follows. We first consider an ideal situation where an oracle knows all parameters in model (2.6). Section 3.1 derives an oracle procedure. Sections 3.2 and 3.3 discuss an approximation strategy and related estimation methods. The data-driven procedure and extensions are presented in Sections 3.5 and Oracle mfdr procedure with pairs of observations Let π j = P(θ ji = 1), j = 1,2. Then the marginal density function for T ji is defined as f j = (1 π j )f j0 + π j f j1, where f j0 and f j1 are the null and non-null densities for T ji, respectively. onventional FDR procedures, which are developed based on a vector of p-values or z-values, are essentially univariate inference procedures that only utilize the information of T 1i. Define the local false discovery rate (Lfdr, Efron et al., 2001) Lfdr 1 (t 1 ) = (1 π 1)f 10 (t 1 ), (3.1) f 1 (t 1 ) where subscript 1 indicates a quantity associated with T 1i. It was shown in Sun and ai (2007) that the optimal univariate mfdr procedure is a thresholding rule of the form δ(lfdr 1,c) = [I{Lfdr 1 (t 1i ) < c} : 1 i m], (3.2) where 0 c 1 is a cutoff. Denote Q LF (c) the mfdr level of δ(lfdr 1,c). Let c = sup{c : Q LF (c) α} be the largest cutoff under the mfdr constraint. Then δ = δ(lfdr 1,c ) is optimal among all univariate mfdr procedures in the sense that it has the largest ETP subject to mfdr α. The next theorem derives an oracle procedure for mfdr control with pairs of observations. We shall see that the performance of δ, the optimal univariate procedure, can be greatly improved by exploiting the information in T 2i. The oracle procedure under the bivariate model (2.6) has two important components: an oracle statistic T i that optimally pools information from both T 1i and T 2i, and an oracle threshold t that controls the mfdr with the largest ETP. Theorem 1. Suppose (T 1i,T 2i ) follow model (2.6). Let q (t 2 ) = (1 π 1 )f (t 2 θ 1i = 0). (3.3)

7 ARS: ovariate-assisted Two-Sample Inference 7 Define the oracle statistic T i (t 1,t 2 ) = P(θ 1i = 0 T 1i = t 1,T 2i = t 2 ) = q (t 2 )f 10 (t 1 ), (3.4) f(t 1,t 2 ) where f(t 1,t 2 ) is the joint density given by (2.6). Then (a) For 0 < λ 1, let Q (λ) be the mfdr level of testing rule {I(T i < λ) : 1 i m}. Then Q (λ) < λ and Q (λ) is non-decreasing in λ. (b) Suppose we choose α ᾱ Q (1). Then the oracle threshold λ = sup{λ : Q (λ) α} exists uniquely and Q (λ ) = α. Furthermore, define oracle rule δ = (δ i : i = 1,,m), where δ i = I(Ti < λ ). (3.5) Then δ is optimal in the sense that ETP δ ETP δ for any δ in D α, where D α is the collection of all testing rules such that mfdr δ α. Remark 3. The oracle statistic T i is the posterior probability that H i,0 is true given the pair of primary and auxiliary statistics. It serves as a significance index providing evidence against the null. Section 3.2 gives a precise characterization of q (t 2 ) and shows that it roughly describes how frequently T 2i from the null distribution would fall into the neighborhood of t 2. The estimation of T and q (t 2 ) is discussed in Section 3.3. Remark 4. Theorem 1 indicates that pooling non-informative auxiliary data would not result in efficiency loss. onsider the worst scenario suggested by a reviewer where T 2i is completely non-informative: n x = n y, σxi 2 = σ2 yi and µ xi = µ yi. In Section A.9 of the Supplementary Material, we show that under the above conditions T i reduces to the Lfdr statistic (3.1), and the oracle (bivariate) procedure would coincide with the optimal univariate rule (3.2). ontrary to the common intuition that incorporating T 2i would negatively affect the performance, Theorem 1 says that our oracle procedure dominates the optimal univariate procedure (3.2). Hence the power will never be decreased by pooling the auxiliary information. Numerical evidence is provided in Section B.2 in the supplement. The oracle rule (3.5) motivates us to consider a stepwise procedure that operates in two steps: ranking and thresholding. The ranking step orders all hypotheses from the most significant to the least significant according to T, and the thresholding step identifies the largest threshold along the ranking subject to the mfdr constraint. Specifically, denote T (1)... T(m) the ordered oracle statistics and H (1),,H (m) the corresponding hypotheses. The step-wise procedure operates as follows. { Let k = max j : ˆQ (j) = j 1 j i=1 T(i) }. α Reject H (1),,H (k). (3.6) The quantity ˆQ (j), which is the moving average of the top j ordered statistics, gives a consistent estimate of the actual FDR level [cf. Sun and ai (2007)]. Thus the stepwise algorithm (3.6) identifies the largest threshold subject to the FDR constraint. The hypotheses whose T values are less than the threshold are claimed to be significant.

8 8 ai & Sun & Wang 3.2. Approximating T via screening The oracle statistic T i is unknown and needs to be estimated in practice. However, standard methods do not work well for the bivariate model. For example, the popular EM algorithm usually requires the specification of a parametric form of the non-null distribution; this is often impractical in large-scale studies where little is known about the alternative. Moreover, existing estimators often suffer from low accuracy and convergence issues when signals are very sparse. To overcome the difficulties in estimation, we propose a new test statistic T τ,i that only involves quantities that can be well estimated from data. The new statistic provides a good approximation to T i and guarantees the FDR control. In (3.4), the null density f 10 is known by construction. The bivariate density f(t 1,t 2 ) can be well estimated using a standard kernel method [Silverman (1986); Wand and Jones (1995)]. Hence we shall focus on the quantity q (t 2 ). Suppose we are interested in counting how frequently T 2i from the null distribution (i.e. θ 1i = 0) would fall into an interval in the neighborhood of t 2 : Q (t 2,h) = #{i : T 2i [t 2 h/2,t 2 +h/2] and θ 1i = 0}/m. The quantity is relevant because q (t 2 ) = lim h 0 E{Q (t 2,h)}/h. The counting task is difficult as we do not know the value of θ 1i. Our idea is to first apply a screening method to select the nulls (i.e. θ 1i = 0), and then construct an estimator based on selected cases. Denote P i = 2Φ( t 1i ) the two-sided p-value associated with T 1i = t 1i. When signals are sparse, most large p-values should come from the null distribution. Hence for a large τ, say τ = 0.9, we would reasonably predict that θ 1i = 0 if P i > τ. It follows that we may do the counting based on those T 2i with large p-values: Q τ (t 2,h) = #{i : T 2i [t 2 h/2,t 2 +h/2] and P i > τ}. (3.7) m(1 τ) The adjustment (1 τ) in the denominator comes from the fact that we have only utilized roughly 100(1 τ)% of the data when counting the frequency. Let A τ = {t 1 : 2Φ( t 1 ) > τ}. Using Q τ to replace Q, a sensible approximation of q (t 2 ) would be q τ E{Q τ (t 2,h)} (t 2 ) = lim = h 0 h A τ f(t 1,t 2 )dt 1. (3.8) 1 τ Intuitively, a large τ would yield a sample that is close to one generated from a pure null distribution and thus reduce the bias q τ (t 2 ) q (t 2 ). Our theory reveals that the bias is always positive (Proposition 2), and would decrease in τ (Proposition 4). However, a larger τ would increase the variability of our proposed estimator (as we have fewer samples to construct the estimator), affecting the testing procedure adversely. The bias-variance tradeoff is further discussed in Section 3.4. Substituting q τ (t 2 ) in place of q (t 2 ), we obtain the following approximation of T i : (t 1,t 2 ) = qτ (t 2 )f 10 (t 1 ). (3.9) f(t 1,t 2 ) T τ,i Some important properties of the approximation in (3.9) are summarized in the next proposition, which shows that T τ,i always overestimates Ti. This implies that if we substitute T τ,i in place of Ti in (3.6), then ˆQ (j) will be overestimated as well. Hence the approximation leads to fewer rejections and a conservative FDR level. Proposition 2. (a) T i (t 1,t 2 ) T τ,i (t 1,t 2 ) for all τ. (b) Let δ τ be a decision rule that substitutes Tτ,i in place of Ti in (3.6). Then both the FDR and mfdr levels of δ τ are controlled below level α.

9 ARS: ovariate-assisted Two-Sample Inference Estimation of the test statistic We now turn to the estimation of T τ,i. By our construction, the null density f 10(t 1 ) is known. The bivariate density f(t 1,t 2 ) can be estimated using a kernel method [Silverman (1986); Wand and Jones (1995)]: ˆf(t m 1,t 2 ) = m 1 K h1 (t 1 t 1i )K h2 (t 2 t 2i ), (3.10) i=1 where K(t) is a kernel function, h 1 and h 2 are the bandwidths, with K h (t) = h 1 K(t/h). To estimate q τ (s), we first carry out a screening procedure to obtain sample T (τ) = {i : P 1i > τ}, and then apply kernel smoothing to the selected observations: ˆq τ (t 2 ) = i T (τ) K h 2 (t 2 t 2i ). (3.11) m(1 τ) The next proposition shows that ˆq τ ( ) converges to q τ ( ) in L 2 norm. Proposition 3. onsider ˆq τ and q τ respectively defined in (3.8) and (3.11). Assume that (i) q τ ( ) is bounded and have continuous first and second derivatives; (ii) the kernel K is a positive, bounded and symmetric function satisfying K(t) = 1, tk(t)dt = 0 and t 2 K(t)dt < ; and (iii) f (2) 2 (t 2 τ) = (2) t 1 A τ f 2 (t 2 t 1 )f 1 (t 1 )dt 1 is square integrable, where f 2 (t 2 t 1 ) is the conditional density of T 2 given T 1. Then with the common choice of bandwidth h m 1/6, we have E ˆq τ q τ 2 = E {ˆq τ (x) q τ (x)} 2 dx 0. ombining the above results, we propose to estimate T τ by the following statistic ˆT τ (t 1,t 2 ) = ˆqτ (t 2 )f 10 (t 1 ) ˆf(t 1,t 2 ) 1, (3.12) where ˆq τ (t 2 ) and ˆf(t 1,t 2 ) are respectively given in (3.11) and (3.10), and x y = min(x,y). Remark 5. In our proposed estimator, the same bandwidth h 2 has been used for the kernels in both (3.10) and (3.11). Utilizing the same bandwidth across the numerator and denominator of (3.12) has no impact on the asymptotics, but is beneficial for increasing the stability of our estimator. Some practical guidelines for the implementation of ARS are provided in Section A refined estimator This section develops a consistent estimator of q (t 2 ). The proposed estimator is important for the optimality theory in Section 3.5. However, it is computationally intensive and has limited efficiency gain. In practice we still recommend the simple estimator (3.11). This section may be skipped for readers who are mainly interested in methodology. We state in the next proposition some theoretical properties for the approximation error q τ (t 2 ) q (t 2 ); these properties are helpful to motivate the new estimator and prove its consistency. Let the DF of the p-value associated with T 1i be G(τ) = (1 π 1 )τ+π 1 G 1 (τ), where G 1 is the alternative DF. Denote g and g 1 the corresponding density functions.

10 10 ai & Sun & Wang Proposition 4. onsider T τ defined in (3.9). (a). Denote B q (τ) = q τ (t 2 ) q (t 2 ) dt 2 the total approximation error. If G 1 ( ) is concave, then B q (τ) decreases in τ. (b). If lim x 1 g 1 (x) = 0, then lim τ 1 q τ (t 2 ) = q (t 2 ). Remark 6. The assumption that G 1 ( ) is concave has been commonly used in the FDR literature (Storey, 2002; Genovese and Wasserman, 2002, 2004). A generalized version of the assumption is the monotone likelihood ratio condition (MLR, Sun and ai, 2007). The assumption that lim x 1 g 1 (x) = 0 is also a typical condition (Genovese and Wasserman, 2004). It essentially says that the null cases are dominant (or the null distribution is pure ) on the right of the p-value histogram. It follows from Proposition 4 that a large τ is helpful to reduce the bias and the bias converges to zero when τ 1. However, a large τ would increase the variance of our estimator (3.11), which is constructed using the sample T (τ) = {i : P 1i > τ}. To address the bias-variance tradeoff, we propose to first obtain ˆq τ for a range of τ s, say {τ 1,,τ k }, and then use a smoothing method to obtain the limiting value of ˆq τ when τ 1. This approach aims to borrow strength from the entire sample to minimize the bias without blowing up the variance. Specifically, let τ 0 < τ 1 < < τ k be ordered and equally spaced points in the inteval (0,1). Denote ˆq τj (t 2 ) the estimates from (3.11), j = 1,,k. We propose to obtain the local linear kernel estimator ˆq (t 2 ) ˆq{τ = 1;ˆq τ1 (t 2 ),,ˆq τ k (t 2 )} as the height of the fit ˆβ 0, where (ˆβ 0, ˆβ 1 ) minimizes the weighted kernel least squares k j=1 (ˆqτj β 0 β 1 τ j ) 2 K hτ (τ j τ k ). For a given integer r, denote ŝ r = k 1 k j=1 (τ j 1) r K hτ (τ j τ k ). It can be shown that (e.g. Wand and Jones, 1995, pp. 119) ˆq (t 2 ) = k 1 k {ŝ 2 ŝ 1 (τ j τ k )}K hτ (τ j τ k )ˆq τj j=1 ŝ 2 ŝ 0 ŝ 2 1 The next proposition shows that ˆq (t 2 ) is a consistent estimator for q (t 2 ).. (3.13) Proposition 5. onsider ˆq τ and ˆq that are respectively defined in (3.11) and (3.13). Denote q τk,(2) (t 2 ) = (d/dτ) 2 q τ,(2) (t 2 ) τ=τk. Assume the following conditions hold: (i) q τk,(2) (t 2 ) is square integrable; (ii) K( ) is symmetric about zero and is supported on [ 1,1]; and (iii) the bandwidth h τ is a sequence satisfying h τ 0 and kh τ as k. Moreover, assume ondition (i) to (iii) in Proposition 3 and ondition (b) in Proposition 4 hold. We have E ˆq q 2 = E {ˆq (x) q (x)} 2 dx 0 when (m,k) 0. (3.14) 3.5. The ARS procedure τ,i The estimated statistics ˆT will be used as a significance index to rank the relative importance of all hypotheses. Motivated by the stepwise algorithm(3.6), we propose the following covariate-assisted ranking and screening (ARS) procedure. Procedure 1 (ARS Procedure). onsider model (2.6) and estimated statistics ˆT τ,i τ,(1) τ,(m) (3.12). Denote ˆT ˆT { the ordered statistics and H (1),,H (m) the corresponding hypotheses. Let k = max j : 1 } j τ,(i) j i=1 ˆT α. Then reject H (1),,H (k).

11 ARS: ovariate-assisted Two-Sample Inference 11 To ensure good performance of the data-driven procedure, we require the following conditions for estimated quantities: (1). E ˆq τ q τ 2 0. (1 ). E ˆq q 2 0. (2). E ˆf f 2 [ ] = E {ˆf(t1,t 2 ) f(t 1,t 2 )} 2 dt 1 dt 2 0. Remark 7. Proposition 3 shows that (1) is satisfied by the proposed estimator (3.11). Proposition5 shows that (1 ) is satisfied by the smoothing estimator (3.14) under stronger assumptions. Finally, (2) is satisfied by the standard choice of bandwidth h t1 m 1/6, h t2 m 1/6 ; see, for example, page 111 in Wand and Jones (1995) for a proof. The asymptotic properties of the ARS procedure are established by the next theorem. Theorem 2. Asymptotic validity and optimality of ARS. (a). If onditions (1) and (2) hold, then both the mfdr and FDR of the ARS procedure are controlled at level α + o(1). (b). If onditions (1 ) and (2) hold, and we substitute ˆq (3.14) in place of ˆq τ in (3.12) to compute ˆT τ, then the FDR level of the ARS procedure is α + o(1). Moreover, denote ETP ARS and ETP the ETP levels of ARS and the oracle procedure, respectively. Then we have ETP ARS /ETP = 1+o(1) The case with unknown variances and non-gaussian errors For two-sample tests with unknown and unequal variances, we can estimate σxi 2 and σ2 yi by Sxi 2 = (n x) 1 n x j=1 (X ij X i ) 2 and Syi 2 = (n y) 1 n y j=1 (Y ij Ȳi) 2, respectively. Let ˆκ i = (γ ysxi 2 )/(γ xsyi 2 ) and S2 pi = γ ysxi 2 +γ xsyi 2. The following pair will be used to summarize the information in the sample: (T 1i,T 2i ) = nx n y n ( Xi Ȳi S pi, ) X i + ˆκ iȳi. (3.15) ˆκ i S pi For the case with unknown but equal variances(e.g. σxi 2 = σ2 yi ), we modify (3.15) as follows. First, ˆκ i is replacedbyκ = γ y /γ x. Second, Spi 2 is instead estimated bys2 pi = γ xsxi 2 +γ ysyi 2. Finally T 1i and T 2i are plugged into (3.12) to compute the ARS statistic, which is further employed to implement Procedure 1. T 1i and T 2i are not strictly independent when estimated variances are used. We conduct simulation studies in Section 5.3 to investigate the performance of this plug-in methodology and find that ARS controls the FDR at the nominal level with substantial power gain over competing methods. The empirical results are supported by the following proposition, which shows that T 1i and T 2i are asymptotically independent under the null. Proposition 6. onsider model (2.1). Assume that the error terms (possibly non- Gaussian) of X ij and Y ij have symmetric distributions and finite fourth moments. Then (T 1i,T 2i ) defined in (3.15) are asymptotically independent when H i,0 : µ xi = µ yi is true.

12 12 ai & Sun & Wang The expression for asymptotic variance-covariance matrix, which is given in Section A.6 of the Supplementary Material, reveals that the asymptotic independence holds for non- Gaussian errors as long as the error distributions are symmetric. Our simulation results in Section 5.3 confirm that unknown variances and non-gaussian errors have almost no impact on the performance of ARS. Therefore the plug-in methods are recommended for practical applications. The case with skewed distribution requires further research, and a full theoretical justification of ARS methodology is still an open question. 4. Extensions and onnections with Existing Work This section considers the extension of our theory to a general bivariate model. The results in the general model provide a unified theoretical framework for understanding different testing strategies, which helps gain insights on the connections between existing methods. We substitute (T i,s i ) in place of (T 1i,T 2i ) in this section. This change reflects a more flexible view of the auxiliary covariate: S i can be either continuous or discrete, from either internal or external data, and we do not explicitly estimate the joint density of T i and S i as donein previoussections. Sections 4.1to4.5assumethat T i followa continuousdistribution with a known density under the null; the case with unknown null density is discussed in Section 4.6. We allow S i to be either continuous or categorical and hence eliminate the notation θ 2i. (Previously θ 2i denotes the union support, which is only needed when T 2i has a density with a point mass.) As a result, the subscript 1 in θ 1i is suppressed for notational convenience, where θ i = 0/1 stands for a null/non-null case A general bivariate model Suppose T i and S i follow a joint distribution F i (t,s). The optimal (oracle) testing rule is given by the next theorem, which can be proven similarly as Theorem 1. Theorem 3. Define the oracle statistic under the general model T G (t,s) = P(θ i = 0 T i = t,s i = s). (4.1) Denote Q G (λ) the mfdr level of δ(tg,λ), where δ(tg,λ) = {I(TG (t,s) < λ) : 1 i m}. Let λ = sup{λ (0,1) : Q G (λ) α}. Define the oracle mfdr procedure under the general model as δ G = δ(tg,λ ). Then δ G is optimal in the sense that for any δ such that mfdr δ α, we always have ETP δ ETP δ G. The oracleprocedure is only of theoretical interest because it is difficult to estimate T G,i in a general bivariate model. Under some special cases, T G can be approximated well. For example, T τ provides a good approximation to TG under bivariate model (2.3) and the conditional independence assumption (2.5). When S i is categorical, the oracle procedure can also be approximated well. This important special case is discussed next Discrete case: multiple testing with groups We now consider a special case where the auxiliary covariate S i is categorical. A concrete scenario is the multi-group random mixture model first introduced in Efron (2008). See also ai and Sun (2009). The model is useful to handle large-scale testing problems where data are collected from heterogeneous sources. orrespondingly, the m hypotheses may

13 ARS: ovariate-assisted Two-Sample Inference 13 be divided into, say, K groups that exhibit different characteristics. Let S i denote the group membership. Assume that S i takes values in {1,,K} with prior probabilities {π 1,,π K }. onsider the following conditional distributions: (T i S i = k) f k 1 = (1 πk 1 )fk 10 +πk 1 fk 11, (4.2) for k = 1,,K, where π1 k is the proportion of non-null cases in group k, f10 k and f11 k are the null and non-null densities of T i, and f1 k = (1 πk 1 )fk 10 +πk 1 fk 11 is the mixture density for all observations in group k. The model allows the conditional distributions in (4.2) to vary across groups; this is desirable in practice when groups are heterogeneous. Remark 8. In Section A.10 in the supplement, we present a simple example to show that T i is not sufficient (as insightfully pointed out by a reviewer), whereas (T i,s i ) is sufficient. S i is ancillary in the sense that its value is determined by an external process independent from the main parameter. ontrary to the common intuition that S i is useless for inference, our analysis reveals that S i can be informative. The phenomenon is referred to as the ancillarity paradox because, to quote Lehmann (page 420, Lehmann and asella, 2006), the distribution of the ancillary, which should not affect the estimation of the main parameter, has an enormous effect on the properties of the standard estimator. A related phenomenon in the estimation context was discussed by Foster and George (1996). See also the seminal work by Brown (1990) for a paradox in multiple regression. The problem of multiple testing with groups and related problems have received substantial attention in the literature (Efron, 2008; Ferkingstad et al., 2008; ai and Sun, 2009; Hu et al., 2010; Liu et al., 2016; Barber and Ramdas, 2017). It can be shown that under model (4.2), the oracle statistic (4.1) is reduced to the conditional local false discovery rate (Lfdr, ai and Sun 2009) Lfdr i = (1 π k 1 )fk 10 (t i)/f k 1 (t i) for S i = k, k = 1,,K. The Lfdr statistic can be accurately estimated from data when the number of tests is large in separate groups. However, the Lfdr statistic cannot be well estimated when the number of groups is large, or when S i becomes a continuous variable. An important recent progress for exploiting grouping and hierarchical structures among hypotheses under more generic settings has been made in Liu et al. (2016), wherein an interesting decomposition of the oracle statistic (4.1) was derived. The decomposition provides substantial insights on how the auxiliary information (grouping effects S i ) and primary statistics (individual effects T i ) interplay in simultaneous testing with side information. The result on multiple testing with groups motivates an interesting strategy to approximate the oracle rule. For a continuous auxiliary covariate S i, we can first discretize S i, then divide the hypotheses into groups according to the discrete variable, and finally apply group-wise multiple testing procedures. This idea is closely related to the uncorrelated screening (US) method in Liu (2014); the connection is discussed next Discretization and uncorrelated screening The idea in Liu (2014) involves discretizing a continuous S i as a binary variable. Define index sets G 1 = {1 i m : S i > λ} and G 2 = {1 i m : S i λ}, where the tuning parameter λ divides t 1i s into two groups: T 1 (G 1 ) = {t i : i G 1 } and T 1 (G 2 ) = {t i : i G 2 }. The uncorrelated screening (US, Liu 2014) method operates in two steps. First, the BH method (Benjamini and Hochberg, 1995) is applied at level α to the two groups separately, and then the rejected hypotheses from two groups are combined as the final

14 14 ai & Sun & Wang output. The tuning parameter λ is chosen in a way such that it yields the largest number of rejections (two groups combined). US is closely related to the separate analysis strategy proposed in Efron (2008). The key difference is that the groups are known a priori in Efron (2008), whereas the groups are chosen adaptively in Liu (2014). The main merit of US is that the screening statistic is constructed to be uncorrelated with the test statistic, which ensures that the selection bias issue can be avoided. Moreover, the divide-and-test strategy combines the results in both groups; this is different from conventional independent filtering approaches [Bourgon et al. (2010)], in which one group is completely filtered out. We now compare different methods under a unified framework. Both US and ARS can be viewed as approximations of the oracle rule (4.1). The goal is to borrow information from the external covariate S i to improve the efficiency of simultaneous inference. US adopts the divide-and-test strategyand only models S i as a binary variable. It suffers from information loss in the discretization step. This sheds lights on the superiority of ARS, for it fully utilizes the data by modeling S i as a continuous variable. The general framework suggests several directions that US may be improved. First, the information of S i may be better exploited, e.g. by creating more groups. However, it remains unknown how to choose the optimal number of groups. Second, US tests the hypotheses at FDR level α for both groups. However, ai and Sun (2009) showed that the choice of identical FDR levels across groups can be suboptimal. In order to maximize the overall power, different group-wise FDR levels should be chosen. However, no matter how smart a divide-and-test strategy may be, discretizing a continuous covariate would inevitably result in information loss and hence will be outperformed by ARS The pooling within strategy for information integration In Philosophy and Principles of Data Analysis (Tukey, 1994), Tukey coined two witted terms for effective information integration strategies: borrowing strength and pooling within. The idea of borrowing strength, which was investigated extensively and systematically by researchers in both simultaneous estimation and multiple testing fields, has led to a number of impactful theories and methodologies exemplified by the James-Stein s estimator (James and Stein, 1961) and local false discovery rate methodology(efron et al., 2001). By contrast, the direction of pooling within has been less explored. Tukey described it as a two step strategy that involves first gathering quantitative indications from within different parts of the data, and then pooling these indications into a single overall index (Tukey, 1994, pp. 278). Our work formalizes a theoretical framework for the pooling within idea in the context of two-sample multiple testing. We not only develop practical principles on the construction of multiple indications from within the data (i.e. independent and comparable pairs), but also derive an overall index (i.e. the oracle statistic) that optimally combines the evidences exhibited from both statistics. Our work differs in several ways from existing works on multiple testing with covariate (Ferkingstad et al., 2008; Zablocki et al., 2014; Scott et al., 2015). First, the covariate in other works is collected externally from other data sources, whereas the auxiliary information in our work is gathered internally within the primary data set. Second, in other works it has been assumed that the null density would not be affected by the external covariate. However, the assumption should be scrutinized in practice as it may not always hold. Under our testing framework, the requirement of a fixed null density is formalized as the conditional independence between the primary and auxiliary statistics. The conditional independence is proposed as a principle for information extraction, and is automatically

15 ARS: ovariate-assisted Two-Sample Inference 15 fulfilled by our approach to constructing the auxiliary sequence. ARS makes several new methodological and theoretical contributions. First, under a decision-theoretic framework, the oracle ARS procedure is shown to be optimal for information pooling. Second, existing methodologies on testing with covariate are mostly developed under the Bayesian computational framework and lack theoretical justifications. By contrast, our data-driven ARS procedure is a non-parametric method and enjoys nice asymptotic properties. Such FDR theories, as far as we know, are new in the literature. Third, the screening approach employed by ARS reveals interesting connections between sparsity estimation and multiple testing with covariate, which is elaborated next apturing sparsity information via screening A celebrated finding in the FDR literature is that incorporating the estimated proportion of non-nulls [π 1 = P(θ i = 1)] can improve the power (Benjamini and Hochberg, 2000; Storey, 2002; Genovese and Wasserman, 2002). In light of S i, the proportion becomes heterogeneous; hence it is desirable to utilize the conditional proportions π 1 (s) = P(θ i = 1 S i = s) to improve the power of existing methods (Zablocki et al., 2014; Scott et al., 2015). In a similar vein, earlier works on multiple testing with groups (or discrete S i ) reveal that varied sparsity levels across groups can be exploited to construct more powerful methods (Ferkingstad et al., 2008; ai and Sun, 2009; Hu et al., 2010). Estimating π 1 (s) with a continuous covariate is a challenging problem. Most existing works (Zablocki et al., 2014; Scott et al., 2015) employ Bayesian computational algorithms that do not provide theoretical guarantees. Notable progress has been made by Boca and Leek(2017). However, their theory requires a correct specification of the underlying regression model, which cannot be checked in practice. Next we discuss how the screening idea in Sections 3.3 and 3.4 can be extended to derive a simple and elegant non-parametric estimator of π 1 (s). Fig. 1 gives a graphical illustration of the proposed estimator. We generate m = 10 5 tests with X i N(0,1) and Ȳi 0.8N(0,1) + 0.2N(2,1); hence T i = (1/ 2)( X i Ȳi) and S i = (1/ 2)( X i +Ȳi). Suppose we are interested in counting how many S i would fall into the interval [t 2 h,t 2 + h] with t 2 = 2 and h = 0.3. The counts are represented by vertical bars on Panel (a) for each p-value interval. As shown in Proposition 1, T i and S i are independent under the null [cf. Equation (2.5)]. Therefore we can see that the counts of S i are roughly uniformly distributed when the p-value of T i is large. Expanding the interval [t 2 h,t 2 +h] to the entire real line (which actually corresponds to discarding the information in S i ), we obtain the histogram of all p-values [Panel (b)]. We start with a description of a classical estimator [Schweder and Spjøtvoll (1982); Storey (2002)] for π 1 ; see Langaas et al. (2005) for a elaborated discussion of various extensions. Let Q(τ) = #{P i > τ}, then byfig. 1. (b), the expected countscoveredby lightgrey bars to the right of the threshold τ can be approximated as E{Q(τ)} = m(1 π 1 )(1 τ). Setting the expected and actual counts equal, we obtain ˆπ τ 1 = 1 Q(τ)/{m(1 τ)}. Next we consider the conditional proportion π 1 (s) = P(θ i = 1 S i = s). Assume that π 1 (s) and f(s), the density of S i, are constants in a small neighborhood [s h/2,s + h/2]. Then the expected counts of the p-values from the null distribution in the interval [s h/2,s+h/2] can be approximated by Q τ (s,h) {1 π 1 (s)}f(s)h. The other way of counting can be done using (3.7) in Section 3.2. In obtaining (3.7), we exploit the fact that the counts S i are roughly uniformly distributed to the right of the threshold τ. Taking the limit, we obtain π 1 (s) = 1 {f(s)} 1 lim h 0 Q τ (s,h)/h = 1 q τ (s)/f(s). Finally, utilizing

16 16 ai & Sun & Wang (a). ounts of T 2i in [t 2 h, t 2 + h] (b). Histogram of p values τ = τ = 0.5 count count p value of T p value of T 1 Fig. 1. A graphical illustration of the smoothing estimator (4.3). (a). The counts of S i are uniformly distributed on the right. The bias decreases and the variability increases when τ increases. (b). The histogram of all p-values. Similarly, the p-values are approximately uniformly distributed on the right. the screening approach(3.11), we propose the following nonparametric smoothing estimator ˆπ 1 τ (s) = 1 i T (τ) K h(s s i ) (1 τ) m i=1 K h(s s i ). (4.3) hoosing tuning parameter τ is an important issue but has gone beyond the scope of the current work; see Storey (2002); Langaas et al. (2005) for further discussions Heterogeneity, correlation and empirical null onventional FDR analyses treat all hypotheses exchangeably. However, the hypotheses become unequal in light of S i, and it is desirable to incorporate, for example, the varied conditional proportions in a testing procedure to improve the efficiency. This section further discusses the case where the heterogeneity is reflected by disparate null densities. A key principle in our construction is that the primary and auxiliary statistics are conditionally independent under the null. However, in many applications where the auxiliary information is collected from external data, S i may be correlated with T i. Then the FDR procedure may become invalid if S i is incorporated inappropriately. For example, if the grouping variable S i is correlated with the p-value, then applying Benjamini-Hochberg s (BH) procedure to hypotheses in separate groups would be problematic because the null distributions of the p-values in some groups may no longer be uniform. A partial solution to resolve the issue is to estimate the empirical null distributions (Efron, 2004; Jin and ai, 2007) for different groups instead of using the theoretical null distribution directly. The theory and methodology in Efron (2008) and ai and Sun (2009), which allow the use of varied empirical nulls across different groups, can be applied to control the FDR. However, as we previously mentioned, discretizing a continuous S i fails to fully utilize the auxiliary information. The estimation of the empirical null with a continuous S i is an interesting problem for future research. The nonparametric smoothing idea in (4.3) might be helpful but additional difficulties may arise. For example, Efron (2004) suggests using the central part of the z-value histogram to estimate the empirical null, wherein a key assumption is that the non-nulls are sparse. However, the sparsity assumption may not hold simultaneously for all π 1 (s), causing identifiability issues. The limitations of the current methodology and open questions will be discussed in Section 6.

17 ARS: ovariate-assisted Two-Sample Inference Numerical Results This section investigates the numerical performance of ARS using both simulated and real data. We compare the oracle and data-driven ARS procedures, respectively denoted by and DD, with existing methods, including the Benjamini-Hochberg procedure (BH, Benjamin and Hochberg, 1995), the adaptive z-value procedure (AZ, Sun and ai 2007), and the uncorrelated screening procedure (US, Liu 2014). We first describe the implementation of ARS in Section 5.1. Sections 5.2 and 5.3 respectively consider (i) the case with known and unequal variances and (ii) the case with estimated variances and non-gaussian errors. An application to Supernova Detection is discussed in Section 5.4. Additional numerical results, including the non-informative case (Section B.2), completely informative case (Section B.3) and dependent tests (Section B.6), are provided in Section B of the Supplementary Material The implementation and R-package ARS The R-package ARS has been developed to implement the proposed method. This section describes some practical guidelines and details for the implementation in our package. In estimating (3.11), the ARS package uses the Lfdr as the screening statistic(as opposed to the p-values), with the default cutoff of τ = 0.9. A correction factor similar to 1 τ is needed and can be easily computed from data. Related formulae and computational details are described in Section A.11 in the Supplementary Material. The p-values are used to describe the methodology because the p-value description is simpler and more intuitive, and can reveal important connections to classical ideas (Figure 1). There is no essential difference between the p-value and Lfdr in the screening step. The main practical advantage for screening via Lfdr is its stability of tuning for finite samples. The intuition is that the Lfdr statistic, which contains information about the sparsity and non-null density, provides a testing rule that is more adaptive to the data. Section B.4 in the Supplementary Material showsthat the choiceofτ has relativelylittle effects on the performanceofars when Lfdr is used. In general a larger τ provides slightly better error control, which is consistent with our theory in Proposition 4. However, to avoid increased variability in estimated quantities, wedo not recommendverylargeτ s. As arule ofthumb, ourpracticalrecommendationisto takeτ = 0.9tobalancetheinterestsofpreciseerrorcontrolandstabilityofthemethodology. This is the default choice in our package and has been used in all our simulations. The R package np is used to choose the bandwidths h 1 and h 2 in equation (3.10). We haveadoptedtwostrategiestoimprovetheperformance. First, thebandwidth h 1 andh 2 are chosen based on the normal reference rule restricted to the samples with Lfdr 1 (t 1i ) < 0.5. The restriction leads to more stable and informative bandwidths as this subset is a more relevant part of the sample for the multiple testing problem. Second, the same h 2 has been used in (3.11) to obtain the numerator of (3.12). This strategy is helpful to increase the stability of the estimator [cf. Remark 5]. It is important to note that these strategies are just practical guidelines for finite samples. The asymptotic theories are not affected Simulation study I: known and unequal variances onsider model (2.1). Denote µ i1:i 2 = (µ i1,,µ i2 ) a vector of consecutive observations from i 1 to i 2 in vector µ = (µ 1,,µ m ). The two originalsamples under two conditions are denoted by {x 1,,x nx } and {y 1,,y ny }, respectively, with correspoinding population

18 18 ai & Sun & Wang means µ x and µ y. Let ǫ xij N(0,1) and ǫ yik N(0,2). We set n x = 50,n y = 60. In all simulations, the number of tests is m = 5000, the nominal FDR level is The following settings are considered in our simulation study. Setting 1: we set µ x,1:k = 5/ 30, µ x,(k+1):(2k) = 4/ 30, µ x,(2k+1):m = 0, µ y,1:k = 3/ 30, µ y,(k+1):(2k) = 4/ 30 and µ y,(2k+1):m = 0. Here k denotes the sparsity level: the proportion of locations with differential effects is k/m, and the proportion of nonzero locations is (2k)/m. We vary k from 100 to 1000 to investigate the effect of sparsity. Setting 2: we use k 1 and k 2 to denote the number of nonzero locations and the number of locations with differential effects, respectively. Let µ x,1:k2 = 5/ 30, µ x,(k2+1):k 1 = 4/ 30, µ x,(k1+1):m = 0, µ y,1:k2 = 2/ 30, µ y,(k2+1):k 1 = 4/ 30 and µ y,(k1+1):m = 0. We fix k 1 = 2000 and vary k 2 from 100 to k 1. This setting investigates how the informativeness of the auxiliary covariate would affect the performance of different methods. Note that as k 2 increases, the conditional probability π 1 1 = P(θ 1i = 1 θ 2i = 1) also increases, and the auxiliary covariate becomes more informative. Setting 3: Wefixk = 750andsetµ x,1:k = µ 0 / 30, µ x,(k+1):(2k) = 3/ 30, µ x,(2k+1):m = 0, µ y,1:k = 2/ 30, andµ y,(k+1):(2k) = 3/ 30, µ y,(2k+1):m = 0. Toinvestigatetheimpact of the effect sizes, we vary µ 0 from 3.5 to 5. In Settings 1-3, we apply different methods to simulated data and obtain FDR and MDR levels by averaging over 500 replications. Finally the actual FDR and ETP levels are plotted as functions of varied parameter values and displayed in Figure 2. We can see that the ARS procedure is more powerful than conventional univariate methods such BH and AZ, and is superior than US which only partially utilizes the auxiliary information. A more detailed description of simulation results is given below. (a). All methods control the FDR at the nominal level 0.05 approximately. BH is slightly conservative and US is very conservative. ARS sometimes has an FDR level that is a bit high but acceptable. This is mainly due to the inaccuracy in estimating the bivariate density function. (b). Univariate methods (BH and AZ) are improved by bivariate methods (US and ARS) in most settings. This shows that exploiting the auxiliary information is helpful. Note thatinsetting2, AZandUShavesimilarETPlevels. However,UShasamuchsmaller FDR level. This shows the usefulness of the auxiliary covariate. (c). US is uniformly dominated by ARS. This is because US only models T 2i as a binary variable whereas ARS fully utilizes the information in T 2i. (d). DD has similar performance as in most settings. However, DD can be conservative in FDR controlin some settings andhence has lesspowercomparedto (cf. Setting 1, top row of Figure 1). This has been predicted by our theory (Proposition 5). (e). From the results under Setting 1, we can see that the gain in efficiency (of bivariate methods over univariate methods) decreases as k increases. This shows that incorporating auxiliary information is more useful in the sparse case. However, our method still gains substantial power over conventional methods in the not-so-sparse case when the π 2 is40%. Ingeneral,the auxiliaryvariableisuseful aslongasthereisasignificant proportion of zeros such as 50% or 60%.

19 ARS: ovariate-assisted Two-Sample Inference 19 FDR level omparison Setting Power omparsion FDR Average Power k k Method DD BH AZ US FDR level omparison Setting Power omparsion FDR Average Power k k 2 Method DD BH AZ US FDR level omparison Setting 3 Power omparison FDR Average Power µ µ 0 Method DD BH AZ US Fig. 2. Two-sample tests with known variances. The FDR and MDR are plotted against varied nonnull proportions (top row) and conditional proportions (middle row) and effect sizes (bottom row). DD ( ), BH ( ), ( ), AZ (+) and US ( ).

20 20 ai & Sun & Wang (f) The results under Setting 2 show that the efficiency gain of ARS over competing methods increases as k 2 increases. This makes sense because k 2 is proportional to the conditional probability π 1 1, which indicates the usefulness/informativeness of the auxiliary covariate. (g). The results in Setting 3 show that ARS dominates other methods for all effect sizes. The efficiency is large when the signal strength is small to moderate. (Note that in Setting, a smaller µ 0 indicates a larger difference in effect sizes) Simulation study II: estimated variances and non-gaussian errors We consider similar simulation settings as the previous section with two modifications. First, we substitute the estimated variances in place of known variances. Second, to investigate the performance of our method with non-gaussian errors, we modify Setting 3 slightly by generating ǫ xij and ǫ yik from t-distribution with df = 4 and df = 5, respectively. The simulation results are summarized in Figure 3. We can see that the patterns of the curves are very similar to those observed in the previous simulation study. Our conclusions on the comparison of different methods remain the same as before. In addition, we have the following observations: (a). The results from the first two settings support our claims that the plug-in methodology works well in practice with estimated variances. Univariate methods also work well for FDR control. (b). The results in Setting 3 show that the non-gaussian errors have little impact on the performance of ARS Application in Supernova Detection In this section, we apply our ARS procedure for analysis of time course satellite imaging data. Figure4showsthetimecourseg-bandimagesofgalaxyM101collectedbythePalomar Transient Factory (PTF) survey (Law et al., 2009). The images indicate the appearance of SN 2011fe, one of the brightest supernovas known up to date (Nugent et al., 2011). A major goal of our analysis is to detect the discrepancies between images taken over time so that we can narrow down the potential locations for Supernova explosions. More accurate measurements and further investigations will then be carried out on the narrowed subset of potential locations. The satellite data are recorded and converted into gray-scaleimages of size (or m = 428,796 pixels). Each pixel corresponds to a value ranging from 0 to 1 that indicates the intensity level of the influx from stars. We use image 1 as the baseline, whose greyscale pixel values are subtracted from those in images 2 and 3. These differences are then vectorized as x = (x 1,,x m ) and y (respectively representing image 2 image 1 and image 3 image 1 ). We plot the histograms and find that the null distributions of x and y are different. This can be explained by the lapse in times at which these images are taken (the brightness of these g-band images changes gradually over time). The supernova data have significant amount of background noise in each image, and the average magnitudes of the background noise vary a lot from image to image. To remove the image-specific background noise, we first estimate the empirical null distributions (Efron, 2004) based on the center part of the

21 ARS: ovariate-assisted Two-Sample Inference 21 FDR level omparison Setting 1 Power omparsion FDR Average Power k k Method DD BH AZ US FDR level omparison Setting Power omparsion FDR Average Power k k 2 Method DD BH AZ US Setting FDR level omparison 1.00 Power omparison FDR Average Power µ µ 0 Method DD BH AZ US Fig. 3. Two-sample tests with unknown variances and non-gaussian errors. The FDR and MDR are plotted against varied non-null proportions (top row) and conditional proportions (middle row) and effect sizes (bottom row). The displayed procedures are DD ( ), BH ( ), ( ), AZ (+) and US ( ).

22 ai & Sun & Wang Fig. 4. From left to right, images are taken respectively on August 23, 24, and 25, 2011. The arrows clearly indicate the explosion of SN 2011fe. Table 1.

22 22 ai & Sun & Wang Fig. 4. From left to right, images are taken respectively on August 23, 24, and 25, The arrows clearly indicate the explosion of SN 2011fe. Table 1. Thresholds and Total Number of Rejections (in parentheses) of Different Testing Procedures FDR level BH procedure Adaptive Z procedure US (two thresholds) ARS (4) (5) 3.24,5.46 (22) (35) (22) (24) 2.51,4.87 (38) (58) (64) 0.39 (69) 1.92,4.42 (80) 0.26 (109) histograms. The variances of observations are assumed to be homoscedastic and are estimated using all pixels. For x and y, we have N(0.0028, ) and N(0.0443, ), respectively. We then standardize the observations as x st and y st, which will be used in our later analyses. We do not take the difference x y directly as it would create many false signals due to the varied magnitudes of the background noise. The standardized measurements x st and y st from the two images seem to be comparable as most of pairs (x st i,yst i ) have similar values. We formulate the problem as a multiple testing problem that consists of m = 428,796 two-sample tests with known variances. For standardized observations x st and y st the variances are known to be 1. Then t 1i = xst i yst i 2 and t 2i = xst i +yst i 2, for 1 i m. We apply BH, AZ, US and ARS proceduresat FDR levels0.01%, 1% and 5%. Figure 11in the Supplementary Material shows the rejected pixels in the layout for each method under different FDR levels. The estimated sparsity levels for t 1 and t 2 are respectively 1.47% and 49.5%. The corresponding estimated support size at τ = 0.5 is We report the thresholds of different testing procedures in Table 1. We can see that more information can be harvested from the data by using auxiliary information. Power gains are achieved by ARS at all FDR levels. In particular, at FDR level 0.01%, the supernova is missed by BH and AZ but detected by ARS. To further quantify our procedure s superiority, we count the total number of rejections in Table 1 (numbers in parentheses). We can see that ARS consistently detects more signals from the satellite images than competing methods.

23 ARS: ovariate-assisted Two-Sample Inference Discussion ovariate-assisted multiple testing can be viewed as a special case of a much broader class of problems under the theoretical framework of integrative simultaneous inference, which covers a range of topics including weighted multiple testing (Benjamini and Hochberg, 1997; Basu et al., 2017), partial conjunction tests or set-wise inference (Benjamini and Heller, 2008; Sun and Wei, 2011; Du and Zhang, 2014), and replicability analysis (Heller et al., 2014; Heller and Yekutieli, 2014). A coherent theme in these problems is to combine the information from multiple sources to make more informative decisions. Tukey s pooling within strategy provides a promising approach in such scenarios, where useful quantitative indications might be hidden in various parts of massive data sets. The current formulations and methodologies in integrative data analysis differ substantially. A general theory and methodology is yet to be developed for handling various types of problems in a unified framework. For instance, in weighted FDR analysis (Benjamini and Hochberg, 1997), the external domain knowledge is incorporated as the weights in modified FDR and power definitions to reflect the varied gains and losses in simultaneous decisions. By contrast, covariate-assisted multiple testing still utilizes unweighted FDR and power definitions. Moreover, in partial conjunction tests and replicability analysis, the summary statistics from different studies are of equal importance, which marks a key difference from covariate assisted inference where some statistics are primary while others are secondary. We conclude our discussion with a few more open issues. Are there better ways to construct the auxiliary sequence? Our theory only shows that ARSisoptimalwhen the pairs are given. Howtoconstructanoptimalpairfromdata is still an open question. For instance, in situations where two means have opposite signs, there would be information loss in the current weighted sum because the sum of absolute values can better capture the sparsity information. However, the absolute value approach would give rise to a correlated pair, which cannot be handled by the current testing framework. Moreover, for observations with non-gaussian errors, the weighted sum needs further modification as it would lead to an asymptotically correlated pair under the null when the error distribution is skewed (cf. the proof of Proposition 6 in Section A.6). How to deal with multiple testing dependency? The ARS method cannot be applied to dependent tests asit assumesthat T i are independent. Our simulation studies show that ARS controls the FDR under weak dependence. However, the result is based on very limited empirical studies, which lack theoretical supports. An important direction is to develop new theory and methodology for the dependent case. How to construct the auxiliary sequence in more general settings? This article focuses on the two-sample tests. It would be of interest to extend the methodology to simultaneous ANOVA tests. Moreover, ARS provides a useful strategy in extracting the sparsity structure from data. There are other important structures in the data such as heteroscedasticity, hierarchy and dependency, which may also be captured by an auxiliary sequence. It remains an open question on how to extract and incorporate such structural information effectively to improve the power of a testing procedure. How to summarize the auxiliary information in high-dimensional settings? The proposed ARS methodology requires the joint modeling of the primary and auxiliary

24 24 ai & Sun & Wang statistics, which cannot handle many covariate sequences because the joint density estimator would greatly suffer from high-dimensionality. A fundamental issue is to develop new principles for information pooling, or optimal dimension reduction, in multiple testing with high-dimensional covariates. How to make inference with multiple sequences? In partial conjunction tests and replicability analysis, an important feature is that the means (of summary statistics) from separate studies are individually sparse. We expect that similar strategies for extracting and sharing sparsity information among multiple sequences would improve the accuracy of simultaneous inference. However, as opposed to covariate-assisted inference where there is a sequence of primary statistic, in partial conjunction tests and replicability analysis all sequences are of equal importance, which poses new challenges for problem formulation and methodological development. 7. Proofs of Main Theorems This section proves the main theorems. The proofs of other propositions are provided in the Supplementary Material Proof of Theorem 1 Proof. We first show that the two expressions of T i in (3.4) are equivalent. Recall q (t 2) = (1 π 1)f (t 2 θ 1i = 0). Applying Bayes theorem and using the conditional independence between T 1i and T 2i under the null θ 1i = 0 (Proposition 1), we obtain T i (t 1,t 2) = P(θ1i = 0)f(t1,t2 θ1i = 0) f(t 1,t 2) = q (t 2)f 10(t 1). f(t 1,t 2) Part (a). Let Q (t) = α t. We first show that α t < t. According to the mfdr definition, E (T1,T 2 ){ m i=1 (Ti α t)i(t i < t) } = 0, (7.1) where the subscript (T 1,T 2) indicates that the expectation is taken over the joint distribution of (T 1,T 2). The above equation implies that α t < t; otherwise all terms in the summation on its LHS would be either zero or negative. Next we show that Q (t) is monotone in t. Let Q (t j) = α j for j = 1,2. We only need to show that if t 1 < t 2, then α 1 α 2. We argue by contradiction. If α 1 > α 2, then (T i α 2)I(T i < t 2) = (T i α 2)I(T i < t 1)+(T i α 2)I(t 1 T i < t 2) (T i α 1)I(T i < t 1)+(α 1 α 2)I(T i < t 1) +(T i α 1)I(t 1 T i < t 2). Next take expectations on both sides and sum over all i. We claim that E T1,T 2 { m i=1 (Ti α 2)I(T i < t 2) } > 0. (7.2) The above inequality holds since (i) E T1,T 2 { m i=1 (Ti α 1)I(T i < t 1) } = 0, (ii) E T1,T 2 { m i=1 (α1 α2)i(ti < t 1) } > 0, and(iii) E T1,T 2 { m i=1 (Ti α 1)I(t 1 T i < t 2) } > 0, which are respectively due to (7.1), the assumption that α 1 > α 2 and the fact α 1 < t 1. However, (7.2) is a contradictiontoourdefinitionofα 2, whichimplies thate T1,T 2 { m i=1 (Ti α 2)I(T i < t 2) } = 0. Hence we must have α 1 α 2. Part (b). The oracle threshold is defined as t = sup t {t (0,1) : Q (t) α}. We want to show that at t, the mfdr level is attained precisely. Let ᾱ = Q (1). Part (a) shows that

25 ARS: ovariate-assisted Two-Sample Inference 25 the continuous function Q (t) is non-decreasing. Then we always have Q (t ) = α if α < ᾱ. Define δ = {I(T i < t ) : 1 i m}. Let δ = (δ 1,,δ m ) be an arbitrary decision rule such that mfdr(δ ) α. It follows that E T1,T 2 { m i=1 (Ti α)δ i } = 0 and ET1,T 2 { m i=1 (Ti α)δ i } 0. (7.3) ombining the two results in (7.3), we conclude that E T1,T 2 { m i=1 (δi δ i )(T i α) } 0. (7.4) Next, consider a monotonic transformation of the oracle decision rule δ i = I(T i < t ) via f(x) = (x α)/(1 x) (note that the ( derivative f(x) ) = (1 α)/(1 x) 2 > 0). The oracle decision rule is equivalent to δ i T i = I α < λ 1 T i, where λ = t α 1 t. A useful fact is that α < t < 1. Hence λ > 0. Note that (i) (T α) λ i (1 T) i < 0 if δ i > δ, i and (ii) (T α) λ i (1 T) i > 0 if δ i < δ. i ombining (i) and (ii), we conclude that the following inequality holds for all i: (δ i δ ) { i (T i α) λ (1 T) } i 0. Summing over i and taking expectations, we have [ m { E T1,T 2 (δ i δ ) i (T i α) λ (1 T)} ] i 0. (7.5) i=1 ombining (7.4) & (7.5), we have { m } { m } λ E T1,T 2 (δ i δ )(1 T i ) i E T1,T 2 (δ i δ )(T i i α) 0. i=1 Finally, note that λ > 0 and that the ETP of a decision rule δ = (δ 1,,δ m) is given by E T1,T 2 { m i=1 δi(1 Ti ) }, we conclude that ETP(δ ) ETP(δ ). i= Proof of Theorem 2 Proof. We first provide a summary of useful notations. Q τ (t) = m 1 m i=1 Tτ,i I(Tτ,i < t). This quantity provides a conservative FDR estimate of the decision rule { I(T τ,i < t) : 1 i m}. ˆQ τ (t) = m 1 m τ,i replaced by ˆT i=1. τ,i τ,i ˆT I(ˆT < t). This is the data-driven version of Qτ (t) with T τ,i being Q τ (t) = E{T τ I(T τ < t)} denotes the limiting value of Q τ (t), where T τ is a generic member from the sample {T i : 1 i m}. t τ = sup{t (0,1) : Q τ (t) α}. Part (a). In Proposition 5, we show that δ τ is conservative in mfdr control. To establish the desired property in mfdr control, we only need to show that mfdr(δ DD) = mfdr(δ τ )+o(1). Define a continuous version of Q τ (t) as follows. For T τ,(k) < t Tτ,(k+1), let whereq τ k = Q τ ( T τ,(k) Q τ (t) = t T τ,(k) T τ,(k+1) T τ,(k) Q τ k + Tτ,(k+1) t T τ,(k+1) T τ,(k) Q τ k+1, (7.6) ). ItiseasytoseethatQ τ (t)iscontinuousandmonotone. Hencetheinverse ofq τ (t), denotedq τ, 1, is well defined. Moreover, Q τ, 1 is continuous and monotone. We can similarly define a continuous version of ˆQ τ (t), denoted by ˆQ τ (t). ˆQτ (t) is continuous and monotone;

26 26 ai & Sun & Wang so does its inverse [ and δ τ DD = I {ˆTτ,i ˆQ τ, 1 ( ). By construction, we have δ τ = [ I { T } ] τ,i : 1 i m. We will show that τ, 1 ˆQ (α) (i) Q τ, 1 (α) p t τ and (ii) Qτ, 1 (α) } : 1 i m ] ˆQ τ, 1 (α) p t τ. (7.7) To show (i), note that the continuity of Q τ, 1 ( ) implies that for any ǫ > 0, we can find η > 0 such that Q τ, 1 (α) Q τ, 1 {Q τ (t τ )} < ǫ if Q τ (t τ ) α < η. Hence P( Q τ (t τ ) α > η) P [ Q τ, 1 (α) Q τ, 1 {Q τ (t τ )} > ǫ ]. Next, by the WLLN Q τ (t) p Q τ (t). Note that Q τ (t τ ) = α, we have P( Q τ (t τ ) α > η) 0. By Markov inequality, we conclude that Q τ, 1 (α) p Q τ, 1 {Q τ (t τ )} = t τ. Next we show (ii). By inspecting the proof of (i), we only need to show that ˆQ τ (t) p Q τ (t). Denote a variable without index i (e.g. ˆTτ and T) τ as a generic member from the sample. It follows from ondition (2) and the continuous mapping theorem that ˆT τ p T. τ Note that both T τ and ˆT τ are bounded above by 1. It follows that E(ˆT τ T) τ 2 0. Let U i = T τ,i I(Tτ,i τ,i τ,i < t) and Ûi = ˆT I(ˆT < t). We will show that E(Ûi Ui)2 = o(1). To see this, consider the following decomposition (Ûi Ui)2 = (ˆT τ T τ ) 2 I(ˆT τ t,t τ t)+(ˆt τ ) 2 I(ˆT τ t,t τ > t) +(T τ ) 2 I(ˆT τ > t,t τ t) = I +II +III. The first term I = o(1) because E(ˆT T τ ) τ 2 0. Let η > 0. Note that T τ is continuous and that ˆT τ p T, τ we have ( ) P(ˆT τ t,t τ > t) P{T τ (t,t+η)}+p ˆT τ T τ > η 0. Since ˆT τ is bounded, we conclude that the second term II = o(1). Similarly we can show that III = o(1). Therefore E(Ûi Ui)2 = o(1). Next we show that ˆQ τ (t) p Q τ (t). Note that Q τ (t) p Q τ (t), we only need to show that ˆQ τ (t) p Q τ (t). The dependence among Ûi in the expression ˆQ τ (t) = m 1 iûi creates some complications. The idea is to apply some standard techniques for the limit of triangular arrays that do not require independence between variables. onsider S n = m i=1 (Ûi Ui). Then E(Sn) = m{e(ûi) E(Ui)}. Applying standard inequalities such as auchy-schwartz, we have E(Ûi U i)(ûj Uj) = o(1). It follows that m 2 var(s n) m 1 {(Ûi } E(Ûi Ui)2 +(1+o(1))E Ui)(Ûj Uj) = o(1). Therefore E{S n E(S n)/n} 2 0. Applying hebyshev s inequality, we obtain m 1 {S n E(S n)} = ˆQ τ (t) Q τ (t) p 0. Therefore ˆQ τ (t) p Q τ (t). By definition, ˆQ τ (t) ˆQ τ (t) m 1. We claim that ˆQ τ (t) p Q τ (t) and (ii) follows. According to (i) and (ii) in (7.7), ˆQ τ, 1 (α) = Q τ, 1 (α)+o P (1). The mfdr levels of the testing procedures are: ( mfdr(δ τ ) = PH 0 T τ,i < Qτ, 1 (α) ) (ˆTτ,i ) τ, 1 P P ( T τ,i < Qτ, 1 (α) ), and mfdr(δ H0 < ˆQ (α) dd ) = (ˆTτ,i ). τ, 1 P < ˆQ (α) TheoperationofourtestingprocedureimpliesthatQ τ, 1 (α) α. Itfollows thatp ( T τ,i is bounded away from zero. We conclude that mfdr(δ dd ) = mfdr(δ τ )+o(1). < Qτ, 1 (α) )

27 ARS: ovariate-assisted Two-Sample Inference 27 The result on mfdr control can be extended to FDR control. The next proposition, which is proved in the Supplementary Material, first gives sufficient conditions under which the mfdr and FDR definitions are asymptotically equivalent, and then verifies that these conditions are fulfilled by the proposed ARS procedure δ τ dd. It follows from the proposition that ARS controls the FDR at level α+o(1). Proposition 7. (a) onsider a general decision rule δ. Let Y = m 1 m i=1δi. Then mfdr(δ) = FDR(δ)+o(1) if (i) E(Y) η for some η > 0, and (ii) Var(Y) = o(1). (b) onditions (i) and (ii) are fulfilled by the ARS procedure δ τ dd. Part (b). The ARS procedure utilizes ˆq, and the corresponding test statistic is. It follows from onditions (1 ) and (2), and the continuous mapping theorem that ˆT p T. Denote Q (t) the oracle mfdr function and t the oracle threshold. Then Q (t) = PH 0(T < t) P(T < t) ˆT,i ˆT,i = E{T I(T < t)}, t = sup{t (0,1) : Q (t) α}. Define ˆQ (t) = m 1 m,i i=1 I(ˆT < t). Similar to (7.6), we define a continuous version of ˆQ (t) and denote it by ˆQ (t). It can be shown that ˆQ (t) is [ continuous and monotone; so does its inverse ˆQ, 1 {ˆT,i } ] (t). The ARS procedure is given by δ, 1 DD = I ˆQ (α) : 1 i m. Following the steps in Part (a) we can show that ˆQ (t) p Q (t), ˆQ, 1 (t) p t. (7.8) The operation of ARS implies that Q, 1 (α) α (thus the denominator of the mfdr is bounded away from zero). Hence mfdr(δ DD) = Q (t )+o(1) = α+o(1). Next, we consider the ETP. It follows from ˆT p T and (7.8) that } ETP(δ, 1 P DD) ETP(δ = H1 {ˆT < ˆQ (α) = 1+o(1). ) P H1 (T < t ) Acknowledgments We thank the Editor, referees and RS for the thorough and useful comments which have greatly helped to improve the presentation of the paper. In particular, we are grateful to several excellent suggestions from the referees that have inspired our discussions on the conditional independence assumption, Tukey s procedures for combination, Brown s ancilarity paradox and the robustness of ARS when pooling non-informative auxiliary data. W. Sun would like to thank Ms. Pallavi Basu from Tel Aviv University for helpful suggestions on theory. Tony ai was supported in part by NSF Grants DMS , DMS and DMS , and NIH Grant R01 A W. Sun was supported in part by NSF Grants DMS-AREER and DMS References Barber, RF. and A. Ramdas (2017+). The p-filter: multi-layer fdr control for grouped hypotheses. Journal of the Royal Statistical Society: Series B (Statistical Methodology) to appear. Basu, P., T. T. ai, K. Das, and W. Sun (2017+). Weighted false discovery control in large-scale multiple testing. Journal of the American Statistical Association. To appear.

28 28 ai & Sun & Wang Benjamini, Y. and R. Heller (2008). Screening for partial conjunction hypotheses. Biometrics 64: Benjamini, Y. and Y. Hochberg (1995). ontrolling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. B 57, Benjamini, Y. and Y. Hochberg (1997). Multiple hypotheses testing with weights. Scandinavian Journal of Statistics 24: Benjamini, Y. and Y. Hochberg (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics 25: Boca, S. M. and J. T. Leek (2017). A regression framework for the proportion of true null hypotheses. Preprint. biorxiv: Bourgon, R., R. Gentleman, and W. Huber (2010). Independent filtering increases detection power for high-throughput experiments. Proceedings of the National Academy of Sciences 107(21), Brown, L. D. (1990). An ancillarity paradox which appears in multiple linear regression. The Annals of Statistics 18(2), ai, T. T. and J. Jin (2010). Optimal rates of convergence for estimating the null density and proportion of non-null effects in large-scale multiple testing. Ann. Statist. 38, ai, T. T. and W. Sun (2009). Simultaneous testing of grouped hypotheses: Finding needles in multiple haystacks. J. Amer. Statist. Assoc. 104, alvano, S. E., W. Xiao, D. R. Richards, R. M. Felciano, H. V. Baker, R. J. ho, R. O. hen, B. H. Brownstein, J. P. obb, S. K. Tschoeke, et al. (2005). A network-based analysis of systemic inflammation in humans. Nature 437(7061), Du, L. and. Zhang (2014). Single-index modulated multiple testing. The Annals of Statistics 42(4), Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. Journal of the American Statistical Association 99(465), Efron, B. (2007). Size, power and false discovery rates. Ann. Statist. 35, Efron, B. (2008). Simultaneous inference: When should hypothesis testing problems be combined? Ann. Appl. Stat. 2, Efron, B., R. Tibshirani, J. D. Storey, and V. Tusher (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96, Ferkingstad, E., A. Frigessi, H. Rue, G. Thorleifsson, and A. Kong (2008). Unsupervised empirical bayesian multiple testing with external covariates. The Annals of Applied Statistics 2, Foster, D. P. and E. I. George (1996). A simple ancillarity paradox. Scandinavian journal of statistics 23(2), Genovese,. and L. Wasserman (2002). Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. B 64, Genovese,. and L. Wasserman (2004). A stochastic process approach to false discovery control. Ann. Statist. 32:

29 ARS: ovariate-assisted Two-Sample Inference 29 Haupt, J., R. M. astro, and R. Nowak (2011). Distilled Sensing: Adaptive Sampling for Sparse Detection and Estimation. Information Theory, IEEE Transactions on 57(9), Heller, R., M. Bogomolov, and Y. Benjamini (2014). Deciding whether follow-up studies have replicated findings in a preliminary large-scale omics study. Proceedings of the National Academy of Sciences 111(46), Heller, R. and D. Yekutieli (2014). Replicability analysis for genome-wide association studies. The Annals of Applied Statistics 8(1), Hu, J. X., H. Zhao, and H. H. Zhou (2010). False discovery rate control with groups. Journal of the American Statistical Association 105(491), James, W. and. Stein (1961). Estimation with quadratic loss. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, Volume 1, pp Jin, J. and T. T. ai (2007). Estimating the null and the proportional of nonnull effects in largescale multiple comparisons. J. Amer. Statist. Assoc. 102, Langaas, M., B. H. Lindqvist, and E. Ferkingstad (2005). Estimating the proportion of true null hypotheses, with application to dna microarray data. J. Roy. Statist. Soc. B 67, Law, N. M., S. R. Kulkarni, R. G. Dekany, E. O. Ofek, R. M. Quimby, P. E. Nugent, J. Surace,.. Grillmair, J. S. Bloom, M. M. Kasliwal, et al. (2009). The palomar transient factory: system overview, performance, and first results. Publications of the Astronomical Society of the Pacific 121(886), Lehmann, E. L. and G. asella (2006). Theory of point estimation. Springer Science & Business Media. Liu, W. (2014). Incorporation of sparsity information in large-scale multiple two-sample t tests. arxiv preprint arxiv: Liu, Y., S. K. Sarkar, and Z. Zhao (2016). A new approach to multiple testing of grouped hypotheses. Journal of Statistical Planning and Inference 179, Meinshausen, N. and J. Rice (2006). Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. Ann. Statist. 34, Nugent, P. E., M. Sullivan, S. B. enko, R.. Thomas, D. Kasen, D. A. Howell, D. Bersier, J. S. Bloom, S. R. Kulkarni, M. T. Kandrashoff, A. V. Filippenko, J. M. Silverman, G. W. Marcy, A. W. Howard, H. T. Isaacson, K. Maguire, N. Suzuki, J. E. Tarlton, Y.-. Pan, L. Bildsten, B. J. Fulton, J. T. Parrent, D. Sand, P. Podsiadlowski, F. B. Bianco, B. Dilday, M. L. Graham, J. Lyman, P. James, M. M. Kasliwal, N. M. Law, R. M. Quimby, I. M. Hook, E. S. Walker, P. Mazzali, E. Pian, E. O. Ofek, A. Gal-Yam, and D. Poznanski (2011, 12). Supernova sn 2011fe from an exploding carbon-oxygen white dwarf star. Nature 480(7377), Reiner-Benaim, A., D. Yekutieli, N. E. Letwin, G. I. Elmer, N. H. Lee, N. Kafkafi, and Y. Benjamini (2007). Associating quantitative behavioral traits with gene expression in the brain: searching for diamonds in the hay. Bioinformatics 23(17), Rubin, D., S. Dudoit, and M. Van der Laan (2006). A method to increase the power of multiple testing procedures through sample splitting. Statistical Applications in Genetics and Molecular Biology 5(1), Article 19. Sarkar, S. K. (2002). Some results on false discovery rate in stepwise multiple testing procedures. Ann. Statist. 30,

30 30 ai & Sun & Wang Schweder, T. and E. Spjøtvoll (1982). Plots of p-values to evaluate many tests simultaneously. Biometrika 69: Scott, J. G., R.. Kelly, M. A. Smith, P. Zhou, and R. E. Kass (2015). False discovery rate regression: an application to neural synchrony detection in primary visual cortex. Journal of the American Statistical Association 110(510), Silverman, B. W. (1986). Density estimation for statistics and data analysis, R press. Wand, M. and M. Jones (1995). Kernel smoothing, hapman & Hall, London. Skol, A. D., L. J. Scott, G. R. Abecasis, and M. Boehnke (2006). Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. 38, Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. B 64, Sun, W. and T. T. ai (2007). Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc. 102, Sun, W. and Z. Wei (2011). Large-scale multiple testing for pattern identification, with applications to time-course microarray experiments. J. Amer. Statist. Assoc. 106, Taylor, J., R. Tibshirani, and B. Efron (2005). The miss rate for the analysis of gene expression data. Biostatistics 6(1), Tukey, J. W. (1994). The collected works of John W. Tukey, Volume 3. Taylor & Francis. Wasserman, L. and K. Roeder (2009). High-dimensional variable selection. Ann. Statist. 37, Zehetmayer, S., P. Bauer, and M. Posch (2005). Two-stage designs for experiments with a large number of hypotheses. Bioinformatics 21(19), Zehetmayer, S., P. Bauer, and M. Posch (2008). Optimized multi-stage designs controlling the false discovery or the family-wise error rate. Stat. Med. 27(21), Zablocki, R. W., A. J. Schork, R. A. Levine, O. A. Andreassen, A. M. Dale, and W. K. Thompson (2014). ovariate-modulated local false discovery rate for genome-wide association studies. Bioinformatics 30(15),

31 ARS: ovariate-assisted Two-Sample Inference 1 Supplementary Material for ARS: ovariate Assisted Ranking and Screening for Large-Scale Two-Sample Inference This supplement contains the proofs of other theoretical results (Section A) and additional numerical results (Section B). A. Proofs of Other Results A.1. Proof of Proposition 1 Proof. Let A x = {x : x 0}. According to the definition of θ ji and the conditional independence assumption (2.4), we have f(t 1i,t 2i θ 1i = 0,θ 2i = 0) = f(t 1i,t 2i µ xi = 0,µ yi = 0) = f(t 1i µ xi = 0,µ yi = 0)f(t 2i µ xi = 0,µ yi = 0) = f(t 1i θ 1i = 0,µ xi = µ yi = 0)f(t 2i θ 2i = 0,µ xi = µ yi = 0) = f(t 1i θ 1i = 0)f(t 2i θ 2i = 0). The last equality holds because given θ ji = 0, t ji are just linear combinations of random errors ǫ xi and ǫ yi, which are independent of µ xi and µ yi, j = 1,2. Denote G µx ( ) the distribution function of µ xi. Then f(t 1i,t 2i θ 1i = 0,θ 2i = 1) = f(t 1i,t 2i µ xi = µ yi = x)dg µx (x) A x = f(t 1i θ 1i = 0,µ xi = µ yi = x)f(t 2i µ xi = µ yi = x)dg µx (x) A x = f(t 1i θ 1i = 0) f(t 2i θ 1i = 0,µ xi = µ yi = x)dg µx (x) A x = f(t 1i θ 1i = 0)f(t 2i θ 1i = 0,θ 2i = 1). For the third equality, note that t 1i are independent of both µ xi and µ yi given θ 1i = 0, whereas t 2i are correlated with µ xi and µ yi given that θ 1i = 0. Finally, f(t 1i,t 2i θ 1i = 0) = f(t 1i,t 2i θ 1i = 0,θ 2i = 0)P(θ 2i = 0 θ 1i = 0) +f(t 1i,t 2i θ 1i = 0,θ 2i = 1)P(θ 2i = 1 θ 1i = 0) = f(t 1i θ 1i = 0){f(t 2i θ 1i = 0,θ 2i = 0)P(θ 2i = 0 θ 1i = 0) +f(t 2i θ 1i = 0,θ 2i = 1)P(θ 2i = 1 θ 1i = 0)} = f(t 1i θ 1i = 0)f(t 2i θ 1i = 0). Hence T 1i and T 2i are independent under the null hypothesis θ 1i = 0. A.2. Proof of Proposition 2 Proof. Part (a). We only need to show that q τ (t 2 ) q (t 2 ) for all τ. We first split the joint density into two parts and then use the independence between T 1i and T 2i under

32 2 ai & Sun & Wang the null: q τ (t 2 ) = (1 τ) 1 t 1 A τ {f(t 1,t 2 θ 1i = 0)P(θ 1i = 0)+f(t 1,t 2 θ 1i = 1)P(θ 1i = 1)}dt 1 (1 τ) 1 t 1 A τ q (t 2 )f 10 (t 1 )dt 1 = q (t 2 ). The last equality holds because t 1 A τ f 10 (t 1 )dt 1 = 1 τ, which cancels with (1 τ) 1. Part (b). Let R be the index set of hypotheses rejected by δ τ. The FDR of δτ is { FDR(δ τ ) = E i R (1 θ } ( ) 1i) 1 = E T1,T2 T i R 1 R 1 For all i R I, we have T i Tτ,i. It follows that ( FDR(δ τ 1 ) E T1,T 2 R 1 i R T τ,i ) i R α. The last inequality is due to the operation of δ τ, which guarantees that { R 1} 1 i R T τ,i α for all realizations of (T 1,T 2 ). The result on mfdr holds since { } ( ) E (1 θ 1i ) E T1,T 2 T τ,i αe T1,T 2 ( R 1). i R i R A.3. Proof of Proposition 3 Let Ñ = m G(τ) be the expected size of the screening set. Define f 2 (t τ 2 ) = (1/Ñ) K h2 (t 2 T 2i ). T 2i T 2(τ) Then ˆq τ (t 2 ) = G(τ) 1 τ f τ 2 (t 2 ). Define the conditional density of T 2i given that P i > τ f τ 2 (t 2 ) = f(t 2 P i > τ) = G(τ) 1 A τ f(t 1,t 2 )dt 1. Then q τ (t 2 ) = G(τ) 1 τ fτ 2 (t 2). It follows that E ˆq τ q τ 2 = { G(τ) 1 τ } 2 E f 2 τ f2 τ 2, For a fixed τ, we only need to show that E f 2 τ f2 τ 2 2dt2 = E { fτ 2 (t 2 ) f2 (t τ 2 )} 0.

33 ARS: ovariate-assisted Two-Sample Inference 3 Let E T2 T 1 denote the conditional expectation that is taken over T 2i s while holding T 1i s fixed. The conditional density of T 2 given T 1 is denoted f 2 (t 2 T 1i ). We will work on the bias and variance in turn. First, E T2 T 1 {K h2 (t 2 T 2i )} = f 2 (t 2 T 1i )+ h2 2 2 f(2) 2 (t 2 T 1i ) z 2 K(z)dz +o(h 2 2 ). Now consider E{ f 2 (t τ 2)} = E T1 [E T2 T 1 { f 2 (t τ 2)}]. It follows that [ E{ f 2 (t τ 2)} = (1/Ñ)E m { T 1 i=1 f 2 (t 2 T 1i )+ h2 2 2 f (2) 2 (t 2 T 1i ) z 2 K(z)dz +o(h 2 2 )} I(T 1i A τ ) ] = f 2 (t τ 2)+ h2 2 2 f(2) 2 (t 2 τ) z 2 K(z)dz +o(h 2 2 ), where f (2) 2 (t 2 τ) is defined in the proposition. The second equality holds because E T1 {f(t 2 T 1i )I(T 1i A τ )} = f 2 (t 2 t 1 )f(t 1 )dt 1 = G(τ)f 2 (t τ 2). A τ We have assumed the square integrability of f (2) 2 (t 2 τ), according to which define { } R = {f (2) 2 (t 2 τ)} 2 dt 2. f (2) 2 τ It follows that the integrated squared bias is [E{ f 2 (t τ 2)} f { } 2 (t τ 2)] 2 dt 2 = (h 4 2 /4)R f (2) 2 (τ) {µ 2 (K)} 2 (1+o(1)), where we use the notation µ 2 (K) = z 2 K(z)dz. Next, we compute the variance term. onsider the following decomposition Var{ f τ 2 (t 2)} = Var T1 [E T2 T 1 { f τ 2 (t 2)}]+E T1 [Var T2 T 1 { f τ 2 (t 2)}], where the first and second terms can be respectively computed as { } 1 [ Var T1 [E T2 T 1 { f 2 (t τ 2)}] = m G 2 (τ) {f 2 1 (t 2 t 1 )} 2 f 1 (t 1 )dt 1 {f 2 (t 2 )} ]{1+o(1)}, 2 E T1 [Var T2 T 1 { f 2 (t τ 1 2)}] = f 2 (t 2 t 1 )f 1 (t 1 )dt 1 R(K){1+o(1)} Ñh 2 A τ = (mh 2 ) 1 f 2 (t τ 2)R(K){1+o(1)}. It is easy to see that the second term is the leading term (note that we fixed τ k and let m ). Hence Var{ f τ 2 (t 2)}dt 2 = (mh 2 ) 1 R(K){1+o(1)}. The common choice of bandwidth h m 1/5 makes E f 2 τ f2 τ 2 0, and the desired result follows.

34 4 ai & Sun & Wang A.4. Proof of Proposition 4 Part (a). The total approximation error can be computed as B q (τ) = q τ (t 2 ) q (t 2 ) dt 2 = 1 G(τ) 1 τ (1 π 1 ) = π 1{1 G 1 (τ)}. 1 τ Noting that G 1 (1) = 1, and applying the mean value theorem, we have d dτ B q(τ) = {1 G 1(τ)} g 1 (τ)(1 τ) (1 τ) 2 = g 1(ξ) g 1 (τ) 1 τ for some ξ (τ,1). If G 1 is concave, then g 1 (ξ) g 1 (τ) < 0 and the desired result follows. Part (b). Let η(x) = {1 G 1 (x)}/(1 x). If g 1 satisfies lim x 1 g 1 (x) = 0, then using L Hospital s Rule, we claim that lim x 1 η(x) exists and lim x 1 η(x) = lim x 1 g 1 (x) = 0. onsider the following decomposition of the joint density f(t 1,t 2 )dt 1 = {q (t 2 )f 10 (t 1 )+π 1 f(t 2 t 1,θ 1i = 1)f 1 (t 1 )}dt 1. A τ A τ Noting that both T 1i and T 2i are standardized, the conditional density f(t 2 t 1,θ 1 = 1) is bounded by some constant 0. Using the definition of η(x), we have q τ (t 2 ) = (1 τ) 1 A τ f(t 1,t 2 )dt 1 q (t 2 )+ 0 π 1 η(τ) Applying L Hospital s Rule again, we have lim τ 1 q τ (s) q (s). In Proposition 2.(a), we have shown that q τ (s) q (s) for all τ. ombining the two inequalities, we conclude that lim τ 1 q τ (s) = q (s). Then the desired result follows. A.5. Proof of Proposition 5 Proof. Proposition 4 defines η(x) = {1 G 1 (x)}/(1 x) and shows that lim τk 1q τ k (t 2 ) = q (t 2 ), q τ (t 2 ) q (t 2 ) 0 π 1 η(τ). Note that G 1 (0) = 0 and G 1 (1) = 1. It follows from the concavity of G 1 that η(τ) 1. Moreover, {q τ k (t 2 ) q (t 2 )}dt 2 = π 1 η(τ k ). Hence {q τ k (t 2 ) q (t 2 )} 2 dt 2 0 π 1 {q τ k (t 2 ) q (t 2 )}dt 2 = 0 π1η(τ 2 k ) 0 (A.1) when τ k 1, according to L Hospital s Rule and our assumption that lim x 1 g 1 (x) = 0. Therefore, we only need to show that for a given τ k, [ ] E ˆq q τ k 2 = E {ˆq (t 2 ) q τ k (t 2 )} 2 m 0 dt 2 0. onsider ˆq (t 2 ) defined in (3.13). Let a j = k 1 (ŝ 2 ŝ 0 ŝ 2 1) 1 {ŝ 2 ŝ 1 (τ j 1)}K hτ (τ j 1).

35 ARS: ovariate-assisted Two-Sample Inference 5 Then ˆq (t 2 ) = k j=1 a jˆq τj, k j=1 a j = 1 and k j=1 a j 1. Following similar steps in the proof of Proposition 3, E{ˆq (t 2 )} = E T1 [E T2 T 1 {ˆq (t 2 )}] [ k 1 G(τ j ) = a j f τj 2 1 τ (t 2)+ h2 2 j 2 u 2(K) j=1 = q τ k (t 2 )+ 1 2 h2 τ qτ k,(2) (t 2 ) (K)+o(h 2 τ ) + h2 2 2 u 2(K) k j=1 a j 1 G(τ j ) 1 τ j t 1 A τj f (2) 2 (t 2 t 1 )f 1 (t 1 )dt 1 +o(h 2 2) t 1 A τj f (2) 2 (t 2 t 1 )f 1 (t 1 )dt 1 +o(h 2 2), where (K) is a kernel dependent constant. The last equality follows from Wand and Jones (1995) (pp. 128) for kernel smoothing estimator at boundary points [applying the theory to k j=1 a jq τj (t 2 )]. Note that 1 G(τj) 1 τ j 1 and j a j = 1, we have k j=1 a j 1 G(τ j ) 1 τ j t 1 A τj f (2) 2 (t 2 t 1 )f 1 (t 1 )dt 1 f (2) 2 (t 2 t 1 )f 1 (t 1 )dt 1. According to the square integrability of q τk,(2) (t 2 ) and f (2) 2 (t 2 τ), we can define { } { } 2dt2 R q τ k,(2) = q τk,(2) (t 2 ), { } R f (2) t 2 τ = { f (2) 2 (t 2 τ)} 2dt2. Then the leading term of [E{q (t 2 )} q τ k (t 2 )] 2 dt 2 is bounded above by 1 { } 2 h4 τr q τ k,(2) { (K)} { } 2 h4 2{u 2 (K)} 2 R f (2) t 2 τ, which converges to 0 when (m,k). The argument below regarding the variance part does not pursue the study of the optimal rateof convergence. In fact, the analysisis intractable because q τj aredependent quantities. We have show in Proposition 3 that Var{ˆq τj (t 2 )}dt 2 (mh 2 ) 1 R(K){1+o(1)} for all τ j. It follows that Var{ˆq (t 2 )} (mh 2 ) 1 R(K)( k a 2 k){1+o(1)} j=1 (mh 2 ) 1 R(K){1+o(1)} 0 when m. We note that the actual convergence rate would be better because the variance will be greatly reduced by averaging. ombining the results on the bias and ]

36 6 ai & Sun & Wang variance terms, we have E {ˆq (t 2 ) q τ k (t 2 )} 2 dt 2 [ 1 2 h4 τr(q τ k,(2) ){ (K)} h4 2{u 2 (K)} 2 R(f (2) t 2 t 1 )+ R(K) mh 2 ] {1+o(1)}. (A.2) ombining (A.1) and (A.2), we conclude that E ˆq q 2 0 when (m,k) 0, completing the proof. A.6. Proof of Proposition 6 Proof. We first point out that the arguments in this proof should be understood as the conditional version (given that µ xi and µ yi are known). Define the following central sample moments that are used in later calculations: m xi = 1 n x m xxi = 1 n x n x j=1 n x j=1 (X ij µ xi ), m yi = 1 n y (Y ij µ yi ), n y j=1 (X ij µ xi ) 2, m yyi = 1 n y (Y ij µ yi ) 2. n y It follows that Sxi 2 = m xxi m 2 xi, S2 yi = m yyi m 2 yi. Next define central moments, According to the LT, we have nx n y n j=1 µ xi3 = E(X ij µ xi ) 3, µ yi3 = E(Y ij µ yi ) 3, µ xi4 = E(X ij µ xi ) 4,µ yi4 = E(Y ij µ yi ) 4. m xi m yi m xxi m yyi 0 0 σxi 2 σyi 2 L N(0,Σ i ), where γ y σxi 2 0 γ y µ xi3 0 Σ i = 0 γ x σyi 2 0 γ x µ yi3 γ y µ xi3 0 γ y (µ xi4 σxi 4 ) 0. 0 γ x µ yi3 0 γ x (µ yi4 σyi 4 ) (s t+µ xi µ yi )(γ x v +γ y u) 1 2 Let g i (s,t,u,v) = ( ){ } s+ γyu γ t+µ xv xi + γy γ x µ yi (γ x v +γ y u) γyu 1 2. It follows that γ xv ġ i (0,0,σxi,σ 2 yi) 2 = ( σ 1 pi σ 1 pi κ 1 2 i σ 1 pi σ 1 pi κ 1 2 i 1 2 γ y ( 1 2 γ y(µ xi ) µ yi )σ 3 pi γ µ xi +µ y yi γ x σ 3 pi (1+2κ i )κ 3 2 i 1 2 γ x ( 1 2 γ x(µ xi µ yi ) )σ 3 pi γ µ xi +µ y yi γ x σ 3 pi κ 1 2 i ).

37 ARS: ovariate-assisted Two-Sample Inference 7 The asymptotic variance-covariance matrix of γ x γ y n(t 1i,T 2i ) is ( Σ i σ 2 T 1,T 2 = i (t 1,t 1 ) σi 2(t ) 1,t 2 ) σi 2(t 1,t 2 ) σi 2(t, 2,t 2 ) where the entries can be computed by apply the Delta method: σi(t 2 1,t 1 ) = 1+σ 4 pi (µ xi µ yi ) ( γxµ 2 yi3 γyµ 2 ) xi3 + σ 6 pi σi 2 (t 1,t 2 ) = σ 4 pi σ 4 pi 2 + σ 6 pi 4 4 (µ xi µ yi ) 2{ γ 3 x 2 (µ xi µ yi ) ( γ y µ xi +µ yi γ 3 xκ 1 2 σ 2 i (t 2,t 2 ) = 1+σ 4 pi + σ 6 pi 4 i ( µyi4 σyi 4 ) ( +γ 3 y µxi4 σxi 4 )}, ( γx 2 µ yi3κ 1 2 i +γy 2 µ xi3κ 1 2 i γ x ( µ xi +µ yi γ y γ x ( µyi4 µ 4 ) } yi, ( µ xi +µ yi γ y ( µ xi +µ yi γ y γ x ) { } γx 2 µ yi3κ 1 2 i +γy 2 µ xi3(1+2κ i )κ 3 2 i, ) { (µ xi µ yi ) γy 3 (1+2κ ( i)κ 3 2 i µxi4 σxi) 4 γ x ) {κ ) 2{ γ 3 x κ i i γ2 x µ yi3 (1+2κ i )κ 2 i γy 2 µ } xi3 ) ( µyi4 σyi) 4 +γ 3 y (1+2κ i )2 κ 3 ( i µxi4 σxi 4 ) }. The off-diagonal terms degenerate to zero when (i) the null hypothesis µ xi = µ yi is true and (ii) the distributions are symmetric, i.e. µ y3i = µ x3i = 0, completing the proof. A.7. Proof of Proposition 7 The proposition can be considered as a special case of the theory in Section D of Basu et al. (2017) and can be proved similarly. As the proofs in Basu et al. (2017) are done in different settings with the weighted FDR definition and random weights, we provide the proof here for completeness. Proof. Part (a). Let X = m 1 m i=1 (1 θ i)δ i. Note that when Y = 0 we must have X = 0. The asymptotic equivalence follows if we can show the following { X mfdr(δ) FDR(δ) E Y X } EY I(Y > 0) = o(1). (A.3) Since X Y and both are non-negative expressions, using auchy-schwarz { X E Y X } { } X EY I(Y > 0) EY = E I(Y > 0) Y Y EY (E Y EY 2) 1/2 EY = Var(Y)1/2, EY

38 8 ai & Sun & Wang The desired result follows from onditions (i) and (ii). ( ) Part (b). According to the proof of Theorem 2, Part (a), P T τ,i < Qτ, 1 (α) is bounded ( ) away from zero, then there exists p α > 0, such that P T τ,i < Qτ, 1 (α) p α. Therefore, m 1 E { m i=1 (ˆTτ,i ) ( τ, 1 P < ˆQ (α) = P T τ,i < Qτ, 1 )+o(1), (α) ( I T τ,i Let Y = m 1 m i=1 I(Tτ,i tτ < Qτ, 1 Var(Y ) = m 1 Var (α) ) } = P ). Note that { I ( T τ,i t To show that Var(Y) = o(1), use the decomposition ( T τ,i < Qτ, 1 (α) ) p α > 0. )} m 1 = o(1). Var(Y) = Var(Y )+Var(Y Y )+2ov(Y Y,Y ). We only need to show that Var(Y Y ) = o(1). Then by auchy-schwarz inequality and using Var(Y ) = o(1), it follows that Recall that ˆT τ Tτ = o P(1) and [ m Var(Y Y ) =m 2 Var m 2 E [ m i=1 i=1 ( = 1 1 ) E m + 1 {I (ˆTτ m E < The last equality follows since E E [{ I k=i,j (ˆTτ,i ov(y Y,Y ) = o(1). τ, 1 ˆQ (α) tτ = o P(1). We have { (ˆTτ,i ( )} τ, 1 I < ˆQ ) I ] (α) T τ,i < tτ { (ˆTτ,i ( )} τ, 1 I < ˆQ ) I ] 2 (α) T τ,i < tτ { (ˆTτ,k I k=i,j;i j ˆQ τ, 1 { (ˆTτ,k ( )} τ, 1 I < ˆQ ) I (α) T τ,k < tτ ) } 2 I(T τ < tτ ) = o(1). ) ( )} τ, 1 < ˆQ (α) I T τ,k < tτ ) ( )}] τ, 1 < ˆQ (α) I T τ,i < tτ = o(1). A.8. Effects of baseline removal (Remark 1) This section explains that removing the baseline effects would not violate our assumption on conditional independence. To fix ideas, we consider the application context of microarray time-course experiments.

39 ARS: ovariate-assisted Two-Sample Inference 9 Let ξ denote the true based line expression levels of m genes. onsider two time points x and y. The (unknown) true effect sizes at these two time points are given by µ x = ξ +η x, µ y = ξ +η y. Note that we do not require that ξ is sparse but some baseline measurements should be available. By contrast, we assume that η x = (η xi : i = 1,,m) and η y = (η yi : i = 1,,m) are sparse vectors representing random perturbation effects from based line levels ξ. The goal is to identify nonzero effects η xi η yi 0. We observe X i = µ xi + ǫ xi and Y i = µ yi + ǫ yi, with ǫ xi N(0,1) and ǫ yi N(0,1). Moreover, we do not observe the true latent variable ξ. The baseline effects would be measured (at time 0) with some errors Z i = ξ i + ǫ zi. We assume that ǫ xi, ǫ yi and ǫ zi are independent with each other. After removing the baseline effects, we have new observations: X i = ξ i +η xi +ǫ xi (ξ i +ǫ zi ); Ỹ i = ξ i +η yi +ǫ yi (ξ i +ǫ zi ). The pairs of differences and sums are given by T 1i = X i Ỹi = (η xi η yi )+(ǫ xi ǫ yi ); T 2i = X i +Ỹi = (η xi +η yi )+(ǫ xi +ǫ yi ) 2(ξ i +ǫ zi ). Under the null hypothesis η xi = η yi, the error terms ǫ xi ǫ yi and ǫ xi +ǫ yi are independent, and ǫ xi ǫ yi is independent of ξ i +ǫ zi. Hence T 1i and T 2i satisfy the conditional independence assumption (2.5), which further leads to Equation (2.6) that is needed in the ARS methodology. A.9. Proof of the claim in Remark 4 Proof. We study the special case where (i) the observations have equal sample size and (known) equal variance σxi 2 = σ2 yi = σ2 i, and (ii) the two nonzero means have the same magnitude with opposite signs µ xi = µ yi. Then n (T 1i,T 2i ) = ( 2σ X i Ȳi, X i +Ȳi). i Next, note that X i = µ xi + ǫ xi,ȳi = µ xi + ǫ yi, where ǫ xi N(0,(2/n)σi 2) and ǫ yi N(0,(2/n)σi 2 ). We have f(t 1,t 2 ) = f(t 1,t 2 µ x = µ)dg µx (µ) = f(t 1 µ x = µ)f 2 (t 2 )dg µx (µ) = f 1 (t 1 )f 2 (t 2 ); f(t 2 θ 1i = 0) = f(t 2 µ xi = µ yi = 0) = f 2 (t 2 ). It follows that the oracle statistic T (t 1,t 2 ) = (1 π 1)f(t 2 θ 1i = 0)f 10 (t 1 ) f(t 1,t 2 ) which gives the Lfdr defined in (3.1). = (1 π 1)f 10 (t 1 ), f 1 (t 1 )

40 10 ai & Sun & Wang A.10. Proof of insufficiency and further explanation of Remark 8 We prove the claim in Remark 8 that the primary statistic is not a sufficient statistic. The proof uses the notations in Section 4. learly we only need to provide a counter example. onsider the following special case where the grouping variable S i takes two possible values 1 or 2 with equal probability: P(S i = 1) = 0.5, P(S i = 2) = 0.5. If S i = j, j = 1,2, then T i follows a mixture distribution T i (1 π j )N(0,1)+π j N(1,1), where π j is the conditional proportion of non-null cases for group j. An equivalent way to conceptualize the above model is to introduce an unknown location parameter µ i : T i µ i N(µ i,1), where µ i follows a distribution with two point masses 0 and 1: µ i (1 π j )δ 0 +π j δ 1. Here δ 0 and δ 1 are the Dirac delta functions centered at 0 and 1, respectively. NowweproveT i sinsufficiency. Bydefinition, ift i issufficient, thenp(s i = 1 T i = t,µ i ) should not depend on µ i. However, some simple algebras give the following: P(S i = 1 T i = t,µ = 0) = 1 π 1 2 π 1 π 2, P(S i = 1 T i = t,µ = 1) = π 1 π 1 +π 2, and the desired result follows. Remark 9. The ancilarity paradox means that S i seems to be useless for inference of µ i since it is an external variable and the location information should have been captured by T i. However, making inference using T i alone would lead to substantial loss of information. In this example, S i is called an ancillary component because (i) T i alone is insufficient; (ii) S i is ancillary; and (iii) (T i,s i ) is sufficient. A.11. Screening via Lfdr: formulae and implementation onsider a screening procedure based on ζ i < τ, where ζ i := ζ i (t 1i ) is a screening statistic and τ > 0 is a tuning parameter. Denote G(τ) the DF of ζ i. Let A τ = {x : ζ(x) > τ} and S(τ) = P(A τ ) the survival function. Then which can be estimated by q τ E{Q τ (t 2,h)} (t 2 ) = lim = S(τ) 1 f(t 1,t 2 )dt 1, (A.4) h 0 h A τ ˆq τ (t 2 ) = i T (τ) K h 2 (t 2 t 2i ) mŝ(τ) 1. (A.5) In the p-value approach, we have an explicit correction factor of 1 τ. The factor is now replaced by Ŝ(τ). If the Lfdr is used for screening, then a new correction factor would be needed. We can simulate B data points (e.g. B = 10 5 ) from the null distribution N(0,1) and calculate their Lfdrs using the method in Sun and ai (2007), where the parameters are estimated using the primary statistics {t 1i : 1 i m}. The correction factor is then obtained as the proportion of Lfdrs exceeding τ.

41 B. Supplementary Numerical Results ARS: ovariate-assisted Two-Sample Inference 11 B.1. The case with unknown and equal variances Let ǫ xij N(0,1) and ǫ yik N(0,1). Set n x = 50,n y = 60. Therefore our κ = ny n x = 1.2. We compute T 1i and T 2i using estimated variances. The following settings are considered in our simulation study. Setting 1: we set µ x,1:k = 5/ 30, µ x,(k+1):(2k) = 4/ 30, µ x,(2k+1):m = 0, µ y,1:k = 2/ 30, µ y,(k+1):(2k) = 4/ 30 and µ y,(2k+1):m = 0. We vary k from 100 to Setting 2: we use k 1 and k 2 to denote the number of nonzero locations and the number of locations with differential effects, respectively. Let µ x,1:k2 = 5/ 30, µ x,(k2+1):k 1 = 4/ 30, µ x,(k1+1):m = 0, µ y,1:k2 = 2/ 30, µ y,(k2+1):k 1 = 4/ 30 and µ y,(k1+1):m = 0. We fix k 1 = 2000 and vary k 2 from 200 to k 1. Setting 3: the sparsity level is fixed at k = 750. We set µ x,1:(k) = µ 0 / 30, µ x,(k+1):(2k) = 3/ 30, µ x,(2k+1):m = 0, µ y,1:k = 2/ 30, and µ y,(k+1):(2k) = 3/ 30, µ y,(2k+1):m = 0. We vary µ 0 from 3.5 to 5. In Settings 1-3, we apply different methods to simulated data and obtain FDR and MDR levels by averaging over 500 replications. We plot the FDR and ETP levels as functions of varied parameter values. The results are displayed in Figure 5. We can similarly see that both ARS and US are more powerful than conventional univariate methods such BH and AZ, and ARS is superior than US. B.2. Simulation results when the means have opposite signs Let ǫ xij N(0,1) and ǫ yik N(0,1) be iid noise. Set n x = n y = 50. In our simulation, the number of tests is m = 10,000. The two mean vectors are given below: µ x,1:500 = 3/ 50, µ x,501:1000 = 4/ 50, µ x,1001:m = 0 µ y,1:500 = (3c)/ 50, µ y,501:1000 = 4/ 50, µ y,1001:m = 0. We vary c from 1 to 0, where c = 1 corresponds to the least favorable situation. We apply the BH procedure (BH), oracle ARS procedure () and data-driven ARS procedure (DD) to the simulated data sets. The FDR and power are obtained based on 200 replications. The simulation results are summarized by Figure 6. We can see that when c = 1, all methods have similar power. As c increases to zero, the power gain of ARS become more pronounced. This shows that the ARS procedure outperforms univariate methods even when the signals have opposite signs. Remark 10. The fact that ARS improves efficiency with opposite signs may appear a bit counter-intuitive. Therefore we discuss the following extreme situation to explain the phenomenon. The discussion would also give further insights on the merit of ARS. Assume that c = 0. We further let other elements in µ y be zero, i.e. µ y,1:m = 0. Let the mean vector µ x be µ x,1:1000 = µ 0, µ x,1001:m = 0. Suppose we only observe one replicate with σ xi = σ yi = 1 for all i. If we use T i = X i Y i as the test statistic, then the signal to noise ratio is µ 0 / 2. Imagine an oracle knows that

42 12 ai & Sun & Wang FDR level omparison Setting 1 Power omparsion FDR Average Power k k Method DD BH AZ US FDR level omparison Setting Power omparsion FDR Average Power k k 2 Method DD BH AZ US FDR level omparison Setting 3 Power omparison FDR Average Power µ µ 0 Method DD BH AZ US Fig. 5. General ase (Unknown Variance): the FDR and MDR varied by non-null proportion (top row) and conditional proportion (bottom row). The displayed procedures are DD ( ), BH ( ), ( ), AZ (+) and US ( ).

43 ARS: ovariate-assisted Two-Sample Inference 13 FDR level omparison Power omparison FDR Average Power c c Method DD BH Fig. 6. omparison of BH, DD and when the nonzero means have opposite signs. µ y,1:m = 0. Hence µ xi µ yi = 0 as long as µ xi = 0. For this oracle, a more efficient test statistic would be T i = X i, with the signal to noise ratio (SNR) being µ 0. The information of µ yi can be partially recoveredby the pair (X i Y i,x i +Y i ). By contrast, the information of µ yi is lost if we use X i Y i alone. Hence the SNR can be enhanced by including the sum in inference (even when µ yi = 0 for all i). Therefore we now have two extreme cases: c = 1 and c = 0. In the former case ARS has the same power as the optimal univariate method, and in the latter case ARS may substantially outperform all univariate methods by exploiting the auxiliary information with enhanced SNR. This explains the pattern we observed in Figure 6. B.3. omparison with sample splitting when θ 1i = θ 2i Similar to the general setting, we set n x = 50,n y = 60 and construct the primary and auxiliary statistics accordingly. We consider two simulation settings. Setting 1: µ x,1:k = 5/ 30, µ x,(k+1):m = 0, µ y,1:k = 2/ 30 and µ y,(k+1):m = 0. Vary k from 50 to 1000 to investigate the effect of sparsity on testing results. Setting 2: the sparsity level is fixed at k = 500, µ y,1:m = 0, µ z,1:k = µ 0 / 30, and µ z,(k+1):m = 0. Vary µ 0 from 1.5 to 3.5 to investigate the impact of effect sizes. In eachsetting, weapply different methods to simulated dataand obtain FDR and MDR levels by averaging over 200 replications. The sample splitting (SS) method has been added to the comparison. Specifically, SS splits both n x and n y into two equal parts. The first half of data are used to compute the initial p-values. We then use the p-value cut-off 0.05 to select locations and use these locations for the second stage analysis. Next the second stage p-values are computed for the selected locations using the second half of the data. Finally we apply the BH procedure to the second stage p-values and record the decisions. The FDR/MDR levels are plotted as functions of k and µ 0. The results are displayed in Figure 7. The following observations can be made based on the simulation.

44 14 ai & Sun & Wang (a). All methods can control the FDR at the nominal level 0.05, with BH slightly conservative and US very conservative. (b) The data-driven ARS procedure can be very conservative in many settings. This is due to the fact that we have used a conservative approximation to the oracle procedure. However, ARS still enjoys substantial power gain over competing methods. (c). Univariate methods (BH and AZ) can be greatly improved by US, which is tremendously improved by ARS, with the DD quite comparable to that of. (d). The results in Setting 1 corroborate that the gain in efficiency decreases as k increases; this is consistent with our findings under the general model. (e) Setting 2 shows that the efficiency gain of ARS over other methods increases as k 2 increases. (f). The sample splitting method is inefficient. B.4. Stability of tuning using Lfdr The R package ARS carries out the screening step using Lfdr statistics. Let T (τ) = {i : Lfdr 1 (t 1i ) > τ}. This section examines the robustness of ARS procedure with respect to the choice of τ. For brevity, we demonstrate the effect of τ using the case when θ 1i and θ 2i are perfectly correlated. The general case demonstrates the same kind of phenomenon as the special one. We explore different τ and compare the results with that obtained by our (benchmark) oracle procedure. We use the previous notations in the main text. We consider the following simulation settings. Setting 1 (same as Setting 1 in section 5.1): k = 500. Vary τ from 0.6 to 0.9. Setting 2: same as Setting 1 above, except that µ x,1:500 = 4/ 50. Vary τ from 0.6 to 0.9. Setting 3: Same as Setting 1. k = Vary τ from 0.6 to 0.9. Setting 4: Same as the Setting 3 above, except that µ x,1:1000 = 4/ 50. Vary τ from 0.6 to 0.9. We apply the oracle procedure () and the data-driven ARS procedure (DD) to the simulated data and obtain the corresponding FDR and MDR levels. We also obtained the average size of Ŝ = { Lfdr 1 > τ} and plot it as a function of τ. Note that for the oracle procedure, we always have ard(s) = 4500 or The results are summarized in Figure 5. We can see that the value of τ affects the size of Ŝ and in general provides better error control when it s large. This is consistent with our theory in Proposition 4. However, to avoid inflated variability in estimated quantities, we do not recommend larger τ s. As a rule of thumb, our practical recommendation is to take τ = 0.9 to ensure more precise error control and stability of the methodology.

45 ARS: ovariate-assisted Two-Sample Inference 15 FDR level omparison Setting 1 Power omparsion FPR Power k k Method DD BH AZ US SS FDR level omparison Setting 2 Power omparsion FPR Power k Method DD BH 0.00 AZ k US Fig. 7. Special ase: FDR/MDR comparison varied by non-null proportion (Settings 1) and signal strength (Settings 2). SS

46 16 ai & Sun & Wang FDR level omparison Setting 1 Power omparison ardinality of S omparison FDR Average Power Average ardinality for S τ τ Method DD τ FDR level omparison Setting 2 Power omparison ardinality of S omparison FDR Average Power Average ardinality for S τ τ Method DD τ FDR level omparison Setting 3 Power omparison ardinality of S omparison FDR Average Power Average ardinality for S τ τ Method DD τ FDR level omparison Setting 4 Power omparison ardinality of S omparison FDR Average Power Average ardinality for S τ τ τ Method DD Fig. 8. Effect of τ: FDR/MDR omparison varied by non-null proportion (top row) and conditional proportion (bottom row). The displayed procedures are DD ( ), ( ).

47 ARS: ovariate-assisted Two-Sample Inference 17 B.5. Very sparse case and ranking A fundamental aspect of the multiple comparison issue is ranking. For example, if a biologist has a limited budget and asks us to provide a list of top k genes (say, k = 10). Then on top of FDR control, s/he may care more about how many true findings are actually in the list that we provide. We carry out a small simulation study to compare different methods in the very sparse case k 2 = 20. The goal is to show that, by exploiting the auxiliary information, ARS yields a much better ranking (in the sense that for a pre-specified budget, namely a fixed number of rejections, ARS identifies more true positives than competing methods). onsider the case with the unknown and equal variances. Let ǫ xij N(0,1) and ǫ yik N(0,1). Set n x = 50,n y = 60. Therefore κ = ny n x = 1.2. Number of repetitions is 100. The number of genes m = Setting: µ x,1:20 = 5/ 50, µ x,21:40 = 4/ 50, µ x,41:m = 0, µ y,1:20 = 2/ 50, µ y,21:40 = 4/ 50, µ y,41:m = 0. We vary number of rejections k from 5 to 20, and calculate for even given k, how many true positives are in the top k hypotheses. The rankings are based on p-values (BH), Lfdr statistics (Sun and ai 2007), oracle statistic T (, proposed with known parameters), and data-driven statistic ˆT τ (DD, proposed with estimated parameters), respectively. The expected number of true positives(etp) is calculated based on the average of 100 simulated data sets. We can see from Figure 9 that the ranking of ARS is superior compared to other methods. Ranking omparison 15 ETP Number of Rejections Method DD AZ BH Fig. 9. omparison of ranking B.6. Dependent tests This section presents a small simulation study to investigate the performance of ARS under dependence.

48 18 ai & Sun & Wang FDR level omparison Power omparsion FDR Average Power k k Method DD BH AZ US Fig. 10. omparison of methods under weak dependence. We demonstrate the performance of ARS under the setting with the unknown and equal variances case. Let Σ be the covariance matrix. Σ i,i = 0, i = 1,,m, Σ i,i+1 = 0.5, i = 1,,m 1, Σ i 1,i = 0.5, i = 2,,m, Σ i,i+2 = 0.4, i = 1,,m 2, Σ i 2,i = 0.4, i = 3,,m. All other entries equal to zero. Let ǫ xj N(0,Σ) and ǫ yk N(0,Σ). Set n x = 50,n y = 60. Therefore κ = ny n x = 1.2. Number of repetitions is 100. m = Setting: We set µ x,1:k = 5/ 50, µ x,(k+1):2k = 4/ 50, µ x,(2k+1):m = 0, µ y,1:k = 2/ 50, µ y,(k+1):2k = 4/ 50, µ y,(2k+1):m = 0. We vary k from 100 to 500. The results are summarized by Figure 10. Both and DD seem to work well for FDR control. However, we want to emphasize that the simulation results here are very limited and we do not intend to draw any conclusions based on these results. Meanwhile, we conjecture that ARS would still be asymptotically valid under some weak dependence assumptions but the theory is very challenging. Much research is still needed and we leave it as a future direction. B.7. Supernova Example Images Here we provide additional numerical results for the analysis in Section 5.4 in the main text. Figure 11 shows the rejected pixels in the layout for each method under different FDR levels.

49 ARS: ovariate-assisted Two-Sample Inference 19 Nominal FDR level 0.01% Nominal FDR level 1% Nominal FDR level 5% Fig. 11. Left to right: BH procedure, Adaptive Z procedure, Uncorrelated Screening, ARS. B.8. Analysis of the Microarray Time-ourse (MT) Data In this section we apply the ARS procedure to the MT dataset collected by alvano et al. (2005) for studying systemic inflammation in humans. This dataset contains eight study subjects which are randomly assigned to treatment and control groups and then administered with endotoxin and placebo, respectively. Affimetrix U133A chips were used to profile the expression levels of 22,283 genes in human leukocytes measured before infusion (0 hour) and at 2, 4, 6, 9, and 24 hours afterwards. One of the goals of this experiment is to identify, in the treatment group, early to middle response genes that are differentially expressed within 4 hours and thus revealing meaningful early activation gene sequence which could potentially govern the immune responses. We first preprocess the data according to the steps featured in Sun and Wei (2011). To further identify the genes that regulate this sequence, we take time point 0 as the baseline and time point 4 and 24 as the interval over which differential gene expression will be

Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks

University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2009 Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks T. Tony Cai University of Pennsylvania