CARS: Covariate Assisted Ranking and Screening for Large-Scale Two-Sample Inference

Size: px
Start display at page:

Download "CARS: Covariate Assisted Ranking and Screening for Large-Scale Two-Sample Inference"

Transcription

1 ARS: ovariate Assisted Ranking and Screening for Large-Scale Two-Sample Inference T. Tony ai University of Pennsylvania, Philadelphia, USA Wenguang Sun University of Southern alifornia, Los Angeles, USA Weinan Wang University of Southern alifornia, Los Angeles, USA Summary. Two-sample multiple testing has a wide range of applications. The conventional practice first reduces the original observations to a vector of p-values and then chooses a cutoff to adjust for multiplicity. However, this data reduction step could cause significant loss of information and thus lead to suboptimal testing procedures. In this paper, we introduce a new framework for two-sample multiple testing by incorporating a carefully constructed auxiliary variable in inference to improve the power. A data-driven multiple testing procedure is developed by employing a covariate-assisted ranking and screening (ARS) approach that optimally combines the information from both the primary and auxiliary variables. The proposed ARS procedure is shown to be asymptotic valid and optimal for false discovery rate (FDR) control. The procedure is implemented in the R-package ARS. Numerical results confirm the effectiveness of ARS in FDR control and show that it achieves substantial power gain over existing methods. ARS is also illustrated through an application to the analysis of satellite imaging data set for supernova detection. Keywords: ompound decision theory; False discovery rate; Logically correlated tests; Multiple testing with covariates; Uncorrelated screening. 1. Introduction A common goal in modern scientific studies is to identify features that exhibit differential levels across two or more conditions. The task becomes difficult in large-scale comparative experiments, where differential features are sparse among thousands or even millions of features being investigated. The conventional practice is to first reduce the original samples to a vector of p-values and then choose a cutoff to adjust for multiplicity. However, the first step of data reduction could cause significant loss of information and thus lead to suboptimal testing procedures. This paper proposes new strategies to extract sparsity information in the sample using an auxiliary covariate sequence and develops optimal covariate-assisted inference procedures for large-scale two-sample multiple testing problems. We focus on a setting where both mean vectors are individually sparse. Such a setting arises naturally in many modern scientific applications. For example, the detection of sequentially activated genes in time-course microarray experiments, considered in Section B.8 in the Supplementary Material, involves identifying varied effect sizes across different time points [alvano et al. (2005); Sun and Wei (2011)]. Since only a small fraction of genes are differentially expressed from the baseline, the problem of identifying varied levels over

2 2 ai & Sun & Wang time essentially reduces to a multiple testing problem with several high-dimensional sparse vectors (after removing the baseline effects). The second example arises from the detection of supernova explosions considered in Section 5.4. The potential locations can be identified by testing sudden changes in brightness in satellite images taken over a period of time. After the measurements are converted into grey-scale images and vectorized, multiple tests are conducted to compare the intensity levels between two sparse vectors. Another case in point is the analysis of differential networks, where the goal is to detect discrepancies between two or more networks with possibly sparse edges. We first describe the conventional framework for two-sample inference and then discuss its limitations. Let X and Y be two random vectors recording the measurement levels of the same m features under two experimental conditions, respectively. The population mean vectors are given by µ x = E(X) = (µ x1,...,µ xm ) and µ y = E(Y ) = (µ y1,...,µ ym ). A classical formulation for identifying differential features is to carry out m two-sample tests: H i,0 : µ xi = µ yi vs. H i,1 : µ xi µ yi, 1 i m. (1.1) Suppose we have collected two random samples {X 1,...,X n1 } and {Y 1,...,Y n2 } as independent copies of X and Y, respectively. The standard practice starts with a data reduction step: a two-sample t-statistic T i is computed to compare the two conditions for feature i, then T i is converted to a p-value or z-value. Finally a significance threshold is chosen to control the multiplicity. However, this conventional practice, which only utilizes a vector of p-values, may suffer from substantial information loss. Our idea is to exploit the sparsity information in the mean vectors to improve the efficiency of tests. Let I( ) be an indicator function. Observe that θ 1i = I(µ xi µ yi ) and θ 2i = I(µ xi 0 or µ yi 0) obey the following logical relationship: θ 1i = 0 if θ 2i = 0. (1.2) When both µ x and µ y are sparse, the union support S = {i : θ 2i 0} is also sparse. However,onlyinformationaboutθ 1i isretainedafterthedatareductionstepinconventional practice. We propose to construct an auxiliary covariate sequence that extracts the sparsity information on the union support from the original data. The covariate, which provides supplementary evidence in inference via the logical relationship (1.2), can help eliminate a large number of null locations and thus improve the accuracy of inference. A key step in our methodological development is to recast the problem in the framework of multiple testing with a covariate sequence. This requires a careful construction of the covariate that leads to a simple bivariate model and an easily implementable methodology. Section 2 discusses strategies for constructing the pair of primary and auxiliary variables. Then we develop oracle and data-driven multiple testing procedures for the consequent bivariate model in Section 3. The proposed method employs a covariate-assisted ranking and screening(ars) approach that simultaneously incorporates the primary and auxiliary information in multiple testing. We show that the ARS procedure controls the false discovery rate at the nominal level, and outperforms existing methods in power by effectively utilizing the discarded information in conventional practice. We mention two related strategies in the literature: testing following screening and testing following grouping. In the first strategy, the hypotheses are formed and tested hierarchically via a screen-and-clean method [Zehetmayer et al.(2005); Reiner-Benaim et al. (2007); Zehetmayer et al. (2008); Wasserman and Roeder (2009); Bourgon et al. (2010)]. Following that strategy, one can first inspect the sample to identify the union support of µ x

3 ARS: ovariate-assisted Two-Sample Inference 3 and µ y, and then conduct two-sample tests on the narrowed subset to further eliminate the null locations with no differential levels. The screen-and-clean approach requires sample splitting to ensure the independence between the screening and testing stages to avoid selection bias [Rubin et al. (2006); Bourgon et al. (2010)]. However, even the screening stage can significantly narrow down the focus, sample splitting often leads to loss of power. For example, the empirical studies in Skol et al. (2006) concluded that a two-stage analysis is in general inferior compared to a naive joint analysis that combines the data from both stages. The second strategy (Liu, 2014) can be described as testing following grouping, that is, the hypotheses are analyzed in groups via a divide-and-test method. Liu (2014) developed an uncorrelated screening (US) method, which first divides the hypotheses into two groups according to a screening statistic, and then applies multiple testing procedures to the groups separately to identify non-null cases. It was shown in Liu (2014) that US controls the error rate at the nominal level and outperforms competitive methods in power. Our approach marks a clear departure from existing methods. Both the screen-and-clean and divide-and-test strategies involve dichotomizing a continuous variable, which fails to fully utilize the auxiliary information. By contrast, our proposed ARS procedure models the screening covariate as a continuous variable and employs a novel ranking and selection procedure that optimally integrates the information from both the primary and auxiliary variables. In Section 4, we develop further results on a general bivariate model; our study reveals the connections between existing methods and provides insights on the advantage of the proposed ARS procedure. Simulation results in Section 5 demonstrate that ARS controls the false discovery rate in finite samples and uniformly dominates all existing methods. The gain in power is substantial in many settings. We illustrate our method to analyze a time-course satellite image dataset in Section 5.4. The application shows improved sensitivity of the proposed method in identifying changes between images taken over time. Section 6 further discusses related issues and open problems. The proofs are provided in Section 7 and Appendix A. Additional numerical results are given in Appendix B. 2. A Bivariate Random Mixture Model for Two-Sample Multiple Testing Suppose {X ij : 1 j n x } and {Y ik : 1 k n y }, i = 1,,m, are repeated measurements of generic independent random variables X i and Y i, respectively. Let β 0 = (β 0i : 1 i m) be a latent vector which itself is sparse [including the special case where β 0i = 0 for all i]. onsider the following model X ij = β 0i +µ xi +ǫ xij, Y ik = β 0i +µ yi +ǫ yik, (2.1) with corresponding population means given by µ xi = β 0i + µ xi and µ yi = β 0i + µ yi. We assume that µ xi and µ yi satisfy µ xi (1 π x)δ 0 +π x g µx ( ) and µ yi (1 π y)δ 0 +π y g µy ( ), where δ 0 is the Dirac delta function, and g µx and g µy are densities of nonzero effects. Remark 1. Model (2.1) can be applied to scenarios with non-sparse µ x and µ y as long as some baseline measurements are available. In Section A.8 in the supplement, we show that after baseline removal, β 0 becomes sparse, and related model assumptions hold. Let n = n x +n y. Denote γ x = n x /n and γ y = n y /n. The population means µ xi and µ yi are estimated by X i = (n x ) 1 n x j=1 X ij and Ȳi = (n y ) 1 n y k=1 Y ik, respectively. For ease iid ofpresentation, we first focus on the Gaussianmodel, where ǫ xij N(0,σxi 2 ) and ǫ iid yij N(0,σyi 2 ). More general settings will be discussed in Sections 3.6.

4 4 ai & Sun & Wang Remark 2. The proposed methodology only requires X i and Ȳi to be normal. In practical situations where n x and n y are large, our method works well without the normality assumption. Numerical results with non-gaussian errors are provided in Section 5.3. The two-sample inference problem is concerned with the simultaneous testing of m hypotheses H i,0 : µ xi = µ yi vs H i,1 : µ xi µ yi, i = 1,,m. Let T 1i and T 2i be summary statisticsthat containthe informationabout θ 1i = I(µ xi µ yi ) (support ofmeandifference) and θ 2i = I(µ xi 0 or µ yi 0) (union support), respectively. T 1i is the primary statistic in inference and T 2i is an auxiliary covariate. The term auxiliary indicates that we do not use T 2i to make inference on θ 1i directly. Instead, we aim to incorporate T 2i in inference to (indirectly) support the evidence provided in the primary variable T 1i. We first discuss how to construct the primary and auxiliary statistics from the original data, and then introduce a bivariate random mixture model to describe their joint distribution. Finally, we formulate a decision-theoretic framework for two-sample simultaneous inference with an auxiliary covariate onstructing the primary and auxiliary statistics A key step in our formulation is to construct a pair of statistics (T 1i,T 2i ) such that (i) the pair extracts information from the data effectively; (ii) the pair leads to a simple bivariate model via which the logical relationship (1.2) can be exploited. To focus on the main ideas, we first discuss the Gaussian case with known variances. Extensions to two-sample tests with non-gaussian errors and unknown variances are discussed in Section 3.6. The general strategies for constructing the pair (T 1i,T 2i ) can be described as follows. First, T 1i is used to capture the information on θ 1i ; hence X i Ȳi should be incorporated in its expression. Second, to capture the information on the union support θ 2i, we propose to use the weighted sum X i + κ i Ȳ i, where κ i > 0 is the weight to be specified later. Under the normality assumption, the covariance of Xi Ȳi and X i + κ i Ȳ i is given by σxi 2 /n x κ i σyi 2 /n y. This motivates us to choose the weight κ i = (γ y σxi 2 )/(γ xσyi 2 ), which leads to zero correlation, a crucial property for simplifying the model and facilitating the methodological development. Finally, the difference and weighted sum are standardized to make the statistics comparable across tests. ombining the above considerations, we propose to use the following pair of statistics to summarize the information in the data: (T 1i,T 2i ) = nx n y n ( Xi Ȳi σ pi, ) X i +κ iȳi, (2.2) κ i σ pi where σ 2 pi = γ yσ 2 xi +γ xσ 2 yi. Denote T 1 = (T 1i : 1 i m) and T 2 = (T 2i : 1 i m) A bivariate random mixture model We develop a bivariate model to describe the joint distribution of T 1i and T 2i. Let θ i = (θ 1i,θ 2i ). Assume that θ i are independent and identically distributed (i.i.d.) bivariate randomvectorsthattakevaluesintheartesianproductspace{0,1} 2 = {(0,0),(0,1),(1,0),(1,1)}. For each combination θ i = (j,k), (T 1i,T 2i ) are jointly distributed with conditional density f(t 1i,t 2i θ 1i = j,θ 2i = k). Denote π jk = P(θ 1i = j,θ 2i = k). In practice, we do not know (θ 1i,θ 2i ) but only observe (T 1i,T 2i ) from a mixture model f(t 1i,t 2i ) = (j,k) {0,1} 2 π jkf(t 1i,t 2i θ 1i = j,θ 2i = k). (2.3)

5 ARS: ovariate-assisted Two-Sample Inference 5 The goal is to determine the value of θ 1i based on pairs {(T 1i,T 2i ) : 1 i m}. The mixture model (2.3) is difficult to analyze. However, if T 1i and T 2i are carefully constructed as done in Section 2.1, then several simplifications can be made. First, the logical relationship (1.2) implies that π 10 = 0; thus we only have three terms in (2.3). Second, according to our construction (2.2), T 1i and T 2i are conditionally independent: f(t 1i,t 2i µ xi,µ yi ) = f(t 1i µ xi,µ yi )f(t 2i µ xi,µ yi ). (2.4) The next proposition utilizes (2.4) to further simplify the model. Proposition 1. The conditional independence (2.4) implies that f(t 1i,t 2i θ 1i = 0,θ 2i = 0) = f(t 1i θ 1i = 0)f(t 2i θ 2i = 0); f(t 1i,t 2i θ 1i = 0,θ 2i = 1) = f(t 1i θ 1i = 0)f(t 2i θ 1i = 0,θ 2i = 1); and f(t 1i,t 2i θ 1i = 0) = f(t 1i θ 1i = 0)f(t 2i θ 1i = 0). (2.5) The last equation shows that T 1i and T 2i are independent under the null hypothesis H i0 : θ 1i = 0. This is a critical result for our later methodological and theoretical developments. The joint density is given by f(t 1i,t 2i ) = π 00 f(t 1i θ 1i = 0)f(t 2i θ 2i = 0)+π 01 f(t 1i θ 1i = 0)f(t 2i θ 1i = 0,θ 2i = 1) +π 11 f(t 1i,t 2i θ 1i = 1,θ 2i = 1). (2.6) 2.3. Problem formulation Our goal is to make inference on θ 1i = I(µ xi µ yi ), 1 i m, by simultaneously testing m hypotheses H i,0 : θ 1i = 0 vs H i,1 : θ 1i = 1. ompared to conventional approaches, we aim to develop methods utilizing m pairs {(T 1i,T 2i ) : 1 i m} instead of a single vector {T 1i : 1 i m}. This new problem can be recast and solved in the framework of multiple testing with a covariate: T 1i is viewed as the primary statistic for assessing significance, and T 2i is viewed as a covariate to assist inference by providing supporting information. The concepts of error rate and power are similar to those in the conventional settings. A multiple testing procedure is represented by a thresholding rule of the form δ = {δ i = I(S i < t) : i = 1,,m} {0,1} m, (2.7) where δ i = 1 if we reject hypothesis i and δ i = 0 otherwise. Here S i is a significance index that ranks the hypotheses from the most significant to least significant, and t is a threshold. In large-scale testing problems, the false discovery rate (FDR, Benjamini and Hochberg, 1995) has been widely used to control the inflation of Type I errors. For a given decision rule δ = (δ i : 1 i m) of the form (2.7), the FDR is defined as [ m i=1 FDR δ = E (1 θ ] 1i)δ i ( m i=1 δ, (2.8) i) 1 where x y = max(x,y). A closely related concept is the marginal false discovery rate (mfdr), which is defined by mfdr δ = E{ m i=1 (1 θ 1i)δ i } E( m i=1 δ. (2.9) i)

6 6 ai & Sun & Wang Genovese and Wasserman (2002) showed that mfdr = FDR + O(m 1/2 ) when the BH procedure (Benjamini and Hochberg, 1995) is applied to m independent tests. We use mfdr mainly for technical considerations to obtain optimality result. Proposition 7 in Section 7.2 gives sufficient conditions under which the mfdr and FDR are asymptotically equivalent and shows that the conditions are fulfilled by our proposed method. Define the expected number of true positives ETP δ = E( m i=1 θ 1iδ i ). Other related power measures include the missed discovery rate (MDR, Taylor et al., 2005), average power (Efron, 2007) and false non-discovery/negative rate (FNR, Genovese and Wasserman, 2002; Sarkar, 2002). Our optimality result is developed based on the mfdr and ETP. We call a multiple testing procedure valid if it controls the mfdr at the nominal level and optimal if it has the largest ETP among all valid mfdr procedures. 3. Oracle and Data-driven Procedures The basic framework of our methodological developments is explained as follows. We first consider an ideal situation where an oracle knows all parameters in model (2.6). Section 3.1 derives an oracle procedure. Sections 3.2 and 3.3 discuss an approximation strategy and related estimation methods. The data-driven procedure and extensions are presented in Sections 3.5 and Oracle mfdr procedure with pairs of observations Let π j = P(θ ji = 1), j = 1,2. Then the marginal density function for T ji is defined as f j = (1 π j )f j0 + π j f j1, where f j0 and f j1 are the null and non-null densities for T ji, respectively. onventional FDR procedures, which are developed based on a vector of p-values or z-values, are essentially univariate inference procedures that only utilize the information of T 1i. Define the local false discovery rate (Lfdr, Efron et al., 2001) Lfdr 1 (t 1 ) = (1 π 1)f 10 (t 1 ), (3.1) f 1 (t 1 ) where subscript 1 indicates a quantity associated with T 1i. It was shown in Sun and ai (2007) that the optimal univariate mfdr procedure is a thresholding rule of the form δ(lfdr 1,c) = [I{Lfdr 1 (t 1i ) < c} : 1 i m], (3.2) where 0 c 1 is a cutoff. Denote Q LF (c) the mfdr level of δ(lfdr 1,c). Let c = sup{c : Q LF (c) α} be the largest cutoff under the mfdr constraint. Then δ = δ(lfdr 1,c ) is optimal among all univariate mfdr procedures in the sense that it has the largest ETP subject to mfdr α. The next theorem derives an oracle procedure for mfdr control with pairs of observations. We shall see that the performance of δ, the optimal univariate procedure, can be greatly improved by exploiting the information in T 2i. The oracle procedure under the bivariate model (2.6) has two important components: an oracle statistic T i that optimally pools information from both T 1i and T 2i, and an oracle threshold t that controls the mfdr with the largest ETP. Theorem 1. Suppose (T 1i,T 2i ) follow model (2.6). Let q (t 2 ) = (1 π 1 )f (t 2 θ 1i = 0). (3.3)

7 ARS: ovariate-assisted Two-Sample Inference 7 Define the oracle statistic T i (t 1,t 2 ) = P(θ 1i = 0 T 1i = t 1,T 2i = t 2 ) = q (t 2 )f 10 (t 1 ), (3.4) f(t 1,t 2 ) where f(t 1,t 2 ) is the joint density given by (2.6). Then (a) For 0 < λ 1, let Q (λ) be the mfdr level of testing rule {I(T i < λ) : 1 i m}. Then Q (λ) < λ and Q (λ) is non-decreasing in λ. (b) Suppose we choose α ᾱ Q (1). Then the oracle threshold λ = sup{λ : Q (λ) α} exists uniquely and Q (λ ) = α. Furthermore, define oracle rule δ = (δ i : i = 1,,m), where δ i = I(Ti < λ ). (3.5) Then δ is optimal in the sense that ETP δ ETP δ for any δ in D α, where D α is the collection of all testing rules such that mfdr δ α. Remark 3. The oracle statistic T i is the posterior probability that H i,0 is true given the pair of primary and auxiliary statistics. It serves as a significance index providing evidence against the null. Section 3.2 gives a precise characterization of q (t 2 ) and shows that it roughly describes how frequently T 2i from the null distribution would fall into the neighborhood of t 2. The estimation of T and q (t 2 ) is discussed in Section 3.3. Remark 4. Theorem 1 indicates that pooling non-informative auxiliary data would not result in efficiency loss. onsider the worst scenario suggested by a reviewer where T 2i is completely non-informative: n x = n y, σxi 2 = σ2 yi and µ xi = µ yi. In Section A.9 of the Supplementary Material, we show that under the above conditions T i reduces to the Lfdr statistic (3.1), and the oracle (bivariate) procedure would coincide with the optimal univariate rule (3.2). ontrary to the common intuition that incorporating T 2i would negatively affect the performance, Theorem 1 says that our oracle procedure dominates the optimal univariate procedure (3.2). Hence the power will never be decreased by pooling the auxiliary information. Numerical evidence is provided in Section B.2 in the supplement. The oracle rule (3.5) motivates us to consider a stepwise procedure that operates in two steps: ranking and thresholding. The ranking step orders all hypotheses from the most significant to the least significant according to T, and the thresholding step identifies the largest threshold along the ranking subject to the mfdr constraint. Specifically, denote T (1)... T(m) the ordered oracle statistics and H (1),,H (m) the corresponding hypotheses. The step-wise procedure operates as follows. { Let k = max j : ˆQ (j) = j 1 j i=1 T(i) }. α Reject H (1),,H (k). (3.6) The quantity ˆQ (j), which is the moving average of the top j ordered statistics, gives a consistent estimate of the actual FDR level [cf. Sun and ai (2007)]. Thus the stepwise algorithm (3.6) identifies the largest threshold subject to the FDR constraint. The hypotheses whose T values are less than the threshold are claimed to be significant.

8 8 ai & Sun & Wang 3.2. Approximating T via screening The oracle statistic T i is unknown and needs to be estimated in practice. However, standard methods do not work well for the bivariate model. For example, the popular EM algorithm usually requires the specification of a parametric form of the non-null distribution; this is often impractical in large-scale studies where little is known about the alternative. Moreover, existing estimators often suffer from low accuracy and convergence issues when signals are very sparse. To overcome the difficulties in estimation, we propose a new test statistic T τ,i that only involves quantities that can be well estimated from data. The new statistic provides a good approximation to T i and guarantees the FDR control. In (3.4), the null density f 10 is known by construction. The bivariate density f(t 1,t 2 ) can be well estimated using a standard kernel method [Silverman (1986); Wand and Jones (1995)]. Hence we shall focus on the quantity q (t 2 ). Suppose we are interested in counting how frequently T 2i from the null distribution (i.e. θ 1i = 0) would fall into an interval in the neighborhood of t 2 : Q (t 2,h) = #{i : T 2i [t 2 h/2,t 2 +h/2] and θ 1i = 0}/m. The quantity is relevant because q (t 2 ) = lim h 0 E{Q (t 2,h)}/h. The counting task is difficult as we do not know the value of θ 1i. Our idea is to first apply a screening method to select the nulls (i.e. θ 1i = 0), and then construct an estimator based on selected cases. Denote P i = 2Φ( t 1i ) the two-sided p-value associated with T 1i = t 1i. When signals are sparse, most large p-values should come from the null distribution. Hence for a large τ, say τ = 0.9, we would reasonably predict that θ 1i = 0 if P i > τ. It follows that we may do the counting based on those T 2i with large p-values: Q τ (t 2,h) = #{i : T 2i [t 2 h/2,t 2 +h/2] and P i > τ}. (3.7) m(1 τ) The adjustment (1 τ) in the denominator comes from the fact that we have only utilized roughly 100(1 τ)% of the data when counting the frequency. Let A τ = {t 1 : 2Φ( t 1 ) > τ}. Using Q τ to replace Q, a sensible approximation of q (t 2 ) would be q τ E{Q τ (t 2,h)} (t 2 ) = lim = h 0 h A τ f(t 1,t 2 )dt 1. (3.8) 1 τ Intuitively, a large τ would yield a sample that is close to one generated from a pure null distribution and thus reduce the bias q τ (t 2 ) q (t 2 ). Our theory reveals that the bias is always positive (Proposition 2), and would decrease in τ (Proposition 4). However, a larger τ would increase the variability of our proposed estimator (as we have fewer samples to construct the estimator), affecting the testing procedure adversely. The bias-variance tradeoff is further discussed in Section 3.4. Substituting q τ (t 2 ) in place of q (t 2 ), we obtain the following approximation of T i : (t 1,t 2 ) = qτ (t 2 )f 10 (t 1 ). (3.9) f(t 1,t 2 ) T τ,i Some important properties of the approximation in (3.9) are summarized in the next proposition, which shows that T τ,i always overestimates Ti. This implies that if we substitute T τ,i in place of Ti in (3.6), then ˆQ (j) will be overestimated as well. Hence the approximation leads to fewer rejections and a conservative FDR level. Proposition 2. (a) T i (t 1,t 2 ) T τ,i (t 1,t 2 ) for all τ. (b) Let δ τ be a decision rule that substitutes Tτ,i in place of Ti in (3.6). Then both the FDR and mfdr levels of δ τ are controlled below level α.

9 ARS: ovariate-assisted Two-Sample Inference Estimation of the test statistic We now turn to the estimation of T τ,i. By our construction, the null density f 10(t 1 ) is known. The bivariate density f(t 1,t 2 ) can be estimated using a kernel method [Silverman (1986); Wand and Jones (1995)]: ˆf(t m 1,t 2 ) = m 1 K h1 (t 1 t 1i )K h2 (t 2 t 2i ), (3.10) i=1 where K(t) is a kernel function, h 1 and h 2 are the bandwidths, with K h (t) = h 1 K(t/h). To estimate q τ (s), we first carry out a screening procedure to obtain sample T (τ) = {i : P 1i > τ}, and then apply kernel smoothing to the selected observations: ˆq τ (t 2 ) = i T (τ) K h 2 (t 2 t 2i ). (3.11) m(1 τ) The next proposition shows that ˆq τ ( ) converges to q τ ( ) in L 2 norm. Proposition 3. onsider ˆq τ and q τ respectively defined in (3.8) and (3.11). Assume that (i) q τ ( ) is bounded and have continuous first and second derivatives; (ii) the kernel K is a positive, bounded and symmetric function satisfying K(t) = 1, tk(t)dt = 0 and t 2 K(t)dt < ; and (iii) f (2) 2 (t 2 τ) = (2) t 1 A τ f 2 (t 2 t 1 )f 1 (t 1 )dt 1 is square integrable, where f 2 (t 2 t 1 ) is the conditional density of T 2 given T 1. Then with the common choice of bandwidth h m 1/6, we have E ˆq τ q τ 2 = E {ˆq τ (x) q τ (x)} 2 dx 0. ombining the above results, we propose to estimate T τ by the following statistic ˆT τ (t 1,t 2 ) = ˆqτ (t 2 )f 10 (t 1 ) ˆf(t 1,t 2 ) 1, (3.12) where ˆq τ (t 2 ) and ˆf(t 1,t 2 ) are respectively given in (3.11) and (3.10), and x y = min(x,y). Remark 5. In our proposed estimator, the same bandwidth h 2 has been used for the kernels in both (3.10) and (3.11). Utilizing the same bandwidth across the numerator and denominator of (3.12) has no impact on the asymptotics, but is beneficial for increasing the stability of our estimator. Some practical guidelines for the implementation of ARS are provided in Section A refined estimator This section develops a consistent estimator of q (t 2 ). The proposed estimator is important for the optimality theory in Section 3.5. However, it is computationally intensive and has limited efficiency gain. In practice we still recommend the simple estimator (3.11). This section may be skipped for readers who are mainly interested in methodology. We state in the next proposition some theoretical properties for the approximation error q τ (t 2 ) q (t 2 ); these properties are helpful to motivate the new estimator and prove its consistency. Let the DF of the p-value associated with T 1i be G(τ) = (1 π 1 )τ+π 1 G 1 (τ), where G 1 is the alternative DF. Denote g and g 1 the corresponding density functions.

10 10 ai & Sun & Wang Proposition 4. onsider T τ defined in (3.9). (a). Denote B q (τ) = q τ (t 2 ) q (t 2 ) dt 2 the total approximation error. If G 1 ( ) is concave, then B q (τ) decreases in τ. (b). If lim x 1 g 1 (x) = 0, then lim τ 1 q τ (t 2 ) = q (t 2 ). Remark 6. The assumption that G 1 ( ) is concave has been commonly used in the FDR literature (Storey, 2002; Genovese and Wasserman, 2002, 2004). A generalized version of the assumption is the monotone likelihood ratio condition (MLR, Sun and ai, 2007). The assumption that lim x 1 g 1 (x) = 0 is also a typical condition (Genovese and Wasserman, 2004). It essentially says that the null cases are dominant (or the null distribution is pure ) on the right of the p-value histogram. It follows from Proposition 4 that a large τ is helpful to reduce the bias and the bias converges to zero when τ 1. However, a large τ would increase the variance of our estimator (3.11), which is constructed using the sample T (τ) = {i : P 1i > τ}. To address the bias-variance tradeoff, we propose to first obtain ˆq τ for a range of τ s, say {τ 1,,τ k }, and then use a smoothing method to obtain the limiting value of ˆq τ when τ 1. This approach aims to borrow strength from the entire sample to minimize the bias without blowing up the variance. Specifically, let τ 0 < τ 1 < < τ k be ordered and equally spaced points in the inteval (0,1). Denote ˆq τj (t 2 ) the estimates from (3.11), j = 1,,k. We propose to obtain the local linear kernel estimator ˆq (t 2 ) ˆq{τ = 1;ˆq τ1 (t 2 ),,ˆq τ k (t 2 )} as the height of the fit ˆβ 0, where (ˆβ 0, ˆβ 1 ) minimizes the weighted kernel least squares k j=1 (ˆqτj β 0 β 1 τ j ) 2 K hτ (τ j τ k ). For a given integer r, denote ŝ r = k 1 k j=1 (τ j 1) r K hτ (τ j τ k ). It can be shown that (e.g. Wand and Jones, 1995, pp. 119) ˆq (t 2 ) = k 1 k {ŝ 2 ŝ 1 (τ j τ k )}K hτ (τ j τ k )ˆq τj j=1 ŝ 2 ŝ 0 ŝ 2 1 The next proposition shows that ˆq (t 2 ) is a consistent estimator for q (t 2 ).. (3.13) Proposition 5. onsider ˆq τ and ˆq that are respectively defined in (3.11) and (3.13). Denote q τk,(2) (t 2 ) = (d/dτ) 2 q τ,(2) (t 2 ) τ=τk. Assume the following conditions hold: (i) q τk,(2) (t 2 ) is square integrable; (ii) K( ) is symmetric about zero and is supported on [ 1,1]; and (iii) the bandwidth h τ is a sequence satisfying h τ 0 and kh τ as k. Moreover, assume ondition (i) to (iii) in Proposition 3 and ondition (b) in Proposition 4 hold. We have E ˆq q 2 = E {ˆq (x) q (x)} 2 dx 0 when (m,k) 0. (3.14) 3.5. The ARS procedure τ,i The estimated statistics ˆT will be used as a significance index to rank the relative importance of all hypotheses. Motivated by the stepwise algorithm(3.6), we propose the following covariate-assisted ranking and screening (ARS) procedure. Procedure 1 (ARS Procedure). onsider model (2.6) and estimated statistics ˆT τ,i τ,(1) τ,(m) (3.12). Denote ˆT ˆT { the ordered statistics and H (1),,H (m) the corresponding hypotheses. Let k = max j : 1 } j τ,(i) j i=1 ˆT α. Then reject H (1),,H (k).

11 ARS: ovariate-assisted Two-Sample Inference 11 To ensure good performance of the data-driven procedure, we require the following conditions for estimated quantities: (1). E ˆq τ q τ 2 0. (1 ). E ˆq q 2 0. (2). E ˆf f 2 [ ] = E {ˆf(t1,t 2 ) f(t 1,t 2 )} 2 dt 1 dt 2 0. Remark 7. Proposition 3 shows that (1) is satisfied by the proposed estimator (3.11). Proposition5 shows that (1 ) is satisfied by the smoothing estimator (3.14) under stronger assumptions. Finally, (2) is satisfied by the standard choice of bandwidth h t1 m 1/6, h t2 m 1/6 ; see, for example, page 111 in Wand and Jones (1995) for a proof. The asymptotic properties of the ARS procedure are established by the next theorem. Theorem 2. Asymptotic validity and optimality of ARS. (a). If onditions (1) and (2) hold, then both the mfdr and FDR of the ARS procedure are controlled at level α + o(1). (b). If onditions (1 ) and (2) hold, and we substitute ˆq (3.14) in place of ˆq τ in (3.12) to compute ˆT τ, then the FDR level of the ARS procedure is α + o(1). Moreover, denote ETP ARS and ETP the ETP levels of ARS and the oracle procedure, respectively. Then we have ETP ARS /ETP = 1+o(1) The case with unknown variances and non-gaussian errors For two-sample tests with unknown and unequal variances, we can estimate σxi 2 and σ2 yi by Sxi 2 = (n x) 1 n x j=1 (X ij X i ) 2 and Syi 2 = (n y) 1 n y j=1 (Y ij Ȳi) 2, respectively. Let ˆκ i = (γ ysxi 2 )/(γ xsyi 2 ) and S2 pi = γ ysxi 2 +γ xsyi 2. The following pair will be used to summarize the information in the sample: (T 1i,T 2i ) = nx n y n ( Xi Ȳi S pi, ) X i + ˆκ iȳi. (3.15) ˆκ i S pi For the case with unknown but equal variances(e.g. σxi 2 = σ2 yi ), we modify (3.15) as follows. First, ˆκ i is replacedbyκ = γ y /γ x. Second, Spi 2 is instead estimated bys2 pi = γ xsxi 2 +γ ysyi 2. Finally T 1i and T 2i are plugged into (3.12) to compute the ARS statistic, which is further employed to implement Procedure 1. T 1i and T 2i are not strictly independent when estimated variances are used. We conduct simulation studies in Section 5.3 to investigate the performance of this plug-in methodology and find that ARS controls the FDR at the nominal level with substantial power gain over competing methods. The empirical results are supported by the following proposition, which shows that T 1i and T 2i are asymptotically independent under the null. Proposition 6. onsider model (2.1). Assume that the error terms (possibly non- Gaussian) of X ij and Y ij have symmetric distributions and finite fourth moments. Then (T 1i,T 2i ) defined in (3.15) are asymptotically independent when H i,0 : µ xi = µ yi is true.

12 12 ai & Sun & Wang The expression for asymptotic variance-covariance matrix, which is given in Section A.6 of the Supplementary Material, reveals that the asymptotic independence holds for non- Gaussian errors as long as the error distributions are symmetric. Our simulation results in Section 5.3 confirm that unknown variances and non-gaussian errors have almost no impact on the performance of ARS. Therefore the plug-in methods are recommended for practical applications. The case with skewed distribution requires further research, and a full theoretical justification of ARS methodology is still an open question. 4. Extensions and onnections with Existing Work This section considers the extension of our theory to a general bivariate model. The results in the general model provide a unified theoretical framework for understanding different testing strategies, which helps gain insights on the connections between existing methods. We substitute (T i,s i ) in place of (T 1i,T 2i ) in this section. This change reflects a more flexible view of the auxiliary covariate: S i can be either continuous or discrete, from either internal or external data, and we do not explicitly estimate the joint density of T i and S i as donein previoussections. Sections 4.1to4.5assumethat T i followa continuousdistribution with a known density under the null; the case with unknown null density is discussed in Section 4.6. We allow S i to be either continuous or categorical and hence eliminate the notation θ 2i. (Previously θ 2i denotes the union support, which is only needed when T 2i has a density with a point mass.) As a result, the subscript 1 in θ 1i is suppressed for notational convenience, where θ i = 0/1 stands for a null/non-null case A general bivariate model Suppose T i and S i follow a joint distribution F i (t,s). The optimal (oracle) testing rule is given by the next theorem, which can be proven similarly as Theorem 1. Theorem 3. Define the oracle statistic under the general model T G (t,s) = P(θ i = 0 T i = t,s i = s). (4.1) Denote Q G (λ) the mfdr level of δ(tg,λ), where δ(tg,λ) = {I(TG (t,s) < λ) : 1 i m}. Let λ = sup{λ (0,1) : Q G (λ) α}. Define the oracle mfdr procedure under the general model as δ G = δ(tg,λ ). Then δ G is optimal in the sense that for any δ such that mfdr δ α, we always have ETP δ ETP δ G. The oracleprocedure is only of theoretical interest because it is difficult to estimate T G,i in a general bivariate model. Under some special cases, T G can be approximated well. For example, T τ provides a good approximation to TG under bivariate model (2.3) and the conditional independence assumption (2.5). When S i is categorical, the oracle procedure can also be approximated well. This important special case is discussed next Discrete case: multiple testing with groups We now consider a special case where the auxiliary covariate S i is categorical. A concrete scenario is the multi-group random mixture model first introduced in Efron (2008). See also ai and Sun (2009). The model is useful to handle large-scale testing problems where data are collected from heterogeneous sources. orrespondingly, the m hypotheses may

13 ARS: ovariate-assisted Two-Sample Inference 13 be divided into, say, K groups that exhibit different characteristics. Let S i denote the group membership. Assume that S i takes values in {1,,K} with prior probabilities {π 1,,π K }. onsider the following conditional distributions: (T i S i = k) f k 1 = (1 πk 1 )fk 10 +πk 1 fk 11, (4.2) for k = 1,,K, where π1 k is the proportion of non-null cases in group k, f10 k and f11 k are the null and non-null densities of T i, and f1 k = (1 πk 1 )fk 10 +πk 1 fk 11 is the mixture density for all observations in group k. The model allows the conditional distributions in (4.2) to vary across groups; this is desirable in practice when groups are heterogeneous. Remark 8. In Section A.10 in the supplement, we present a simple example to show that T i is not sufficient (as insightfully pointed out by a reviewer), whereas (T i,s i ) is sufficient. S i is ancillary in the sense that its value is determined by an external process independent from the main parameter. ontrary to the common intuition that S i is useless for inference, our analysis reveals that S i can be informative. The phenomenon is referred to as the ancillarity paradox because, to quote Lehmann (page 420, Lehmann and asella, 2006), the distribution of the ancillary, which should not affect the estimation of the main parameter, has an enormous effect on the properties of the standard estimator. A related phenomenon in the estimation context was discussed by Foster and George (1996). See also the seminal work by Brown (1990) for a paradox in multiple regression. The problem of multiple testing with groups and related problems have received substantial attention in the literature (Efron, 2008; Ferkingstad et al., 2008; ai and Sun, 2009; Hu et al., 2010; Liu et al., 2016; Barber and Ramdas, 2017). It can be shown that under model (4.2), the oracle statistic (4.1) is reduced to the conditional local false discovery rate (Lfdr, ai and Sun 2009) Lfdr i = (1 π k 1 )fk 10 (t i)/f k 1 (t i) for S i = k, k = 1,,K. The Lfdr statistic can be accurately estimated from data when the number of tests is large in separate groups. However, the Lfdr statistic cannot be well estimated when the number of groups is large, or when S i becomes a continuous variable. An important recent progress for exploiting grouping and hierarchical structures among hypotheses under more generic settings has been made in Liu et al. (2016), wherein an interesting decomposition of the oracle statistic (4.1) was derived. The decomposition provides substantial insights on how the auxiliary information (grouping effects S i ) and primary statistics (individual effects T i ) interplay in simultaneous testing with side information. The result on multiple testing with groups motivates an interesting strategy to approximate the oracle rule. For a continuous auxiliary covariate S i, we can first discretize S i, then divide the hypotheses into groups according to the discrete variable, and finally apply group-wise multiple testing procedures. This idea is closely related to the uncorrelated screening (US) method in Liu (2014); the connection is discussed next Discretization and uncorrelated screening The idea in Liu (2014) involves discretizing a continuous S i as a binary variable. Define index sets G 1 = {1 i m : S i > λ} and G 2 = {1 i m : S i λ}, where the tuning parameter λ divides t 1i s into two groups: T 1 (G 1 ) = {t i : i G 1 } and T 1 (G 2 ) = {t i : i G 2 }. The uncorrelated screening (US, Liu 2014) method operates in two steps. First, the BH method (Benjamini and Hochberg, 1995) is applied at level α to the two groups separately, and then the rejected hypotheses from two groups are combined as the final

14 14 ai & Sun & Wang output. The tuning parameter λ is chosen in a way such that it yields the largest number of rejections (two groups combined). US is closely related to the separate analysis strategy proposed in Efron (2008). The key difference is that the groups are known a priori in Efron (2008), whereas the groups are chosen adaptively in Liu (2014). The main merit of US is that the screening statistic is constructed to be uncorrelated with the test statistic, which ensures that the selection bias issue can be avoided. Moreover, the divide-and-test strategy combines the results in both groups; this is different from conventional independent filtering approaches [Bourgon et al. (2010)], in which one group is completely filtered out. We now compare different methods under a unified framework. Both US and ARS can be viewed as approximations of the oracle rule (4.1). The goal is to borrow information from the external covariate S i to improve the efficiency of simultaneous inference. US adopts the divide-and-test strategyand only models S i as a binary variable. It suffers from information loss in the discretization step. This sheds lights on the superiority of ARS, for it fully utilizes the data by modeling S i as a continuous variable. The general framework suggests several directions that US may be improved. First, the information of S i may be better exploited, e.g. by creating more groups. However, it remains unknown how to choose the optimal number of groups. Second, US tests the hypotheses at FDR level α for both groups. However, ai and Sun (2009) showed that the choice of identical FDR levels across groups can be suboptimal. In order to maximize the overall power, different group-wise FDR levels should be chosen. However, no matter how smart a divide-and-test strategy may be, discretizing a continuous covariate would inevitably result in information loss and hence will be outperformed by ARS The pooling within strategy for information integration In Philosophy and Principles of Data Analysis (Tukey, 1994), Tukey coined two witted terms for effective information integration strategies: borrowing strength and pooling within. The idea of borrowing strength, which was investigated extensively and systematically by researchers in both simultaneous estimation and multiple testing fields, has led to a number of impactful theories and methodologies exemplified by the James-Stein s estimator (James and Stein, 1961) and local false discovery rate methodology(efron et al., 2001). By contrast, the direction of pooling within has been less explored. Tukey described it as a two step strategy that involves first gathering quantitative indications from within different parts of the data, and then pooling these indications into a single overall index (Tukey, 1994, pp. 278). Our work formalizes a theoretical framework for the pooling within idea in the context of two-sample multiple testing. We not only develop practical principles on the construction of multiple indications from within the data (i.e. independent and comparable pairs), but also derive an overall index (i.e. the oracle statistic) that optimally combines the evidences exhibited from both statistics. Our work differs in several ways from existing works on multiple testing with covariate (Ferkingstad et al., 2008; Zablocki et al., 2014; Scott et al., 2015). First, the covariate in other works is collected externally from other data sources, whereas the auxiliary information in our work is gathered internally within the primary data set. Second, in other works it has been assumed that the null density would not be affected by the external covariate. However, the assumption should be scrutinized in practice as it may not always hold. Under our testing framework, the requirement of a fixed null density is formalized as the conditional independence between the primary and auxiliary statistics. The conditional independence is proposed as a principle for information extraction, and is automatically

15 ARS: ovariate-assisted Two-Sample Inference 15 fulfilled by our approach to constructing the auxiliary sequence. ARS makes several new methodological and theoretical contributions. First, under a decision-theoretic framework, the oracle ARS procedure is shown to be optimal for information pooling. Second, existing methodologies on testing with covariate are mostly developed under the Bayesian computational framework and lack theoretical justifications. By contrast, our data-driven ARS procedure is a non-parametric method and enjoys nice asymptotic properties. Such FDR theories, as far as we know, are new in the literature. Third, the screening approach employed by ARS reveals interesting connections between sparsity estimation and multiple testing with covariate, which is elaborated next apturing sparsity information via screening A celebrated finding in the FDR literature is that incorporating the estimated proportion of non-nulls [π 1 = P(θ i = 1)] can improve the power (Benjamini and Hochberg, 2000; Storey, 2002; Genovese and Wasserman, 2002). In light of S i, the proportion becomes heterogeneous; hence it is desirable to utilize the conditional proportions π 1 (s) = P(θ i = 1 S i = s) to improve the power of existing methods (Zablocki et al., 2014; Scott et al., 2015). In a similar vein, earlier works on multiple testing with groups (or discrete S i ) reveal that varied sparsity levels across groups can be exploited to construct more powerful methods (Ferkingstad et al., 2008; ai and Sun, 2009; Hu et al., 2010). Estimating π 1 (s) with a continuous covariate is a challenging problem. Most existing works (Zablocki et al., 2014; Scott et al., 2015) employ Bayesian computational algorithms that do not provide theoretical guarantees. Notable progress has been made by Boca and Leek(2017). However, their theory requires a correct specification of the underlying regression model, which cannot be checked in practice. Next we discuss how the screening idea in Sections 3.3 and 3.4 can be extended to derive a simple and elegant non-parametric estimator of π 1 (s). Fig. 1 gives a graphical illustration of the proposed estimator. We generate m = 10 5 tests with X i N(0,1) and Ȳi 0.8N(0,1) + 0.2N(2,1); hence T i = (1/ 2)( X i Ȳi) and S i = (1/ 2)( X i +Ȳi). Suppose we are interested in counting how many S i would fall into the interval [t 2 h,t 2 + h] with t 2 = 2 and h = 0.3. The counts are represented by vertical bars on Panel (a) for each p-value interval. As shown in Proposition 1, T i and S i are independent under the null [cf. Equation (2.5)]. Therefore we can see that the counts of S i are roughly uniformly distributed when the p-value of T i is large. Expanding the interval [t 2 h,t 2 +h] to the entire real line (which actually corresponds to discarding the information in S i ), we obtain the histogram of all p-values [Panel (b)]. We start with a description of a classical estimator [Schweder and Spjøtvoll (1982); Storey (2002)] for π 1 ; see Langaas et al. (2005) for a elaborated discussion of various extensions. Let Q(τ) = #{P i > τ}, then byfig. 1. (b), the expected countscoveredby lightgrey bars to the right of the threshold τ can be approximated as E{Q(τ)} = m(1 π 1 )(1 τ). Setting the expected and actual counts equal, we obtain ˆπ τ 1 = 1 Q(τ)/{m(1 τ)}. Next we consider the conditional proportion π 1 (s) = P(θ i = 1 S i = s). Assume that π 1 (s) and f(s), the density of S i, are constants in a small neighborhood [s h/2,s + h/2]. Then the expected counts of the p-values from the null distribution in the interval [s h/2,s+h/2] can be approximated by Q τ (s,h) {1 π 1 (s)}f(s)h. The other way of counting can be done using (3.7) in Section 3.2. In obtaining (3.7), we exploit the fact that the counts S i are roughly uniformly distributed to the right of the threshold τ. Taking the limit, we obtain π 1 (s) = 1 {f(s)} 1 lim h 0 Q τ (s,h)/h = 1 q τ (s)/f(s). Finally, utilizing

16 16 ai & Sun & Wang (a). ounts of T 2i in [t 2 h, t 2 + h] (b). Histogram of p values τ = τ = 0.5 count count p value of T p value of T 1 Fig. 1. A graphical illustration of the smoothing estimator (4.3). (a). The counts of S i are uniformly distributed on the right. The bias decreases and the variability increases when τ increases. (b). The histogram of all p-values. Similarly, the p-values are approximately uniformly distributed on the right. the screening approach(3.11), we propose the following nonparametric smoothing estimator ˆπ 1 τ (s) = 1 i T (τ) K h(s s i ) (1 τ) m i=1 K h(s s i ). (4.3) hoosing tuning parameter τ is an important issue but has gone beyond the scope of the current work; see Storey (2002); Langaas et al. (2005) for further discussions Heterogeneity, correlation and empirical null onventional FDR analyses treat all hypotheses exchangeably. However, the hypotheses become unequal in light of S i, and it is desirable to incorporate, for example, the varied conditional proportions in a testing procedure to improve the efficiency. This section further discusses the case where the heterogeneity is reflected by disparate null densities. A key principle in our construction is that the primary and auxiliary statistics are conditionally independent under the null. However, in many applications where the auxiliary information is collected from external data, S i may be correlated with T i. Then the FDR procedure may become invalid if S i is incorporated inappropriately. For example, if the grouping variable S i is correlated with the p-value, then applying Benjamini-Hochberg s (BH) procedure to hypotheses in separate groups would be problematic because the null distributions of the p-values in some groups may no longer be uniform. A partial solution to resolve the issue is to estimate the empirical null distributions (Efron, 2004; Jin and ai, 2007) for different groups instead of using the theoretical null distribution directly. The theory and methodology in Efron (2008) and ai and Sun (2009), which allow the use of varied empirical nulls across different groups, can be applied to control the FDR. However, as we previously mentioned, discretizing a continuous S i fails to fully utilize the auxiliary information. The estimation of the empirical null with a continuous S i is an interesting problem for future research. The nonparametric smoothing idea in (4.3) might be helpful but additional difficulties may arise. For example, Efron (2004) suggests using the central part of the z-value histogram to estimate the empirical null, wherein a key assumption is that the non-nulls are sparse. However, the sparsity assumption may not hold simultaneously for all π 1 (s), causing identifiability issues. The limitations of the current methodology and open questions will be discussed in Section 6.

17 ARS: ovariate-assisted Two-Sample Inference Numerical Results This section investigates the numerical performance of ARS using both simulated and real data. We compare the oracle and data-driven ARS procedures, respectively denoted by and DD, with existing methods, including the Benjamini-Hochberg procedure (BH, Benjamin and Hochberg, 1995), the adaptive z-value procedure (AZ, Sun and ai 2007), and the uncorrelated screening procedure (US, Liu 2014). We first describe the implementation of ARS in Section 5.1. Sections 5.2 and 5.3 respectively consider (i) the case with known and unequal variances and (ii) the case with estimated variances and non-gaussian errors. An application to Supernova Detection is discussed in Section 5.4. Additional numerical results, including the non-informative case (Section B.2), completely informative case (Section B.3) and dependent tests (Section B.6), are provided in Section B of the Supplementary Material The implementation and R-package ARS The R-package ARS has been developed to implement the proposed method. This section describes some practical guidelines and details for the implementation in our package. In estimating (3.11), the ARS package uses the Lfdr as the screening statistic(as opposed to the p-values), with the default cutoff of τ = 0.9. A correction factor similar to 1 τ is needed and can be easily computed from data. Related formulae and computational details are described in Section A.11 in the Supplementary Material. The p-values are used to describe the methodology because the p-value description is simpler and more intuitive, and can reveal important connections to classical ideas (Figure 1). There is no essential difference between the p-value and Lfdr in the screening step. The main practical advantage for screening via Lfdr is its stability of tuning for finite samples. The intuition is that the Lfdr statistic, which contains information about the sparsity and non-null density, provides a testing rule that is more adaptive to the data. Section B.4 in the Supplementary Material showsthat the choiceofτ has relativelylittle effects on the performanceofars when Lfdr is used. In general a larger τ provides slightly better error control, which is consistent with our theory in Proposition 4. However, to avoid increased variability in estimated quantities, wedo not recommendverylargeτ s. As arule ofthumb, ourpracticalrecommendationisto takeτ = 0.9tobalancetheinterestsofpreciseerrorcontrolandstabilityofthemethodology. This is the default choice in our package and has been used in all our simulations. The R package np is used to choose the bandwidths h 1 and h 2 in equation (3.10). We haveadoptedtwostrategiestoimprovetheperformance. First, thebandwidth h 1 andh 2 are chosen based on the normal reference rule restricted to the samples with Lfdr 1 (t 1i ) < 0.5. The restriction leads to more stable and informative bandwidths as this subset is a more relevant part of the sample for the multiple testing problem. Second, the same h 2 has been used in (3.11) to obtain the numerator of (3.12). This strategy is helpful to increase the stability of the estimator [cf. Remark 5]. It is important to note that these strategies are just practical guidelines for finite samples. The asymptotic theories are not affected Simulation study I: known and unequal variances onsider model (2.1). Denote µ i1:i 2 = (µ i1,,µ i2 ) a vector of consecutive observations from i 1 to i 2 in vector µ = (µ 1,,µ m ). The two originalsamples under two conditions are denoted by {x 1,,x nx } and {y 1,,y ny }, respectively, with correspoinding population

18 18 ai & Sun & Wang means µ x and µ y. Let ǫ xij N(0,1) and ǫ yik N(0,2). We set n x = 50,n y = 60. In all simulations, the number of tests is m = 5000, the nominal FDR level is The following settings are considered in our simulation study. Setting 1: we set µ x,1:k = 5/ 30, µ x,(k+1):(2k) = 4/ 30, µ x,(2k+1):m = 0, µ y,1:k = 3/ 30, µ y,(k+1):(2k) = 4/ 30 and µ y,(2k+1):m = 0. Here k denotes the sparsity level: the proportion of locations with differential effects is k/m, and the proportion of nonzero locations is (2k)/m. We vary k from 100 to 1000 to investigate the effect of sparsity. Setting 2: we use k 1 and k 2 to denote the number of nonzero locations and the number of locations with differential effects, respectively. Let µ x,1:k2 = 5/ 30, µ x,(k2+1):k 1 = 4/ 30, µ x,(k1+1):m = 0, µ y,1:k2 = 2/ 30, µ y,(k2+1):k 1 = 4/ 30 and µ y,(k1+1):m = 0. We fix k 1 = 2000 and vary k 2 from 100 to k 1. This setting investigates how the informativeness of the auxiliary covariate would affect the performance of different methods. Note that as k 2 increases, the conditional probability π 1 1 = P(θ 1i = 1 θ 2i = 1) also increases, and the auxiliary covariate becomes more informative. Setting 3: Wefixk = 750andsetµ x,1:k = µ 0 / 30, µ x,(k+1):(2k) = 3/ 30, µ x,(2k+1):m = 0, µ y,1:k = 2/ 30, andµ y,(k+1):(2k) = 3/ 30, µ y,(2k+1):m = 0. Toinvestigatetheimpact of the effect sizes, we vary µ 0 from 3.5 to 5. In Settings 1-3, we apply different methods to simulated data and obtain FDR and MDR levels by averaging over 500 replications. Finally the actual FDR and ETP levels are plotted as functions of varied parameter values and displayed in Figure 2. We can see that the ARS procedure is more powerful than conventional univariate methods such BH and AZ, and is superior than US which only partially utilizes the auxiliary information. A more detailed description of simulation results is given below. (a). All methods control the FDR at the nominal level 0.05 approximately. BH is slightly conservative and US is very conservative. ARS sometimes has an FDR level that is a bit high but acceptable. This is mainly due to the inaccuracy in estimating the bivariate density function. (b). Univariate methods (BH and AZ) are improved by bivariate methods (US and ARS) in most settings. This shows that exploiting the auxiliary information is helpful. Note thatinsetting2, AZandUShavesimilarETPlevels. However,UShasamuchsmaller FDR level. This shows the usefulness of the auxiliary covariate. (c). US is uniformly dominated by ARS. This is because US only models T 2i as a binary variable whereas ARS fully utilizes the information in T 2i. (d). DD has similar performance as in most settings. However, DD can be conservative in FDR controlin some settings andhence has lesspowercomparedto (cf. Setting 1, top row of Figure 1). This has been predicted by our theory (Proposition 5). (e). From the results under Setting 1, we can see that the gain in efficiency (of bivariate methods over univariate methods) decreases as k increases. This shows that incorporating auxiliary information is more useful in the sparse case. However, our method still gains substantial power over conventional methods in the not-so-sparse case when the π 2 is40%. Ingeneral,the auxiliaryvariableisuseful aslongasthereisasignificant proportion of zeros such as 50% or 60%.

19 ARS: ovariate-assisted Two-Sample Inference 19 FDR level omparison Setting Power omparsion FDR Average Power k k Method DD BH AZ US FDR level omparison Setting Power omparsion FDR Average Power k k 2 Method DD BH AZ US FDR level omparison Setting 3 Power omparison FDR Average Power µ µ 0 Method DD BH AZ US Fig. 2. Two-sample tests with known variances. The FDR and MDR are plotted against varied nonnull proportions (top row) and conditional proportions (middle row) and effect sizes (bottom row). DD ( ), BH ( ), ( ), AZ (+) and US ( ).

20 20 ai & Sun & Wang (f) The results under Setting 2 show that the efficiency gain of ARS over competing methods increases as k 2 increases. This makes sense because k 2 is proportional to the conditional probability π 1 1, which indicates the usefulness/informativeness of the auxiliary covariate. (g). The results in Setting 3 show that ARS dominates other methods for all effect sizes. The efficiency is large when the signal strength is small to moderate. (Note that in Setting, a smaller µ 0 indicates a larger difference in effect sizes) Simulation study II: estimated variances and non-gaussian errors We consider similar simulation settings as the previous section with two modifications. First, we substitute the estimated variances in place of known variances. Second, to investigate the performance of our method with non-gaussian errors, we modify Setting 3 slightly by generating ǫ xij and ǫ yik from t-distribution with df = 4 and df = 5, respectively. The simulation results are summarized in Figure 3. We can see that the patterns of the curves are very similar to those observed in the previous simulation study. Our conclusions on the comparison of different methods remain the same as before. In addition, we have the following observations: (a). The results from the first two settings support our claims that the plug-in methodology works well in practice with estimated variances. Univariate methods also work well for FDR control. (b). The results in Setting 3 show that the non-gaussian errors have little impact on the performance of ARS Application in Supernova Detection In this section, we apply our ARS procedure for analysis of time course satellite imaging data. Figure4showsthetimecourseg-bandimagesofgalaxyM101collectedbythePalomar Transient Factory (PTF) survey (Law et al., 2009). The images indicate the appearance of SN 2011fe, one of the brightest supernovas known up to date (Nugent et al., 2011). A major goal of our analysis is to detect the discrepancies between images taken over time so that we can narrow down the potential locations for Supernova explosions. More accurate measurements and further investigations will then be carried out on the narrowed subset of potential locations. The satellite data are recorded and converted into gray-scaleimages of size (or m = 428,796 pixels). Each pixel corresponds to a value ranging from 0 to 1 that indicates the intensity level of the influx from stars. We use image 1 as the baseline, whose greyscale pixel values are subtracted from those in images 2 and 3. These differences are then vectorized as x = (x 1,,x m ) and y (respectively representing image 2 image 1 and image 3 image 1 ). We plot the histograms and find that the null distributions of x and y are different. This can be explained by the lapse in times at which these images are taken (the brightness of these g-band images changes gradually over time). The supernova data have significant amount of background noise in each image, and the average magnitudes of the background noise vary a lot from image to image. To remove the image-specific background noise, we first estimate the empirical null distributions (Efron, 2004) based on the center part of the

21 ARS: ovariate-assisted Two-Sample Inference 21 FDR level omparison Setting 1 Power omparsion FDR Average Power k k Method DD BH AZ US FDR level omparison Setting Power omparsion FDR Average Power k k 2 Method DD BH AZ US Setting FDR level omparison 1.00 Power omparison FDR Average Power µ µ 0 Method DD BH AZ US Fig. 3. Two-sample tests with unknown variances and non-gaussian errors. The FDR and MDR are plotted against varied non-null proportions (top row) and conditional proportions (middle row) and effect sizes (bottom row). The displayed procedures are DD ( ), BH ( ), ( ), AZ (+) and US ( ).

22 22 ai & Sun & Wang Fig. 4. From left to right, images are taken respectively on August 23, 24, and 25, The arrows clearly indicate the explosion of SN 2011fe. Table 1. Thresholds and Total Number of Rejections (in parentheses) of Different Testing Procedures FDR level BH procedure Adaptive Z procedure US (two thresholds) ARS (4) (5) 3.24,5.46 (22) (35) (22) (24) 2.51,4.87 (38) (58) (64) 0.39 (69) 1.92,4.42 (80) 0.26 (109) histograms. The variances of observations are assumed to be homoscedastic and are estimated using all pixels. For x and y, we have N(0.0028, ) and N(0.0443, ), respectively. We then standardize the observations as x st and y st, which will be used in our later analyses. We do not take the difference x y directly as it would create many false signals due to the varied magnitudes of the background noise. The standardized measurements x st and y st from the two images seem to be comparable as most of pairs (x st i,yst i ) have similar values. We formulate the problem as a multiple testing problem that consists of m = 428,796 two-sample tests with known variances. For standardized observations x st and y st the variances are known to be 1. Then t 1i = xst i yst i 2 and t 2i = xst i +yst i 2, for 1 i m. We apply BH, AZ, US and ARS proceduresat FDR levels0.01%, 1% and 5%. Figure 11in the Supplementary Material shows the rejected pixels in the layout for each method under different FDR levels. The estimated sparsity levels for t 1 and t 2 are respectively 1.47% and 49.5%. The corresponding estimated support size at τ = 0.5 is We report the thresholds of different testing procedures in Table 1. We can see that more information can be harvested from the data by using auxiliary information. Power gains are achieved by ARS at all FDR levels. In particular, at FDR level 0.01%, the supernova is missed by BH and AZ but detected by ARS. To further quantify our procedure s superiority, we count the total number of rejections in Table 1 (numbers in parentheses). We can see that ARS consistently detects more signals from the satellite images than competing methods.

23 ARS: ovariate-assisted Two-Sample Inference Discussion ovariate-assisted multiple testing can be viewed as a special case of a much broader class of problems under the theoretical framework of integrative simultaneous inference, which covers a range of topics including weighted multiple testing (Benjamini and Hochberg, 1997; Basu et al., 2017), partial conjunction tests or set-wise inference (Benjamini and Heller, 2008; Sun and Wei, 2011; Du and Zhang, 2014), and replicability analysis (Heller et al., 2014; Heller and Yekutieli, 2014). A coherent theme in these problems is to combine the information from multiple sources to make more informative decisions. Tukey s pooling within strategy provides a promising approach in such scenarios, where useful quantitative indications might be hidden in various parts of massive data sets. The current formulations and methodologies in integrative data analysis differ substantially. A general theory and methodology is yet to be developed for handling various types of problems in a unified framework. For instance, in weighted FDR analysis (Benjamini and Hochberg, 1997), the external domain knowledge is incorporated as the weights in modified FDR and power definitions to reflect the varied gains and losses in simultaneous decisions. By contrast, covariate-assisted multiple testing still utilizes unweighted FDR and power definitions. Moreover, in partial conjunction tests and replicability analysis, the summary statistics from different studies are of equal importance, which marks a key difference from covariate assisted inference where some statistics are primary while others are secondary. We conclude our discussion with a few more open issues. Are there better ways to construct the auxiliary sequence? Our theory only shows that ARSisoptimalwhen the pairs are given. Howtoconstructanoptimalpairfromdata is still an open question. For instance, in situations where two means have opposite signs, there would be information loss in the current weighted sum because the sum of absolute values can better capture the sparsity information. However, the absolute value approach would give rise to a correlated pair, which cannot be handled by the current testing framework. Moreover, for observations with non-gaussian errors, the weighted sum needs further modification as it would lead to an asymptotically correlated pair under the null when the error distribution is skewed (cf. the proof of Proposition 6 in Section A.6). How to deal with multiple testing dependency? The ARS method cannot be applied to dependent tests asit assumesthat T i are independent. Our simulation studies show that ARS controls the FDR under weak dependence. However, the result is based on very limited empirical studies, which lack theoretical supports. An important direction is to develop new theory and methodology for the dependent case. How to construct the auxiliary sequence in more general settings? This article focuses on the two-sample tests. It would be of interest to extend the methodology to simultaneous ANOVA tests. Moreover, ARS provides a useful strategy in extracting the sparsity structure from data. There are other important structures in the data such as heteroscedasticity, hierarchy and dependency, which may also be captured by an auxiliary sequence. It remains an open question on how to extract and incorporate such structural information effectively to improve the power of a testing procedure. How to summarize the auxiliary information in high-dimensional settings? The proposed ARS methodology requires the joint modeling of the primary and auxiliary

24 24 ai & Sun & Wang statistics, which cannot handle many covariate sequences because the joint density estimator would greatly suffer from high-dimensionality. A fundamental issue is to develop new principles for information pooling, or optimal dimension reduction, in multiple testing with high-dimensional covariates. How to make inference with multiple sequences? In partial conjunction tests and replicability analysis, an important feature is that the means (of summary statistics) from separate studies are individually sparse. We expect that similar strategies for extracting and sharing sparsity information among multiple sequences would improve the accuracy of simultaneous inference. However, as opposed to covariate-assisted inference where there is a sequence of primary statistic, in partial conjunction tests and replicability analysis all sequences are of equal importance, which poses new challenges for problem formulation and methodological development. 7. Proofs of Main Theorems This section proves the main theorems. The proofs of other propositions are provided in the Supplementary Material Proof of Theorem 1 Proof. We first show that the two expressions of T i in (3.4) are equivalent. Recall q (t 2) = (1 π 1)f (t 2 θ 1i = 0). Applying Bayes theorem and using the conditional independence between T 1i and T 2i under the null θ 1i = 0 (Proposition 1), we obtain T i (t 1,t 2) = P(θ1i = 0)f(t1,t2 θ1i = 0) f(t 1,t 2) = q (t 2)f 10(t 1). f(t 1,t 2) Part (a). Let Q (t) = α t. We first show that α t < t. According to the mfdr definition, E (T1,T 2 ){ m i=1 (Ti α t)i(t i < t) } = 0, (7.1) where the subscript (T 1,T 2) indicates that the expectation is taken over the joint distribution of (T 1,T 2). The above equation implies that α t < t; otherwise all terms in the summation on its LHS would be either zero or negative. Next we show that Q (t) is monotone in t. Let Q (t j) = α j for j = 1,2. We only need to show that if t 1 < t 2, then α 1 α 2. We argue by contradiction. If α 1 > α 2, then (T i α 2)I(T i < t 2) = (T i α 2)I(T i < t 1)+(T i α 2)I(t 1 T i < t 2) (T i α 1)I(T i < t 1)+(α 1 α 2)I(T i < t 1) +(T i α 1)I(t 1 T i < t 2). Next take expectations on both sides and sum over all i. We claim that E T1,T 2 { m i=1 (Ti α 2)I(T i < t 2) } > 0. (7.2) The above inequality holds since (i) E T1,T 2 { m i=1 (Ti α 1)I(T i < t 1) } = 0, (ii) E T1,T 2 { m i=1 (α1 α2)i(ti < t 1) } > 0, and(iii) E T1,T 2 { m i=1 (Ti α 1)I(t 1 T i < t 2) } > 0, which are respectively due to (7.1), the assumption that α 1 > α 2 and the fact α 1 < t 1. However, (7.2) is a contradictiontoourdefinitionofα 2, whichimplies thate T1,T 2 { m i=1 (Ti α 2)I(T i < t 2) } = 0. Hence we must have α 1 α 2. Part (b). The oracle threshold is defined as t = sup t {t (0,1) : Q (t) α}. We want to show that at t, the mfdr level is attained precisely. Let ᾱ = Q (1). Part (a) shows that

25 ARS: ovariate-assisted Two-Sample Inference 25 the continuous function Q (t) is non-decreasing. Then we always have Q (t ) = α if α < ᾱ. Define δ = {I(T i < t ) : 1 i m}. Let δ = (δ 1,,δ m ) be an arbitrary decision rule such that mfdr(δ ) α. It follows that E T1,T 2 { m i=1 (Ti α)δ i } = 0 and ET1,T 2 { m i=1 (Ti α)δ i } 0. (7.3) ombining the two results in (7.3), we conclude that E T1,T 2 { m i=1 (δi δ i )(T i α) } 0. (7.4) Next, consider a monotonic transformation of the oracle decision rule δ i = I(T i < t ) via f(x) = (x α)/(1 x) (note that the ( derivative f(x) ) = (1 α)/(1 x) 2 > 0). The oracle decision rule is equivalent to δ i T i = I α < λ 1 T i, where λ = t α 1 t. A useful fact is that α < t < 1. Hence λ > 0. Note that (i) (T α) λ i (1 T) i < 0 if δ i > δ, i and (ii) (T α) λ i (1 T) i > 0 if δ i < δ. i ombining (i) and (ii), we conclude that the following inequality holds for all i: (δ i δ ) { i (T i α) λ (1 T) } i 0. Summing over i and taking expectations, we have [ m { E T1,T 2 (δ i δ ) i (T i α) λ (1 T)} ] i 0. (7.5) i=1 ombining (7.4) & (7.5), we have { m } { m } λ E T1,T 2 (δ i δ )(1 T i ) i E T1,T 2 (δ i δ )(T i i α) 0. i=1 Finally, note that λ > 0 and that the ETP of a decision rule δ = (δ 1,,δ m) is given by E T1,T 2 { m i=1 δi(1 Ti ) }, we conclude that ETP(δ ) ETP(δ ). i= Proof of Theorem 2 Proof. We first provide a summary of useful notations. Q τ (t) = m 1 m i=1 Tτ,i I(Tτ,i < t). This quantity provides a conservative FDR estimate of the decision rule { I(T τ,i < t) : 1 i m}. ˆQ τ (t) = m 1 m τ,i replaced by ˆT i=1. τ,i τ,i ˆT I(ˆT < t). This is the data-driven version of Qτ (t) with T τ,i being Q τ (t) = E{T τ I(T τ < t)} denotes the limiting value of Q τ (t), where T τ is a generic member from the sample {T i : 1 i m}. t τ = sup{t (0,1) : Q τ (t) α}. Part (a). In Proposition 5, we show that δ τ is conservative in mfdr control. To establish the desired property in mfdr control, we only need to show that mfdr(δ DD) = mfdr(δ τ )+o(1). Define a continuous version of Q τ (t) as follows. For T τ,(k) < t Tτ,(k+1), let whereq τ k = Q τ ( T τ,(k) Q τ (t) = t T τ,(k) T τ,(k+1) T τ,(k) Q τ k + Tτ,(k+1) t T τ,(k+1) T τ,(k) Q τ k+1, (7.6) ). ItiseasytoseethatQ τ (t)iscontinuousandmonotone. Hencetheinverse ofq τ (t), denotedq τ, 1, is well defined. Moreover, Q τ, 1 is continuous and monotone. We can similarly define a continuous version of ˆQ τ (t), denoted by ˆQ τ (t). ˆQτ (t) is continuous and monotone;

26 26 ai & Sun & Wang so does its inverse [ and δ τ DD = I {ˆTτ,i ˆQ τ, 1 ( ). By construction, we have δ τ = [ I { T } ] τ,i : 1 i m. We will show that τ, 1 ˆQ (α) (i) Q τ, 1 (α) p t τ and (ii) Qτ, 1 (α) } : 1 i m ] ˆQ τ, 1 (α) p t τ. (7.7) To show (i), note that the continuity of Q τ, 1 ( ) implies that for any ǫ > 0, we can find η > 0 such that Q τ, 1 (α) Q τ, 1 {Q τ (t τ )} < ǫ if Q τ (t τ ) α < η. Hence P( Q τ (t τ ) α > η) P [ Q τ, 1 (α) Q τ, 1 {Q τ (t τ )} > ǫ ]. Next, by the WLLN Q τ (t) p Q τ (t). Note that Q τ (t τ ) = α, we have P( Q τ (t τ ) α > η) 0. By Markov inequality, we conclude that Q τ, 1 (α) p Q τ, 1 {Q τ (t τ )} = t τ. Next we show (ii). By inspecting the proof of (i), we only need to show that ˆQ τ (t) p Q τ (t). Denote a variable without index i (e.g. ˆTτ and T) τ as a generic member from the sample. It follows from ondition (2) and the continuous mapping theorem that ˆT τ p T. τ Note that both T τ and ˆT τ are bounded above by 1. It follows that E(ˆT τ T) τ 2 0. Let U i = T τ,i I(Tτ,i τ,i τ,i < t) and Ûi = ˆT I(ˆT < t). We will show that E(Ûi Ui)2 = o(1). To see this, consider the following decomposition (Ûi Ui)2 = (ˆT τ T τ ) 2 I(ˆT τ t,t τ t)+(ˆt τ ) 2 I(ˆT τ t,t τ > t) +(T τ ) 2 I(ˆT τ > t,t τ t) = I +II +III. The first term I = o(1) because E(ˆT T τ ) τ 2 0. Let η > 0. Note that T τ is continuous and that ˆT τ p T, τ we have ( ) P(ˆT τ t,t τ > t) P{T τ (t,t+η)}+p ˆT τ T τ > η 0. Since ˆT τ is bounded, we conclude that the second term II = o(1). Similarly we can show that III = o(1). Therefore E(Ûi Ui)2 = o(1). Next we show that ˆQ τ (t) p Q τ (t). Note that Q τ (t) p Q τ (t), we only need to show that ˆQ τ (t) p Q τ (t). The dependence among Ûi in the expression ˆQ τ (t) = m 1 iûi creates some complications. The idea is to apply some standard techniques for the limit of triangular arrays that do not require independence between variables. onsider S n = m i=1 (Ûi Ui). Then E(Sn) = m{e(ûi) E(Ui)}. Applying standard inequalities such as auchy-schwartz, we have E(Ûi U i)(ûj Uj) = o(1). It follows that m 2 var(s n) m 1 {(Ûi } E(Ûi Ui)2 +(1+o(1))E Ui)(Ûj Uj) = o(1). Therefore E{S n E(S n)/n} 2 0. Applying hebyshev s inequality, we obtain m 1 {S n E(S n)} = ˆQ τ (t) Q τ (t) p 0. Therefore ˆQ τ (t) p Q τ (t). By definition, ˆQ τ (t) ˆQ τ (t) m 1. We claim that ˆQ τ (t) p Q τ (t) and (ii) follows. According to (i) and (ii) in (7.7), ˆQ τ, 1 (α) = Q τ, 1 (α)+o P (1). The mfdr levels of the testing procedures are: ( mfdr(δ τ ) = PH 0 T τ,i < Qτ, 1 (α) ) (ˆTτ,i ) τ, 1 P P ( T τ,i < Qτ, 1 (α) ), and mfdr(δ H0 < ˆQ (α) dd ) = (ˆTτ,i ). τ, 1 P < ˆQ (α) TheoperationofourtestingprocedureimpliesthatQ τ, 1 (α) α. Itfollows thatp ( T τ,i is bounded away from zero. We conclude that mfdr(δ dd ) = mfdr(δ τ )+o(1). < Qτ, 1 (α) )

27 ARS: ovariate-assisted Two-Sample Inference 27 The result on mfdr control can be extended to FDR control. The next proposition, which is proved in the Supplementary Material, first gives sufficient conditions under which the mfdr and FDR definitions are asymptotically equivalent, and then verifies that these conditions are fulfilled by the proposed ARS procedure δ τ dd. It follows from the proposition that ARS controls the FDR at level α+o(1). Proposition 7. (a) onsider a general decision rule δ. Let Y = m 1 m i=1δi. Then mfdr(δ) = FDR(δ)+o(1) if (i) E(Y) η for some η > 0, and (ii) Var(Y) = o(1). (b) onditions (i) and (ii) are fulfilled by the ARS procedure δ τ dd. Part (b). The ARS procedure utilizes ˆq, and the corresponding test statistic is. It follows from onditions (1 ) and (2), and the continuous mapping theorem that ˆT p T. Denote Q (t) the oracle mfdr function and t the oracle threshold. Then Q (t) = PH 0(T < t) P(T < t) ˆT,i ˆT,i = E{T I(T < t)}, t = sup{t (0,1) : Q (t) α}. Define ˆQ (t) = m 1 m,i i=1 I(ˆT < t). Similar to (7.6), we define a continuous version of ˆQ (t) and denote it by ˆQ (t). It can be shown that ˆQ (t) is [ continuous and monotone; so does its inverse ˆQ, 1 {ˆT,i } ] (t). The ARS procedure is given by δ, 1 DD = I ˆQ (α) : 1 i m. Following the steps in Part (a) we can show that ˆQ (t) p Q (t), ˆQ, 1 (t) p t. (7.8) The operation of ARS implies that Q, 1 (α) α (thus the denominator of the mfdr is bounded away from zero). Hence mfdr(δ DD) = Q (t )+o(1) = α+o(1). Next, we consider the ETP. It follows from ˆT p T and (7.8) that } ETP(δ, 1 P DD) ETP(δ = H1 {ˆT < ˆQ (α) = 1+o(1). ) P H1 (T < t ) Acknowledgments We thank the Editor, referees and RS for the thorough and useful comments which have greatly helped to improve the presentation of the paper. In particular, we are grateful to several excellent suggestions from the referees that have inspired our discussions on the conditional independence assumption, Tukey s procedures for combination, Brown s ancilarity paradox and the robustness of ARS when pooling non-informative auxiliary data. W. Sun would like to thank Ms. Pallavi Basu from Tel Aviv University for helpful suggestions on theory. Tony ai was supported in part by NSF Grants DMS , DMS and DMS , and NIH Grant R01 A W. Sun was supported in part by NSF Grants DMS-AREER and DMS References Barber, RF. and A. Ramdas (2017+). The p-filter: multi-layer fdr control for grouped hypotheses. Journal of the Royal Statistical Society: Series B (Statistical Methodology) to appear. Basu, P., T. T. ai, K. Das, and W. Sun (2017+). Weighted false discovery control in large-scale multiple testing. Journal of the American Statistical Association. To appear.

28 28 ai & Sun & Wang Benjamini, Y. and R. Heller (2008). Screening for partial conjunction hypotheses. Biometrics 64: Benjamini, Y. and Y. Hochberg (1995). ontrolling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. B 57, Benjamini, Y. and Y. Hochberg (1997). Multiple hypotheses testing with weights. Scandinavian Journal of Statistics 24: Benjamini, Y. and Y. Hochberg (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics 25: Boca, S. M. and J. T. Leek (2017). A regression framework for the proportion of true null hypotheses. Preprint. biorxiv: Bourgon, R., R. Gentleman, and W. Huber (2010). Independent filtering increases detection power for high-throughput experiments. Proceedings of the National Academy of Sciences 107(21), Brown, L. D. (1990). An ancillarity paradox which appears in multiple linear regression. The Annals of Statistics 18(2), ai, T. T. and J. Jin (2010). Optimal rates of convergence for estimating the null density and proportion of non-null effects in large-scale multiple testing. Ann. Statist. 38, ai, T. T. and W. Sun (2009). Simultaneous testing of grouped hypotheses: Finding needles in multiple haystacks. J. Amer. Statist. Assoc. 104, alvano, S. E., W. Xiao, D. R. Richards, R. M. Felciano, H. V. Baker, R. J. ho, R. O. hen, B. H. Brownstein, J. P. obb, S. K. Tschoeke, et al. (2005). A network-based analysis of systemic inflammation in humans. Nature 437(7061), Du, L. and. Zhang (2014). Single-index modulated multiple testing. The Annals of Statistics 42(4), Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. Journal of the American Statistical Association 99(465), Efron, B. (2007). Size, power and false discovery rates. Ann. Statist. 35, Efron, B. (2008). Simultaneous inference: When should hypothesis testing problems be combined? Ann. Appl. Stat. 2, Efron, B., R. Tibshirani, J. D. Storey, and V. Tusher (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96, Ferkingstad, E., A. Frigessi, H. Rue, G. Thorleifsson, and A. Kong (2008). Unsupervised empirical bayesian multiple testing with external covariates. The Annals of Applied Statistics 2, Foster, D. P. and E. I. George (1996). A simple ancillarity paradox. Scandinavian journal of statistics 23(2), Genovese,. and L. Wasserman (2002). Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. B 64, Genovese,. and L. Wasserman (2004). A stochastic process approach to false discovery control. Ann. Statist. 32:

29 ARS: ovariate-assisted Two-Sample Inference 29 Haupt, J., R. M. astro, and R. Nowak (2011). Distilled Sensing: Adaptive Sampling for Sparse Detection and Estimation. Information Theory, IEEE Transactions on 57(9), Heller, R., M. Bogomolov, and Y. Benjamini (2014). Deciding whether follow-up studies have replicated findings in a preliminary large-scale omics study. Proceedings of the National Academy of Sciences 111(46), Heller, R. and D. Yekutieli (2014). Replicability analysis for genome-wide association studies. The Annals of Applied Statistics 8(1), Hu, J. X., H. Zhao, and H. H. Zhou (2010). False discovery rate control with groups. Journal of the American Statistical Association 105(491), James, W. and. Stein (1961). Estimation with quadratic loss. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, Volume 1, pp Jin, J. and T. T. ai (2007). Estimating the null and the proportional of nonnull effects in largescale multiple comparisons. J. Amer. Statist. Assoc. 102, Langaas, M., B. H. Lindqvist, and E. Ferkingstad (2005). Estimating the proportion of true null hypotheses, with application to dna microarray data. J. Roy. Statist. Soc. B 67, Law, N. M., S. R. Kulkarni, R. G. Dekany, E. O. Ofek, R. M. Quimby, P. E. Nugent, J. Surace,.. Grillmair, J. S. Bloom, M. M. Kasliwal, et al. (2009). The palomar transient factory: system overview, performance, and first results. Publications of the Astronomical Society of the Pacific 121(886), Lehmann, E. L. and G. asella (2006). Theory of point estimation. Springer Science & Business Media. Liu, W. (2014). Incorporation of sparsity information in large-scale multiple two-sample t tests. arxiv preprint arxiv: Liu, Y., S. K. Sarkar, and Z. Zhao (2016). A new approach to multiple testing of grouped hypotheses. Journal of Statistical Planning and Inference 179, Meinshausen, N. and J. Rice (2006). Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. Ann. Statist. 34, Nugent, P. E., M. Sullivan, S. B. enko, R.. Thomas, D. Kasen, D. A. Howell, D. Bersier, J. S. Bloom, S. R. Kulkarni, M. T. Kandrashoff, A. V. Filippenko, J. M. Silverman, G. W. Marcy, A. W. Howard, H. T. Isaacson, K. Maguire, N. Suzuki, J. E. Tarlton, Y.-. Pan, L. Bildsten, B. J. Fulton, J. T. Parrent, D. Sand, P. Podsiadlowski, F. B. Bianco, B. Dilday, M. L. Graham, J. Lyman, P. James, M. M. Kasliwal, N. M. Law, R. M. Quimby, I. M. Hook, E. S. Walker, P. Mazzali, E. Pian, E. O. Ofek, A. Gal-Yam, and D. Poznanski (2011, 12). Supernova sn 2011fe from an exploding carbon-oxygen white dwarf star. Nature 480(7377), Reiner-Benaim, A., D. Yekutieli, N. E. Letwin, G. I. Elmer, N. H. Lee, N. Kafkafi, and Y. Benjamini (2007). Associating quantitative behavioral traits with gene expression in the brain: searching for diamonds in the hay. Bioinformatics 23(17), Rubin, D., S. Dudoit, and M. Van der Laan (2006). A method to increase the power of multiple testing procedures through sample splitting. Statistical Applications in Genetics and Molecular Biology 5(1), Article 19. Sarkar, S. K. (2002). Some results on false discovery rate in stepwise multiple testing procedures. Ann. Statist. 30,

30 30 ai & Sun & Wang Schweder, T. and E. Spjøtvoll (1982). Plots of p-values to evaluate many tests simultaneously. Biometrika 69: Scott, J. G., R.. Kelly, M. A. Smith, P. Zhou, and R. E. Kass (2015). False discovery rate regression: an application to neural synchrony detection in primary visual cortex. Journal of the American Statistical Association 110(510), Silverman, B. W. (1986). Density estimation for statistics and data analysis, R press. Wand, M. and M. Jones (1995). Kernel smoothing, hapman & Hall, London. Skol, A. D., L. J. Scott, G. R. Abecasis, and M. Boehnke (2006). Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. 38, Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. B 64, Sun, W. and T. T. ai (2007). Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc. 102, Sun, W. and Z. Wei (2011). Large-scale multiple testing for pattern identification, with applications to time-course microarray experiments. J. Amer. Statist. Assoc. 106, Taylor, J., R. Tibshirani, and B. Efron (2005). The miss rate for the analysis of gene expression data. Biostatistics 6(1), Tukey, J. W. (1994). The collected works of John W. Tukey, Volume 3. Taylor & Francis. Wasserman, L. and K. Roeder (2009). High-dimensional variable selection. Ann. Statist. 37, Zehetmayer, S., P. Bauer, and M. Posch (2005). Two-stage designs for experiments with a large number of hypotheses. Bioinformatics 21(19), Zehetmayer, S., P. Bauer, and M. Posch (2008). Optimized multi-stage designs controlling the false discovery or the family-wise error rate. Stat. Med. 27(21), Zablocki, R. W., A. J. Schork, R. A. Levine, O. A. Andreassen, A. M. Dale, and W. K. Thompson (2014). ovariate-modulated local false discovery rate for genome-wide association studies. Bioinformatics 30(15),

31 ARS: ovariate-assisted Two-Sample Inference 1 Supplementary Material for ARS: ovariate Assisted Ranking and Screening for Large-Scale Two-Sample Inference This supplement contains the proofs of other theoretical results (Section A) and additional numerical results (Section B). A. Proofs of Other Results A.1. Proof of Proposition 1 Proof. Let A x = {x : x 0}. According to the definition of θ ji and the conditional independence assumption (2.4), we have f(t 1i,t 2i θ 1i = 0,θ 2i = 0) = f(t 1i,t 2i µ xi = 0,µ yi = 0) = f(t 1i µ xi = 0,µ yi = 0)f(t 2i µ xi = 0,µ yi = 0) = f(t 1i θ 1i = 0,µ xi = µ yi = 0)f(t 2i θ 2i = 0,µ xi = µ yi = 0) = f(t 1i θ 1i = 0)f(t 2i θ 2i = 0). The last equality holds because given θ ji = 0, t ji are just linear combinations of random errors ǫ xi and ǫ yi, which are independent of µ xi and µ yi, j = 1,2. Denote G µx ( ) the distribution function of µ xi. Then f(t 1i,t 2i θ 1i = 0,θ 2i = 1) = f(t 1i,t 2i µ xi = µ yi = x)dg µx (x) A x = f(t 1i θ 1i = 0,µ xi = µ yi = x)f(t 2i µ xi = µ yi = x)dg µx (x) A x = f(t 1i θ 1i = 0) f(t 2i θ 1i = 0,µ xi = µ yi = x)dg µx (x) A x = f(t 1i θ 1i = 0)f(t 2i θ 1i = 0,θ 2i = 1). For the third equality, note that t 1i are independent of both µ xi and µ yi given θ 1i = 0, whereas t 2i are correlated with µ xi and µ yi given that θ 1i = 0. Finally, f(t 1i,t 2i θ 1i = 0) = f(t 1i,t 2i θ 1i = 0,θ 2i = 0)P(θ 2i = 0 θ 1i = 0) +f(t 1i,t 2i θ 1i = 0,θ 2i = 1)P(θ 2i = 1 θ 1i = 0) = f(t 1i θ 1i = 0){f(t 2i θ 1i = 0,θ 2i = 0)P(θ 2i = 0 θ 1i = 0) +f(t 2i θ 1i = 0,θ 2i = 1)P(θ 2i = 1 θ 1i = 0)} = f(t 1i θ 1i = 0)f(t 2i θ 1i = 0). Hence T 1i and T 2i are independent under the null hypothesis θ 1i = 0. A.2. Proof of Proposition 2 Proof. Part (a). We only need to show that q τ (t 2 ) q (t 2 ) for all τ. We first split the joint density into two parts and then use the independence between T 1i and T 2i under

32 2 ai & Sun & Wang the null: q τ (t 2 ) = (1 τ) 1 t 1 A τ {f(t 1,t 2 θ 1i = 0)P(θ 1i = 0)+f(t 1,t 2 θ 1i = 1)P(θ 1i = 1)}dt 1 (1 τ) 1 t 1 A τ q (t 2 )f 10 (t 1 )dt 1 = q (t 2 ). The last equality holds because t 1 A τ f 10 (t 1 )dt 1 = 1 τ, which cancels with (1 τ) 1. Part (b). Let R be the index set of hypotheses rejected by δ τ. The FDR of δτ is { FDR(δ τ ) = E i R (1 θ } ( ) 1i) 1 = E T1,T2 T i R 1 R 1 For all i R I, we have T i Tτ,i. It follows that ( FDR(δ τ 1 ) E T1,T 2 R 1 i R T τ,i ) i R α. The last inequality is due to the operation of δ τ, which guarantees that { R 1} 1 i R T τ,i α for all realizations of (T 1,T 2 ). The result on mfdr holds since { } ( ) E (1 θ 1i ) E T1,T 2 T τ,i αe T1,T 2 ( R 1). i R i R A.3. Proof of Proposition 3 Let Ñ = m G(τ) be the expected size of the screening set. Define f 2 (t τ 2 ) = (1/Ñ) K h2 (t 2 T 2i ). T 2i T 2(τ) Then ˆq τ (t 2 ) = G(τ) 1 τ f τ 2 (t 2 ). Define the conditional density of T 2i given that P i > τ f τ 2 (t 2 ) = f(t 2 P i > τ) = G(τ) 1 A τ f(t 1,t 2 )dt 1. Then q τ (t 2 ) = G(τ) 1 τ fτ 2 (t 2). It follows that E ˆq τ q τ 2 = { G(τ) 1 τ } 2 E f 2 τ f2 τ 2, For a fixed τ, we only need to show that E f 2 τ f2 τ 2 2dt2 = E { fτ 2 (t 2 ) f2 (t τ 2 )} 0.

33 ARS: ovariate-assisted Two-Sample Inference 3 Let E T2 T 1 denote the conditional expectation that is taken over T 2i s while holding T 1i s fixed. The conditional density of T 2 given T 1 is denoted f 2 (t 2 T 1i ). We will work on the bias and variance in turn. First, E T2 T 1 {K h2 (t 2 T 2i )} = f 2 (t 2 T 1i )+ h2 2 2 f(2) 2 (t 2 T 1i ) z 2 K(z)dz +o(h 2 2 ). Now consider E{ f 2 (t τ 2)} = E T1 [E T2 T 1 { f 2 (t τ 2)}]. It follows that [ E{ f 2 (t τ 2)} = (1/Ñ)E m { T 1 i=1 f 2 (t 2 T 1i )+ h2 2 2 f (2) 2 (t 2 T 1i ) z 2 K(z)dz +o(h 2 2 )} I(T 1i A τ ) ] = f 2 (t τ 2)+ h2 2 2 f(2) 2 (t 2 τ) z 2 K(z)dz +o(h 2 2 ), where f (2) 2 (t 2 τ) is defined in the proposition. The second equality holds because E T1 {f(t 2 T 1i )I(T 1i A τ )} = f 2 (t 2 t 1 )f(t 1 )dt 1 = G(τ)f 2 (t τ 2). A τ We have assumed the square integrability of f (2) 2 (t 2 τ), according to which define { } R = {f (2) 2 (t 2 τ)} 2 dt 2. f (2) 2 τ It follows that the integrated squared bias is [E{ f 2 (t τ 2)} f { } 2 (t τ 2)] 2 dt 2 = (h 4 2 /4)R f (2) 2 (τ) {µ 2 (K)} 2 (1+o(1)), where we use the notation µ 2 (K) = z 2 K(z)dz. Next, we compute the variance term. onsider the following decomposition Var{ f τ 2 (t 2)} = Var T1 [E T2 T 1 { f τ 2 (t 2)}]+E T1 [Var T2 T 1 { f τ 2 (t 2)}], where the first and second terms can be respectively computed as { } 1 [ Var T1 [E T2 T 1 { f 2 (t τ 2)}] = m G 2 (τ) {f 2 1 (t 2 t 1 )} 2 f 1 (t 1 )dt 1 {f 2 (t 2 )} ]{1+o(1)}, 2 E T1 [Var T2 T 1 { f 2 (t τ 1 2)}] = f 2 (t 2 t 1 )f 1 (t 1 )dt 1 R(K){1+o(1)} Ñh 2 A τ = (mh 2 ) 1 f 2 (t τ 2)R(K){1+o(1)}. It is easy to see that the second term is the leading term (note that we fixed τ k and let m ). Hence Var{ f τ 2 (t 2)}dt 2 = (mh 2 ) 1 R(K){1+o(1)}. The common choice of bandwidth h m 1/5 makes E f 2 τ f2 τ 2 0, and the desired result follows.

34 4 ai & Sun & Wang A.4. Proof of Proposition 4 Part (a). The total approximation error can be computed as B q (τ) = q τ (t 2 ) q (t 2 ) dt 2 = 1 G(τ) 1 τ (1 π 1 ) = π 1{1 G 1 (τ)}. 1 τ Noting that G 1 (1) = 1, and applying the mean value theorem, we have d dτ B q(τ) = {1 G 1(τ)} g 1 (τ)(1 τ) (1 τ) 2 = g 1(ξ) g 1 (τ) 1 τ for some ξ (τ,1). If G 1 is concave, then g 1 (ξ) g 1 (τ) < 0 and the desired result follows. Part (b). Let η(x) = {1 G 1 (x)}/(1 x). If g 1 satisfies lim x 1 g 1 (x) = 0, then using L Hospital s Rule, we claim that lim x 1 η(x) exists and lim x 1 η(x) = lim x 1 g 1 (x) = 0. onsider the following decomposition of the joint density f(t 1,t 2 )dt 1 = {q (t 2 )f 10 (t 1 )+π 1 f(t 2 t 1,θ 1i = 1)f 1 (t 1 )}dt 1. A τ A τ Noting that both T 1i and T 2i are standardized, the conditional density f(t 2 t 1,θ 1 = 1) is bounded by some constant 0. Using the definition of η(x), we have q τ (t 2 ) = (1 τ) 1 A τ f(t 1,t 2 )dt 1 q (t 2 )+ 0 π 1 η(τ) Applying L Hospital s Rule again, we have lim τ 1 q τ (s) q (s). In Proposition 2.(a), we have shown that q τ (s) q (s) for all τ. ombining the two inequalities, we conclude that lim τ 1 q τ (s) = q (s). Then the desired result follows. A.5. Proof of Proposition 5 Proof. Proposition 4 defines η(x) = {1 G 1 (x)}/(1 x) and shows that lim τk 1q τ k (t 2 ) = q (t 2 ), q τ (t 2 ) q (t 2 ) 0 π 1 η(τ). Note that G 1 (0) = 0 and G 1 (1) = 1. It follows from the concavity of G 1 that η(τ) 1. Moreover, {q τ k (t 2 ) q (t 2 )}dt 2 = π 1 η(τ k ). Hence {q τ k (t 2 ) q (t 2 )} 2 dt 2 0 π 1 {q τ k (t 2 ) q (t 2 )}dt 2 = 0 π1η(τ 2 k ) 0 (A.1) when τ k 1, according to L Hospital s Rule and our assumption that lim x 1 g 1 (x) = 0. Therefore, we only need to show that for a given τ k, [ ] E ˆq q τ k 2 = E {ˆq (t 2 ) q τ k (t 2 )} 2 m 0 dt 2 0. onsider ˆq (t 2 ) defined in (3.13). Let a j = k 1 (ŝ 2 ŝ 0 ŝ 2 1) 1 {ŝ 2 ŝ 1 (τ j 1)}K hτ (τ j 1).

35 ARS: ovariate-assisted Two-Sample Inference 5 Then ˆq (t 2 ) = k j=1 a jˆq τj, k j=1 a j = 1 and k j=1 a j 1. Following similar steps in the proof of Proposition 3, E{ˆq (t 2 )} = E T1 [E T2 T 1 {ˆq (t 2 )}] [ k 1 G(τ j ) = a j f τj 2 1 τ (t 2)+ h2 2 j 2 u 2(K) j=1 = q τ k (t 2 )+ 1 2 h2 τ qτ k,(2) (t 2 ) (K)+o(h 2 τ ) + h2 2 2 u 2(K) k j=1 a j 1 G(τ j ) 1 τ j t 1 A τj f (2) 2 (t 2 t 1 )f 1 (t 1 )dt 1 +o(h 2 2) t 1 A τj f (2) 2 (t 2 t 1 )f 1 (t 1 )dt 1 +o(h 2 2), where (K) is a kernel dependent constant. The last equality follows from Wand and Jones (1995) (pp. 128) for kernel smoothing estimator at boundary points [applying the theory to k j=1 a jq τj (t 2 )]. Note that 1 G(τj) 1 τ j 1 and j a j = 1, we have k j=1 a j 1 G(τ j ) 1 τ j t 1 A τj f (2) 2 (t 2 t 1 )f 1 (t 1 )dt 1 f (2) 2 (t 2 t 1 )f 1 (t 1 )dt 1. According to the square integrability of q τk,(2) (t 2 ) and f (2) 2 (t 2 τ), we can define { } { } 2dt2 R q τ k,(2) = q τk,(2) (t 2 ), { } R f (2) t 2 τ = { f (2) 2 (t 2 τ)} 2dt2. Then the leading term of [E{q (t 2 )} q τ k (t 2 )] 2 dt 2 is bounded above by 1 { } 2 h4 τr q τ k,(2) { (K)} { } 2 h4 2{u 2 (K)} 2 R f (2) t 2 τ, which converges to 0 when (m,k). The argument below regarding the variance part does not pursue the study of the optimal rateof convergence. In fact, the analysisis intractable because q τj aredependent quantities. We have show in Proposition 3 that Var{ˆq τj (t 2 )}dt 2 (mh 2 ) 1 R(K){1+o(1)} for all τ j. It follows that Var{ˆq (t 2 )} (mh 2 ) 1 R(K)( k a 2 k){1+o(1)} j=1 (mh 2 ) 1 R(K){1+o(1)} 0 when m. We note that the actual convergence rate would be better because the variance will be greatly reduced by averaging. ombining the results on the bias and ]

36 6 ai & Sun & Wang variance terms, we have E {ˆq (t 2 ) q τ k (t 2 )} 2 dt 2 [ 1 2 h4 τr(q τ k,(2) ){ (K)} h4 2{u 2 (K)} 2 R(f (2) t 2 t 1 )+ R(K) mh 2 ] {1+o(1)}. (A.2) ombining (A.1) and (A.2), we conclude that E ˆq q 2 0 when (m,k) 0, completing the proof. A.6. Proof of Proposition 6 Proof. We first point out that the arguments in this proof should be understood as the conditional version (given that µ xi and µ yi are known). Define the following central sample moments that are used in later calculations: m xi = 1 n x m xxi = 1 n x n x j=1 n x j=1 (X ij µ xi ), m yi = 1 n y (Y ij µ yi ), n y j=1 (X ij µ xi ) 2, m yyi = 1 n y (Y ij µ yi ) 2. n y It follows that Sxi 2 = m xxi m 2 xi, S2 yi = m yyi m 2 yi. Next define central moments, According to the LT, we have nx n y n j=1 µ xi3 = E(X ij µ xi ) 3, µ yi3 = E(Y ij µ yi ) 3, µ xi4 = E(X ij µ xi ) 4,µ yi4 = E(Y ij µ yi ) 4. m xi m yi m xxi m yyi 0 0 σxi 2 σyi 2 L N(0,Σ i ), where γ y σxi 2 0 γ y µ xi3 0 Σ i = 0 γ x σyi 2 0 γ x µ yi3 γ y µ xi3 0 γ y (µ xi4 σxi 4 ) 0. 0 γ x µ yi3 0 γ x (µ yi4 σyi 4 ) (s t+µ xi µ yi )(γ x v +γ y u) 1 2 Let g i (s,t,u,v) = ( ){ } s+ γyu γ t+µ xv xi + γy γ x µ yi (γ x v +γ y u) γyu 1 2. It follows that γ xv ġ i (0,0,σxi,σ 2 yi) 2 = ( σ 1 pi σ 1 pi κ 1 2 i σ 1 pi σ 1 pi κ 1 2 i 1 2 γ y ( 1 2 γ y(µ xi ) µ yi )σ 3 pi γ µ xi +µ y yi γ x σ 3 pi (1+2κ i )κ 3 2 i 1 2 γ x ( 1 2 γ x(µ xi µ yi ) )σ 3 pi γ µ xi +µ y yi γ x σ 3 pi κ 1 2 i ).

37 ARS: ovariate-assisted Two-Sample Inference 7 The asymptotic variance-covariance matrix of γ x γ y n(t 1i,T 2i ) is ( Σ i σ 2 T 1,T 2 = i (t 1,t 1 ) σi 2(t ) 1,t 2 ) σi 2(t 1,t 2 ) σi 2(t, 2,t 2 ) where the entries can be computed by apply the Delta method: σi(t 2 1,t 1 ) = 1+σ 4 pi (µ xi µ yi ) ( γxµ 2 yi3 γyµ 2 ) xi3 + σ 6 pi σi 2 (t 1,t 2 ) = σ 4 pi σ 4 pi 2 + σ 6 pi 4 4 (µ xi µ yi ) 2{ γ 3 x 2 (µ xi µ yi ) ( γ y µ xi +µ yi γ 3 xκ 1 2 σ 2 i (t 2,t 2 ) = 1+σ 4 pi + σ 6 pi 4 i ( µyi4 σyi 4 ) ( +γ 3 y µxi4 σxi 4 )}, ( γx 2 µ yi3κ 1 2 i +γy 2 µ xi3κ 1 2 i γ x ( µ xi +µ yi γ y γ x ( µyi4 µ 4 ) } yi, ( µ xi +µ yi γ y ( µ xi +µ yi γ y γ x ) { } γx 2 µ yi3κ 1 2 i +γy 2 µ xi3(1+2κ i )κ 3 2 i, ) { (µ xi µ yi ) γy 3 (1+2κ ( i)κ 3 2 i µxi4 σxi) 4 γ x ) {κ ) 2{ γ 3 x κ i i γ2 x µ yi3 (1+2κ i )κ 2 i γy 2 µ } xi3 ) ( µyi4 σyi) 4 +γ 3 y (1+2κ i )2 κ 3 ( i µxi4 σxi 4 ) }. The off-diagonal terms degenerate to zero when (i) the null hypothesis µ xi = µ yi is true and (ii) the distributions are symmetric, i.e. µ y3i = µ x3i = 0, completing the proof. A.7. Proof of Proposition 7 The proposition can be considered as a special case of the theory in Section D of Basu et al. (2017) and can be proved similarly. As the proofs in Basu et al. (2017) are done in different settings with the weighted FDR definition and random weights, we provide the proof here for completeness. Proof. Part (a). Let X = m 1 m i=1 (1 θ i)δ i. Note that when Y = 0 we must have X = 0. The asymptotic equivalence follows if we can show the following { X mfdr(δ) FDR(δ) E Y X } EY I(Y > 0) = o(1). (A.3) Since X Y and both are non-negative expressions, using auchy-schwarz { X E Y X } { } X EY I(Y > 0) EY = E I(Y > 0) Y Y EY (E Y EY 2) 1/2 EY = Var(Y)1/2, EY

38 8 ai & Sun & Wang The desired result follows from onditions (i) and (ii). ( ) Part (b). According to the proof of Theorem 2, Part (a), P T τ,i < Qτ, 1 (α) is bounded ( ) away from zero, then there exists p α > 0, such that P T τ,i < Qτ, 1 (α) p α. Therefore, m 1 E { m i=1 (ˆTτ,i ) ( τ, 1 P < ˆQ (α) = P T τ,i < Qτ, 1 )+o(1), (α) ( I T τ,i Let Y = m 1 m i=1 I(Tτ,i tτ < Qτ, 1 Var(Y ) = m 1 Var (α) ) } = P ). Note that { I ( T τ,i t To show that Var(Y) = o(1), use the decomposition ( T τ,i < Qτ, 1 (α) ) p α > 0. )} m 1 = o(1). Var(Y) = Var(Y )+Var(Y Y )+2ov(Y Y,Y ). We only need to show that Var(Y Y ) = o(1). Then by auchy-schwarz inequality and using Var(Y ) = o(1), it follows that Recall that ˆT τ Tτ = o P(1) and [ m Var(Y Y ) =m 2 Var m 2 E [ m i=1 i=1 ( = 1 1 ) E m + 1 {I (ˆTτ m E < The last equality follows since E E [{ I k=i,j (ˆTτ,i ov(y Y,Y ) = o(1). τ, 1 ˆQ (α) tτ = o P(1). We have { (ˆTτ,i ( )} τ, 1 I < ˆQ ) I ] (α) T τ,i < tτ { (ˆTτ,i ( )} τ, 1 I < ˆQ ) I ] 2 (α) T τ,i < tτ { (ˆTτ,k I k=i,j;i j ˆQ τ, 1 { (ˆTτ,k ( )} τ, 1 I < ˆQ ) I (α) T τ,k < tτ ) } 2 I(T τ < tτ ) = o(1). ) ( )} τ, 1 < ˆQ (α) I T τ,k < tτ ) ( )}] τ, 1 < ˆQ (α) I T τ,i < tτ = o(1). A.8. Effects of baseline removal (Remark 1) This section explains that removing the baseline effects would not violate our assumption on conditional independence. To fix ideas, we consider the application context of microarray time-course experiments.

39 ARS: ovariate-assisted Two-Sample Inference 9 Let ξ denote the true based line expression levels of m genes. onsider two time points x and y. The (unknown) true effect sizes at these two time points are given by µ x = ξ +η x, µ y = ξ +η y. Note that we do not require that ξ is sparse but some baseline measurements should be available. By contrast, we assume that η x = (η xi : i = 1,,m) and η y = (η yi : i = 1,,m) are sparse vectors representing random perturbation effects from based line levels ξ. The goal is to identify nonzero effects η xi η yi 0. We observe X i = µ xi + ǫ xi and Y i = µ yi + ǫ yi, with ǫ xi N(0,1) and ǫ yi N(0,1). Moreover, we do not observe the true latent variable ξ. The baseline effects would be measured (at time 0) with some errors Z i = ξ i + ǫ zi. We assume that ǫ xi, ǫ yi and ǫ zi are independent with each other. After removing the baseline effects, we have new observations: X i = ξ i +η xi +ǫ xi (ξ i +ǫ zi ); Ỹ i = ξ i +η yi +ǫ yi (ξ i +ǫ zi ). The pairs of differences and sums are given by T 1i = X i Ỹi = (η xi η yi )+(ǫ xi ǫ yi ); T 2i = X i +Ỹi = (η xi +η yi )+(ǫ xi +ǫ yi ) 2(ξ i +ǫ zi ). Under the null hypothesis η xi = η yi, the error terms ǫ xi ǫ yi and ǫ xi +ǫ yi are independent, and ǫ xi ǫ yi is independent of ξ i +ǫ zi. Hence T 1i and T 2i satisfy the conditional independence assumption (2.5), which further leads to Equation (2.6) that is needed in the ARS methodology. A.9. Proof of the claim in Remark 4 Proof. We study the special case where (i) the observations have equal sample size and (known) equal variance σxi 2 = σ2 yi = σ2 i, and (ii) the two nonzero means have the same magnitude with opposite signs µ xi = µ yi. Then n (T 1i,T 2i ) = ( 2σ X i Ȳi, X i +Ȳi). i Next, note that X i = µ xi + ǫ xi,ȳi = µ xi + ǫ yi, where ǫ xi N(0,(2/n)σi 2) and ǫ yi N(0,(2/n)σi 2 ). We have f(t 1,t 2 ) = f(t 1,t 2 µ x = µ)dg µx (µ) = f(t 1 µ x = µ)f 2 (t 2 )dg µx (µ) = f 1 (t 1 )f 2 (t 2 ); f(t 2 θ 1i = 0) = f(t 2 µ xi = µ yi = 0) = f 2 (t 2 ). It follows that the oracle statistic T (t 1,t 2 ) = (1 π 1)f(t 2 θ 1i = 0)f 10 (t 1 ) f(t 1,t 2 ) which gives the Lfdr defined in (3.1). = (1 π 1)f 10 (t 1 ), f 1 (t 1 )

40 10 ai & Sun & Wang A.10. Proof of insufficiency and further explanation of Remark 8 We prove the claim in Remark 8 that the primary statistic is not a sufficient statistic. The proof uses the notations in Section 4. learly we only need to provide a counter example. onsider the following special case where the grouping variable S i takes two possible values 1 or 2 with equal probability: P(S i = 1) = 0.5, P(S i = 2) = 0.5. If S i = j, j = 1,2, then T i follows a mixture distribution T i (1 π j )N(0,1)+π j N(1,1), where π j is the conditional proportion of non-null cases for group j. An equivalent way to conceptualize the above model is to introduce an unknown location parameter µ i : T i µ i N(µ i,1), where µ i follows a distribution with two point masses 0 and 1: µ i (1 π j )δ 0 +π j δ 1. Here δ 0 and δ 1 are the Dirac delta functions centered at 0 and 1, respectively. NowweproveT i sinsufficiency. Bydefinition, ift i issufficient, thenp(s i = 1 T i = t,µ i ) should not depend on µ i. However, some simple algebras give the following: P(S i = 1 T i = t,µ = 0) = 1 π 1 2 π 1 π 2, P(S i = 1 T i = t,µ = 1) = π 1 π 1 +π 2, and the desired result follows. Remark 9. The ancilarity paradox means that S i seems to be useless for inference of µ i since it is an external variable and the location information should have been captured by T i. However, making inference using T i alone would lead to substantial loss of information. In this example, S i is called an ancillary component because (i) T i alone is insufficient; (ii) S i is ancillary; and (iii) (T i,s i ) is sufficient. A.11. Screening via Lfdr: formulae and implementation onsider a screening procedure based on ζ i < τ, where ζ i := ζ i (t 1i ) is a screening statistic and τ > 0 is a tuning parameter. Denote G(τ) the DF of ζ i. Let A τ = {x : ζ(x) > τ} and S(τ) = P(A τ ) the survival function. Then which can be estimated by q τ E{Q τ (t 2,h)} (t 2 ) = lim = S(τ) 1 f(t 1,t 2 )dt 1, (A.4) h 0 h A τ ˆq τ (t 2 ) = i T (τ) K h 2 (t 2 t 2i ) mŝ(τ) 1. (A.5) In the p-value approach, we have an explicit correction factor of 1 τ. The factor is now replaced by Ŝ(τ). If the Lfdr is used for screening, then a new correction factor would be needed. We can simulate B data points (e.g. B = 10 5 ) from the null distribution N(0,1) and calculate their Lfdrs using the method in Sun and ai (2007), where the parameters are estimated using the primary statistics {t 1i : 1 i m}. The correction factor is then obtained as the proportion of Lfdrs exceeding τ.

41 B. Supplementary Numerical Results ARS: ovariate-assisted Two-Sample Inference 11 B.1. The case with unknown and equal variances Let ǫ xij N(0,1) and ǫ yik N(0,1). Set n x = 50,n y = 60. Therefore our κ = ny n x = 1.2. We compute T 1i and T 2i using estimated variances. The following settings are considered in our simulation study. Setting 1: we set µ x,1:k = 5/ 30, µ x,(k+1):(2k) = 4/ 30, µ x,(2k+1):m = 0, µ y,1:k = 2/ 30, µ y,(k+1):(2k) = 4/ 30 and µ y,(2k+1):m = 0. We vary k from 100 to Setting 2: we use k 1 and k 2 to denote the number of nonzero locations and the number of locations with differential effects, respectively. Let µ x,1:k2 = 5/ 30, µ x,(k2+1):k 1 = 4/ 30, µ x,(k1+1):m = 0, µ y,1:k2 = 2/ 30, µ y,(k2+1):k 1 = 4/ 30 and µ y,(k1+1):m = 0. We fix k 1 = 2000 and vary k 2 from 200 to k 1. Setting 3: the sparsity level is fixed at k = 750. We set µ x,1:(k) = µ 0 / 30, µ x,(k+1):(2k) = 3/ 30, µ x,(2k+1):m = 0, µ y,1:k = 2/ 30, and µ y,(k+1):(2k) = 3/ 30, µ y,(2k+1):m = 0. We vary µ 0 from 3.5 to 5. In Settings 1-3, we apply different methods to simulated data and obtain FDR and MDR levels by averaging over 500 replications. We plot the FDR and ETP levels as functions of varied parameter values. The results are displayed in Figure 5. We can similarly see that both ARS and US are more powerful than conventional univariate methods such BH and AZ, and ARS is superior than US. B.2. Simulation results when the means have opposite signs Let ǫ xij N(0,1) and ǫ yik N(0,1) be iid noise. Set n x = n y = 50. In our simulation, the number of tests is m = 10,000. The two mean vectors are given below: µ x,1:500 = 3/ 50, µ x,501:1000 = 4/ 50, µ x,1001:m = 0 µ y,1:500 = (3c)/ 50, µ y,501:1000 = 4/ 50, µ y,1001:m = 0. We vary c from 1 to 0, where c = 1 corresponds to the least favorable situation. We apply the BH procedure (BH), oracle ARS procedure () and data-driven ARS procedure (DD) to the simulated data sets. The FDR and power are obtained based on 200 replications. The simulation results are summarized by Figure 6. We can see that when c = 1, all methods have similar power. As c increases to zero, the power gain of ARS become more pronounced. This shows that the ARS procedure outperforms univariate methods even when the signals have opposite signs. Remark 10. The fact that ARS improves efficiency with opposite signs may appear a bit counter-intuitive. Therefore we discuss the following extreme situation to explain the phenomenon. The discussion would also give further insights on the merit of ARS. Assume that c = 0. We further let other elements in µ y be zero, i.e. µ y,1:m = 0. Let the mean vector µ x be µ x,1:1000 = µ 0, µ x,1001:m = 0. Suppose we only observe one replicate with σ xi = σ yi = 1 for all i. If we use T i = X i Y i as the test statistic, then the signal to noise ratio is µ 0 / 2. Imagine an oracle knows that

42 12 ai & Sun & Wang FDR level omparison Setting 1 Power omparsion FDR Average Power k k Method DD BH AZ US FDR level omparison Setting Power omparsion FDR Average Power k k 2 Method DD BH AZ US FDR level omparison Setting 3 Power omparison FDR Average Power µ µ 0 Method DD BH AZ US Fig. 5. General ase (Unknown Variance): the FDR and MDR varied by non-null proportion (top row) and conditional proportion (bottom row). The displayed procedures are DD ( ), BH ( ), ( ), AZ (+) and US ( ).

43 ARS: ovariate-assisted Two-Sample Inference 13 FDR level omparison Power omparison FDR Average Power c c Method DD BH Fig. 6. omparison of BH, DD and when the nonzero means have opposite signs. µ y,1:m = 0. Hence µ xi µ yi = 0 as long as µ xi = 0. For this oracle, a more efficient test statistic would be T i = X i, with the signal to noise ratio (SNR) being µ 0. The information of µ yi can be partially recoveredby the pair (X i Y i,x i +Y i ). By contrast, the information of µ yi is lost if we use X i Y i alone. Hence the SNR can be enhanced by including the sum in inference (even when µ yi = 0 for all i). Therefore we now have two extreme cases: c = 1 and c = 0. In the former case ARS has the same power as the optimal univariate method, and in the latter case ARS may substantially outperform all univariate methods by exploiting the auxiliary information with enhanced SNR. This explains the pattern we observed in Figure 6. B.3. omparison with sample splitting when θ 1i = θ 2i Similar to the general setting, we set n x = 50,n y = 60 and construct the primary and auxiliary statistics accordingly. We consider two simulation settings. Setting 1: µ x,1:k = 5/ 30, µ x,(k+1):m = 0, µ y,1:k = 2/ 30 and µ y,(k+1):m = 0. Vary k from 50 to 1000 to investigate the effect of sparsity on testing results. Setting 2: the sparsity level is fixed at k = 500, µ y,1:m = 0, µ z,1:k = µ 0 / 30, and µ z,(k+1):m = 0. Vary µ 0 from 1.5 to 3.5 to investigate the impact of effect sizes. In eachsetting, weapply different methods to simulated dataand obtain FDR and MDR levels by averaging over 200 replications. The sample splitting (SS) method has been added to the comparison. Specifically, SS splits both n x and n y into two equal parts. The first half of data are used to compute the initial p-values. We then use the p-value cut-off 0.05 to select locations and use these locations for the second stage analysis. Next the second stage p-values are computed for the selected locations using the second half of the data. Finally we apply the BH procedure to the second stage p-values and record the decisions. The FDR/MDR levels are plotted as functions of k and µ 0. The results are displayed in Figure 7. The following observations can be made based on the simulation.

44 14 ai & Sun & Wang (a). All methods can control the FDR at the nominal level 0.05, with BH slightly conservative and US very conservative. (b) The data-driven ARS procedure can be very conservative in many settings. This is due to the fact that we have used a conservative approximation to the oracle procedure. However, ARS still enjoys substantial power gain over competing methods. (c). Univariate methods (BH and AZ) can be greatly improved by US, which is tremendously improved by ARS, with the DD quite comparable to that of. (d). The results in Setting 1 corroborate that the gain in efficiency decreases as k increases; this is consistent with our findings under the general model. (e) Setting 2 shows that the efficiency gain of ARS over other methods increases as k 2 increases. (f). The sample splitting method is inefficient. B.4. Stability of tuning using Lfdr The R package ARS carries out the screening step using Lfdr statistics. Let T (τ) = {i : Lfdr 1 (t 1i ) > τ}. This section examines the robustness of ARS procedure with respect to the choice of τ. For brevity, we demonstrate the effect of τ using the case when θ 1i and θ 2i are perfectly correlated. The general case demonstrates the same kind of phenomenon as the special one. We explore different τ and compare the results with that obtained by our (benchmark) oracle procedure. We use the previous notations in the main text. We consider the following simulation settings. Setting 1 (same as Setting 1 in section 5.1): k = 500. Vary τ from 0.6 to 0.9. Setting 2: same as Setting 1 above, except that µ x,1:500 = 4/ 50. Vary τ from 0.6 to 0.9. Setting 3: Same as Setting 1. k = Vary τ from 0.6 to 0.9. Setting 4: Same as the Setting 3 above, except that µ x,1:1000 = 4/ 50. Vary τ from 0.6 to 0.9. We apply the oracle procedure () and the data-driven ARS procedure (DD) to the simulated data and obtain the corresponding FDR and MDR levels. We also obtained the average size of Ŝ = { Lfdr 1 > τ} and plot it as a function of τ. Note that for the oracle procedure, we always have ard(s) = 4500 or The results are summarized in Figure 5. We can see that the value of τ affects the size of Ŝ and in general provides better error control when it s large. This is consistent with our theory in Proposition 4. However, to avoid inflated variability in estimated quantities, we do not recommend larger τ s. As a rule of thumb, our practical recommendation is to take τ = 0.9 to ensure more precise error control and stability of the methodology.

45 ARS: ovariate-assisted Two-Sample Inference 15 FDR level omparison Setting 1 Power omparsion FPR Power k k Method DD BH AZ US SS FDR level omparison Setting 2 Power omparsion FPR Power k Method DD BH 0.00 AZ k US Fig. 7. Special ase: FDR/MDR comparison varied by non-null proportion (Settings 1) and signal strength (Settings 2). SS

46 16 ai & Sun & Wang FDR level omparison Setting 1 Power omparison ardinality of S omparison FDR Average Power Average ardinality for S τ τ Method DD τ FDR level omparison Setting 2 Power omparison ardinality of S omparison FDR Average Power Average ardinality for S τ τ Method DD τ FDR level omparison Setting 3 Power omparison ardinality of S omparison FDR Average Power Average ardinality for S τ τ Method DD τ FDR level omparison Setting 4 Power omparison ardinality of S omparison FDR Average Power Average ardinality for S τ τ τ Method DD Fig. 8. Effect of τ: FDR/MDR omparison varied by non-null proportion (top row) and conditional proportion (bottom row). The displayed procedures are DD ( ), ( ).

47 ARS: ovariate-assisted Two-Sample Inference 17 B.5. Very sparse case and ranking A fundamental aspect of the multiple comparison issue is ranking. For example, if a biologist has a limited budget and asks us to provide a list of top k genes (say, k = 10). Then on top of FDR control, s/he may care more about how many true findings are actually in the list that we provide. We carry out a small simulation study to compare different methods in the very sparse case k 2 = 20. The goal is to show that, by exploiting the auxiliary information, ARS yields a much better ranking (in the sense that for a pre-specified budget, namely a fixed number of rejections, ARS identifies more true positives than competing methods). onsider the case with the unknown and equal variances. Let ǫ xij N(0,1) and ǫ yik N(0,1). Set n x = 50,n y = 60. Therefore κ = ny n x = 1.2. Number of repetitions is 100. The number of genes m = Setting: µ x,1:20 = 5/ 50, µ x,21:40 = 4/ 50, µ x,41:m = 0, µ y,1:20 = 2/ 50, µ y,21:40 = 4/ 50, µ y,41:m = 0. We vary number of rejections k from 5 to 20, and calculate for even given k, how many true positives are in the top k hypotheses. The rankings are based on p-values (BH), Lfdr statistics (Sun and ai 2007), oracle statistic T (, proposed with known parameters), and data-driven statistic ˆT τ (DD, proposed with estimated parameters), respectively. The expected number of true positives(etp) is calculated based on the average of 100 simulated data sets. We can see from Figure 9 that the ranking of ARS is superior compared to other methods. Ranking omparison 15 ETP Number of Rejections Method DD AZ BH Fig. 9. omparison of ranking B.6. Dependent tests This section presents a small simulation study to investigate the performance of ARS under dependence.

48 18 ai & Sun & Wang FDR level omparison Power omparsion FDR Average Power k k Method DD BH AZ US Fig. 10. omparison of methods under weak dependence. We demonstrate the performance of ARS under the setting with the unknown and equal variances case. Let Σ be the covariance matrix. Σ i,i = 0, i = 1,,m, Σ i,i+1 = 0.5, i = 1,,m 1, Σ i 1,i = 0.5, i = 2,,m, Σ i,i+2 = 0.4, i = 1,,m 2, Σ i 2,i = 0.4, i = 3,,m. All other entries equal to zero. Let ǫ xj N(0,Σ) and ǫ yk N(0,Σ). Set n x = 50,n y = 60. Therefore κ = ny n x = 1.2. Number of repetitions is 100. m = Setting: We set µ x,1:k = 5/ 50, µ x,(k+1):2k = 4/ 50, µ x,(2k+1):m = 0, µ y,1:k = 2/ 50, µ y,(k+1):2k = 4/ 50, µ y,(2k+1):m = 0. We vary k from 100 to 500. The results are summarized by Figure 10. Both and DD seem to work well for FDR control. However, we want to emphasize that the simulation results here are very limited and we do not intend to draw any conclusions based on these results. Meanwhile, we conjecture that ARS would still be asymptotically valid under some weak dependence assumptions but the theory is very challenging. Much research is still needed and we leave it as a future direction. B.7. Supernova Example Images Here we provide additional numerical results for the analysis in Section 5.4 in the main text. Figure 11 shows the rejected pixels in the layout for each method under different FDR levels.

49 ARS: ovariate-assisted Two-Sample Inference 19 Nominal FDR level 0.01% Nominal FDR level 1% Nominal FDR level 5% Fig. 11. Left to right: BH procedure, Adaptive Z procedure, Uncorrelated Screening, ARS. B.8. Analysis of the Microarray Time-ourse (MT) Data In this section we apply the ARS procedure to the MT dataset collected by alvano et al. (2005) for studying systemic inflammation in humans. This dataset contains eight study subjects which are randomly assigned to treatment and control groups and then administered with endotoxin and placebo, respectively. Affimetrix U133A chips were used to profile the expression levels of 22,283 genes in human leukocytes measured before infusion (0 hour) and at 2, 4, 6, 9, and 24 hours afterwards. One of the goals of this experiment is to identify, in the treatment group, early to middle response genes that are differentially expressed within 4 hours and thus revealing meaningful early activation gene sequence which could potentially govern the immune responses. We first preprocess the data according to the steps featured in Sun and Wei (2011). To further identify the genes that regulate this sequence, we take time point 0 as the baseline and time point 4 and 24 as the interval over which differential gene expression will be

Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks

Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2009 Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks T. Tony Cai University of Pennsylvania

More information

Applying the Benjamini Hochberg procedure to a set of generalized p-values

Applying the Benjamini Hochberg procedure to a set of generalized p-values U.U.D.M. Report 20:22 Applying the Benjamini Hochberg procedure to a set of generalized p-values Fredrik Jonsson Department of Mathematics Uppsala University Applying the Benjamini Hochberg procedure

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Christopher R. Genovese Department of Statistics Carnegie Mellon University joint work with Larry Wasserman

More information

Doing Cosmology with Balls and Envelopes

Doing Cosmology with Balls and Envelopes Doing Cosmology with Balls and Envelopes Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Larry Wasserman Department of Statistics Carnegie

More information

Journal of Statistical Software

Journal of Statistical Software JSS Journal of Statistical Software MMMMMM YYYY, Volume VV, Issue II. doi: 10.18637/jss.v000.i00 GroupTest: Multiple Testing Procedure for Grouped Hypotheses Zhigen Zhao Abstract In the modern Big Data

More information

A Large-Sample Approach to Controlling the False Discovery Rate

A Large-Sample Approach to Controlling the False Discovery Rate A Large-Sample Approach to Controlling the False Discovery Rate Christopher R. Genovese Department of Statistics Carnegie Mellon University Larry Wasserman Department of Statistics Carnegie Mellon University

More information

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data Faming Liang, Chuanhai Liu, and Naisyin Wang Texas A&M University Multiple Hypothesis Testing Introduction

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests

Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests Incorporation of Sparsity Information in Large-scale Multiple Two-sample t Tests Weidong Liu October 19, 2014 Abstract Large-scale multiple two-sample Student s t testing problems often arise from the

More information

Estimation of a Two-component Mixture Model

Estimation of a Two-component Mixture Model Estimation of a Two-component Mixture Model Bodhisattva Sen 1,2 University of Cambridge, Cambridge, UK Columbia University, New York, USA Indian Statistical Institute, Kolkata, India 6 August, 2012 1 Joint

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

High-throughput Testing

High-throughput Testing High-throughput Testing Noah Simon and Richard Simon July 2016 1 / 29 Testing vs Prediction On each of n patients measure y i - single binary outcome (eg. progression after a year, PCR) x i - p-vector

More information

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors The Multiple Testing Problem Multiple Testing Methods for the Analysis of Microarray Data 3/9/2009 Copyright 2009 Dan Nettleton Suppose one test of interest has been conducted for each of m genes in a

More information

Modified Simes Critical Values Under Positive Dependence

Modified Simes Critical Values Under Positive Dependence Modified Simes Critical Values Under Positive Dependence Gengqian Cai, Sanat K. Sarkar Clinical Pharmacology Statistics & Programming, BDS, GlaxoSmithKline Statistics Department, Temple University, Philadelphia

More information

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Statistics Journal Club, 36-825 Beau Dabbs and Philipp Burckhardt 9-19-2014 1 Paper

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

Alpha-Investing. Sequential Control of Expected False Discoveries

Alpha-Investing. Sequential Control of Expected False Discoveries Alpha-Investing Sequential Control of Expected False Discoveries Dean Foster Bob Stine Department of Statistics Wharton School of the University of Pennsylvania www-stat.wharton.upenn.edu/ stine Joint

More information

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1 Controlling Bayes Directional False Discovery Rate in Random Effects Model 1 Sanat K. Sarkar a, Tianhui Zhou b a Temple University, Philadelphia, PA 19122, USA b Wyeth Pharmaceuticals, Collegeville, PA

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Optimal Estimation of a Nonsmooth Functional

Optimal Estimation of a Nonsmooth Functional Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose

More information

False Discovery Control in Spatial Multiple Testing

False Discovery Control in Spatial Multiple Testing False Discovery Control in Spatial Multiple Testing WSun 1,BReich 2,TCai 3, M Guindani 4, and A. Schwartzman 2 WNAR, June, 2012 1 University of Southern California 2 North Carolina State University 3 University

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

BTRY 4090: Spring 2009 Theory of Statistics

BTRY 4090: Spring 2009 Theory of Statistics BTRY 4090: Spring 2009 Theory of Statistics Guozhang Wang September 25, 2010 1 Review of Probability We begin with a real example of using probability to solve computationally intensive (or infeasible)

More information

arxiv: v1 [stat.me] 13 Dec 2017

arxiv: v1 [stat.me] 13 Dec 2017 Local False Discovery Rate Based Methods for Multiple Testing of One-Way Classified Hypotheses Sanat K. Sarkar, Zhigen Zhao Department of Statistical Science, Temple University, Philadelphia, PA, 19122,

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

The optimal discovery procedure: a new approach to simultaneous significance testing

The optimal discovery procedure: a new approach to simultaneous significance testing J. R. Statist. Soc. B (2007) 69, Part 3, pp. 347 368 The optimal discovery procedure: a new approach to simultaneous significance testing John D. Storey University of Washington, Seattle, USA [Received

More information

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES Sanat K. Sarkar a a Department of Statistics, Temple University, Speakman Hall (006-00), Philadelphia, PA 19122, USA Abstract The concept

More information

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Outline in High Dimensions Using the Rodeo Han Liu 1,2 John Lafferty 2,3 Larry Wasserman 1,2 1 Statistics Department, 2 Machine Learning Department, 3 Computer Science Department, Carnegie Mellon University

More information

Large-Scale Multiple Testing of Correlations

Large-Scale Multiple Testing of Correlations Large-Scale Multiple Testing of Correlations T. Tony Cai and Weidong Liu Abstract Multiple testing of correlations arises in many applications including gene coexpression network analysis and brain connectivity

More information

Statistical testing. Samantha Kleinberg. October 20, 2009

Statistical testing. Samantha Kleinberg. October 20, 2009 October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find

More information

Confounder Adjustment in Multiple Hypothesis Testing

Confounder Adjustment in Multiple Hypothesis Testing in Multiple Hypothesis Testing Department of Statistics, Stanford University January 28, 2016 Slides are available at http://web.stanford.edu/~qyzhao/. Collaborators Jingshu Wang Trevor Hastie Art Owen

More information

Optimal Screening and Discovery of Sparse Signals with Applications to Multistage High-throughput Studies

Optimal Screening and Discovery of Sparse Signals with Applications to Multistage High-throughput Studies University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research -07 Optimal Screening and Discovery of Sparse Signals with Applications to Multistage High-throughput Studies Tony

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume, Issue 006 Article 9 A Method to Increase the Power of Multiple Testing Procedures Through Sample Splitting Daniel Rubin Sandrine Dudoit

More information

FDR and ROC: Similarities, Assumptions, and Decisions

FDR and ROC: Similarities, Assumptions, and Decisions EDITORIALS 8 FDR and ROC: Similarities, Assumptions, and Decisions. Why FDR and ROC? It is a privilege to have been asked to introduce this collection of papers appearing in Statistica Sinica. The papers

More information

arxiv: v1 [math.st] 31 Mar 2009

arxiv: v1 [math.st] 31 Mar 2009 The Annals of Statistics 2009, Vol. 37, No. 2, 619 629 DOI: 10.1214/07-AOS586 c Institute of Mathematical Statistics, 2009 arxiv:0903.5373v1 [math.st] 31 Mar 2009 AN ADAPTIVE STEP-DOWN PROCEDURE WITH PROVEN

More information

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations Inference with Transposable Data: Modeling the Effects of Row and Column Correlations Genevera I. Allen Department of Pediatrics-Neurology, Baylor College of Medicine, Jan and Dan Duncan Neurological Research

More information

Asymptotics for posterior hazards

Asymptotics for posterior hazards Asymptotics for posterior hazards Pierpaolo De Blasi University of Turin 10th August 2007, BNR Workshop, Isaac Newton Intitute, Cambridge, UK Joint work with Giovanni Peccati (Université Paris VI) and

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses

On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses Gavin Lynch Catchpoint Systems, Inc., 228 Park Ave S 28080 New York, NY 10003, U.S.A. Wenge Guo Department of Mathematical

More information

A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE

A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE Sanat K. Sarkar 1, Tianhui Zhou and Debashis Ghosh Temple University, Wyeth Pharmaceuticals and

More information

Flexible Estimation of Treatment Effect Parameters

Flexible Estimation of Treatment Effect Parameters Flexible Estimation of Treatment Effect Parameters Thomas MaCurdy a and Xiaohong Chen b and Han Hong c Introduction Many empirical studies of program evaluations are complicated by the presence of both

More information

arxiv: v1 [stat.me] 25 Aug 2016

arxiv: v1 [stat.me] 25 Aug 2016 Empirical Null Estimation using Discrete Mixture Distributions and its Application to Protein Domain Data arxiv:1608.07204v1 [stat.me] 25 Aug 2016 Iris Ivy Gauran 1, Junyong Park 1, Johan Lim 2, DoHwan

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

ESTIMATING THE PROPORTION OF TRUE NULL HYPOTHESES UNDER DEPENDENCE

ESTIMATING THE PROPORTION OF TRUE NULL HYPOTHESES UNDER DEPENDENCE Statistica Sinica 22 (2012), 1689-1716 doi:http://dx.doi.org/10.5705/ss.2010.255 ESTIMATING THE PROPORTION OF TRUE NULL HYPOTHESES UNDER DEPENDENCE Irina Ostrovnaya and Dan L. Nicolae Memorial Sloan-Kettering

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University Multiple Testing Hoang Tran Department of Statistics, Florida State University Large-Scale Testing Examples: Microarray data: testing differences in gene expression between two traits/conditions Microbiome

More information

Stat 206: Estimation and testing for a mean vector,

Stat 206: Estimation and testing for a mean vector, Stat 206: Estimation and testing for a mean vector, Part II James Johndrow 2016-12-03 Comparing components of the mean vector In the last part, we talked about testing the hypothesis H 0 : µ 1 = µ 2 where

More information

New Approaches to False Discovery Control

New Approaches to False Discovery Control New Approaches to False Discovery Control Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Larry Wasserman Department of Statistics Carnegie

More information

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Monte Carlo Studies. The response in a Monte Carlo study is a random variable. Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating

More information

Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss;

Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss; BFF4, May 2, 2017 Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss; Lawrence D. Brown Wharton School, Univ. of Pennsylvania Joint work with Gourab Mukherjee and Paat Rusmevichientong

More information

Knockoffs as Post-Selection Inference

Knockoffs as Post-Selection Inference Knockoffs as Post-Selection Inference Lucas Janson Harvard University Department of Statistics blank line blank line WHOA-PSI, August 12, 2017 Controlled Variable Selection Conditional modeling setup:

More information

Post-Selection Inference

Post-Selection Inference Classical Inference start end start Post-Selection Inference selected end model data inference data selection model data inference Post-Selection Inference Todd Kuffner Washington University in St. Louis

More information

A NEW APPROACH FOR LARGE SCALE MULTIPLE TESTING WITH APPLICATION TO FDR CONTROL FOR GRAPHICALLY STRUCTURED HYPOTHESES

A NEW APPROACH FOR LARGE SCALE MULTIPLE TESTING WITH APPLICATION TO FDR CONTROL FOR GRAPHICALLY STRUCTURED HYPOTHESES A NEW APPROACH FOR LARGE SCALE MULTIPLE TESTING WITH APPLICATION TO FDR CONTROL FOR GRAPHICALLY STRUCTURED HYPOTHESES By Wenge Guo Gavin Lynch Joseph P. Romano Technical Report No. 2018-06 September 2018

More information

Post-selection Inference for Changepoint Detection

Post-selection Inference for Changepoint Detection Post-selection Inference for Changepoint Detection Sangwon Hyun (Justin) Dept. of Statistics Advisors: Max G Sell, Ryan Tibshirani Committee: Will Fithian (UC Berkeley), Alessandro Rinaldo, Kathryn Roeder,

More information

Confidence Thresholds and False Discovery Control

Confidence Thresholds and False Discovery Control Confidence Thresholds and False Discovery Control Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Larry Wasserman Department of Statistics

More information

Heterogeneity and False Discovery Rate Control

Heterogeneity and False Discovery Rate Control Heterogeneity and False Discovery Rate Control Joshua D Habiger Oklahoma State University jhabige@okstateedu URL: jdhabigerokstateedu August, 2014 Motivating Data: Anderson and Habiger (2012) M = 778 bacteria

More information

Recall the Basics of Hypothesis Testing

Recall the Basics of Hypothesis Testing Recall the Basics of Hypothesis Testing The level of significance α, (size of test) is defined as the probability of X falling in w (rejecting H 0 ) when H 0 is true: P(X w H 0 ) = α. H 0 TRUE H 1 TRUE

More information

3 Comparison with Other Dummy Variable Methods

3 Comparison with Other Dummy Variable Methods Stats 300C: Theory of Statistics Spring 2018 Lecture 11 April 25, 2018 Prof. Emmanuel Candès Scribe: Emmanuel Candès, Michael Celentano, Zijun Gao, Shuangning Li 1 Outline Agenda: Knockoffs 1. Introduction

More information

Optimal Rates of Convergence for Estimating the Null Density and Proportion of Non-Null Effects in Large-Scale Multiple Testing

Optimal Rates of Convergence for Estimating the Null Density and Proportion of Non-Null Effects in Large-Scale Multiple Testing Optimal Rates of Convergence for Estimating the Null Density and Proportion of Non-Null Effects in Large-Scale Multiple Testing T. Tony Cai 1 and Jiashun Jin 2 Abstract An important estimation problem

More information

The Pennsylvania State University The Graduate School Eberly College of Science GENERALIZED STEPWISE PROCEDURES FOR

The Pennsylvania State University The Graduate School Eberly College of Science GENERALIZED STEPWISE PROCEDURES FOR The Pennsylvania State University The Graduate School Eberly College of Science GENERALIZED STEPWISE PROCEDURES FOR CONTROLLING THE FALSE DISCOVERY RATE A Dissertation in Statistics by Scott Roths c 2011

More information

high-dimensional inference robust to the lack of model sparsity

high-dimensional inference robust to the lack of model sparsity high-dimensional inference robust to the lack of model sparsity Jelena Bradic (joint with a PhD student Yinchu Zhu) www.jelenabradic.net Assistant Professor Department of Mathematics University of California,

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Propensity Score Weighting with Multilevel Data

Propensity Score Weighting with Multilevel Data Propensity Score Weighting with Multilevel Data Fan Li Department of Statistical Science Duke University October 25, 2012 Joint work with Alan Zaslavsky and Mary Beth Landrum Introduction In comparative

More information

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem

More information

Statistical Inference of Covariate-Adjusted Randomized Experiments

Statistical Inference of Covariate-Adjusted Randomized Experiments 1 Statistical Inference of Covariate-Adjusted Randomized Experiments Feifang Hu Department of Statistics George Washington University Joint research with Wei Ma, Yichen Qin and Yang Li Email: feifang@gwu.edu

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures

Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures David Hunter Pennsylvania State University, USA Joint work with: Tom Hettmansperger, Hoben Thomas, Didier Chauveau, Pierre Vandekerkhove,

More information

New Procedures for False Discovery Control

New Procedures for False Discovery Control New Procedures for False Discovery Control Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Elisha Merriam Department of Neuroscience University

More information

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Han Liu John Lafferty Larry Wasserman Statistics Department Computer Science Department Machine Learning Department Carnegie Mellon

More information

On adaptive procedures controlling the familywise error rate

On adaptive procedures controlling the familywise error rate , pp. 3 On adaptive procedures controlling the familywise error rate By SANAT K. SARKAR Temple University, Philadelphia, PA 922, USA sanat@temple.edu Summary This paper considers the problem of developing

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations John R. Michael, Significance, Inc. and William R. Schucany, Southern Methodist University The mixture

More information

Estimation and Confidence Sets For Sparse Normal Mixtures

Estimation and Confidence Sets For Sparse Normal Mixtures Estimation and Confidence Sets For Sparse Normal Mixtures T. Tony Cai 1, Jiashun Jin 2 and Mark G. Low 1 Abstract For high dimensional statistical models, researchers have begun to focus on situations

More information

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Control of Generalized Error Rates in Multiple Testing

Control of Generalized Error Rates in Multiple Testing Institute for Empirical Research in Economics University of Zurich Working Paper Series ISSN 1424-0459 Working Paper No. 245 Control of Generalized Error Rates in Multiple Testing Joseph P. Romano and

More information

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model Minimum Hellinger Distance Estimation in a Semiparametric Mixture Model Sijia Xiang 1, Weixin Yao 1, and Jingjing Wu 2 1 Department of Statistics, Kansas State University, Manhattan, Kansas, USA 66506-0802.

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.1 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models

Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models David B. Dahl Department of Statistics Texas A&M University Marina Vannucci, Michael Newton, & Qianxing Mo

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics Jiti Gao Department of Statistics School of Mathematics and Statistics The University of Western Australia Crawley

More information

Multiple testing with the structure-adaptive Benjamini Hochberg algorithm

Multiple testing with the structure-adaptive Benjamini Hochberg algorithm J. R. Statist. Soc. B (2019) 81, Part 1, pp. 45 74 Multiple testing with the structure-adaptive Benjamini Hochberg algorithm Ang Li and Rina Foygel Barber University of Chicago, USA [Received June 2016.

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Testing Statistical Hypotheses

Testing Statistical Hypotheses E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions

More information

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs Summary and discussion of: Controlling the False Discovery Rate via Knockoffs Statistics Journal Club, 36-825 Sangwon Justin Hyun and William Willie Neiswanger 1 Paper Summary 1.1 Quick intuitive summary

More information

A tailor made nonparametric density estimate

A tailor made nonparametric density estimate A tailor made nonparametric density estimate Daniel Carando 1, Ricardo Fraiman 2 and Pablo Groisman 1 1 Universidad de Buenos Aires 2 Universidad de San Andrés School and Workshop on Probability Theory

More information

Bayes methods for categorical data. April 25, 2017

Bayes methods for categorical data. April 25, 2017 Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,

More information

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018 Bayesian latent hierarchical model for transcriptomic meta-analysis to detect biomarkers with clustered meta-patterns of differential expression signals BayesMP Zhiguang Huo 1, Chi Song 2, George Tseng

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

FALSE DISCOVERY AND FALSE NONDISCOVERY RATES IN SINGLE-STEP MULTIPLE TESTING PROCEDURES 1. BY SANAT K. SARKAR Temple University

FALSE DISCOVERY AND FALSE NONDISCOVERY RATES IN SINGLE-STEP MULTIPLE TESTING PROCEDURES 1. BY SANAT K. SARKAR Temple University The Annals of Statistics 2006, Vol. 34, No. 1, 394 415 DOI: 10.1214/009053605000000778 Institute of Mathematical Statistics, 2006 FALSE DISCOVERY AND FALSE NONDISCOVERY RATES IN SINGLE-STEP MULTIPLE TESTING

More information