arxiv: v4 [math.st] 26 Aug 2017

Size: px

Start display at page:

Download "arxiv: v4 [math.st] 26 Aug 2017"

Martin Claud Austin
6 years ago
Views:

1 A GEERAL ASYMPTOTIC FRAMEWORK FOR DISTRIBUTIO-FREE GRAPH-BASED TWO-SAMPLE TESTS BHASWAR B. BHATTACHARYA arxiv: v4 [math.st] 6 Aug 07 Abstract. Testing equality of two multivariate distributions is a classical problem for which many non-parametric tests have been proposed over the years. Most of the popular two-sample tests, which are asymptotically distribution-free, are based either on geometric graphs constructed using inter-point distances between the observations (multivariate generalizations of the Wald-Wolfowitz s runs test) or on multivariate data-depth (generalizations of the Mann-Whitney rank test). This paper introduces a general notion of distribution-free graph-based two-sample tests, which includes the tests described above, and provides a unified framework for analyzing and comparing their asymptotic properties. The asymptotic efficiency of a general graphbased test is derived, which relates combinatorial properties of the underlying graph to the performance of the associated test. As a consequence, it is shown that tests based on geometric graphs such as the Friedman-Rafsky test [7], the test based on the K-nearest neighbor graph, and the cross-match test of Rosenbaum [38] have zero asymptotic (Pitman) efficiency against O( ) alternatives. Asymptotic efficiencies of the tests based on multivariate depth functions (the Liu-Singh rank sum statistic [7]) are also derived from the proposed general theory. Applications of these results are illustrated both on synthetic and real datasets.. Introduction Let F and G be two continuous distribution functions in R d with densities f and g with respect to Lebesgue measure, respectively. Given independent and identically distributed samples X = {X, X,..., X } and Y = {Y, Y,..., Y } (.) from two unknown densities f and g, respectively, the two-sample problem is to distinguish the hypotheses H 0 : f = g, versus H : f g. (.) More precisely, H 0 is the collection of all distributions of mutually independent i.i.d. observations with sample sizes and from a distribution in P d, the collection of all probability distributions in R d which have density with respect to Lebesgue measure. Similarly, the alternative H is the collection of all distributions of mutually independent i.i.d. observations with sample size from some distribution F P d, and i.i.d. observations with sample size from some other distribution G P d, such that their respective densities f and g, differ on a set of positive Lebesgue measure. 00 Mathematics Subject Classification. 6G0, 6G30, 60D05, 60F05, 60C05. Key words and phrases. Asymptotic efficiency, Distribution-free tests, Minimum spanning tree, earestneighbor graphs, Two-sample problem.

2 B. B. BHATTACHARYA There are many multivariate two-sample testing procedures, ranging from tests for parametric hypotheses such as the Hotelling s T -test, and the likelihood ratio test, to more general non-parametric procedures [4, 5, 8, 9, 4, 7, 9, 0,, 37, 38, 39, 44]. In this paper, we consider multivariate two-sample tests, which are asymptotically distribution-free, that is, tests for which the asymptotic null distribution do not depend on the underlying (unknown) distribution of the data. As a result, these tests can be directly implemented as an asymptotically level α test, making them practically convenient. For univariate data, there are several celebrated nonparametric distribution-free tests such as the the Wald-Wolfowitz runs test [43] and the Mann-Whitney rank test [9] (see the textbook [8] for more on these tests). Many multivariate generalizations of these tests, which are asymptotically distribution-free, have proposed. Most of these tests, can be broadly classified into two categories: () Tests based on geometric graphs: These tests are constructed using the inter-point distances between the observations. This includes the test based on the Euclidean minimal spanning tree by Friedman and Rafsky [7] (generalization the Wald-Wolfowitz runs test [43] to higher dimensions) and tests based on nearest neighbor graphs [, 39]. This also include Rosenbaum s [38] test based minimum non-bipartite matching and the test based on the Hamiltonian path by Biswas et al. [9] (both of which are distribution-free in finite samples), and the recent tests of Chen and Friedman []. Refer to Maa et al. [8] for theoretical motivations for using tests based on inter-point distances. () Tests based on depth functions: The Liu-Singh rank sum test [7] are a class of multivariate two-sample tests that generalize the Mann-Whitney rank test using the notion of data-depth. This include tests based on halfspace depth [4] and simplicial depth [6], among others. For other generalizations of the Mann-Whitney test, refer to the survey by Oja [3] and the references therein. In this paper, we provide a general framework of graph-based two-sample tests, which includes all the tests discussed above. We begin with a few definitions: A subset S R d is locally finite if S C is finite, for all compact subsets C R d. A locally finite set S R d is nice if all inter-point distances among the elements of S are distinct. If S is a set of i.i.d. points W, W,..., W from some continuous distribution F, then the distribution of W W does not have any point mass, and S is nice. A graph functional G in R d defines a graph for all finite subsets of R d, that is, given S R d finite, G (S) is a graph with vertex set S. A graph functional is said to be undirected/directed if the graph G (S) is an undirected/directed graph with vertex set S. We assume that G (S) has no self loops and multiple edges, that is, no edge is repeated more than once in the undirected case, and no edge in the same direction is repeated more than once in the directed case. The set of edges in the graph G (S) will be denoted by E(G (S)). Definition.. Let X and Y be i.i.d. samples of size and from densities f and g, respectively, as in (.). The -sample test statistic based on the graph functional G is defined as T (G (X Y )) = j= {(X i, Y j ) E(G (X Y ))}. (.3) E(G (X Y ))

3 DISTRIBUTIO-FREE GRAPH-BASED TWO-SAMPLE TESTS 3 If G is an undirected graph functional, then the statistic T counts the proportion of edges in the graph G (X Y ) with one end point in X and the other end point in Y. If G is a directed graph functional, then T is the proportion of directed edges with the outward end in X and the inward end in Y. By conditioning on the graph G (X Y ), it is easy to see that under the null H 0, E(T (G (X Y ))) is or, depending on ( ) ( ) whether the graph functional is undirected or directed, respectively, where = +. In this paper, the graph functionals are computed based on the Euclidean distance in R d, and the rejection region of the statistic (.3) will be based on its asymptotic null distribution in the usual limiting regime with + p (0, ), + q := p, (.4) and r := pq. The test statistics considered in this paper will have -fluctuations under H 0. Thus, depending on the type of alternative, the test based on (.3) will reject H 0 for large and/or small values of the standardized statistic R(G (Z )) = {T (G (Z )) E(T (G (Z )))}, (.5) where Z = X Y is the pooled sample. Let Z = {Z, Z,..., Z } be the elements of the pooled sample, with Z i labelled { if zi X c i =, if z i Y. (.6) Then (.3) can be re-written as i j T (G (Z )) := ψ(c i, c j ){(Z i, Z j ) E(G (Z ))}, (.7) E(G (Z )) where ψ(c i, c j ) = {c i =, c j = }... Two-Sample Tests Based on Geometric Graphs. Many popular multivariate twosample test statistics are of the form (.3) where the graph functional G is constructed using the inter-point distances of the pooled sample.... Wald-Wolfowitz (WW) Runs Test. This is one of the earliest known non-parametric tests for the equality of two univariate distributions [43]: Let X and Y be i.i.d. samples of size and as in (.). A run in the pooled sample Z = X Y is a maximal non-empty segment of adjacent elements with the same label when the elements in Z are arranged in increasing order. If the two distributions are different, the elements with labels and would be clumped together, and the total number of runs C(Z ) in Z will be small. On the other hand, for distributions which are equal/close, the different labels are jumbled up and C(Z ) will be large. Thus, the WW-test rejects H 0 for large values of C(Z ). It is easy to see that the WW-test is a graph-based test (.3): C(X Y ) = T (P(X Y )), Ties occur with zero probability by the assumption that both the distribution functions F and G are continuous.

4 4 B. B. BHATTACHARYA where P(S) is the path with S edges through the elements of S arranged in increasing order, for all finite S R. Let R(P(Z )) be the standardized version of T (P(Z )) as in (.5). Wald and Wolfowitz [43] proved that R(P(Z )) is distribution-free in finite samples, is asymptotically normal under H 0, and consistent under all fixed alternatives. The WW-test often has low power in practice and has zero asymptotic efficiency, that is, it is powerless against O( ) alternatives [30].... Friedman-Rafsky (FR) Test. Friedman and Rafsky [7] generalized the Wald and Wolfowitz runs test to higher dimensions by using the Euclidean minimal spanning tree of the pooled sample. Definition.. Given a nice finite set S R d, a spanning tree of S is a connected graph T with vertex-set S and no cycles. The length w(t ) of T is the sum of the Euclidean lengths of the edges of T. A minimum spanning tree (MST) of S, denoted by T (S), is a spanning tree with the smallest length, that is, w(t (S)) w(t ) for all spanning trees T of S. Thus, T defines a graph functional in R d, and given X and Y as in (.), the FR-test rejects H 0 for small values of j= T (T (X Y )) = {(X i, Y j ) E(T (X Y ))}. (.8) This is precisely the WW-test in d =, and is motivated by the same intuition that when the two distributions are different, the number of edges across labels and is small. Friedman and Rafsky [7] calibrated (.8) as a permutation test, and showed that it has good power in practice for multivariate data. Later, Henze and Penrose [] proved that R(T (Z )) is asymptotically normal under H 0 and is consistent under all fixed alternatives. Recently, Chen and Zhang [0] used the FR and related graph-based tests in change-point detection problems, and suggested new modifications of the FR-test for high-dimensional and object data [, ]...3. Test Based on K-earest eighbor (K-) Graphs. As in (.8), a multivariate twosample test can be constructed using the K-nearest neighbor graph of Z. This was originally suggested by Friedman and Rafsky [7] and later studied by Schilling [39] and Henze []. Definition.3. Given a nice finite set S R d, the (undirected) K-nearest neighbor graph (K-) is a graph with vertex set S with an edge (a, b), for a, b S, if the Euclidean distance between a and b is among the K-th smallest distances from a to any other point in S and/or among the K-th smallest distances from b to any other point in S. Denote the undirected K- of S by K (S). Given X and Y as in (.), the K- statistic is T ( K (Z )) = j= {(X i, Y j ) E( K (Z ))}, (.9) E( K (Z ))

5 DISTRIBUTIO-FREE GRAPH-BASED TWO-SAMPLE TESTS (a) (b) Figure. The 3- graph on a pooled sample of size 5 in R with 0 i.i.d. points from (0, I) (colored blue) and 5 i.i.d. points from (, I) (colored red). For (a) =, there are edges with endpoints in different samples, and for (b) = 0.05, there are 0 edges with endpoints in different samples. where Z = X Y. As before, when the two distributions are different, the number of edges across the two samples is small (see Figure ), so the K- test rejects H 0 for small values of (.9). Schilling [43] showed that the test based on K nearest neighbors is asymptotically normal under H 0 and consistent under fixed alternatives...4. Cross-Match (CM) Test. Rosenbaum [38] proposed a distribution-free multivariate two-sample test based on minimum non-bipartite matching. For simplicity, assume that the total number of samples is even; otherwise, add or delete a sample point to make it even. Definition.4. Given a finite S R d and a symmetric distance matrix D := ((d(a, b))) a b S, a non-bipartite matching of S is a pairing of the elements S into / non-overlapping pairs, that is, a partition of S = / S i, where S i = and S i S j =. The weight of a matching is the sum of the distances between the / matched pairs. A minimum non-bipartite matching (BM) of S is a matching which has the minimum weight over all matchings of S. The BM defines a graph functional W as follows: for every S R d finite, W(S) is the graph with vertex set S and an edge (a, b) whenever there exists i [/] such that S i = {a, b}. ote that W(S) is a graph with / pairwise disjoint edges. Given X and as in (.), the CM-test rejects H 0 for small values of Y T (W(X Y )) = j= {(X i, Y j ) E(W(X Y ))}. (.0) / The statistic (.9) is slightly different from the test used by Schilling [39, Section ], which can also be re-written as graph-based test (.3) by allowing multiple edges in the K- graph.

6 6 B. B. BHATTACHARYA Like the WW-test, but unlike the FR and the K- tests, the CM-test is distribution-free in finite samples under H 0. Rosenbaum [38] implemented this as a permutation test and derived the asymptotic normal distribution under the permutation null... Two-Sample Tests Based on Data-Depth. Many non-parametric two-sample tests are based on depth functions, which are multivariate generalizations of ranks [6, 3]. Given a distribution function F in R d, a depth function D(, F ) : R d [0, ] is a function that provides a ranking of points in R d. High depth corresponds to centrality, while low depth corresponds to outlyingness. The center consists of the points that globally maximize the depth, and is often considered as a multivariate median of F. For X F and Y G two independent random variables in R d, R D (y, F ) = P(X : D(X, F ) D(y, F )) (.) is a measure of the relative outlyingness of a point y R d with respect to F. In other words, R D (y, F ) is the fraction of the F population with depth at most as that of the point y. The dependence on D in the notation R D (, F ) will be dropped when it is clear from the context. The quality index ˆ Q(F, G) := R(y, F )dg(y) =P(D(X, F ) D(Y, F ) X F, Y G) (.) is the average fraction of F with depth at most as the point y, averaged over y distributed as G. When F = G and D(X, F ) has a continuous distribution, then R(Y, F ) Unif[0, ] and Q(F, G) = / [7, Proposition 3.]. Definition.5. Let X and Y be as in (.) and F and G be the empirical distribution functions, respectively. The Liu-Singh rank sum statistic [7], is ˆ Q(F, G ) := the sample estimator of Q(F, G). R(y, F )dg (y) = R(Y j, F ), (.3) The test rejects H 0 for large values of ( Q(F, G) ) and can be re-written as a graph-based test (.3). To see this, note that Q(F, G ) = {i [ ] : D(X i, F ) D(Y j, F )} j= j= = {D(X i, F ) D(Y j, F )}. (.4) j= Let Z be the pooled sample with the labeling as in (.6). Construct a graph G D (Z ) with vertices Z with a directed edge from (Z i, Z j ) whenever D(Z i, F ) D(Z j, F ). ote that

7 DISTRIBUTIO-FREE GRAPH-BASED TWO-SAMPLE TESTS 7 G D (Z ) is a complete graph with directions on the edges depending on the relative order of the depth of the two endpoints. Then from (.4), Q(F, G ) = ψ(c i, c j ){D(Z i, F ) D(Z j, F )} = i j i j ψ(c i, c j ){(Z i, Z j ) E(G D (Z ))}, with ψ(, ) as in (.7). Since E(G D (Z )) = ( ), this implies, Q(F, G ) = ( ) T (G D (Z )), (.5) where T is defined in (.7). Thus, the Liu-Singh rank sum statistic based on a depth function D is a graph-based test (.3). Common depth functions include the Mahalanobis depth, the halfspace depth, the simplicial depth, and the projection depth, among others. In the following, we recall the definitions for a few of these (refer to [6] for more on depth functions).. Mann-Whitney Test: This is one of the oldest non-parametric two-sample tests for univariate data [8, 4]. Depending on the alternative, the Mann-Whitney rank test rejects H 0 for small or large values of j= {X i < Y j }. This is a distribution-free test, which corresponds to taking D(x, F ) = F (x) in (.4).. Halfspace Depth: Tukey [4] suggested the following depth-function ˆ HD(x, F ) = inf df, (.6) H x where the infimum is taken over all closed halfspace H x with x on its boundary. The twosample test based on the halfspace depth is obtained by using (.6) in the statistic (.4). 3. Mahalanobis Depth: Given a distribution function F, the Mahalanobis depth of a point x R d is MD(x, F ) = (.7) + (x µ(f )) Σ (F )(x µ(f )) where µ(f ) = xdf (x) and Σ(F ) = (x µ(f ))(x µ(f )) df (x) are the mean and covariance of the distribution F. The Liu-Singh rank sum statistic is consistent against alternatives for which the quality index Q(F, G) (recall (.)), under mild conditions [7]. A special version of Liu- Singh rank sum statistic, with a reference sample, inherits the distribution-free property of the Mann-Whitney test in finite samples [7, Section 4]. However, the Liu-Singh rank sum statistic defined above (.3) is only asymptotically distribution-free under the null hypothesis [7, 45]. Zuo and He [45, Theorem ] proved the asymptotic normality of the Liu-Singh rank sum statistic under general alternatives for depth functions satisfying certain regularity conditions.

8 8 B. B. BHATTACHARYA.3. Properties of Graph-Based Tests. A test function φ for the testing problem (.) is said to be asymptotically exact level α, if lim E H0 (φ ) = α. An exact level α test function φ is said to be consistent against the alternative H, if lim E H (φ ) =, that is, the power of the test φ converges to in the usual asymptotic regime (.4). All the tests described in Sections. and. have the following common properties: Asymptotically Distribution-Free: The standardized test statistic (.5) converges to (0, σ ) under H 0, where the limiting variance σ depends on the graph functional G, but not on the unknown distributions. Therefore, under the null, the asymptotic distribution of the test statistic (.5) is distribution-free. This was proved for the FR-test by Henze and Penrose [], for the test based on the K- graph by Henze [], and by Liu and Singh [7] for depth-based tests. Finally, recall that the CM test is exactly distribution-free in finite samples [38]. Consistent Against Fixed Alternatives: For a geometric graph functional G in R d, the test function with rejection region {R(G (Z )) < σ z α }, (.8) where z α is the α-th quantile of the standard normal, is asymptotically exact level α and consistent against all fixed alternatives, for the testing problem (.). This was proved for the MST by Henze and Penrose [34] and for the K- graph by Schilling [39] and later by Henze []. Recently, Arias-Castro and Pelletier [3] showed that the CM test is consistent against general alternatives. Similarly, the Liu-Singh rank sum statistic 3 is asymptotically exact level α and consistent against alternatives for which the quality index Q(F, G). Even though the basic asymptotic properties of the tests based on geometric graphs and those based on data-depth are quite similar, their algorithmic complexities are very different. Tests based on geometric graphs such as the MST, the K- graph, and the CM-test can be computed in polynomial time with respect to both the number of data points and the dimension d. For instance, the MST and the K- graph of a set of points in R d can be computed easily in O(d ) time, and the BM matching can be computed in O(d 3 ) time [3]. On the other hand, depth-based tests are generally difficult to compute when the dimension is large because the computation time is polynomial in but exponential in d (with the exception of Mahalanobis depth (.7), which is easy to compute but gives rise to non-robust procedures because it involves the sample mean and covariance matrix). In fact, computing the Tukey depth of a set of points in R d is P-hard if both and d are parts of the input [4], and it is even hard to approximate []. For more details on algorithms to compute various depth functions, refer to the recent survey of Rousseeuw and Hubert [36] and the references therein..4. Summary of Results. The general notion of graph-based two-sample tests (.3) introduced above provides a unified framework for analyzing and statistically comparing their asymptotic properties. We derive the limiting power of these tests against local alternatives, 3 For depth-based tests, the rejection region can be of the form { R(G (Z )) σ z α/ }, depending on the alternative.

9 DISTRIBUTIO-FREE GRAPH-BASED TWO-SAMPLE TESTS 9 that is, alternatives which shrink towards the null as the sample size grows to infinity. If the statistic (.5) is asymptotically normal under the null, and the test based on (.8) is consistent, it will generally have non-trivial power (greater than the size of the test) against O( ) local alternatives. This can be formalized using the notion of Pitman efficiency of a test (see Definition. below), and can be used to compare the asymptotic performances of different tests. The following is a summary of the results obtained:. In Section 3 the asymptotic (Pitman) efficiency of a general graph-based test is derived. This result can be used as a black-box to derive the efficiency of any such two-sample test (Theorem 3. and Corollary 3.). The results obtained show how combinatorial properties of the underlying graph effect the performance of the associated test, which can be effective in the future for constructing and analyzing new tests. The results are illustrated through simulations and compared with other parametric tests in Section 5.. As a consequence of the general result the asymptotic efficiency of the tests described in Sections. and. can be derived: It is shown that the Friedman-Rafsy test has zero asymptotic efficiency, that is, it is powerless against any O( ) local alternatives (Theorem 4.). In fact, this phenomenon extends to a large class of random geometric graphs that exhibit local spatial dependence. This can be formalized using the notion of stabilization of geometric graphs [34], which includes the MST, the K- graph among others (Theorem 4.). On the other hand, the tests based on data-depth (.3) (see also Section 4.4) have have non-trivial power, and hence, non-zero asymptotic efficiencies, for many O( ) alternatives (Theorem 4.). The results above, together with the discussion at the end of Section.3, show that there is a very interesting trade-off between asymptotic efficiencies and computation time, needed by the tests based on geometric graphs and those based on datadepth: Tests based on geometric graphs can be computed efficiently, however, they are asymptotically inefficient. On the other hand, tests based on data-depth are difficult to compute, but generally have non-trivial efficiencies. This raises the following intriguing question: Is possible to construct an asymptotically distribution-free test, which has non-trivial asymptotic efficiency and is also computationally tractable when dimension is large? In Section 4.5, we address the fundamental difficulties and discuss practical alternatives. 3. Finally, the performance of the Friedman-Rafsy test and the test based on the halfspace depth are compared on the sensorless drive diagnosis data set [7] (Section 6). Preliminaries about asymptotic efficiency and Le Cam s theory are discussed in Section. Proofs of the results are given in Appendix A-F.. Preliminaries Begin by recalling some definitions and notation: For two sequences {a n } n and {b n } n, a n b n means a n = O(b n ); a n b n means a n = ( + o())b n. For two random sequences {a n } n and {b n } n defined on the same probability space, a n = O P (b n ) means

10 0 B. B. BHATTACHARYA lim M lim n P( a n /b n > M) = 0 and a n = o P (b n ) means a n /b n converges in probability to zero. Finally, for x R, x + = max{x, 0}, and for a vector z R s, z = ( s z i ), is the Euclidean norm of z. For S R d measurable, denote by B(S) the Borel sigma-algebra generated by the open subsets of S. Let (S n, B(S n )) be a sequence of measure spaces with two probability distributions P n and Q n. The sequence {Q n } n is said to be contiguous with respect to {P n } n if lim n P n (E n ) = 0 implies lim n Q n (E n ) for all sequence {E n } of measurable sets. Contiguity is relevant for analyzing test functions under local alternatives, that is, alternatives shrinking towards the null as the sample size goes to infinity. To quantify the notion of local alternatives, let Θ R p and {P θ } θ Θ be a parametric family of distributions in R d with density f( θ), with respect to Lebesgue measure, indexed by a p-dimensional parameter θ Θ. Denote the score function η(, θ) = θf( θ), and f( θ) Fisher-information matrix I(θ).To compute asymptotic efficiency of tests certain smoothness conditions are required on f( θ). The standard technical condition is to assume differentiability in norm of the square root of the density: Definition.. [5, Definition..] The family {P θ } θ Θ is quadratic mean differentiable (QMD) at θ = θ 0 Θ if there exists a vector of real-valued functions ν(, θ 0 ) = (ν (, θ 0 ), ν (, θ 0 ), ν p (, θ 0 )) such that ˆ [ f(x θ0 + h) ] f(x θ 0 ) h, ν(x, θ 0 ) dx = o( h ), as h 0. R d The above condition holds for most standard families of distributions, including exponential families in natural form. Hereafter, we assume that {P θ } θ Θ is a QMD family such that the score function has bounded fourth moment. Assumption.. A parametric family of distributions {P θ } θ Θ, with Θ R p, is said to satisfy Assumption. at θ = θ, if P θ is QMD at θ = θ, and for V P θ and all h R p, E( h, η(v, θ ) 4 ) <. Let X and Y be samples from P θ and P θ as in (.), respectively. For h R p, consider the testing problem, H 0 : θ θ = 0, versus H : θ θ = h. (.) ote that the tests are still carried out in the non-parametric setup assuming no knowledge of the distributions of the two samples. However, the efficiency is computed assuming a parametric form for the unknown distributions. 4 4 Another choice of alternatives for computing the efficiency of a non-parametric test is to consider: H 0 : g = f, versus H : g = ( δ )f + δ g, for some density g in R d and δ > 0. Then, under mild integrability assumptions, the densities associated with the alternative are contiguous, and the efficiency results derived in this paper easily extend to this case, as well. We chose the formulation in (.) because it yields slightly cleaner asymptotic expansions and formulas.

11 DISTRIBUTIO-FREE GRAPH-BASED TWO-SAMPLE TESTS Let Z = X Y be the pooled sample. The log-likelihood ratio for the testing problem (.) is: ( ) f Y j θ + h L := log = log f(z j θ + h ) {c j = }, (.) f(y j θ ) f(z j θ ) j= where c j is the label of Z j as in (.6). Since P θ is QMD, by Lehmann and Romano [5, Theorem..3], in the usual asymptotic regime (.4), ( L D q h, I(θ ) )h, q h, I(θ )h. This implies the joint distributions of X and Y under H 0 and H (as in (.)) are mutually contiguous [5, Corollary.3.]. Then by Le Cam s Third Lemma [5, Corollary.3.], if under the null H 0, ( ) R(G (Z )) L (( D j= 0 q h,i(θ )h ) ( σ, σ σ q h, I(θ )h )), (.3) then under the alternative H, R(G (Z )) D (σ, σ). Then the limiting power of the two-sample test based on G with rejection region (.8) is given by Φ(z α σ σ ), where Φ( ) is the standard normal distribution function. Definition.. The quantity (σ ) + σ will be referred to as the asymptotic (Pitman) efficiency of the test statistic R(G (Z )) for the testing problem (.), and will be denoted by AE(G ). The ratio of the asymptotic efficiencies of two test statistics is the relative (Pitman) efficiency, which is the ratio of the number of samples required to achieve the same limiting power between two level α-tests. Refer to the textbooks [8, 5, 4] for more on contiguity, local power and asymptotic efficiencies of tests. 3. Asymptotic Efficiency of Graph-Based Two-Sample Tests This section describes the main results about the asymptotic efficiencies of general graphbased two-sample tests. There are two cases, depending on whether the graph functional is directed or undirected. To begin, let G be a directed graph functional in R d. Denote by E(G (S)) the set of edges in G (S), and by E + (G (S)) the set of pairs of vertices with edges in both directions (that is, the set of ordered pairs of vertices (x, y) such that both (x, y) E(G (S)) and (y, x) E(G (S))), respectively. For x R d, let d (x, G (S)) be the in-degree of the vertex x in the graph G (S {x}), d (x, G (S)) be the out-degree of the vertex x in the graph G (S {x}), d(x, G (S)) = d (x, G (S)) + d (x, G (S)), be the total degree of the vertex x in the graph G (S {x}). Define the scaled in-degree and the scaled out-degree of a vertex as follows: λ (x, G (S)) = d (x, G (S x )) E(G (S x )), λ (x, G (S)) = d (x, G (S x )), (3.) E(G (S x ))

12 B. B. BHATTACHARYA where S x = S {x}. Also, let T (G (S)) = ( ) d (x, G (S)), T (G (S)) = ( ) d (x, G (S)) x S x S (3.) be the number of outward -stars and inward -stars in G (S), respectively. Finally, let T + (G (S)) be the number of -stars in G (S) with different directions on the two edges. For an undirected graph functional G, denote by d(x, G (S)) the degree of the vertex x in the graph G (S {x}). As in (3.) and (3.), let λ(x, G (S)) = d(x, G (Sx )), T E(G (S x (G (S)) = ( ) d(x, G (S)), (3.3) )) x S where S x = S {x}. Intuitively, the functions λ (x, G (S)), λ (x, G (S)), and λ(x, G (S)), measure the relative position of the point x in set S {x}. For example, in the FR-test or the test based on the K- graph, large values of λ correspond to points near the center of the data-cloud, whereas small values correspond to outliers. Similarly, for depth-based tests, small/large values of λ generally correspond to points which are outliers. In fact, for such tests, this is directly related to the relative outlyingness of the point x, as defined in (.) (see Observation F.). 3.. Asymptotic Efficiency for Directed Graph Functionals. Let G be a directed graph functional in R d and {P θ } θ Θ a parametric family of distributions satisfying Assumption. at θ = θ Θ R p. To derive the asymptotic efficiency of a general graph-based test using (.3) various assumptions are required on the graph functional G. The variance condition stated below requires that the proportion of edges and the (scaled) number of -stars converge, which ensures that the variance of R(G (Z )) converges (see Lemma A. in Appendix A.). Assumption 3.. (Variance Condition) The pair (G, P θ ) is said to satisfy the variance condition with parameters (β 0, β 0 +, β, β, β + ), if for V := {V, V,..., V } i.i.d. from P θ, there exist finite non-negative constants β 0, β 0 +, β, β, and β + (which do not depend on P θ ) such that (a) E(G (V )) (b) T (G (V )) E(G (V )) P β, (c) T + (G (V )) E(G (V )) P β +. P β 0, and E+ (G (V )) E(G (V )) P β + 0, T (G (V )) E(G (V )) P β, and The variance condition is a very mild requirement which is satisfied by all the graph functionals described in Section (see Section 4 for details). ext, to show (.3), the limiting covariance of the statistic R(G (Z )) and the loglikelihood ratio L has to be computed. To ensure the existence of this limit the following condition is required: Assumption 3.. (Covariance Condition) The pair (G, P θ ) is said to satisfy the covariance condition if for V := {V, V,..., V } i.i.d. from P θ the following holds:

13 DISTRIBUTIO-FREE GRAPH-BASED TWO-SAMPLE TESTS 3 (a) There exists functions λ, λ : R d R, such that for almost all z R d, λ (z) = lim E(λ (z, G (V ))), and λ (z) = lim E(λ (z, G (V ))), (3.4) and zero otherwise. (b) For h R p as in (.) ˆ h, η(v i, θ ) λ (V i, G (V )) P h, f(z θ ) λ (z)dz, (3.5) and the same holds for λ. Finally, to show (.3), the joint normality of R(G (Z )) and L must be proved. To this end, define (G (S)) := max i [] d(v i, G (S)) the total maximum degree of the graph G (S), where [] := {,,..., }. Assumption 3.3. (ormality Condition) The pair (G, P θ ) is said to satisfy the normality condition if for V := {V, V,..., V } i.i.d. from P θ either one of the following holds: (a) (Condition ) The maximum degree (b) (Condition ) The maximum degree (G (V )) P and (G (V )) = O P (). (3.6) (G (V )) E(G (V )) = O P (). (3.7) Remark 3.. Assumption 3.3 ensures that either (a) the graph has bounded degree (for a slightly weaker condition refer to Proposition B. in Appendix B), or (b) the maximum degree (G (V )) is of the same order as the average degree E(G (V )) /. The examples in Section satisfy either one of these conditions, and in Propositions B. and C. the joint normality of the likelihood ratio and the statistic (.5) is proved under condition and, respectively. However, determining a necessary and sufficient condition for the asymptotic normality of the statistic (.5) is a non-trivial problem. This can be reformulated as proving the CLT of the number of monochromatic edges in a randomly -colored graph, which has been extensively studied, especially in the equal sample sizes (p = q) case (refer to [6] and the references therein). The following result gives the asymptotic efficiency of a graph-based test, if the graph functional satisfies the above conditions. The proof of the theorem is given in Appendix A. Theorem 3.. Let G be a directed graph functional and {P θ } θ Θ a parametric family of distributions in R d satisfying Assumption. at θ = θ Θ R p. Suppose the pair (G, P θ ) satisfies the variance condition 3. with parameters (β 0, β 0 +, β, β, β + ), the covariance condition 3., and the normality condition 3.3. Then the asymptotic efficiency of the test statistic (.5) for the testing problem (.) is AE(G ) := r r ( p h, f(z θ ) λ (z)dz q h, f(z θ ) λ (z)dz ) + { ( )}, β 0 + qβ + pβ r β 0 + β+ 0 + β + β + β +

14 4 B. B. BHATTACHARYA whenever the denominator above is strictly positive. The graph functionals defined in Section satisfy the variance, covariance, and normality conditions, and the above theorem can be used to compute their asymptotic efficiencies (see Section 4 for details). More generally, the efficiency formula shows how combinatorial properties of the underlying graph effect the performance of the associated test. Remark 3.. The efficiency of a test is governed by the functions λ, λ. Recall that λ (z), λ (z) are the scaled out/in-degrees of the vertex z, in the graph constructed with z and the other data points. ote that for points close to the center of the data-cloud, λ (z) and/or λ (z) will be large, that is, these functions are like depth measures of the data set. In fact, it will be shown in Section 4.4, that for tests based on depth-functions, they are directly related to the relative outlyingness of the associated depth function. On the other hand, in Section 4. it will be shown, that for tests based on stabilizing geometric graphs, λ, λ are constant functions. Intuitively, this means that for these graphs, the degree of a new point is blind to the geometry of the data distribution, and, as a consequence, the associated test has zero efficiency. The proof of Theorem 3. also gives the limiting null distribution of the test statistic (.5). Remark 3.3. Let G be a directed graph functional and f and g be two densities in R d. ote that the variance condition 3. and the normality condition 3.3 naturally extend to the pair (G, P f ), where P f is the distribution function corresponding to f. The proof of Theorem 3. shows that if the pair (G, P f ) satisfies the variance condition 3. and the normality condition 3.3, then under the null hypothesis (f = g) R(G (Z )) D (0, σ), where σ is the denominator of the formula in Theorem 3.. This gives the asymptotic null distribution of a general graph-based two-sample test. In particular, this gives a unified proof for the asymptotic null distribution of the FR-test [, Theorem ], the K- test [39, Theorem 3.], and the Liu-Singh rank sum statistic [7, Theorem 6.]. Remark 3.4. The condition that the denominator in the formula in Theorem 3. is positive ensures that the limiting variance of the statistic (.5) is positive (see Lemma A. in Appendix A.). If the limiting variance happens to be zero, then the statistic does not have a non-degenerate distribution at the -scale and the asymptotic efficiency against alternatives (.) is undefined. 3.. Asymptotic Efficiency for Undirected Graph Functionals. Every undirected graph functional G can be modified to a directed graph functional G + in a natural way: For S R d finite, G + (S) is obtained by replacing every edge in G (S) with two edges one in each direction. The asymptotic efficiency of the test based on G can then be derived by applying Theorem 3. to the directed graph functional G +. The following is the analogue of Assumptions for undirected graph functionals. Assumption 3.4. Let G be an undirected graph functional and assume V := {V, V,..., V } be i.i.d. with density P θ. The pair (G, P θ ) is said to satisfy Assumption 3.4 if the following hold:

15 DISTRIBUTIO-FREE GRAPH-BASED TWO-SAMPLE TESTS 5 ((γ 0, γ )-Undirected Variance Condition) There exists finite non-negative constants γ 0, γ such that E(G (V )) P γ 0, and T (G (V )) P γ E(G (V )). (3.8) (Undirected Covariance Condition) There exists a function λ : R d R, such that for almost all z R d, lim E(λ(z, G (V ))) λ(z), and zero otherwise. Moreover, for h R p as in (.) ˆ h, η(v i, θ ) λ(v i, G (V )) P h, f(z θ ) λ(z)dz. (3.9) (ormality Condition) The pair (G +, P θ ) satisfies the normality condition as in Assumption 3.3. If G satisfies the above condition, then G + satisfies Assumptions (see Appendix A.3.), and applying Theorem 3. for G + we get the following: Corollary 3.. Let G be an undirected graph functional and {P θ } θ Θ a parametric family of distributions in R p satisfying Assumption. at θ = θ Θ. If the pair (G, P θ ) satisfies Assumption 3.4, then the asymptotic efficiency of the test statistic (.5) is ( r AE(G ) = (p q) h, f(z θ ) λ(z)dz ) + r {γ0 ( r) + (γ )( r)}, (3.0) whenever the denominator above is strictly positive. Remark 3.5. ote that the numerator in (3.0) is zero when p = q = /, that is, when the two sample sizes and are asymptotically equal. This is because, conditional on the graph, the variables {ψ(c i, c j ) : (Z i, Z j ) E(G (Z ))} are pairwise independent when p = q, and, as a result, the test statistic (.5) does not correlate with the likelihood ratio. Therefore, depending on the value of the denominator in (3.0) the following two cases arise: If the graph functional is non-sparse, that is, E(G (V )) /, then γ 0 = 0 in (3.8) and the denominator in (3.0) is zero, since r = pq = /. Therefore, non-sparse undirected graph functionals do not have a non-degenerate distribution at the scale when p = q, and Corollary 3. does not apply. If the graph is sparse, that is, γ 0 > 0, then the denominator of (3.0) is non-zero when p = q. This shows that tests based on sparse graph functionals cannot have non-zero efficiencies when the two-sample sizes are asymptotically equal. 4. Applications In this section we compute the asymptotic efficiencies of the tests described in Section using Theorem 3. and Corollary 3.. Extensions and generalizations which can be used to construct locally efficient tests are also discussed.

16 6 B. B. BHATTACHARYA 4.. The Friedman-Rafsky (FR) Test. Let T be the MST functional as in Definition. and consider the two-sample test based on T (.8). Examples suggested that this test has zero asymptotic efficiency, which is formalized in the following theorm: Theorem 4.. Let {P θ } θ Θ be a parametric family of distributions in R d satisfying Assumption. at θ = θ Θ. Then the asymptotic efficiency of the Friedman-Rafsky test (.8), for the testing problem (.), is zero, that is, AE(T ) = 0. The FR-test has zero asymptotic efficiency because the function λ(z) does not depend on z R d, and hence, the numerator in (3.0) is zero. This is a consequence of the following well-known result: For V = {V, V,..., V } i.i.d. P θ, lim E(λ(z, T (V ))) = (refer to [, Proposition ]). The Efron-Stein inequality [5] can then be used to show that (T, P θ ) satisfies the covariance condition with the function λ(z) =, for all z R d. The details of the proof are given in Appendix D. 4.. Tests Based on Stabilizing Graphs. The proof of Theorem 4. uses the fact that the MST graph functional has local dependence, that is, addition/deletion of a point only effects the edges incident on the neighborhood of that point. This phenomenon holds for many other random geometric graphs and was formalized by Penrose and Yukich [34] using the notion of stabilization. Let G be a graph functional defined for all locally finite subsets of R d. (The K- graph can be naturally extended to locally finite infinite points sets. Aldous and Steele [] extended the MST graph functional to locally finite infinite point sets using the Prim s algorithm.) For S R d locally finite and x R d, let E(x, G (S)) be the set edges incident on x in G (S {x}). ote that E(x, G (S)) = d(x, G (S)), the (total) degree of the vertex x in G (S {x}). Definition 4.. Given S R d and y R d and a R, denote by y +S = {y +z : z S} and as := {az : z S}. A graph functional G is said to be translation invariant if the graphs G (x + S) and G (S) are isomorphic for all points x R d and all locally finite S R d. A graph functional G is scale invariant if G (as) and G (S) are isomorphic for all points a R and and all locally finite S R d. Let P λ be the Poisson process of intensity λ 0 in R d, and P λ,x := P λ {x}, for x R d. Penrose and Yukich [34] defined stabilization of graph functionals over homogeneous Poisson processes as follows: Definition 4. (Penrose and Yukich [34]). A translation and scale invariant graph functional G stabilizes P λ if there exists a random but almost surely finite variable R such that E(0, G (P λ,0 )) = E(0, G (P λ,0 B(0, R) A )), (4.) for all finite A R d \B(0, R), where B(0, R) is the (Euclidean) ball of radius R with center at the point 0 R d. Informally, stabilization ensures the insertion of a point (or finitely many points) far away from the origin 0 does effect the degree of 0 in the graph G (Pλ 0 ), that is, it has only a local effect in some sense. Many geometric graphs, like the MST, the K- graph, and the Delaunay graph are stabilizing [34].

17 DISTRIBUTIO-FREE GRAPH-BASED TWO-SAMPLE TESTS Asymptotic Efficiency of Stabilizing Graphs. The following theorem shows that tests based on stabilizing graph functionals have zero asymptotic efficiency. Theorem 4.. Let {P θ } θ Θ be a parametric family of distributions in R d satisfying Assumption. at θ = θ Θ, and G be a translation and scale invariant graph functional which stabilizes P. If the pair (G, P θ ) satisfies Assumption 3.3 and sup sup E (d(z, G (Z )) s ) <, (4.) z R d for some s > 4, then the asymptotic efficiency of the two-sample test based on G (.5), for the testing problem (.), is zero, that is, AE(G ) = 0. The outline of the proof is as follows (refer to Appendix D for detials): For undirected stabilizing graph functionals E(G (V ) / P Ed(0, G (P )) and Ed(z, G (V )) P Ed(0, G (P )), where V = {V, V,..., V } are i.i.d. P θ. Therefore, λ(z, V ) and, as in the FRtest, it can be shown that (G, P θ ) satisfies the covariance condition with λ(z) =, for all z R d. Thus, the numerator in (3.0) is zero, and the result follows. The proof holds almost verbatim for directed stabilizing graph functionals as well Applications of Theorem 4.. Graph functionals such as the MST, the K-, and others, like the Delaunay graph and the Gabriel graph, not considered in this paper are stabilizing [34]. Theorem 4. can be used to show that tests based on such graph functionals have zero asymptotic efficiency. Example 4. (Minimum Spanning Tree (MST)). By [34, Lemma.] the MST graph functional T stabilizes P. Moreover, the degree of a vertex in the MST of a set of points in R d is bounded by a constant B d, depending only on the dimension d [, Lemma 4]. Therefore, the normality condition in Assumption 3.3 and the moment condition (4.) are trivially satisfied. Theorem 4. then implies AE(T ) = 0, thus re-deriving Theorem 4.. Example 4. (K-earest eighbor ()). By [33, Lemma 6.], the K- graph functional K stabilizes P. Condition in Assumption 3.3 and the moment condition (4.) are trivially satisfied, since K is a bounded degree graph functional. This implies, AE( K ) = 0, showing that tests based on K- graphs have no asymptotic local power Cross-Match (CM) Test. Let W be the minimum non-bipartite matching (BM) graph functional as in Definition.4. It is unknown whether W is stabilizing [34], and so, Theorem 4. cannot be applied to compute the asymptotic efficiency of the cross-match test. However, in this case, Corollary 3. can be used directly to derive the following: Corollary 4.3. Let {P θ } θ Θ be a parametric family of distributions in R d satisfying Assumption. at θ = θ Θ. Then the asymptotic efficiency of the cross-match (CM) test (.0), for the testing problem (.), is zero, that is, AE(W) = 0. Proof. Let be even and V = {V, V,..., V } i.i.d. P θ. In this case, E(W(V )) = / and T (W(V )) = 0. Therefore, (W, P θ ) satisfies the (, 0)-undirected variance condition (3.8). The normality condition (3.6) holds, since d(v i, W(V )) = for all V i V. This also implies that the undirected covariance condition holds with λ(z) =, for all z R d. The result then follows by Corollary 3..

18 8 B. B. BHATTACHARYA 4.4. Depth-Based Tests. Let Z = X Y be the pooled sample, and F the empirical distribution of X. The two sample test based on a depth function D (4.4) rejects for large values of T (G D (Z )), where the graph G D (Z ) has vertex set Z with a directed edge (Z i, Z j ) whenever D(Z i, F ) D(Z j, F ). If the depth function D(X, F ), where X F, has a continuous distribution then G D (Z ) is a complete graph ( E(G D (Z )) = ( )/) with directions on the edges depending on the relative ordering of the depth of the two endpoints. Definition 4.3. Let F be a distribution function in R d with empirical distribution function F. A depth function D is said to be good with respect to F if (A) For X F, the distribution of D(X, F ) is continuous. (A) P(y D(Y, F ) y ) C y y, for some constant C and any y, y [0, ]. (A3) sup x R d D(x, F ) D(x, F ) = o() almost surely and in expectation. The standard depth functions such as the halfspace depth, the Mahalanobis depth, the projection depth among others, satisfy the above conditions, for any continuous distribution function F. The following result gives the asymptotic efficiencies of tests based on such depth functions. To this end, let {P θ } θ Θ be a parametric family of distributions in R d satisfying Assumption. at θ = θ Θ R p. Moreover, let F θ be the distribution function of P θ. Theorem 4.4. The asymptotic efficiency of the two-sample test (4.4) based on a good depth function D (with respect to F θ ), for the testing problem (.), is AE(G D ) = ˆ 6r h, f(x θ ) R(x, F θ )dx, (4.3) where R(x, F θ ) is as defined in (.). The proof of the theorem is given in Appendix F. It entails verifying the different conditions in Theorem 3. for the directed graph functional G D, for a good depth function D. Example 4.3. (Mann-Whitney U-Statistic) This is one of the most popular univariate twosample tests, which corresponds to (4.4) with D(x, F ) = F (x). This implies, R(x, F ) = F (x) and, for θ = θ Θ, by Theorem 4.4, the asymptotic efficiency of the Mann- Whitney test is 6r ˆ h, f(x θ ) R(x, F θ )dx = ˆ 6r h, f(x θ ) F θ (x)dx. (4.4) Tests based on depth functions generally have non-trivial local power against O( ) alternatives, especially for scale problems. However, they have no power in location problems, as discussed below: Example 4.4. (Location Family) As in Example 5., consider parametric family P θ (θ, I). Then R MD (y, F θ ) = R D (y, F θ ), where F θ is the distribution function of P θ, for any depth function D which is affine-invariant and the satisfies strict monotonicity property (refer to [7, Theorem 5.] for details). ow, R MD (y, F θ ) = P Fθ (X : MD(X, F θ ) MD(y, F θ )) =

19 DISTRIBUTIO-FREE GRAPH-BASED TWO-SAMPLE TESTS 9 P(χ d > (y θ)t (y θ)). Then ˆ h, f(x θ) R(x, F θ )dx = ˆ h, ze z t z R (π) d d/ P(χ d > z t z)dz = 0. R d This implies that the asymptotic efficiency of any depth-based test is zero for the normal location problem, as seen in Example 5. for the halfspace depth (HD). 5 In fact, the same argument shows that depth-based tests have no power against location alternatives for any elliptical distribution [6] Discussion. The examples above raise the following question: Can one construct computationally efficient and distribution-free two-sample tests, which are consistent against a large class of alternatives and have non-zero Pitman efficiency? Theorem 3. shows that to obtain two-sample tests with non-zero efficiency, one has to construct graph functionals for which the degree-functions λ /λ (defined in (3.4)) are nonconstant. One natural approach is to consider inhomogeneous (weighted) nearest-neighbor graphs: Given a weight function w : R d {0,,..., K}, where K is a fixed positive integer, the directed w-weighted nearest neighbor graph functional, to be denoted by w, is defined as follows: For a finite set S R d, w (S) is a graph with vertex set S with directed edges (a, b), for a, b S, if the Euclidean distance between a and b is among the w(a)-th smallest distances from a to any other point in S. (ote that the unweighted directed K- graph corresponds to the constant weight function K.) Similar to the unweighted K-, the weighted nearest-neighbor graph can be computed efficiently, in any dimension. Moreover, our general result implies that the asymptotic efficiency of the two-sample test based on w is, 6 ˆ r(p q) AE( w ) = h, f(z θ ) w(z)dz, (4.5) σ w where σ w is the asymptotic variance of the test statistic (.5) (which can be calculated using Lemma A. in Appendix A.). This shows that if p q, one can, in principle choose a weight function w such the corresponding test has non-zero efficiency. 7 The optimal weight function which maximizes the asymptotic efficiency is, w(z) = K {(p q) h, f(z θ ) > 0}. (4.6) However, this is practically useless because we do not know f. It is possible to use (4.6) by estimating f from the data, however, in doing so, the test will no longer be asymptotically distribution-free. The results above illustrate a fundamental difficulty in constructing distribution-free twosample tests, which are generally based on the number of edges across two the samples, that are both computationally and statistically efficient. A natural practical way to reduce the computational cost and maintain non-zero Pitman efficiency is by projection: For example, 5 The univariate depth function D(x, F ) = F (x) does not satisfy the strict monotonicity property. In fact, (4.4) implies that the asymptotic efficiency of the Mann-Whitney test for the normal location problem is 6r f (x θ )dx > 0, unlike the strictly monotonic depth functions. 6 For well-behaved weight functions, that is, for each a {0,,..., K}, the set B a := {x R d : w(x) = a} is either empty or has a non-empty interior. 7 The test has no power when p = q, irrespective of the choice of w, for reasons discussed in Remark 3.5.

arxiv: v2 [math.st] 22 Jun 2016

arxiv: v2 [math.st] 22 Jun 2016 DISTRIBUTIO OF TWO-SAMPLE TESTS BASED O GEOMETRIC GRAPHS AD APPLICATIOS arxiv:151.00384v [math.st] Jun 016 By Bhaswar B. Bhattacharya Stanford University In this paper central limit theorems are derived