Relative density of the random r-factor proximity catch digraph for testing spatial patterns of segregation and association

Size: px

Start display at page:

Download "Relative density of the random r-factor proximity catch digraph for testing spatial patterns of segregation and association"

Brice Hart
5 years ago
Views:

Computational Statistics & Data Analysis 5 6 195 1964 www.elsevier.

Wierman Applied Mathematics, Statistics, Johns Hopkins University, Whitehead Hall, Baltimore 11868, USA Received 18 January 4; accepted 1 March 5 Available online 1 April 5 Abstract Statistical

In this approach, a random directed graph is constructed from the data using the relative positions of the data points from various classes.

1 Computational Statistics & Data Analysis Relative density of the random r-factor proximity catch digraph for testing spatial patterns of segregation and association Elvan Ceyhan, Carey E. Priebe, John C. Wierman Applied Mathematics, Statistics, Johns Hopkins University, Whitehead Hall, Baltimore 11868, USA Received 18 January 4; accepted 1 March 5 Available online 1 April 5 Abstract Statistical pattern classification methods based on data-random graphs were introduced recently. In this approach, a random directed graph is constructed from the data using the relative positions of the data points from various classes. Different random graphs result from different definitions of the proximity region associated with each data point and different graph statistics can be employed for data reduction. The approach used in this article is based on a parameterized family of proximity maps determining an associated family of data-random digraphs. The relative arc density of the digraph is used as the summary statistic, providing an alternative to the domination number employed previously. An important advantage of the relative arc density is that, properly re-scaled, it is a U-statistic, facilitating analytic study of its asymptotic distribution using standard U-statistic central limit theory. The approach is illustrated with an application to the testing of spatial patterns of segregation and association. Knowledge of the asymptotic distribution allows evaluation of the Pitman and Hodges Lehmann asymptotic efficacies, and selection of the proximity map parameter to optimize efficiency. Furthermore the approach presented here also has the advantage of validity for data in any dimension. 5 Elsevier B.V. All rights reserved. Keywords: Random proximity graphs; Delaunay triangulation; Relative density; Segregation; Association Corresponding author. Tel.: ; fax: address: cep@jhu.edu C.E. Priebe /$ - see front matter 5 Elsevier B.V. All rights reserved. doi:1.116/j.csda.5..

2 196 E. Ceyhan et al. / Computational Statistics & Data Analysis Introduction Classification and clustering have received considerable attention in the statistical literature. In recent years, a new classification approach has been developed which is based on the relative positions of the data points from various classes. Priebe et al. introduced the class cover catch digraphs CCCD in R and gave the exact and the asymptotic distribution of the domination number of the CCCD Priebe et al., 1. DeVinney et al., Marchette and Priebe, Priebe et al. a,b applied the concept in higher dimensions and demonstrated relatively good performance of CCCD in classification. The methods employed involve data reduction condensing by using approximate minimum dominating sets as prototype sets since finding the exact minimum dominating set is an NP-hard problem -in particular for CCCD. Furthermore the exact and the asymptotic distribution of the domination number of the CCCD are not analytically tractable in multiple dimensions. Ceyhan and Priebe introduced the central similarity proximity map and r-factor proximity maps and the associated random digraphs in Ceyhan and Priebe,5, respectively. In both cases, the space is partitioned by the Delaunay tessellation which is the Delaunay triangulation in R. In each triangle, a family of data-random proximity catch digraphs is constructed based on the proximity of the points to each other. The advantages of the r-factor proximity catch digraphs are that an exact minimum dominating set can be found in polynomial time and the asymptotic distribution of the domination number is analytically tractable. The latter is then used to test segregation and association of points of different classes in Ceyhan and Priebe 5. Segregation and association are two patterns that describe the spatial relation between two or more classes. See Section.5 for more detail. In this article, we employ a different statistic, namely the relative arc density, that is the proportion of all possible arcs directed edges which are present in the data random digraph. This test statistic has the advantage that, properly rescaled, it is a U-statistic. Two plain classes of alternative hypotheses, for segregation and association, are defined in Section.5. The asymptotic distributions under both the null and the alternative hypotheses are determined in Section by using standard U-statistic central limit theory. Pitman and Hodges Lehman asymptotic efficacies are analyzed in Sections 4. and 4.4 respectively. This test is related to the available tests of segregation and association in the ecology literature, such as Pielou s test and Ripley s test. See discussion in Section 6 for more detail. Our approach is valid for data in any dimension, but for simplicity of expression and visualization, will be described for two-dimensional data.. Preliminaries.1. Proximity maps Let Ω, M be a measurable space and consider a function N : Ω Ω Ω, where Ω represents the power set of Ω. Then given Y Ω, the proximity map N Y = N, Y : Ω Ω associates with each point x Ω a proximity region N Y x Ω. Typically,

E. Ceyhan et al. / Computational Statistics & Data Analysis 5 6 195 1964 197 Fig. 1. Construction of r-factor proximity region, N Y x shaded region. N is chosen to satisfy x N Y x for all x Ω.

3 E. Ceyhan et al. / Computational Statistics & Data Analysis Fig. 1. Construction of r-factor proximity region, N Y x shaded region. N is chosen to satisfy x N Y x for all x Ω. The use of the adjective proximity comes form thinking of the region N Y x as representing a neighborhood of points close to x Toussaint, 198; Jaromczyk and Toussaint, r-factor proximity maps We now briefly define r-factor proximity maps. see, Ceyhan and Priebe, 5 for more details. Let Ω = R and let Y = { } y 1, y, y R be three non-collinear points. Denote by TY the triangle including the interior formed by the three points i.e. TY is the convex hull of Y. For r 1, ], define NY r to be the r-factor proximity map as follows; see also Fig. 1. Using line segments from the center of mass centroid of TY to the midpoints of its edges, we partition TY into vertex regions R y 1, R y, and R y.forx TY\Y, let vx Y be the vertex in whose region x falls, so x Rvx. Ifx falls on the boundary of two vertex regions, we assign vx arbitrarily to one of the adjacent regions. Let ex be the edge of TY opposite vx. Let lx be the line parallel to ex through x. Let dvx, lx be the Euclidean perpendicular distance from vx to lx. Forr 1,, let l r x be the line parallel to ex such that d vx, l r x = rdvx, lx and d lx, l r x <dvx, l r x.let T r x be the triangle similar to and with the same orientation as TY having vx as a vertex and l r x as the opposite edge. Then the r-factor proximity region NY r x is defined to be T r x TY. Notice that r 1 implies x NY r x. Note also that lim r NY r x=ty for all x TY\Y, so we define NY x = TY for all such x. Forx Y, we define NY r x ={x} for all r 1, ].

4 198 E. Ceyhan et al. / Computational Statistics & Data Analysis Data-random proximity catch digraphs If X n := {X 1,X,...,X n } is a set of Ω-valued random variables, then the N Y X i,i= 1,...,n, are random sets. If the X i are independent and identically distributed, then so are the random sets N Y X i. iid In the case of an r-factor proximity map, notice that if X i F and F has a non-degenerate two-dimensional probability density function f with supportf TY, then the special case in the construction of NY r X falls on the boundary of two vertex regions occurs with probability zero. The proximities of the data points to each other are used to construct a digraph. A digraph is a directed graph; i.e. a graph with directed edges from one vertex to another based on a binary relation. Define the data-random proximity catch digraph D with vertex set V = {X 1,...,X n } and arc set A by X i,x j A Xj N Y X i. Since this relationship is not symmetric, a digraph is needed rather than a graph. The random digraph D depends on the joint distribution of the X i and on the map N Y..4. Relative density The relative arc density of a digraph D = V, A of order V =n, denoted ρd, is defined as ρd = A nn 1, where denotes the set cardinality functional Janson et al.,. Thus ρd represents the ratio of the number of arcs in the digraph D to the number of arcs in the complete symmetric digraph of order n, which is nn 1. For brevity of notation we use relative density rather than relative arc density henceforth. iid If X 1,...,X n F the relative density of the associated data-random proximity catch digraph D, denoted ρ X n ; h, N Y,isaU-statistic, where ρ X n ; h, N Y = 1 hx i,x j ; N Y, 1 nn 1 i<j h { } { } X i,x j ; N Y = I Xi,X j A + I Xj,X i A = I { X j N Y X i } + I { } X i N Y Xj, where I is the indicator function. We denote h X i,x j ; N Y as hij for brevity of notation. Although the digraph is asymmetric, h ij is defined as the number of arcs in D between vertices X i and X j, in order to produce a symmetric kernel with finite variance Lehmann, The random variable ρ n := ρx n ; h, N Y depends on n and N Y explicitly and on F implicitly. The expectation E ρ n ], however, is independent of n and depends on only F

5 E. Ceyhan et al. / Computational Statistics & Data Analysis and N Y : E ρ n ] = 1 E h 1 ] 1 for all n. The variance Var ρ n ] simplifies to Var ] 1 ρ n = nn 1 Var h 1] + n nn 1 Cov h 1,h 1 ] A central limit theorem for U-statistics Lehmann, 1988 yields n ρn E ] L ρ n N, Cov h1,h 1 ], 5 provided Cov h 1,h 1 ] >. The asymptotic variance of ρ n, Cov h 1,h 1 ], depends on only F and N Y. Thus, we need determine only E h 1 ] and Cov h 1,h 1 ] in order to obtain the normal approximation approx ρ n N E ] ] E h1 ] ρ n, Var ρn = N, Cov h 1,h 1 ] for large n. n 6.5. Null and alternative hypotheses In a two class setting, the phenomenon known as segregation occurs when members of one class have a tendency to repel members of the other class. For instance, it may be the case that one type of plant does not grow well in the vicinity of another type of plant, and vice versa. This implies, in our notation, that X i are unlikely to be located near any elements of Y. Alternatively, association occurs when members of one class have a tendency to attract members of the other class, as in symbiotic species, so that the X i will tend to cluster around the elements of Y, for example. See, for instance, Dixon 1994 and Coomes et al The null hypothesis for spatial patterns have been a contraversial topic in ecology from the early days. Gotelli and Graves 1996 have collected a voluminous literature to present a comprehensive analysis of the use and misuse of null models in ecology community. They also define and attempt to clarify the null model concept as a pattern-generating model that is based on randomization of ecological data or random sampling from a known or imagined distribution....the randomization is designed to produce a pattern that would be expected in the absence of a particular ecological mechanism. In other words, the hypothesized null models can be viewed as thought experiments, which is conventially used in the physical sciences, and these models provide a statistical baseline for the analysis of the patterns. For statistical testing for segregation and association, the null hypothesis we consider is a type of complete spatial randomness; that is, H : X i iid UT Y, where UT Y is the uniform distribution on TY. If it is desired to have the sample size be a random variable, we may consider a spatial Poisson point process on TY as our null hypothesis.

6 19 E. Ceyhan et al. / Computational Statistics & Data Analysis We define two classes of alternatives, Hε S and Hε A with ε, /, for segregation and association, respectively. For y Y, let ey denote the edge of TY opposite vertex y, and for x TY let l y x denote the line parallel to ey through x. Then define Ty, ε = { x TY : d y,l y x ε }. Let Hε S iid be the model under which X i U TY\ y Y Ty, ε and Hε A be the model under which X iid i U y Y T y, / ε. Thus the segregation model excludes the possibility of any X i occurring near a y j, and the association model requires that all X i occur near a y j. The / ε in the definition of the association alternative is so that ε = yields H under both classes of alternatives. Remark. These definitions of the alternatives are given for the standard equilateral triangle. The geometry invariance result of Theorem 1 from Section still holds under the alternatives, in the following sense. If, in an arbitrary triangle, a small percentage δ 1% where δ, 4/9 of the area is carved away as forbidden from each vertex using line segments parallel to the opposite edge, then under the transformation to the standard equilateral triangle this will result in the alternative H S. This argument is for segregation with δ < 1/4; a similar δ/4 construction is available for the other cases.. Asymptotic normality under the null and alternative hypotheses First we present a geometry invariance result which allows us to assume TY is the standard equilateral triangle, T,, 1,, 1/, /, thereby simplifying our subsequent analysis. Theorem 1. Let Y = { } y 1, y, y R be three non-collinear points. For i = 1,...,nlet iid X i F =UT Y, the uniform distribution on the triangle TY. Then for any r 1, ] the distribution of ρ X n ; h, NY r is independent of Y, hence the geometry of TY. Proof. A composition of translation, rotation, reflections, and scaling will transform any given triangle T o = T y 1, y, y into the basic triangle Tb = T,, 1,, c 1,c with <c 1 1/, c > and 1 c 1 +c 1, preserving uniformity. The transformation e : R R given by e u, v = u + 1 c 1 / v, / c v takes T b to the equilateral triangle T e = T,, 1,, 1/, /. Investigation of the Jacobian shows that e also preserves uniformity. Furthermore, the composition of e with the rigid motion transformations maps the boundary of the original triangle T o to the boundary of the equilateral triangle T e, the median lines of T o to the median lines of T e, and lines parallel to the edges of T o to lines parallel to the edges of T e. Since the joint distribution of any collection of the h ij involves only probability content of unions and intersections of regions bounded by precisely such lines, and the probability content of such regions is preserved since uniformity is preserved, the desired result follows.

7 E. Ceyhan et al. / Computational Statistics & Data Analysis Based on Theorem 1 and our uniform{ null hypothesis, we may assume that TY is the standard equilateral triangle with Y =,, 1,, 1/, } / henceforth. For our r-factor proximity map and uniform null hypothesis, the asymptotic null distribution of ρ n r = ρ X n ; h, NY r can be derived as a function of r. Let μr := E ρn r ] and νr := Cov h 1,h 1 ]. Notice that μr = E h 1 ] / = P X NY r X 1 is the probability of an arc occurring between any pair of vertices..1. Asymptotic normality under the null hypothesis By detailed geometric probability calculations, provided in Appendix A, the mean and the asymptotic variance of the relative density of the r-factor proximity catch digraph can explicitly be computed. The central limit theorem for U-statistics then establishes the asymptotic normality under the uniform null hypothesis. These results are summarized in the following theorem: Theorem. For r 1,, n ρn r μr νr L N, 1, 7 where 7 16 r for r 1, /, μr = 1 8 r + 4 8r r for r /,, 1 r for r,, 8 and with νr = ν 1 rir 1, 4/ + ν r Ir 4/, / + ν r Ir /, + ν 4 r Ir, ] 9 ν 1 r = 7 r1 184 r r r r r r r 888 r r 4, ν r = 5467 r1 78 r r r r r r 1555 r r 4, ν r = 7 r 1 7 r r 1 5 r r r r r r + 1 r ]/ 7648 r r 6], ν 4 r = 15 r4 11 r 48 r r 6. For r =, ρ n r is degenerate. See Appendix A for proof. Consider the form of the mean and variance functions, which are depicted in Fig.. Note that μr is monotonically increasing in r, since the proximity region of any data point

8 19 E. Ceyhan et al. / Computational Statistics & Data Analysis Fig.. Asymptotic null mean μr left and variance νr right, from Eqs. 8 and 9 in Theorem, respectively. The vertical lines indicate the endpoints of the intervals in the piecewise definition of the functions. Notice that the vertical axes are differently scaled. increases with r. In addition, μr 1asr, since the digraph becomes complete asymptotically, which explains why ρ n r is degenerate, i.e. νr =, when r =. Note also that μr is continuous, with the value at r = 1 μ1 = 7/16. Regarding the asymptotic variance, note that νr is continuous in r with lim r νr= and ν1=4/58.58 and observe that sup r 1 νr.15 at argsup r 1 νr.45. To illustrate the limiting distribution, r = yields n ρn μ or equivalently ν = 19n 5 ρ n approx 5 N 8, 5. 19n ρ n 5 8 L N, 1 Fig. indicates that, for r =, the normal approximation is accurate even for small n although kurtosis may be indicated for n = 1. Fig. 4 demonstrates, however, that severe skewness obtains for small values of n, and extreme values of r. The finite sample variance in Eq. 4 and skewness may be derived analytically in much the same way as was Cov h 1,h 1 ] for the asymptotic variance. In fact, the exact distribution of ρ n r is, in principle, available by successively conditioning on the values of the X i. Alas, while the joint distribution of h 1,h 1 is available, the joint distribution of {h ij } 1 i<j n, and hence the calculation for the exact distribution of ρ n r, is extraordinarily tedious and lengthy for even small values of n.

9 E. Ceyhan et al. / Computational Statistics & Data Analysis Density 4 1 Density Density Fig.. Depicted are the distributions of ρ n approx N 58, for n = 1,, 1 left to right. Histograms 5 19n are based on 1 Monte Carlo replicates. Solid curves represent the approximating normal densities given by Theorem. Again, note that the vertical axes are differently scaled Density 1 5 Density Fig. 4. Depicted are the histograms for 1, Monte Carlo replicates of ρ 1 1 left and ρ 1 5 right indicating severe small sample skewness for extreme values of r. Letting H n r = n i=1 hx i,x n+1, the exact distribution of ρ n r can be evaluated based on the recurrence n + 1nρ n+1 r d = nn 1ρ n r + H n r by noting that the conditional random variable H n r X n+1 is the sum of n independent and identically distributed random variables. Alas, this calculation is also tedious for large n... Asymptotic normality under the alternatives Asymptotic normality of relative density of the proximity catch digraphs under the alternative hypotheses of segregation and association can be established by the same method as under the null hypothesis. Let E S ε ] E A ε ] be the expectation with respect to the uniform distribution under the segregation association alternatives with ε, /.

10 194 E. Ceyhan et al. / Computational Statistics & Data Analysis Theorem. Let μ S r, ε and μ A r, ε be the mean and ν S r, ε and ν A r, ε be the covariance, Cov h 1,h 1 ] for r, 1] and ε, / under segregation and association. Then under Hε S, n ρ n r μ S r, ε L N, νs r, ε for the values of the pair r, ε for which ν S r, ε>. Likewise, under Hε A, n ρ n r μ A r, ε L N, νa r, ε for the values of the pair r, ε for which ν A r, ε>. Proof Sketch. Under the alternatives, i.e. ε >,ρ n r is a U-statistic with the same symmetric kernel h ij as in the null case. The mean μ S r, ε = E ε ρn r ] = E ε h 1 ] / and μ A r, ε, now a function of both r and ε, is again in, 1]. The asymptotic variance ν S r, ε = Cov ε h 1,h 1 ] and ν A r, ε, also a function of both r and ε, is bounded above by 1/4, as before. The explicit forms of μ S r, ε and μ A r, ε is given, defined piecewise, in Appendix B. Sample values of μ S r, ε, ν S r, ε and μ A r, ε, ν A r, ε are given in Appendix C for segregation with ε = /4 and for association with ε = /1. Thus asymptotic normality obtains provided ν S r, ε> ν A r, ε>; otherwise ρ n r is degenerate. Note that under Hε S, ν S r, ε> for r, ε 1, /ε and under H A ε, ν A r, ε> for r, ε 1, /4, /,, ] /4 1, /ε, / {1}, /1. Notice that for the association class of alternatives any r 1, yields asymptotic normality for all ε, /, while for the segregation class of alternatives only r = 1 yields this universal asymptotic normality. 4. The test and analysis The relative density of the proximity catch digraph is a test statistic for the segregation/association alternative; rejecting for extreme values of ρ n r is appropriate since under segregation we expect ρ n r to be large, while under association we expect ρ n r to be small. Using the test statistic n ρn r μr R =, 1 νr the asymptotic critical value for the one-sided level α test against segregation is given by z α = Φ 1 1 α, 11 where Φ is the standard normal distribution function. Against segregation, the test rejects for R>z 1 α and against association, the test rejects for R<z α.

11 E. Ceyhan et al. / Computational Statistics & Data Analysis kernel density estimate kernel density estimate relative density relative density Fig. 5. Two Monte Carlo experiments against the segregation alternative H S. Depicted are kernel density /8 estimates for ρ n 11/1 for n = 1 left and n = 1 right under the null solid and alternative dashed Consistency Theorem 4. The test against Hε S which rejects for R>z 1 α and the test against Hε A rejects for R<z α are consistent for r 1, and ε, /. which Proof. Since the variance of the asymptotically normal test statistic, under both the null and the alternatives, converges to as n or is degenerate, it remains to show that the mean under the null, μr = E ρ n r], is less than greater than the mean under the alternative, μ S r, ε = E ε ρn r ] μ A r, ε against segregation association for ε >. Whence it will follow that power converges to 1 as n. Detailed analysis of μ S r, ε and μ A r, ε in Appendix B indicates that under segregation μ S r, ε>μr for all ε > and r 1,. Likewise, detailed analysis of μ A r, ε in Appendix C indicates that under association μ A r, ε<μr for all ε > and r 1,. Hence the desired result follows for both alternatives. In fact, the analysis of μr, ε under the alternatives reveals more than what is required for consistency. Under segregation, the analysis indicates that μ S r, ε 1 < μ S r, ε for ε 1 < ε. Likewise, under association, the analysis indicates that μ A r, ε 1 > μ A r, ε for ε 1 < ε. 4.. Monte Carlo power analysis In Fig. 5, we present a Monte Carlo investigation against the segregation alternative H S for r = 11/1 and n = 1, 1. With n = 1, the null and alternative probability /8 density functions for ρ are very similar, implying small power 1, Monte Carlo replicates yield β S mc =.787, which is based on the empirical critical value. With n=1, there is more separation between null and alternative probability density functions; for this

12 196 E. Ceyhan et al. / Computational Statistics & Data Analysis power.4 power Fig. 6. Monte Carlo power using the asymptotic critical value against segregation alternatives H S left and /8 H S right as a function of r, for n = 1. The circles represent the empirical significance levels while triangles /4 represent the empirical power values. The r values plotted are 1, 11/1, 1/1, 4/,,,,, 5, 1. case, 1 Monte Carlo replicates yield β S mc =.77. Notice also that the probability density functions are more skewed for n = 1, while approximate normality holds for n = 1. For a given alternative and sample size, we may consider analyzing the power of the test using the asymptotic critical value as a function of the proximity factor r. InFig. 6, we present a Monte Carlo investigation of power against H S and H S as a function /8 /4 of r for n = 1. The empirical significance level is about.5 for r =, which have the empirical power β S 1 r, /8.5, and β S 1 r, /4 = 1. So, for small sample sizes, moderate values of r are more appropriate for normal approximation, as they yield the desired significance level and the more severe the segregation, the higher the power estimate. In Fig. 7, we present a Monte Carlo investigation against the association alternative H A for r = 11/1 and n = 1 and 1. The analysis is same as in the analysis of the /1 Fig. 5. InFig. 8, we present a Monte Carlo investigation of power against H A and /1 H A 5 /4 as a function of r for n = 1. The empirical significance level is about.5 for r=/,,, 5 which have the empirical power β A 1 r, /1.5 with maximum power at r =, and β A 1 r, 5 /4 = 1atr =. So, for small sample sizes, moderate values of r are more appropriate for normal approximation, as they yield the desired significance level, and the more severe the association, the higher the power estimate. 4.. Pitman asymptotic efficiency Pitman asymptotic efficiency PAE provides for an investigation of local asymptotic power local around H. This involves the limit as n as well as the limit as ε

13 E. Ceyhan et al. / Computational Statistics & Data Analysis kernel density estimate kernel density estimate relative density relative density Fig. 7. Two Monte Carlo experiments against the association alternative H A. Depicted are kernel density /1 estimates for ρ n 11/1 for n = 1 left and 1 right under the null solid and alternative dashed power.4 power Fig. 8. Monte Carlo power using the asymptotic critical value against association alternatives H A left and /1 H A 5 /4 right as a function of r, for n = 1. The r values plotted are 1, 11/1, 1/1, 4/,,,, 5, 1.. A detailed discussion of PAE can be found in Kendall and Stuart 1979 and Eeden 196. For segregation or association alternatives the PAE is given by PAE ρ n r = μ k r, ε = /νr where k is the minimum order of the derivative with respect to ε for which μ k r, ε= =. That is, μ k r, ε= = butμ l r, ε== for l=1,,...,k 1. Then under segregation alternative H S ε and association alternative H A ε, the PAE of ρ nr is

14 198 E. Ceyhan et al. / Computational Statistics & Data Analysis S PAE r 6 4 PAE r A Fig. 9. Pitman asymptotic efficiency against segregation left and association right as a function of r. Notice that vertical axes are differently scaled. given by PAE S r = μ S r, ε = νr μ and PAE A A r = r, ε =, νr respectively, since μ S r, ε = = μ A r, ε = =. Eq. 9 provides the denominator; the numerator requires μr, ε which is provided in Appendix B for under both segregation and association alternatives, where we only use the intervals of r that do not vanish as ε. In Fig. 9, we present the PAE as a function of r for both segregation and association. Notice that PAE S r = 1 = 16/7.8571, lim r PAE S r =,PAE A r = 1 = 1744/ , lim r PAE A r =, argsup r 1, PAE A r 1.6 with sup r 1, PAE A r PAE A r has also a local supremum at r l with PAE A r l Based on the asymptotic efficiency analysis, we suggest, for large n and small ε, choosing r large for testing against segregation and choosing r small for testing against association Hodges Lehmann asymptotic efficiency Hodges Lehmann asymptotic efficiency HLAE of ρ n r see, e.g., Hodges and Lehmann, 1956 under Hε S is given by HLAE S r, ε := μ Sr, ε μr. ν S r, ε HLAE for association is defined similarly. Unlike PAE, HLAE does not involve the limit as ε. Since this requires the mean and, especially, the asymptotic variance of ρ n r under an alternative, we investigate HLAE for specific values of ε. Fig. 1 contains a graph

15 E. Ceyhan et al. / Computational Statistics & Data Analysis Fig. 1. Hodges Lehmann asymptotic efficiency against segregation alternative Hε S ε = /8, /4, /7 left to right. as a function of r for Fig. 11. Hodges Lehmann asymptotic efficiency against association alternative Hε A ε = /1, /1, 5 /4 left to right. as a function of r for of HLAE against segregation as a function of r for ε = /8, /4, /7. See Appendix C for explicit forms of μ S r, ε and ν S r, ε for ε = /4. From Fig. 1, we see that, against Hε S, HLAES r, ε appears to be an increasing function, dependent on ε,ofr. Let r d ε be the minimum r such that ρ n r becomes degenerate under /8 /4 = 4, r d =, and r d the alternative Hε S. Then r d ε, ] /4, r d ε = /ε and for ε /7 =. In fact, for /4, /, r d ε = /ε. Notice that lim r rd ε HLAE S r, ε =, which is in agreement with PAE S as ε ; since as ε, HLAE becomes PAE and r d ε and under H, ρ n r is degenerate for r =.So HLAE suggests choosing r large against segregation, but in fact choosing r too large will reduce power since r r d ε guarantees the complete digraph under the alternative and, as r increases therefrom, provides an ever greater probability of seeing the complete digraph under the null. Fig. 11 contains a graph of HLAE against association as a function of r for ε = 5 /4, /1, /1. See Appendix C for explicit forms of μa r, ε and ν A r, ε for ε = /1. Notice that since νr, ε = for ε /1, HLAE A r = 1, ε = for ε /1 and lim r HLAE A r, ε =.

16 194 E. Ceyhan et al. / Computational Statistics & Data Analysis Fig. 1. Asymptotic power function against segregation alternative H S as a function of r for n = 1 first from /8 left and n = 1 s and association alternative H A as a function of r for n = 1 third and 1 fourth. /1 In Fig. 11 we see that, against Hε A, HLAEA r, ε has a local supremum for r sufficiently larger than 1. Let r be the value at which this local supremum is attained. Then r 5 /1 /1 /4., r , and r 1.5. Note that, as ε gets smaller, r gets smaller. Furthermore, HLAE A r = 1, /1 < and as ε, r becomes the global supremum, and PAE A r =1= and argsup r 1 PAE A r = So, when testing against association, HLAE suggests choosing moderate r, whereas PAE suggests choosing small r Asymptotic power function analysis The asymptotic power function see e.g., Kendall and Stuart, 1979 can also be investigated as a function of r, n, and ε using the asymptotic critical value and an appeal to normality. Under a specific segregation alternative Hε S, the asymptotic power function is given by Π S r, n, ε = 1 Φ z1 α νr + nμr μs r, ε, νs r, ε where z 1 α = Φ 1 1 α. Under H A ε,wehave Π A r, n, ε = Φ zα νr + nμr μa r, ε. νa r, ε Analysis of Fig. 1 shows that, against H S, a large choice of r is warranted for n=1 /8 but, for smaller sample size, a more moderate r is recommended.against H A, a moderate /1 choice of r is recommended for both n=1 and 1. This is in agreement with Monte Carlo investigations.

17 E. Ceyhan et al. / Computational Statistics & Data Analysis Multiple triangle case Suppose Y is a finite collection of points in R with Y. Consider the Delaunay triangulation assumed to exist of Y, where T j denotes the jth Delaunay triangle, J denotes the number of triangles, and C H Y denotes the convex hull of Y. We wish to test H : iid X i U C H Y against segregation and association alternatives. The digraph D is constructed using NY r j as described in Section., where for X i T j the three points in Y defining the Delaunay triangle T j are used as Y j. Let ρ n r, J be the relative density of the digraph based on X n and Y which yields J Delaunay triangles, and let w j := AT j /A C H Y for j = 1,...,J, where A C H Y = J j=1 A T j with A being the area functional. Then we obtain the following as a corollary to Theorem. Corollary 1. The asymptotic null distribution for ρ n r, J conditional on W={w 1,...,w J } for r 1, ] is given by Nμr, J, νr, J /n provided that νr,j> with μr, J := μr J j=1 j=1 w j and J νr, J := νr wj + J J 4μr wj j=1 wj j=1 where μr and νr are given by Eqs. 8 and 9, respectively., 1 Proof. See Appendix D. By an appropriate application of Jensen s Inequality, we see that Jj=1 w j Jj=1. wj Therefore, νr, J =iffνr= and Jj=1 Jj=1, wj = wj so asymptotic normality may hold even when νr =. Similarly, for the segregation association alternatives with 4ε / 1% of the triangles around the vertices of each triangle is forbidden allowed, we obtain the above asymptotic distribution of ρ n r with μr being replaced by μ S r, ε, νr by ν S r, ε, μr, J, by μ S r, J, ε, and νr, J by ν S r, J, ε. Likewise for association. iid Thus in the case of J>1, we have a conditional test of H : X i U C H Y which once again rejects against segregation for large values of ρ n r, J and rejects against association for small values of ρ n r, J. Depicted in Fig. 1 are the segregation with δ=1/16 i.e. ε= /8, null, and association with δ = 1/4 i.e. ε = /1 realizations from left to right with n = 1, Y =1, and J = 1. For the null realization, the p-value is greater than.1 for all r values and both alternatives. For the segregation realization, we obtain p<.1 for 1 <r 5 and p>.4 for r = 1 and r 1. For the association realization, we obtain p<.15 for 1 <r, p =.14 for r = 1, and p>.5 for r 5. Note that this is only for one realization of X n.

18 194 E. Ceyhan et al. / Computational Statistics & Data Analysis Fig. 1. Realizations of segregation left, H middle, and association right for Y =1, J = 1, and n = 1. We implement the above described Monte Carlo experiment 1 times with n = 1, n =, and n = 5 and find the empirical significance levels α S n, J and α A n, J and the empirical powers β S n r, /8,J and β A n r, /1,J. These empirical estimates are presented in Table 1 and plotted in Figs. 14 and 15. Notice that the empirical significance levels are all larger than.5 for both alternatives, so this test is liberal in rejecting H against both alternatives for the given realization of Y and n values. The smallest empirical significance levels and highest empirical power estimates occur at moderate r values r = /,, against segregation and at smaller r values r =, / against association. Based on this analysis, for the given realization of Y, we suggest the use of moderate r values for segregation and slightly smaller for association. Notice also that as n increases, the empirical power estimates gets larger for both alternatives. The conditional test presented here is appropriate when the W are fixed, not random. An unconditional version requires the joint distribution of the number and relative size of Delaunay triangles when Y is, for instance, a Poisson point pattern. Alas, this joint distribution is not available Okabe et al., Related rest statistics in multiple triangle case For J>1, we have derived the asymptotic distribution of ρ n r, J = A /nn 1. Let A j be the number of arcs, n j := X n T j, and ρ nj r be the arc density for triangle T j for j = 1,...,J.So J j=1 n j nj 1 /nn 1ρ nj r = ρ n r, J, since J j=1 n j n j 1/nn 1ρ nj r = J j=1 A j /nn 1 = A /nn 1 = ρ n r, J. Let Û n := J j=1 wj ρ n j r where w j = AT j /A C H Y. Since ρ nj r are asymptotically independent, n Û n μr, J and n ρ n r, J μr, J both converge in distribution to N, νr, J. In the denominator of ρ n r, J, weusenn 1 as the maximum number of arcs possible. However, by definition, we can at most have a digraph with J complete symmetric components of order n j, for j = 1,...,J. Then the maximum number possible is n t := J j=1 n j nj 1. Then the adjusted arc density is ρ adj A n,j :=. Then n t

19 E. Ceyhan et al. / Computational Statistics & Data Analysis Table 1 The empirical significance level and empirical power values under H S and H A, N = 1, n = 1, and J = 1, at α =.5 for the realization of Y in Fig. 1 /8 /1 r 1 11/1 6/5 4/ / 5 1 n = 1, N = 1 αsn, J βs n r, /8,J αan, J βa n r, /1,J n =, N = 1 αsn, J βs n r, /8,J αan, J βa n r, /1,J n = 5, N = 1 αsn, J βs n r, /8,J αan, J βa r, /1,J n

20 1944 E. Ceyhan et al. / Computational Statistics & Data Analysis power.6.4 power.6.4 power Fig. 14. Monte Carlo power using the asymptotic critical value against H S, as a function of r, for n = 1 /8 left, n = middle, and n = 5 right conditional on the realization of Y in Fig. 1. The circles represent the empirical significance levels while triangles represent the empirical power values power.6.4 power.6.4 power Fig. 15. Monte Carlo power using the asymptotic critical value against H A as a function of r, for n = 1 /1 left, n = middle, and n = 5 right conditional on the realization of Y in Fig. 1. The circles represent the empirical significance levels while triangles represent the empirical power values. ρ adj n,j r = J j=1 A j /n t = J j=1 n j nj 1 /n t ρ nj r. Since n j nj 1 /n t for each j, and J j=1 n j nj 1 /n t = 1, ρ adj n,j r is a mixture of ρ n j r s. Then ρ adj n,j r is ] asymptotically normal with mean E ρ adj n,j r = μr, J and the variance of ρ adj n,j r is 1 J / J νr wj w j + 4μr J / J wj wj 1. n j=1 j=1 j=1 j=1 5.. Asymptotic efficiency analysis for J>1 The PAE, HLAE, and asymptotic power function analysis are given for J = 1 in Sections , respectively. For J>1, the analysis will depend on both the number of triangles as well as the size of the triangles. So the optimal r values with respect to these efficiency criteria for J = 1 do not necessarily hold for J>1, hence the analyses need to be updated, given the values of J and W.

21 E. Ceyhan et al. / Computational Statistics & Data Analysis Fig. 16. Pitman asymptotic efficiency against segregation left and association right as a function of r with J = 1. Notice that vertical axes are differently scaled. Under segregation alternative H S ε, the PAE of ρ nr, J is given by μ PAE S J r = S r, J, ε = νr, J μ S r, ε= J j=1 wj = νr J j=1 wj +4μ Sr, ε = Jj=1 Jj=1. wj wj Under association alternative Hε A the PAE of ρ n r, J is similar. In Fig. 16, we present the PAE as a function of r for both segregation and association conditional on the realization of Y in Fig. 1. Notice that, unlike J = 1 case, PAE S J r is bounded. Some values of note are PAE S J ρn 1 =.884, lim r PAE S J r=8 J Jj=1 Jj=1 j=1 wj /56 wj wj 19.4, argsup r 1,] PAE S J r As for association, PAEA J r = 1 = , lim r PAE A J r=, argsup r 1 PAEA J r=1.5 with PAEA J r = Based on the asymptotic efficiency analysis, we suggest, for large n and small ε, choosing moderate r for testing against segregation and association. Under segregation, the HLAE of ρ n r, J is given by HLAE S J r, ε := μ Sr, J, ε μr, J ν S r, J, ε Jj=1 Jj=1 μ S r, ε wj μr wj = ν S r, ε J j=1 wj + 4μ Sr, ε Jj=1 Jj=1. wj wj

22 1946 E. Ceyhan et al. / Computational Statistics & Data Analysis Fig. 17. Hodges Lehmann asymptotic efficiency against segregation alternative Hε S ε = /8, /4, /7 left to right and J = 1. as a function of r for Notice that HLAE S J r, ε = = and lim HLAE S J r, ε = and HLAE is bounded provided that νr,j>. We calculate HLAE of ρ n r, J under Hε S for ε = /8, ε = /4, and ε = /7. In Fig. 17 we present HLAE S J r, ε for these ε values conditional on the realization of Y in Fig. 1. Note that with ε= /8, HLAE S J r =1, /8.4 and argsup r 1, ] HLAE S J r, / with the supremum.544.with ε= /4, HLAE S J r=1, /4.45 and argsup r 1, ] HLAE S J r, / with the supremum With ε = /7, HLAE S J r = 1, /7.45 and argsup r 1, ] HLAE S J r, /7 1.88with the supremum Furthermore, we observe that HLAE S J r, /7 > HLAE S J r, /4 > HLAE S J r, /8. Based on the HLAE analysis for the given Y we suggest moderate r values for moderate segregation and small r values for severe segregation. The explicit form of HLAE A J r, ε is similar to HLAES J r, ε which implies HLAEA J r, ε= = and lim HLAE A J r, ε =. We calculate HLAE of ρ n r, J under Hε A for ε = /1, ε = /1, and ε = 5 /4. In Fig. 18 we present HLAE S J r, ε for these ε values conditional on the realization of Y in Fig. 1. Note that with ε = /1, HLAE A J r = 1, /1.9 and argsup r 1, ] HLAE A J r, / with the supremum.157. With ε = /1, HLAE A J r = 1, /1.168 and argsup r 1, ] HLAE A J r, / with the supremum With ε = 5 /4, HLAE A J r = 1, 5 /4.17 and argsup r 1, ] HLAE A J r, 5 /4.96 with the supremum Furthermore, we observe that HLAE A J r, 5 /4 > HLAE A J r, /1 > HLAE A J r, /1. Based on the HLAE analysis for the given Y we suggest moderate r values for moderate association and large r values for severe association.

23 E. Ceyhan et al. / Computational Statistics & Data Analysis Fig. 18. Hodges Lehmann asymptotic efficiency against association alternative Hε A ε = /1, /1, 5 /4 left to right and J = 1. as a function of r for 6. Discussion and conclusions In this article we investigate the mathematical properties of a random digraph method for the analysis of spatial point patterns. The first proximity map similar to the r-factor proximity map NY r in literature is the spherical proximity map N S x := Bx,rx, see the references for CCCD in the Introduction. A slight variation of N S is the arc-slice proximity map N AS x := Bx,rx Txwhere Txis the Delaunay cell that contains x see Ceyhan and Priebe,. Furthermore, Ceyhan and Priebe introduced the central similarity proximity map N CS in Ceyhan and Priebe and NY r in Ceyhan and Priebe 5. The r-factor proximity map, when compared to the others, has the advantages that the asymptotic distribution of the domination number γ n NY r is tractable see Ceyhan and Priebe, 5, an exact minimum dominating set can be found in polynomial time. Moreover NY r and N CS are geometry invariant for uniform data over triangles. Additionally, the mean and variance of relative density ρ n is not analytically tractable for N S and N AS. While NY r x, N CSx, and N AS x are well defined only for x C H Y, the convex hull of Y, N S x is well defined for all x R d. The proximity maps N S and N AS require no effort to extend to higher dimensions. The N S the proximity map associated with CCCD is used in classification in the literature, but not for testing spatial patterns between two or more classes. We develop a technique to test the patterns of segregation or association. There are many tests available for segregation and association in ecology literature. See Dixon 1994 for a survey on these tests and relevant references. Two of the most commonly used tests are Pielou s χ test of independence and Ripley s test based on Kt and Lt functions. However, the test we introduce here is not comparable to either of them. Our test is a conditional test conditional on a realization of J number of Delaunay triangles and W the set of relative areas of the Delaunay triangles and we require the number of triangles J is fixed and relatively small compared to n = X n. Furthermore, our method deals with a slightly different type of data than most methods to examine spatial patterns. The sample size for one type of point type X points is much larger compared to the other type Y points. This implies that in practice, Y could be stationary or have much longer life span than members of X. For example, a

24 1948 E. Ceyhan et al. / Computational Statistics & Data Analysis special type of fungi might constitute X points, while the tree species around which the fungi grow might be viewed as the Y points. There are two major types of asymptotic structures for spatial data Lahiri, In the first, any two observations are required to be at least a fixed distance apart, hence as the number of observations increase, the region on which the process is observed eventually becomes unbounded. This type of sampling structure is called increasing domain asymptotics. In the second type, the region of interest is a fixed bounded region and more or more points are observed in this region. Hence the minimum distance between data points tends to zero as the sample size tends to infinity. This type of structure is called infill asymptotics, due to Cressie The sampling structure for our asymptotic analysis is infill, as only the size of the type X process tends to infinity, while the support, the convex hull of a given set of points from type Y process, C H Y is a fixed bounded region. Moreover, our statistic that can be written as a U-statistic based on the locations of type X points with respect to type Y points. This is one advantage of the proposed method: most statistics for spatial patterns can not be written as U-statistics. The U-statistic form avails us the asymptotic normality, once the mean and variance is obtained by tedious detailed geometric calculations. The null hypothesis we consider is considerably more restrictive than current approaches, which can be used much more generally. The null hypothesis for testing segregation or association can be described in two slightly different forms Dixon, 1994: i complete spatial randomness, that is, each class is distributed randomly throughout the area of interest. It describes both the arrangement of the locations and the association between classes. ii random labeling of locations, which is less restrictive than spatial randomness, in the sense that arrangement of the locations can either be random or non-random. Our conditional test is closer to the former in this regard. Pielou s test provide insight only on the interaction between classes, hence there is no assumption on the allocation of the observations, which makes it more appropriate for testing the null hypothesis of random labeling. Ripley s test can be used for both types of null hypotheses, in particular, it can be used to test a type of spatial randomness against another type of spatial randomness. The test based on the mean domination number in Ceyhan and Priebe 5 is not a conditional test, but requires both n and number of Delaunay triangles J to be large. The comparison for a large but fixed J is possible. Furthermore, under segregation alternatives, the Pitman asymptotic efficiency is not applicable to the mean domination number case, however, for large n and J we suggest the use of it over arc density since for each ε >, Hodges Lehmann asymptotic efficiency is unbounded for the mean domination number case, while it is bounded for arc density case with J>1. As for the association alternative, HLAE suggests moderate r values which has finite Hodges Lehmann asymptotic efficiency. So again, for large J and n mean domination number is preferable. The basic advantage of ρ n r is that, it does not require J to be large, so for small J it is preferable. Although the statistical analysis and the mathematical properties related to the r-factor proximity catch digraph are done in R, the extension to R d with d> is straightforward.

25 E. Ceyhan et al. / Computational Statistics & Data Analysis See Ceyhan and Priebe 5 for more detail on the construction of the associated proximity region in higher dimensions. Moreover, the geometry invariance, asymptotic normality of the U-statistic and consistency of the tests hold for d>. Acknowledgements This research was supported by the Defense Advanced Research Projects Agency as administered by the Air Force Office of Scientific Research under contract DOD F and by Office of Naval Research Grant N The authors thank anonymous referees for valuable comments and suggestions. Appendix A. Derivation of μr and νr In the standard equilateral triangle, let y 1 =,, y = 1,, y = 1/, /, M C be the center of mass, M j be the midpoints of the edges e j for j = 1,,. Then M C = 1/, /6, M 1 = /4, /4, M = 1/4, /4, M = 1/,. Recall that E ρ n r]= nn 1 1 i<j E h ij ]= 1 E h 1]=μr = P X j NY r X i. Let X n be a random sample of size n from UT Y. Forx 1 = u, v, l r x 1 = rv + r u x. Next, let N 1 := l r x 1 e and N := l r x 1 e. Then for z 1 T s := T y 1,M,M C, N r Y z 1 = T y 1,N 1,N provided that lr x 1 is not outside of TY, where N 1 = r y 1 + x 1 /, and N = r y 1 + x 1 /6, y 1 + x 1 r/. Now we find μr for r 1,. First, observe that, by symmetry, μr = P X N r Y X 1 = 6P X N r Y X 1, X 1 T s. Let l s r, x be the line such that rd y 1,l s r, x = d y 1,e 1 and ls r, x TY =,so l s r, x = 1 r x. Then if x 1 T s is above l s r, x then NY r x 1 = TY, otherwise, NY r x 1 = T r x 1 TY. For r 1, /, l s r, x T s =,sony r x 1 = T r x 1 TY for all x T s. Then P X NY r X 1/ x/ A N r 1,X 1 T s = Y x 1 7 dy dx = AT Y 196 r. where A N r Y x 1 = μr = 7 16 r. 1 r y + x and AT Y = /4. Hence for r 1, /,

26 195 E. Ceyhan et al. / Computational Statistics & Data Analysis l s r =, x y =1/, / l s r =1.75,x l s r =4,x e e 1 M C y 1 =, s 1 s M e y =1, Fig. 19. The cases for relative position of l s r, x with various r values. For r /,, l s r, x crosses through M M C. Let the x coordinate of l s r, x y 1 M C be s 1, then s 1 = /4r. See Fig. 19 for the relative position of l s r, x and T s. Then P X NY r X 1,X 1 T s = 1/ x/ s1 x/ = + 1/ x/ s 1 A N r Y x 1 AT Y A NY r x 1 l s r,x dy dx dy dx + AT Y 1 dy dx AT Y = 6 + r4 + 64r r 48r. 1/ ls r,x s 1 A N r Y x 1 AT Y dy dx Hence for r /,, μr = 1 8 r 8r r + 4. For r,, l s r, x crosses through y 1 M. Let the x coordinate of l s r, x y 1 M be s, then s = 1/r. See Fig. 19.

Department of Mathematics, Koç University, Istanbul, Turkey. Online publication date: 19 February 2011

Department of Mathematics, Koç University, Istanbul, Turkey. Online publication date: 19 February 2011 This article was downloaded by: [TÜBİTAK EKUAL] On: 22 March 2011 Access details: Access Details: [subscription number 772815469] Publisher Taylor & Francis Informa Ltd Registered in England and Wales