A Lower Bound On Proximity Preservation by Space Filling Curves

Size: px

Start display at page:

Download "A Lower Bound On Proximity Preservation by Space Filling Curves"

Brooke Ross
5 years ago
Views:

1 A Lower Boun On Proximity Preservation by Space Filling Curves Pan Xu Inustrial an Manufacturing Systems Engg. Iowa State University Ames, IA, USA Srikanta Tirthapura Electrical an Computer Engg. Iowa State University Ames, IA, USA Abstract A space filling curve (SFC) is a proximity preserving mapping from a high imensional space to a single imensional space. SFCs have been use extensively in ealing with multi-imensional ata in parallel computing, scientific computing, an atabases. The general goal of an SFC is that points that are close to each other in high-imensional space are also close to each other in the single imensional space. While SFCs have been use wiely, the extent to which proximity can be preserve by an SFC is not precisely unerstoo yet. We consier natural metrics, incluing the nearestneighbor stretch of an SFC, which measure the extent to which an SFC preserves proximity. We first show a powerful negative result, that there is an inherent lower boun on the stretch of any SFC. We then show that the stretch of the commonly use Z curve is within a factor of.5 from the optimal, irrespective of the number of imensions. Further we show that a very simple SFC also achieves the same stretch as the Z curve. Our results apply to SFCs in any imension such that is a constant. Keywors-space filling curve, proximity, stretch, lower boun I. INTRODUCTION Space filling curves are a wiely use tool for ealing with multi-imensional ata. The basic iea in a space filling curve is that ata in two or more imensions is mappe to a single imension with the expectation that proximity is preserve. In other wors, points that are close to each other in the high imensional space are also (hopefully) close to each other in the space filling curve. Space filling curves have been use in numerous applications such as ata partitioning in scientific computing [6, 3], parallel omain ecomposition [3, ], cryptography [6], seconary memory ata structures [9], geographical information systems [], to name a few. Some popularly use space filling curves are the Z-curve [, 9] (also known as the Morton orering), the Hilbert curve [3], an the Gray coe curve [9, 0]. While all the above applications of SFCs rely on the proximity preserving properties of SFCs, there has been surprisingly little formal analysis of the extent to which proximity is preserve by an The work of Tirthapura was supporte in part by the National Science Founation through awars , SFC. There are two natural questions that one may ask in this context. ) Are there any inherent its on the extent to which proximity can be preserve by an SFC? If so, what are they? ) How close are specific SFCs to the optimal SFC with respect to proximity preservation? This work is an attempt to answer both these questions. In orer to o so, we efine precise metrics for evaluating the proximity preserving quality of an SFC. Nearest Neighbor Stretch: We first consier proximity preservation for those pairs of points that are nearest neighbors in the multi-imensional space. In many applications of SFCs, such as N-boy simulations [6], the ominant interactions are the ones between nearest neighbors, an proximity preservation between such pairs of points is critical for the efficiency of the ata structure. To evaluate proximity preservation among nearest neighbors, we introuce a class of metrics calle the average nearest-neighbor stretch of an SFC, henceforth referre to as the average NN-stretch. Informally, the average NNstretch of a -imensional SFC is the average multiplicative increase of the istance between nearest neighbor pairs in high-imensions when the points are mappe into one imension. We consier two variants of this efinition, base on whether we consier the average istance to a nearest neighbor for each cell, followe by average across all cells (average-average NN-stretch), or the maximum istance to a nearest neighbor for each cell, followe by the average across all cells (average-maximum NN-stretch). In the following, we use average NN-stretch to mean the average-average NN stretch. Precise efinitions are provie in Section III. We show the following results. There is an inherent lower boun on the average NNstretch of any SFC. In particular, the average ( ) NNstretch of any SFC must be at least 3 n for large n, where n is the number of cells in the universe, an is the number of imensions (Theorem ) The average NN-stretch of the Z space filling curve is within a factor of.5 of the above boun, irrespective

2 of the number of imensions (Theorem ). Further, the average NN-stretch of a very simple space filling curve, which we call the simple curve, also has the same average NN-stretch as the Z curve (Theorem 3). All Pairs Stretch: We next consier proximity preservation for all possible pairs of points, not just those pairs that are nearest neighbors in high imensions. For this, we introuce a metric calle the average all pairs stretch. Informally, the average all pairs stretch of a imensional SFC is the average multiplicative increase in the istance between a pair of points when the points are mappe from high imensional space to one imension. We consier two metrics for cells in high imensions, the Manhattan metric an the Eucliean metric. Our results for the all pairs stretch inclue lower bouns for any SFC, as well as upper bouns for specific SFCs. For any SFC π, the average all pairs stretch using the n+ Manhattan metric must be at least 3 n n 3, where n is the size of the universe, an is the number of imensions. The average all pairs stretch of the simple curve is no more than n In practice, proximity preservation is more important for pairs of points that are close to each other in highimensions than it is for points that are further apart in highimensional space. For example, if the application was to simulate N-boy interactions between particles, the forces between particles get much weaker with istance, so that the interactions between particles that are far away are nonexistent or negligible. Thus, we believe the average NNstretch is usually a more significant metric than the average all pairs stretch. Our results lea to the following observations: ) For proximity preservation accoring to the average NN-stretch, the Z curve is close to optimal. To our knowlege, this is the first instance in the literature on space filling curves that we are able to make a precise statement about the optimality of an SFC with respect to proximity preservation. Previous works ha investigate the properties of specific SFCs, but i not present a lower boun on the class of all SFCs. ) Rather surprisingly, the simple curve has the same performance as the Z curve an is also near-optimal. 3) Further, a ifferent space filling curve can yiel only a constant factor improvement over the Z curve or the simple curve. An SFC is generally thought of as a curve that oes not intersect with itself. In this work, we efine an SFC as a bijection from the high imensional universe with n cells to the set {0,,,..., n }. Since this can inclue curves that can self-intersect (see curve π in Figure, for example), our efinition of SFCs is a more general class than is usually consiere. This also implies that each of our lower bouns is a lower boun for the class of non-intersecting SFCs also. Roamap: The rest of this paper is organize as follows. We iscuss relate work in Section II, followe by a iscussion of the moel an a efinition of our metrics in Section III. We present our analysis of the NN-stretch in Section IV. This inclues lower bouns for any SFC, as well as exact analysis of the Z curve an the simple curve. Relate problems, incluing all pairs stretch, Eucliean metric, an variants are consiere in Section V. II. RELATED WORK There is a vast literature on SFCs. Here we attempt to cite closely relate work. None of these works consier lower bouns for proximity preservation by an SFC, like we o. Moon et al.[8] present a comprehensive analysis of the Hilbert SFC with respect to the clustering metric, efine as follows: given a rectangular region, into how many consecutive segments of an SFC can this be ivie into, on average? The clustering metric is ifferent from the metric that we consier. Our metric, the stretch, measures the extent to which istances are preserve. A relate application of SFCs for managing multi-imensional ata in seconary memory is iscusse in [9, 4]. Nearest-neighbor fining using SFCs has been stuie in [5]. This work compares the Z-curve, the Gray-coe curve, an the Hilbert-curve accoring to their performance for nearest-neighbor queries. They o not consier the stretch of an SFC, as we o here. Neiermeier, Reinhart, an Saners [0] stuy the Hilbert SFC in two an three imensions accoring to the ratio between istance between points on the two imensional gri to the istance between them on the SFC. This work proves a result of the form: If two points are inexe i an j on the Hilbert SFC, then their Manhattan istance on the two imensional gri is boune by 3 i j. Thus, they prove that mapping from one imension to two imensions (usually) results in a contraction in istance, for the Hilbert curve. Note this result oes not imply that mapping from two imensions to one imension results in a small expansion of istance. Inee, as we show, there is an inherent lower boun on the extent to which this can be achieve. Thus, this work an our work are concerne with ifferent metrics. Dai an Su [7, 8] present upper bouns on the stretch of specific SFCs incluing the Hilbert curve, but o not consier lower bouns on the stretch, like we o here. Mitchison an Durbin [7] consier a metric close to the average NN-stretch as efine above, an present an analysis of the optimal numbering for the two imensional case. When compare with this, our work presents an analysis of the problem for any (constant) number of imensions. An even earlier work ue to Harper [] consiers a metric that is relate to the nearest neighbor stretch in high imensions,

3 but focuses on the n-cube, where the sie length along each imension is. In contrast, we consier a high-imensional cube of sie length k. Gotsman an Linenbaum [] also consier questions similar to [0]: to what extent can two points that are close to each other along the SFC be far apart in the multiimensional metric (Eucliean or Manhattan)? Similar to the above, our work consiers a ifferent metric, which goes the opposite irection: what is the extent to which istances are preserve when points are mappe from the multiimensional universe to the one imensional universe. Tirthapura, Seal an Aluru [5] consiere the analysis of SFCs in the context of nearest neighbor queries an spherical region queries. They assume a probabilistic moel of input an present an average case analysis of the performance of a class of SFCs that inclues the Hilbert an Z curves. They o not consier the stretch as we efine here, an restrict their analysis to two an three imensions. There is work on the efficient generation of the orer of points accoring to the space filling curve. For example, SFCGen [5] proposes a table-riven approach to the generation of points along an SFC. Note that there have been several empirical comparisons of ifferent space-filling curves, such as []. III. MODEL The universe is the imensional gri of imensions n n. For simplicity, we assume that n = k where k is a non-negative integer. Let U enote the set of all n cells. Each point in U is a -tuple (x, x,..., x ) where for each i =..., 0 x i < n. For cells α, β U, let (α, β) enote the Manhattan istance between α an β on the imensional gri. If α = (α, α,..., α ) an β = (β, β,..., β ), then (α, β) = α i β i. A space filling curve (SFC) π is a bijection π : U {0,,..., n }. A space filling curve provies a total orer among all cells in U. For an SFC π, for cells α, β U, let enote π(α) π(β), i.e., the istance between α an β on SFC π. Nearest Neighbor Stretch: Our first metric captures the stretch between those pairs of cells that are nearest neighbors in U accoring to metric. For cell α U, let N(α) enote the set of cells β U such that (α, β) =. Note that cells that are nearest neighbors accoring to the Manhattan metric are also nearest neighbors accoring to the Eucliean metric in the -imensional space. The set N(α) is referre to as the neighbors of α in U. It is clear that for each α U, N(α). In the rest of the paper, we use the phrase nearest neighbors to mean the nearest neighbors accoring to the metric, unless otherwise specifie. Definition. For an SFC π an cell α, the average nearest D A B C Figure. Two space filling curves on a gri. The curves on the left an the right are referre to as π an π respectively. π orers the cells as C, A, B, D an π orers the cells as A, B, C, D. neighbor stretch for α is efine as: δ avg π (α) = D A β N(α) π(α, β) N(α) Note that for any β N(α), it must be true that (α, β) =, so the above expression can be interprete as the average ilation of a path between nearest neighbors when the cells are organize using the SFC π. The intuition is that if the above metric is small, then the SFC preserves the istance between nearest neighbors well, an if the metric is large, then cells that are neighbors in U are far apart when organize using π. Definition. For a space filling curve π, the averageaverage nearest neighbor stretch for π is efine as: D avg (π) = δπ avg (α) n α U We woul like to have D avg (π) be as small as possible. For example, it woul be best if D avg (π) was, so that cells that are nearest neighbors in U are assigne consecutive inices by π, but this is easily seen to be impossible. In subsequent sections, we present strong lower bouns for D avg (π), for a general SFC π. Definition 3. For an SFC π an cell α, the maximum nearest neighbor stretch for α is efine as: δ max π (α) = max β N(α) π(α, β) Definition 4. The average-maximum nearest neighbor stretch for SFC π is efine as: D max (π) = δπ max (α) n α U B C

4 For the example shown in the Figure, we have the following. The values of δπ avg (A), δπ avg (B), δπ avg (C), an δπ avg (D) are all equal to.5, an hence D avg (π ) =.5. Similarly, it can be checke that D avg (π ) =, D max (π ) =, an D max (π ) =.5 We first prove some basic results about the istances inuce by π (, ). Lemma. For any space filling curve π on an universe U, π obeys the generalize triangle inequality. In other wors, for any α, α, α 3,..., α k U, k π (α, α k ) π (α i, α i+ ) Proof: We prove using mathematical inuction. Consier the case when k =. Easily we can check that the generalize triangle inequality hols. Now assume the generalize triangle inequality hols for each k l, then we have: π (α, α l ) π (α, α l ) + π (α l, α l ) n π (α i, α i+ ) + π (α l, α l ) = l π (α i, α i+ ) So we have that the generalize triangle inequality hols for k = n. IV. ANALYSIS OF AVERAGE NEAREST NEIGHBOR STRETCH We first present a lower boun on D avg (π) for any SFC π. Then, we present an exact analysis of D avg (Z) where Z is the Z-curve on imensions. We follow this by the exact analysis of a space filling curve S with a simple structure, an show that D avg (S) matches that of the Z curve. A. Lower Boun In this section, we present a lower boun for the averageaverage NN-stretch for any SFC. Let A be the set of all possible unorere pairs in U while A be the set of all possible orere pairs in U. Let NN enote the set of unorere pairs that are nearest neighbors in the imensional universe, i.e. NN = {(α, β) α U, β U, (α, β) = } Think of elements of NN as the eges of length between cells in the metric efine by. Secon Dim Figure. p(β, α). α = (,) β = (3,5) First Dim. The ashe path enotes p(α, β), an the soli path enotes Theorem. For any SFC π whose omain is a - imensional universe U, it must be true that: D avg (π) 3 (n n ) To prove the above theorem, we introuce the notion of a nearest neighbor ecomposition of a pair of cells (α, β) A, enote by p(α, β), efine as follows. Let the coorinates of the cells be α = (x,, x ), β = (y,, y ). Intuitively, p(α, β) efines a specific path from α to β, an can thus be consiere as a set of eges (unorere pairs of vertices), i.e. a subset of NN. Suppose the coorinates of α an β iffere only along one imension, say i, i. Then when x i < y i, we have: p(α, β) = y i l=x i {((x,..., l,..., x ), (x,..., l+,..., x ))} an when x i > y i, we have: p(α, β) = x i l=y i {((x,..., l,..., x ), (x,..., l+,..., x ))} For example, p((6, 4, 5), (3, 4, 5)) is the set {((3, 4, 5), (4, 4, 5)), ((4, 4, 5), (5, 4, 5)), ((5, 4, 5), (6, 4, 5))}. Note that if α an β iffere only in one coorinate, p(α, β) equals p(β, α). If the coorinates of α an β iffer along more than one imension, then we efine the following sequence of vertices that form a path from α to β. In this path, we correct the coorinates of α one imension at a time, starting from imension till imension, until we reach β. [α 0 = α = (x,, x )], [α = (y, x,, x )], [α = (y, y, x 3,, x )], [α =

5 (y,, y, x )], [α = β = (y,, y, y )]. Note that α i an α i+ iffer only along a single imension. Then we have: p(α, β) = i=0 p(α i, α i+ ) Note that for a pair of cells (α, β) A, p(α, β) can be ifferent from p(β, α). In Figure, p(α, β) is the set {((, ), (, )), ((, ), (3, )), ((3, ), (3, )), ((3, ), (3, 3)), ((3, 3), (3, 4)), ((3, 4), (3, 5))} while p(β, α) is the set {((, 5), (, 5)), ((, 5), (3, 5)), ((, ), (, )), ((, ), (, 3)), ((, 3), (, 4)), ((, 4), (, 5))} Our strategy for a proof of Theorem is as follows: We compute π(α, β) in two ifferent ways. In one manner, we compute it irectly an exactly, leaing to a boun of Θ(n 3 ). In another way, we fin an upper boun in terms of D avg (π), using the nearest neighbor ecomposition of each (α, β) A, an then the triangle inequality. This eventually leas to a lower boun on D avg (π). Let S A (π) be efine as follows. S A (π) = Lemma. For any SFC π, S A (π) = (n )n(n + ) () 3 Proof: We partition A into n subgroups by the istance on π. Let A i A, i n enote the group of pairs such that = i, (α, β) A i. Then easily we get that A i = (n i). So we have: n S A (π) = i(n i) = (n )n(n + ) 3 Lemma 3. For any imensional SFC π, we have: D avg (π) n (α,β) NN D avg (π) n (α,β) NN Proof: From the efinition D avg (π), we have: D avg (π) = n N(α) = n α U (α,β) NN β N(α) ( N(α) + N(β) ) Note that for each cell α U, N(α). So we have N(α) + N(β). Combining these two inequalities with the equality above yiels our conclusion. Now we are reay to prove Theorem. Proof of Theorem : For every pair (α, β) A, note that the vertex pairs in p(α, β) together form a path from α to β. Thus, we have the following from Lemma (generalize triangle inequality): π (α, β ) () It follows that: (α,β ) p(α,β) (α,β ) p(α,β) π (α, β ) (3) We show in Lemma 4 that for each neighboring pair (α, β ) NN, it appears in the right sie of inequality 3 at most n + times. So we have the following inequality: n + (ζ,η) NN π (ζ, η) (4) Recall that π(α, β) = 3 (n3 n) from Lemma an D avg (π) n (α,β) NN π(α, β) from Lemma 3. So after combining Lemma, Lemma 3 an inequality 4 we get D avg (π) 3 (n n ). Lemma 4. For each neighboring pair (ζ, η) NN, there exist at most n + pairs (α, β) in A such that (ζ, η) p(α, β). Proof: Assume ζ = (ζ,, ζ i,, ζ ), η = (ζ,, ζ i +,, ζ ), i, i.e. (ζ, η) iffers in the i coorinate. Let α = (x,, x ), β = (y,, y ). Accoring to our nearest neighbor ecomposition, we have: p(α, β) = j=0 p(α j, α j+ ) Note that p(α j, α j+ ), 0 j consists of all neighboring pairs which iffers in the (j + ) coorinate. So easily we have that: (ζ, η) p(α, β) (ζ, η) p(α i, α i ) Note that when x i < y i, we have p(α i, α i ) = y i l=x i {((y,..., y i, l, x i+,..., x ), (y,..., y i, l +, x i+,..., x ))}

6 while when x i > y i, we have p(α i, α i ) = x i l=y i {((y,..., y i, l, x i+,..., x ), (y,..., y i, l +, x i+,..., x ))} It follows that (ζ, η) p(α i, α i ) if an only if () the first i coorinates of β must share with the first i coorinates of ζ; () the last i coorinates of α must share with the last i coorinates of ζ; (3) the interval between ζ i an ζ i + must be containe in the interval between x i an y i. The exact mathematical escription is as follows: y j = ζ j, j i (ζ, η) p(α i, α i ) x j = ζ j, i + j (x i ζ i < y i ) (y i ζ i < x i ) So the total number of possible pairs (α, β) such that (ζ, η) p(α, β) shoul be ( n) ζ i ( n ζ i ). Easily we can get the total number is upper boune by n +. B. Performance of Z Curve In this section we will consier the imensional Z curve (see Figure 3). We show that the average-average NN-stretch of the Z curve is within a factor of.5 of the lower boun, an hence within a factor of.5 of the optimal. Recall that n = k. A cell x in the universe U can be represente by coorinates (x, x,..., x ), where for i, 0 x i < k ; thus x i can be represente using a binary string of length k. For j =..., let x j i enote the j-th most significant bit in x i. We recall the efinition of a imensional Z curve ([, 9]). The Z curve is efine by assigning to each cell x U, a key Z(x), which is an integer that enotes the position of a cell in the space filling curve orer. Z(x) is equal to the binary number represente by the string x, x,, x, x, x,, x,, xk, x k,, x k. In other wors, the coorinates in ifferent imensions are interleave together to form the key of the cell. For example, if = 3, k = 3, then Z(,, ) =. Note that ifferent Z curves are possible by taking the imensions in a ifferent orer uring interleaving, but these are all equivalent to the above efinition, at least for the metrics that we consier. f(n) Notation: We write f(n) g(n) if g(n) =. Theorem. If Z is the imensional Z-curve, then D avg (Z) n Our strategy for a proof of the above theorem is as follows. We ivie NN into groups {G i }, i as follows. G i is the set of neighboring pairs (α, β) NN such that α an β iffer in the ith coorinate. Clearly the ifferent G i are all isjoint. Let Λ i (Z) enotes the total sum of istances on Z curve for all pairs in G i, i.e. Λ i (Z) = Z (α, β) (α,β) G i Lemma 5. Λ i (Z) = i n, i Proof: Consier a neighboring pair (α, β) G i. These two iffer only in imension i, an suppose that the coorinates of α an β in imension i are κ an κ+ respectively. First consier the case when the least significant bit of κ is 0. In this case the coorinates of α an β in imension i iffer only in the least significant bit, an their coorinates are equal in all other imensions. By the efinition of the Z curve, Z (α, β) = Z(α) Z(β) = i. The total number of pairs in G i where the least significant bit of κ is 0 is k n. This is because the ith coorinate of α can be chosen in k ways an the other coorinates of α can be chosen in n ways, an these choices are inepenent of each other. Once α is chosen, β is also fixe. For j =... k, let G i,j G i enote the subset of pairs (α, β) G i such that the j least significant bits of κ are, an the jth least significant of κ is 0. i.e. κ has the form (,,..., 0,,, (j ) times, ). In this case, κ+ has the form (,,...,, 0, 0, (j ) times, 0). By the efinition of the Z curve, we get for all (α, β) G i,j : j Z (α, β) = j i l= l i The total number of neighboring pairs in G i,j is k j n. Λ i (Z) = = = = ( ) k j G i,j j i l i j= l= ( ) k j k j n j i l i j= l= k k j n ( j i j i i ) j= k k j n j= ( ) j i + i.

7 Dimension Dimension Figure 3. Two imensional Z curve on an 8 8 gri. On the left is the assignment of keys to cells; within each cell the higher orer bits are on the top an the lower orer bits are at the bottom. On the right is a pictorial representation of the orer. Let Q = Q = k k j n j= j= ( ) j i k ( ) k j n i. Recall that n = k. Thus for Q, we have: Q n For Q, we have: Q n Finally, we get: = n = i i k = n = 0 k j= Λ i (Z) = i n Now we are reay to prove Theorem. Proof of Theorem : ( k i+j( ) ) We partition the set NN into two subsets. Recall that D avg (Z) = n (α,β) NN ( N(α) + ) Z (α, β) N(β) Let H enote the set of pairs {(α, β) NN N(α) = N(β) = } Let H enote the set of pairs {(α, β) NN (N(α) < ) (N(β) < )} We have D avg (Z) = n (h + h ) where an h = (α,β) H h = Z (α, β) (5) (α,β) NN ( N(α) + N(β) ) Z (α, β) (6) We first evaluate h. By the efinition of Λ i, we have: h = Using Lemma 5, we get: h n = Λ i (Z) i = We now consier h. We know for any α U, N(α) (7)

8 . Using this in Equation 6 leas to: h Z (α, β) (α,β) H Note that in every pair (α, β) H, it must be true that for some i, i, either ()the ith coorinate of α is 0 or k, or ()the ith coorinate of β is 0 or k. Suppose for each neighboring pair (α, β) NN, we use (α, β) i to enote the pair of the ith coorinate of α an β. Let K = {(α, β) NN i : (α, β) i = (0, ) (α, β) i = ( k, k )}, K = {(α, β) NN i : (α, β) i = (0, 0) (α, β) i = ( k, k )}. Then we have H = K K while K an K are isjoint. Let Λ(K ), Λ(K ) an Λ(H ) enote the total sum of istances on Z curve for all pairs in Λ(K ), Λ(K ) an Λ(H ) respectively. We first analyze Λ(K ) as follows. Let K,i = {(α, β) NN (α, β) i = (0, ) (α, β) i = ( k, k )}. So we have K = K,i. Note that the least significant bits of 0 an k are both 0. From the efinition of G i, we have K,i G i,, i. So we have Z (α, β) = i, (α, β) K,i, i. Easily we can get that the total number of pairs in K,i shoul be n. So we have: Λ(K ) = Z (α, β) (α,β) K,i = n i = n ( ) Now we turn to Λ(K ). Let K,i = {(α, β) NN (α, β) i = (0, 0) (α, β) i = ( k, k )}. We can analyze K,i in the same way as we analyze NN. What is ifferent here is one of coorinates in the pair is fixe at 0 or k. So we have: It follows that: Λ(K ) n Λ(K,i ) = = 0 Easily we can get that: l=,l i Λ l(z) n Λ(K,i) n n n l= Λ l(z) Λ(K ) = 0 n So we have: Λ(H ) Λ(K ) + Λ(K ) = = 0 n n Dimension Dimension Figure 4. Now we have: A simple curve S on an 8 8 gri. h n h So we have of h an h yiels: n Λ(H ) n = 0 = 0. Combining the itation h + h = n Note that D avg (Z) = n (h + h ). Thus we get D avg (Z) n. C. Performance of a Simple Space-Filling Curve In this section we will show that even a simple curve, the one shown in Figure 4 can have the same performance as the Z-curve. First we iscuss how to construct imensional simple curve filling the gri of size n n. Refer to each cell as a tuple of coorinate (x,..., x ), where for i, 0 x i < n. For each cell α = (x,..., x ), the linear orer assigne by simple curve S(α) can be efine as follows: S(α) = x i ( n) i (8) In the rest sections, once we mention imensional simple curve S, we mean it refers to the specific simple curve efine in Equation 8. Theorem 3. If S is the imensional simple curve, then D avg (S) n

9 Proof: Recall that U enote the set of all n cells. Set U = {(x,..., x ) : x i n, i }, U = {(x,..., x ) : i, x i = 0 or n }. Here U refers to the subset of cells forming a imensional sub gri while U refers to the subset of cells forming the ( )-imensional faces. Obviously, U = U U. For each cell (x,..., x ) in U, it efinitely has nearest neighbors since x i n, i. The total number of cells in U is ( n ). For each cell α U, the average nearest neighbor stretch δ avg S (α) = l=0 ( n) l = n n. Denote the total sum of average nearest neighbor stretch for all cells in U as Λ(U ), i.e, Λ(U ) = α U δ avg S (α). Then we have: Λ(U ) n = = n ( n ) n n The total number of cells in U shoul be n ( n ). For each cell α in U, we have the average nearest neighbor stretch δ avg S (α) n n. Denote the total sum of average nearest neighbor stretch for all cells in U as Λ(U ), i.e, Λ(U ) = α U δ avg S (α). Then we have: Λ(U ) n n ( n ( n ) ) n n = 0 Λ(U So we get that ) = 0. Combining the two n together we have: n (Λ(U ) + Λ(U )) = Note that the average-average nearest neighbor stretch for all cells in U shoul be D avg (S) = n (Λ(U )+Λ(U )). Thus we have D avg (S) n. V. EXTENSIONS In this section, we present results for other variants of the efinitions of the stretch. First, we consier the averagemaximum NN stretch D max. Then we consier the average stretch between all pairs of cells, not just those pairs that are nearest neighbors. We consier two variants of this all pairs stretch, one where the metric in high-imensions is the Manhattan metric, an the other where the metric in higher imensions is the Eucliean metric. For each of these problems, we present upper an lower bouns, to the extent possible. A. Maximum Nearest Neighbor Stretch Recall that the average-maximum nearest neighbor stretch of a cell α is efine as: δ max π (α) = max β N(α) π(α, β) The average-maximum NN-stretch of SFC π is: D max (π) = n α U δmax π (α). We have the following lower boun. Proposition. For any SFC π whose omain is a - imensional universe U, it must be true that: D max (π) 3 (n n ) Proof: The inequality can be easily obtaine from the following fact that for each α U, δ max π (α) δ avg (α). It follows that D max (π) D avg (π). Combining the result in Theorem yiels the inequality above. In the following, we show that the lower boun obtaine above is nontrivial by showing the performance of the simple curve efine in 8 can be n, i.e, the simple curve is optimal up to a factor equal to the number of imensions. Proposition. If S is the imensional simple curve, then π D max (S) = n Proof: Recall that for each cell α = (x,..., x ), the linear orer assigne by simple curve S(α) is efine as follows: S(α) = x i ( n) i (9) For each cell α = (x,..., x ), at least one of the two neighbors, say β = (x,..., x +) an β = (x,..., x ), shoul exist. Note that S (α, β ) = S (α, β ) = n. So we have for each cell α U: δs max (α) = n It follows that D max (S) = n. The above result shoul be compare with the averageaverage NN-stretch of the simple curve (Theorem 3). This shows that the average-maximum stretch is worse that the average-average stretch by a factor. The intuitive explanation for this is that for a vast majority of cells that o not lie on the borer, the istance (along the SFC) to two of the nearest neighbors is large, while the other nearest neighbors are much closer along the SFC. B. All Pairs Stretch In the case of all pairs stretch, the istance between two cells α an β in high imensions can be ifferent epening on whether we use the Eucliean or the Manhattan metric. The average all pairs stretch for π, using the Manhattan metric is: str avg,m (π) = n(n ) (α, β)

10 Let E enote the Eucliean metric on universe U. The average all pairs stretch for π, using the Eucliean metric is: str avg,e (π) = n(n ) E (α, β) In the following, we present a simple, while nontrivial lower boun for str avg,m (π) an str avg,e (π). Lemma 6. For each pair (α, β) A, we have (α, β) ( n ), E (α, β) ( n ) Proof: Obviously we can get that both of (α, β) an E (α, β) can achieve the maximum value at the pair (α, β ) as follows: α = (0,, 0), β = ( n,, n ) So we have for each pair (α, β) A : (α, β) (α, β ) = ( n ) E (α, β) E (α, β ) = ( n ) Proposition 3. For any SFC π whose omain is a - imensional universe U, it must be true that: an str avg,m (π) 3 str avg,e (π) 3 n + n n + n Proof: We first prove for str avg,m (π). From Lemma 6 we have: n(n ) n(n )( n ) From Lemma, we alreay have: (α, β) S A (π) = (n )n(n + ) 3 Note that A is the set of all possible orere pairs in U while A is the set of all possible unorere pairs in U. It follows that: = S A (π) = (n )n(n + ) 6 Combining with the inequality above we have: str avg,m (π) = n(n ) (α, β) n + 3 n Following the same steps we can prove str avg,e (π) 3 n+ n as well. In the following, we show that an upper boun on the performance of the simple curve S efine in 8. Proposition 4. If S is the imensional simple curve, then str avg,m (S) n, str avg,e (S) n Lemma 7. Suppose S is the imensional simple curve. Then we have for each pair (α, β) A S (α, β) (α, β) n, S (α, β) E (α, β) n Proof: Arbitrarily choose a pair of cells, say α = (x,, x ), β = (y,, y ). Then we have: S (α, β) = (x i y i )( n) i, (α, β) = It follows that: S (α, β) (α, β) x i y i (x i y i ) ( n) i x i y i ( n) (x i y i ) x i y i = ( n) Note that: E (α, β) = x i y i It follows that: S (α, β) E (α, β) S(α, β) (α, β) ( n) x i y i = (α, β) Now we are reay to prove Proposition 4. Proof: Since str avg,m (S) an str avg,e (S) is the average all pairs stretch by using Manhattan metric an Eucliean metric respectively, we can get the inequalities in Proposition 4 irectly from the results in Lemma 7. VI. CONCLUSIONS We presente an analysis of the proximity preserving properties of an SFC. We showe that there is an inherent lower boun on the average-average NN-stretch of any SFC. Further, some specific SFCs such as the Z curve come close to matching this boun. This stuy raises a number of interesting open questions. An obvious question is to close the gap between the lower boun an upper boun for the average-average NN-stretch, perhaps via an analysis of a ifferent SFC, or through a better lower boun, or both. A relate

11 question is an analysis of the average NN-stretch of the Hilbert SFC. Next, there is a larger gap between the lower boun an the upper boun for the average-maximum NN-stretch. It woul be interesting to narrow this gap. It woul also be interesting to narrow the gap between upper an lower bouns for the all pairs stretch. Another irection is the analysis of proximity preservation using a more general probabilistic moel of input. REFERENCES [] D. Abel an D. Mark. A comparative analysis of some two-imensional orerings. Int. J. Geogr. Inf. Syst., 4(): 3, 990. [] J. Alber an R. Nieermeier. On multi-imensional hilbert inexings. In Proc. 4th International Conference on Computing an Combinatorics (COCOON), pages , 998. [3] S. Aluru an F. Sevilgen. Parallel omain ecomposition an loa balancing using space-filling curves. In High-Performance Computing, 997. Proceeings. Fourth International Conference on, pages 30 35, 997. [4] W. G. Aref an I. Kamel. On multi-imensional sorting orers. In Database an Expert Systems Applications (DEXA), pages ,. [5] H.-L. Chen an Y.-I. Chang. Neighbor-fining base on space-filling curves. Inf. Syst., 30:05 6, May 005. [6] A. Collins, J. Czyzowicz, L. Gasieniec, an A. Labourel. Tell me where I am so I can meet you sooner. In Proc. ICALP, pages 50 54,. [7] H.-K. Dai an H.-C. Su. On the locality properties of space-filling curves. In Proc. International Symposium on Algorithms an Computation (ISAAC), pages , 003. [8] H.-K. Dai an H.-C. Su. On p-norm base locality measures of space-filling curves. In Proc. International Symposium on Algorithms an Computation (ISAAC), pages , 004. [9] C. Faloutsos. Multiattribute hashing using gray coes. SIGMOD Recor, 5:7 38, June 986. [0] C. Faloutsos. Gray coes for partial match an range queries. IEEE Trans. Software Engg., 4:38 393, 988. [] C. Gotsman an M. Linenbaum. On the metric properties of iscrete space-filling curves. IEEE Transactions on Image Processing, 5(5): , 996. [] L. H. Harper. Optimal assignments of numbers to vertices. Journal of the Society for Inustrial an Applie Mathematics, ():3 35, 964. [3] D. Hilbert. Uber ie stegie abbilung einer linie auf flachenstuck. Math. Ann., 38: , 89. [4] H. V. Jagaish. Analysis of the hilbert curve for representing two-imensional space. Information Processing Letters, 6:7, 997. [5] G. Jin an J. M. Mellor-Crummey. Sfcgen: A framework for efficient generation of multi-imensional space-filling curves by recursion. ACM Transactions Mathematical Software, 3():0 48, 005. [6] Y. Matias an A. Shamir. A vieo scrambling technique base on space filling curves. In CRYPTO, pages , 987. [7] G. Mitchison an R. Durbin. Optimal numberings of an n x n array. SIAM J. Algebraic Discrete Methos, 7:57 58, October 986. [8] B. Moon, H. V. Jagaish, C. Faloutsos, an J. H. Saltz. Analysis of the clustering properties of the hilbert space-filling curve. IEEE Trans. Knowlege an Data Engineering, 3():4 4,. [9] G. Morton. A computer oriente geoetic ata base; an a new technique in file sequencing. Technical report, IBM, 966. [0] R. Nieermeier, K. Reinhart, an P. Saners. Towars optimal locality in mesh-inexings. Discrete Applie Mathematics, 7(-3): 37, 00. [] J. A. Orenstein an T. H. Merrett. A class of ata structures for associative searching. In Proceeings of the 3r ACM SIGACT-SIGMOD symposium on Principles of atabase systems, PODS 84, pages 8 90, 984. [] M. Parashar an J. C. Browne. On partitioning ynamic aaptive gri hierarchies. In Proceeings of the 9th Annual Hawaii International Conference on System Sciences, pages , 996. [3] J. R. Pilkington an S. B. Baen. Dynamic partitioning of non-uniform structure workloas with space filling curves. IEEE Trans. on Parallel an Distribute Systems, 7(3):88 300, 996. [4] H. Samet. The Design an Analysis of Spatial Data Structures. Aison-Wesley, 990. [5] S. Tirthapura, S. Seal, an S. Aluru. A formal analysis of space filling curves for parallel omain ecomposition. In Proc. International Conference on Parallel Processing (ICPP), pages 505 5, 006. [6] M. Warren an J. Salmon. A parallel hashe-octtree N-boy algorithm. In Proceeings of Supercomputing 93, Portlan, OR, Nov. 993.

Lower bounds on Locality Sensitive Hashing

Lower bounds on Locality Sensitive Hashing Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,