On the Necessary and Sufficient Conditions of a Meaningful Distance Function for High Dimensional Data Space

Size: px

Start display at page:

Download "On the Necessary and Sufficient Conditions of a Meaningful Distance Function for High Dimensional Data Space"

Jonah King
6 years ago
Views:

1 On the Necessary and Sufficient Conditions of a Meaningful Distance Function for High Diensional Data Space Chih-Ming Hsu Ming-Syan Chen Abstract The use of effective distance functions has been explored for any data ining probles including clustering, nearest neighbor search, and indexing. Recent research results show that if the earson variation of the distance distribution converges to zero with increasing diensionality, the distance function will becoe unstable or eaningless) in high diensional space even with the coonly used L p etric on the Euclidean space. This result has spawned any subsequent studies. We first coent that although the prior work provided the sufficient condition for the unstability of a distance function, the corresponding proof has soe defects. Also, the necessary condition for unstability i.e., the negation of the sufficient condition for the stability) of a distance function, which is required for function design, reains unknown. Consequently, we first provide in this paper a general proof for the sufficient condition of unstability. More iportantly, we go further to prove that the rapid degradation of earson variation for a distance distribution is in fact a necessary condition of the resulting unstability. With the result, we will then have the necessary and the sufficient conditions for unstability, which in turn iply the sufficient and necessary conditions for stability. This theoretical result derived leads to a powerful eans to design a eaningful distance function. Explicitly, in light of our results, we design in this paper a eaningful distance function, called Shrinkage-Divergence roxiity abbreviated as SD), based on a given distance function. It is epirically shown that the SD significantly outperfors prior easures for its being stable in high diensional data space and robust to noise, and is thus deeed ore suitable for distance-based clustering applications than the priorly used etric. 1 Introduction The curse of diensionality has recently been studied extensively on several data ining probles such as clustering, nearest neighbor search, and indexing. The curse of high diensionality is critical not only with re- Electrical Engineering Departent, National Taiwan University, Taipei, Taiwan. Eail: ing@arbor.ee.ntu.edu.tw. Electrical Engineering Departent, National Taiwan University, Taipei, Taiwan. Eail: schen@cc.ee.ntu.edu.tw. gards to the perforance issue but also to the quality issue. Specifically, on the quality issue, the design of effective distance functions has been deeed a very iportant and challenging issue. Recent research results showed that in high diensional space, the concept of distance or proxiity) ay not even be qualitatively [1][2][3][5][6][11]. Explicitly, the theore in [6] showed that under a broad set of conditions, in high diensional space, the distance to the nearest data point approaches the distance to the farthest data point of a given query point with increasing diensionality. For exaple, under the independent and identically distributed diensions assuption, the coonly used L p etrics will encounter probles in high diensionality. This theore has spawned any subsequent studies along the sae line [1][2][3][11][13]. The scenario is shown in Figure 1 where ɛ denotes a very sall nuber. Fro the query point, the ratio of the distance to the nearest neighbor to that to the farthest neighbor is alost 1. This phenoenon is called the unstable phenoenon [6] because there is poor discriination between the nearest and farthest neighbors for proxiity query. As such, the nearest neighbor proble becoes poorly defined. Moreover, ost indexing structures will have a rapid degradation with increasing diensionality which leads to an access to the entire database for any query [3]. Siilar issues are encountered by distance-based clustering algoriths and classification algoriths to odel the proxiity for grouping data points into eaningful subclasses. In this paper, a distance function which will result in this unstable phenoenon is referred to as a eaningless function in high diensional space, and is called eaningful otherwise. The result in [6] suggested that the design of a eaningful distance function in high diensional space is a very iportant proble and has significant ipact to a wide variety of data ining applications. In [6], it is proved that the sufficient condition of unstability is that the earson variation of the corresponding distance distribution degrades to with increasing diensionality. For exaple, if under the independent and identically distributed diensions assuption and the use an L p etric, the earson 12

2 εdmin DMAX 1 + ε )DMIN Figure 1: An exaple of unstable phenoenon. g f ax f, g ) in f, g) Figure 2: Exaple iniu and axiu of functions which are not continuous. variation of distance distribution rapidly degrades to with increasing diensionality in high diensional space, then the unstable phenoenon occurs. This distance function is hence called eaningless.) Note that in light of the equivalence between p q and q p, the negative of above sufficient condition for unstability is equivalent to the necessary condition of stability where we have a eaningful distance function). However, the sufficient condition for stability reains unknown. In fact, the iportant issue is how to design a eaningful distance or proxiity) function for high diensional data space. The authors in [1] provided soe practical desiderata for constructing a eaningful distance function, including 1) contrasting, 2) statistically sensitive, 3) skew agnification, and 4) copactness. These properties are in essence design guidelines for a eaningful distance function. However, we have no guarantee that a distance function which satisfies those needs will avoid the unstable phenoenon since these properties are not sufficient condition for stability). Consequently, neither the result in [6] nor that in [1] provides the necessary condition for unstability which is required for us to design a eaningful distance function in high diensional space. The design of a eaningful distance or proxiity) function hence reains as an open proble. This is the issue we shall solve in this paper. We first coent that although the work in [6] provided the sufficient condition for unstability, the corresponding proof has soe defects. [6] used the property that the iniu and axiu are continuous functions to deduce the sufficient condition for unstability, which, however, does not hold always. For exaple, consider the scenario of two functions f and g, as shown in Figure 2 where both the iniu and the axiu of functions f and g are discontinuous. In general, the iniu and axiu functions of rando sequence are not continuous, especially for discontinuous rando variables [8]. To reedy this, we will first provide in this paper a general proof for the sufficient condition of unstability. More iportantly, we go further to prove that the rapid degradation of earson variation for a distance distribution is a necessary condition of the resulting unstability. With the result, we will then have the necessary and sufficient conditions for unstability, whose negatives in turn iply the sufficient and necessary conditions for stability. Note that with the sufficient condition for stability which was unsolved in prior works and is first derived in this paper, one will then be able to design a eaningful distance function for high diensional space. Explicitly, this new result eans that a distance function for which the degradation of earson variation does not approach zero rapidly will be guaranteed to be eaningful i.e., stable) in high diensional space. The estiation of variation is a guideline for testing the distance function is unstable or not. As such, this theoretical analysis leads a powerful eans to design a eaningful distance function. Explicitly, in light our results, we design in this paper a eaningful distance function, called Shrinkage- Divergence roxiity abbreviated as SD), based on a given distance function. Specifically, the SD defines an adaptive proxiity function of two data points on individual attributes separately. The proxiity of two points is the aggregation of each attributive proxiity. SD agnifies the variation of the distance to detect and avoid the unstable phenoenon attribute by attribute. For each attribute of two data points, we will shrink the proxiity of this attribute to zero if the projected attribute of two data points falls into a sall interval. If they are ore siilar to each other than to others on an attribute, we are then not able to significantly discern aong the statistically. On the other hand, if soe projected attributes of two data points are apart fro one to another for a long original distance, then they are viewed dissiilar to each other. Therefore, we will be able to spread the 13

3 out to increase the degree of discriination. Note that since we define the proxiity between two data points separately on individual attributes, the noise effects of soe attributes will be itigated by other attributes. This accounts for the reason that SD is robust to noise in our experients. The contributions of this paper are twofold. First, as a theoretical foundation, we provided and proved the necessary and sufficient conditions of unstable phenoenon in high diensional space. Note that the negative of necessary condition of unstability is in essence the sufficient condition of stability, which provides an innovative and effective guideline for the design of a eaningful i.e., diensionality resistant) distance function in high diensional space. Second, in light of the theoretical results derived, we developed a new diensionality resistant proxiity function SD. It is epirically shown that the SD significantly outperfors prior easures for its being stable in high diensional data space and robust to noise, and is thus deeed ore suitable for distance-based clustering applications than the coonly used L p etric. The rest of the paper is organized as follows. Section 2 describes related works. Section 3 provides theoretical results for our work where the necessary and sufficient conditions for unstable phenoenon are derived. In Section 4, we devise a eaningful distance function SD. Experiental results are presented in Section 5. This paper concludes with Section 6. 2 Related Works The use of effective distance functions has been explored in several data ining probles, including nearest neighbor search, indexing, and so on. As described earlier, the work in [6] showed that under a broad set of conditions the neighbor queries becoe unstable in high diensional spaces. That is, fro a given query point, the distance to the nearest data point will approach that to the farthest data point in high diensional space. For exaple, under the coonly used assuption that each diension is independent, the L p etric will be unstable for any high diensional data spaces. For constant p p 1), the L p etric for two -diensional data points x = x 1, x 2,, x ) and y = y 1, y 2,, y ) is defined as L p x, y ) = i=1 x i y i p ) 1/p. The result in [6] has spawned any subsequent studies along this direction. [11] and [2] specifically exained the behavior of L p etric and showed that the proble of stability i.e., eaningfulness) in high diensionality is sensitive to the value of p. A property of L p presented in [11] is that the value of extreal difference Dax Din grows as 1/p 1/2 with increasing diensionality, where Dax and Din are the distances to the farthest point and that to the nearest point fro the origin, respectively. As a result, the L 1 etric is the only etric of the L p faily for which the absolute difference between nearest and farthest neighbor increases with the diensionality. For the etric, Dax Din converges to a constant, and for distance etrics L k for k 3, Dax Din converges to zero with diensionality. This eans that the L 1 etric is ore preferable than the for high diensional data ining applications. In [2], the authors also extended the notion of a L p etric to a fractional distance function where a fractional distance function dist l for l, 1) is defined as: ) 1/l dist l x, y ) = x i y i ) l. i=1 In [3], the authors proposed the IGrid-index which is a ethod for siilarity indexing. The IGrid-index used grid-based approach to redesign the siilarity function fro L p. In order to perfor the proxiity thresholds, the IGrid-index ethod discretize the data space into k d equidepth ranges. Specifically, R [i, j] denotes the jth range for diension i. For diension i, if both x i and y i belong to the sae range R [i, j], then the two data points are said to be in proxiity on diension i. Let S [ x, y, k d ] be the proxiity set for two data points x and y with a given level of discretization k d, then the siilarity between x and y is given by: IDist x, y, k d ) = i S[ x, y,k d] 1/p 1 x ) p i y i, i n i where i and n i are the upper and lower bounds for the corresponding range in the diension i in which the data points x and y are in proxiity to one another. Note that these results while being valuable fro various perspectives do not provide the sufficient condition for a eaningful distance function that can be used for the distance function design in high diensional data space. 3 On a Meaningful Distance Function In this section, we shall derive theoretical properties for a eaningful distance function. reliinaries are given in Section 3.1. In Section 3.2, we first provide a new proof for the sufficient condition of unstability in Theore 1 since, as pointed out earlier, the proof in [6] has soe defects). Then, we derived the necessary condition for unstability i.e., the negative of the sufficient condition for stability) in Theore 2. We state the coplete 14

4 results the necessary and sufficient conditions for unstability) in Theore 3. For better readability, we put the proofs of all theores in Section 3.3. Rearks on the valid indices for a distance function to be eaningful are ade in Section Definitions We now introduce several ters to facilitate our presentation. Assue that d is a real-value distance function 1 defined on a certain -diensional space. d is well defined as increases. For exaple, the L p etric defined on the -diensional Euclidean space is well defined as increases for any p in, ). Let,i i = 1, 2,, N be N independent data points which are sapled fro soe -variate distribution F. F is also well defined on soe saple space as increases. Note that we do not assue that the attributes of data space are independent. Essentially, any of the attributes are correlated with one another [1].) If Q is an arbitrary diensional) query point chosen independently fro all,i. Let = ax{d,i, Q ) 1 i N} and DMIN = in{d,i, Q ) 1 i N}. Hence and DMIN are rando variables for any. Definition 3.1. A faily of well defined distance functions {d = 1, 2, } is δ-unstable or δ- eaningless) if δ = sup{δ li { 1 + ɛ)dmin } δ for any ɛ > }. For ease of exposition, we also refer to 1-unstable as unstable or with a eaningless distance function). As δ approaches 1, a sall relative change in any query point in a direction away fro the nearest neighbor could change the point into the farthest neighbor. For the purpose of this study, we shall explore the extreal case: unstable phenoenon i.e., δ = 1). A distance function is called stable if it is not unstable i.e., δ < 1). A list of sybols used is given in Table Theoretical roperties of Unstability Fro the probability theory, the unstable phenoenon is equivalent to the case that DMAX DMIN converges in probability to one as increases. The work in [6] proved that under the ) condition that earson variation d var,1,q ) E[d,1,Q )] of distance distribution converges to with increasing diensionality, the extreal ratio DMIN will also converge to one with increasing diensionality. Forally, we have the following theore. 1 In this paper, the distance function need not be a etric. A nonnegative function d : X X R is a etric for data space X if it satisfies the following properties: 1) dx, y) x, y X, 2) dx, y) = if and only if x = y, 3) dx, y) = dy, x) x, y X, and 4) dx, z) + dz, y) dx, y) x, y, z X Notation Table 1: List of Sybols Definition Diensionality of the data space N Nuber of data points {e} robability of an event e X, Y, X, Y Rando variables defined on soe probability space E[X] Expectation of a rando variable X varx) Variance of a rando variable X iid Independent and identically distribution d A distance function of a - diensional data space = 1, 2, x F x is a rando saple point fro the distribution F A sequence of rando variables X 1, X n X X2 converges in probability to a rando variable X if ɛ li n { X n X ɛ} = 1 Recall that this theore was rendered in [6] and we ainly provide a correct and ore general proof in this paper. Theore 3.1. Sufficient condition of unstability [6]) Let p be a constant < p < ). If li var d,1,q)p E[d,1,Q ) p ]) =, then for every ɛ > li { 1 + ɛ)dmin } = 1. Fro Theore 3.1, a distance function is unstable in high diensional space if its earson variation of distance distribution approaches with increasing diensionality. This phenoenon leads poor discriination between the nearest and farthest neighbor for proxiity query in high diensional space. Note that as entioned in [6], the condition of Theore 3.1 is applicable to a variety of data ining applications. Exaple 1. Suppose that we have the data whose distribution and query are iid fro soe distribution with finite fourth oents in all diensions [6]. If,ij and Q j are, respectively, the jth attribute of ith data point and the jth attribute of the query point. Hence,,ij Q j ) 2, j = 1, 2,, are iid with soe expectation µ and soe nonnegative variance σ 2 that are the sae regardless of the values of i and. If we use the etric for proxiity query, then d,i, Q ) = j=1,i j Q j ) 2 ) 1/2. Under the iid assuptions, we have E[d,i, Q ) 2 ] = µ and var d,i, Q ) 2) = σ 2. Therefore, the earson 15

5 variation var d,1,q)2 E[d,1,Q ) 2 ] ) = σ2 /µ 2 converges to with increasing diensionality. Hence, the corresponding is eaningless under such a data space. In general, if the data distribution and query are iid fro soe distribution with finite 2pth oents in all diensions, the L p etric is eaningless for all 1 p < [6]. We next prove in Theore 3.2 that the condition for the earson variation of the distance distribution to any given target to converge to with increasing diensionality is not only the sufficient condition as stated in Theore 3.1) but also the necessary condition of unstability of a distance function in high diensional space. Theore 3.2. Necessary condition of unstability) If li { 1 + ɛ)dmin } = 1 for every ɛ >, then li var d,1, Q ) p E[d,1, Q ) p ) = < p < ). ] With Theore 3.2, we will then have the necessary and the sufficient conditions for unstability, whose negatives in turn iply the sufficient and necessary conditions for stability. The stateent that the earson variation of the distance distribution to any given target should converge to with increasing diensionality is equivalent to the unstable phenoenon. Following Theores 3.1 and 3.2, we reach Theore 3.3 below which provides a theoretical guideline for designing diensionality resistant distance functions. Theore 3.3. Main Theore) Let p be a constant < p < ). For every ɛ >, li { 1 + ɛ)dmin } = 1 if and only if li var d,1, Q ) p E[d,1, Q ) p ] ) =. Exaple 2. Theore 3.3 shows that we have to increase the variation of distance distribution for redesigning a diensionality resistant distance function. Assue that the data points and query are iid fro soe distribution in all diension, and E[d,i, Q ) p ] = µ, var D,1, Q ) p ) = σ 2 for soe constant µ and σ> ), as in Exaple 1. Since the exponential function e x for x ) is strictly convex, hence the Jensen s inequality [4] [15] suggests that the variation of distance distribution can be agnified by applying the exponential function. Let fx) = e x for x. We obtain soe interesting results by applying f to soe well-known distance distributions, as shown in Table 2. The above results show that the transforations of distance functions by the exponential function can reedy the eaningless behaviors on soe high diensional data spaces, especially for those cases that distance distributions have long tails. However, in such case, the stable property of distance functions ay not be useful for application needs. In addition, the large variation akes data sparsely, and then the concept of proxiity will also be difficult to visualize. Fro our experiental results, it is shown that the earson variations, whose distance functions are translated fro L 1 etric or etric by exponential function f, will start to diverge even when encountering only 1 diensions on any data spaces. We ake soe rearks on the valid indices for a distance function to be eaningful in Section roof of Main Theore In order to prove our ain theore, we need to drive soe properties of unstable phenoenon. For interest of space, those properties and soe iportant theores of probability are presented in Appendix A. roof for Theore 3.1: d Let V,i =,i,q ) p E[d,i,Q ) p ] and X = V,1, V,2,, V,N ). Hence, V,i i = 1, 2,, N are non-negative iid rando variables for all and V,i 1 as. Since for any ɛ >, li { ax X ) 1 ɛ} = li {ax X ) 1 + ɛ)} + li { ax X ) 1 ɛ)} = li 1 {V,j < 1 + ɛ) j = 1, 2,, N}) + li {V,j 1 ɛ) j = 1, 2,, N} by Theore A.4) = li 1 ) N j=1 {V,j < 1 + ɛ)} + li N j=1 {V,j 1 ɛ)} by V,j j = 1, 2,, N are iid and Theore A.4) = by V,j 1 as ), and li { in X ) 1 ɛ} = li {in X ) 1 + ɛ)} + li { in X ) 1 ɛ)} = li {V,j 1 + ɛ) j = 1, 2,, N} +li 1 {V,j > 1 ɛ) j = 1, 2,, N}) by Theore A.4) N = li j=1 {V,j 1 + ɛ)} + li 1 ) N j=1 {V,j > 1 ɛ)} by V,j j = 1, 2,, N are iid and Theore A.4) 16

6 Table 2: The earson variation of soe translated distance functions. Distance distribution Binoial Unifor Noral Gaa Exponential li var fd,1,q)p ) E[fd,1,Q ) p )] ) = by V,j 1 as ), then ax X ) and in X ) converge in probability to 1 as. Further, by Slutsky s theore, proposition 2 of Theore A.1, and E[d,i, Q ) p ]ax ) 1/p X ) = DMIN E[d,i, Q ) p ]in X ) ax ) 1/p X ) = in, X ) then DMAX DMIN 1 as. Q.E.D. roof for Theore 3.2: Let W,i = d,i, Q ) p. By third property of Lea A.1 and the second property of Theore A.1, we have W,i W,j 1 for any i, j. Hence, an application of Theore A.3 and the forth property of Lea A.1, we have [ ] W,i W,i var = E var W,j W,j W,j [ ]) W,i + var E W,j W,j and var hence li var W,i respectively. Furtherore, since E [ ]) W,i E W,j W,j 3.1) li E W,j ) =, [ ] W,i var W,j W,j and are nonnegative for all, [ ] W,i var W,j = W,j and [ ]) 3.2) li var W,i E W,j =. W,j W,i Also, var W,j W,j = x for all x and for all, therefore Equation 3.1) iplies) that the probability of W,i this set {x var W,j W,j = x > } ust be or W,i var W,j = x = for all x, W,j as. Set x = E[d,i, Q ) p ], hence we have Then li var d,i, Q ) p E[d,i, Q ) p ) = for any i. ] li var d,1, Q ) p E[d,1, Q ) p ] ) =. Note that W,i i = 1, 2, are iid for all.) Q.E.D. The ain theore, i.e., Theore 3.3, thus follows fro Theore 3.1 and Rearks on Indices to Test Meaningful Functions Many researchers used the extreal ratio DMIN [6] or the relative contrast DMAX DMIN [2] to test the stable phenoenon of distance functions. However, as shown by our experiental results, these indices are sensitive to outliers and found inconsistent for any cases. Here, we will coent on the reason that earson variation is a valid index to evaluate the stable phenoenon of distance functions. In light of Theore 3.3 devised, we have the relationship shown in Figure 3, which leads to the following four conceivable indices to test the stable or unstable) phenoenon: Index 1. Extreal ratio) DMIN Index 2. earson variation) var Index 3. Mean of extreal ratio) E ; d,1,q ) E[d [,1,Q )] DMIN ]; Index 4. Variance of extreal ratio) var ) ; DMIN ). A robust eaningful distance function needs to satisfy li var d,1,q) E[d ) > independently of the distribution of data set. Suppose that,1,q )] we use those indices to evaluate the eaningless behavior, for soe given distance functions, through DMIN all possible data sets. If li = 1 2 d or li var,1,q ) E[d,1,Q )] =, we can conclude that this distance function is eaningless. On the other hand, we can say) that it is stable if li var > d,1,q ) E[d,1,Q )] 2 Note that and DMIN are rando variables. Therefore, this convergence is uch stronger than convergence in probability [4][7][14][15]. 17

7 DMAX li = 1 DMIN d li var E d, Q ),1 p [,1, Q ) ] c DMAX 1 DMIN Unstable henoena p = DMAX li E = 1 DMIN DMAX li var = DMIN Figure 3: The convergent relationships of extreal ratio. [ ], li E DMIN 1, or li var DMIN >. If we decide to apply soe distance-based data ining algoriths to explore a given high diensional data set. We first need to select a eaningful distance function depending on our applications. Therefore, we ay copute those indices for each candidate of distance function, using the whole data set or a sapling subset, to evaluate its eaningful behavior. However, both the ean of extreal ratio Index 3) and the variance of extreal ratio Index 4) are invalid to estiate in this case. Also, though one can deduce that the distance is eaningless if its DMIN value is very close to one, it is not decided whether the function is eaningful or not if its value of extreal ratio Index 1) is apart fro one. In addition, the extreal ratio Index 1) is sensitive to outliers. On other hand, if we apply soe resapling techniques, such as bootstrap [1], to test the stable property of distance functions for soe given high diensional data set. The extreal ratio Index 1) could be inconsistent for any sapling subsets. Consequently, in light of the theoretical results derived and also as will be validated, earson variation Index 2) eerges as the index to use to evaluate the eaningfulness of distance function. 4 Shrinkage-Divergence roxiity As deduced before, the unstable phenoenon is rooted in the variation of the distance distribution. In light of Theore 3.3 devised, we will next design a new proxiity function, called SD Shrinkage-Divergence roxiity), based on soe well defined faily of distance functions {d = 1, 2, }. 4.1 Definition of SD Let f be a non-negative real function defined on the set of non-negative real nubers such that if x < a, f a,b x) = x if a x < b, e x otherwise. For any -diensional data points x = x 1, x 2,, x ) and y = y 1, y 2,, y ), we define the SD function as SD G x, y ) = w i f si1,s i2 d 1 x i, y i )). i=1 The general for of SD, denoted by SD G defines a distance function f si1,s i2 between x and y on each individual attribute. araeters w i i = 1, 2,, are deterined by the doain knowledge subject to the iportance of attribute i for application needs. Also, paraeters s i1 and s i2 for attribute i are dependent on the distribution of the data points projected on ith diension. In any situations, we have no prior knowledge on the weights of iportance aong attributes, and do not know the distribution of each attribute either. In such a case, the general SD function, i.e., SD G, will be degenerated to, SD s1,s 2 x, y ) = f s1,s 2 d 1 x i, y i )). i=1 In this paper, we only discuss the properties and applications on this SD. The paraeters s 1 and s 2 are, respectively, called as shrinkage threshold and divergence threshold. For illustrative purposes, we present in Figure 4 soe 2-diensional balls of center, ) with different radius for SD, fractional function, L 1 etric, and etric. 4.2 roperties of SD We next discuss the properties of SD for siilarity search and data clustering probles. 18

8 4 4 x x 1 a) SDs 1 =.5, s 2 =.6) x x c) L 1 etric x 2 x x 1 b) fractionall = ) x 1.3 d) etric etric with triangle inequality. It can be verified that SD s1,s 2 x, y ) = does not iply x = y for s 1 > and the triangle inequality does not hold in general. However, the influence of triangle inequality is usually insignificant in any clustering applications [9][12], in particular for high diensional space. Statistical View. Here, we exaine the perspectives of SD for clustering applications statistically. Assue that the data distributions are independent in all diensions of attributes. Given a sall nonnegative nuber ɛ, let s i1 be the axiu value or supreu) such that {d 1 x i, y i ) s i1 } ɛ if x and y belong to distinct clusters. Siilarly, let s i2 be the iniu value or infiu) such that {d 1 x i, y i ) s i2 } ɛ if x and y belong to the sae cluster. Set s1 = in{s i1 i = 1, 2,, } and s 2 = ax{s i2 i = 1, 2,, }. Then, both Figure 4: The 2-diensional balls of center, ) for different radius. roposition of SD If d is the L 1 etric defined on Euclidean space, then SD s1,s 2 is equivalent to L 1 as s 1 and s SD s1,s 2 x, y ) = if and only if d 1 x i, y i ) < s 1 for all i. 3. SD s1,s 2 x, y ) e s2 if d 1 x i, y i ) s 2 for all i. The first property shows that SD is a general for of L 1 etric. The second property eans that the SD is siilar to grid approaches [3] in that all data points within a sall rectangle are ore siilar to each other than to others. Thus, we cannot significantly discern aong the statistically. Therefore, it is reasonable to shrink the proxiity of the to zero. In order to construct a noise insensitive proxiity function, and to avoid over agnifying the distance variation, the SD defines an adaptive proxiity of two data points on individual attributes. For two data points, if values of any attribute are projected into the sae sall interval, we will shrink the proxiity of this attribute to zero. On the other hand, if all projected attributes of two data points are apart fro one to another for a long original distance, then they are dissiilar to each other. As such, we are able to spread the out to increase discriination. The SD can reedy the edge effects proble of grid approach [3] caused by two adjacent grids which ay contain data points very close to one another. It is worth entioning that sae as the fractional function [2] and IDist function of the IGrid-index [3], the SD function is in essence not a and {SD s1,s 2 x, y ) = x and y belong to distinct clusters} ɛ) {SD s1,s 2 x, y ) e s2 x and y belong to the sae cluster} ɛ) will approach as is large. Furtherore, SD is insensitive to noise in high diensional space, because SD disposes individual attribute separately. The noise effects of soe attributes will be itigated by other attributes. Also, the SD is able to avoid spreading data points too sparsely for discriination. Overall, SD has better discriination power than the original distance function, and is hence ore proper for distance-based clustering algoriths. 5 Experiental Evaluation To assess the stableness and the perforance of SD function, we have conducted a series of experients. We copare in our study the stable behaviors and the perforances for distance-based clustering of SD with several well-known distance functions, including L 1 etric, etric, and fractional distance function dis l. Meaningful Behaviors. First we copared the stable behavior of SD, which is based on L etric, with several widely used of distance functions. We use the earson variation Index 2): var d,1,q ) E[d,1,Q )] ), the ean [ of the extreal ratio Index 3): E DMIN ], and the variance of the extreal ratio Index 4): var DMIN 19

9 to evaluate the stability of those distance functions. Recall that coents on these indices and their use are given in Section 3.4. earson Variation earson Variation Diensions a) SD SD Diensions b) L 1,, and fractional function L 1 fractional Figure 5: The average value of earson variation. The synthetic -diensional saple data sets and query points of our experients were generated as follows. Each data set includes 1 thousand independent data points. For each data set, the jth entry x ij of ith data point x i = x i1, x i2,, x i, ) was randoly sapled fro unifor distribution Ua, b + a), or exponential distribution Expλ), or noral distribution Nµ, σ). In each generation for x ij, the paraeters a, b, λ, µ, and σ are randoly sapled fro unifor distributions with range, 1),, 1),.1, 2),, 1), and.1,1), respectively. Forally, for each data set, for any i = 1, 2,, 1, for any j = 1, 2,,, Ua, a + b) with probability1/3, x ij Expλ) with probability1/3, Nµ, σ) with probability1/3, where a, b, µ U, 1), λ U.1, 2), and σ U.1, 1). The query points were also generated by the sae anner. We repeated such process to generate 1 data sets for each diensionality. Diensionality varied fro one to 1. The shrinkage threshold and divergence threshold of SD are s 1 =.5 and s 2 =.6, respectively. The paraeter l for the fractional distance function dis l was set as. The estiations DMIN [ of E DMIN ], var and the average value d of var,1,q ) E[d,1,Q )] were coputed in natural logarith scale to easure the eaningful behavior. lne[ /DMIN ]) lne[ /DMIN ]) Diensions a) SD SD Diensions b) L 1,, and fractional function L 1 fractional Figure 6: The estiations of the ean of the extreal ratio in logarithic scale). Those results are shown fro Figure 5 to Figure 7. In order to perceive the differences aong these perforances easily, we only present the outputs of the first 4 or 2) diensions.) As shown in these figures, the SD is an effective diensionality resistant 2

10 distance function. It is noted that L 1 and etric becoe unstable with as few as 2 diensions. The fractional distance function is ore effective at preserving eaningfulness of proxiity than L 1 and, but starts to suffer fro unstability after the diensionality exceeds 8. In contrast, the SD reains stable even if the diensionality is greater than 1, showing the proinent advantage of using SD. lnvar /DMIN )) lnvar /DMIN )) Diensions a) SD SD Diensions b) L 1,, and fractional function L 1 fractional Figure 7: The estiations of the variance of the extreal ratio in logarithic scale). Clustering Applications. In order to copare the qualitative perforances, we applied the SD function and etric for clustering on ultivariate ixture noral odels. We synthesized any ixtures of two -variate Gaussian data sets with diagonal covariance atrix. All diensions of Cluster 1 and Cluster 2 are sapled iid fro noral distribution Nµ 1, σ1) 2 and Nµ 2, σ2), 2 respectively. We applied atrix powering Table 3: The atrix powering algorith. Algorith: Matrix owering Algorith Inputs: Apairwise distance atrix), ɛ Output: A partition of data points into clusters 1. Copute A 2 = A A 2. for each pair of yet unclassified points i, j a. If A 2 i A2 j ) A2 i A2 j )T < ɛ, then i and j are in the sae cluster. b. If A 2 i A2 j ) A2 i A2 j )T ɛ, then i and j are in different clusters. algorith [16] on those generated data sets to copare the perforances of SD with etric. The outline of atrix powering algorith is given in Table 3 [16]. For any atrix M, we use M j, M ij, and M T to denote, respectively, the j-th row of M, the ij-th entry of M and the transpose of M. Let the precision ratio of the algorith be the percentage of the N 2 pairwise relationship classified as sae or different cluster) that it partitions correctly. Suppose that the expectation of A i,j is p respectively, q) for data points i and j in the sae cluster respectively, different clusters). Also, q > p. Then, the optial threshold given in [16] is ɛ = q p) 2 N 3/2 / 2. However, the knowledge of q p is usually unavailable. In addition, the scales of SD and etric varied. We then odify the threshold as ɛk) = kn DMIN )/2 for variant k, and search the r-feasible range which is defined as the axial interval of k such that the precision ratio is at least r with threshold ɛk). We epirically investigated the behaviors of and SD by using the above atrix powering algorith. The shrinkage and divergence thresholds of SD were set to.5 and 6σ 1 + σ 2 ), respectively. First, we used 1-diensional synthetic data sets drawn fro the ixture of two noral distributions in variant clusters for ean difference µ 1 µ 2. Each data set has 2 data points, and each cluster contains 1 data points. The results are shown in Figure 8a. We also considered the 1-diensional ultivariate ixture odels with several variance ratios σ 1 /σ 2, and showed the results in Figure 8b. Finally, we tested both SD and etric for searching feasible ranges with increasing diensionality. Figure 9. The epirical results are shown in Note that for using atrix powering algorith to solve our data clustering probles, we first need to choose an optial threshold. A wider feasible range offers ore adequate solution space to this proble. As shown in these figures, the feasible ranges of 21

11 98%-feasible range 98%-feasible range SD µ 1 -µ 2 a) σ 1 = 1, µ 2 = 3, σ 2 = 4 SD that SD significantly outperfors the priorly used etric. 98%-feasible range 98%-feasible range SD Diensions SD σ 1 /σ 2 b) µ 1 = µ 2 = 3, σ 2 = Diensions Figure 8: a) The feasible ranges for SD and with varied ean difference of two clusters, b) the feasible ranges for SD and with varied variance ratios of two clusters. etric are uch narrower and the boundary points of feasible ranges are very close to. On the other hand, the feasible ranges of SD are uch wider than even in high diensional space. Further, the SD obtains appropriate response to varying characters of clusters. A larger ean difference or variance) ratio of two clusters iplies a better discriination between the. As shown in Figure 8, the feasible range of SD becoes wider with agnifying the ean difference or the variance ratios of two clusters. Also, the width of feasible ranges for SD increase with increasing diensionality as in Figure 9. On the other hand, due to the unstable phenoenon, the width of feasible ranges for rapidly degrades to with increasing diensionality. Fro Figure 8 and Figure 9, it is shown Figure 9: The feasible range for SD and with varied diensionality. µ 1 = 8, σ 1 = 1) v.s. µ 1 = 3, σ 1 = 4)) 6 Conclusions In this paper, we derived the necessary and the sufficient conditions for the stability of a distance function in high diensional space. Explicitly, we proved that the rapidly degraded earson variation of distance distribution with increasing diensionality is equivalent to i.e., being necessary and sufficient conditions of) unstable phenoenon. This theoretical result on the sufficient condition of a eaningful distance function design derived in this paper leads a powerful eans to test the stability of a distance function in high diensional data space. Explicitly, in light of our results, we have designed a eaningful distance function SD based on a certain given distance function. It was epirically shown that the SD significantly outperfors prior easures for its being stable in high diensional 22

12 data space and also robust to noise, and is thus deeed ore suitable for distance-based clustering applications than the priorly used L p etric. References [1] C. C. Aggarwal. Re-designing distance functions and distance-based applications for high diensional data. ACM SIGMOD Record, Vol. 3, pp.13-18, 21. [2] C. C. Aggarwal, A. Hinneburg, and D. Kei. On the Surprising Behavior of Distance Metrics in High Diensional Space. ICDT Conference, 21. [3] C. C. Aggarwal,and. S. Yu. The IGrid Index: Reversing the Diensionality Curse for Siilarity Indexing in High Diensional Space. ACM SIGKDD Conference, 2. [4]. J. Bickel, and K. A. Doksu. Matheatical Statistics: Basic Ideas and Selected Topics. Vol. 1. 2nd edition. rentice Hall, 21. [5] K.. Bennett, U,. Fayyad, and D. Geiger. Density- Based Indexing for Approxiate Nearest Neighbor Queries. ACM SIGKDD Conference, [6] Beyer K., Goldstein J., Raakrishnan R, and Shaft U. When is Nearest Neighbors Meaningful? ICDT Conference roceedings, [7] K. L. Chung. A Course in robability Theory. 3rd edition. Acadeic ress, 21. [8] H. A. David, and H. N. Nagaraja. Order Statistics. 3rd edition. Wiley & Sons, 23. [9] L. Ertöz, M. Steinbach, and V. Kuar. Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Diensional Data. SDM Conference, 23. [1] B. Efron, and R. Tibshirani. An Introduction to the Bootstrap. Chapan and Hall, [11] A. Hinneburg, C. C. Aggarwal,and D. Kei. What is the nearest Neighbor in High Diensional Spaces? VLDB Conference, 2. [12] A. K. Jain, M. N. Murty and. J. Flynn. Data Clustering: A Review. ACM Coputing Surveys, [13] N. Katayaa, and S. Satoh. Distinctiveness-Sensitive Nearest-Neighbor Search for Efficient Siilarity Retrieval of Multiedia Inforation. ICDE Conference, 21. [14] V. V. etrov. Liit Theores of robability Theory. Oxford University ress, 1995 [15] V. K. Rohatgi, and A. K. Md. Ehsanes Saleh. An Introduction to robability and Statistics. 2nd edition. Wiley, 21. [16] H. Zhou, and D. Woodruff. Clustering via Matrix owering. ODS Conference, 24. Appendix A: Related Theores on robability In order to prove our ain theore, we present soe iportant theores fro the probability theory [15] [7] [8]. The proofs are oitted for interest of space. Theore A. 1. If X X, Y Y and g is a continuous function defined on real nubers, then we have the following properties: 1. X X. 2. gx ) gx). 3. ax ± Y ax ± Y for any constant a. 4. X Y XY. 5. X /Y X/a, provided Y = a ) Slutsky s theore). Theore A. 2. If X 1 then X 1 1. Theore A. 3. If E[X 2 ] < then varx) = vare[x Y ]) + E[varX Y )]. Theore A. 4. If X j j = 1, 2..., are independent, then {axx 1,..., X ) ɛ} = {X 1 ɛ, X 2 ɛ,..., X ɛ} = j=1 {X j ɛ}. In order to prove the necessary condition, we also need to derive the following lea. Lea A. 1. For every ɛ >, if li { 1 + ɛ)dmin } = 1 then we have the following properties: DMIN li E[ DMAX DMIN ] = 1 and DMAX li var DMIN =. 3. For any i, j, li {d,i, Q ) 1 + ɛ)d,j, Q )} = For any i, j, li E[ d,i,q) d) ] = 1 and,j,q ) li var =. d,i,q ) d,j,q ) roof. 1. The first proposition follows fro Theore A Fro the probability theory [15],the following properties are all equivalent: a. li { 1 + ɛ)dmin } = 1, b. DMIN 1, DMAX c. DMIN 1 converges in distribution to the degenerate distribution Dx), where Dx) = 1 if x > and Dx) = if x. Hence, we have li E[ DMAX DMIN ] = 1 and li var DMAX DMIN =. 3. Since DMIN d,i,q ) d,j,q ) DMIN for any i, j, hence we have d,i,q) d,j,q ) 1 4.The third proposition also leads to the fourth one. 23

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds