THE search for nearest neighbors (NNs) is a crucial task in

Size: px

Start display at page:

Download "THE search for nearest neighbors (NNs) is a crucial task in"

Nelson Campbell
6 years ago
Views:

1 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 7, JULY The Concentration of Fractional Distances Damien François, Vincent Wertz, Member, IEEE, and Michel Verleysen, Senior Member, IEEE Abstract Nearest neighbor search and many other numerical data analysis tools most often rely on the use of the euclidean distance. When data are high dimensional, however, the euclidean distances seem to concentrate; all distances between airs of data elements seem to be very similar. Therefore, the relevance of the euclidean distance has been questioned in the ast, and fractional norms (Minkowski-like norms with an exonent less than one) were introduced to fight the concentration henomenon. This aer justifies the use of alternative distances to fight concentration by showing that the concentration is indeed an intrinsic roerty of the distances and not an artifact from a finite samle. Furthermore, an estimation of the concentration as a function of the exonent of the distance and of the distribution of the data is given. It leads to the conclusion that, contrary to what is generally admitted, fractional norms are not always less concentrated than the euclidean norm; a counterexamle is given to rove this claim. Theoretical arguments are resented, which show that the concentration henomenon can aear for real data that do not match the hyotheses of the theorems, in articular, the assumtion of indeendent and identically distributed variables. Finally, some insights about how to choose an otimal metric are given. Index Terms Nearest neighbor search, high-dimensional data, distance concentration, fractional distances. Ç INTRODUCTION THE search for nearest neighbors (NNs) is a crucial task in data management and data analysis. Content-based data retrieval systems, for examle, use the distance between a query from the user and each element in a database to retrieve the most similar data []. However, the distances between elements can also be used to automatically classify data [] or to identify clusters (natural grous of similar data) in the data set [3]. To define a measure of distance among data, the latter are often described in a euclidean sace through a feature vector, and the euclidean distance is used. The euclidean distance is the euclidean norms of the difference between two vectors and is suosed to reflect the notion of similarity between them. Nowadays, most data are getting more and more comlex in the sense that a large number of features is needed to describe them; they are said to be high dimensional. For examle, ictures taken by a standard camera consist of two to five million ixels, digital books contain thousands of words, DNA sequences are comosed of tens of thousand bases, and so forth. High-dimensional data must obviously be described in a high-dimensional sace. In those saces, however, the norm used to define the distance has the strange roerty to concentrate [4], [5]. As a consequence, all airwise distances in a high-dimensional data set seem to be equal or at least. D. François and V. Wertz are with the Deartment of Mathematical Engineering, Université catholique de Louvain, Georges Lemaître, 4, B-348 Louvain-la-Neuve, Belgium. {francois, wertz}@inma.ucl.ac.be.. M. Verleysen is with the Microelectronic Laboratory, Place du Levant 3, B-348 Louvain-la-Neuve, Belgium. verleysen@dice.ucl.ac.be. Manuscrit received Aug. 005; revised 5 Set. 006; acceted Jan. 007; ublished online 5 Mar For information on obtaining rerints of this article, lease send to tkde@comuter.org, and reference IEEECS Log Number TKDE Digital Object Identifier no. 0.09/TKDE very similar. This may lead to roblems when searching for NNs [6], [7], [8], [9], [0], [], [], [3], [4], [5], [6], [7], [8], [9], [0], [], []. Therefore, the relevance of the norm, secifically of the euclidean norm, is questioned when measuring high-dimensional data similarity. Some literature suggests using fractional norms, an extension of the euclidean norm, to counter concentration effects [3], [4]. Fractional norms have been studied and used [5], [6], [7], [8], [9], but many fundamental questions remain oen about the concentration henomenon. Does the concentration henomenon occur when the data do not match the hyotheses of the theorems about concentration? Is the concentration henomenon really intrinsic to the norm/distance, or can it be caused by some other counterintuitive effect of the high dimensionality? Are fractional norms always less concentrated than higher order norms? The aim of this aer is to rovide answers to these questions. In articular, this aer will show that the norm of normalized data will concentrate even if the latter do not match the indeendent and identically distributed (i.i.d.) hyothesis. It will also show that the concentration henomenon occurs even when an infinite number of data oints are considered. Finally, this aer shows that, in contrast to what is generally acknowledged in the literature, fractional norms are not always less concentrated than higher order norms. This aer is organized as follows Section introduces the concentration henomenon and reviews the state of the art. Section 3 evokes the link between the concentration of the norm and the curse of dimensionality in database indexing. Section 4 discusses the itations of current results and extends them. For the sake of clarity, this section will only resent the new results; roofs and illustrations are gathered in Section 5. Finally, Section 6 rooses some ways to choose an otimal metric in articular situations /07/$5.00 ß 007 IEEE Published by the IEEE Comuter Society

2 874 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 7, JULY 007 Fig.. From bottom to to minimum observed value, average minus standard deviation, average value, average lus standard deviation, maximum observed value, and maximum ossible value of the Euclidean norm of a random vector. The exectation grows, but the variance remains constant. A small subinterval of the domain of the norm is reached in ractice. STATE OF THE ART The concentration of distances in high-dimensional saces is a rather counterintuitive henomenon. It can be roughly stated as follows In a high-dimensional sace, all airwise distances between oints seem identical. This aer will study the concentration of the distances through the concentration of the norm of random vectors, as done in [4], [4], [30], and [3]. Given a data set suosedly drawn from a random variable Z, we consider Z 0 that is distributed as Z and define X ¼ Z Z 0. The robability density function of X is the convolution of the resective robability density functions of Z and Z 0. Studying the distribution of distances in the data set roduced by Z, kz Z 0 k, is equivalent to studying the distribution of kxk the norm of X. Similarly, studying the airwise distances in a data set with n elements can be done by first building another data set with nðn þ Þ= elements corresonding to all airs, whose attributes are the differences between two elements and then studying the norm of this new data. This section illustrates the henomenon and gives an overview of the main known results about the concentration of the norm/distance.. An Intuitive View of the Concentration Phenomenon The following exeriment will hel in introducing the concet of concentration for the norm in high-dimensional saces. The aim is to observe that the norms of highdimensional vectors tend to be very similar to one another. Let X ¼ðX ; ;X d Þ be a d-dimensional random vector taking values in the unit cube ½0; Š d. Each comonent X i will be referred to as variable X i. We will denote by X F a random vector X distributed according to the multivariate robability density function F. Let ¼fx ðjþ g n j¼ IRd be a finite samle drawn from X, that is, a set of indeendent realizations of X. We consider the set fkx ðjþ kg n j¼ of all the norms of the xðjþ. Obviously, the values of kx ðjþ k are bounded kx ðjþ k½0;mš, where M ¼kð;...; Þk. In low-dimensional saces ðd < 0Þ, ifn is not too small, min j fkx ðjþ kg n j¼ will be close to zero, and max jfkx ðjþ kg n j¼ will be close to M. However, in higher dimensional saces ðd >0Þ, this is not verified anymore. Fig. shows the average value, emirical standard deviation, and maximum and minimum observed values of the norm of a uniformly randomly drawn samle of size n ¼ 0 5 in saces of growing dimension. The euclidean norm is considered, so M ¼ ffiffiffi d. We can observe that the average value of the norm increases with the dimension, whereas the standard deviation seems rather constant. When the dimension is low (Fig. a), we can see that the minimum and maximum observed values are close to the bounds of the domain of the norm, resectively, 0 and ffiffiffi d. When the dimension is large, say, from dimension 0 onward, the maximum and minimum observed values tend to move away from the bounds. Indeed, even with a large number of oints ð0 5 Þ, all the observed norms seem to concentrate in a small ortion of their domain. In addition, this ortion gets smaller and smaller as the dimension grows, when comared to the size of the total domain. This henomenon is referred to as the concentration of the norm. Section.3 will review some results from the literature about the concentration of the norm in highdimensional saces; Minkowski norms will be introduced in Section... The Euclidean Norm and the Minkowski Family The Minkowski norms form a family of norms arametrized by their exonent ¼ ; ;...

3 FRANÇOIS ET AL. THE CONCENTRATION OF FRACTIONAL DISTANCES 875 Fig.. Two-dimensional unit balls for several values of the arameter of the -norm. kxk ¼ X i jx i j! ðþ When ¼, the Minkowski norm corresonds to the euclidean one, which induces the euclidean distance. For ¼, it is sometimes called sum-norm and induces the Manhattan or city-block metric. The it for!is the max-norm, which induces the Chebychev metric. In the sequel, we will consider extensions of Minkowski norms to the case where is a ositive real number. For, those extensions are indeed norms, but for 0 <<, the triangle inequality does not hold and, hence, they do not deserve the name; they are sometimes called renorms. Actually, the inequality is reversed. A consequence is that the straight line is no longer the smallest ath between two oints, which may seem counterintuitive. In the remainder, we will denote -norm a norm or renorm of the form () with IR þ. We will call a fractional norm a -norm with <. Fig. deicts the D unit balls (that is the set of x ðjþ for which kx ðjþ k¼) for values of equal to,,, and infinity. We can see that, for, the balls are convex; for 0 <<, however, they are not. In Sections.3,.4, and.5, results from the literature about the concentration of Minkowski norms and fractional -norms will be resented..3 Concentration of the Euclidean Norm In the exeriment described at the beginning of this section, we observed that the exectation of the norm of a random vector increases with the dimension, whereas its standard deviation (and, hence, its variance) remains rather constant. Demartines [4] has theoretically confirmed this fact. Theorem Demartines. Let X IR d be a random vector with i.i.d. comonents X i F. Then, EðkXk Þ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffi ad b þ Oð=dÞ and ðþ VarðkXk Þ¼bþOð= ffiffiffi d Þ; ð3þ where a and b are constants that do not deend on the dimension. The theorem is valid whatever the distribution F of the X i might be. Different distributions will lead to different values for a and b, but the asymtotic results remain. The theorem roves that the exectation of the euclidean norm of random vectors increases as the square root of the dimension, whereas its variance is constant and indeendent of the dimension. Therefore, when the dimension is large, the variance of the norm is very small comared with its exected value. Demartines concludes that when the dimension is large, vectors seem normalized The relative error made while considering EðkXk Þ instead of the real value of kxk becomes negligible. As a consequence, high-dimensional vectors aear to be distributed on a shere of radius EðkXk Þ. Demartines also notes that, since the euclidean distance is the norm of the difference between two random vectors, its exectation and variance follow laws () and (3); airwise distances between oints in high-dimensional saces seem to be all identical. Demartines mentions that if X i are not indeendent, the results are still valid rovided that we relace d with the actual number of degrees of freedom. The result from the work of Demartines is interesting in that it confirms the exerimental results, but it is restricted to the euclidean norm and makes the rather strong hyothesis of indeendence and identical distributions..4 Concentration of Arbitrary Norms Indeendent of the results of Demartines work, Beyer et al. exlored the effect of dimensionality on the NN roblem [5]. Whereas Demartines defines a data set as consisting of n indeendent draws x ðjþ from a single random vector X, Beyer et al. consider n random vectors P ðjþ ; a data set is then made of one realization of each random vector. The main result of Beyer et al. s work is the following theorem. The original theorem is stated for an arbitrary distance measure; it is rewritten here with norms for illustration uroses. Theorem Beyer et al., adated. Let P ðjþ j n be n d-dimensional i.i.d. random vectors. If! kp ðjþk EðkP ðjþ kþ ¼ 0 ð4þ then, for any >0, " # j kp ðjþ kmin j kp ðjþ k min j kp ðjþ k ¼ The theorem is interreted as follows Suose a set of n data oints, randomly distributed in the d-dimensional sace. Some query oint is suosed to be located at the origin, without loss of generality. Then, if hyothesis (4) is satisfied, indeendent of the distribution of the comonents of the P ðjþ, the difference between the largest and smallest distances to the query oint becomes smaller and smaller when comared with the smallest distance when the dimension increases. The ratio is called the relative contrast. max j kp ðjþ kmin j kp ðjþ k min j kp ðjþ k

4 876 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 7, JULY 007 Beyer et al. conclude that all oints seem to be located at aroximately the same distance from the query oint; the concet of NN in a high-dimensional sace is then less intuitive than in a lower dimensional one. Beyer et al. exlore some scenarios that satisfy (4) and some that do not. A roof that Minkowski norms and fractional norms satisfy the hyothesis will be given in Section Concentration of Minkowski Norms Hinneburg et al. focused on the search for NN as well [3]. They roduced the following theorem relative to Minkowski norms. Theorem 3 Hinneburg et al. Let P ðjþ j n be n d-dimensional i.i.d. random vectors and kk be the Minkowski norm with exonent. If the P ðjþ are distributed over ½0; Š d, then there exists a constant C indeendent of the distribution of the P ðjþ such that C E max! j kp ðjþ k min j kp ðjþ k ðnþc d ð5þ Hinneburg et al. note that a consequence of (5) is the surrising fact that, on average, the contrast max kp ðjþ k min kp ðjþ k j j ð6þ grows as d ==. They further conclude that, as a result, the contrast converges to a constant when the dimension increases and when the euclidean ffiffiffi distance is used. For the L norm, it increases as d, for the euclidean norm ð ¼ Þ, it remains constant, and for norms with 3, it tends toward zero. Hinneburg et al. conclude that, for L metrics with 3, the NN search in a high-dimensional sace tends to be meaningless for Minkowski norms with exonent larger than or equal to 3, since the maximum observed distance tends toward the minimum one. The distance has lost its discriminative ower between the notions of close and far. The conclusion we can draw from this theorem is the following On average, the ratio between the contrast and d == is bounded. However, the bounds themselves deend on the value of. Furthermore, if the number of oints n is large, the uer bound may be very large too. In ractice, though, it may aear that the value of the ratio is much closer to the lower bound than to the uer one. A result indeendent of the number of oints is resented in Section 4.. Yianilos [3] mentions that the standard deviation of the - norm,, of a uniformly distributed random vector is ðd == Þ and sketches a roof. In contrast to the aroaches by Beyer et al. and Hinneburg et al., Yianilos considers random vector distributions rather than realizations of random vectors. He also asserts that the i.i.d. hyothesis can be weakened. He then makes the same conclusions as Hinneburg et al. and investigates the consequences for excluded middle vantage oint forests methods for similarity search in metric saces. We will see a similar result in Section 4 without the restriction of the uniform distribution and on the values of..6 Concentration of Fractional Norms Aggarwal et al. extend Hinneburg s result to fractional -norms [4]. In a sense, if ¼ is better than ¼ 3 and ¼ is better than ¼, why not to look at ¼ to see if it is better than ¼? Aggarwal et al. roduced the following theorem. Theorem 4 Aggarwal et al. Let P ðjþ j n be n d-dimensional indeendent random vectors, uniformly distributed over ½0; Š d. There exists a constant C indeendent of and d such that sffiffiffiffiffiffiffiffiffiffiffiffiffi C E max! j kp ðjþ k min j kp ðjþ k þ min j kp ðjþ k s ðn ÞC ffiffiffi d ffiffiffiffiffiffiffiffiffiffiffiffiffi þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Aggarwal et al. note that the constant =ð þ Þ may lay a valuable role in affecting the relative contrast and confirmed it exerimentally with synthetic data sets. They then conclude that, on average, fractional norms rovide better contrast than Minkowski norms in terms of relative distances. The next section will rovide a similar result indeendent of the number of oints. That result will show that the conclusions of Theorem 4 cannot be extended to the case where the data are not uniformly distributed. Indeendent of Aggarwal et al., Skala [30] has shown that the ratio ðdþ ¼ EðkXk Þ VarðkXk Þ ; increases linearly with dimension d. Here, X is a random vector whose comonents are i.i.d. distributed. The theorem does not require uniform densities; however, it is exact only in the ¼ case and gives an aroximate result for other values of. 3 CONCENTRATION AND SIMILARITY SEARCH IN DATABASES In order to better understand the imact of concentration on NN search and indexing, this section reviews some examle studies available in the literature. It further describes the context of this research and justifies the imortance of the study of concentration and alternative metrics. The major consequence of concentration on NN search is that indexing methods, which have an exected logarithmic cost, actually sometimes erform no better than simle linear scanning. This has been described among others in [3], [33], [34], and [35] and was acknowledged by many others as the curse of dimensionality. Consequently, new indexing structures, secially designed for high-dimensional data were suggested, like the X-tree [34], TV-tree [35], and so forth, and an aroximate search was roosed as a solution [36]. A survey of these methods can be found in [37]. It seems that Brin [38] was one of the ioneers to actually relate the curse of dimensionality to the articular shae of distance distributions in high-dimensional saces. He ð7þ

5 FRANÇOIS ET AL. THE CONCENTRATION OF FRACTIONAL DISTANCES 877 roosed the Geometric Near-neighbor Access Tree (GNAT) indexing structure that was built taking into account the minimum and maximum distances in the data set. Whereas Berchtold et al. [39] suggested that the decrease of erformances is due to the boundary effects, coming from the fact that all oints seem located in the corners of the sace, Weber et al. [40] rooses a review of counterintuitive roerties of high-dimensional geometrical saces and relates them to the losses of erformances. In view of their theorem (Theorem ), Beyer et al. exlain the decreases of erformances of the indexing methods by the instability of the search for NNs. They roose to use less concentrated metrics and further question the intrinsic relevance of NN search, indeendent of erformance issues, for concentrated metrics. Whereas many of the resonses to the curse of dimensionality are to introduce a new indexing method, Hinneburg et al. [3] and Aggarwal et al. [4] roose to use alternative metrics that are less concentrated. Their main urose, however, is to use a more meaningful metric in high-dimensional saces and not to accelerate the search. In both cases, using less-concentrated norms either to erform better with indexing structures or to get a more meaningful NN search thus assumes that the concentration henomenon is intrinsic to the notion of the norm. The aim of this aer is to show that the concentration is indeed intrinsic to the norm. To this end, we will consider results that are indeendent of the number of oints. Considering a high-dimensional bounded sace with some oints scattered over it, it seems reasonable to think that the less oints you have, the more concentrated the distances are akin to be since the maximum distance will decrease while the minimum distance will increase. The number of oints arameter either has to be taken into account or has to be ruled out of the equation. In the following, we roose to do this by considering data distributions instead of data sets. The conclusion of our study is that, indeed, the concentration henomenon is intrinsic to the norm and, thus, justifies Aggarwal et al. s work on fractional metrics to reduce concentration. From Aggarwal et al. s results, it is often extraolated that using fractional distances with small values of the arameter will decrease the concentration henomenon. The latter fact has subsequently been used as an argument to use fractional norms in all sorts of situations without first checking for alicability. However, the restriction is that the data must be uniformly distributed. As we are dealing here with high-dimensional saces, it aears quite evident that real data can only sarsely oulate high-dimensional saces because of their finite number and because they often lie on a submanifold. Therefore, data are far from being uniformly sread over the sace. This study confirms that the distribution of data has to be taken into account to estimate the concentration. Deending on the distribution, the evolution of the concentration with resect to the value of arameter may be increasing as gets smaller; in other cases, it also may have a local maximum for some value of, as will be shown later. 4 FURTHER THEORETICAL RESULTS Section reviewed major results from the literature. This section will describe new results in order to better understand the henomenon of norm concentration in high-dimensional saces. All roofs and examles are ostoned to Section 5 for ease of readability. 4. The Finite and Bounded Samle A finite number of oints will most robably be sarsely distributed in a high-dimensional sace. Most oints will be far away from each other, and the density will be very low over the whole sace. This is referred to as the Emty Sace Phenomenon [4]. Furthermore, if the oints live in a closed (bounded) region of the sace, for instance, the ½0; Š d hyercube, then the maximum distance is bounded too. In such a situation, it may haen that the relative contrast is very low. The questions are then Is the concentration henomenon a side effect of the Emty Sace Phenomenon, just because we consider a finite number of oints in a bounded ortion of a high-dimensional sace? Would the conclusions still hold if an infinite number of oints (in other words a distribution) sanning the whole sace was considered? Unfortunately, the results by Beyer et al., Hinneburg et al., and Aggarwal et al. cannot be extended to the case where the number of data oints is arbitrarily large. Indeed, the bounds on the relative contrast deend on the number of oints. Furthermore, if the values the data oints P ðjþ can take are unbounded, the notion of relative contrast may not be relevant anymore since it relies on maximum and minimum values. In contrast to Beyer et al. s, Hinneburg et al. s, and Aggarwal et al. s results, Demartines and Yianilos ones do not refer to a finite number of oints but rather to a distribution. Unfortunately, they consider Minkowski norms only. An interesting result would be to extend their results to fractional -norms. For that urose, it is roosed to use the ratio qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi VarðkXk Þ RV F; ¼ ð8þ EðkXk Þ as a measure of the concentration; RV F; will be called the relative variance of the norm. It is nonnegative; a small value indicates that the distribution of the norm is concentrated, whereas a larger one indicates a wide effective range of norm values. Intuitively, we can see that RV F; measures the concentration by relating a measure of sread (standard deviation) to a measure of location (exectation). In that sense, it is similar to the relative contrast that also relates a measure of sread (range) to a measure of location (minimum). The main result of this section is that, regarding the relative variance, all -norms concentrate as the dimensionality of the sace increases. This is stated more recisely in Theorem 5, deduced from two lemmas that, resectively, characterize the individual behaviors of the variance and of the exectation of the norm. Those lemmas are resented here while their resective roofs can be found in Section 5.. Lemma. Let X ¼ðX ; ;X d Þ be a random vector with i.i.d. comonents X i F. Then, EðkXk Þ ¼ c ð9þ d =

6 878 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 7, JULY 007 with c being a constant indeendent of the dimension but related to and to the distribution F. Therefore, it aears that the exectation of the -norm of a random vector grows as the th root of the dimension. Note that this result is consistent with () in the euclidean case ð ¼ Þ. The second lemma concerns the variance of the norm. Lemma. Let X ¼ðX ; ;X d Þ be a random vector with i.i.d. comonents X i F. Then, VarðkXk Þ ¼ c 0 ð0þ d with c 0 being a constant indeendent of the dimension but related to and to the distribution F. The deendency on d indicates that the variance remains constant with the dimension for the euclidean distance. We can also see that the variance decreases when the dimension increases for -norms with > and increases for -norms with <. This is consistent with Demartines and Hinneburg et al. s results. From these Lemmas, it can be shown that all -norms concentrate as the dimension increases. Theorem 5. Let X ¼ðX ; ;X d Þ be a random vector with i.i.d. comonents X i F. Then, qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi VarðkXk Þ EðkXk Þ ¼ 0; that is, the relative variance decreases toward zero when the dimension grows. The roof can be found in Section 5.. From Theorem 5, it can be concluded that the concentration of the norms in high-dimensional saces is really an intrinsic concentration roerty of the norms and not a side effect of the finite samle size or from the Emty Sace Phenomenon. This result extends Demartines one to all -norms and does not deend on the samle size. Moreover, although all -norms concentrate as the dimension increases, they do not concentrate in the same way. Characterizing these differences with resect to is the toic of Section Imact of the Value of on the Concentration In the revious section, it has been shown that all -norms concentrate, whatever is the value of ð >0Þ. In this section, the relationshi between the relative variance and the value of for a given dimension will be made exlicit. Theorem 6. Let X ¼ðX ; ;X d Þ be a random vector with i.i.d. comonents X i F. Then, ffiffiffi d q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi VarðkXk Þ EðkXk Þ ¼ F; ; F; where F; ¼ EðjX i j Þ, and F; ¼ VarðjX ij Þ. If d is large, we can thus aroximate the relative variance by RV F ; ¼ q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi VarðkXk Þ EðkXk Þ ffiffiffi d F; F; This means that, for a fixed (large) dimension, the relative variance evolves with as K F; ¼ F; ðþ F ; The evolution of K F; with determines the value of the relative variance for a fixed dimension d. Aggarwal et al. has shown that, if the oints are uniformly distributed, the relative contrast at fixed dimension increases as decreases. We will show below and rove in Section 5.3 that, under the uniform distribution hyothesis, the relative variance increases as decreases too. However, in general, the function described in () is not always strictly decreasing with as stated in the following roosition Proosition. The relative variance () is ) a strictly decreasing function of when the variables are distributed according to a uniform distribution F over the interval [0, ], but ) there exists distributions F for which it is not the case. Fractional norms are not always less concentrated than higher order norms. The roof is given in Section 5.. There are thus data for which the -norm is more concentrated than the -norm, for instance; in general, a higher order norm can be less concentrated than a fractional norm by having a higher relative contrast or relative variance. In conclusion, using fractional norms does not always bring less concentrated distances. 4.3 The i.i.d. Hyothesis All theorems resented in Section 4 rely on the fact that data are suosed to be i.i.d. What haens if it is not the case in ractice (actually it will hardly ever be)? Are these assumtions really needed, or do they merely make the roofs easier or even feasible? This section addresses these issues. First, the identically distributed assumtion is considered. In the revious sections, we have suosed that all X i are distributed according to the same distribution X i F. If variables X i are not identically distributed, it means that each X i is distributed according to some distribution noted F i 8i X i F i. Proosition. If the data are not identically distributed, then the conclusion of Theorem 5 still holds rovided that the data are normalized. The roof is given in Section 5.3. Normalizing means here to subtract the mean from the variables and divide them by their standard deviation so that 8i EðX i Þ¼0 and VarðX i Þ¼. Normalizing data is often considered as imortant when using norms and distances, because it ensures that all variables X i will have equal influence on the comutation of the norm. If this it is not the case, the variables with the largest variances will have the largest influence on the distance value, whereas the variables with low variances will have little or no influence on the comutation of the norm.

7 FRANÇOIS ET AL. THE CONCENTRATION OF FRACTIONAL DISTANCES 879 Actually, norms will concentrate for nonidentically distributed data if they are normalized. However, if the data are not normalized, some variables may have too little effect on the distance value. The indeendent assumtion may not be encountered in ractice. If X has comonents X i that are not indeendent, it means that the joint distribution F of X ¼ ðx ; ;X d Þ differs from the roduct of the marginal distributions F i. The existence of relations between the elements of X means that X lies on a (ossibly nonlinear) submanifold of IR d. If the submanifold is nonlinear, the euclidean norm, for instance, measures the distances between data oints using short cuts, that is, through a straight line in the sace; this can be very different from a geodesic distance measured along the manifold [4], [43]. However, the manifold can be reresented in a vector sace IR dint whose dimension is the number of degrees of freedom of the manifold. Dimension d int is called the intrinsic dimension and d the embedding dimension. Each realization of X in the embedding sace may then be maed to a unique realization in a d int -dimensional rojection sace. One may thus exect that the asymtotic roerties of the norms behave similarly in both saces. Actually, the measure of intrinsic dimensionality roosed by Chavez et al. [] is recisely the inverse of the square of the relative variance. Korn et al. [] relate the concentration to the fractal dimension of the data, which is another way to measure its intrinsic dimensionality. The fact that the concentration deends on the intrinsic dimension of the data is often admitted [], even if there is no consensus about the actual definition of the intrinsic dimension. As a consequence, high-dimensional data that resent a lot of correlation or deendencies between variables will be much less concentrated than if all variables are indeendent. The conclusion we can draw is that the concentration henomenon deends on the intrinsic dimension of the data more than on the dimension of the embedding sace. 5 PROOFS AND ILLUSTRATIONS In this section, roofs are given for Lemmas and, for Theorems 5 and 6, and for Proositions and. 5. Proof of Theorem 5 The roof of Theorem 5 is based on Lemmas and. Therefore, the roofs of the lemmas are given first. tu 5.. Proof of Lemma The roof requires two stes. First, it will be shown that, under the assumtions of Lemma, we have kxk P ¼ c ¼ ; ðþ d = where c is a constant indeendent of the dimension d of X (Ste ). Then, this result will be extended to the exectation of the norm (Ste ). Ste. Let S i ¼jX i j for i ¼ ;...;d. The S i are i.i.d. as well; let F; be their exectation. The Strong Law of Large Numbers (SLLN) allows us to write Then, " # P d Xd S i ¼ F; ¼ P4 or, by definition of S i, P4 d = d Xd X d! 3 = = S i ¼ F ; 5 ¼! 3 = jx i j = ¼ F; 5 ¼ This is nothing else than kxk P d ¼ = F; = ¼ ; which concludes the roof with c ¼ = F;. Ste. By (), for any realization of X excet some, a subset of IR d with measure 0, we have Thus, Z IR d n F ðþ kk kk d ¼ = F; = d d ¼ = Z IR d n ¼ F; = F ðþ F ; = d ð3þ ð4þ with the last equality coming from the fact that = F; is bounded. In addition, we have Z Z kk F ðþ d IR d n d ¼ kk = IR dfðþ d d ð5þ = Z ¼ IR dfðþkk d d ð6þ = since is of measure 0. Combining (4) and (6) gives Z kk d ¼ IR d d= EðkXk Þ d = ¼ F; = ; ð7þ which concludes the roof since the right-hand side of (7) is indeendent of the dimension d. tu 5.. Proof of Lemma Let F; be the exectation of variable jx i j and F; its variance. For random vectors X of dimension d and a given -norm, we have

8 880 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 7, JULY 007 Since we have Therefore, VarkXk ¼ d = kxk EkXk ¼ kxk EkXk d == VarkXk d = E kxk EkXk d = " ¼ E kxk EkXk # d == kxk EkXk r ; EkXk P r¼0 kxkr = kxk EkXk ffiffiffi d ¼ r P r¼0 P r¼0 ðkxk d = kxk d = r EkXk d = 0 Þ= ðkxk EkXk ffiffiffi 3 6B d C 7 ¼ E4@ A 5 r EkXk d = Using the theorem by Lebesgue and Fatou about the convergence of integrable functions, we can swa the it and the exectation oerators VarkXk d = Þ= EkXk ffiffiffi d ðkxk ¼ E r 5 P r¼0 ðkxk d = r EkXk d = 3 Þ= ðkxk EkXk ffiffiffi d ¼ E6 4 7 r 5 P r¼0 ðkxk d = r EkXk d = The transition from the former to the latter is allowed since the denominator does not tend toward zero. Because of Lemma, we know that EkXk tends toward a d = constant as d increases. Furthermore, we have seen that kxk almost surely, that is, with robability, also tends d = toward a constant. Therefore, the denominator almost surely tends toward a constant as the dimension grows " X kxk r EkXk # r P ¼ ðþ= d r¼0 = d = F ; 3 r ð8þ is, which means that, with robability, Þ= E VarkXk ðkxk EkXk ffiffiffi d ¼ d = ðþ= F; If we focus on the numerator now, we notice that kxk EkXk X d ¼ ¼ Xd jx i j EkXk 0 EkXk jx i d Using the result from Lemma, we write kxk EkXk The numerator can now be written as Since we have E4 E Xd E4 X d " # jx i j F; X d However, Var A X d ¼ jx i j F; jx i j! 3 F ; =d5 ¼ Xd ¼ Xd ¼ Xd ¼ 0; jx i j! 3 F; 5 ¼ Var X d jx i j! F; ¼ Xd leading to the conclusion that E4 X d E jx i j F; E½jX i j Š F ; F; F ; ¼ Xd X d ¼ d F ; jx i j! F; Var jx i j F ; VarðjX i j Þ jx i j! 3 F; =d5 ¼ F; ð9þ Using exression (9) for the numerator and (8) for the denominator, the result is that " # VarkXk F; P ¼ ¼ ð0þ d = ð ðþ= F; Þ The same arguments as in Ste from the roof of Lemma can be used to get

9 FRANÇOIS ET AL. THE CONCENTRATION OF FRACTIONAL DISTANCES 88 VarkXk F; ¼ d = ð ðþ= F; Þ ; ðþ which roves the lemma since the right-hand side of () is indeendent of the dimension d. tu 5..3 Proof of Theorem 5 Lemmas and are used to rove Theorem 5. Writing qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi VarkXk EkXk ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffi VarkXk d == d = EkXk d = and taking the it qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi VarkXk EkXk ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffi VarkXk d == we have, by Lemma and Lemma qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi VarkXk EkXk ¼ ffiffiffi c c 0 EkXk d = d = ; d= ¼ 0 5. Proof of Theorem 6 and Proosition Similar to Theorem 5, the roof of Theorem 6 is based on Lemmas and. Theorem 6 is roven as follows From Lemmas and, ffiffiffi d q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi VarkXk EkXk ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffi VarkXk d == EkXk d = ffiffiffi c ¼ 0 c ut ðþ Using the values of c and c 0, resectively, from (7) and (), we have ffiffiffi c c 0 ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi F; F; F; ¼ F; F; ð3þ Proosition contains two claims; the roofs are, resectively, given in Part a and Part b. Part a. If F is the uniform distribution over the interval [0, ], F; is given by F; ¼ þ Since F; ¼ F; F;, we have F; ¼ þ sffiffiffiffiffiffiffiffiffiffiffiffiffi þ Fig. 3. Relative variance for data distributed as F, as a function of. We can see that a maximum is obtained for a rather large value of, far from. Therefore, F; F ; ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffi ð4þ þ We can conclude that, under the uniform distribution and large dimension d hyotheses, the relative variance decreases with. The concentration of the norm thus increases as grows. Part b. A counterexamle is rovided to rove assertion b of the roosition. Let us consider a situation where data are disatched into two Gaussian clusters with variance ¼ and, resectively, centered on and. The marginal distribution of each X i is then F ðþ ¼ cst ð e Þ þ e ð Þ þ ð5þ In this examle, the relative variance is higher for higher order norms than for fractional norms, as illustrated in Fig. 3; consequently, fractional norms are more concentrated than higher order norms with values of ½8; 30Š. The next examles are taken from real data sets used by Aggarwal et al. in [4] the segmentation data set and the Wisconsin Diagnostic Breast Cancer (WDBC) data set from the University of California, Irvine (UCI) Machine Learning Reository [44]. Fig. 4 shows a lot of the relative contrast as a function of. In Fig. 4a (the segmentation data set), between ¼ and ¼, the relative contrast is actually increasing. This is a real examle where the euclidean norm is actually less concentrated than the -norm. Fig. 4b is even more interesting. Here, the relative contrast is consistently better for higher order norms than for fractional norms. tu 5.3 Proofs for Proosition In Section 5., the roof of Theorem 5 was built using the Law of Large Numbers. Although this law is often stated with the i.i.d. hyothesis, the identically distributed one is sufficient but not necessary. Actually, if the data are normalized, the assumtion of identical distributions is not necessary. It has been shown that a less restrictive sufficient condition for the Law of Large Numbers to hold is that

10 88 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 7, JULY 007 Fig. 4. (a) Relative contrast for the segmentation and (b) the WDBC data sets. The relative contrast is higher for higher order norms than that for the fractional norms. ZZ F i ð i Þ¼ F ð ;... i ;...; d Þ d... d i d iþ... d d the set of X i must be uniformly integrable [45]. This means that, for any >0, there exists some K< such that for every X i Z F i ðþjjd < ð6þ jx ijk Note that (6) is actually the exectation of jx i j restricted to the values jx i jk. It is equivalent to [45] 9k > su EjX i j k < ð7þ i If the data are normalized, EðX i Þ¼0, and VarðX i Þ¼ for all i. We then have ¼ Var X i ¼ E X i E X i ¼ E X i ¼ EjX ij Taking k ¼ in (7) leads to the conclusion that if the data are centered and reduced, as often done in ractice, then they are uniformly integrable. In this case, the Law of Large Numbers still holds without the assumtion of equal distributions. Therefore, all subsequent results are valid rovided that the data are normalized; this roves Proosition. To show that the indeendent art of the i.i.d. hyothesis is not necessary either, we will comare the concentration of norms in real data sets with the same embedding dimensions but different intrinsic dimensions. Suose we have some data described in a d-dimensional sace for which we know that the intrinsic dimension d int is lower. Such a situation occurs, for examle, when the variables are significantly correlated. We will see that, for such data, the value of the relative variance is much more similar to the one of a d int -dimensional data set with indeendent variables than of a d-dimensional data set with indeendent variables. For that urose, we will ensure that marginal distributions, that is, the distribution of each variable taken searately, are identical in both data sets. Suose we have ¼fx ðjþ g n j¼, a samle drawn from X ¼ðX ; ;X d ÞFð ; ;...; d Þ, where F is the multivariate robability density function of X. The marginal distributions of X i are To roduce a data set 0 that is marginally identically distributed as, we roose the following If we consider a matrix where each row corresonds to a data element x ðjþ and each column to a variable X i, the values in each column are randomly ermuted. As a consequence, the marginal distributions of each variable will not change. By contrast, all relationshis between variables are destroyed in the rocess. Therefore, we obtain a samle 0 that is marginally distributed as, but where the comonents are now indeendent; the intrinsic dimension of 0 is thus equal to its embedding dimension. Let us denote by RV d ; the estimation of RV F; with the euclidean norm given the data set with high embedding dimension and low intrinsic dimension. We will comare this value to the value of the relative variance of 0 ðrv d 0 ;Þ that has a high intrinsic dimension. Moreover, we will comare those values to the value of the relative variance for a data set made of a small number (tyically 0 times lower than the original number of variables) of lowcorrelated variables from ðrv d s; Þ (small embedding dimension). The relative variance of a uniformly distributed synthetic data set of dimensionality d ðrvd rand; Þ (high intrinsic dimension) is also comuted. It is exected that the relative variances for data sets with low intrinsic dimension will be similar while being much lower than the relative variances of the data sets with high intrinsic dimension. The relative variance for a data set ¼fx ðjþ g n j¼ is estimated as follows rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P n j¼ RV d kxðjþ k P n j¼ kxðjþ k X; ¼ P n j¼ kxðjþ k The data sets mentioned in Table are high-dimensional data coming from various domains. The first two data sets are the near-infrared sectra of ale and meat,

11 FRANÇOIS ET AL. THE CONCENTRATION OF FRACTIONAL DISTANCES 883 TABLE Relative Variances for Several Real Data Sets Comared with the Relative Variances of Artificial i.i.d. Data Sets The numbers between arentheses are the number of variables used to build X s. resectively, the next two data sets are recorded sounds of yes and no, and of boat and goat [46]. The ionoshere and the musk data sets come from the UCI machine learning reository. 3 The dimensionality goes from 34 to more than 8,000. As exected, it is seen in Table that the relative variance of the original data set (column ) is very similar to the relative variance of a subset of its variables (column 3). In contrast, the relative variance for the crafted data set 0 (column ) is very similar to the relative variance of a random uniformly distributed data set of the same dimension (column 4). tu 6 ABOUT THE OPTIMAL VALUE OF Throughout the receding sections, we saw that all fractional distances concentrate in high-dimensional saces and that concentration deends nontrivially on the value of. This concentration has negative consequences on indexing methods and leads to the questioning of the meaningfulness of the NN search when the distances seem to be all identical. This motivates the use of alternative metrics to the widely used euclidean one. 6. Fractional Norms and Non-Gaussian Noise One more argument can motivate the use of fractional norms. It can easily be shown that the concet of euclidean distance and the concet of Gaussian white noise are intimately tied. A Gaussian white noise is a noise that has a normal distribution with zero mean and equal variance for all variables. Looking for the most similar data element to a query oint by minimizing the euclidean distance is equivalent to choosing the NN to be the oint most likely to be the query oint under the Gaussian white noise scheme. In low-dimensional saces, most often, a Gaussian white noise is an accetable assumtion. However, when the dimension increases, other noise schemes might be more aroriate. The white Gaussian noise is a model that describes small alterations gently distributed over the coordinates. Obviously, in low-dimensional saces, this is the only kind of noise we can coe with. Imagine now a noise scheme that models large alterations of only some of the coordinates instead of small alterations of all coordinates.. Available at htt// Available at htt//nathalie.vialaneix.free.fr/maths/article.h3?id _article=0. 3. Available at htt// This kind of noise will generate the so-called outliers in low-dimensional saces, but in higher dimensions, where more information is available for each data element, it can just be handled as a noise scheme. Examles of such noise schemes are the so-called Salt and Peer noise on images and Burst noise on time series; encoding errors may also be viewed as noise that sometimes dramatically affects a small number of the coordinates. In [47], a colored noise scheme is studied, which concentrates its effects on some of the coordinates while leaving the other ones nearly unchanged. The exeriments on high-dimensional data show that, for such a noise, fractional norms are better at identifying the real NNs, that is, the original oint when the noise is removed. For Gaussian white noise, however, the euclidean norm gives better results. Similarly, in [4], the exeriments show that fractional norms are better suited for classification when masking noise is alied. This noise scheme randomly changes some values of the coordinates; it is very different from Gaussian noise. All these results clearly illustrate the fact that the otimal value of is highly alication deendent. 6. Choosing the Norm for Regression/ Classification In a rediction or classification roblem, the value of could be chosen so as to get the best model erformances, according to the exected error in redicting the resonse value or the class label. It would thus be considered as an additional arameter to the model The norm that is chosen is the norm that minimizes the differences between the true resonse values or class labels and the redicted ones. However, this would necessitate multilying the comutation times by as many different values of are tested. An elegant alternative is to choose the value rior to the building of the model. In this case, the norm is chosen before any rediction model is constructed. It is chosen according to a statistical measure of relevance for each -norm that is considered. This statistical measure gives each -norm a score based on how well similar data elements according to the -norm relate to similar resonse values or class labels. If the data elements that are close in the data sace, that is, similar according to the norm, are also close in terms of associated resonse value, then the norm is considered relevant as a measure of similarity for those data. Nonarametric noise estimators such as the Gamma test [48] or the Differogram [49] are suitable for this, as well as the erformances of a -NN model. This latter strategy (-NN model) is illustrated in the following exeriments. We consider real data, namely, the Housing 4 data set, and the Wine, Tecator, and Ale 5 data sets. The objective in the Housing data set is to redict the values of houses (in thousand dollars) described by 3 attributes reresenting the demograhic statistics of the area around each house. The data set contains 506 instances slit into 338 learning examles and 69 for testing. The 4. Available at the UCI reository. 5. All of them are available at htt//

12 884 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 7, JULY 007 TABLE Performances of the RBFN on Several Data Sets with the Euclidean Norm and with Fractional Norms The data are altered with burst noise. The norm in the RBFN is chosen according to the leave-one-out erformances of a -NN. other three data sets are sectral data for the Wine, the alcohol content must be redicted, for the Tecator, it is the fat content, and for the Ale data set, it is the sugar content. A burst noise was added to the data Some coordinates were altered with a multilicative noise of high amlitude. With such a noise scheme, strongly non- Gaussian fractional norms are better suited to measure the similarity. The leave-one-out error of a -NN is used to measure the relevance of each norm and to choose the most relevant one. Then, a Radial Basis Function Network (RBFN) [50] is built with the chosen norm. The arameters of the RBFN are chosen by fourfold cross validation. Table reorts the Root Mean Square Error (RMSE) of rediction on an indeendent test set for all the data sets. For each of these data sets, the fractional norm, chosen rior to the building of the model, gives better results than the euclidean norm. 6.3 Choosing the Norm in Content-Based Similarity Search In many multimedia database systems, the user feedback can be used to estimate the relevance of the result of the search for similar elements. This can be translated into the relevance of the metric used to get similar elements. For instance, in image and text retrieval, the relevance measure is used to weigh the coordinates in the comutation of the distance to better reflect the erceived (subjective) similarity between objects [5], [5]. A relevance feedback algorithm is then given as follows Suose that all elements are described by a feature vector; given a query Q, the system retrieves the nearest or the -NNs according to several -norms (with ¼ 0; 05; ; ; 4, for instance). The user then identifies the most relevant results, and the score of the -norms corresonding to those results is increased. After several iterations, the -norm with the highest score is chosen. To illustrate this rocedure, the XMVTS database is used. This database is comrised of 600 hotograhs, altered with Salt and Peer noise (some ixels are set to black or white randomly), of 00 individuals (that is, three ictures er individual.) At each iteration, a icture from the data set is considered as a query oint. Its NN according to several -norms are retrieved, and the score of the norms for which the retrieved image corresonds to the same individual is increased by one. Fig. 5 shows the evolution of the score for each -norm at iterations to 0 and iteration 580 to 600. We can see that the /-norm is the one with the highest score after 0 iterations and still the one after all 600 iterations. The /4 and /8-norms also have high scores in contrast to higher order norms that erform oorly. The same exeriment was reeated 00 times with only 0 iterations. In 86 ercent of these exeriments, the /-norm was identified as the most relevant. The /8-norm was chosen as the most relevant three times, the /4-norms only once, the -norm 0 times, and the other (higher order) norms were never chosen. In conclusion, fractional norms should be used when they are a more relevant measure of similarity and, hence, increase the erformances, rather than because of concentration considerations. 7 CONCLUSIONS A comarison between data elements is often erformed using the euclidean norm and distance. In high-dimensional saces, however, norms concentrate. This means that all airwise distances in a data set are very similar and can Fig. 5. The evolution of the score of each norm as a function of the number of iterations for the face image database. The /-norm aears as the most relevant from iteration 0 onward.

13 FRANÇOIS ET AL. THE CONCENTRATION OF FRACTIONAL DISTANCES 885 lead to questioning the relevance of the euclidean distance for measuring similarities in high-dimensional saces. This aer considers the roblem of the concentration of the distances indeendently from the number of data and shows that the concentration is indeed an intrinsic roerty of the distances and not due to finite samle size. This aer furthermore shows that. all -norms concentrate, even when an infinite number of data (that is, a distribution) is considered, including when the distribution is unbounded;. the value of and the shae of the distribution of high-dimensional data influence the value of the relative variance, which is derived for any distribution and used as a measure of concentration;. consequently, the exonent of the norm can be adjusted to fight the concentration henomenon;. there exist distributions for which the relative contrast does not increase when the exonent of the norm decreases;. the identically distributed hyothesis in the concentration of the norm theorems is not necessary as soon as the variables are normalized;. the concentration henomenon is more related to the intrinsic dimension of data than to their embedding dimension, which makes its consequences in ractical situations less severe than mathematically exected; and. the otimal metric is highly alication-deendent, and some sort of suervision is needed to otimally choose the metric. Fractional norms are not always less concentrated than other norms. They seem, however, to be more relevant as a measure of similarity when the noise affecting the data is strongly non-gaussian. ACKNOWLEDGMENTS The work of Damien Francois is funded by a grant from the Belgian FRIA. Part of his work and of the work of Vincent Wertz is suorted by the Interuniversity Attraction Pole (IAP), initiated by the Belgian Federal State, Ministry of Sciences, Technologies and Culture. The scientific resonsibility rests with the authors. REFERENCES [] E. Chavez, G. Navarro, R.A. Baeza-Yates, and J.L. Marroquin, Searching in Metric Saces, ACM Comuting Surveys, vol. 33, no. 3,. 73-3, Set. 00. [] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed. Academic Press Professional, 990. [3] A.K. Jain, M.N. Murty, and P.J. Flynn, Data Clustering A Review, ACM Comuting Surveys, vol. 3, no. 3, , Set [4] P. Demartines, Analyse de Données ar Réseaux de Neurones Auto-Organisés, PhD dissertation, Institut Nat l Polytechnique de Grenoble, Grenoble, France, 994 (in French). [5] K.S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, When Is Nearest Neighbor Meaningful, Proc. Seventh Int l Conf. Database Theory,. 7-35, Jan [6] Y. Tao, J. Sun, and D. Paadias, Analysis of Predictive Satio- Temoral Queries, ACM Trans. Database Systems, vol. 8, no. 4, , Dec [7] N. Katayama and S. Satoh, Distinctiveness-Sensitive Nearest- Neighbor Search for Efficient Similarity Retrieval of Multimedia Information, Proc. 7th Int l Conf. Data Eng. (ICDE 0), , Ar. 00. [8] N. Katayama and S. Satoh, Similarity Image Retrieval with Significance-Sensitive Nearest-Neighbor Search, Proc. Sixth Int l Worksho Multimedia Information Systems (MIS 00), , Oct [9] J. Yang, A. Patro, S. Huang, N. Mehta, M.O. Ward, and E.A. Rundensteiner, Value and Relation Dislay for Interactive Exloration of High Dimensional Datasets, Proc. IEEE Sym. Information Visualization (InfoVis 04), W.A. Burks, ed., , Oct [0] E. Tuncel, H. Ferhatosmanoglu, and K. Rose, Vq-Index An Index Structure for Similarity Searching in Multimedia Databases, Proc. 0th ACM Int l Conf. Multimedia, , Dec. 00. [] N. Mamoulis, D.W. Cheung, and W. Lian, Similarity Search in Sets and Categorical Data Using the Signature Tree, Proc. 9th Int l Conf. Data Eng. (ICDE 03), , Mar [] P. Ciaccia and M. Patella, Pac Nearest Neighbor Queries Aroximate and Controlled Search in High-Dimensional and Metric Saces, Proc. 6th Int l Conf. Data Eng. (ICDE 0), W.A. Burks, ed., , Mar [3] M. Demirbas and H. Ferhatosmanoglu, Peer-to-Peer Satial Queries in Sensor Networks, Proc. Third IEEE Int l Conf. Peer-to- Peer Comuting (PP 03),. 3-39, Set [4] D.S. Johnson, S. Krishnan, J. Chhugania, S. Kumar, and S. Venkatasubramanian, Comressing Large Boolean Matrices Using Reordering Techniques, Proc. 30th Int l Conf. Very Large Data Bases (VLDB 04),. 3-3, Set [5] K. Chakrabarti and S. Mehrotra, The Hybrid Tree An Index Structure for High Dimensional Feature Saces, Proc. 5th Int l Conf. Data Eng. (ICDE 99), , Feb [6] K. Chakrabarti and S. Mehrotra, Local Dimensionality Reduction A New Aroach to Indexing High Dimensional Saces, Proc. 6th Int l Conf. Very Large Data Bases (VLDB 00), , Set [7] T. Liu, A. Moore, A. Gray, and K. Yang, An Investigation of Practical Aroximate Nearest Neighbor Algorithms, Proc. 8th Ann. Conf. Neural Information Processing Systems (NIPS 04), , Dec [8] K.P. Bennett and U. Fayyad Dan Geiger, Density-Based Indexing for Aroximate Nearest-Neighbor Queries, Proc. Fifth ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining, , Aug [9] S. Berchtold, C. Böhm, H.V. Jagadish, H.-P. Kriegel, and J. Sander, Indeendent Quantization An Index Comression Technique for High-Dimensional Data Saces, Proc. 6th Int l Conf. Data Eng. (ICDE 00), , Mar [0] K.Y. Yi, D.W. Cheung, and M.K. Ng, A Highly-Usable Projected Clustering Algorithm for Gene Exression Profiles, Proc. Worksho Data Mining in Bioinformatics (BIOKDD 03),. 4-48, Aug [] M. Vlachos, D. Gunoulos, and G. Kollios, Robust Similarity Measures for Mobile Object Trajectories, Proc. 3th Int l Worksho Database and Exert Systems Alications (DEXA 0),. 7-78, Set. 00. [] F. Korn, B.-U. Pagel, and C. Faloutsos, On the Dimensionality Curse and the Self-Similarity Blessing, IEEE Trans. Knowledge and Data Eng., vol. 3, no.,. 96-, Jan./Feb. 00. [3] A. Hinneburg, C.C. Aggarwal, and D.A. Keim, What Is the Nearest Neighbor in High Dimensional Saces, Proc. 6th Int l Conf. Very Large Data Bases (VLDB 00), A. El Abbadi, M.L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K.-Y. Whang, eds., , Set [4] C.C. Aggarwal, A. Hinneburg, and D.A. Keim, On the Surrising Behavior of Distance Metrics in High Dimensional Saces, Proc. Eighth Int l Conf. Database Theory, J. Van den Bussche and V. Vianu, eds., , Jan. 00 [5] C.C. Aggarwal, Re-Designing Distance Functions and Distance- Based Alications for High Dimensional Data, Proc. ACM Int l Conf. Management of Data (SIGMOD 0), vol. 30, no.,. 3-8, Mar. 00. [6] K. Doherty, R. Adams, and N. Davey, Non-Euclidean Norms and Data Normalisation, Proc. th Euroean Sym. Artificial Neural Networks, M. Verleysen, ed., Ar. 004.

886 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 7, JULY 007 [7] P. Howarth and S.M. Rüger, Fractional Distance Measures for Content-Based Image Retrieval, Proc. 7th Euroean Conf.

Zhou, An Adative and Efficient Dimensionality Reduction Algorithm for High-Dimensional Indexing, Proc. 9th Int l Conf. Data Eng. (ICDE 03),. 87-98, Mar. 003. [9] C.

Skala, Measuring the Difficulty of Distance-Based Indexing, Proc. th Int l Conf. String Processing and Information Retrieval (SPIRE 05), M.P. Consens and G. Navarro, eds.,. 03-4, Nov. 005. [3] P.

14 886 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 9, NO. 7, JULY 007 [7] P. Howarth and S.M. Rüger, Fractional Distance Measures for Content-Based Image Retrieval, Proc. 7th Euroean Conf. Information Retrieval Research (ECIR 05), J.M. Fernandez-Luna and D.E. Losada, eds., , Mar. 005 [8] H. Jin, B.C. Ooi, H. Shen, C. Yu, and A. Zhou, An Adative and Efficient Dimensionality Reduction Algorithm for High-Dimensional Indexing, Proc. 9th Int l Conf. Data Eng. (ICDE 03), , Mar [9] C. Elkan, Using the Triangle Inequality to Accelerate K-Means, Proc. 0th Int l Conf. Machine Learning (ICML 03), , Aug [30] M. Skala, Measuring the Difficulty of Distance-Based Indexing, Proc. th Int l Conf. String Processing and Information Retrieval (SPIRE 05), M.P. Consens and G. Navarro, eds.,. 03-4, Nov [3] P. Yianilos, Excluded Middle Vantage Point Forests for Nearest Neighbor Search, technical reort, NEC Research Inst., Jan. 999, resented at Proc. Sixth Center for Discrete Mathematics and Theoretical Comuter Science (DIMACS) Imlementation Challenge Near Neighbor Searches Worksho. [3] S. Berchtold, C. Boehm, B. Braunmueller, D.A. Keim, and H.-P. Kriegel, Fast Similarity Search in Multimedia Databases, Proc. ACM Int l Conf. Management of Data (SIGMOD 97), J. Peckham, ed., May 997. [33] N. Katayama and S. Satoh, The Sr-Tree An Index Structure for High-Dimensional Nearest Neighbor Queries, Proc. ACM Int l Conf. Management of Data (SIGMOD 97), J. Peckham, ed., , May 997. [34] S. Berchtold, D.A. Keim, and H.-P. Kriegel, The X-Tree An Index Structure for High-Dimensional Data, Proc. nd Int l Conf. Very Large Data Bases (VLDB 96), T.M. Vijayaraman, A.P. Buchmann, C. Mohan, and N.L. Sarda, eds.,. 8-39, Set [35] K.-I. Lin, H.V. Jagadish, and C. Faloutsos, The TV-Tree An Index Structure for High-Dimensional Data, VLDB J., vol. 3, no. 4, , 994. [36] P. Indyk and R. Motwani, Aroximate Nearest Neighbors Towards Removing the Curse of Dimensionality, Proc. 30th Ann. ACM Sym. Theory of Comuting, , May 998. [37] C. Böhm, S. Berchtold, and D.A. Keim, Searching in High- Dimensional Saces Index Structures for Imroving the Performance of Multimedia Databases, ACM Comuting Surveys, vol. 33, no. 3, , Set. 00. [38] S. Brin, Near Neighbor Search in Large Metric Saces, Proc. st Int l Conf. Very Large Data Bases (VLDB 95), U. Dayal, P.M.D. Gray, and S. Nishio, eds., , Set [39] S. Berchtold, C. Bohm, D.A. Keim, and H.-P. Kriegel, A Cost Model for Nearest Neighbor Search in High-Dimensional Data Sace, Proc. 6th ACM SIGACT-SIGMOD-SIGART Sym. Princiles of Database Systems (PODS 97), , May 997. [40] R. Weber, H.-J. Schek, and S. Blott, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High- Dimensional Saces, Proc. 4th Int l Conf. Very Large Data Bases (VLDB 98), A. Guta, O. Shmueli, and J. Widom, eds., , Aug [4] R. Bellmann, Adative Control Processes A Guided Tour. Princeton Univ. Press, 96. [4] J.B. Tenenbaum, V. de Silva, and J.C. Langford, A Global Geometric Framework for Nonlinear Dimensionality Reduction, Science, vol. 90, no. 5500, , Dec [43] J.A. Lee, A. Lendasse, and M. Verleysen, Nonlinear Projection with Curvilinear Distances Isoma versus Curvilinear Distance Analysis, Neurocomuting, vol. 57, , 003. [44] C.L Blake, D.J. Newman, S. Hettich, and C.J. Merz, UCI Reository of Machine Learning Databases, htt// edu/~mlearn/mlreository.html, 998. [45] D. Landers and L. Rogge, Laws of Large Numbers for Pairwise Indeendent Uniformly Integrable Random Variables, Math. Nachrichten, vol. 30,. 89-9, 987. [46] G. Biau, F. Bunea, and M.H. Wegkam, Functional Classification in Hilbert Saces, IEEE Trans. Information Theory, vol. 5, no. 6,. 63-7, 005. [47] D. Francois, V. Wertz, and M. Verleysen, Non-Euclidean Metrics for Similarity Search in Noisy Datasets, Proc. Euroean Sym. Artificial Neural Networks (ESANN 05), , 005. [48] A.J. Jone, A. Stefansson, and N. Koncar, A Note on the Gamma Test, Neural Comuting and Alications, vol. 5, no. 3,. 3-33, Set [49] K. Pelckmans, J. De Brabanter, J.A.K. Suykens, and B. De Moor, The Differogram Nonarametric Noise Variance Estimation and Its Use for Model Selection, Neurocomuting, vol. 69, no.,. 00-, Dec [50] M. Powell, Radial Basis Functions for Multivariable Interolation A Review, Algorithms for Aroximation, M.G. Cox and J.C. Mason, eds., , Clarendon Press, 987 [5] Y. Rui, T.S. Huang, M. Ortega, and S. Mehrotra, Relevance Feedback A Power Tool in Interactive Content-Based Image Retrieval, IEEE Trans. Circuits and Systems for Video Technology, secial issue on segmentation, descrition, and retrieval of video content, vol. 8, no. 5, , Set [5] G. Salton and C. Buckley, Mroving Retrieval Performance by Relevance Feedback, J. Am. Soc. for Information Science, vol. 4, , June 990. Damien François received the MSc degree in comuter science and the PhD degree in alied mathematics from the Université catholique de Louvain, Belgium, in 00 and 007, resectively. He is now a research assistant at the Center for System Engineering and Alied Mechanics (CESAME) and a member of the Machine Learning Grou at the Université catholique de Louvain. His main interests include high-dimensional data analysis, distance-based rediction models, and feature/model selection. Vincent Wertz received the engineering degree in alied mathematics in 978 and the PhD degree in control engineering in 98 from the Université catholique de Louvain, Louvain-la- Neuve, Belgium. He is now a rofessor at the Université catholique de Louvain after having held a ermanent osition with the Belgian National Fund for the Scientific Research (NFSR). His main research interests are in the fields of identification, redictive control, fuzzy control, and industrial alications. Lately, he has also been involved in a major edagogical reform of the first and second year teaching rogram in the School of Engineering. He is a member of the IEEE. Michel Verleysen received the MS and PhD degrees in electrical engineering from the Université catholique de Louvain, Belgium, in 987 and 99, resectively. He was an invited rofessor at the Swiss Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland, in 99, at the Université d Evry Val d Essonne, France, in 00, and at the Université Paris I- Panthéon-Sorbonne in 00, 003, and 004, resectively. He is now a research director at the Belgian Funds National de la Recherche Scientique (FNRS) and lecturer at the Université catholique de Louvain. He is the editor-in-chief of Neural Processing Letters, the chairman of the Annual Euroean Symosium on Artificial Neural Networks (ESANN), an associate editor of the IEEE Transactions on Neural Networks, and a member of the editorial board and rogram committee of several journals and conferences on neural networks and learning. He is an author or coauthor of about 00 scientific aers in international journals and books or communications to conferences with reviewing committees. He is the coauthor of the scientific oularization book on artificial neural networks in the series Que Sais-Je? in French. His research interests include machine learning, artificial neural networks, self-organization, time-series forecasting, nonlinear statistics, adative signal rocessing, and high-dimensional data analysis. He is a senior member of the IEEE.. For more information on this or any other comuting toic, lease visit our Digital Library at

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules