On the Behavior of Intrinsically High-Dimensional Spaces: Distances, Direct and Reverse Nearest Neighbors, and Hubness

Size: px

Start display at page:

Download "On the Behavior of Intrinsically High-Dimensional Spaces: Distances, Direct and Reverse Nearest Neighbors, and Hubness"

Jared Warren
5 years ago
Views:

1 Journal of Machine Learning Research Submitte 3/17; Revise 2/18; Publishe 4/18 On the Behavior of Intrinsically High-Dimensional Spaces: Distances, Direct an Reverse Nearest Neighbors, an Hubness Fabrizio Angiulli DIMES Dept. of Computer, Moeling, Electronics, an Systems Engineering University of Calabria Rene CS, Italy Eitor: Sanjiv Kumar Abstract Over the years, ifferent characterizations of the curse of imensionality have been provie, usually stating the conitions uner which, in the limit of the infinite imensionality, istances become inistinguishable. However, these characterizations almost never aress the form of associate istributions in the finite, although high-imensional, case. This work aims to contribute in this respect by investigating the istribution of istances, an of irect an reverse nearest neighbors, in intrinsically high-imensional spaces. Inee, we erive a close form for the istribution of istances from a given point, for the expecte istance from a given point to its kth nearest neighbor, an for the expecte size of the approximate set of neighbors of a given point in finite high-imensional spaces. Aitionally, the hubness problem is consiere, which is relate to the form of the function N k representing the number of points that have a given point as one of their k nearest neighbors, which is also calle the number of k-occurrences. Despite the extensive use of this function, the precise characterization of its form is a longstaning problem. We erive a close form for the number of k-occurrences associate with a given point in finite high-imensional spaces, together with the associate limiting probability istribution. By investigating the relationships with the hubness phenomenon emerging in network science, we fin that the istribution of noe in-egrees of some real-life, large-scale networks has connections with the istribution of k-occurrences escribe herein. Keywors: high-imensional ata, istance concentration, istribution of istances, nearest neighbors, reverse nearest neighbors, hubness 1. Introuction Although the size an the imensionality of collecte ata are steaily growing, traitional techniques usually slow own exponentially with the number of attributes to be consiere an are often overcome by linear scans of the whole ata. In particular, the term curse of imensionality Bellmann, 1961, is use to refer to ifficulties arising whenever highimensional ata has to be taken into account. One of the main aspects of this curse is known as istance concentration Demartines, 1994, which is the tenency for istances to become almost iniscernible in high-imensional spaces. This phenomenon may greatly affect the quality an performances of machine learning, ata mining, an information-retrieval techniques. This effect results because almost c 2018 Fabrizio Angiulli. License: CC-BY 4.0, see Attribution requirements are provie at

2 Angiulli all these techniques rely on the concept of istance, or issimilarity, among ata items in orer to retrieve or analyze information. However, whereas low-imensional spaces show goo agreement between geometric proximity an the notion of similarity, as imensionality increases, ifferent counterintuitive phenomena arise that may be harmful to traitional techniques. Over time, ifferent characterizations of the curse of imensionality an relate phenomena have been provie Demartines, 1994; Beyer et al., 1999; Aggarwal et al., 2001; Hinneburg et al., 2000; François et al., These characterizations usually state conitions uner which, accoring to the limits of infinite imensionality, istances become inistinguishable. However, almost never o these conitions aress the form of associate istributions in finite, albeit high-imensional, cases. This work aims to contribute in this area by investigating the istribution of istances an of some relate measures in intrinsically high-imensional ata. In particular, the analysis is conucte by applying the central limit theorem to the Eucliean istance ranom variable to approximate the istance probability istribution between pairs of ranom vectors, between a ranom vector an realizations of a ranom vector, an to obtain the expecte istance from a given point to its kth nearest neighbor. It is then shown that an unerstaning of these istributions can be exploite to gain knowlege of the behavior of high-imensional spaces, specifically the number of approximate nearest neighbors an the number of reverse nearest neighbors that are also investigate herein. Nearest neighbors are transversal to many isciplines Preparata an Shamos, 1985; Dasarathy, 1990; Beyer et al., 1999; Dua et al., 2000; Chávez et al., 2001; Shakhnarovich et al., In orer to try to overcome the ifficulty of answering nearest neighbor queries in high-imensional spaces Weber et al., 1998; Beyer et al., 1999; Pestov, 2000; Giannella, 2009; Kabán, 2012, the concept of the ɛ-approximate nearest neighbor Inyk an Motwani, 1998; Arya et al., 1998 has been introuce. The ɛ-neighborhoo of a query point is the set of points locate at a istance not greater than 1 + ɛ times the istance separating the query from its true nearest neighbor. Relate to the notion of the ɛ-approximate nearest neighbor is the notion of neighborhoo or query instability Beyer et al., 1999: a query is sai to be unstable if the ɛ- neighborhoo of the query point consists of most of the ata points. Although asymptotic results, such as that reporte by Beyer et al. 1999, tell what happens when imensionality is taken to infinity, nothing is sai about the imensionality at which the nearest neighbors become unstable. Pursuant to this scenario, this paper erives a close form for the expecte size of the ɛ-neighborhoo in finite high-imensional spaces, an expression that is then exploite to etermine the critical imensionality. Also, to quantify the ifficulty of approximate nearest neighbor search, He et al introuce the concept of relative contrast, a measure of separability of the nearest neighbor of the query point from the rest of the ata, an provie an estimate which is applicable for finite imensions. By leveraging the results concerning istance istributions, this paper erives a more accurate estimate for the relative contrast measure. The number N k of reverse nearest neighbors, also calle the number of k-occurrences or the reverse nearest neighbor count, is the number of ata points for which a given point is among their k nearest neighbors. Reverse nearest neighbors are of interest both in the atabase, information retrieval, an computational geometry literatures Korn an 2

3 On the Behavior of Intrinsically High-Dimensional Spaces Muthukrishnan, 2000; Singh et al., 2003; Tao et al., 2007; Cheong et al., 2011; Yang et al., 2015, with uses having been propose in the ata mining an machine learning fiels Williams et al., 2002; Hautamäki et al., 2004; Raovanovic et al., 2009, 2010; Tomasev et al., 2014; Raovanovic et al., 2015; Tomasev an Buza, 2015, beyon being the objects of stuy in applie probability an mathematical psychology Newman et al., 1983; Maloney, 1983; Tversky et al., 1983; Newman an Rinott, 1985; Yao an Simons, Despite the usefulness an the extensive use of this construct, the precise characterization of the form of the function N k both in the finite an infinite imensional cases is a longstaning problem. What is alreay known is that for the infinite limit of size an imension, N k must converge in its istribution to zero; however, this result an its interpretations seem to be insufficient to characterize its observe behavior in finite samples an imensions. Consequently, this paper erives a close form of the number of k-occurrences associate with a given point in finite high-imensional spaces, together with a generalize close form of the associate limiting probability istribution that encompasses previous results an provies interpretability of its behavior an of the relate hubness phenomenon. The results, which are first illustrate for inepenent an ientically istribute ata, are subsequently extene to inepenent non-ientically istribute ata satisfying certain conitions, an then, connections with non-inepenent real ata are epicte. Finally, it is iscusse how to potentially exten the approach to Minkowski s metrics an, more generally, to istances satisfying certain conitions of spatial centrality. Because hubness is a phenomenon of primary importance in network science, we also investigate if the finings relative to the istribution of the reverse nearest neighbors an the emergence of hubs in intrinsically high-imensional contexts is connecte to an analogous phenomenon occurring in the context of networks. The investigation reveals that for some real-life large-scale networks, the istribution of the incoming noe egrees is connecte to the herein-erive istribution of the infinite-imensional k-occurrences function, which moels the number of reverse k-nearest neighbors in an arbitrarily large feature space of inepenent imensions. Hence, the provie istribution also appears to be suitable for moelling noe-egree istributions in complex real networks. The current stuy can be leverage in several ways an in ifferent contexts, such as in irect an reverse nearest neighbor searches, ensity estimation, anomaly an novelty etection, ensity-base clustering, an network analysis, among others. With regar to its possible applications, we can highlight approximations of measures relate to istance istributions, worst-case scenarios for ata analysis an retrieval techniques, esign strategies that try to mitigate the curse of imensionality, an moels of complex networks. We refer to the concluing section for a more extensive iscussion. The rest of the work is organize as follows. Section 2 iscusses relate work concerning the concentration of istances, intrinsic imensionality, an the number of k-occurrences an the associate hubness phenomenon. Section 3 presents the notation use to provie results. Section 4 introuces the main results of the paper. Section 5 iscusses relationships between the stuy of the hubness phenomena occurring in high-imensional spaces with the analogous phenomena observe in real-life, large-scale, complex networks. Section 6 conclues the work. Finally, the Appenix contains the proofs that are not reporte within the main text. 3

4 Angiulli 2. Relate Work As alreay note, the term curse of imensionality is use to refer to ifficulties arising when high-imensional ata must be taken into account, an one of the main aspects of this curse is istance concentration. In this regar, Demartines 1994 has shown that the expectation of the Eucliean norm of inepenent an ientically istribute i.i.. ranom vectors increases as the square root of the imension, whereas its variance tens towar a constant an, hence, oes not epen on the imensionality. Specifically: Theorem 1 Demartines, 1994, cf. Theorem 2.1 Let X be an i.i.. -imensional ranom vector with common cf F X. Then E[ X 2 ] = a b + O1/ an σ 2 X 2 = b + O1/, where a an b are constants epening on the central moments of F X up to the fourth orer but o not epen on the imensionality. Demartines notice that, because the Eucliean istance correspons to the norm of the ifference of two vectors, the istance between the i.i.. ranom vectors must also exhibit the same behavior. This insightful result explains why high-imensional vectors appear to be istribute aroun the surface of a sphere of raius E[ X ] an why, because they seem to be normalize, the istances between pairs of high-imensional ranom vectors ten to be similar. The istance concentration phenomenon is usually characterize in the literature by means of a ratio between some measure relate to the sprea an some measure relate to the magnitue of the norm, sometimes presente as the istance from a point locate in the origin of the space. In particular, the conclusion is that there is a concentration of istances when the above ratio converges to 0 as the imensionality tens to infinity. Some authors have stuie the concentration phenomenon by representing a ata set as a set of n -imensional i.i.. ranom vectors X j 1 j n with not-necessarily common pfs f X j. Specifically, the contrast is efine as the ifference between the largest an the smallest observe norm, or rather the istance from a query point locate at the origin, whereas the relative contrast is efine as RC = max j X j p min j X j min j X j p 1/p, where p enotes the p-norm x p = x i p is the contrast normalize with respect to the smallest norm/istance. Theorem 2 Aapte from Beyer et al., 1999, cf. Theorem 1 Let X j 1 j n be n -imensional ranom vectors with common cfs. If lim σ2 j X p E[ X j = 0, then, for any ɛ > 0, lim p] P r [RC ɛ] = 1. 4 p,

5 On the Behavior of Intrinsically High-Dimensional Spaces If the hypothesis is verifie, that is, if the variance of the ratio between the norm of the vectors an their expecte value vanishes as the imensionality goes to infinity, then the relative contrast also becomes smaller an smaller, an all the vectors seem to be locate at approximatively the same istance from the reference vector. That is, given a query point Q, the istance from the nearest an the furthest neighbor become negligible: lim P r [ max j Q X j p 1 + ɛ min j Q X j p ] = 1. In Beyer et al., 1999, it is shown that i.i.. ranom vectors satisfy the above conition. Other authors have provie characterizations of the concentration phenomenon by proviing upper an lower bouns to the relative contrast in the cases of Minkowski an fractional norms Hinneburg et al., 2000; Aggarwal et al., Subsequently, François et al., 2007 pose the following problem: is the concentration phenomenon a sie effect of the Empty Space Phenomenon Bellmann, 1961, just because we consier a finite number of points in a boune portion of a high-imensional space? To explore this problem, they stuie the concentration phenomenon by taking the same perspective as Demartines, i.e., to refer to a istribution rather than to a finite set of points. The relative variance RV = σ X p E[ X p ] is a measure of concentration for istributions, corresponing to the ratio between the stanar eviation an the expecte value of the norm. Theorem 3 Aapte from François et al., 2007, cf. i.i.. -imensional ranom vector. Then Theorem 5 Let X be an lim RV = 0. From the above result, they conclue that the concentration of the norms in high-imensional spaces is an intrinsic property of the norms an not a sie effect of the finite sample size or of the Empty Space Phenomenon. Because it oes not epen on the sample size, this can be regare as an extension of Demartines results to all p-norms. As a consequence of the istance concentration, the separation between the nearest neighbor an the farthest neighbor of a given point ten to become increasingly inistinct as the imensionality increases. Relate to the analysis of i.i.. ata is the concept of intrinsic imensionality. Although variables use to ientify each atum coul not be statistically inepenent, ultimately, the intrinsic imensionality of the ata is ientifie as the minimum number of variables neee to represent the ata itself van er Maaten et al., This correspons in linear spaces to the number of linearly inepenent vectors neee to escribe each point. As a matter of fact, an extensively use notion of intrinsic imensionality, the correlation imension Grassberger an Procaccia, 1983, is base on ientifying the imensionality D at which the Eucliean space is homeomorphic to the manifol containing the support of the ata: ln F st δ D = lim, δ 0 ln δ 5

6 Angiulli where F st enotes the cumulative istribution function of pairwise istances, which formalizes the notion that in the limit of small length-scales δ 0 upon which the manifol the ata lie, the manifol is homeomorphic to the Eucliean space of imension D. An, inee, Demartines, 1994 mentions that if ranom vector components are not inepenent, the concentration phenomenon is still present provie that the actual number D of egrees of freeom is sufficiently large. Thus, results erive for i.i.. ata continue to be vali provie that the imensionality is replace with D. Moreover, Beyer et al., 1999 provie ifferent examples of ata presenting concentration, all of which share with the i.i.. case a sparse correlation structure. Durrant an Kabán, 2009 note that it is ifficult to ientify meaningful workloas that o not exhibit concentration, an showe that for the family of linear latent variable moels, a class of ata istributions having non-i.i.. imensions, the Eucliean istance will not become concentrate as long as the number of relevant imensions grows no more slowly than the overall ata imensions o. This also confirms that weakly epenent ata lea to concentration; however, they also note that the conition to avoi concentration is not often met in practice. Another aspect of the curse of imensionality problem, closely relate to the istance concentration an the nearest neighbor relationship, is the so calle hubness phenomenon. This phenomenon has been previously observe in several applications Doington et al., 1998; Singh et al., 2003; Aucouturier an Pachet, 2007, has recently unergone to irect investigation Raovanovic et al., 2009, 2010; Low et al., 2013, an has been subjecte to several ifferent propose techniques for overcoming the phenomenon Raovanovic et al., 2015; Tomasev, Specifically, consier the number N k x of observe points that have x among their k nearest neighbors. N k is also calle k-occurrences or the reverse k-nearest neighbor count. It is known that in low imensional spaces, the istribution of N k complies with the binomial one an, in particular, for uniformly istribute ata in low imensions, it can be moele as noe in-egrees in the k-nearest neighbor graph, which follows the Erős- Rényi ranom graph moel, with a binomial egree istribution Erős an Rényi, However, it has been observe that as imensionality increases, the istribution of N k becomes skewe to the right, resulting in the emergence of hubs, which are points whose reverse k-nearest neighbor counts ten to be meaningfully larger than that associate with any other point. The istribution of N k has been explicitly stuie in the applie probability an mathematical psychology communities Newman et al., 1983; Maloney, 1983; Newman an Rinott, 1985; Tversky an Hutchinson, 1986; Yao an Simons, Almost all the results provie concern a Poisson process that spreas the vectors uniformly over R, leaing to the conclusion that the limiting istribution of N k converges to the Poisson istribution with mean k. The case of continuous istributions with i.i.. components has been consiere in Newman et al., 1983; Newman an Rinott, 1985, where the expression for the infinite-imensional istribution of N 1 is characterize as follows. Theorem 4 Newman et al., 1983, cf. Theorem 7 Let {X 0, X1,..., Xn } be i.i.. ranom vectors with a common continuous cf having a finite fourth moment. Let N n, 1 enote the number of elements from {X 1,..., Xn } whose nearest neighbor with 6

7 On the Behavior of Intrinsically High-Dimensional Spaces respect to the Eucliean istance is X 0. Then lim lim n Nn, D 1 0 an lim lim n σ2 N n, 1 =. The interpretation of the above result ue to Tversky et al., 1983 is that if the number of imensions is large relative to the number of points, a large portion of points will have reverse nearest neighbor count equaling zero, whereas a small fraction i.e., the hubs will score large counts. In orer to provie an explanation for hubness, Raovanovic et al., 2010 notice that it is expecte for points that are closer to the mean of the ata istribution to be closer, on average, to all other points. However, empirical evience inicates that this tenency is amplifie by high-imensionality, making points that resie in the proximity of the atas mean become closer to all other points than their low-imensional analogues are. This tenency causes high-imensional points that are closer to the mean to have increase probability of being selecte as k-nearest neighbors by other points, even for small values of k. In orer to formalize the above evience in finite-imensional spaces, the authors consiere the simplifie setting of normally istribute i.i.. -imensional ranom vectors, for which the istribution of Eucliean istances, which are calculate as the square root of the sum of squares of i.i.. normal variables, correspons to a chi istribution with egrees of freeom Johnson et al., 1994 an the ranom variable x Y, representing the istance from a fixe point x to the rest of the ata, follows the noncentral chi istribution with egrees of freeom an noncentrality parameter λ = x Oberto an Pennecchi, Theorem 5 Raovanovic et al., 2010, cf. Theorem 1 Let λ,1 = µ χ + c 1 σ χ an λ,2 = µ χ + c 2 σ χ, where N +, c 1 < c 2 0, an µ χ an σ χ are the mean an stanar eviation of the chi istribution with egrees of freeom, respectively. Define µ x,1, x,2 = µ χ,λ,2 µ χ,λ,1, where µ χ,λ is the mean of the noncentral chi istribution with egrees of freeom an noncentrality parameter λ. Then, there exists 0 N such that for every > 0, µ λ,1, λ,2 > 0, an µ +2 λ +2,1, λ +2,2 > µ λ,1, λ,2. Intuitively, λ,1 an λ,2 represent two -imensional points whose norms are locate at c 1 an c 2, resp., stanar eviations from the expecte value of the norm in the imensionality, an for which µ λ,1, λ,2 is the istance between the expecte value of the associate istribution of istances. As state by authors, the implication of the above theorem is that hubness is an inherent property of ata istributions in high-imensional space, rather than an artifact of other factors, such as finite sample size. However, Theorem 5 only formalizes the tenency of the ifference between the means of the two istance istributions to increase with the imensionality, an the proof is specific for Gaussian ata. No moel to preict the number of k-occurrences or to infer the form of the unerlying istribution is provie, an the characterization of the istribution probability of N k remains an open problem. 7

8 Angiulli 3. Notation In the rest of this section, upper case letters, such as X, Y, Z,..., enote ranom variables r.v. taking values in R. f X F X, resp. enotes the probability ensity function pf probability istribution function cf, resp. associate with X. Bolface uppercase letters with as a subscript, such as X, Y, Z,..., enote - imensional ranom vectors taking values in R. The components X i 1 i of a ranom vector X = X 1, X 2,..., X are ranom variables having pfs f Xi = f i cf F Xi = F i. A -imensional ranom vector is sai to be inepenent an ientically istribute i.i.. for short if its ranom variables are inepenent an have common pf f X = f Xi cf F X = F Xi. Bolface lowercase letters with as a subscript, such as x, y, z,..., enote a specific -imensional vector taking value in R. The components of a vector x = x 1, x 2,..., x, enote as x i 1 i, are real scalar values. Given a ranom variable X, w.l.o.g. an for simplicity of treatment, sometimes it is assume that the expecte value µ or µ X of f X is µ = 0. If that is not the case, to satisfy the assumption, it suffices to replace uring the analysis the original ranom variable X with the novel ranom variable ˆX = X µ X. Thus, ˆX enotes the ranom vector X µ X. σ X or σx or σ alone, whenever X is clear from the context is the stanar eviation of the ranom variable X. By µ k ˆµ k, resp., or µ X,k ˆµ X,k, resp. whenever X is not clear from the context, it is enote the k-th moment k-th central moment, resp. k > 0 µ k = E[X k ] ˆµ k = E[X µ X k ], resp. of the ranom variable X, where E[X] is the expectation of X. Clearly, when µ = µ 1 = 0, µ k coincies with ˆµ k an µ 2 = σ 2. Moments of a pf f cf F, resp. are those associate with a ranom variable X having pf f X = f cf F X = f, resp.. The moments of an i.i.. ranom vector X are those associate with its cf F X. It is sai that a istribution function has finite central moment µ k, if there exists 0 µ top < such that µ k µ top. Whenever moments are employe uring the proofs, we always assume that they exist an are finite. Moreover, if the ranom variable associate with a moment employe in a proof is not explicitly state, we assume that the moment is relative to the common cf of the ranom vectors occurring in the istribution reporte in the statement of the theorem. Moreover, it is sometimes consiere the case that µ 3 = 0, a conition that is referre to as null skewness. It is known that o central moments, provie they exist, are null if the pf of X is symmetric with respect to the mean with examples of istributions having null µ 3 value being the Uniform an Normal istributions. The notation N µ, σ 2 represents the Normal istribution function with mean µ an variance σ 2. By Φ φ, resp. one enotes the cf pf, resp. of the stanar normal istribution, whereas by Φ X φ X, resp. one enotes the cf pf, resp. of the normal istribution with mean µ X an variance σx 2. Let X represent a univariate ranom variable that is efine in terms of a real-value function of one or more -imensional ranom vectors. For example, X coul be efine as X 2. The notation X N µ X, σx 2 is shorthan to enote the fact that, as, the istribution F X of the stanar score X µ X σ X of X converges to the normal istribution 8

9 On the Behavior of Intrinsically High-Dimensional Spaces N 0, 1. Thus, for large values of, N µ X, σx 2 approximates the istribution probability F X of X, an P r[x δ] Φ δ µx σ X. x2 i. Moreover, istp, Q In the following, enotes the L 2 norm, i.e., x = enotes the Eucliean istance P Q between ranom vector P an ranom vector Q. Let x R, an let X be a ranom variable. Then z x,x = x µ X σ X represents the value x stanarize with respect to the mean an the stanar eviation of X. For a -imensional vector x, which is the realization of a -imensional i.i.. ranom vector Y, the notation z x is use as shorthan for z x 2, Y 2, i.e., z x = z x 2, Y 2 = x 2 µ Y 2 σ Y 2. Results in the following are erive by consiering istributions. However, these results can be applie to a finite set of points by taking into account large samples. In orer to eal with a finite set of points, {Y } n enotes a set of n ranom vectors {Y 1,..., Y n }, each one istribute as Y. Now we recall the Lyapunov Central Limit Theorem CLT conition. Consier the sequence W 1, W 2, W 3,... of inepenent, but not ientically istribute, ranom variables, an let V = W i. By the Lyapunov CLT conition Ash an Doléans-Dae, 1999, if for some δ > 0 it hols that lim 1 s 2+δ then, as goes to infinity, E [ W i E[W i ] 2+δ] = 0, where s 2 = σw 2 i, 1 U E[U ] σu = W i E[W i] σ2 W i N 0, 1, i.e., the stanar score V E[V ]/σv converges in istribution to a stanar normal ranom variable. In the following, when a statement involves a -imensional vector x, we will usually assume that x is the realization of a specific -imensional ranom vector X. Moreover, we will say that a result involving the realization x of a ranom vector X hols with high probability if the statement is true for all the realizations of X except for a subset which becomes increasingly less probable as the imensionality increases. Technically, the assumption that x is a realization of a ranom vector X is leverage to attain a proof of convergence in probability. This also means that when we simultaneously account for all the realizations of a ranom vector X by integrating on all vectors x such that f X x > 0, the existence of such a negligible set oes not affect the final result. 9

10 Angiulli 4. Results This section presents the results of the work concerning istribution of istances, nearest neighbors, an reverse nearest neighbors. Specifically, Section 4.1, concerning the istribution of istances between intrinsically high-imensional ata, erives the expressions for the istance istribution between pairs of ranom vectors an between a realization of a ranom vector an a ranom vector, an analyzes the error associate with expressions. Section 4.2 takes into account the istribution of istances from nearest neighbors, erives the expecte size of the ɛ-neighborhoo in high-imensional spaces, an leverages it to characterize neighborhoo instability. The section also erives a novel estimate of the relative contrast measure. Section 4.3 aresses the problem of etermining the number of k-occurrences an etermines the close form of its limiting istribution, showing that it encompasses previous results an provies interpretability of the associate hubness phenomenon. Section 4.4 generalizes the results erive for the i.i.. case to inepenent nonientically istribute ata, epicting connections with the behavior in real ata. Section 4.5 iscusses relationship to other istances, incluing Minkowski s metrics an, in general, istances satisfying certain conitions of spatial centrality. The first three sections eal with i.i.. ranom vectors. In these sections, the synthetic ata sets consiere consist of ata generate from a uniform istribution in [ 0.5, +0.5], a stanar normal istribution, an an exponential istribution with mean 1. For the proofs that are not reporte within the main text, the reaer is referre to the Appenix. 4.1 On the Distribution of Distances for i.i.. Data First of all, the probability that two -imensional i.i.. ranom vectors lie at a istance not greater than δ from one another is consiere. The expression of Theorem 6 results from the fact that the istribution of the ranom variable X Y 2 converges towars the normal istribution for large imensionalities. Theorem 6 Let X an Y be two -imensional i.i.. ranom vectors with common cf F. Then, for large values of, P r [istx, Y δ] Φ δ 2 2µ 2 µ 2 2 µ 4 + µ µ. µ2µ 2 µ 2 2µ 3 Proof of Theorem 6. The statement follows from the property shown in Lemma 7. Lemma 7 X Y 2 N 2µ 2 µ 2, 2 µ 4 + µ µ µ2µ 2 µ 2 2µ 3. Proof of Lemma 7. The square norm X Y 2 can be written as X Y 2 = X 2 + Y 2 2 X, Y, where X 2 Y 2, an X, Y are the following ranom 10

11 On the Behavior of Intrinsically High-Dimensional Spaces variables Y 2 = Y 2 i an X, Y = X i Y i. The proof procees by showing that, as, X 2, Y 2, an X, Y are both normally istribute an jointly normally istribute an by etermining their covariance, which is accounte for in Propositions 8, 9, 10, an 11, as reporte in the following. Proposition 8 Y 2 N µ 2, µ 4 µ 2 2. Proof of Proposition 8. Consier the ranom variable Y 2 = Y 2 i = W i, where W i = Yi 2 is a novel ranom variable. Then, µ W = E[W i ] = E[Yi 2] = µ 2, an σw 2 = E[W i 2] E[W i] 2 = E[Yi 4] µ2 2 = µ 4 µ 2 2. Consier the sequence W 1, W 2, W 3,... of i.i.. ranom variables. By the Central Limit Theorem CLT for short Ash an Doléans-Dae, 1999, the stanar score of W i is such that, as, W i µ W = Y i 2 µ 2 σw µ4 µ 2 2 N 0, 1, from which the result follows. Proposition 9 X, Y N µ 2, µ 2 2 µ4. Proof of Proposition 9. Because X, Y = X iy i = W i is the sum of a sequence W 1, W 2, W 3,... of i.i.. ranom variables with mean E[W i ] = E[X i Y i ] = E[X i ]E[Y i ] = µ 2 an variance σ 2 [W i ] = E[Wi 2] E[W i] 2 = E[Xi 2Y i 2] µ2 2 = E[Xi 2]E[Y i 2] µ 4 = µ 2 2 µ4, from the CLT the result follows. Proposition 10 As, X 2, Y 2 an X, Y are jointly normally istribute. Proof of Proposition 10. The statement hols provie that all linear combinations W = a X 2 + b Y 2 + c X, Y are normal. Notice that W = a X 2 i + b Y 2 i + c X i Y i = ax 2 i + byi 2 + cx i Y i = W i, where W i = axi 2 + by i 2 + cx i Y i is a novel ranom variable. Because W 1, W 2, W 3,... is a sequence of i.i.. ranom variables, the result follows from the CLT. 11

12 Angiulli Proposition 11 cov Y 2, X, Y = µµ 3 µ 2 µ an cov X 2, X, Y = µµ 3 µ 2 µ, for symmetry. Proof of Proposition 11. See the appenix. Proof of Lemma 7 continue. Because the ranom variables X 2, Y 2, an X, Y are jointly normally istribute see Proposition 10, their linear combination X Y 2 = X 2 + Y 2 2 X, Y is normally istribute with mean µ X Y 2 = µ X 2 + µ Y 2 2µ X,Y = 2µ 2 µ 2, an variance σ 2 X Y 2 = 2σ 2 Y σ 2 X,Y + 4 2cov Y 2, X, Y = = 2µ 4 µ µ 2 2 µ 4 8µµ 3 µ 2 µ = = 2 µ 4 + µ µ µ2µ 2 µ 2 2µ 3. Proof of Theorem 6 continue. To conclue the proof: P r [istx, Y δ] = P r [ istx, Y 2 δ 2] = P r [ X Y 2 δ 2] Φ X Y 2δ2. Note that, if X an Y have a common pf with null mean µ = 0, Y 2 X 2 equivalently an X, Y are uncorrelate, an being jointly normal istribute, they are also inepenent. In such a case, the parameters of the istribution can be expresse in the following simplifie form. Corollary 12 Let X an Y be two -imensional i.i.. ranom vectors with common cf F X having mean µ. Then ˆX Ŷ 2 N 2ˆµ 2, 2ˆµ 4 + ˆµ 2 2, where ˆX = X µ Ŷ = Y µ, resp. an ˆµ k = E[X µ k ] k > 0 are the central moments of f X the moments of f ˆX, resp.. Proof of Corollary 12. Immeiate from Theorem 7. The notability of the above expression also stems from the following fact. Proposition 13 P r[istx, Y δ] = P r[ist ˆX, Ŷ δ]. Proof of Proposition 13. Distances are not affecte by translation. Until now, it has been assume that X an Y have a common cf. The following expression takes into account the case of ifferent cfs. 12

13 On the Behavior of Intrinsically High-Dimensional Spaces Corollary 14 Let X an Y be two -imensional i.i.. ranom vectors with cfs F X an F Y, respectively. Then X Y 2 N µ X,Y, σx,y 2, where µ X,Y = µ X,2 + µ Y,2 2µ X µ Y, an σ 2 X,Y = µ X,4 µ 2 X,2 + µ Y,4 µ 2 Y,2 + 4µ X,2 µ Y,2 + 4µ X µ Y µx,2 + µ Y,2 µ X µ Y + 4µ X µ Y,3 4µ Y µ X,3. Proof of Corollary 14. The expression can be obtaine by following the same line of reasoning of Theorem 7. To characterize more precisely istance istributions, it is of interest to consier the case in which one of the two vectors is hel fixe. With this aim, the following Theorem 15 concerns the probability that a given -imensional vector x an the realization of a -imensional i.i.. ranom vector Y lie at a istance not greater than δ from one another. The result hols uner the conition that x itself is the realization of a -imensional i.i.. ranom vector X, with the cf F X of X not necessarily being ientical to the cf F Y of Y. Formally, Theorem 15 hols with high probability because it relies on a proof of convergence in probability exploite in Proposition 17. Although not all the realizations may comply with the conition of Proposition 17 e.g., consier the case x i = c i with c 1, it hols anyway for almost all the realizations, except for a set of vanishing measure. Theorem 15 Let x enote a realization of a -imensional i.i.. ranom vector X, an let Y be a -imensional i.i.. ranom vector. Then, for large values of, with high probability P r [istx, Y δ] Φ where moments are relative to the ranom vector Y. δ 2 x 2 µ 2 + 2µ x i µ 4 µ µ 2 µ 2 x 2 4µ 3 µµ 2 x i Proof of Theorem 15. The proof relies on the result of Lemma 16 consiering the istribution of x Y 2. Lemma 16 With high probability x Y 2 N x 2 + µ 2 2µ x i, µ 4 µ µ 2 µ 2 x 2 4µ 3 µµ 2, x i. Proof of Lemma 16. Consier the equality x Y 2 = x 2 + Y 2 2 x, Y. The proof procees by stuying the istribution of x, Y see Proposition 18, by showing that Y 2 an x, Y are jointly normally istribute see Proposition 19, an by etermining their covariance see Proposition 20. However, a technical result that is leverage in the sequel of the proof is first neee; this is presente in Proposition

14 Angiulli Proposition 17 Let X be a -imensional i.i.. ranom vector having cf F X. Moreover, let p an q be positive integers, an β 0, β 1,..., β p, α 0, α 1,..., α q be real coefficients such that β p 0 an α q 0. Then, for any ɛ > 0, p lim P r j=0 β jx j i q 2 ɛ = 0. j=0 α jx j i Proof of Proposition 17. See the appenix. Proposition 18 With high probability x, Y N µ x i, µ 2 µ 2 x 2. Proof of Proposition 18. Consier the ranom variable x, Y : x, Y = x i Y i = W i, where W i = x i Y i is a novel ranom variable. Then, µ Wi = E[W i ] = E[x i Y i ] = x i E[Y i ] = x i µ, an σw 2 i = E[Wi 2] E[W i] 2 = E[x 2 i Y i 2] x2 i µ2 = x 2 i µ 2 x 2 i µ2 = µ 2 µ 2 x 2 i. Consier the sequence W 1, W 2, W 3,... of inepenent, but not ientically istribute, ranom variables. If the Lyapunov CLT conition reporte in Equation 1 hols, the stanar score x, Y µ x,y /σ x,y converges in istribution to a stanar normal ranom variable as goes to infinity, i.e., x, Y µ x,y σ x,y = W i E[W i] = x iy i µ x i σ2 W µ2 i µ 2 x N 0, 1. Thus, consier the Lyapunov conition for δ = 2: lim [ W E i E[W i ] 2+δ] [ x E i Y i µ 4] = lim µ 2 µ 2 2 x 4 = δ=2 s 2+δ = µ 4 + µ6µµ 2 4µ 3 3µ 3 µ 2 µ 2 2 lim x4 i 2 = 0. x2 i The above limit converges in probability for the r.v. X by Proposition 17. Proposition 19 As, with high probability Y 2 an x, Y are jointly normally istribute. Proof of Proposition 19. See the appenix. 14

15 On the Behavior of Intrinsically High-Dimensional Spaces Proposition 20 cov Y 2, x, Y = µ 3 µµ 2 Proof of Proposition 20. See the appenix. x i. Proof of Lemma 16 continue. To conclue the proof of Lemma 16, because the ranom variables Y 2, an x, Y are jointly normally istribute, then the ranom variable x Y 2 is normally istribute with mean an variance µ x Y 2 = µ x 2 + µ Y 2 2µ x,y = x 2 + µ 2 2µ x i, σ x 2 Y = σ 2 2 Y σ 2 2 x,y cov Y 2, x, Y. = = µ 4 µ µ 2 µ 2 x 2 4µ 3 µµ 2 x i. Proof of Theorem 15 continue. To conclue the proof: P r [istx, Y δ] = P r [ istx, Y 2 δ 2] = P r [ x Y 2 δ 2] = Φ x Y 2δ2. For istributions having null means, the above expressions can be simplifie. Corollary 21 Let x enote a realization of a -imensional i.i.. ranom vector X, an let Y be a -imensional i.i.. ranom vector with cf F Y having null mean µ Y = 0. Then, with high probability x Y 2 N x 2 + µ 2, µ 4 µ µ 2 x 2 4µ 3 x i, where the moments are relative to the ranom vector Y. Proof of Corollary 21. The result follows by substituting µ = µ Y = 0 in the right-han sie of the statement of Lemma 16. In orer to quantify the error associate with the approximations of Theorem 6 an Theorem 15, the Kolmogorov-Smirnov statistic D n is employe here as an error measure. This statistic is usually use for comparing a theoretical cumulative istribution function F to a given empirical istribution function G n for n observations, an it is efine as D n G n, F = sup G n δ F δ. δ R 15

16 Angiulli In our case, given an empirical istribution function G,n for n observations an a theoretical istribution function F, both relate to the imensionality parameter, we efine the error err G,n, F as D n G,n, F. As for the approximation of Theorem 6, F X Y 2δ = Φ δ E[ X Y 2 ] σ X Y 2 as theoretical cf F, whereas F emp is employe δ, enoting the empirical istribution of the X Y 2,n square istances, is employe as the empirical cf G,n, an the error measure is e = err F emp X Y 2,n, F X Y. 2 As the approximation of Theorem 15, given the realization x of X, F x Y 2δ = Φ δ E[ x Y 2 ] σ x Y 2 is employe as a theoretical cf, whereas F emp δ enotes the em- x Y 2,n pirical cf. Specifically, we consiere three points p i point p i lies k i with k 1 = 0, k 2 = 1, an k 3 = 5 stanar eviations σ X mean µ X 2 of the square norm of X, i.e., each point p i in particular, the generic coorinate of p i error measure for each point is e i = err 1 i 3 as instances of x. Each 2 away from the is such that z i p = k 2, X 2 i has value µ X 2 + k i σ X 2/ 1/2. The F emp p i Y 2,n, F p i Y 2 The empirical cf F emp has been obtaine by generating n pairs X Y 2 xj,n, yj 1 j n of realizations of the ranom vectors X an Y, respectively, an then by computing F emp X Y 2,n δ = 1 n n j=1 I [0,δ], yj, where I S enotes the inicator function with S representing a generic set, such that I S v = 1, if v S, an istx j., is obtaine by generating n realiza- I S v = 0, otherwise. The empirical cf F emp x Y 2,n tions y j 1 j n of the ranom vector Y, an then by computing F emp 1 n n j=1 I [0,δ] istx, y j. x Y 2,n δ = We note that, for any istance threshol δ 0, the value err represents an upper boun to the error committe when the theoretical cf of Theorem 6 Theorem 15, resp. is use to estimate the probability P r[ X Y δ] P r[ x Y δ], resp.. Figure 1 shows the above efine errors e, e 1, e2, an e3 re curves, for istributions F X = F Y, uniform in [ 0.5, +0.5] Fig. 1a, stanar normal Fig. 1b, an exponential with λ = 1 Fig. 1c, respectively. Before commenting on the results, it must be pointe out that the error err epens on the size n of the sample employe to buil the empirical istribution. Thus, first we iscuss the behavior for unboune sample sizes n, an then take into account the effect of finite sample sizes. In orer to simulate an unboune sample size, the curves in the figures have been obtaine for a very large sample size n > From Figures 1a, 1b, an 1c it can be seen that the error err ecreases with the imensionality. The tren of the error curves is more regular for the uniform an normal istribution than for the exponential istribution, probably ue to the skewness of the exponential istribution. The error e associate with the cf F X Y 2 is greater than the errors e i associate with the cf F x Y 2. Intuitively, this can be explaine since the egree of uncertainty is reuce if one of the two ranom vectors is replace by a fixe point. In general, it hols that e 1 > e 2 > e 3, thus inicating that uncertainty 16

17 On the Behavior of Intrinsically High-Dimensional Spaces 10 0 Uniform 10 0 Normal Error [err ] e e 1 e 2 e 3 n=10 7 n=10 6 n=10 5 n=10 4 n=10 3 n= Dimensionality [] a Error [err ] e e 1 e 2 e 3 n=10 7 n=10 6 n=10 5 n=10 4 n=10 3 n= Dimensionality [] b 10 0 Exponential Error [err ] e e 1 e 2 e 3 n=10 7 n=10 6 n=10 5 n=10 4 n=10 3 n= Dimensionality [] c Figure 1: [Best viewe in color.] Empirical evaluation of the approximation errors of Th. 6 an Th. 15, for imensionalities [10 0, 10 4 ] an sample sizes n [10 2, 10 7 ]. Error e re soli line is associate with the expression of Th. 6, whereas errors e 1 magenta ashe line, e 2 cyan ash-otte line, an e 3 green otte line are associate with the expression of Th. 15, for three ifferent points whose square norm stanar scores are 0, 1, an 5, respectively. Horizontal blue lines take into account the sample size n: the otte line is the expecte error for ifferent n values uner the hypothesis that the istance istribution is inee normal; the ashe line is the value uner which the hypothesis that the sample is generate accoring the theoretical istribution can be accepte at the 95% confience level. 17

18 Angiulli increases towars the most largely populate regions of the space. Moreover, the larger the imensionality, the closer the errors e j to e 1. As anticipate above, the error err epens on the size n of the sample employe to buil the empirical istribution. Specifically, ifferently from the case of unboune n values, for which the error ecreases with the imensionality, for any fixe sample size n, there exists a imensionality such that the error converges aroun a value e n. Such a value e n correspons to the error D n Φ emp n, Φ between the empirical cf Φ emp n associate with a ranom sample of n elements of a stanar normal istribution an the theoretical cf Φ of a stanar normal istribution. Let K be a ranom variable having a Kolmogorov istribution. Accoring to the Kolmogorov-Smirnov test, the null hypothesis that the sample of n observations having empirical istribution G n comes from the hypothesize istribution F is rejecte at level α 0, 1 if the statistic n D n G n, F is greater than the value K α, where K α is such that P r[k K α ] = 1 α. It follows from the above that if, for a certain α an sample size n, it hols that err is smaller than e α n = K α n 1/2, then the hypothesis that the sample complies with the theoretical istribution can be accepte at the 1 α confience level, e.g., for α = 0.05, the value K 0.05 is Moreover, the expecte value e n of D n Φ emp n, Φ approximately correspons to K α n 1/2 with α = 0.44, i.e., to e n n 1/2. Horizontal blue lines in Figures 1a, 1b, an 1c take into account the effect of the sample size n. Each pair of ashe an otte lines is associate with a ifferent value of n {10 2, 10 3,..., 10 7 }. Dashe lines are associate with the errors e 0.05 n, whereas otte lines are associate with the errors e n. Let n be the actual sample size, an let be the imensionality such that the value e of the particular curve e is equal to e n otte horizontal curve. Then, for, the expecte value of e tens to e n. Thus, for <, the curve of e is similar to the one reporte in the figure, whereas for >, the curve of e tens to be horizontal, with a value close to e n. Moreover, if e en 0.05 ashe horizontal curve, then the hypothesis that the sample complies with the hypothesize istribution can be accepte at the 95% confience level. Informally speaking, this means that in the latter case, the istribution hypothesize in Theorems 6 an 15 is iniscernible from the unerlying istribution generating the observe inter-point istances. In summary, as previously pointe out, because err epens on the worst-case threshol value δ, it is an upper boun to the error committe when estimating probabilities by leveraging the results previously presente. The analysis with unboune sample size highlights that the worst-case error always ecreases with the imensionality. Moreover, let the effective error be efine as the ifference between the observe error an the error expecte when the ata are generate accoring to the hypothesize istribution. The analysis of finite sample sizes highlights that, in practice, the effective error can become null. For the istributions F Y having both a null mean an null skewness µ 3 = 0, it follows from Propositions 19 an 20 that the ranom variables Y 2 an x, Y are inepenent. Moreover, the istribution efine in Corollary 21, in Theorem 15 an in Lemma 16, epen only on the square norm x 2, whereas the actual value of x oes not matter. However, it can be shown that the same property hols also for skewe istributions, since the term i x i is relate to x 2, as accounte for in the subsequent result. 18

19 On the Behavior of Intrinsically High-Dimensional Spaces Ratio Σ i x i / Σ i x i Uniform mean mean+st mean-st Dimensionality [] a Ratio Σ i x i / Σ i x i Normal mean mean+st mean-st Dimensionality [] b Ratio Σ i x i / Σ i x i Exponential mean mean+st mean-st Dimensionality [] c Figure 2: Empirical valiation of Proposition 22 on ifferent istributions: a uniform µ 1 /µ 2 = 1.5, b normal µ 1 /µ 2 = 0.5, an c exponential µ 1 /µ 2 = 0.5. The re soli curve represents the expecte value µ W of the ratio W = X i/ X 2, whereas the re ashe curves represent the values µ W + σ W an µ W σ W, measure for n = 20,000 points an [10 1, 10 4 ]. Proposition 22 Let x enote a realization of a -imensional i.i.. ranom vector X with cf F X. Then, for large values of, with high probability x i x 2 µ X µ X,2. Proof of Proposition 22. See the appenix. Thus, the term i x i can be approximate by µ X µ X,2 x 2. Notice that the above result also states that for ranom vectors X having null mean, the term i x i becomes negligible with respect to x 2 an, hence, that it can be ignore in the expression reporte in Corollary 21, thus removing the epenence from the skewness of the istribution F Y. To empirically valiate Proposition 22, the mean an the stanar eviation of the ratio W = X i/ X 2 have been measure on istributions having non-null mean µ 0. Figure 2 reports the result of the experiment for [10, 10 4 ] an n = 20,000. Specifically, a uniform istribution with mean µ 1 = 0.5 µ 2 = 0.333, µ 3 = 0.25, an µ 4 = 0.2 an ratio µ 1 /µ 2 = 1.5 Fig. 2a, a normal istribution with mean µ 1 = 1 µ 2 = 2, µ 3 = 4, an µ 4 = 10 an ratio µ 1 /µ 2 = 0.5 Fig. 2b, an an exponential istribution with mean µ 1 = 1 µ 2 = 2, µ 3 = 6, an µ 4 = 24 an ratio µ 1 /µ 2 = 0.5 Fig. 2c, were consiere. It can be seen that the expecte value E[W ] of the ratio W rapily converges to the limiting value µ 1 /µ 2 an also that the stanar eviation σw of the ratio W ecreases with the imensionality. Moreover, in all cases, the tren agrees with the preiction of Proposition 22, accoring to which it hols that E[W ] µ 1 /µ 2 = O 1 an σw = O 1/2. 19

20 Angiulli 4.2 On the Distribution of Nearest Neighbors for i.i.. Data Given a real number ϱ [0, 1], a -imensional vector x an a -imensional ranom vector Y, istnn ϱ x, Y enotes the raius of the smallest neighborhoo centere in x containing at least the ϱ fraction of the realizations of Y. Moreover, nn ϱ x, Y, or nn ϱ x whenever Y is clear from the context, also calle ϱ-th nearest neighbor of x w.r.t. Y, enotes an element of the set {y R f Y y > 0 an istx, y = istnn ϱ x, Y }. 1 NN ϱ x, Y, or NN ϱ x whenever Y is clear from the context, enotes the set of points {y R f Y y > 0 an istx, y istx, nn ϱ x, Y }. In orer to eal with finite sets of n points {Y } n, the integer parameter k = ϱn k {1,..., n} has to be employe in place of ϱ. Thus, given a positive integer k, istnn k x, {Y } n represents the raius of the smallest neighborhoo centere in x containing at least k points of {Y } n. Moreover, nn k x, {Y } n or nn k x, also calle k-th nearest neighbor of x in {Y } n, enotes an element of the set {y {Y } n istx, y = istnn k x, {Y } n }. 2 NN k x, {Y } n, or NN k x, Y, enotes the set of points {y {Y } n istx, y istx, nn ϱ x, {Y } n }. In the rest of the work, given a -imensional i.i.. ranom vector X with cf F X, representing the istribution of the query points, an a -imensional i.i.. ranom vector Y with cf F Y, representing the istribution of the ata points, we assume w.l.o.g. that F Y has null mean µ Y. Inee, if it is not the case, it is sufficient to replace them with the ranom vectors X = X µ Y an Y = Y ˆ = Y µ Y such that µ Y = 0. Moreover, a realization x of X can be replace with x = x µ Y. The following result consiers the istance separating a vector from its ϱ-th nearest neighbor w.r.t. a -imensional i.i.. ranom vector. Lemma 23 Let x enote a realization of a -imensional i.i.. ranom vector X having cf F X. Consier the ϱ-nearest neighbor nn ϱ x, Y of x w.r.t. a -imensional i.i.. ranom vector Y with cf F Y. Assume, w.l.o.g., that F Y has null mean µ Y = 0. Then, for large values of, with high probability istx, nn ϱ x, Y x 2 + µ 2 + Φ 1 ϱ µ4 µ µ 2 x 2 4µ 3 x i. Proof of Lemma 23. By efinition, nn ϱ x, Y is such that P r [istx, Y istx, nn ϱ x, Y ] = ϱ. By Corollary 21, istx, nn ϱ x, Y 2 E[ x Y 2 ] P r [istx, Y istx, nn ϱ x, Y ] Φ σ x Y Because our interest is only in the fact that nn ϱx, Y satisfies the property istx, nn ϱx, Y = istnn ϱx, Y, it can be assume that nn ϱx is ranomly selecte from the above set. 2. Because our interest is only in the fact that nn k x, Y satisfies the property istx, nn k x, {Y } n = istnn k x, {Y } n, it can be assume that nn k x is ranomly selecte from the above set. 20

21 On the Behavior of Intrinsically High-Dimensional Spaces Hence, istx, nn ϱ x, Y 2 E[ x Y 2 ] + Φ 1 ϱ σ x Y 2. It has been alreay pointe out that if F Y has null skewness µ 3 = 0, if F X = F Y, or if F X has null mean µ X = 0, the term 4µ 3 i x i can be isregare. Due to the ifficulty of answering nearest neighbor queries in high-imensional spaces, ifferent authors have propose to consier approximate nearest neighbor queries Inyk an Motwani, 1998; Arya et al., 1998, returning an ɛ-approximate nearest neighbor instea of the exact nearest neighbor: given point x an ɛ 0, a point y is an ɛ-approximate nearest neighbor of x if it hols that istx, y 1 + ɛistx, nn 1 x. Beyer et al calle a nearest neighbor query unstable for a given ɛ 0, if the istance from the query point to most ata points is less than 1 + ɛ times the istance from the query point to its nearest neighbor. Moreover, Beyer et al have shown that in many situations, for any fixe ɛ > 0, as imensionality rises, the probability that a query is unstable converges to 1 see Theorem 2. Instability is unesirable because the points that fall in the enlarge query region, also calle ɛ-neighborhoo, are vali answers to the approximate nearest neighbor problem. Thus, the larger the expecte number of ata points falling within the ɛ-neighborhoos of the query points, the smaller the meaningfulness of the approximate query scenario. Definition 24 Let NN ɛ ϱx, Y enote the set of the ɛ-approximate ϱ-nearest neighbors of x, also calle ɛ-neighborhoo, that are the realizations y of Y whose istance from x is within 1 + ɛ times the istance separating x from its ϱ-th nearest neighbor nn ϱ x, Y, i.e., NN ɛ ϱx, Y = {y R f Y y > 0 an istx, y 1 + ɛ istx, nn ϱ x }. In orer to quantify the meaningfulness of ɛ-approximate queries, it is sensible to compute the expecte size of the ɛ-neighborhoos associate with query points with respect to the ata population, which is the task pursue in the following. Theorem 25 Let ɛ 0, let X be a -imensional i.i.. ranom vector with cf F X, representing the istribution of the query points, an let Y be a -imensional i.i.. ranom vector with cf F Y not necessarily ientical to F X, representing the istribution of the ata points. Assume, w.l.o.g., that F Y has null mean µ Y = 0. Then, for large values of, E[ NN ɛ k X, {Y } n ] ɛ 2 + 2ɛ µ X,2 + µ Y, ɛ 2 Φ 1 k n µ Y,4 µ 2 Y,2 + 4µ Y,2µ X,2 4µ Y,3 µ X nφ µ. Y,4 µ 2 Y,2 + 4µ Y,2µ X,2 4µ Y,3 µ X + ɛ 2 + 2ɛ 2 µ X,4 µ 2 X,2 21

22 Angiulli Proof of Theorem 25. Consier the probability exploiting Corollary 21, Proposition 22 an Lemmas 16 an 23 P r[istx, Y 1 + ɛ istx, nn ϱ x, Y ] = = P r[istx, Y ɛ 2 istx, nn ϱ x, Y 2 ] 1 + ɛ 2 E[ x Y 2 ] + Φ 1 ϱ σ x Y 2 E[ x Y 2 ] Φ σ x Y 2 = ɛ 2 + 2ɛ E[ x Y 2 ] ɛ 2 Φ 1 ϱ σ x Y 2 = Φ σ x Y 2 = = Φ ɛ2 + 2ɛ x 2 + µ Y, ɛ 2 Φ 1 ϱ µ Y,4 µ 2 Y,2 + 4µY,2 x 2 µ 4µ X Y,3 µ X,2 x 2 µy,4 µ 2 Y,2 + 4µY,2 x. 2 µ 4µ X Y,3 µ X,2 x 2 By taking into account the stanar score of x x 2 = z x σ X 2 + µ X 2 = z x µ X,4 µ 2 X,2 + µ X,2, an by consiering that for α, β, an z finite note that φz is practically negligible for z 5 an growing, α + z β α, then the above probability can be approximate with Φa,ɛ,ϱ X,Y + b,ɛ,ϱ X,Y z x, where a,ɛ,ϱ X,Y = ɛ 2 + 2ɛ µ X,2 + µ Y, ɛ 2 Φ 1 ϱ µ Y,4 µ 2 Y,2 + 4µ Y,2µ X,2 4µ Y,3 µ X, µ Y,4 µ 2 Y,2 + 4µ Y,2µ X,2 4µ Y,3 µ X b,ɛ X,Y = ɛ 2 + 2ɛ µ X,4 µ 2 X,2. µ Y,4 µ 2 Y,2 + 4µ Y,2µ X,2 4µ Y,3 µ X Consier now the expecte value E[ NN ɛ ϱx ] = P r[istx, Y 1 + ɛ istx, nn ϱ x, Y ] Pr[X = x ] x = R z,max = Φ a,ɛ,ϱ X,Y + b,ɛ X,Y z x φz x z x. z,min The statement then follows by leveraging the following equation Owen, a Φa + bz φz x = Φ, b 2 taking the limits of integration to infinity. Note that the extra omain of integration consiere is associate with a negligible probability because z,min = µ X 2 inf X 2 /σ X 2 an z,max = sup X 2 µ X 2/σ X 2, are such that both φz,min an φz,max rapily approach zero. In orer to valiate the above result, the expecte value E[ NN ɛ k X, {Y } n ] is empirically estimate for ifferent values of k, an ɛ [0, 0.5], by exploiting sets of n = 10,000 22

23 On the Behavior of Intrinsically High-Dimensional Spaces 10 0 Uniform ata n= Normal ata n= Exponential ata n= E[ NN k ] =10K, k=1 emp =10K, k=1 theo =10K, k=10 emp =10K, k=10 theo =1K, k=1 emp =1K, k=1 theo =1K, k=10 emp =1K, k=10 theo E[ NN k ] =10K, k=1 emp =10K, k=1 theo =10K, k=10 emp =10K, k=10 theo =1K, k=1 emp =1K, k=1 theo =1K, k=10 emp =1K, k=10 theo E[ NN k ] =10K, k=1 emp =10K, k=1 theo =10K, k=10 emp =10K, k=10 theo =1K, k=1 emp =1K, k=1 theo =1K, k=10 emp =1K, k=10 theo Approximation, Approximation, Approximation, Figure 3: [Best viewe in color.] Comparison between the empirically estimate x-marke curves an the preicte by means of Th. 25 o-marke curves expecte sizes of the ɛ-neighborhoo, for n = 10,000, = 1,000 an k = 1 magenta ash-otte line, = 1,000 an k = 10 green otte line, = 10,000 an k = 1 re soli line, an = 10,000 an k = 10 blue ashe line. realizations of the ranom vector Y. Results are average by consiering ten ifferent sets. In the experiment, it is assume that F X = F Y an that each point of the set is use in turn as a query point; thus, the size of the ɛ-neighborhoo may vary between k an n 1. Figure 3 reports the results of this experiment for uniform, normal, an exponential i.i.. ata. The value E[ NN ɛ k X, {Y } n ] empirically estimate as escribe above is compare with the value preicte by means of Theorem 25. The curves for the number of neighbors k {1, 10} an the imensionalities {1,000, 10,000} are reporte. The curves confirm that the preiction follows the tren of the empirical evience with the error vanishing as the imensionality increases. As alreay state by Beyer et al. 1999, Theorem 2 only tells us what happens when we take the imensionality to infinity, but nothing is sai about the imensionality at which o we anticipate nearest neighbors to become unstable, an the issue must be aresse through empirical stuies. The above imensionality, calle the critical imensionality, can, however, be obtaine as follows. Let θ [0, 1] represent a fraction of the ata elements; the critical imensionality ϱ,ɛ,θ for the parameters ϱ an ɛ at the threshol level θ, also calle selectivity, is such that ϱ,ɛ,θ = min{ N+ : E[ NN ɛ ϱx, Y ] θ}, i.e., the imensionality at which the expecte size of the ɛ-neighborhoo contains at least the θ fraction of the ata points. Figure 4 reports the critical imensionality for ɛ varying in [0.001, 1], threshols θ {0.01, 0.1, 0.5}, n = 10,000 an k = 1 i.e. ϱ = k/n , obtaine by exploiting the expression reporte in Theorem 25. For example, for θ = 0.1, the plot says that for imensionalities below the bottom curve, ɛ-neighborhoos contain on the average 10% of the points one hunre points for n = 10,000. Note that analogous preictions can be obtaine in a very similar way for any other combination of the parameters ϱ, θ, an ɛ, an istribution function F. Figure 4 also report the values of the critical imensionality estimate empirically ashe lines. The plots highlight that the preicte critical imensionality tens to the 23

24 Angiulli Uniform ata n=10000, k=1 θ=0.5 θ=0.1 θ= Normal ata n=10000, k=1 θ=0.5 θ=0.1 θ= Exponential ata n=10000, k=1 θ=0.5 θ=0.1 θ=0.01 Critical imensionality, Critical imensionality, Critical imensionality, Approximation, Approximation, Approximation, Figure 4: [Best viewe in color.] Critical imensionality for ɛ [10 3, 10 0 ], n = 10,000, k = 1, an θ = 0.01 re soli line, θ = 0.1 blue soli line, an θ = 0.5 magenta soli line, preicte by exploiting Th. 25. The ashe curves represent the values of the critical imensionality estimate empirically. empirical one for ecreasing ɛ an that the rate of convergence is irectly proportional to θ. Interestingly, it can be seen that in ifferent cases, the reporte critical imensionality is quite high e.g., consier ɛ = Because approximate nearest neighbors must be associate with small values of θ e.g., consier θ = 0.01 to be consiere meaningful, it can be conclue that the notion of approximate nearest neighbor can be consiere meaningful even in high-imensional spaces provie that the approximation factor ɛ is sufficiently small. Unfortunately, this oes not imply that algorithms perform efficiently in these cases. To illustrate, the researchers have propose ifferent algorithms for approximate nearest neighbor search problems. Most of these algorithms are ranomize; that is, they are associate with a failure probability δ. Specifically, the approximate nearest neighbor search problem with failure probability δ is efine as the problem to construct a ata structure over a set of points S R such that, given any query point x R, with probability 1 δ reports: P1. some y S with istx, y 1 + ɛr ɛ-approximate r-near neighbor; P2. some y S with istx, y 1+ɛ istx, nnx, S ɛ-approximate nearest neighbor; P3. each point y S with istx, y r r-near neighbor reporting. The propose algorithms offer trae-offs between the approximation factor, the space an the query time Anoni, From the practical perspective, the space use by an algorithm shoul be as close to linear as possible. In this case, the best-existing solutions are base on locality-sensitive hashing LSH Inyk an Motwani, 1998; Har-Pele et al., The iea of the LSH approach is to hash the points in a way that the probability of collision is much higher for points that are close with the istance r to each other than for those that are far apart with istance at least 1 + ɛr. Uner ifferent assumptions involving the parameters employe Har-Pele et al., 2012, the LSH algorithm solves the ɛ-approximate r-near neighbor problem using On 1+ρɛ extra space, On ρɛ query time, 24

25 On the Behavior of Intrinsically High-Dimensional Spaces a b Figure 5: Temporal complexity of the E 2 LSH algorithm on uniform ata for ifferent selectivity values, namely θ = 0.1 re soli line, θ = 0.01 blue ashe line, an θ = magenta ash-otte line, an imensions [10, 10 3 ], estimate by using n = 10,000 ata points an m = n query points. The plot on the left concerns the cost of reporting all the neighbors. The plot on the right concerns the cost of reporting just one neighbor. In the latter plot, otte curves represent the complexity of sampling until a neighbor is retrieve. an failure probability δ = 1/e+1/3. 3 As for the value of the exponent ρ ɛ, for the Eucliean istance, it is possible to achieve ρ ɛ = 1/1 + ɛ 2 + o ɛ 1 Anoni an Inyk, 2006, an it is known this boun is tight. For example, consier that if ɛ = 0.01, then ρ ɛ = Because meaningfulness in intrinsically high-imensional spaces requires smaller an smaller ɛ values, this means that, if we wish to maintain a pre-efine level of selectivity θ, we expect that the efficiency of LSH-base schemes will iminish with the intrinsic imensionality of the space. To empirically illustrate the relationship among selectivity θ, the intrinsic imensionality, an the temporal complexity γ of the search algorithm, 4 we analyze the performances of the E 2 LSH metho as a function of the expecte size θ of the r-neighborhoo. The E 2 LSH package solves the ranomize r-near neighbor reporting problem exploiting the basic LSH scheme. 5 After preprocessing the ata set, E 2 LSH answers queries, typically in 3. The failure probability δ can be mae arbitrarily small, say δ < 1/n, by running Ologn + m copies of the basic LSH algorithm for P1, where n an m enote an upper boun on the number of points in the ata structure an on the number of queries performe at any time. Moreover, P2 can be solve by using as builing blocks Olog n copies of an algorithm for P1, achieving failure probability Oδ log n Har-Pele et al., A similar strategy allows solving the nearest neighbor reporting problem P3 by builing on ifferent ata structures for P1 associate with increasing values of r Anoni an Inyk, The temporal complexity is efine as the exponent γ 0 such that the total number of istances D compute by the algorithm in orer to report its answer is such that D = n γ. 5. The E 2 LSH package is available for ownloa at 25

26 Angiulli sub-linear time, with each near neighbor being reporte with a certain probability 1 δ = 0.9 by efault. As for the values of the other parameters employe, we use the values etermine automatically by the algorithm. Figure 5 reports the results of the experiments on a family of uniformly istribute ata sets compose of n = 10,000 points with [10, 10 3 ]. We use m = n ifferent query points generate from the same istribution. We also varie the selectivity θ in {0.001, 0.01, 0.1} by etermining the raius r such that the expecte fraction of r-near neighbors of the query points is θ. In Figure 5a, it can be seen that the complexity of the proceure increases with θ, an this can be explaine by noting that the total number of points to be reporte is irectly proportional to θ. However, even if θ is hel fixe, in all cases, the complexity of the algorithm for large values tens to a linear scan of the ata or to the cost γ s of a ranom sampling proceure. 6 In Figure 5b, the algorithm has been enforce to report at most one near neighbor; hence, it stops the search as soon as it retrieves a near neighbor. It can now be seen that the complexity of the proceure ecreases with θ, an this can be explaine by noting that the probability of retrieving a neighbor is irectly proportional to θ. The otte curves represent the complexity γ s of the proceure consisting in ranomly selecting points until a r-near neighbor is retrieve. Aitionally, in this case, it can be observe that the complexity egraes towars that of the ranom sampling proceure irrespectively of the selectivity value θ. 7 The above analysis provies a picture of how much better an approximate search algorithm can perform than the pure ranom search, as a function of the selectivity an of the intrinsic imensionality. Although the target neighborhoo can be guarantee to contain not too many points even in very large imensional spaces, the best search algorithms may fail to perform better than ranom sampling. This can be explaine by the poor separation of istances with the objects that are outsie the approximate neighborhoo. In this regar, although the critical imensionality is a construct with which to attempt to quantify the meaningless of a certain query, the relative contrast C r He et al., 2012 is a way to attempt to quantify its ifficulty. Given a query point x, the relative contrast is a measure of separability of the nearest neighbor of x from the rest of the ata set points. Definition 26 Aapte from He et al., 2012 Let DS be a ata set consisting of n realizations of a ranom vector Y. The relative contrast for the ata set DS for a query x, being the realization of a ranom vector X, is efine as Cr k x = E[istx,DS] E[istnn k x,ds]. Taking expectations with respect to queries, the relative contrast for the ata set DS is C k r = E[istX, DS] E[istnn k X, DS]. He et al provie an estimate of the relative contrast for a ata set vali for inepenent imensions an, moreover, provie bouns on the cost of LSH-base nearest neighbor search algorithms taking into account the relative contrast. They also note that 6. Inee, the expecte number n γs of points to be ranomly picke in orer to retrieve the 1 δ fraction of the nθ ata points that are r-near neighbors of the query point is n γs = n1 δ an γ s = 1 + log1 δ/ logn. E.g., for 1 δ = 0.9, γ s = Note that, for a query having selectivity θ, the expecte number of points to be ranomly picke in orer to retrieve exactly one r-near neighbor is n γs = 1/θ an, hence, γ s = logθ/ logn. 26

27 On the Behavior of Intrinsically High-Dimensional Spaces the analysis of Beyer et al an François et al agree with the asymptotic behavior of the relative contrast. We refer to He et al., 2012 for the etails. Here, we show that by exploiting the previous results, we can erive an approximation for the relative contrast Cr k of a ata set that results to be more accurate than the estimate provie by He et al In aition, we can erive a close form for the relative contrast Cr k x of an iniviual query point. Theorem 27 Let X be a -imensional i.i.. ranom vector with cf F X an let Y be a -imensional i.i.. ranom vector with cf F Y. Assume, w.l.o.g., that F Y has null mean µ Y = 0. Then, for large values of, Cr k µy,2 + µ X,2 µy,2 + µ X,2 + Φ. 1 k n µ Y,4 µ 2 Y,2 + 4µ Y,2µ X,2 4µ Y,3 µ X Proof of Theorem 27. Consier the expecte value of the square istance separating a query point X from nn ϱ X, Y leveraging Proposition 8 an Lemma 23: = E[ X nn ϱ X, Y 2 ] = E[istnn ϱ X, Y 2 ] = E[istX, nn ϱ X, Y 2 ] = = P r[x = x ] istx, nn ϱ x, Y 2 x = R = φ X 2R µ x Y 2 + Φ 1 ϱ σ x Y 2 x R = 0 2 =R φ X 2R R + µ Y,2 + Φ 1 ϱ µ Y,4 µ 2 Y,2 + 4µ µ X Y,2R 4µ Y,3 R R. µ X,2 After approximating the R uner the square root with the expecte value µ X 2 = µ X,2 of X : E[istX, nn ϱ X, Y 2 ] = + 0 R φ X 2R R + + µ Y,2 + Φ 1 ϱ µ Y,4 µ 2 Y,2 + 4µ Y,2µ X,2 4µ Y,3 µ X + 0 = µ X,2 + µ Y,2 + Φ 1 ϱ µ Y,4 µ 2 Y,2 + 4µ X,2µ Y,2 4µ Y,3 µ X. φ X 2R R = Inee, the left han integral above correspons to the expecte value µ X 2 = µ X,2 of of the ranom variable X 2, whereas the right han integral evaluates to one. Accoring to the Jensen inequality Johnson et al., 1994, if g is a concave function, then E[gX] ge[x]; moreover, the larger the relative variance σ X /µ X of X, the closer the two above values, i.e., E[gX] ge[x]. Specifically, E[ X ] = E[ i X2 i ] E[ i X2 i ] = E[ X 2 ] an, because σ X 2/µ X 2 = O 1/2, for large values of, E[istX, Y ] E[istX, Y 2 ], an E[istnn ϱ X, Y ] E[istnn ϱ X, Y 2 ]. We can also provie the relative contrast C k r x of an iniviual query point x. 27

28 Angiulli Figure 6: [Best viewe in color.] Comparison between the estimate of the relative contrast C r provie in Theorem 27 blue ashe line an the estimate provie by He et al magenta ash-otte line. The re soli line represents the value of the relative contrast estimate empirically. Corollary 28 Let x enote a realization of a -imensional i.i.. ranom vector, an let Y be a -imensional i.i.. ranom vector with cf F Y. Then, for large values of, with high probability Cr k x x 2 + µ 2 2µ x i + Φ 1 k n x 2 + µ 2 2µ x i µ 4 µ µ 2 x 2 4µ 3 µµ 2 x i. Proof of Corollary 28. Following the same line of reasoning of Theorem 27, Cr k x E[istx,Y 2 ] E[istnn ϱx,y, an the statement follows by leveraging Theorem 15 an Lemma ] Figure 6 compares the approximation of the relative contrast provie in Theorem 27 to the approximation provie by He et al In all the cases, the approximation of Theorem 27 is the more accurate. This can be unerstoo by noting that He et al estimate the relative contrast by consiering the ifferences X i Y i between the components of a query point X an of a ata point Y as novel ranom variables, an then by etermining their expectations an stanar eviations. This correspons to ignoring the form of the istribution of istances from each iniviual query point an all the ata points, a relationship that is conversely taken into account in Theorem 27, ue to the leveraging of Theorem 16, Proposition 8, an Lemma

29 On the Behavior of Intrinsically High-Dimensional Spaces 4.3 On the Distribution of Reverse Nearest Neighbors for i.i.. Data Given a real number ϱ [0, 1], a -imensional ranom vector Y, an a realization x of Y, it is sai that a realization y of Y is a ϱ reverse nearest neighbor of x w.r.t. Y if x NN ϱ y, Y. The size N ϱ x, Y, or N ϱ x whenever Y is clear from the context, of the ϱ reverse nearest neighborhoo of x w.r.t. Y, also calle reverse ϱ nearest neighbor count or ϱ- occurrences, is the fraction of realizations y of Y such that x NN ϱ y, Y. As in the previous sections, in orer to eal with finite sets of n points {Y } n, the integer parameter k = ϱn k {1,..., n} must be employe in place of ϱ. In such a case, we speak of k reverse nearest neighborhoo, reverse k nearest neighbor count, or k-occurrences. Before going into the main results, the following expression provies the probability that a realization, having norm value R, of a -imensional i.i.. ranom vector Y lies at istance not greater than δ from a given -imensional vector x. Lemma 29 Let x enote a realization of a -imensional i.i.. ranom vector X with cf F X, an let Y be a -imensional i.i.. ranom vector with cf F Y. Assume, w.l.o.g., that F Y has null mean µ Y = 0. Then, for large values of, with high probability δ 2 R 2 x 2 P r [istx, Y δ Y = R] Φ 2 x, µ 2 where moments are relative to the constraine ranom vector Y. Proof of Lemma 29. See the appenix. The noteworthiness of the above expression lies in the fact that, by combining it with Proposition 8, it is possible in some cases to replace multi-imensional integrations involving the full event space R with one-imensional integrations over the omain R + 0 of the squarenorm values. Specifically, it is essential to the proof of the following result. Theorem 30 Let x enote a realization of a -imensional i.i.. ranom vector Y, with cf F Y having, w.l.o.g., null mean µ = 0. Consier the reverse k nearest neighbor count N ϱ x of x w.r.t. Y. Then, for large values of, with high probability Φ 1 ϱ µ 4 + 3µ 2 2 N ϱ x Φ z x µ4 µ µ 2 Proof of Theorem 30. First of all, note that N ϱ x = P r[x NN ϱ Y ]. Consier the probability P r[x NN ϱ Y ] = P r [ x NN ϱ y ] P r[y = y ] y = R = P r [ istx, y isty, nn ϱ y ] P r[y = y ] y = R = + 0 P r [ istx, Y isty, nn ϱ Y Y 2 = R ] P r [ Y 2 = R ] R. 29

30 Angiulli By Lemma 23 an Proposition 22, for Y 2 = R, isty, nn ϱ Y 2 = R + µ 2 + Φ 1 ϱ µ 4 µ µ 2R, while by Lemma 29, P r [ istx, Y isty, nn ϱ Y Y 2 = R ] isty, nn ϱ Y 2 R x 2 Φ 2 x = µ 2 Φ 1 ϱ µ 4 µ 2 2 = Φ + 4µ 2R + µ 2 x 2 2 x, µ 2 from which it follows that P r[x nn ϱ Y ] + 0 Φ 1 ϱ µ 4 µ 2 2 Φ + 4µ 2R + µ 2 x 2 2 x φ Y µ 2R R, 2 where moments are relative to the constraine ranom vector Y. The proof procees by expressing x 2 an R in terms of their stanar scores with respect to the ranom variable Y 2, i.e., x 2 = µ Y 2 + z x σ Y 2 an R = µ Y 2 + z R, Y 2 σ Y 2. By substituting in the left-han sie above, an by consiering that for α an β being finite an growing, α α + β, Φ Φ 1 ϱ µ 4 µ µ2 2 + z R, Y 24µ 2 µ4 µ µ 2 µ 2 z x µ4 µ µ 2 µ 2 + z x µ4 µ 2 2 Φ 1 ϱ µ 4 + 3µ 2 2 Φ z x µ4 µ 2 2 = 2µ 2 Φ 1 ϱ µ 4 + 3µ 2 2 = Φ z x µ4 µ 2 2 = Cz x, ϱ. 2µ 2 Since, for large values of, moments ten to their unconstraine values see proof of Lemma 29, the last expression epens on z x an on ϱ, but not on R. Thus P r[x NN ϱ Y ] Cz x, ϱ + 0 φ Y 2R R = Cz x, ϱ. As for the expression reporte in the statement of Theorem 30, it oes not explicitly epen on the imensionality an on the exact position of the point x but only on the relative position of the square norm of the point with respect to the expecte value. Thus, the following efinition naturally arises. 30

31 On the Behavior of Intrinsically High-Dimensional Spaces Definition 31 Let z enote the stanar score of the square norm. imensional k-occurrences function N ϱ : R [0, 1], efine as N ϱ z = Φ Φ 1 ϱ µ 4 + 3µ 2 2 z µ 4 µ 2 2 2µ 2 Then, the infinite, 3 represents the fraction of points having a point with square norm stanar score z among their ϱ nearest neighbors. An alternative expression can be provie by leveraging the kurtosis κ = µ 4, a well known µ 2 2 measure of taileness of a probability istribution that is κ + 3 N ϱ z = Φ Φ 1 ϱ z 2 κ 1 In particular, it hols from the evelopment of Theorem 30 that 2. 4 lim P r[n ϱx = N ϱ z] = φz, 5 from which it can be seen that the point z 0 is such that N ϱ z 0 1 an φz 0 0. This point precisely correspons with the expecte value of X the origin of the space for variables with null mean an is the point most likely to be selecte as nearest neighbor by any other point. At the same time, it is the point least likely to be observe as a realization of X among those having norms smaller than the expecte value. As for the point z, it is such that N ϱ z 0 an φz 0. Hence, it is the less likely to be observe as a realization of X, but it is also the less likely to be selecte as a nearest neighbor. This point is the furthest from the mean, an it is locate on the bounary of a boune region or a infinitum. Figure 7 shows the curves of N ϱ re lines for i.i.. ata coming from ifferent istributions, together with the empirical N k /n values black ots, with k = ϱn, in a ranom sample of n = 10,000 points. Theoretical N ϱ curves represent the picture of what happens a infinitum, because they provie the values to which the k-occurrences counts converge for large imensions. We observe that in most cases, the convergence arises very soon, often a few tens of imensions suffice. Inee, empirical values ten to istribute along the associate curves. While the first two istributions have null skewness, the same behavior is exhibite by the thir one, which instea has non-null skewness, even if, in this case, convergence appears to be slower. In any case, it appears that the value of k-occurrences preicte by means of the function N ϱ usually is in goo agreement with the empirical evience, even for the smallest imension consiere in the figure. For similar plots on real ata sets, we refer to Figures 10, 11, an 12, reporte in the following. It is now of interest to obtain the cf an pf of N ϱ X for large values, together with the associate variance an expecte value. 31

32 Angiulli Figure 7: [Best viewe in color.] Comparison between the empirical values of the relative number of k-occurrences N k /n, estimate in a ranom sample of n = 10,000 points with {100, 1,000}, an the values preicte by means of the function N ϱ re curves, for ϱ = 0.01 magenta ots an soli re line, ϱ = 0.01 green ots an ashe re line, ϱ = 0.01 blue ots an ash-otte re line, an ϱ = 0.01 cyan ots an otte re line. Theorem 32 i ii iii iv lim P r[n ϱx θ] = Φ lim P r[n 2µ ϱx = θ] = 2 µ4 µ 2 2 lim σ2 N ϱ X = ϱ1 ϱ 2T lim E[N ϱx ] = ϱ, where T h, a = φh a Proof of Theorem 32. See the appenix. 0 Φ 1 θ2µ 2 Φ 1 ϱ µ 4 + 3µ 2 2 µ4 µ φφ 1 θ φ Φ 1 ϱ,, Φ 1 θ2µ 2 Φ 1 ϱ µ 4 + 3µ 2 2 µ4 µ 2 2 2µ 2 2µ4 + µ 2 2, an φhx x is the Owen s T function. 1 + x2, Figure 8 shows the cfs an the pfs of the limiting istributions of the k-occurrences for uniform, normal, an exponential istributions. As expecte, the probability of observing large values of N ϱ increases with ϱ. Moreover, it can be observe from the pf functions that for the exponential istribution, the probability of observing N ϱ 1 is not negligible 32

33 On the Behavior of Intrinsically High-Dimensional Spaces Figure 8: [Best viewe in color.] Cumulative istribution function left column an probability ensity function right column of the limiting istribution of the number of k-occurrences for i.i.. ranom vectors see Th. 32 accoring to ifferent families of istributions, for ϱ = 0.01 re soli line, ϱ = 0.05 blue ashe line, ϱ = 0.10 magenta ash-otte line, an ϱ = 0.25 green otte line. 33

34 Angiulli even for moerately large values of ϱ. This behavior can be better unerstoo by looking at Figure 7, where theoretical curves of the exponential are approaching 1 earlier. Corollary 33 Let X an Y be two -imensional i.i.. ranom vectors with common cf F Y having, w.l.o.g., null mean µ Y = 0. Then, for large values of, Φ 1 h n P r[n k X, {Y } n h] Φ 2µ 2 Φ 1 k n µ 4 + 3µ 2 2. µ4 µ 2 2 Proof of Corollary 33. The statement follows immeiately from Theorem 32. In orer to compare the solution here erive to the large imensional limits of the function N n, k provie by Newman et al. 1983, the same limits are erive next. Corollary 34 Let k be a fixe positive integer. Then i lim lim n Nn, D k 0, ii lim n lim σ2 N n, k Proof of Corollary 34. See the appenix. =, an iii lim lim n E[Nn, k ] = k. Results provie by Newman et al. 1983, correspon to points i an ii above for the case k = 1. The point iii is reporte only as a check, because k is expecte by efinition of the k-occurrences. The interpretation of the above result provie by Tversky et al. 1983, which is typically how it is reporte in the relate literature, is that if the number of imensions is large relative to the number of points, one may expect to have a large proportion of points with reverse nearest neighbor counts equaling 0, an a small proportion of points with high count values. However, accoring to the previous finings, the convergence in istribution to zero oes not have to be motivate by the excess of the imensions with respect to the sample size, but rather by the use of a fixe-size neighborhoo parameter k in the presence of large samples. As a matter of fact, the curves reporte in Figure 8 ten to the zero istribution only for ϱ = k/n 0. Moreover, large counts can also be achieve in the case of small samples an large imensionalities, as shown in Figures 7 an 8. E.g., from Equation 5, the expecte number of points such that z 1 z 2, resp. is about the 15.9% 2.3%, resp. for any sample size n. All of this suggests that hubness is efinitely not an artifact of a finite sample. 4.4 The Distribution of Inepenent Non-Ientically Distribute Data Previous results can be extene to inepenent non-ientically istribute ranom vectors. With this aim, we consier the following proposition. Given a sequence W 1, W 2,..., W of inepenent non-ientically istribute ranom variables having non-null variances 8 an finite central moments ˆµ i,k up to the eighth mo- 8. Clearly, variables having constant omain, hence null variance, can be isregare because they o not alter istances. 34

35 On the Behavior of Intrinsically High-Dimensional Spaces ment, we say that the sequence has comparable central moments if there exist positive constants ˆµ max max i,k { ˆµ i,k } an ˆµ min min i { ˆµ i,k : ˆµ i,k 0}. Intuitively, this guarantees that the ratio between the greatest an the smallest non-null moment remains limite. 9 Proposition 35 Let U = W i be a ranom variable efine as the summation of a sequence of inepenent, but not ientically istribute, ranom variables W i having comparable central moments. Then U N µ Wi, σw 2 i = N where µ W = 1/ µ W i an σ 2 W = 1/ σ2 W i. Proof of Proposition 35. See the appenix. µ W, σ 2 W, François et al. 2007, cf. Proposition 2 affirme that Theorem 3 hols for inepenent non-ientically istribute variables provie that they are normalize. Authors justify this result by noting that norms will concentrate because normalization prevents variables from having too little effect on the istance values. Accoring to this interpretation, normalization is essential for having comparable variances. Recall that the variance is the secon orer central moment. Definition 36 Let Y = Y 1, Y 2,..., Y be an inepenent non-ientically istribute - imensional ranom vector with cfs F Y = F Y1, F Y2,..., F Y having k-th moments µ k = µ Y1,k, µ Y2,k,..., µ Y,k = µ 1,k, µ 2,k,..., µ,k. Moreover, given a positive integer h, k 1, let µ h k enote the average h-th egree of the k-th central moments of Y, also referre to as average central moment for simplicity, efine as µ h k = 1 ˆµ h i,k = 1 E[Y i µ Yi k ] h, where ˆµ i,k enotes the k-th central moment of Y i. Because consiering ranom variables having null means simplifies expressions, for the sake of simplicity, we next consier the case of inepenent non-ientically istribute ranom vectors having common cfs, but we note that a similar result also hols in the more general case F X F Y. Theorem 37 Let X an Y be two inepenent non-ientically istribute -imensional ranom vectors with common cfs F having means µ = µ 1,..., µ, non null variances, 9. This efinition fits the Lyapunov conition. In general, given a sequence W 1, W 2,..., W of inepenent non-ientically istribute ranom variables having non-null finite variances, then their stanarize sum converges in istribution to a stanar normal ranom variable if an only if the Feller-Lineberg conition hols Feller, Accoring to this conition, the variance σw i of any iniviual term never ominates their sum s ; hence, lim max σ 2 i W i/s 2 = 0. Because this both necessary an sufficient for the CLT to hol, the Feller-Lineberg conition implies the Lyapunov conition. 35

36 Angiulli an comparable central moments, an let x enote a realization of X. The results of Sections 4.1, 4.2 an 4.3 can be applie to X, Y, an x by taking into account the average central moments of X an Y an the realization x µ. Proof of Theorem 37. See the appenix for etails. To illustrate the above results, real ata sets having imensionality varying at some orer of magnitue are consiere. The ata sets are briefly escribe next. The Statlog Lansat Satellite ata set 10 consists of multi-spectral values of pixels in 3 3 neighborhoos in a satellite image = 36, n = 6,435. The SIFT ata set 11 consists of the base vectors of the ANN SIFT10 evaluation set use to evaluate the quality of approximate nearest neighbors search algorithms an consists of SIFT image escriptors = 128, n = 10,000. The MNIST ata set 12 consists of hanwritten igits which have been size-normalize an centere in a image. The test examples have been employe = 784, n = 10,000. The Sports ata set 13 consists of time series representing sensor measurements associate with activities performe by eight subjects for 5 minutes = 5,625, n = 9,120. The NIPS textual ata set 14 consists of counts associate with wors appearing in 5,812 NIPS conference papers publishe between 1987 an 2015 = 11,463, n = 5,812. The RNA-Seq ata set 15 consists of gene expressions levels, measure by a illumina HiSeq sequencing platform, of patients having ifferent types of tumors = 20,531, n = 801. In the following, we also consier the shuffle version of the original ata set. Specifically, the shuffle version of a given ata set is obtaine by ranomly permuting the elements within every attribute. As alreay note by François et al. 2007, the shuffle ata set is marginally istribute as the original one, but because all the relationships between variables are estroye, its components are inepenent, an its intrinsic imension is equal to its embeing imension. Figure 9 reports the empirical cf of the square istance soli line associate with each ata set, together with the theoretical cf ashe line associate with inepenent but not ientically ata having the same average central moments of the original ata, as reporte in Theorem 37. The latter curve has been obtaine by using the average central moments of the ata accoring to Theorem 37. The empirical cf of the shuffle ata is also reporte otte line. From the linearity of the expecte value, for any pair of -imensional ranom vectors, it follows that E[ X Y 2 ] = µ X,2 + µ Y,2 an E[ x Y 2 ] = x µ Y 2 + µ Y,2, where the equality hols also for epenent an non-ientically istribute ranom vectors. Hence, the expecte value of the pairwise istances between ata set points is ientical to 10. Data available at Data available at Data available at Data available at Data available at Data available at 36

37 On the Behavior of Intrinsically High-Dimensional Spaces Figure 9: [Best viewe in color.] Pairwise istance istributions for real ata sets, incluing the original ata re soli line, the shuffle ata blue ashe line, an the equivalent inepenent ata magenta otte line. The last curve correspons to the theoretical cf associate with inepenent but not ientically ata having the same average central moments of the original ata, as reporte in Theorem 37. the expecte value of the theoretical istribution erive uner the i.i.. hypothesis, an also to that of the shuffle ata. The ifference between the curves of the original an of the inepenent ata suggests that the intrinsic imensionality of the ata at han is smaller than that of the embeing space, because epenencies eviently exist between the attributes. Moreover, it can be seen that the empirical cf of the shuffle ata is very similar to that of the theoretical cf. These result confirm the correctness of Theorem 37, whose preiction agrees with the empirical observation on real inepenent ata. Moreover, its approximation is accurate even in moerately large spaces, because the corresponence is goo even for the smallest ata set consiere = 36. Moreover, these experiments testify to the meaningfulness of the analysis here accomplishe as a worst-case analysis scenario, corresponing to the case in which relationships between variables are absent. In orer to verify if the ata sets satisfy the requirements for the CLT to be applie, the value of the finite Lyapunov CLT conition see Equation 1, for δ = 2 has been etermine on the ata at han with the variable W i taking value over all the terms x j,i x k,i 2 that can be forme with istinct pairs of ata set points x j an x k, 1 j < k n. Table 1 reports the value of the above conition inicate as LC together with the Relative Variance RV of the norm of ata set points, for both the original ata set note that the shuffle ata presents the same LC value as the original one an its normalize version obtaine by substituting each attribute X i with X i µ i /σ i : 37

38 Angiulli Data set Original Normalize LC RV LC RV Satellite SIFT MNIST Sports NIPS RNA-Seq Table 1: Comparison between the values of the Lyapunov CLT Conition LC an the Relative Variance RV against those of the real ata sets. A small LC value say, less than 1 inicates that the normal approximation is correct. This conition is met for all the configurations except for the normalize MNIST ata set. Inee, normalizing this ata set is not meaningful, because attributes corresponing to pixels are alreay homogeneous their omain consists of 256 gray levels encoe as byte values an because the normalization has only the negative effect of exaggerating the range of variation of pixels whose omain is forme almost entirely of zeros. As a result, a few attributes ominate the istance, an this eteriorates convergence to normality. The relative variance has been reporte for comparison, because it is a measure of the concentration of the ata. Note that the relative variance of the shuffle ata is not coincient with that of the original one. Figure 10 reports the relative number of k-occurrences associate with ata set points represente in terms of their square norm stanar score z. Different values for the parameter ϱ have been employe, Specifically, ϱ {0.01, 0.05, 0.10, 0.25} the color of points in the figure is magenta for ϱ = 0.01, green for ϱ = 0.05, blue for ϱ = 0.10 an cyan for ϱ = The theoretical curves of N ϱ z, base on average central moments of the ata, are also reporte for comparison. Interestingly, the real istributions of k-occurrences follow the trens of the theoretical curves; however, in contrast to the inepenent case, counts have much larger variability. The interpretation is that variability is associate with a lower intrinsic imensionality an epenencies among variables, because inepenent ata are much more wiely istribute along the theoretical tren, as can be seen in Figure 7. Inee, such behavior is also observe when consiering the shuffle ata set see Figure 11. Moreover, in this case, the empirical evience matches the behavior preicte by Theorem 37 for the inepenent case. In some cases, the tren appears to be ifferent albeit generally analogous. It appears that the egree of agreement between the empirical evience an the the theoretical preiction is irectly proportional to the value of the LC conition reporte in Table 1. 38

On the Behavior of Intrinsically High-Dimensional Spaces Figure 10: [Best viewe in color.

theoretical preiction accoring to the infinite imensional k- occurrences function reporte in Equation 4.

05 green-colore points an ashe re line, ϱ 3 = 0.10 blue-colore points an ash-otte re line, an ϱ 4

5 Extension to Other Distances In general, one may attempt to exten some of the properties iscusse above to istances having the general form istx, y

39 On the Behavior of Intrinsically High-Dimensional Spaces Figure 10: [Best viewe in color.] Relative number of k-occurrences associate with ata set points, represente in terms of their square norm stanar score z ifferent colors, an the theoretical preiction accoring to the infinite imensional k- occurrences function reporte in Equation 4.3 re lines, for the following values of ϱ = k/n: ϱ 1 = 0.01 magenta-colore points an soli re line, ϱ 2 = 0.05 green-colore points an ashe re line, ϱ 3 = 0.10 blue-colore points an ash-otte re line, an ϱ 4 = 0.25 cyan-colore points an otte re line. 4.5 Extension to Other Distances In general, one may attempt to exten some of the properties iscusse above to istances having the general form istx, y = h gx i, y i, 6 with g : R 2 R being commutative an not ientically constant, an h : R R strictly monotonic an, hence, invertible. Inee, let X an Y be -imensional i.i.. ranom vectors, an consier the ranom variable h 1 istx, Y = gx i, Y i = W i. Because W 1, W 2, W 3,... is a sequence of i.i.. ranom variables, by the CLT, it can be sai that h 1 istx, Y Φ E[gX i, Y i ], σ 2 gx i, Y i an, for large values of, hδ E[gXi, Y i ] P r [istx, Y δ] Φ. σgxi, Y i 39

40 Angiulli Figure 11: [Best viewe in color.] Relative number of k-occurrences associate with ata set points the shuffle version of each ata set here being consiere, which is obtaine by ranomly permuting the elements within every attribute, represente in terms of their square norm stanar score z ifferent colors, an theoretical preiction accoring to the infinite imensional k-occurrences function reporte in Equation 4.3, for the following values of ϱ = k/n: ϱ 1 = 0.01 magenta-colore points an soli re line, ϱ 2 = 0.05 green-colore points an ashe re line, ϱ 3 = 0.10 blue-colore points an ash-otte re line, an ϱ 4 = 0.25 cyan-colore points an otte re line. 1/p, As an example, consier the Minkowski norm L p, x p = x i p with p a positive integer. Let, for the sake of simplicity, p be even, then p p E[ X Y p p] = 1 j µ X,p j µ Y,j, an j σ 2 X Y p p = + j=0 p j=0 p 2 σ 2 X p j i Y j i j + p p j=0 j k=0 1 j+k p j p covx p j i Y j k i, Xp k i Y k i. Newman an Rinott 1985 reporte a generalize version of Theorem 4, in which istances of the form use in Equation 6 are consiere. Theorem 38 Aapte from Newman an Rinott, 1985, cf. Theorem 3 Consier the generalize istance function reporte in Equation 6. Let β g = corr gx, Y, gx, Z be the correlation between gx, Y an gx, Z, where X, Y, Z are i.i.. ranom variables 40

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration