On the Surprising Behavior of Distance Metrics in High Dimensional Space

Similar documents
Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lower bounds on Locality Sensitive Hashing

Least-Squares Regression on Sparse Spaces

Influence of weight initialization on multilayer perceptron performance

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

A note on asymptotic formulae for one-dimensional network flow problems Carlos F. Daganzo and Karen R. Smilowitz

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

High-Dimensional p-norms

u!i = a T u = 0. Then S satisfies

Separation of Variables

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

Diophantine Approximations: Examining the Farey Process and its Method on Producing Best Approximations

A Lower Bound On Proximity Preservation by Space Filling Curves

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets

On the Behavior of Intrinsically High-Dimensional Spaces: Distances, Direct and Reverse Nearest Neighbors, and Hubness

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

arxiv: v4 [math.pr] 27 Jul 2016

Quantum mechanical approaches to the virial

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

Bohr Model of the Hydrogen Atom

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

Situation awareness of power system based on static voltage security region

Chromatic number for a generalization of Cartesian product graphs

Lecture 6 : Dimensionality Reduction

Multi-View Clustering via Canonical Correlation Analysis

Similarity Measures for Categorical Data A Comparative Study. Technical Report

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

PDE Notes, Lecture #11

Topic 7: Convergence of Random Variables

KNN Particle Filters for Dynamic Hybrid Bayesian Networks

Flexible High-Dimensional Classification Machines and Their Asymptotic Properties

Robustness and Perturbations of Minimal Bases

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Calculus of Variations

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Parameter estimation: A new approach to weighting a priori information

Multi-View Clustering via Canonical Correlation Analysis

CS9840 Learning and Computer Vision Prof. Olga Veksler. Lecture 2. Some Concepts from Computer Vision Curse of Dimensionality PCA

Binary Discrimination Methods for High Dimensional Data with a. Geometric Representation

Agmon Kolmogorov Inequalities on l 2 (Z d )

Generalizing Kronecker Graphs in order to Model Searchable Networks

Sparse Reconstruction of Systems of Ordinary Differential Equations

CONTROL CHARTS FOR VARIABLES

Multi-View Clustering via Canonical Correlation Analysis

Schrödinger s equation.

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016

Quantum Search on the Spatial Grid

Acute sets in Euclidean spaces

Multi-View Clustering via Canonical Correlation Analysis

Lecture XII. where Φ is called the potential function. Let us introduce spherical coordinates defined through the relations

Lecture 2 Lagrangian formulation of classical mechanics Mechanics

Transmission Line Matrix (TLM) network analogues of reversible trapping processes Part B: scaling and consistency

Maximal Causes for Non-linear Component Extraction

A LIMIT THEOREM FOR RANDOM FIELDS WITH A SINGULARITY IN THE SPECTRUM

WUCHEN LI AND STANLEY OSHER

Introduction to the Vlasov-Poisson system

Level Construction of Decision Trees in a Partition-based Framework for Classification

7.1 Support Vector Machine

arxiv: v1 [math.co] 15 Sep 2015

WESD - Weighted Spectral Distance for Measuring Shape Dissimilarity

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

Tractability results for weighted Banach spaces of smooth functions

Database-friendly Random Projections

All s Well That Ends Well: Supplementary Proofs

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Linear First-Order Equations

Algorithms and matching lower bounds for approximately-convex optimization

Math 1271 Solutions for Fall 2005 Final Exam

Euler equations for multiple integrals

The total derivative. Chapter Lagrangian and Eulerian approaches

arxiv: v2 [physics.data-an] 5 Jul 2012

Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications to Detection and System Identification

A new proof of the sharpness of the phase transition for Bernoulli percolation on Z d

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE

Chapter 6: Energy-Momentum Tensors

arxiv: v4 [cs.ds] 7 Mar 2014

A Course in Machine Learning

arxiv: v2 [cs.ds] 11 May 2016

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems

SYNCHRONOUS SEQUENTIAL CIRCUITS

Some Examples. Uniform motion. Poisson processes on the real line

On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization

Error Floors in LDPC Codes: Fast Simulation, Bounds and Hardware Emulation

Subspace Estimation from Incomplete Observations: A High-Dimensional Analysis

A Randomized Approximate Nearest Neighbors Algorithm - a short version

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

Dot trajectories in the superposition of random screens: analysis and synthesis

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs

Monotonicity for excited random walk in high dimensions

Counting Lattice Points in Polytopes: The Ehrhart Theory

Introduction to variational calculus: Lecture notes 1

Table of Common Derivatives By David Abraham

Transcription:

On the Surprising Behavior of Distance Metrics in High Dimensional Space Charu C. Aggarwal, Alexaner Hinneburg 2, an Daniel A. Keim 2 IBM T. J. Watson Research Center Yortown Heights, NY 0598, USA. charu@watson.ibm.com 2 Institute of Computer Science, University of Halle Kurt-Mothes-Str., 0620 Halle (Saale), Germany { hinneburg, eim }@informati.uni-halle.e Abstract. In recent years, the effect of the curse of high imensionality has been stuie in great etail on several problems such as clustering, nearest neighbor search, an inexing. In high imensional space the ata becomes sparse, an traitional inexing an algorithmic techniques fail from a efficiency an/or effectiveness perspective. Recent research results show that in high imensional space, the concept of proximity, istance or nearest neighbor may not even be qualitatively meaningful. In this paper, we view the imensionality curse from the point of view of the istance metrics which are use to measure the similarity between objects. We specifically examine the behavior of the commonly use L norm an show that the problem of meaningfulness in high imensionality is sensitive to the value of. For example, this means that the Manhattan istance metric (L norm) is consistently more preferable than the Eucliean istance metric (L 2 norm) for high imensional ata mining applications. Using the intuition erive from our analysis, we introuce an examine a natural extension of the L norm to fractional istance metrics. We show that the fractional istance metric provies more meaningful results both from the theoretical an empirical perspective. The results show that fractional istance metrics can significantly improve the effectiveness of stanar clustering algorithms such as the -means algorithm. Introuction In recent years, high imensional search an retrieval have become very well stuie problems because of the increase importance of ata mining applications [], [2], [3], [4], [5], [8], [0], []. Typically, most real applications which require the use of such techniques comprise very high imensional ata. For such applications, the curse of high imensionality tens to be a major obstacle in the evelopment of ata mining techniques in several ways. For example, the performance of similarity inexing structures in high imensions egraes rapily, so that each query requires the access of almost all the ata []. J. Van en Bussche an V. Vianu (Es.): ICDT 200, LNCS 973, pp. 420 434, 200. c Springer-Verlag Berlin Heielberg 200

On the Surprising Behavior of Distance Metrics 42 It has been argue in [6], that uner certain reasonable assumptions on the ata istribution, the ratio of the istances of the nearest an farthest neighbors to a given target in high imensional space is almost for a wie variety of ata istributions an istance functions. In such a case, the nearest neighbor problem becomes ill efine, since the contrast between the istances to ifferent ata points oes not exist. In such cases, even the concept of proximity may not be meaningful from a qualitative perspective: a problem which is even more funamental than the performance egraation of high imensional algorithms. In most high imensional applications the choice of the istance metric is not obvious; an the notion for the calculation of similarity is very heuristical. Given the non-contrasting nature of the istribution of istances to a given query point, ifferent measures may provie very ifferent orers of proximity of points to a given query point. There is very little literature on proviing guiance for choosing the correct istance measure which results in the most meaningful notion of proximity between two recors. Many high imensional inexing structures an algorithms use the eucliean istance metric as a natural extension of its traitional use in two- or three-imensional spatial applications. In this paper, we iscuss the general behavior of the commonly use L norm (x, y R, Z, L (x, y) = i= ( xi y i ) / ) in high imensional space. The L norm istance function is also susceptible to the imensionality curse for many classes of ata istributions [6]. Our recent results [9] seem to suggest that the L -norm may be more relevant for = or 2 than values of 3. In this paper, we provie some surprising theoretical an experimental results in analyzing the epenency of the L norm on the value of. More specifically, we show that the relative contrasts of the istances to a query point epen heavily on the L metric use. This provies consierable evience that the meaningfulness of the L norm worsens faster with increasing imensionality for higher values of. Thus, for a given problem with a fixe (high) value of the imensionality, it may be preferable to use lower values of. This means that the L istance metric (Manhattan Distance metric) is the most preferable for high imensional applications, followe by the Eucliean Metric (L 2 ), then the L 3 metric, an so on. Encourage by this tren, we examine the behavior of fractional istance metrics, in which is allowe to be a fraction smaller than. We show that this metric is even more effective at preserving the meaningfulness of proximity measures. We bac up our theoretical results with empirical tests on real an synthetic ata showing that the results provie by fractional istance metrics are inee practically useful. Thus, the results of this paper have strong implications for the choice of istance metrics for high imensional ata mining problems. We specifically show the improvements which can be obtaine by applying fractional istance metrics to the stanar -means algorithm. This paper is organize as follows. In the next section, we provie a theoretical analysis of the behavior of the L norm in very high imensionality. In section 3, we iscuss fractional istance metrics an provie a theoretical analysis of their behavior. In section 4, we provie the empirical results, an section 5 provies summary an conclusions.

422 C.C. Aggarwal, A. Hinneburg, an D.A. Keim 2 Behavior of the L -Norm in High Dimensionality In orer to present our convergence results, we first establish some notations an efinitions in Table. Table. Notations an Basic Definitions Notation Definition Dimensionality of the ata space N Number of ata points F -imensional ata istribution in (0, ) X Data point from F with each coorinate rawn from F ist (x, y) Distance between (x,...x ) an (y,...y ) using L metric = i= [(xi x i 2) ] / Distance of a vector to the origin (0,...,0) using the function ist (, ) = max{ X } Farthest istance of the N points to the origin using the istance metric L Dmin = min{ X } Nearest istance of the N points to the origin using the istance metric L E[X], var[x] Expecte value an variance of a ranom variable X Y p c A vector sequence Y,...,Y converges in probability to a constant vector c if: ɛ >0 lim P [ist (Y,c) ɛ] = Theorem. Beyer ( et. ) al. (Aapte for L metric) If lim var X E[ X ] =0, then Dmin Dmin p 0. Proof. See [6] for proof of a more general version of this result. The result of the theorem [6] shows that the ifference between the maximum an minimum istances to a given query point oes not increase as fast as the nearest istance to any point in high imensional space. This maes a proximity query meaningless an unstable because there is poor iscrimination between the nearest an furthest neighbor. Henceforth, we will refer to the ratio Dmin Dmin as the relative contrast. The results in [6] use the value of Dmin as an interesting criterion Dmin for meaningfulness. In orer to provie more insight, in the following we analyze the behavior for ifferent istance metrics in high-imensional space. We first assume a uniform istribution of ata points an show our results for N =2 points. Then, we generalize the results to an arbitrary number of points an arbitrary istributions. In this paper, we consistently use the origin as the query point. This choice oes not affect the generality of our results, though it simplifies our algebra consierably.

On the Surprising Behavior of Distance Metrics 423 Lemma. Let F be uniform istribution of N =2points. For an L metric, [ ] ( ) ( ) lim E Dmin / /2 (+) / 2 +, where C is some constant. Proof. Let A an B be the two points in a imensional ata istribution such that each coorinate is inepenently rawn from a -imensional ata istribution F with finite mean an stanar eviation. Specifically A = (P...P ) an B = (Q...Q ) with P i an Q i being rawn from F. Let PA = { i= (P i) } / be the istance of A to the origin using the L metric an PB = { i= (Q i) } / the istance of B. The ifference of istances is PA PB = { i= (P i) } / { i= (Q i) } /. It can be shown 2 that the ranom variable (P i ) has mean + an stanar ( ) ( eviation + 2 + ). This means that (PA ) p (+), (PB ) p (+) an therefore PA / p ( ) /, + PB / p ( ) ( We inten to show that PA PB / /2 p (+) / ( ) / () + 2 2 + ). We can express PA PB in the following numerator/enominator form which we will use in orer to examine the convergence behavior of the numerator an enominator iniviually. (PA ) (PB ) PA PB = r=0 (PA (2) ) r (PB ) r Diviing both sies by / /2 an regrouping the right-han-sie we get: PA PB = ((PA ) (PB ) ) / / /2 ( PA ) r ( PB ) r (3) r=0 / / Consequently, using Slutsy s theorem 3 an the results of Equation we obtain ( ) r ( ) r ( ) ( )/ PA PB / / p (4) + r=0 Having characterize the convergence behavior of the enominator of the right han sie of Equation 3, let us now examine the behavior of the numerator: (PA ) (PB ) / = i= ((P i) (Q i ) ) / = i= R i /. Here R i is the new ranom variable efine by ((P i ) (Q i ) ) i {,...}. This ranom variable has zero mean an stanar eviation which is 2 σ where 2 This is because E[P i ]=/( + ) an E[P 2 i ]=/(2 + ). 3 Slutsy s Theorem: Let Y...Y... be a sequence of ranom vectors an h( ) be a continuous function. If Y p c then h(y ) p h(c).

424 C.C. Aggarwal, A. Hinneburg, an D.A. Keim σ is the stanar eviation of (P i ). The sum of ifferent values of R i over imensions will converge to a normal istribution with mean 0 an stanar eviation 2 σ because of the central limit theorem. Consequently, the mean average eviation of this istribution will be C σ for some constant C. Therefore, we have: lim E [ (PA ) (PB ) ] + 2 + Since the enominator of Equation 3 shows probabilistic convergence, we can combine the results of Equations 4 an 5 to obtain [ ] PA PB lim E (6) / /2 ( +) / 2 + We can easily generalize the result for a atabase of N uniformly istribute points. The following Corollary provies the result. Corollary. Let F be the uniform istribution of N = n points. Then, ( ) ( ) [ ] ( ) ( C (+) / 2 + lim E Dmin C (n ) / /2 (+) / 2 + Proof. This is because if L is the expecte ifference between the maximum an minimum of two ranomly rawn points, then the same value for n points rawn from the same istribution must be in the range (L, (n ) L). The results can be moifie for arbitrary istributions of N points in a atabase by introucing the constant factor C. In that case, the general epenency of D max D min on 2 remains unchange. A etaile proof is provie in the Appenix; a short outline of the reasoning behin the result is available in [9]. Lemma 2. [ [9] Let F be ] an arbitrary istribution of N =2points. Then, lim E Dmin = C / /2, where C is some constant epenent on. Corollary 2. Let F be the arbitrary istribution of N = n points. Then, [ C lim E Dmin ] (n ) C / /2. ). (5) Thus, this result shows that in high imensional space Dmin increases at the rate of / /2, inepenent of the ata istribution. This means that for the manhattan istance metric, the value of this expression iverges to ; for the Eucliean istance metric, the expression is boune by constants whereas for all other istance metrics, it converges to 0 (see Figure ). Furthermore, the convergence is faster when the value of of the L metric increases. This provies the insight that higher norm parameters provie poorer contrast between the furthest an nearest neighbor. Even more insight may be obtaine by examining the exact behavior of the relative contrast as oppose to the absolute istance between the furthest an nearest point.

On the Surprising Behavior of Distance Metrics 425.5..05 0.95 0.9 0.85 0.8 0.75 0.7 400 350 300 250 200 50 00 50.9 25 p=2 p=2 p=.8.7 20.6 5.5.4 0.3.2 5. 0 20 40 60 80 00 20 40 60 80 200 20 40 60 80 00 20 40 60 80 200 20 40 60 80 00 20 40 60 80 200 (a) = 3 (b) = 2 (c) =.6e+07 p=2/3 p=2/5.4e+07.2e+07 e+07 8e+06 6e+06 4e+06 2e+06 0 0 20 40 60 80 00 20 40 60 80 200 20 40 60 80 00 20 40 60 80 200 () =2/3 (e) =2/5 Fig.. Dmin epening on for ifferent metrics (uniform ata) Table 2. Effect of imensionality on relative (L an L 2) behavior of relative contrast Dimensionality P [U <T ] Dimensionality P [U <T ] Both metrics are the same 0 95.6% 2 85.0% 5 96.% 3 88.7% 20 97.% 4 9.3% 00 98.2% Theorem[( 2. Let F be the) uniform istribution of N =2points. Then, lim E ] Dmin Dmin 2 +. Proof. Let A, B, P...P, Q...Q, PA, PB be efine as in the proof of Lemma. We have shown in the proof of the previous result that PA ( /. / +) Using Slutsy s theorem we can erive that: min{ PA, PB / / } ( ) / (7) + We have also shown in the previous result that: [ ] PA PB lim E / /2 ( ( +) / ) ( ) 2 + (8) We can combine the results in Equation 7 an 8 to obtain: [ ] PA PB lim E /(2 + ) (9) min{pa,pb } Note that the above results confirm of the results in [6] because it shows that the relative contrast egraes as / for the ifferent istance norms. Note

426 C.C. Aggarwal, A. Hinneburg, an D.A. Keim 4.5 RELATIVE CONTRAST FOR UNIFORM DISTRIBUTION RELATIVE CONTRAST 4 3.5 3 2.5 2.5 0.5 N=0,000 N=,000 N=00 0 0 2 3 4 5 6 7 8 9 0 PARAMETER OF DISTANCE NORM Fig. 2. Relative contrast variation with norm parameter for the uniform istribution 0.8 0.6 0.4 f= f=0.75 f=0.5 f=0.25 0.2 0-0.2-0.4-0.6-0.8 - - -0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 Fig. 3. Unit spheres for ifferent fractional metrics (2D) that for values of in the reasonable range of ata mining applications, the norm epenent factor of /(2 + ) may play a valuable role in affecting the relative contrast. For such cases, even the relative rate of egraation of the ifferent istance metrics for a given ata set in the same value of the imensionality may be important. In the Figure 2 we have illustrate the relative contrast create by an artificially generate ata set rawn from a uniform istribution in = 20 imensions. Clearly, the relative contrast ecreases with increasing value of an also follows the same tren as /(2 + ). Another interesting aspect which can be explore to improve nearest neighbor an clustering algorithms in high-imensional space is the effect of on the relative contrast. Even though the expecte relative contrast always ecreases with increasing imensionality, this may not necessarily be true for a given ata set an ifferent. To show this, we performe the following experiment on the ) Manhattan (L ) an Eucliean (L 2 ) istance metric: Let U = ( ( 2 Dmin2 Dmin 2 ). We performe some empirical tests to calculate an T = Dmin Dmin the value of P [U < T ] for the case of the Manhattan (L ) an Eucliean (L 2 ) istance metrics for N = 0 points rawn from a uniform istribution. In each trial, U an T were calculate from the same set of N = 0 points, an P [U <T ] was calculate by fining the fraction of times U was less than T in 000 trials. The results of the experiment are given in Table 2. It is clear that with increasing imensionality, the value of P [U <T ] continues to increase. Thus, for higher imensionality, the relative contrast provie by a norm with smaller parameter is more liely to ominate another with a larger parameter. For imensionalities of 20 or higher it is clear that the manhattan istance metric provies a significantly higher relative contrast than the Eucliean istance metric with very high probability. Thus, among the istance metrics with integral norms, the manhattan istance metric is the metho of choice for proviing the best contrast between the ifferent points. This result of our analysis can be irectly use in a number of ifferent applications.

3 Fractional Distance Metrics On the Surprising Behavior of Distance Metrics 427 The result of the previous section that the Manhattan metric ( = ) provies the best iscrimination in high-imensional ata spaces is the motivation for looing into istance metrics with <. We call these metrics fractional istance metrics. A fractional istance metric ist f (L f norm) for f (0, ) is efine as: ist f (x, y) = i= [ (x i y i ) f ] /f. To give a intuition of the behavior of the fractional istance metric we plotte in Figure 3 the unit spheres for ifferent fractional metrics in R 2. We will prove most of our results in this section assuming that f is of the form /l, where l is some integer. The reason that we show the results for this special case is that we are able to use nice algebraic trics for the proofs. The natural conjecture from the smooth continuous variation of ist f with f is that the results are also true for arbitrary values of f. 4. Our results provie consierable insights into the behavior of the fractional istance metric an its relationship with the L -norm for integral values of. Lemma 3. Let F be the uniform istribution of N =2points an f =/l for some integer [ l. Then, ] ( ) ( ) lim E f Dminf /f /2 (f+) /f 2 f+. Proof. Let A, B, P...P, Q...Q, PA, PB be efine using the L f metric as they were efine in Lemma for the L metric. Let further QA =(PA ) f = (PA ) /l = i= (P i) f an QB =(PB ) f =(PB ) /l = i= (Q i) f. Analogous to Lemma, QA p f+, QB p f+. [ ] ( ) ( ) We inten to show that E PA PB l /2 (f+) /f 2 f+. The ifference of istances is PA PB = { i= (P i) f } /f { i= (Q i) f } /f = { i= (P i) f } l { i= (Q i) f } l. Note that the above expression is of the form a l b l = a b ( l r=0 ar b l r ). Therefore, PA PB can be written as { i= (P i) f (Q i ) f } { l r=0 (QA ) r (QB ) l r }. By iviing both sies by /f /2 an regrouping the right han sie we get: PA PB i= /f /2 p { (P i) f (Q i ) f l ( ) r ( ) l r QA QB } { } (0) By using the results in Equation 0, we can erive that: PA PB i= /f /2 p { (P i) f (Q i ) f } {l } () ( + f) l 4 Empirical simulations of the relative contrast show this is inee the case. r=0

428 C.C. Aggarwal, A. Hinneburg, an D.A. Keim This ranom variable (P i ) f (Q i ) f has zero mean an stanar eviation which is 2 σ where σ is the stanar eviation of (P i ) f. The sum of ifferent values of (P i ) f (Q i ) f over imensions will converge to normal istribution with mean 0 an stanar eviation 2 σ because of the central limit theorem. Consequently, the expecte mean average eviation of this normal istribution is C σ for some constant C. Therefore, we have: [ (PA ) f (PB ) f ] ( ) ( f lim E σ f + Combining the results of Equations 2 an, we get: 2 f + ). (2) [ ] ( ) ( ) PA PB C lim E = /f /2 (f +) /f 2 f + (3) An irect consequence of the above result is the following generalization to N = n points. Corollary 3. When F is the uniform istribution of N = n points an f =/l for some integer l. Then, for some constant C we have: ( ) ( ) [ ] ( ) ( ) C (f+) /f 2 f+ lim E f Dminf C (n ) /f /2 (f+) /f 2 f+. Proof. Similar to corollary. The above result shows that the absolute ifference between the maximum an minimum for the fractional istance metric increases at the rate of /f /2. Thus, the smaller the fraction, the greater the rate of absolute ivergence between the maximum an minimum value. Now, we will examine the relative contrast of the fractional istance metric. Theorem 3. Let F be the uniform istribution of N =2points an f =/l for some( integer l. Then, ) lim f Dminf Dmin f 2 f+ for some constant C. Proof. Analogous to the proof of Theorem 2. The following is the irect generalization to N = n points. Corollary 4. Let F be the uniform istribution of N = n points, an f =/l for some integer l. Then, for some constant C [ ] C 2 f+ lim E C (n ) 2 f+. f Dminf Dmin f Proof. Analogous to the proof of Corollary.

On the Surprising Behavior of Distance Metrics 429 This result is true for the case of arbitrary values f (not just f =/l) an N, but the use of these specific values of f helps consierably in simplification of the proof of the result. The empirical simulation in Figure 2, shows the behavior for arbitrary values of f an N. The curve for each value of N is ifferent but all curves fit the general tren of reuce contrast with increase value of f. Note that the value of the relative contrast for both, the case of integral istance metric L an fractional istance metric L f is the same in the bounary case when f = =. The above results show that fractional istance metrics provie better contrast than integral istance metrics both in terms of the absolute istributions of points to a given query point an relative istances. This is a surprising result in light of the fact that the Eucliean istance metric is traitionally use in a large variety of inexing structures an ata mining applications. The wiesprea use of the Eucliean istance metric stems from the natural extension of applicability to spatial atabase systems (many multiimensional inexing structures were initially propose in the context of spatial systems). However, from the perspective of high imensional ata mining applications, this natural interpretability in 2 or 3-imensional spatial systems is completely irrelevant. Whether the theoretical behavior of the relative contrast also translates into practically useful implications for high imensional ata mining applications is an issue which we will examine in greater etail in the next section. 4 Empirical Results In this section, we show that our surprising finings can be irectly applie to improve existing mining techniques for high-imensional ata. For the experiments, we use synthetic an real ata. The synthetic ata consists of a number of clusters (ata insie the clusters follow a normal istribution an the cluster centers are uniformly istribute). The avantage of the synthetic ata sets is that the clusters are clearly separate an any clustering algorithm shoul be able to ientify them correctly. For our experiments we use one of the most wiely use stanar clustering algorithms - the -means algorithm. The ata set use in the experiments consists of 6 clusters with 0000 ata points each an no noise. The imensionality was chosen to be 20. The results of our experiments show that the fractional istance metrics provies a much higher classification rate which is about 99% for the fractional istance metric with f =0.3 versus 89% for the Eucliean metric (see figure 4). The etaile results incluing the confusion matrices obtaine are provie in the appenix. For the experiments with real ata sets, we use some of the classification problems from the UCI machine learning repository 5. All of these problems are classification problems which have a large number of feature variables, an a special variable which is esignate as the class label. We use the following simple experiment: For each of the cases that we teste on, we strippe off the 5 http : //www.cs.uci.eu/ mlearn

430 C.C. Aggarwal, A. Hinneburg, an D.A. Keim Classification Rate 00 95 90 85 80 75 70 65 60 55 50 0 0.5.5 2 2.5 3 Distance Parameter Fig. 4. Effectiveness of -Means class variable from the ata set an consiere the feature variables only. The query points were pice from the original atabase, an the closest l neighbors were foun to each target point using ifferent istance metrics. The technique was teste using the following two measures:. Class Variable Accuracy: This was the primary measure that we use in orer to test the quality of the ifferent istance metrics. Since the class variable is nown to epen in some way on the feature variables, the proximity of objects belonging to the same class in feature space is evience of the meaningfulness of a given istance metric. The specific measure that we use was the total number of the l nearest neighbors that belonge to the same class as the target object over all the ifferent target objects. Neeless to say, we o not inten to propose this ruimentary unsupervise technique as an alternative to classification moels, but use the classification performance only as an evience of the meaningfulness (or lac of meaningfulness) of a given istance metric. The class labels may not necessarily always correspon to locality in feature space; therefore the meaningfulness results presente are eviential in nature. However, a consistent effect on the class variable accuracy with increasing norm parameter oes ten to be a powerful way of emonstrating qualitative trens. 2. Noise Stability: How oes the quality of the istance metric vary with more or less noisy ata? We use noise masing in orer to evaluate this aspect. In noise masing, each entry in the atabase was replace by a ranom entry with masing probability p c. The ranom entry was chosen from a uniform istribution centere at the mean of that attribute. Thus, when p c is, the ata is completely noisy. We stuie how each of the two problems were affecte by noise masing. In Table 3, we have illustrate some examples of the variation in performance for ifferent istance metrics. Except for a few exceptions, the major tren in this table is that the accuracy performance ecreases with increasing value of the norm parameter. We have show the table in the range L 0. to L 0 because it was easiest to calculate the istance values without exceeing the numerical ranges in the computer representation. We have also illustrate the accuracy performance when the L metric is use. One interesting observation is that the accuracy with the L istance metric is often worse than the accuracy value by picing a recor from the atabase at ranom an reporting the corresponing target

On the Surprising Behavior of Distance Metrics 43 Table 3. Number of correct class label matches between nearest neighbor an target Data Set L 0. L 0.5 L L 2 L 4 L 0 L Ranom Machine 522 474 449 402 364 353 34 53 Mus 998 893 683 405 30 272 63 40 Breast Cancer (wbc) 5299 5268 596 5052 466 472 4032 302 Segmentation 423 47 377 20 03 03 300 323 Ionosphere 2954 3002 2839 2430 2062 836 769 884 4 3.5 3.5 3 L(0.) L() L(0) ACCURACY RATIO TO RANDOM MATCHING 3 2.5 2.5 ACCURACY OF RANDOM MATCHING ACCURACY RATIO 2.5 2.5 ACCURACY OF RANDOM MATCHING 0.5 0.5 0 0 2 3 4 5 6 7 8 9 0 PARAMETER OF DISTANCE NORM USED 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 NOISE MASKING PROBABILITY Fig. 5. Accuracy epening on the norm parameter Fig. 6. Accuracy epening on noise masing value. This tren is observe because of the fact that the L metric only loos at the imension at which the target an neighbor are furthest apart. In high imensional space, this is liely to be a very poor representation of the nearest neighbor. A similar argument is true for L istance metrics (for high values of ) which provie unue importance to the istant (sparse/noisy) imensions. It is precisely this aspect which is reflecte in our theoretical analysis of the relative contrast, which results in istance metrics with high norm parameters to be poorly iscriminating between the furthest an nearest neighbor. In Figure 5, we have shown the variation in the accuracy of the class variable matching with, when the L norm is use. The accuracy on the Y -axis is reporte as the ratio of the accuracy to that of a completely ranom matching scheme. The graph is average over all the ata sets of Table 3. It is easy to see that there is a clear tren of the accuracy worsening with increasing values of the parameter. We also stuie the robustness of the scheme to the use of noise masing. For this purpose, we have illustrate the performance of three istance metrics in Figure 6: L 0., L, an L 0 for various values of the masing probability on the machine ata set. On the X-axis, we have enote the value of the masing probability, whereas on the Y -axis we have the accuracy ratio to that of a completely ranom matching scheme. Note that when the masing probability is, then any scheme woul egrae to a ranom metho. However, it is interesting to see from Figure 6 that the L 0 istance metric egraes much faster to the

432 C.C. Aggarwal, A. Hinneburg, an D.A. Keim ranom performance (at a masing probability of 0.4), whereas the L egraes to ranom at 0.6. The L 0. istance metric is most robust to the presence of noise in the ata set an egraes to ranom performance at the slowest rate. These results are closely connecte to our theoretical analysis which shows the rapi lac of iscrimination between the nearest an furthest istances for high values of the norm-parameter because of unue weighting being given to the noisy imensions which contribute the most to the istance. 5 Conclusions an Summary In this paper, we showe some surprising results of the qualitative behavior of the ifferent istance metrics for measuring proximity in high imensionality. We emonstrate our results in both a theoretical an empirical setting. In the past, not much attention has been pai to the choice of istance metrics use in high imensional applications. The results of this paper are liely to have a powerful impact on the particular choice of istance metric which is use from problems such as clustering, categorization, an similarity search; all of which epen upon some notion of proximity. References. Weber R., Sche H.-J., Blott S.: A Quantitative Analysis an Performance Stuy for Similarity-Search Methos in High-Dimensional Spaces. VLDB Conference Proceeings, 998. 2. Bennett K. P., Fayya U., Geiger D.: Density-Base Inexing for Approximate Nearest Neighbor Queries. ACM SIGKDD Conference Proceeings, 999. 3. Berchtol S., Böhm C., Kriegel H.-P.: The Pyrami Technique: Towars Breaing the Curse of Dimensionality. ACM SIGMOD Conference Proceeings, June 998. 4. Berchtol S., Böhm C., Keim D., Kriegel H.-P.: A Cost Moel for Nearest Neighbor Search in High Dimensional Space. ACM PODS Conference Proceeings, 997. 5. Berchtol S., Ertl B., Keim D., Kriegel H.-P. Seil T.: Fast Nearest Neighbor Search in High Dimensional Spaces. ICDE Conference Proceeings, 998. 6. Beyer K., Golstein J., Ramarishnan R., Shaft U.: When is Nearest Neighbors Meaningful? ICDT Conference Proceeings, 999. 7. Shaft U., Golstein J., Beyer K.: Nearest Neighbor Query Performance for Unstable Distributions. Technical Report TR 388, Department of Computer Science, University of Wisconsin at Maison. 8. Guttman, A.: R-Trees: A Dynamic Inex Structure for Spatial Searching. ACM SIGMOD Conference Proceeings, 984. 9. Hinneburg A., Aggarwal C., Keim D.: What is the nearest neighbor in high imensional spaces? VLDB Conference Proceeings, 2000. 0. Katayama N., Satoh S.: The SR-Tree: An Inex Structure for High Dimensional Nearest Neighbor Queries. ACM SIGMOD Conference Proceeings, 997.. Lin K.-I., Jagaish H. V., Faloutsos C.: The TV-tree: An Inex Structure for High Dimensional Data. VLDB Journal, Volume 3, Number 4, pages 57 542, 992.

On the Surprising Behavior of Distance Metrics 433 Appenix Here we provie a etaile proof of Lemma 2, which proves our moifie convergence results for arbitrary istributions of points. This Lemma shows that the asymptotical rate of convergence of the absolute ifference of istances between the nearest an furthest points is epenent on the istance norm use. To recap, we restate Lemma 2. Lemma 2: [ Let F be an arbitrary ] istribution of N =2points. Then, lim E Dmin = C / /2, where C is some constant epenent on. Proof. Let A an B be the two points in a imensional ata istribution such that each coorinate is inepenently rawn from the ata istribution F. Specifically A =(P...P ) an B =(Q...Q ) with P i an Q i being rawn from F. Let PA = { i= (P i) } / be the istance of A to the origin using the L metric an PB = { i= (Q i) } / the istance of B. We assume that the th power of a ranom variable rawn from the istribution F has mean µ F, an stanar eviation σ F,. This means that: PA p µ F,, PB p µ F, an therefore: PA / / p (µ F, ) /, PB / / p (µ F, ) /. (4) We inten to show that PA PB / /2 p C for some constant C epening on. We express PA PB in the following numerator/enominator form which we will use in orer to examine the convergence behavior of the numerator an enominator iniviually. PA PB = (PA ) (PB ) r=0 (PA ) r (PB ) r (5) Diviing both sies by / /2 an regrouping on right-han-sie we get PA PB = (PA ) (PB ) / / /2 ( PA ) r ( PB ) r (6) r=0 / / Consequently, using Slutsy s theorem an the results of Equation 4 we have: ( PA / /) r ( PB / /) r p (µ F, ) ( )/ (7) r=0 Having characterize the convergence behavior of the enominator of the righthan-sie of Equation 6, let us now examine the behavior of the numerator: (PA ) (PB ) / = i= ((P i) (Q i ) ) / = i= R i /. Here R i is the new ranom variable efine by ((P i ) (Q i ) ) i {,...}. This ranom variable has zero mean an stanar eviation which is 2 σ F, where σ F, is the stanar eviation of (P i ). Then, the sum of ifferent values

434 C.C. Aggarwal, A. Hinneburg, an D.A. Keim of R i over imensions will converge to a normal istribution with mean 0 an stanar eviation 2 σ F, because of the central limit theorem. Consequently, the mean average eviation of this istribution will be C σ F, for some constant C. Therefore, we have: [ (PA ) (PB ) ] lim E σ F, (8) Since the enominator of Equation 6 shows probabilistic convergence, we can combine the results of Equations 7 an 8 to obtain: [ ] PA PB σ F, lim E (9) / /2 µ ( )/ F, The result follows. Confusion Matrices. We have illustrate the confusion matrices for two ifferent values of p below. As illustrate, the confusion matrix for using the value p =0.3 is significantly better than the one obtaine using p =2. Table 4. Confusion Matrix- p=2, (rows for prototype, colums for cluster) 208 82 97 4 0 4 0 2 0 0 6328 4 9872 04 32 0 8750 8 74 9954 8 39 0 0 8 8 9948 2 36 0 2 3642 6 Table 5. Confusion Matrix- p=0.3, (rows for prototype, colums for cluster) 5 5 9773 0 37 5 0 7 24 0 9935 4 5 0 9 9962 0 4 9858 66 5 9 8 0 9 3 9 9956 9925 0 9 20 0 0