Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki Wagner Meira Jr. Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 6: High-dimensional Data Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data / 9

High-dimensional Space Let D be a n d data matrix. In data mining typically the data is very high dimensional. Understanding the nature of high-dimensional space, or hyperspace, is very important, especially because it does not behave like the more familiar geometry in two or three dimensions. Hyper-rectangle: The data space is a d-dimensional hyper-rectangle R d = d [ ] min(x j ), max(x j ) where min(x j ) and max(x j ) specify the range of X j. j= Hypercube: Assume the data is centered, and let m denote the maximum attribute value { } m = max d n x ij j= max i= The data hyperspace can be represented as a hypercube, centered at, with all sides of length l = m, given as { H d (l) = x = (x, x,...,x d ) } T i, xi [ l/, l/] The unit hypercube has all sides of length l =, and is denoted as H d (). Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data / 9

Hypersphere Assume that the data has been centered, so that µ =. Let r denote the largest magnitude among all points: { } r = max x i i The data hyperspace can be represented as a d-dimensional hyperball centered at with radius r, defined as B d (r) = { x x r } or B d (r) = x = (x, x,...,x d ) d xj r The surface of the hyperball is called a hypersphere, and it consists of all the points exactly at distance r from the center of the hyperball S d (r) = { x x = r } or S d (r) = x = (x, x,...,x d ) d (x j ) = r j= j= Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 3 / 9

Iris Data Hyperspace: Hypercube and Hypersphere l = 4. and r =.9 X: sepal width r X : sepal length Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 4 / 9

High-dimensional Volumes Hypercube: The volume of a hypercube with edge length l is given as vol(h d (l)) = l d HypersphereThe volume of a hyperball and its corresponding hypersphere is identical The volume of a hypersphere is given as In dimension: vol(s (r)) = r In dimensions: vol(s (r)) = πr where In 3 dimensions: vol(s 3 (r)) = 4 3 πr 3 ( ) In d-dimensions: vol(s d (r)) = K d r d π d = Γ ( d + ) ( ) d Γ + {( d ) =! if d is even ( π d!! if d is odd (d+)/ ) r d Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 5 / 9

Volume of Unit Hypersphere With increasing dimensionality the hypersphere volume first increases up to a point, and then starts to decrease, and ultimately vanishes. In particular, for the unit hypersphere with r =, lim vol(s π d d()) = lim d d Γ( d + ) vol(sd()) 5 4 3 5 5 5 3 35 4 45 5 d Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 6 / 9

Hypersphere Inscribed within Hypercube Consider the space enclosed within the largest hypersphere that can be accommodated within a hypercube (which represents the dataspace). The ratio of the volume of the hypersphere of radius r to the hypercube with side length l = r is given as In dimensions: In 3 dimensions: In d dimensions: vol(s (r)) vol(h (r)) = πr 4r = π 4 = 78.5% 4 vol(s 3 (r)) vol(h 3 (r)) = 3 πr 3 8r 3 = π 6 = 5.4% vol(s d (r)) lim d vol(h d (r)) = lim π d/ d d Γ( d + ) As the dimensionality increases, most of the volume of the hypercube is in the corners, whereas the center is essentially empty. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 7 / 9

Hypersphere Inscribed inside a Hypercube r r r r Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 8 / 9

Conceptual View of High-dimensional Space Two, three, four, and higher dimensions All the volume of the hyperspace is in the corners, with the center being essentially empty. High-dimensional space looks like a rolled-up porcupine! (a) D (b) 3D (c) 4D (d) dd Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 9 / 9

Volume of a Thin Shell The volume of a thin hypershell of width ǫ is given as ol(s d (r,ǫ)) = vol(s d (r)) vol(s d (r ǫ)) = K d r d K d (r ǫ) d. The ratio of volume of the thin shell to the volume of the outer sphere: vol(s d (r,ǫ)) vol(s d (r)) = K dr d K d (r ǫ) d K d r d ( = ǫ ) d r r r ǫ ǫ s d increases, we have lim vol(s d (r,ǫ)) vol(s d (r)) ( = lim ǫ d d r) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data / 9

Diagonals in Hyperspace Consider a d-dimensional hypercube, with origin d = (,,..., d ), and bounded in each dimension in the range [, ]. Each corner of the hyperspace is a d-dimensional vector of the form (±,±,...,± d ) T. Let e i = (,..., i,..., d ) T denote the d-dimensional canonical unit vector in dimension i, and let denote the d-dimensional diagonal vector (,,..., d ) T. Consider the angle θ d between the diagonal vector and the first axis e, in d dimensions: As d increases, we have which implies that cosθ d = et e = e T = e T e T lim cosθ d = lim d d d lim d θ d π = 9 d = d Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data / 9

Angle between Diagonal Vector and e θ e θ e (a) In D (b) In 3D In high dimensions all of the diagonal vectors are perpendicular (or orthogonal) to all the coordinates axes! Each of the d new axes connecting pairs of d corners are essentially orthogonal to all of the d principal coordinate axes! Thus, in effect, high-dimensional space has an exponential number of orthogonal axes. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data / 9

Density of the Multivariate Normal Consider the standard multivariate normal distribution with µ =, and Σ = I { } f(x) = ( π) exp xt x d The peak of the density is at the mean. Consider the set of points x with density at least α fraction of the density at the mean f(x) f() α { } exp xt x α x T x ln(α) d (x i ) ln(α) i= The sum of squared IID random variables follows a chi-squared distributionχ d. Thus, ( ) f(x) P f() α = F χ ( ln(α)) d where F χ q is the CDF. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 3 / 9

Density Contour for α Fraction of the Density at the Mean: One Dimension Let α =.5, then ln(.5) =.386 and F χ (.386) =.76. Thus, 4% of the density is in the tail regions..4.3. α =.5. 4 3 3 4 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 4 / 9

Density Contour for α Fraction of the Density at the Mean: Two Dimensions Let α =.5, then ln(.5) =.386 and F χ (.386) =.5. Thus, 5% of the density is in the tail regions. f(x).5..5 α =.5 4 3 X 4 3 X 3 4 4 3 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 5 / 9

Chi-Squared Distribution: P(f(x)/f() α) This probability decreases rapidly with dimensionality. For D, it is.5. For 3D it is.9, ie., 7% of the density is in the tails. By d =, it decreases to.75%, that is, 99.95% of the points lie in the extreme or tail regions. f(x) f(x).5 F =.5.5 F =.9.4..3.5....5 5 5 x 5 5 x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 6 / 9

Hypersphere Volume: Polar Coordinates in D X r θ (x, x ) X The point x = (x, x ) in polar coordinates x = r cosθ = rc x = r sinθ = rs where r = x, and cosθ = c and sinθ = s. The Jacobian matrix for this transformation is given as J(θ ) = ( x r x r x θ x θ ) ( ) c rs = s rc Hypersphere volume is obtained by integration over r and θ (with r >, and θ π): vol(s (r)) = = r θ r π det(j(θ )) dr dθ r dr dθ = r Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter π 6: High-dimensional Data 7 / 9 r r dr π dθ

Hypersphere Volume: Polar Coordinates in 3D x = (x, x, x ) in polar coordinates X 3 x = r cosθ cosθ = rc c x = r cosθ sinθ = rc s r θ (x, x, x 3 ) x 3 = r sinθ = rs The Jacobian matrix is given as c c rs c rc s X J(θ,θ ) = c s rs s rc c s rc θ X The volume of the hypersphere for d = 3 is obtained via a triple integral with r >, π/ θ π/, and θ π vol(s 3 (r)) = r θ = 4 3 πr 3 det(j(θ,θ )) dr dθ dθ θ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 8 / 9

Hypersphere Volume in d Dimensions The determinant of the d-dimensional Jacobian matrix is det(j(θ,θ,...,θ d )) = ( ) d r d c d c d 3...c d The volume of the hypersphere is given by the d-dimensional integral with r >, π/ θ i π/ for all i =,...,d, and θ d π: vol(s d (r)) =... det(j(θ,θ,...,θ d )) dr dθ dθ...dθ d r θ = r = r d d θ r d dr Γ ( d θ d π/ π/ ) Γ ( ) Γ ( ) d = πγ( ) d/ r d ) = ( d Γ( d π d/ Γ ( d + ) ) r d c d dθ... π/ π/ c d dθ d π Γ ( ) ( d Γ ) Γ ( )... Γ()Γ( ) d Γ ( ) π 3 dθ d Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 9 / 9