Data Mining and Analysis: Fundamental Concepts and Algorithms

Similar documents
Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Overview of vector calculus. Coordinate systems in space. Distance formula. (Sec. 12.1)

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Chapter XII: Data Pre and Post Processing

MATH 162. FINAL EXAM ANSWERS December 17, 2006

11.1 Three-Dimensional Coordinate System

MULTIVARIABLE INTEGRATION

Note: Final Exam is at 10:45 on Tuesday, 5/3/11 (This is the Final Exam time reserved for our labs). From Practice Test I

1 4 (1 cos(4θ))dθ = θ 4 sin(4θ)

MAS223 Statistical Inference and Modelling Exercises

Log1 Contest Round 2 Theta Geometry

Multivariate Statistics

Elliptically Contoured Distributions

a k 0, then k + 1 = 2 lim 1 + 1

Q1. If (1, 2) lies on the circle. x 2 + y 2 + 2gx + 2fy + c = 0. which is concentric with the circle x 2 + y 2 +4x + 2y 5 = 0 then c =

2. The CDF Technique. 1. Introduction. f X ( ).

MIDTERM 2 REVIEW: ADDITIONAL PROBLEMS. 1 2 x + 1. y = + 1 = x 1/ = 1. y = 1 2 x 3/2 = 1. into this equation would have then given. y 1.

ES.182A Topic 44 Notes Jeremy Orloff

Infinite Series. 1 Introduction. 2 General discussion on convergence

HIGH DIMENSIONAL FEATURE REDUCTION VIA PROJECTION PURSUIT

y = x 3 and y = 2x 2 x. 2x 2 x = x 3 x 3 2x 2 + x = 0 x(x 2 2x + 1) = 0 x(x 1) 2 = 0 x = 0 and x = (x 3 (2x 2 x)) dx

There are some trigonometric identities given on the last page.

APPENDIX 2.1 LINE AND SURFACE INTEGRALS

Math 241, Exam 1 Information.

SOUTH AFRICAN TERTIARY MATHEMATICS OLYMPIAD

1. Find the real solutions, if any, of a. x 2 + 3x + 9 = 0 Discriminant: b 2 4ac = = 24 > 0, so 2 real solutions. Use the quadratic formula,

Integrals in cylindrical, spherical coordinates (Sect. 15.7)

Math 111D Calculus 1 Exam 2 Practice Problems Fall 2001

Robustness of Principal Components

Fall 2016, MA 252, Calculus II, Final Exam Preview Solutions

Advanced Calculus Questions

EM Algorithm & High Dimensional Data

A different parametric curve ( t, t 2 ) traces the same curve, but this time the par-

Practice Exam 1 Solutions

The Volume of a Hypersphere

LINEAR ALGEBRA - CHAPTER 1: VECTORS

n=0 ( 1)n /(n + 1) converges, but not

Overlake School Summer Math Packet AP Calculus AB

False. 1 is a number, the other expressions are invalid.

t 2 + 2t dt = (t + 1) dt + 1 = arctan t x + 6 x(x 3)(x + 2) = A x +

1 Integration in many variables.

f dr. (6.1) f(x i, y i, z i ) r i. (6.2) N i=1

E X A M. Probability Theory and Stochastic Processes Date: December 13, 2016 Duration: 4 hours. Number of pages incl.

Multivariate Distributions

x+1 e 2t dt. h(x) := Find the equation of the tangent line to y = h(x) at x = 0.

The Radii of Hyper Circumsphere and Insphere through Equidistant Points

Math156 Review for Exam 4

Zero Variance Markov Chain Monte Carlo for Bayesian Estimators

Slides 5: Random Number Extensions

Partial Derivatives. w = f(x, y, z).

Spring 2015, MA 252, Calculus II, Final Exam Preview Solutions

1. Vectors and Matrices

g(t) = f(x 1 (t),..., x n (t)).

MTH Calculus with Analytic Geom I TEST 1

Math 350 Solutions for Final Exam Page 1. Problem 1. (10 points) (a) Compute the line integral. F ds C. z dx + y dy + x dz C

c) xy 3 = cos(7x +5y), y 0 = y3 + 7 sin(7x +5y) 3xy sin(7x +5y) d) xe y = sin(xy), y 0 = ey + y cos(xy) x(e y cos(xy)) e) y = x ln(3x + 5), y 0

1.1 Single Variable Calculus versus Multivariable Calculus Rectangular Coordinate Systems... 4

2 Functions of random variables

Before you begin read these instructions carefully:

Tangent Planes, Linear Approximations and Differentiability

Massachusetts Institute of Technology Instrumentation Laboratory Cambridge, Massachusetts

Volume: The Disk Method. Using the integral to find volume.

PARAMETRIC EQUATIONS AND POLAR COORDINATES

Derivatives and Integrals

S6880 #7. Generate Non-uniform Random Number #1

SOLUTIONS FOR PRACTICE FINAL EXAM

Booklet Number: 2016 TEST CODE: DST. Objective type: 30 Questions Time: 2 hours

PHY752, Fall 2016, Assigned Problems

Haus, Hermann A., and James R. Melcher. Electromagnetic Fields and Energy. Englewood Cliffs, NJ: Prentice-Hall, ISBN:

AP Physics C. Gauss s Law. Free Response Problems

Math 162: Calculus IIA

MTH4101 CALCULUS II REVISION NOTES. 1. COMPLEX NUMBERS (Thomas Appendix 7 + lecture notes) ax 2 + bx + c = 0. x = b ± b 2 4ac 2a. i = 1.

Math 142, Final Exam. 12/7/10.

SECTION A. f(x) = ln(x). Sketch the graph of y = f(x), indicating the coordinates of any points where the graph crosses the axes.

Math 20C Homework 2 Partial Solutions

MA 519 Probability: Review

Time : 3 hours 02 - Mathematics - July 2006 Marks : 100 Pg - 1 Instructions : S E CT I O N - A

Integration is the reverse of the process of differentiation. In the usual notation. k dx = kx + c. kx dx = 1 2 kx2 + c.

5. Random Vectors. probabilities. characteristic function. cross correlation, cross covariance. Gaussian random vectors. functions of random vectors

Lecture 14 Conformal Mapping. 1 Conformality. 1.1 Preservation of angle. 1.2 Length and area. MATH-GA Complex Variables

Since x + we get x² + 2x = 4, or simplifying it, x² = 4. Therefore, x² + = 4 2 = 2. Ans. (C)

EXAM 2 ANSWERS AND SOLUTIONS, MATH 233 WEDNESDAY, OCTOBER 18, 2000

UNIVERSITY OF NORTH CAROLINA CHARLOTTE 1996 HIGH SCHOOL MATHEMATICS CONTEST March 4, f(x, y) = (max(x, y)) min(x,y)

Review Problems for the Final

AP Calculus Free-Response Questions 1969-present AB

component risk analysis

Part I: Multiple Choice Mark the correct answer on the bubble sheet provided. n=1. a) None b) 1 c) 2 d) 3 e) 1, 2 f) 1, 3 g) 2, 3 h) 1, 2, 3

MATH 311: COMPLEX ANALYSIS CONTOUR INTEGRALS LECTURE

18 Bivariate normal distribution I

Transcription:

Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki Wagner Meira Jr. Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 6: High-dimensional Data Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data / 9

High-dimensional Space Let D be a n d data matrix. In data mining typically the data is very high dimensional. Understanding the nature of high-dimensional space, or hyperspace, is very important, especially because it does not behave like the more familiar geometry in two or three dimensions. Hyper-rectangle: The data space is a d-dimensional hyper-rectangle R d = d [ ] min(x j ), max(x j ) where min(x j ) and max(x j ) specify the range of X j. j= Hypercube: Assume the data is centered, and let m denote the maximum attribute value { } m = max d n x ij j= max i= The data hyperspace can be represented as a hypercube, centered at, with all sides of length l = m, given as { H d (l) = x = (x, x,...,x d ) } T i, xi [ l/, l/] The unit hypercube has all sides of length l =, and is denoted as H d (). Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data / 9

Hypersphere Assume that the data has been centered, so that µ =. Let r denote the largest magnitude among all points: { } r = max x i i The data hyperspace can be represented as a d-dimensional hyperball centered at with radius r, defined as B d (r) = { x x r } or B d (r) = x = (x, x,...,x d ) d xj r The surface of the hyperball is called a hypersphere, and it consists of all the points exactly at distance r from the center of the hyperball S d (r) = { x x = r } or S d (r) = x = (x, x,...,x d ) d (x j ) = r j= j= Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 3 / 9

Iris Data Hyperspace: Hypercube and Hypersphere l = 4. and r =.9 X: sepal width r X : sepal length Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 4 / 9

High-dimensional Volumes Hypercube: The volume of a hypercube with edge length l is given as vol(h d (l)) = l d HypersphereThe volume of a hyperball and its corresponding hypersphere is identical The volume of a hypersphere is given as In dimension: vol(s (r)) = r In dimensions: vol(s (r)) = πr where In 3 dimensions: vol(s 3 (r)) = 4 3 πr 3 ( ) In d-dimensions: vol(s d (r)) = K d r d π d = Γ ( d + ) ( ) d Γ + {( d ) =! if d is even ( π d!! if d is odd (d+)/ ) r d Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 5 / 9

Volume of Unit Hypersphere With increasing dimensionality the hypersphere volume first increases up to a point, and then starts to decrease, and ultimately vanishes. In particular, for the unit hypersphere with r =, lim vol(s π d d()) = lim d d Γ( d + ) vol(sd()) 5 4 3 5 5 5 3 35 4 45 5 d Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 6 / 9

Hypersphere Inscribed within Hypercube Consider the space enclosed within the largest hypersphere that can be accommodated within a hypercube (which represents the dataspace). The ratio of the volume of the hypersphere of radius r to the hypercube with side length l = r is given as In dimensions: In 3 dimensions: In d dimensions: vol(s (r)) vol(h (r)) = πr 4r = π 4 = 78.5% 4 vol(s 3 (r)) vol(h 3 (r)) = 3 πr 3 8r 3 = π 6 = 5.4% vol(s d (r)) lim d vol(h d (r)) = lim π d/ d d Γ( d + ) As the dimensionality increases, most of the volume of the hypercube is in the corners, whereas the center is essentially empty. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 7 / 9

Hypersphere Inscribed inside a Hypercube r r r r Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 8 / 9

Conceptual View of High-dimensional Space Two, three, four, and higher dimensions All the volume of the hyperspace is in the corners, with the center being essentially empty. High-dimensional space looks like a rolled-up porcupine! (a) D (b) 3D (c) 4D (d) dd Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 9 / 9

Volume of a Thin Shell The volume of a thin hypershell of width ǫ is given as ol(s d (r,ǫ)) = vol(s d (r)) vol(s d (r ǫ)) = K d r d K d (r ǫ) d. The ratio of volume of the thin shell to the volume of the outer sphere: vol(s d (r,ǫ)) vol(s d (r)) = K dr d K d (r ǫ) d K d r d ( = ǫ ) d r r r ǫ ǫ s d increases, we have lim vol(s d (r,ǫ)) vol(s d (r)) ( = lim ǫ d d r) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data / 9

Diagonals in Hyperspace Consider a d-dimensional hypercube, with origin d = (,,..., d ), and bounded in each dimension in the range [, ]. Each corner of the hyperspace is a d-dimensional vector of the form (±,±,...,± d ) T. Let e i = (,..., i,..., d ) T denote the d-dimensional canonical unit vector in dimension i, and let denote the d-dimensional diagonal vector (,,..., d ) T. Consider the angle θ d between the diagonal vector and the first axis e, in d dimensions: As d increases, we have which implies that cosθ d = et e = e T = e T e T lim cosθ d = lim d d d lim d θ d π = 9 d = d Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data / 9

Angle between Diagonal Vector and e θ e θ e (a) In D (b) In 3D In high dimensions all of the diagonal vectors are perpendicular (or orthogonal) to all the coordinates axes! Each of the d new axes connecting pairs of d corners are essentially orthogonal to all of the d principal coordinate axes! Thus, in effect, high-dimensional space has an exponential number of orthogonal axes. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data / 9

Density of the Multivariate Normal Consider the standard multivariate normal distribution with µ =, and Σ = I { } f(x) = ( π) exp xt x d The peak of the density is at the mean. Consider the set of points x with density at least α fraction of the density at the mean f(x) f() α { } exp xt x α x T x ln(α) d (x i ) ln(α) i= The sum of squared IID random variables follows a chi-squared distributionχ d. Thus, ( ) f(x) P f() α = F χ ( ln(α)) d where F χ q is the CDF. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 3 / 9

Density Contour for α Fraction of the Density at the Mean: One Dimension Let α =.5, then ln(.5) =.386 and F χ (.386) =.76. Thus, 4% of the density is in the tail regions..4.3. α =.5. 4 3 3 4 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 4 / 9

Density Contour for α Fraction of the Density at the Mean: Two Dimensions Let α =.5, then ln(.5) =.386 and F χ (.386) =.5. Thus, 5% of the density is in the tail regions. f(x).5..5 α =.5 4 3 X 4 3 X 3 4 4 3 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 5 / 9

Chi-Squared Distribution: P(f(x)/f() α) This probability decreases rapidly with dimensionality. For D, it is.5. For 3D it is.9, ie., 7% of the density is in the tails. By d =, it decreases to.75%, that is, 99.95% of the points lie in the extreme or tail regions. f(x) f(x).5 F =.5.5 F =.9.4..3.5....5 5 5 x 5 5 x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 6 / 9

Hypersphere Volume: Polar Coordinates in D X r θ (x, x ) X The point x = (x, x ) in polar coordinates x = r cosθ = rc x = r sinθ = rs where r = x, and cosθ = c and sinθ = s. The Jacobian matrix for this transformation is given as J(θ ) = ( x r x r x θ x θ ) ( ) c rs = s rc Hypersphere volume is obtained by integration over r and θ (with r >, and θ π): vol(s (r)) = = r θ r π det(j(θ )) dr dθ r dr dθ = r Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter π 6: High-dimensional Data 7 / 9 r r dr π dθ

Hypersphere Volume: Polar Coordinates in 3D x = (x, x, x ) in polar coordinates X 3 x = r cosθ cosθ = rc c x = r cosθ sinθ = rc s r θ (x, x, x 3 ) x 3 = r sinθ = rs The Jacobian matrix is given as c c rs c rc s X J(θ,θ ) = c s rs s rc c s rc θ X The volume of the hypersphere for d = 3 is obtained via a triple integral with r >, π/ θ π/, and θ π vol(s 3 (r)) = r θ = 4 3 πr 3 det(j(θ,θ )) dr dθ dθ θ Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 8 / 9

Hypersphere Volume in d Dimensions The determinant of the d-dimensional Jacobian matrix is det(j(θ,θ,...,θ d )) = ( ) d r d c d c d 3...c d The volume of the hypersphere is given by the d-dimensional integral with r >, π/ θ i π/ for all i =,...,d, and θ d π: vol(s d (r)) =... det(j(θ,θ,...,θ d )) dr dθ dθ...dθ d r θ = r = r d d θ r d dr Γ ( d θ d π/ π/ ) Γ ( ) Γ ( ) d = πγ( ) d/ r d ) = ( d Γ( d π d/ Γ ( d + ) ) r d c d dθ... π/ π/ c d dθ d π Γ ( ) ( d Γ ) Γ ( )... Γ()Γ( ) d Γ ( ) π 3 dθ d Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 6: High-dimensional Data 9 / 9