Clustering High Dimensional Data

Size: px

Start display at page:

Download "Clustering High Dimensional Data"

Malcolm Blankenship
6 years ago
Views:

de), Frank Klawonn, Rudolf Kruse July 11,

1 Clustering High Dimensional Data Roland Winkler Frank Klawonn, Rudolf Kruse July 11, 2011 German Aerospace Center Braunschweig Institute of Flight Guidance

2 2 / 33 1 S.O.D.A. 2 Introduction to high dimensional spaces 3 Distance Concentration and its Implications on Clustering 4 Distance Function modifications

3 3 / 33 Current Section 1 S.O.D.A. 2 Introduction to high dimensional spaces 3 Distance Concentration and its Implications on Clustering 4 Distance Function modifications

4 S.O.D.A.: Frankfurt Airport 4 / 33

5 S.O.D.A.: 3 Aircraft transitions 5 / 33

6 S.O.D.A.: Aircraft transitions 6 / 33

7 S.O.D.A.: Characteristic Positions 7 / 33

8 8 / 33 Current Section 1 S.O.D.A. 2 Introduction to high dimensional spaces 3 Distance Concentration and its Implications on Clustering 4 Distance Function modifications

9 9 / 33 Challenges with high dim data sets in clustering Huge space that is very thin populated (for comparison: the m-dimensional hypercube has 2 m corners) The intrinsic dimensionality might be lower and form a complex geometry Dimension reduction is not necessarily helpful Occurrence of hubs (data objects that are part of the k-nn of a large portion of other data objects) Distance concentration

10 Distance concentration Data set: X = {x 1,..., x n } R m with n data objects Query point: Q R m Metric: : R m R Set of distance: D Q (X) = { x i Q : x i X} Sample mean: Ê(D Q(X)) and sample variance V (D Q (X)) Relative variance of distances: V (D Q (X)) Ê(D Q (X)) 2 = RV (D Q (X))) lim RV (D Q (X)) = 0 lim P ((1 + ε) min(d Q (X)) > max(d Q (X))) = 1 m m (1) 10 / 33

11 2 dimensions normal distribution 11 / 33

12 5 dimensions normal distribution 12 / 33

13 20 dimensions normal distribution 13 / 33

14 50 dimensions normal distribution 14 / 33

15 100 dimensions normal distribution 15 / 33

16 200 dimensions normal distribution 16 / 33

17 S.O.D.A.: Characteristic vector data set 17 / 33

18 18 / 33 Current Section 1 S.O.D.A. 2 Introduction to high dimensional spaces 3 Distance Concentration and its Implications on Clustering 4 Distance Function modifications

19 19 / 33 Objective function for HCM, FCM and PFCM Data set X = {x 1,..., x n } R m with n data objects Prototype set: Y = {y 1,..., y c } R m Distance function: d ij = d(y i, x j ) = y i x j Membership values: u ij [0, 1] Fuzzifier function: f fuzzy : [0, 1] [0, 1] Objective function c n f fuzzy (u ij )dij 2 i=1 j=1

20 20 / 33 Test Data set m = 2 m = 50

21 21 / 33 HCM f fuzzy (u) = u found clusters: 47 m = 2 m = 50

22 22 / 33 FCM f fuzzy (u) = u ω (here: ω = 2) found clusters: 0 m = 2 m = 50

23 23 / 33 FCM with distance concentration Suppose d(y i, x j ) = d ij d All membership values tend to become equal All prototypes tend to become equal u ij y i ( 1 ) 2 d ω 1 c ( 1 ) 2 d ω 1 k=1 n ( 1 ) ω c xj j=1 n j=1 = = ( 1 ) ω c ( 1 ) 2 d ω 1 c ( ) 2 1 d ω 1 n j=1 ( 1 ) ω c xj = 1 c n ( ) 1 ω = 1 n c n j=1 x j

24 24 / 33 PFCM f fuzzy (u) = α u + (1 α)u 2 found clusters: 92 m = 2 m = 50

25 25 / 33 EM Gaussian Mixture Model Covariance matrix = δe m found clusters: 68 m = 2 m = 50

26 26 / 33 Current Section 1 S.O.D.A. 2 Introduction to high dimensional spaces 3 Distance Concentration and its Implications on Clustering 4 Distance Function modifications

27 27 / 33 Alternative Distance Function D Q (X) = { x Q : x X} a > 0 δ = max (Ê(D Q(X)) a V (D Q (X)), 0) d ij = d δ (y i, x j ) = max ( y i x j δ, 0 ) with Q = y i a = 3 implies: for less than 10% of data objects, d δ (y i, x j ) = 0 Objective function c n f fuzzy (u ij )dδ 2(y i, x j ) does not work i=1 j=1 well with d δ Use d δ only for calculating Membership values

28 28 / 33 FCM with Alternative Distance Function Example f fuzzy (u) = u ω, d = d δ found clusters: 99 m = 2 m = 50

29 Thank you for your attention 29 / 33

30 30 / 33 Bibliography Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. (1999). When is nearest neighbor meaningful? In Database Theory - ICDT 99, volume 1540 of Lecture Notes in Computer Science, pages Springer Berlin / Heidelberg. Durrant, R. J. and Kabán, A. (2008). When is nearest neighbour meaningful: A converse theorem and implications. Journal of Complexity, 25(4): Hsu, C.-M. and Chen, M.-S. (2009). On the design and applicability of distance functions in high-dimensional data space. Knowledge and Data Engineering, IEEE Transactions on, 21(4): Kabán, A. (2011). Non-parametric detection of meaningless distances in high dimensional data. Statistics and Computing. status: accepted.

31 31 / 33 S.O.D.A. S.O.D.A (Surveillance Data Analysis System) is developed by the Fraport AG, the organization operating the international airport in Frankfurt It is based on the open source data base system PostGre SQL, the visualizations are done using PostGIS DLR (German Aerospace Center) Braunschweig and Fraport are working as partners in the S.O.D.A. project to increase data usage and quality

32 32 / 33 Distance concentration F (m), m N be a sequence of random variables X (m) = {x (m) 1,..., x (m) n } F (m) be a sample of n data objects Q (m) dom(f (m) ) a query point : dom(f (m) ) R be a distance function and p > 0 D (m) = D Q (m)(x (m) ) = { x (m) i Q (m) p : x (m) i X (m) } Ê(D (m) ) and V (D (m) ) be finite, Ê(D(m) ) > 0 V (D (m) ) Ê(D (m) ) 2 = RV (D (m) )) lim RV (D (m) ) = 0 lim P ( (1 + ε) min(d (m) ) > max(d (m) ) ) = 1 (2) m m

33 33 / 33 Distance concentration Let be one of the L p norms and in all cases Q (m) F (m) When are the conditions in (1) true? F (m) is i.i.d. in all dimensions (see the following example) F (m) is correlations in all dimensions: U 1,..., U m so that U k U[0, k], define F (m) 1 = U 1 F (m) k = U k F (m) k 1 F (m) is the uniform distribution on a m-dimensional hyperqube surface: F (m) = U([0, 1] m \ (0, 1) m ) F (m) with σ m 0: define F (m) k N(0, 1 k ), k = 1... m Open question: What are the satisfying conditions for distance concentration?

Problems of Fuzzy c-means and similar algorithms with high dimensional data sets Roland Winkler Frank Klawonn, Rudolf Kruse

Problems of Fuzzy c-means and similar algorithms with high dimensional data sets Roland Winkler (roland.winkler@dlr.de), Frank Klawonn, Rudolf Kruse July 20, 2010 2 / 44 Outline 1 FCM (and similar) Algorithms