Less is More: Non-Redundant Subspace Clustering

Size: px

Start display at page:

Download "Less is More: Non-Redundant Subspace Clustering"

Virgil Neal
5 years ago
Views:

Less is More: Non-Redundant Subspace Clustering Ira Assent Emmanuel Müller Stephan Günnemann Ralph Krieger Thomas Seidl Aalborg University, Denmark

1 Less is More: Non-Redundant Subspace Clustering Ira Assent Emmanuel Müller Stephan Günnemann Ralph Krieger Thomas Seidl Aalborg University, Denmark DATA MANAGEMENT AND DATA EXPLORATION GROUP RWTH Prof. Dr. Aachen rer. nat. University, Thomas Germany Seidl MultiClust Workshop at SIGKDD 2010 July 25, 2010

2 Detection of Non-Redundant Subspace Clusters I C 3 # boats in Miami C 1 internet usage C 5 C 2 C 4 income C 6 sportive activities Hidden clusters are described by different attribute sets Each object might be grouped in multiple clusters Novel challenges for subspace clustering Less is More: Non-Redundant Subspace Clustering 1 / 11

3 Detection of Non-Redundant Subspace Clusters II C 3 # boats in Miami C 1 Subspace Cluster: (rich; boat owner; car fan; globetrotter; horse fan) # horses freq. flyer miles # cars Exp. many projections (rich) (boat owner) (rich; globetrotter)... C 4 income Huge amount of redundant clusters Typically number of clusters number of objects Detection of all and only non-redundant subspace clusters Less is More: Non-Redundant Subspace Clustering 2 / 11

4 Overview Main question How can you use/extend non-redundant clustering... In this talk, we present A survey of our contributions so far The generality of our techniques Our open source initiatives for the community Research questions arise in the areas of: 1 Effective Models 2 Efficient Computation 3 Evaluation and Exploration of Results Less is More: Non-Redundant Subspace Clustering 3 / 11

5 Notions and Related Work Abstract subspace clustering definition Definition of object set O clustered in subspace S C = (O, S) with O DB, S DIM Selection of result set M a subset of all valid subspace clusters ALL 1,2,3,4 1,2,3 1,2,4 1,3,4 2,3,4 1,2 1,3 1,4 2,3 2,4 3, M = {(O 1, S 1 )... (O n, S n )} ALL Related work Subspace clustering: focus on definition of (O, S) Output all valid subspace cluster M = ALL ( too many) Projected clustering: focus on definition of disjoint clusters in M Unable to detect objects in multiple clusters ( too few) Less is More: Non-Redundant Subspace Clustering 4 / 11

6 Non-Redundant Subspace Clustering Models Select M ALL: Exclude redundant subspace clusters... C 2 C 2 C 3 C 1 C 1 Local (pairwise) redundancy elimination [1][2] (O, S) is non-redundant iff (O, S ) with O O S S O R O Excludes large number of redundant subspace clusters [1] Assent, Krieger, Müller and Seidl: DUSC: Dimensionality Unbiased Subspace Clustering, in ICDM [2] Assent, Krieger, Müller and Seidl: INSCY: Indexing Subspace Clusters with In-Process-Removal of Redundancy, in ICDM Less is More: Non-Redundant Subspace Clustering 5 / 11

7 Generalization of Redundancy Elimination Relevant subspace clustering model [3] Include the most interesting subspace clusters Exclude redundant subspace clusters Provide most relevant subspace clusters in result set Extract novel knowledge with each cluster all possible clusters ALL interestingness of clusters relevance model redundancy of clusters relevant clustering M ALL Given any definition of subspace clusters C = (O, S) Choose optimal subset M = {C 1,..., C n } ALL Proof: Such an optimization is an NP-hard problem [3] Müller, Assent, Günnemann, Krieger and Seidl: Relevant Subspace Clustering: Mining the Most Interesting Non-Redundant Concepts in High Dimensional Data, in ICDM Less is More: Non-Redundant Subspace Clustering 6 / 11

8 Redundancy Pruning by Depth-First Processing Pruning Applicable for local redundancy (simple pairwise model) Enables in-process pruning of redundant clusters. 1,2,3,4 1,2,3 1,2,4 1,3,4 2,3,4 1,2 1,3 1,4 2,3 2,4 3, breadth-first depth-first 1,2,3,4 1,2,3 1,2,4 1,3,4 2,3,4 1,2 1,3 1,4 2,3 2,4 3, Step-by-step processing (k-d (k + 1)-D subspace!) Scalability to high dimensional data? Less is More: Non-Redundant Subspace Clustering 7 / 11

9 Scalable Subspace Processing direct jump 1,2,3,4 1,2,3 1,2,4 1,3,4 2,3,4 1,2 1,3 1,4 2,3 2,4 3, Dimension interval 1 interval 2 Dimension Dimension 4 Key idea: density estimation + steered jumps Subspace clusters are represented by many low dimensional projections Use 2-D projections to estimate density in higher subspace regions [4] Use k-d projections to jump directly to (k + x)-d subspaces [x 1] Best-first search: Intelligent steering to promising subspace regions [4] Müller, Assent, Krieger, Günnemann and Seidl: DensEst: Density Estimation for Data Mining in High Dimensional Spaces, in SDM Less is More: Non-Redundant Subspace Clustering 8 / 11

10 Challenges in Evaluation and Exploration General challenge for clustering No ground truth available for clustering Subjective evaluation by exploration requires visualization techniques and interactive exploration tools Objective evaluations are incomparable using different implementations, databases and quality measures We provide broad evaluation study & interactive exploration framework Evaluation Study [5] Characterization of major paradigms Providing comparable baseline implementations Evaluation based on broad set of data sets, quality measures and parameter settings [5] Müller, Günnemann, Assent and Seidl: Evaluating Clustering in Subspace Projections of High Dimensional Data, in VLDB Less is More: Non-Redundant Subspace Clustering 9 / 11

Open Source Framework OpenSubspace framework Framework for research, education and application [6][7][8][9] Baselines for algorithm and evaluation measure development OpenSubspace algo. A algo.

de/opensubspace/ [6] Müller, Assent, Krieger, Jansen and Seidl: Morpheus: Interactive Exploration of Subspace Clustering, in KDD 2008.

11 Open Source Framework OpenSubspace framework Framework for research, education and application [6][7][8][9] Baselines for algorithm and evaluation measure development OpenSubspace algo. A algo. B algo. C unified algorithm repository algo. D rare case: common implementation eval. 1 eval. 2 eval. 3 re-implementation eval. 1 eval. 2 eval. 3 unified evaluation repository [6] Müller, Assent, Krieger, Jansen and Seidl: Morpheus: Interactive Exploration of Subspace Clustering, in KDD [7] Assent, Müller, Krieger, Jansen and Seidl: Pleiades: Subspace Clustering and Evaluation, in PKDD [8] Günnemann, Färber, Kremer, Seidl: CoDA: Interactive Cluster Based Concept Discovery, in VLDB 2010 [9] Müller, Schiffer, Gerwert, Hannen, Jansen, Seidl: SOREX: Subspace Outlier Ranking Exploration Toolkit, in PKDD Less is More: Non-Redundant Subspace Clustering 10 / 11

12 Conclusion and Future Work Subspace clustering is still an emerging research field... Is the basis for a lot of further research Alternative subspace clustering Evaluation measures for subspace clustering Benchmark databases for subspace clustering... Less is More: Non-Redundant Subspace Clustering 11 / 11

13 Conclusion and Future Work Subspace clustering is still an emerging research field... Is the basis for a lot of further research Alternative subspace clustering Evaluation measures for subspace clustering Benchmark databases for subspace clustering... Thank you for your attention. Questions? Less is More: Non-Redundant Subspace Clustering 11 / 11

P leiades: Subspace Clustering and Evaluation

P leiades: Subspace Clustering and Evaluation Ira Assent, Emmanuel Müller, Ralph Krieger, Timm Jansen, and Thomas Seidl Data management and exploration group, RWTH Aachen University, Germany {assent,mueller,krieger,jansen,seidl}@cs.rwth-aachen.de