P leiades: Subspace Clustering and Evaluation

Similar documents
A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms

Less is More: Non-Redundant Subspace Clustering

Finding Hierarchies of Subspace Clusters

Constraint-based Subspace Clustering

Subspace Clustering and Visualization of Data Streams

When Pattern Met Subspace Cluster

Subspace Clustering using CLIQUE: An Exploratory Study

Detection and Visualization of Subspace Cluster Hierarchies

An Empirical Study of Building Compact Ensembles

Subspace Correlation Clustering: Finding Locally Correlated Dimensions in Subspace Projections of the Data

Iterative Laplacian Score for Feature Selection

Ranking Interesting Subspaces for Clustering High Dimensional Data

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA

An Approach to Classification Based on Fuzzy Association Rules

CARE: Finding Local Linear Correlations in High Dimensional Data

In Knowledge Discovery and Data Mining: Challenges and Realities. Edited by X. Zhu and I. Davidson. pp

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Development of a Data Mining Methodology using Robust Design

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Pattern-Based Decision Tree Construction

Mining Spatial and Spatio-temporal Patterns in Scientific Data

Musical Genre Classication

Decision Tree Learning

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Online Estimation of Discrete Densities using Classifier Chains

Visual Quality Assessment of Subspace Clusterings

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Clustering High Dimensional Data

Recognition of On-Line Handwritten Commutative Diagrams

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Subspace Search and Visualization to Make Sense of Alternative Clusterings in High-Dimensional Data

Using HDDT to avoid instances propagation in unbalanced and evolving data streams

Neural Networks and Ensemble Methods for Classification

Optimizing Abstaining Classifiers using ROC Analysis. Tadek Pietraszek / 'tʌ dek pɪe 'trʌ ʃek / ICML 2005 August 9, 2005

Background literature. Data Mining. Data mining: what is it?

CSE 5243 INTRO. TO DATA MINING

Mining Positive and Negative Fuzzy Association Rules

Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions

Hans-Peter Kriegel, Peer Kröger, Irene Ntoutsi, Arthur Zimek

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time

Anticipatory DTW for Efficient Similarity Search in Time Series Databases

Deriving Quantitative Models for Correlation Clusters

Cluster Evaluation of Density Based Subspace Clustering Rahmat Widia Sembiring, Jasni Mohamad Zain

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

On Improving the k-means Algorithm to Classify Unclassified Patterns

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Lazy Rule Learning Nikolaus Korfhage

Lecture 7: DecisionTrees

Notes on Machine Learning for and

CHAPTER-17. Decision Tree Induction

Relation between Pareto-Optimal Fuzzy Rules and Pareto-Optimal Fuzzy Rule Sets

Exploring Spatial Relationships for Knowledge Discovery in Spatial Data

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Density-Based Clustering

Classification Using Decision Trees

Feature Selection with Fuzzy Decision Reducts

Solar Flux Case Study

CS 6375 Machine Learning

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Flexible and Adaptive Subspace Search for Outlier Analysis

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Novel Methods for Graph Mining in Databases of Small Molecules. Andreas Maunz, Retreat Spitzingsee,

Data Mining Part 4. Prediction

Distributed Data Mining for Pervasive and Privacy-Sensitive Applications. Hillol Kargupta

Finding Non-Redundant, Statistically Signicant Regions in High Dimensional Data: a Novel Approach to Projected and Subspace Clustering

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Classification and Prediction

Models to carry out inference vs. Models to mimic (spatio-temporal) systems 5/5/15

Effect of Rule Weights in Fuzzy Rule-Based Classification Systems

Data Mining and Knowledge Discovery: Practice Notes

Classification Based on Logical Concept Analysis

Projective Clustering by Histograms

4S: Scalable Subspace Search Scheme Overcoming Traditional Apriori Processing

Mining Quantitative Frequent Itemsets Using Adaptive Density-based Subspace Clustering

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Nonnegative Matrix Factorization Clustering on Multiple Manifolds

Data Mining and Machine Learning (Machine Learning: Symbolische Ansätze)

A Database Framework for Classifier Engineering

A Discretization Algorithm for Uncertain Data

Efficiently merging symbolic rules into integrated rules

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Applying Data Mining Techniques on Soil Fertility Prediction

EasySDM: A Spatial Data Mining Platform

Sparse representation classification and positive L1 minimization

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Will it rain tomorrow?

Road Network Analysis as a Means of Socio-spatial Investigation the CoMStaR 1 Project

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

An Overview of Alternative Rule Evaluation Criteria and Their Use in Separate-and-Conquer Classifiers

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Rare Event Discovery And Event Change Point In Biological Data Stream

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Classification of Ordinal Data Using Neural Networks

Cholesky Decomposition Rectification for Non-negative Matrix Factorization

Transcription:

P leiades: Subspace Clustering and Evaluation Ira Assent, Emmanuel Müller, Ralph Krieger, Timm Jansen, and Thomas Seidl Data management and exploration group, RWTH Aachen University, Germany {assent,mueller,krieger,jansen,seidl}@cs.rwth-aachen.de http://dme.rwth-aachen.de Abstract. Subspace clustering mines the clusters present in locally relevant subsets of the attributes. In the literature, several approaches have been suggested along with different measures for quality assessment. Pleiades provides the means for easy comparison and evaluation of different subspace clustering approaches, along with several quality measures specific for subspace clustering as well as extensibility to further application areas and algorithms. It extends the popular WEKA mining tools, allowing for contrasting results with existing algorithms and data sets. 1 P leiades In high dimensional data, clustering is hindered through many irrelevant dimensions (cf. curse of dimensionality [1]). Subspace clustering identifies locally relevant subspace projections for individual clusters [2]. As these are recent proposals, subspace clustering algorithms and respective quality measures are not available in existing data mining tools. To allow researchers and students alike to explore the strengths and weaknesses of different approaches, our Pleiades system provides the means for their comparison and analysis as well as easy extensibility. preprocessing high dimensional data WEKAinput open interface subspace clustering algorithms postprocessing abstract format subspace clusters CLIQUE SCHISM PreDeCon SUBCLU DUSC PROCLUS P3C open interface evaluation techniques Entropy F1 measure Accuracy Coverage output Pleiades Fig. 1. Data processing in P leiades W. Daelemans et al. (Eds.): ECML PKDD 2008, Part II, LNAI 5212, pp. 666 671, 2008. c Springer-Verlag Berlin Heidelberg 2008

P leiades: Subspace Clustering and Evaluation 667 Figure 1 gives an overview over P leiades: High-dimensional data is imported using the WEKA input format, possibly using pre-processing tools already available in WEKA and those we specifically added for subspace clustering [3]. The subspace clustering algorithms described in the next section are all part of the P leiades system, and new algorithms can be added using our new subspace clustering interface. The result is presented to the user, and can be subjected to post-processing (e.g. filtering). P leiades offers several evaluation techniques, as described in the next section, to measure the quality of the subspace clustering. Further measures can be plugged into our new evaluation interface. 2 P leiades Features Our P leiades system is a tool that integrates subspace clustering algorithms along with measures designated for the assessment of subspace clustering quality. 2.1 Subspace Clustering Algorithms While subspace clustering is a rather young area that has been researched for only one decade, several distinct paradigms can be observed in the literature. Our P leiades system includes representatives of these paradigms to provide an overview over the techniques available. We provide implementations of the most recent approaches from different paradigms: Grid-based subspace clustering discretizes the data space for efficient detection of dense grid cells in a bottom-up fashion. It was introduced in the CLIQUE approach which exploits monotonicity on the density of grid cells for pruning [4]. SCHISM [5] extends CLIQUE using a variable threshold adapted to the dimensionality of the subspace as well as efficient heuristics for pruning. Density-based subspace clustering defines clusters as dense areas separated by sparsely populated areas. In SUBCLU, a density monotonicity property is used to prune subspaces in a bottom-up fashion [6]. PreDeCon extends this paradigm by introducing the concept of subspace preference weights to determine axis parallel projections [7]. In DUSC, dimensionality bias is removed by normalizing the density with respect to the dimensionality of the subspace [8]. Projected clustering methods identify disjoint clusters in subspace projections. PROCLUS extends the k-medoid algorithm by iteratively refining a full-space k-medoid clustering in a top-down manner [9]. P3C combines onedimensional cluster cores to higher-dimensional clusters bottom-up [10]. 2.2 Evaluation Techniques Quality of clustering or classification is usually measured in terms of accuracy, i.e. the ratio of correctly classified or clustered objects. For clustering, the ground truth, i.e. the true clustering structure of the data, is usually not known. In fact, it is the very goal of clustering to detect this structure. As a consequence, clustering algorithms are often evaluated manually, ideally with the help of domain experts. However, domain experts are not always available, and they might

668 I. Assent et al. not agree on the quality of the result. Their assessment of the clustering result is necessarily only based on the result itself, it cannot be compared to the optimum which is not known. Moreover, manual evaluation does not scale to large datasets or clustering result outcomes. For more realistic evaluation of clustering algorithms, large scale analysis is therefore typically based on pre-labelled data, e.g. from classification applications [10,11]. The underlying assumption is that the clustering structure typically reflects the class label assignment. At least for relative comparisons of clustering algorithms, this provides measures of the quality of the clustering result. Our P leiades system provides the measures proposed in recent subspace clustering publications. In Figure 2 we present the evaluation output with various measures for comparing subspace clustering results. Quality can be determined as entropy and coverage. Corresponding roughly to the measures of precision and recall, entropy accounts for purity of the clustering (e.g. in [5]), while coverage measures the size of the clustering, i.e. the percentage of objects in any subspace cluster. P leiades provides both coverage and entropy (for readability, inverse entropy as a percentage) [8]. The F1-value is commonly used in evaluation of classifiers and recently also for subspace or projected clustering as well [10]. It is computed as the harmonic mean of recall ( are all clusters detected? ) and precision ( are the clusters accurately detected? ). The F1-value of the whole clustering is simply the average of all F1-values. Accuracy of classifiers (e.g. C4.5 decision tree) built on the detected patterns compared with the accuracy of the same classifier on the original data is another Fig. 2. Evaluation screen

P leiades: Subspace Clustering and Evaluation 669 quality measure [11]. It indicates to which extend the subspace clustering successfully generalizes the underlying data distribution. 2.3 Applicability and Extensibility Our P leiades system provides the means for researches and students of data mining to use and compare different subspace clustering techniques all in one system. Different evaluation measures allow in-depth analysis of the algorithm properties. The interconnection to the WEKA data mining system allows further comparison with existing full space clustering techniques, as well as pre- and post-processing tools [3]. We provide additional tools for pre- and post-processing for subspace clustering. Figure 3 gives an example of our novel visual assistance for parameter setting in P leiades. Fig. 3. Parametrization screen P leiades incorporates two open interfaces, which enable extensibility to further subspace clustering algorithms and new evaluation measurements. In Figure 4 we show the main classes of the P leiades system which extends the WEKA framework by a new subspace clustering panel. Subspace clustering shows major differences compared to traditional clustering; e.g. an object can be partof severalsubspace clusters in different projections. We therefore do not extend the clustering panel, but provide a separate subspace clustering panel. Recent subspace clustering algorithms described in Section 2.1 are implemented based on our P leiades system. The abstraction of subspace clustering properties in P leiades allows to easily add new algorithms through our new subspace clustering interface. Furthermore, P leiades offers several evaluation techniques (cf. Section 2.2) to measure the quality of the subspace clustering. By using these evaluation measures one can easily compare different subspace clustering techniques. Further measures can be added by our new evaluation interface, which allows to define new quality criteria for subspace clustering.

670 I. Assent et al. Explorer WEKA Framework PreprocessPanel ArffLoader Filters ClassifierPanel ClusterPanel SubspaceClusterPanel AssociationsPanel Pleiades ParametrizationAssistent SubspaceClusterResult «interface» SubspaceClusterer «interface» SubspaceClusterEvaluation subspace clustering interface evaluation interface P3C Schism... F1 Measure Entropy... Fig. 4. UML class diagram of the P leiades system 3 P leiades Demonstrator The demo will illustrate the above subspace clustering and evaluation techniques for several data sets. It will allow conference attendees to explore the diverse paradigms and measures implemented in the P leiades system, thus raising research interest in the area. Open interfaces will facilitate extension with further subspace clustering algorithms and evaluation measures by other researchers.

P leiades: Subspace Clustering and Evaluation 671 References 1. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbors meaningful. In: Proceedings International Conference on Database Theory (ICDT), pp. 217 235 (1999) 2. Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter 6(1), 90 105 (2004) 3. Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers Inc., San Francisco (2005) 4. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings ACM SIGMOD International Conference on Management of Data, pp. 94 105 (1998) 5. Sequeira, K., Zaki, M.: SCHISM: A new approach for interesting subspace mining. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 186 193 (2004) 6. Kailing, K., Kriegel, H.P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 246 257 (2004) 7. Böhm, C., Kailing, K., Kriegel, H.P., Kröger, P.: Density connected clustering with local subspace preferences. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 27 34 (2004) 8. Assent, I., Krieger, R., Müller, E., Seidl, T.: DUSC: Dimensionality unbiased subspace clustering. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 409 414 (2007) 9. Aggarwal, C., Wolf, J., Yu, P., Procopiuc, C., Park, J.: Fast algorithms for projected clustering. In: Proceedings ACM SIGMOD International Conference on Management of Data, pp. 61 72 (1999) 10. Moise, G., Sander, J., Ester, M.: P3C: A robust projected clustering algorithm. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 414 425. IEEE Computer Society Press, Los Alamitos (2006) 11. Bringmann, B., Zimmermann, A.: The chosen few: On identifying valuable patterns. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 63 72 (2007)