Constraint-based Subspace Clustering

Size: px

Start display at page:

Download "Constraint-based Subspace Clustering"

Jeffrey Carroll
6 years ago
Views:

1 Constraint-based Subspace Clustering Elisa Fromont 1, Adriana Prado 2 and Céline Robardet 1 1 Université de Lyon, France 2 Universiteit Antwerpen, Belgium Thursday, April 30

2 Traditional Clustering Partitions the data into groups (clusters) of similar objects Similarity : based on distances or density Traditional methods use all features (dimensions) to identify clusters in the data Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 2 / 32

3 Clustering examples Synthetic data K-means clustering Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 3 / 32

4 Problems When dealing with high-dimensional data : «Curse of dimensionality»[beyer et al.,1999] : Distance-based : the distance to the nearest neighbor is nearly equal to the distance to the farthest neighbor Density-based : it is difficult to determine dense regions in high-dimensional data Data may have many irrelevant dimensions Subspace Clustering for High Dimensional Data : A Review, Parsons et al. KDD Explorations 2004 Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 4 / 32

5 Solutions? Dimensionality reduction? (e.g. PCA) Aims at discarding irrelevant dimensions BUT : dimensions are often not «globally»irrelevant Detecting Clusters in Moderate-to-high Dimensional Data, A. Zimek, Tutorial on Subspace Clustering at KDD Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 5 / 32

6 Gene Expression Data Analysis Columns : Genes Rows : Experiment conditions or samples. Values : relative abundance of the mrna of a gene under a specific condition Task : Cluster the samples w.r.t. their similarity on gene expression values Samples may be clustered differently depending on considered subsets of genes Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 6 / 32

Gene Expression Data Analysis Add instance-level Constraints on couples of samples : Some are known to result from similar experiment conditions, and must belong to the same subspace

7 Gene Expression Data Analysis Add instance-level Constraints on couples of samples : Some are known to result from similar experiment conditions, and must belong to the same subspace cluster. Others turn out from different experiment conditions and cannot be linked by a subspace. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 7 / 32

8 Solution Constraint-based Subspace Clustering! Techniques that automatically detect clusters in subspaces of the data while ensuring the instance-level constraints are satisfied How can it be done efficiently? Naïve solution Check whether each possible subspace of a d-dimensional dataset is a subspace cluster satisfying the instance-level constraints. Runtime complexity : O(2 d ) Infeasible! Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 8 / 32

9 Solution Constraint-based Subspace Clustering! Techniques that automatically detect clusters in subspaces of the data while ensuring the instance-level constraints are satisfied How can it be done efficiently? Naïve solution Check whether each possible subspace of a d-dimensional dataset is a subspace cluster satisfying the instance-level constraints. Runtime complexity : O(2 d ) Infeasible! Integrating instance-level constraints into the subspace clustering mining process Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 8 / 32

10 Outline of the talk 1 Subspace clustering 2 Constraint-based Subspace clustering 3 Experimental Results 4 Conclusion Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 9 / 32

11 Subspace Clustering Strategies Top-down Start with an initial approximation of the clusters in full feature space (ex : k-medoids) Iteratively refine the current clustering by projecting clusters to a lower-dimensional space Pb : do not guarantee the best clustering Bottom-up First consider clusters in 1-dimensional spaces Iteratively join subspaces to form higher dimensional ones. Pb : complexity of the enumeration process try to prune the enumeration as much as possible! Use a clustering criterion that implements the downward closure property. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 10 / 32

12 CLIQUE [Agrawal et al,98] Pioneering approach (several extensions already) Grid- and density-based approach Each dimension is partitioned into equal-sized intervals : 1-dimensional units A k-dimensional unit is the intersection of k units of different dimensions A k-dimensional unit is dense iff it contains at least σ objects Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 11 / 32

13 CLIQUE [Agrawal et al,98] Pioneering approach (several extensions already) Grid- and density-based approach Each dimension is partitioned into equal-sized intervals : 1-dimensional units A k-dimensional unit is the intersection of k units of different dimensions A k-dimensional unit is dense iff it contains at least σ objects Anti-monotonic property If a k-dimensional unit is dense, then all its included k 1-dimensional units are also dense. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 11 / 32

14 CLIQUE [Agrawal et al,98] Pioneering approach (several extensions already) Grid- and density-based approach Each dimension is partitioned into equal-sized intervals : 1-dimensional units A k-dimensional unit is the intersection of k units of different dimensions A k-dimensional unit is dense iff it contains at least σ objects Anti-monotonic property If a k-dimensional unit is dense, then all its included k 1-dimensional units are also dense. A subspace cluster is a maximal set of connected dense k-dimensional units Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 11 / 32

15 CLIQUE : an example Number of units = 2 per dimension Dense units : units with at least 4 objects (σ = 4) Raw dataset d 1 d 2 o o o o o o o o o o Grid 4 o u 4 o o d o u 3 o 7 o o 1 o 2 o 5 o u 11 u 12 d 1 1-dimensional dense unit : ({o 1, o 2, o 3, o 4 }), ({u 11 }) 1-dimensional dense unit : ({o 5, o 6, o 7, o 8, o 9, o 10 }), ({u 12 }) 1-dimensional dense unit : ({o 1, o 2, o 3, o 5, o 6, o 7, o 8 }), ({u 21 }) 2-dimensional dense unit : {o 5, o 6, o 7, o 8 }), ({u 12, u 21 }) Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 12 / 32

16 CLIQUE : Mining k-dimensional units Find 1-dimensional dense units At iteration k > 1 : generate k-dimensional dense units Merge a pair of (k 1)-dimensional dense units differing in only one dimension Prune k-dimensional units having a (k 1)-dimensional projections that is not dense. Output : subspace clusters (O, D), where O is a set of objects and D a k-dimensional unit Post-processing : Connected k-dimensional units are merged to generate maximal subspace clusters. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 13 / 32

17 Outline of the talk 1 Subspace clustering 2 Constraint-based Subspace clustering 3 Experimental Results 4 Conclusion Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 14 / 32

18 Motivation and Goal Subspace clustering relies on the monotonicity of constraints to improve efficiency. We propose to Integrate background knowledge into the Subspace clustering process in the form of instance-level constraints : must-link and cannot-link Investigate whether these new constraints can make the process not only more accurate but also more efficient. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 15 / 32

19 Definitions of Instance-Level constraints Cannot-link constraint CL(o i, o j ) A Cannot-link constraint between two objects o i and o j is satisfied by a subspace cluster (O, D) iff {o i, o j } O. Must-link constraint ML(o i, o j ) A Must-link constraint between two objects o i and o j is satisfied by a subspace cluster (O, D) iff {o i, o j } O or {o i, o j } O =. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 16 / 32

20 Monotonicity properties of Instance-Level constraints Cannot-Link is anti-monotonic P O : {o i, o j } O {o i, o j } P Must-Link is a disjunction of a monotonic and an anti-monotonic constraint P O : {o i, o j } P {o i, o j } O P O : {o i, o j } O = {o i, o j } P = Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 17 / 32

21 SC-MINER : main characteristics Need of an algorithm that can handle both monotonic and anti-monotonic constraints SC-MINER (Subspace Clustering Miner) : Considers that the dimensions are divided into units beforehand Enumerates the candidate subspace clusters in a depth-first way Can handle monotonic and anti-monotonic constraints Mines closed subspace clusters directly Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 18 / 32

22 Candidate Generation A candidate X, Y consists of 2 couples of sets : X = (O, D) : the set of objects O and the set of units D contained in all the subspace clusters under construction Y = (O, D ), the set of objects O and the set of units D that still need to be enumerated A each iteration : SC-MINER picks an element z from Y (from O or D ) and makes two recursive calls : once for the candidate X {z}, Y \ {z} once for the candidate X, Y \ {z} Recursion stops when a candidate and all its descendants can be pruned or when Y =, In this case, we have found a valid subspace cluster X = (O, D). For the first call, the candidate is (, ), (O, D). Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 19 / 32

23 Subspace cluster constraint evaluation Subspaces clusters (O, D) are made of objects and units that are in relation : Each object in O must belong to all units of D Each unit in D must contain all objects of O Instead of enumerating candidates and checking if they satisfy this property, SC-MINER maintains this property dynamically (propagation of constraints) : When an element z is moved from Y to X (first recursive call), all elements of Y not in relation with z are removed. Evaluation of the density constraint : if O O < σ the recursion is stopped none of the descendants of the current subspace cluster candidate can be dense. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 20 / 32

24 Candidate Generation example σ = 3 d 1 d 2 d 3 o o o o (, ), (o 1 o 2 o 3 o 4, d 1 d 2 d 3 ) d 1 (, d 1 ), (o 2 o 3, d 2 d 3 ) (, ), (o 1 o 2 o 3 o 4, d 2 d 3 ) d 3 (, d 3 ), (o 1 o 3, d 2 ) (, ), (o 1 o 2 o 3 o 4, d 2 ) d 2 (, d 2 ), (o 2 o 3 o 4, ) (, ), (o 1 o 2 o 3 o 4, ) (o 2 o 3 o 4, d 2 ), (, ) Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 21 / 32

25 Propagation of Instance-Level constraints Cannot-Link constraint : CL(o i, o j ) or CL(o j, o i ) When the candidate X {o i }, Y \ {o i } is generated, o j is removed from Y. Must-Link constraint : ML(o i, o j ) or ML(o i, o j ) When the candidate X {o i }, Y \ {o i } is generated, o j is moved from Y into X and the elements of Y not in relation with o j are removed. When the candidate X, Y \ {o i } is generated, o j is also removed from Y. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 22 / 32

26 Closeness constraint To avoid redundant clusters Is neither monotonic nor anti-monotonic. We check whether any element z that was previously enumerated in X is in relation with all the elements of (O O ) or (D D ) If so, the current candidate is not closed and can be safely pruned This can be checked efficiently by keeping track of all previously enumerated elements during the recursions Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 23 / 32

27 Outline of the talk 1 Subspace clustering 2 Constraint-based Subspace clustering 3 Experimental Results 4 Conclusion Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 24 / 32

28 Subspace Clustering examples Synthetic data Subspace clustering Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 25 / 32

29 Experimental Results Datasets : Four benchmark datasets with numerical attributes Real high-dimensional gene expression data Plasmodium [Bozdech, 2003] Constraints Generation of IL constraints randomly from examples according to the class attribute (cf. [Struyf et al, 2007]) Average results on 60 different generations of constraints Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 26 / 32

30 Efficiency Number of candidate subspace clusters of SC-MINER, for different numbers of IL constraints Nb candidates decreases in inverse proportion to the number of IL constraints! Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 27 / 32

31 Accuracy evaluation Coverage : percentage of objects present in any of the subspace clusters Quality [Assent et al, 2007] : purity of the final clustering w.r.t the class values The quality increases! However, the coverage decreases Why? Too many constraints to validate robustness! Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 28 / 32

32 Gene Expression Data Each column : expression profile of a given gene of Plasmodium Falciparum (a parasite), evaluated during its developmental cycle (DC). Total : 476 genes (476 dimensions) Each line corresponds to a specific hour of the developmental cycle of Plasmodium Falciparum. Total : 48 hours (48 objects) divided into 3 different stages : Ring, Trophozoite or Schizont (class attribute). Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 29 / 32

33 Meaningful Clusters? Parameters : bins = 4, σ = 26%, 50 constraints, and containing at least 35 dimensions (genes) 26 subspace clusters were obtained with 77.18% quality, 91.3% coverage We compared our results with the biological results in [Bozdech, 2003] We observed that the clusters were formed by genes whose corresponding functions are known to be active during the corresponding samples (objects) : Functional Group Ring Trophozoite Schizont Schizont+beginning cytoplasmic translation 15,000 10,500 9,375 13,045 transcription machinery 4,143 3,500 1,875 2,331 proteasome 2,286 3,500 2,0 2,981 ribonucleotide synthesis 1,143 1,5 0,625 1,513 deoxynucleotide synthesis 0,000 0,000 1,250 0,000 dna replication 2,143 2,000 5,00 4,558 plastid genome 1,286 1,0 1,75 0,481 Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 30 / 32

34 Outline of the talk 1 Subspace clustering 2 Constraint-based Subspace clustering 3 Experimental Results 4 Conclusion Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 31 / 32

35 Conclusion and Future work Conclusion We proposed to extend the common framework of bottom-up subspace clustering to also consider IL constraints IL constraints can increase not only the efficiency of the techniques but also the quality of the resulting clustering Can be integrated into an Inductive Database framework Future work On clustering : Integration of soft constraints (to take noisy data into account) Integration in a real inductive database On constraint-based data mining : Continue to investigate how constraints can help both users and data mining algorithms Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 32 / 32

Mining bi-sets in numerical data

Mining bi-sets in numerical data Jérémy Besson, Céline Robardet, Luc De Raedt and Jean-François Boulicaut Institut National des Sciences Appliquées de Lyon - France Albert-Ludwigs-Universitat Freiburg