Constraint-based Subspace Clustering
|
|
- Jeffrey Carroll
- 6 years ago
- Views:
Transcription
1 Constraint-based Subspace Clustering Elisa Fromont 1, Adriana Prado 2 and Céline Robardet 1 1 Université de Lyon, France 2 Universiteit Antwerpen, Belgium Thursday, April 30
2 Traditional Clustering Partitions the data into groups (clusters) of similar objects Similarity : based on distances or density Traditional methods use all features (dimensions) to identify clusters in the data Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 2 / 32
3 Clustering examples Synthetic data K-means clustering Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 3 / 32
4 Problems When dealing with high-dimensional data : «Curse of dimensionality»[beyer et al.,1999] : Distance-based : the distance to the nearest neighbor is nearly equal to the distance to the farthest neighbor Density-based : it is difficult to determine dense regions in high-dimensional data Data may have many irrelevant dimensions Subspace Clustering for High Dimensional Data : A Review, Parsons et al. KDD Explorations 2004 Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 4 / 32
5 Solutions? Dimensionality reduction? (e.g. PCA) Aims at discarding irrelevant dimensions BUT : dimensions are often not «globally»irrelevant Detecting Clusters in Moderate-to-high Dimensional Data, A. Zimek, Tutorial on Subspace Clustering at KDD Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 5 / 32
6 Gene Expression Data Analysis Columns : Genes Rows : Experiment conditions or samples. Values : relative abundance of the mrna of a gene under a specific condition Task : Cluster the samples w.r.t. their similarity on gene expression values Samples may be clustered differently depending on considered subsets of genes Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 6 / 32
7 Gene Expression Data Analysis Add instance-level Constraints on couples of samples : Some are known to result from similar experiment conditions, and must belong to the same subspace cluster. Others turn out from different experiment conditions and cannot be linked by a subspace. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 7 / 32
8 Solution Constraint-based Subspace Clustering! Techniques that automatically detect clusters in subspaces of the data while ensuring the instance-level constraints are satisfied How can it be done efficiently? Naïve solution Check whether each possible subspace of a d-dimensional dataset is a subspace cluster satisfying the instance-level constraints. Runtime complexity : O(2 d ) Infeasible! Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 8 / 32
9 Solution Constraint-based Subspace Clustering! Techniques that automatically detect clusters in subspaces of the data while ensuring the instance-level constraints are satisfied How can it be done efficiently? Naïve solution Check whether each possible subspace of a d-dimensional dataset is a subspace cluster satisfying the instance-level constraints. Runtime complexity : O(2 d ) Infeasible! Integrating instance-level constraints into the subspace clustering mining process Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 8 / 32
10 Outline of the talk 1 Subspace clustering 2 Constraint-based Subspace clustering 3 Experimental Results 4 Conclusion Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 9 / 32
11 Subspace Clustering Strategies Top-down Start with an initial approximation of the clusters in full feature space (ex : k-medoids) Iteratively refine the current clustering by projecting clusters to a lower-dimensional space Pb : do not guarantee the best clustering Bottom-up First consider clusters in 1-dimensional spaces Iteratively join subspaces to form higher dimensional ones. Pb : complexity of the enumeration process try to prune the enumeration as much as possible! Use a clustering criterion that implements the downward closure property. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 10 / 32
12 CLIQUE [Agrawal et al,98] Pioneering approach (several extensions already) Grid- and density-based approach Each dimension is partitioned into equal-sized intervals : 1-dimensional units A k-dimensional unit is the intersection of k units of different dimensions A k-dimensional unit is dense iff it contains at least σ objects Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 11 / 32
13 CLIQUE [Agrawal et al,98] Pioneering approach (several extensions already) Grid- and density-based approach Each dimension is partitioned into equal-sized intervals : 1-dimensional units A k-dimensional unit is the intersection of k units of different dimensions A k-dimensional unit is dense iff it contains at least σ objects Anti-monotonic property If a k-dimensional unit is dense, then all its included k 1-dimensional units are also dense. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 11 / 32
14 CLIQUE [Agrawal et al,98] Pioneering approach (several extensions already) Grid- and density-based approach Each dimension is partitioned into equal-sized intervals : 1-dimensional units A k-dimensional unit is the intersection of k units of different dimensions A k-dimensional unit is dense iff it contains at least σ objects Anti-monotonic property If a k-dimensional unit is dense, then all its included k 1-dimensional units are also dense. A subspace cluster is a maximal set of connected dense k-dimensional units Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 11 / 32
15 CLIQUE : an example Number of units = 2 per dimension Dense units : units with at least 4 objects (σ = 4) Raw dataset d 1 d 2 o o o o o o o o o o Grid 4 o u 4 o o d o u 3 o 7 o o 1 o 2 o 5 o u 11 u 12 d 1 1-dimensional dense unit : ({o 1, o 2, o 3, o 4 }), ({u 11 }) 1-dimensional dense unit : ({o 5, o 6, o 7, o 8, o 9, o 10 }), ({u 12 }) 1-dimensional dense unit : ({o 1, o 2, o 3, o 5, o 6, o 7, o 8 }), ({u 21 }) 2-dimensional dense unit : {o 5, o 6, o 7, o 8 }), ({u 12, u 21 }) Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 12 / 32
16 CLIQUE : Mining k-dimensional units Find 1-dimensional dense units At iteration k > 1 : generate k-dimensional dense units Merge a pair of (k 1)-dimensional dense units differing in only one dimension Prune k-dimensional units having a (k 1)-dimensional projections that is not dense. Output : subspace clusters (O, D), where O is a set of objects and D a k-dimensional unit Post-processing : Connected k-dimensional units are merged to generate maximal subspace clusters. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 13 / 32
17 Outline of the talk 1 Subspace clustering 2 Constraint-based Subspace clustering 3 Experimental Results 4 Conclusion Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 14 / 32
18 Motivation and Goal Subspace clustering relies on the monotonicity of constraints to improve efficiency. We propose to Integrate background knowledge into the Subspace clustering process in the form of instance-level constraints : must-link and cannot-link Investigate whether these new constraints can make the process not only more accurate but also more efficient. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 15 / 32
19 Definitions of Instance-Level constraints Cannot-link constraint CL(o i, o j ) A Cannot-link constraint between two objects o i and o j is satisfied by a subspace cluster (O, D) iff {o i, o j } O. Must-link constraint ML(o i, o j ) A Must-link constraint between two objects o i and o j is satisfied by a subspace cluster (O, D) iff {o i, o j } O or {o i, o j } O =. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 16 / 32
20 Monotonicity properties of Instance-Level constraints Cannot-Link is anti-monotonic P O : {o i, o j } O {o i, o j } P Must-Link is a disjunction of a monotonic and an anti-monotonic constraint P O : {o i, o j } P {o i, o j } O P O : {o i, o j } O = {o i, o j } P = Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 17 / 32
21 SC-MINER : main characteristics Need of an algorithm that can handle both monotonic and anti-monotonic constraints SC-MINER (Subspace Clustering Miner) : Considers that the dimensions are divided into units beforehand Enumerates the candidate subspace clusters in a depth-first way Can handle monotonic and anti-monotonic constraints Mines closed subspace clusters directly Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 18 / 32
22 Candidate Generation A candidate X, Y consists of 2 couples of sets : X = (O, D) : the set of objects O and the set of units D contained in all the subspace clusters under construction Y = (O, D ), the set of objects O and the set of units D that still need to be enumerated A each iteration : SC-MINER picks an element z from Y (from O or D ) and makes two recursive calls : once for the candidate X {z}, Y \ {z} once for the candidate X, Y \ {z} Recursion stops when a candidate and all its descendants can be pruned or when Y =, In this case, we have found a valid subspace cluster X = (O, D). For the first call, the candidate is (, ), (O, D). Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 19 / 32
23 Subspace cluster constraint evaluation Subspaces clusters (O, D) are made of objects and units that are in relation : Each object in O must belong to all units of D Each unit in D must contain all objects of O Instead of enumerating candidates and checking if they satisfy this property, SC-MINER maintains this property dynamically (propagation of constraints) : When an element z is moved from Y to X (first recursive call), all elements of Y not in relation with z are removed. Evaluation of the density constraint : if O O < σ the recursion is stopped none of the descendants of the current subspace cluster candidate can be dense. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 20 / 32
24 Candidate Generation example σ = 3 d 1 d 2 d 3 o o o o (, ), (o 1 o 2 o 3 o 4, d 1 d 2 d 3 ) d 1 (, d 1 ), (o 2 o 3, d 2 d 3 ) (, ), (o 1 o 2 o 3 o 4, d 2 d 3 ) d 3 (, d 3 ), (o 1 o 3, d 2 ) (, ), (o 1 o 2 o 3 o 4, d 2 ) d 2 (, d 2 ), (o 2 o 3 o 4, ) (, ), (o 1 o 2 o 3 o 4, ) (o 2 o 3 o 4, d 2 ), (, ) Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 21 / 32
25 Propagation of Instance-Level constraints Cannot-Link constraint : CL(o i, o j ) or CL(o j, o i ) When the candidate X {o i }, Y \ {o i } is generated, o j is removed from Y. Must-Link constraint : ML(o i, o j ) or ML(o i, o j ) When the candidate X {o i }, Y \ {o i } is generated, o j is moved from Y into X and the elements of Y not in relation with o j are removed. When the candidate X, Y \ {o i } is generated, o j is also removed from Y. Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 22 / 32
26 Closeness constraint To avoid redundant clusters Is neither monotonic nor anti-monotonic. We check whether any element z that was previously enumerated in X is in relation with all the elements of (O O ) or (D D ) If so, the current candidate is not closed and can be safely pruned This can be checked efficiently by keeping track of all previously enumerated elements during the recursions Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 23 / 32
27 Outline of the talk 1 Subspace clustering 2 Constraint-based Subspace clustering 3 Experimental Results 4 Conclusion Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 24 / 32
28 Subspace Clustering examples Synthetic data Subspace clustering Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 25 / 32
29 Experimental Results Datasets : Four benchmark datasets with numerical attributes Real high-dimensional gene expression data Plasmodium [Bozdech, 2003] Constraints Generation of IL constraints randomly from examples according to the class attribute (cf. [Struyf et al, 2007]) Average results on 60 different generations of constraints Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 26 / 32
30 Efficiency Number of candidate subspace clusters of SC-MINER, for different numbers of IL constraints Nb candidates decreases in inverse proportion to the number of IL constraints! Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 27 / 32
31 Accuracy evaluation Coverage : percentage of objects present in any of the subspace clusters Quality [Assent et al, 2007] : purity of the final clustering w.r.t the class values The quality increases! However, the coverage decreases Why? Too many constraints to validate robustness! Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 28 / 32
32 Gene Expression Data Each column : expression profile of a given gene of Plasmodium Falciparum (a parasite), evaluated during its developmental cycle (DC). Total : 476 genes (476 dimensions) Each line corresponds to a specific hour of the developmental cycle of Plasmodium Falciparum. Total : 48 hours (48 objects) divided into 3 different stages : Ring, Trophozoite or Schizont (class attribute). Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 29 / 32
33 Meaningful Clusters? Parameters : bins = 4, σ = 26%, 50 constraints, and containing at least 35 dimensions (genes) 26 subspace clusters were obtained with 77.18% quality, 91.3% coverage We compared our results with the biological results in [Bozdech, 2003] We observed that the clusters were formed by genes whose corresponding functions are known to be active during the corresponding samples (objects) : Functional Group Ring Trophozoite Schizont Schizont+beginning cytoplasmic translation 15,000 10,500 9,375 13,045 transcription machinery 4,143 3,500 1,875 2,331 proteasome 2,286 3,500 2,0 2,981 ribonucleotide synthesis 1,143 1,5 0,625 1,513 deoxynucleotide synthesis 0,000 0,000 1,250 0,000 dna replication 2,143 2,000 5,00 4,558 plastid genome 1,286 1,0 1,75 0,481 Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 30 / 32
34 Outline of the talk 1 Subspace clustering 2 Constraint-based Subspace clustering 3 Experimental Results 4 Conclusion Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 31 / 32
35 Conclusion and Future work Conclusion We proposed to extend the common framework of bottom-up subspace clustering to also consider IL constraints IL constraints can increase not only the efficiency of the techniques but also the quality of the resulting clustering Can be integrated into an Inductive Database framework Future work On clustering : Integration of soft constraints (to take noisy data into account) Integration in a real inductive database On constraint-based data mining : Continue to investigate how constraints can help both users and data mining algorithms Elisa Fromont, Adriana Prado and Céline Robardet Constraint-based Subspace Clustering 32 / 32
Mining bi-sets in numerical data
Mining bi-sets in numerical data Jérémy Besson, Céline Robardet, Luc De Raedt and Jean-François Boulicaut Institut National des Sciences Appliquées de Lyon - France Albert-Ludwigs-Universitat Freiburg
More informationP leiades: Subspace Clustering and Evaluation
P leiades: Subspace Clustering and Evaluation Ira Assent, Emmanuel Müller, Ralph Krieger, Timm Jansen, and Thomas Seidl Data management and exploration group, RWTH Aachen University, Germany {assent,mueller,krieger,jansen,seidl}@cs.rwth-aachen.de
More informationA Novel Dencos Model For High Dimensional Data Using Genetic Algorithms
A Novel Dencos Model For High Dimensional Data Using Genetic Algorithms T. Vijayakumar 1, V.Nivedhitha 2, K.Deeba 3 and M. Sathya Bama 4 1 Assistant professor / Dept of IT, Dr.N.G.P College of Engineering
More informationProjective Clustering by Histograms
Projective Clustering by Histograms Eric Ka Ka Ng, Ada Wai-chee Fu and Raymond Chi-Wing Wong, Member, IEEE Abstract Recent research suggests that clustering for high dimensional data should involve searching
More informationHans-Peter Kriegel, Peer Kröger, Irene Ntoutsi, Arthur Zimek
Hans-Peter Kriegel, Peer Kröger, Irene Ntoutsi, Arthur Zimek SSDBM, 20-22/7/2011, Portland OR Ludwig-Maximilians-Universität (LMU) Munich, Germany www.dbs.ifi.lmu.de Motivation Subspace clustering for
More informationAlternative Clustering, Multiview Clustering: What Can We Learn From Each Other?
LUDWIG- MAXIMILIANS- UNIVERSITÄT MÜNCHEN INSTITUTE FOR INFORMATICS DATABASE Subspace Clustering, Ensemble Clustering, Alternative Clustering, Multiview Clustering: What Can We Learn From Each Other? MultiClust@KDD
More informationFrequent Itemset Mining
ì 1 Frequent Itemset Mining Nadjib LAZAAR LIRMM- UM COCONUT Team (PART I) IMAGINA 17/18 Webpage: http://www.lirmm.fr/~lazaar/teaching.html Email: lazaar@lirmm.fr 2 Data Mining ì Data Mining (DM) or Knowledge
More informationFrequent Pattern Mining: Exercises
Frequent Pattern Mining: Exercises Christian Borgelt School of Computer Science tto-von-guericke-university of Magdeburg Universitätsplatz 2, 39106 Magdeburg, Germany christian@borgelt.net http://www.borgelt.net/
More informationFlexible and Adaptive Subspace Search for Outlier Analysis
Flexible and Adaptive Subspace Search for Outlier Analysis Fabian Keller Emmanuel Müller Andreas Wixler Klemens Böhm Karlsruhe Institute of Technology (KIT), Germany {fabian.keller, emmanuel.mueller, klemens.boehm}@kit.edu
More informationClustering. CSL465/603 - Fall 2016 Narayanan C Krishnan
Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification
More informationChapter 5-2: Clustering
Chapter 5-2: Clustering Jilles Vreeken Revision 1, November 20 th typo s fixed: dendrogram Revision 2, December 10 th clarified: we do consider a point x as a member of its own ε-neighborhood 12 Nov 2015
More informationChapters 6 & 7, Frequent Pattern Mining
CSI 4352, Introduction to Data Mining Chapters 6 & 7, Frequent Pattern Mining Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapters
More informationTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest Neighbors Frank Nielsen 1 Paolo Piro 2 Michel Barlaud 2 1 Ecole Polytechnique, LIX, Palaiseau, France 2 CNRS / University of Nice-Sophia Antipolis, Sophia
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationData Analytics Beyond OLAP. Prof. Yanlei Diao
Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of
More informationSupervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!
Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"
More informationClustering & microarray technology
Clustering & microarray technology A large scale way to measure gene expression levels. Thanks to Kevin Wayne, Matt Hibbs, & SMD for a few of the slides 1 Why is expression important? Proteins Gene Expression
More informationCS6375: Machine Learning Gautam Kunapuli. Decision Trees
Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s
More informationTowards Detecting Protein Complexes from Protein Interaction Data
Towards Detecting Protein Complexes from Protein Interaction Data Pengjun Pei 1 and Aidong Zhang 1 Department of Computer Science and Engineering State University of New York at Buffalo Buffalo NY 14260,
More informationFinding High-Order Correlations in High-Dimensional Biological Data
Finding High-Order Correlations in High-Dimensional Biological Data Xiang Zhang, Feng Pan, and Wei Wang Department of Computer Science University of North Carolina at Chapel Hill 1 Introduction Many real
More informationLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descent
Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1
More informationPrinciples of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata
Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision
More informationMining State Dependencies Between Multiple Sensor Data Sources
Mining State Dependencies Between Multiple Sensor Data Sources C. Robardet Co-Authored with Marc Plantevit and Vasile-Marian Scuturici April 2013 1 / 27 Mining Sensor data A timely challenge? Why is it
More informationData Mining and Matrices
Data Mining and Matrices 08 Boolean Matrix Factorization Rainer Gemulla, Pauli Miettinen June 13, 2013 Outline 1 Warm-Up 2 What is BMF 3 BMF vs. other three-letter abbreviations 4 Binary matrices, tiles,
More informationMining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data
Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional -Series Data Xiaolei Li, Jiawei Han University of Illinois at Urbana-Champaign VLDB 2007 1 Series Data Many applications produce time series
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More informationMining alpha/beta concepts as relevant bi-sets from transactional data
Mining alpha/beta concepts as relevant bi-sets from transactional data Jérémy Besson 1,2, Céline Robardet 3, and Jean-François Boulicaut 1 1 INSA Lyon, LIRIS CNRS FRE 2672, F-69621 Villeurbanne cedex,
More informationAn Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets
IEEE Big Data 2015 Big Data in Geosciences Workshop An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets Fatih Akdag and Christoph F. Eick Department of Computer
More informationEfficient Haplotype Inference with Boolean Satisfiability
Efficient Haplotype Inference with Boolean Satisfiability Joao Marques-Silva 1 and Ines Lynce 2 1 School of Electronics and Computer Science University of Southampton 2 INESC-ID/IST Technical University
More informationDecision Tree Learning
Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,
More informationDiscovering molecular pathways from protein interaction and ge
Discovering molecular pathways from protein interaction and gene expression data 9-4-2008 Aim To have a mechanism for inferring pathways from gene expression and protein interaction data. Motivation Why
More informationUncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization
Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization Haiping Lu 1 K. N. Plataniotis 1 A. N. Venetsanopoulos 1,2 1 Department of Electrical & Computer Engineering,
More informationLocal search algorithms. Chapter 4, Sections 3 4 1
Local search algorithms Chapter 4, Sections 3 4 Chapter 4, Sections 3 4 1 Outline Hill-climbing Simulated annealing Genetic algorithms (briefly) Local search in continuous spaces (very briefly) Chapter
More informationHandling Uncertainty in Clustering Art-exhibition Visiting Styles
Handling Uncertainty in Clustering Art-exhibition Visiting Styles 1 joint work with Francesco Gullo 2 and Andrea Tagarelli 3 Salvatore Cuomo 4, Pasquale De Michele 4, Francesco Piccialli 4 1 DTE-ICT-HPC
More informationFrequent Itemset Mining
ì 1 Frequent Itemset Mining Nadjib LAZAAR LIRMM- UM COCONUT Team IMAGINA 16/17 Webpage: h;p://www.lirmm.fr/~lazaar/teaching.html Email: lazaar@lirmm.fr 2 Data Mining ì Data Mining (DM) or Knowledge Discovery
More informationClustering Perturbation Resilient
Clustering Perturbation Resilient Instances Maria-Florina Balcan Carnegie Mellon University Clustering Comes Up Everywhere Clustering news articles or web pages or search results by topic. Clustering protein
More informationASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG
ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING, LIG M2 SIF DMV course 207/208 Market basket analysis Analyse supermarket s transaction data Transaction = «market basket» of a customer Find which items are
More informationData Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition
Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each
More informationPivot Selection Techniques
Pivot Selection Techniques Proximity Searching in Metric Spaces by Benjamin Bustos, Gonzalo Navarro and Edgar Chávez Catarina Moreira Outline Introduction Pivots and Metric Spaces Pivots in Nearest Neighbor
More informationLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Syllabus Fri. 21.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 27.10. (2) A.1 Linear Regression Fri. 3.11. (3) A.2 Linear Classification Fri. 10.11. (4) A.3 Regularization
More informationMining Molecular Fragments: Finding Relevant Substructures of Molecules
Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli
More informationWolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig
Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Indexes for Multimedia Data 14 Indexes for Multimedia
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationCOMP 5331: Knowledge Discovery and Data Mining
COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Tan, Steinbach, Kumar And Jiawei Han, Micheline Kamber, and Jian Pei 1 10
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationPart I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS
Part I C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Probabilistic Graphical Models Graphical representation of a probabilistic model Each variable corresponds to a
More informationEECS 349:Machine Learning Bryan Pardo
EECS 349:Machine Learning Bryan Pardo Topic 2: Decision Trees (Includes content provided by: Russel & Norvig, D. Downie, P. Domingos) 1 General Learning Task There is a set of possible examples Each example
More informationSubspace Correlation Clustering: Finding Locally Correlated Dimensions in Subspace Projections of the Data
Subspace Correlation Clustering: Finding Locally Correlated Dimensions in Subspace Projections of the Data Stephan Günnemann, Ines Färber, Kittipat Virochsiri, and Thomas Seidl RWTH Aachen University,
More informationIncremental Construction of Complex Aggregates: Counting over a Secondary Table
Incremental Construction of Complex Aggregates: Counting over a Secondary Table Clément Charnay 1, Nicolas Lachiche 1, and Agnès Braud 1 ICube, Université de Strasbourg, CNRS 300 Bd Sébastien Brant - CS
More informationSPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM
SPATIAL DATA MINING Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM INTRODUCTION The main difference between data mining in relational DBS and in spatial DBS is that attributes of the neighbors
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationMining chains of relations
Mining chains of relations Foto Aftrati 1, Gautam Das 2, Aristides Gionis 3, Heikki Mannila 4, Taneli Mielikäinen 5, and Panayiotis Tsaparas 6 1 National Technical University of Athens, afrati@softlab.ece.ntua.gr
More informationUnsupervised Learning: K- Means & PCA
Unsupervised Learning: K- Means & PCA Unsupervised Learning Supervised learning used labeled data pairs (x, y) to learn a func>on f : X Y But, what if we don t have labels? No labels = unsupervised learning
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More information10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison
10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:
More informationApplying cluster analysis to 2011 Census local authority data
Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017 Outline Basic ideas of cluster analysis How to choose variables
More informationSubspace Clustering using CLIQUE: An Exploratory Study
Subspace Clustering using CLIQUE: An Exploratory Study Jyoti Yadav, Dharmender Kumar Abstract Traditional clustering algorithms like K-means, CLARANS, BIRCH, DBSCAN etc. are not able to handle higher dimensional
More informationA Bi-clustering Framework for Categorical Data
A Bi-clustering Framework for Categorical Data Ruggero G. Pensa 1,Céline Robardet 2, and Jean-François Boulicaut 1 1 INSA Lyon, LIRIS CNRS UMR 5205, F-69621 Villeurbanne cedex, France 2 INSA Lyon, PRISMa
More informationChapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining
Chapter 6. Frequent Pattern Mining: Concepts and Apriori Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Pattern Discovery: Definition What are patterns? Patterns: A set of
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More informationData Mining Lab Course WS 2017/18
Data Mining Lab Course WS 2017/18 L. Richter Department of Computer Science Technische Universität München Wednesday, Dec 20th L. Richter DM Lab WS 17/18 1 / 14 1 2 3 4 L. Richter DM Lab WS 17/18 2 / 14
More informationMining Non-Redundant High Order Correlations in Binary Data
Mining Non-Redundant High Order Correlations in Binary Data Xiang Zhang 1, Feng Pan 1, Wei Wang 1, and Andrew Nobel 2 1 Department of Computer Science, 2 Department of Statistics and Operations Research
More informationCS570 Introduction to Data Mining
CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning
More informationUn nouvel algorithme de génération des itemsets fermés fréquents
Un nouvel algorithme de génération des itemsets fermés fréquents Huaiguo Fu CRIL-CNRS FRE2499, Université d Artois - IUT de Lens Rue de l université SP 16, 62307 Lens cedex. France. E-mail: fu@cril.univ-artois.fr
More informationSubspace Clustering and Visualization of Data Streams
Ibrahim Louhi 1,2, Lydia Boudjeloud-Assala 1 and Thomas Tamisier 2 1 Laboratoire d Informatique Théorique et Appliquée, LITA-EA 3097, Université de Lorraine, Ile du Saucly, Metz, France 2 e-science Unit,
More informationRemoving trivial associations in association rule discovery
Removing trivial associations in association rule discovery Geoffrey I. Webb and Songmao Zhang School of Computing and Mathematics, Deakin University Geelong, Victoria 3217, Australia Abstract Association
More informationOutlier Detection in High-Dimensional Data
Tutorial Arthur Zimek 1,2, Erich Schubert 2, Hans-Peter Kriegel 2 1 University of Alberta Edmonton, AB, Canada 2 Ludwig-Maximilians-Universität München Munich, Germany PAKDD 2013, Gold Coast, Australia
More informationLocal search algorithms. Chapter 4, Sections 3 4 1
Local search algorithms Chapter 4, Sections 3 4 Chapter 4, Sections 3 4 1 Outline Hill-climbing Simulated annealing Genetic algorithms (briefly) Local search in continuous spaces (very briefly) Chapter
More informationNotes on Machine Learning for and
Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Learning = improving with experience Improve over task T (e.g, Classification, control tasks) with respect
More informationEditorial Manager(tm) for Data Mining and Knowledge Discovery Manuscript Draft
Editorial Manager(tm) for Data Mining and Knowledge Discovery Manuscript Draft Manuscript Number: Title: Summarizing transactional databases with overlapped hyperrectangles, theories and algorithms Article
More informationGenome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics
Genome 541 Unit 4, lecture 2 Transcription factor binding using functional genomics Slides vs chalk talk: I m not sure why you chose a chalk talk over ppt. I prefer the latter no issues with readability
More informationCS 584 Data Mining. Association Rule Mining 2
CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M
More informationScalable Algorithms for Distribution Search
Scalable Algorithms for Distribution Search Yasuko Matsubara (Kyoto University) Yasushi Sakurai (NTT Communication Science Labs) Masatoshi Yoshikawa (Kyoto University) 1 Introduction Main intuition and
More informationOn Improving the k-means Algorithm to Classify Unclassified Patterns
On Improving the k-means Algorithm to Classify Unclassified Patterns Mohamed M. Rizk 1, Safar Mohamed Safar Alghamdi 2 1 Mathematics & Statistics Department, Faculty of Science, Taif University, Taif,
More informationRanking Interesting Subspaces for Clustering High Dimensional Data
Ranking Interesting Subspaces for Clustering High Dimensional Data Karin Kailing, Hans-Peter Kriegel, Peer Kröger, and Stefanie Wanka Institute for Computer Science University of Munich Oettingenstr. 67,
More informationApplications of the Lopsided Lovász Local Lemma Regarding Hypergraphs
Regarding Hypergraphs Ph.D. Dissertation Defense April 15, 2013 Overview The Local Lemmata 2-Coloring Hypergraphs with the Original Local Lemma Counting Derangements with the Lopsided Local Lemma Lopsided
More informationGenerating p-extremal graphs
Generating p-extremal graphs Derrick Stolee Department of Mathematics Department of Computer Science University of Nebraska Lincoln s-dstolee1@math.unl.edu August 2, 2011 Abstract Let f(n, p be the maximum
More informationOn the Mining of Numerical Data with Formal Concept Analysis
On the Mining of Numerical Data with Formal Concept Analysis Thèse de doctorat en informatique Mehdi Kaytoue 22 April 2011 Amedeo Napoli Sébastien Duplessis Somewhere... in a temperate forest... N 2 /
More informationPreprocessing & dimensionality reduction
Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016
More informationSparse representation classification and positive L1 minimization
Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng
More informationDecision Trees Entropy, Information Gain, Gain Ratio
Changelog: 14 Oct, 30 Oct Decision Trees Entropy, Information Gain, Gain Ratio Lecture 3: Part 2 Outline Entropy Information gain Gain ratio Marina Santini Acknowledgements Slides borrowed and adapted
More informationLecture 23 Branch-and-Bound Algorithm. November 3, 2009
Branch-and-Bound Algorithm November 3, 2009 Outline Lecture 23 Modeling aspect: Either-Or requirement Special ILPs: Totally unimodular matrices Branch-and-Bound Algorithm Underlying idea Terminology Formal
More informationOptimization of Submodular Functions Tutorial - lecture I
Optimization of Submodular Functions Tutorial - lecture I Jan Vondrák 1 1 IBM Almaden Research Center San Jose, CA Jan Vondrák (IBM Almaden) Submodular Optimization Tutorial 1 / 1 Lecture I: outline 1
More informationDistributed Mining of Frequent Closed Itemsets: Some Preliminary Results
Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results Claudio Lucchese Ca Foscari University of Venice clucches@dsi.unive.it Raffaele Perego ISTI-CNR of Pisa perego@isti.cnr.it Salvatore
More information17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:
17 Non-collinear alignment This exposition is based on: 1. Darling, A.E., Mau, B., Perna, N.T. (2010) progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147.
More informationStructural Learning and Integrative Decomposition of Multi-View Data
Structural Learning and Integrative Decomposition of Multi-View Data, Department of Statistics, Texas A&M University JSM 2018, Vancouver, Canada July 31st, 2018 Dr. Gen Li, Columbia University, Mailman
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 6
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights
More informationThe Lovász Local Lemma: constructive aspects, stronger variants and the hard core model
The Lovász Local Lemma: constructive aspects, stronger variants and the hard core model Jan Vondrák 1 1 Dept. of Mathematics Stanford University joint work with Nick Harvey (UBC) The Lovász Local Lemma
More informationOrbitopes. Marc Pfetsch. joint work with Volker Kaibel. Zuse Institute Berlin
Orbitopes Marc Pfetsch joint work with Volker Kaibel Zuse Institute Berlin What this talk is about We introduce orbitopes. A polyhedral way to break symmetries in integer programs. Introduction 2 Orbitopes
More informationFinding Non-Redundant, Statistically Signicant Regions in High Dimensional Data: a Novel Approach to Projected and Subspace Clustering
Finding Non-Redundant, Statistically Signicant Regions in High Dimensional Data: a Novel Approach to Projected and Subspace Clustering ABSTRACT Gabriela Moise Dept. of Computing Science University of Alberta
More informationDifferential Modeling for Cancer Microarray Data
Differential Modeling for Cancer Microarray Data Omar Odibat Department of Computer Science Feb, 01, 2011 1 Outline Introduction Cancer Microarray data Problem Definition Differential analysis Existing
More informationEncyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen
Book Title Encyclopedia of Machine Learning Chapter Number 00403 Book CopyRight - Year 2010 Title Frequent Pattern Author Particle Given Name Hannu Family Name Toivonen Suffix Email hannu.toivonen@cs.helsinki.fi
More informationLearning Decision Trees
Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2
More informationSummarizing Transactional Databases with Overlapped Hyperrectangles
Noname manuscript No. (will be inserted by the editor) Summarizing Transactional Databases with Overlapped Hyperrectangles Yang Xiang Ruoming Jin David Fuhry Feodor F. Dragan Abstract Transactional data
More informationCARE: Finding Local Linear Correlations in High Dimensional Data
CARE: Finding Local Linear Correlations in High Dimensional Data Xiang Zhang, Feng Pan, and Wei Wang Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA
More informationData Exploration and Unsupervised Learning with Clustering
Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a
More informationDirected and Undirected Graphical Models
Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed
More information