Density-Based Clustering

Similar documents
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

732A61/TDDD41 Data Mining - Clustering and Association Analysis

COMP 5331: Knowledge Discovery and Data Mining

CS 484 Data Mining. Association Rule Mining 2

CSE 5243 INTRO. TO DATA MINING

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

COMP 5331: Knowledge Discovery and Data Mining

D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals

Data Analytics Beyond OLAP. Prof. Yanlei Diao

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Data Mining Concepts & Techniques

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

CSE 5243 INTRO. TO DATA MINING

Machine Learning: Pattern Mining

Knowledge Discovery and Data Mining I

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

Lecture Notes for Chapter 6. Introduction to Data Mining

Chapters 6 & 7, Frequent Pattern Mining

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Association Rule Mining on Web

Chapter 5-2: Clustering

IV Course Spring 14. Graduate Course. May 4th, Big Spatiotemporal Data Analytics & Visualization

DATA MINING - 1DL360

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

Associa'on Rule Mining

Unsupervised Learning. k-means Algorithm

4. Conclusions. Acknowledgments. 5. References

DATA MINING - 1DL360

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Data Mining and Analysis: Fundamental Concepts and Algorithms

CS 584 Data Mining. Association Rule Mining 2

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

Data Warehousing & Data Mining

Data Warehousing & Data Mining

Detection and Visualization of Subspace Cluster Hierarchies

Frequent Itemset Mining

Handling a Concept Hierarchy

Correlation Preserving Unsupervised Discretization. Outline

15 Introduction to Data Mining

CS5112: Algorithms and Data Structures for Applications

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Quantitative Association Rule Mining on Weighted Transactional Data

Data Warehousing. Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Mining Quantitative Frequent Itemsets Using Adaptive Density-based Subspace Clustering

Introduction to Data Mining

Outlier Detection Using Rough Set Theory

Algorithms for Characterization and Trend Detection in Spatial Databases

Summary. 8.1 BI Overview. 8. Business Intelligence. 8.1 BI Overview. 8.1 BI Overview 12/17/ Business Intelligence

Approximate counting: count-min data structure. Problem definition

Free-sets : a Condensed Representation of Boolean Data for the Approximation of Frequency Queries

1 Frequent Pattern Mining

Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries

Removing trivial associations in association rule discovery

Association Rules. Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION. Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION

FP-growth and PrefixSpan

CS738 Class Notes. Steve Revilak

Anomaly Detection. Jing Gao. SUNY Buffalo

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS

Data Mining Concepts & Techniques

Anomaly Detection via Online Oversampling Principal Component Analysis

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4

CLUSTERING AND NOISE DETECTION FOR GEOGRAPHIC KNOWLEDGE DISCOVERY

Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Anomaly Detection via Over-sampling Principal Component Analysis

NetBox: A Probabilistic Method for Analyzing Market Basket Data

Association Analysis. Part 1

Clustering of Amino Acids Proles

Mining Strong Positive and Negative Sequential Patterns

Algorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data

ST-DBSCAN: An Algorithm for Clustering Spatial-Temporal Data

Pattern-Based Decision Tree Construction

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Pushing Tougher Constraints in Frequent Pattern Mining

Unit II Association Rules

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

Discovering Non-Redundant Association Rules using MinMax Approximation Rules

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

DATA MINING - 1DL105, 1DL111

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Finding Hierarchies of Subspace Clusters

An Approach to Classification Based on Fuzzy Association Rules

Transcription:

Density-Based Clustering idea: Clusters are dense regions in feature space F. density: objects volume ε here: volume: ε-neighborhood for object o w.r.t. distance measure dist(x,y) dense region: ε-neighborhood contains MinPts objects => o is called core point connected core points form clusters Objects outside cluster is considered noise 41

Density-Based Clustering intuition parameters ε IR and MinPts IN specify the density threshold ε MinPts = 4 42

Density-Based Clustering intuition parameters ε IR and MinPts IN specify density threshold ε MinPts = 4 core points 43

Density-Base Clustering intuition parameters ε IR and MinPts IN specify density threshold ε MinPts = 4 core points direct density-reachability 44

Density-Based Clustering intuition parameters ε IR and MinPts IN specify density threshold ε MinPts = 4 core points direct density-reachability density reachability 45

Density-Based Clustering intuition parameters ε IR and MinPts IN specify density threshold ε MinPts = 4 core points direct density-reachability density reachability density connectivity 46

Density-Based Clustering formal: [Ester, Kriegel, Sander & Xu 1996] Object p DB is a core object, if: RQ(p,ε) MinPts RQ(p,ε) = {o DB dist(p,o) <= ε} Object p DB is direct density reachable from q DB wr.t. ε and MinPts, if: p RQ(q,ε) and q is a core object in DB. Object p is density-reachable from object q, if there is a sequence of direct density reachable objects from q to p. Two objects p and q are density-connected, if both p and q are density reachable from a third object o. p o p p q o MinPts=4 q 47

Density-Based Clustering formal: A density-based cluster C w.r.t. ε and MinPts is a noneempty subset of DB with the following properties: Maximality: p,q DB: p C and q is density-reachable from p => q C. Connectivity: p,q C => p and q are density-connected. 48

Density-Based Clustering formal Clustering A density-based clustering CL of DB w.r.t. ε and MinPts is the complete set of all density-based clusters w.r.t. ε and MinPts. Noise The set Noise CL is defined as the subset of objects in DB which are not contained in any cluster. idea behind the DBSCAN algorithm Let C be a density-based cluster and let p C be a core object, then C = {o DB o density reachable from p w.r.t. ε and MinPts}. 49

Density-Based Clustering Algorithmus DBSCAN DBSCAN(dataset DB, Real ε, Integer MinPts) // in beginning all objects are unlabeled, // o.clid = UNLABELED for all o DB ClusterId := nextid(noise); for i from 1 to DB do Objekt := DB.get(i); if Objekt.ClId = UNLABELED then if ExpandCluster(DB, Objekt, ClusterId, ε, MinPts) then ClusterId:=nextId(ClusterId); 50

Density-Based Clustering ExpandCluster(DB, startobject, clusterid, ε, MinPts): Boolean seeds:= RQ(startObject, ε); if seeds < MinPts then // startobject is not a core object startobject.clid := NOISE; return false; // else: startobject is a core object forall o seeds do o.clid := clusterid; remove startobject from seeds; while seeds Empty do select object o from seeds; neighborhood := RQ(o, ε); if neighborhood MinPts then // o is a core object for i from 1 to neighborhood do p := neighborhood.get(i); if p.clid in {UNLABELED, NOISE} then if p.clid = UNLABELED then add p to seeds; p.clid := ClusterId; remove o from seeds; return true; 51

Discussion Density-Based Clustering number of clusters is determined by the algorithm Parameters ε and MinPts generally less problematics Time complexity is O(n 2 ) for general data objects Density-based methods only require a distance measure Border points make DBSCAN dependent on processing order No cluster model or parameter optimization Assigning new points is done with nearest neighbor classification 52

Outlier Detection Hawkins Definition [Hawkins 1980]: An outlier is an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism. What does mechanism mean? intuition from Bayesian statistics: Outliers have a small likelihood to be generated by the assumed generative model. connection to clustering: - a clustering describes the distribution of data - outliers describe errors/noise max. distance to all cluster centers (part. clustering) noise in density-based clustering 53

Example: distance-based Outliers Definition (pct,dmin)-outlier [Knorr, Ng 97] An object p in data set DB is called (pct,dmin)-outlier, if at least pct - percent of the objects from DB have a large distance to p then dmin. Selection of pct and dmin is left to the user p1 example: p1 DB, pct=0.95, dmin=8 dmin p1 is a (0.95,8)-outlier => 95% of objects in DB display a distance > 8 to p1 p2 54

Frequent Pattern Mining until now: patterns describe subsets of DB idea: patterns might be parts of object description here: patterns are identical parts in object descriptions from large subsets of DB example: group composition in MMORPG: Grp1 Grp3 Grp3 Grp4 Grp5 priest, mage, druid, rogue, warrior mage, mage, hunter, priest, warrior Priest, priest, paladin, druid, warrior warlock, paladin, shaman, warrior, priest priest, warrior, hunter, rogue, warrior warrior, priest,?,?,? 55

Directions in Frequent Pattern Mining Frequent Itemset Mining (=> KDD1 and in the following) objects are subsets of items => find frequent subsets Frequent Substring Mining (=> next chapter) data objects are string/sequences over a discrete alphabet => find frequent subsequences Frequent Subgraph Mining (=> KDD2) data objects are graphs with discrete labels for nodes and/or edges => find frequent groups of isomorphic subgraphs Rule Mining: find rules based on frequent patterns: If pattern A is present in a data object, then its extension A+B is present with likelihood p. 56

Basic on Frequent Itemsets (1) items I = {i 1,..., i m } is a set of literals e.g., all articles in a shop itemset X: subset of items X I e.g., a set of items purchased together transaction T = (tid, X T ) where tid is the transaction ID and X T I is the complete itemset in the transaction e.g., all items being purchased together and the corresponding invoice number database DB: set of transactions T e.g., the set of all purchases in a certain time interval Items in itemsets can be sorted lexicographical: itemset X = (x 1, x 2,..., x k ), where x 1 x 2... x k length of an itemset: X : number of items in X k-itemset: an itemset of length k {butter, bread, milk, sugar} is a 4-itemset {flour, sausage} is a 2-itemset 57

Basics on Frequent Itemsets(2) cover of itemset X: set of transactions T containing X : cover(x) = {tid (tid,x T ) DB, X X T } support of itemsets X in DB: number of transactions containing X: support(x) = cover(x) remark: support( ) = DB frequency of itemsets X in DB: relative frequency of X being contained in any T DB : frequency(x) = P(X) = support(x) / DB frequent itemset X in DB: support(x) s s is an absolute threshold alternative: frequency(x) s rel (0 s DB ) where s = s rel. DB 58

problem setting : Frequent Itemset Mining given: a set of items I a transaction database DB over I an absolute frequency threshold s task: find all frequent itemsets in DB, i.e. {X I support(x) s} TransaktionsID Items 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F support of 1-itemsets: (A): 75%, (B), (C): 50%, (D), (E), (F): 25%, support of 2-itemsets: (A, C): 50%, (A, B), (A, D), (B, C), (B, E), (B, F), (E, F): 25% 59

Itemset Mining Apriori Algorithm [Agrawal & Srikant 1994] start by computing 1-itemsets, 2-itemsets and so on. (breadth-first search) find all frequent k+1-itemsets: Consider only those k+1 itemsets for which all k-itemsets are frequent Determine support by scanning all transactions in DB (only one Scan) 60

Apriori Algorithm C k : candidate set of potentially frequent items of length k L k : set of all frequent itemsets of length k Apriori(I, DB, minsup) L 1 := {frequent 1-Itemsets aus I}; k := 2; while L k-1 ¹ do C k := AprioriCandidateGeneration (L k - 1 ); for each Transaktion T DB do k++; return k L k ; CT := Subset(C k, T);// all candidates C k contained in T for each candidate c CT do c.count++; L k := {c C k c.count minsup}; 61

candidate generation(1) requirements to candidate itemsets: C k L k C k should be much smaller then the set of all possible k-itemsets step 1: Join p and q are frequent (k-1)-itemsets join p and q if they share a common (k-2)-prefix p L k-1 (beer, chips, pizza) (beer, chips, pizza, wine) C k q L k-1 (beer, chips, wine) 62

Candidate generation(2) step2: pruning remove all candidate k-itemsets containing non-frequent k-1-itemsets example: L 3 = {(1 2 3), (1 2 4), (1 3 4), (1 3 5), (2 3 4)} after join step: candidates = {(1 2 3 4), (1 3 4 5)} pruning step: delete (1 3 4 5) since (1 4 5) is not frequent => C 4 = {(1 2 3 4)} 63

Example Apriori Algorithms minsup = 2 TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 Scan DB C 1 itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 L 1 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 L 2 itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2 C 2 itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 Scan DB C 2 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} C 3 itemset {2 3 5} Scan DB L 3 itemset sup {2 3 5} 2 64

What you should know by now What is data mining and KDD? Steps in the KDD process What is supervised learning? linear regression knn Classification Bayesian learning metrics for supervised learning What is unsupervised learning? clustering (k-means, DBSCAN) outlier detection (distance-based outliers) frequent pattern mining (frequent itemset mining,apriori algorithm) 65

Literatur script KDD I: http://www.dbs.ifi.lmu.de/cms/knowledge_discovery_in_databases_i_(kdd_i) Ester M., Sander J.: Knowledge Discovery in Databases: Techniken und Anwendungen Springer, September 2000. Han J., Kamber M., Pei J.: Data Mining: Concepts and Techniques 3rd edition, Morgan Kaufmann Publishers, March 2011. H.-P. Kriegel, M. Schubert, A. Zimek Angle-Based Outlier Detection in High-dimensional Data In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Las Vegas, NV: 444 452, 2008. M. Ankerst, M. M. Breunig, H.-P. Kriegel, J. Sander OPTICS: Ordering Points To Identify the Clustering Structure In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Philadelphia, PA: 49 60, 1999. 66