Machine Learning: Pattern Mining

Similar documents
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Association Rules. Fundamentals

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

D B M G Data Base and Data Mining Group of Politecnico di Torino

CS 484 Data Mining. Association Rule Mining 2

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

732A61/TDDD41 Data Mining - Clustering and Association Analysis

CS5112: Algorithms and Data Structures for Applications

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Frequent Itemset Mining

Data Mining Concepts & Techniques

Introduction to Data Mining

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining

CS 584 Data Mining. Association Rule Mining 2

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

COMP 5331: Knowledge Discovery and Data Mining

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

Handling a Concept Hierarchy

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

Temporal Data Mining

Association Analysis Part 2. FP Growth (Pei et al 2000)

Associa'on Rule Mining

DATA MINING - 1DL360

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Association Rule Mining on Web

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Sequential Pattern Mining

Frequent Itemset Mining

Frequent Itemset Mining

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Chapters 6 & 7, Frequent Pattern Mining

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

Frequent Pattern Mining: Exercises

DATA MINING - 1DL360

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

Frequent Itemset Mining

Algorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview

COMP 5331: Knowledge Discovery and Data Mining

Density-Based Clustering

Association Analysis. Part 1

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

Approximate counting: count-min data structure. Problem definition

Unit II Association Rules

Association Analysis 1

Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Knowledge Discovery and Data Mining I

Mining Positive and Negative Fuzzy Association Rules

Association)Rule Mining. Pekka Malo, Assist. Prof. (statistics) Aalto BIZ / Department of Information and Service Management

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4

On the Complexity of Mining Itemsets from the Crowd Using Taxonomies

Mining Infrequent Patter ns

Descrip9ve data analysis. Example. Example. Example. Example. Data Mining MTAT (6EAP)

Four Paradigms in Data Mining

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #11: Frequent Itemsets

FP-growth and PrefixSpan

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

DATA MINING - 1DL105, 1DL111

Mining Strong Positive and Negative Sequential Patterns

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Introduction Association Rule Mining Decision Trees Summary. SMLO12: Data Mining. Statistical Machine Learning Overview.

CS738 Class Notes. Steve Revilak

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Descrip<ve data analysis. Example. E.g. set of items. Example. Example. Data Mining MTAT (6EAP)

CS 412 Intro. to Data Mining

Mining Temporal Patterns for Interval-Based and Point-Based Events

1 Frequent Pattern Mining

Classification Using Decision Trees

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example

Data mining, 4 cu Lecture 5:

Removing trivial associations in association rule discovery

Data Mining Concepts & Techniques

15 Introduction to Data Mining

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Mining Emerging Substrings

Un nouvel algorithme de génération des itemsets fermés fréquents

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Exercise 1. min_sup = 0.3. Items support Type a 0.5 C b 0.7 C c 0.5 C d 0.9 C e 0.6 F

Pattern-Based Decision Tree Construction

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example item id s

Rule Generation using Decision Trees

Transcription:

Machine Learning: Pattern Mining Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008

Pattern Mining Overview Itemsets Task Naive Algorithm Apriori Algorithm Data Structure Eclat Algorithm Association Rules Task Algorithm Summary

Overview Pattern Mining discovers regularities in data. Example: a transaction database of a supermarket: someone who buys chips also buys beer. Frequent patterns are found by counting the occurences in the data base. Types of patterns: itemsets, association rules, sequences,...

Example Shopping Carts Beer, Chips, Chocolate, Cookies Coke, Beer, Pizza, Chips Salad, Noodles, Tomatoes, Water Lasagne, Coke, Beer, Chips Oranges, Apple Juice, Rice, Cabbage, Sausage Diapers, Beer, Charcoal, Sausage Beer, Cabbage, Sausage, Chips...

Example Shopping Carts Beer, Chips, Chocolate, Cookies Coke, Beer, Pizza, Chips Salad, Noodles, Tomatoes, Water Lasagne, Coke, Beer, Chips Oranges, Apple Juice, Rice, Cabbage, Sausage Diapers, Beer, Charcoal, Sausage Beer, Cabbage, Sausage, Chips... Observations: Many customers buy beer. Beer and chips are often bought together. Customers who buy cabbage also buy sausage. Customers who buy something to eat also buy something to drink.

Outline Classification predicts class labels based on training data Clustering groups data based on similarity Pattern Mining discovers regularities in data

Itemsets Which itemsets frequently occur in the same transaction? Example: chips and beer are frequently bought together given Items I = {ii,..., i m } Data D P(I ) multiset Frequency threshold θ s to find Frequent sets L = {X P(I ) supportd (X ) θ s }

Definitions and Terms support D (X ) = {d D X d} D X is frequent / large iff support D (X ) θ s

Naive Algorithm function Naive(D, θ s ) L for all X P(I ) do if support D (X ) θ s then L L {X } end if end for return L end function

Example Data D,e a,,e a,b,d,e,d a, find itemsets with θ s 0.3 X frequent # D (X ) > 2

Example Data D,e a,,e a,b,d,e,d a, P(I ) # a? b? c? d? e? a,b?? a,d? a,e?? b,d? b,e? c,d? c,e? P(I ) # d,e? a,? a,b,d??,d?,e? c,d,e? a,,d? a,,e? a,b,d,e?,d,e?,d,e? a,,d,e?

Example Data D,e a,,e a,b,d,e,d a, P(I ) # a 7 b 7 c 6 d 2 e 5 a,b 5 4 a,d 1 a,e 5 4 b,d 2 b,e 4 c,d 1 c,e 2 P(I ) # d,e 1 a, 2 a,b,d 1 4,d 1,e 1 c,d,e 0 a,,d 0 a,,e 1 a,b,d,e 1,d,e 0,d,e 0 a,,d,e 0

Example Data D,e a,,e a,b,d,e,d a, P(I ) # a 7 b 7 c 6 d 2 e 5 a,b 5 4 a,d 1 a,e 5 4 b,d 2 b,e 4 c,d 1 c,e 2 P(I ) # d,e 1 a, 2 a,b,d 1 4,d 1,e 1 c,d,e 0 a,,d 0 a,,e 1 a,b,d,e 1,d,e 0,d,e 0 a,,d,e 0 L = {{a}, {b}, {c}, {e}, {a, b}, {a, c}, {a, e}, {b, c}, {b, e}, {a, b, e}}

Properties of Naive Algorithm returns correct result always terminates But: counting support for each itemset X P(I ) is not applicable as P(I ) is exponential in I

Observations support D (X ) support D (X Y ) support D (X ) θ s Y : Y X : support D (Y ) θ s all subsets of a frequent set are frequent support D (X ) < θ s Y : Y X : support D (Y ) < θ s all supersets of an infrequent set X are not frequent example: support D ({a, b}) support D ({a, b, c, d})

Apriori Algorithm Breadth-first/ levelwise search 1. find frequent itemsets of length 1 2. find frequent itemsets of length 2 3.... only explores itemsets where all subsets are known to be frequent

Apriori Algorithm function Apriori(D, θ s ) k 1 L k {{i} i I, support D ({i}) θ s } while L k do C k+1 generatecandidates(l k, k + 1) L k+1 {X C k+1 support D (X ) θ s } k k + 1 end while return k=1 end function L k

Candidate Generation generates candidates of length k from frequent itemsets L of length k 1 function generatecandidates(l, k) C {X Y X, Y L X Y = k} C {X C Y X : Y = k 1 Y L} return C end function

Example Data D,e a,,e a,b,d,e,d a, find itemsets with θ s 0.3 X frequent # D (X ) > 2

Example Data D,e a,,e a,b,d,e,d a, C 1 # a? b? c? d? e?

Example Data D,e a,,e a,b,d,e,d a, C 1 # a 7 b 7 c 6 d 2 e 5

Example Data D,e a,,e a,b,d,e,d a, C 1 # a 7 b 7 c 6 d 2 e 5

Example Data D,e a,,e a,b,d,e,d a, C 1 # a 7 b 7 c 6 d 2 e 5 L 1 a b c e

Example Data D,e a,,e a,b,d,e,d a, C 2 # a,b?? a,e?? b,e? c,e? L 1 a b c e

Example Data D,e a,,e a,b,d,e,d a, C 2 # a,b 5 4 a,e 5 4 b,e 4 c,e 2

Example Data D,e a,,e a,b,d,e,d a, C 2 # a,b 5 4 a,e 5 4 b,e 4 c,e 2

Example Data D,e a,,e a,b,d,e,d a, C 2 # a,b 5 4 a,e 5 4 b,e 4 c,e 2 L 2 a,b a,e b,e

Example Data D,e a,,e a,b,d,e,d a, C 3 # a,??,e?,e? L 2 a,b a,e b,e

Example Data D,e a,,e a,b,d,e,d a, C 3 # a,?? Pruning: {c, e} L 2

Example Data D,e a,,e a,b,d,e,d a, C 3 # a, 2 4

Example Data D,e a,,e a,b,d,e,d a, C 3 # a, 2 4

Example Data D,e a,,e a,b,d,e,d a, C 3 # a, 2 4 L 3

Example Data D,e a,,e a,b,d,e,d a, C 4 # - - L 3

Example Data D,e a,,e a,b,d,e,d a, L 1 a b c e L 2 a,b a,e b,e L 3 L = {{a}, {b}, {c}, {e}, {a, b}, {a, c}, {a, e}, {b, c}, {b, e}, {a, b, e}}

Trie / Prefix Tree For candidate generation and frequency counting, a trie can be used: a trie is a tree each node contains an item and a frequency counter each path from the root to a node corresponses to an itemset the k-th level represents itemsets of length k the items in a trie are ordered

Example Prefix Tree Data D,e a,,e a,b,d,e,d a, Ø _ a 7 b 7 c 6 e 5 b 5 c 4 e 5 c 4 e 4 e 4

Example #{} = 4 Data D,e a,,e a,b,d,e,d a, Ø _ a 7 b 7 c 6 e 5 b 5 c 4 e 5 c 4 e 4 e 4

Example Data D,e a,,e a,b,d,e,d a, #{b} = 7 Ø _ a 7 b 7 c 6 e 5 b 5 c 4 e 5 c 4 e 4 e 4

Example Data D,e a,,e a,b,d,e,d a, #{b,e} = 4 Ø _ a 7 b 7 c 6 e 5 b 5 c 4 e 5 c 4 e 4 e 4

Trie: Frequency Counting To count frequencies with a trie, each transaction d D is handled the following way: 1. sort d 2. start at the root 3. for each item i d follow the node i, increase it by one and recursively repeat this for d \ {i}

Trie: Candidate Generation Candidates of length k can be generated from a trie of depth k 1: 1. for each node at level k 1 append its siblings 2. prune infrequent childs

Trie: Example Data D,e a,,e a,b,d,e,d a, find itemsets with θ s 0.3 X frequent # D (X ) > 2... see blackboard...

Eclat Algorithm Algorithm for itemset mining Depth-first algorithm Vertical data base layout For each pattern: store the cover, i.e. all transactions that include this pattern. e.g. (a, {d 1, d 3, d 4, d 5, d 7, d 8, d 9 }) Count frequency by intersection

Eclat Algorithm function Eclat(D, θ s ) C = {{(i, {d D i d}) i I L = (i, D i ) C D } i D θ s return EclatRecursion(L,, θ s ) end function

Eclat Algorithm function EclatRecursion(L, p, θ s ) F for all (i, D i ) L do q p {i} F F {p} C q { {(j, D i D j ) (j, D j ) L, j > i} L q (k, D k ) C q D } k θ s D if L q then F F EclatRecursion(L q, q, θ s } end if end for return F end function

Example Data D,e a,,e a,b,d,e,d a, find itemsets with θ s 0.3 X frequent # D (X ) > 2... see blackboard...

Association Rules Which itemssets Y occur often if another itemset X appears? X Y Example: a customer buying diapers also buys beer {diapers} {beer} given Items I = {ii,..., i m } Data D P(I ) multiset Frequency thresholds θs Confidence threshold θ c to find Rules R = {X Y supportd (X Y ) θ s confidence D (X Y ) θ c }

Definitions and Terms support measures how often the rule X Y appears support D (X Y ) = support D (X Y ) X Y is frequent / large iff support D (X Y ) θ s confidence measures how likely it is that Y appears if X is present. confidenced (X Y ) = support D(X Y ) support D (X ) for a rule X Y, Y is called head and X is called body

Example Data D,e a,,e a,b,d,e,d a, find rules with θ s 0.3 X frequent # D (X ) > 2 find rules with θ c 0.8

Example Data D,e a,,e a,b,d,e,d a, confidence D (X Y ) = support D(X Y ) support D (X ) support D (X Y ) = support D (X Y ) L = {{a}, {b}, {c}, {e}, {a, b}, {a, c}, {a, e}, {b, c}, {b, e}, {a, b, e}}... see blackboard...

Example Data D,e a,,e a,b,d,e,d a, R = {e a, e b, e ab, ab e, ae b, be a} {X X L}

Observations expanding the head of a rule by an item of the body, results in a rule with less or equal confidence. proof: confidence D (X \ Z Y Z) confidence D (X Y ) confidence D (X \ Z Y Z) = support D((X \ Z) (Y Z)) support D (X \ Z) support D(X Y ) support D (X ) = confidence D (X Y ) = support D(X Y Z) support D (X \ Z) example: confidence D ({a, b} {c, d}) confidence D ({a, b, c} {d})

Algorithm Association rule mining is done in two steps: 1. find frequent itemsets (see itemset mining) 2. extract rules from the frequent itemsets

AssociationRules Algorithm function AssociationRules(D, θ s, θ c ) L Apriori(D, θ s ) R for all I L do k 1 C k {{i} i I } while C k do H k {X C k confidence D (I \ X X ) θ c } C k+1 generatecandidateheads(h k, k + 1) k k + 1 end while R R {I \ X X X H k } {I } end for return R end function k=1

Candidate Generation for Heads of Rules generates candidate heads of length k from heads H of length k 1 function generatecandidateheads(h, k) C {X Y X, Y H X Y = k} C {X C Y X : Y = k 1 Y H} return C end function

Remarks Calculating the confidence can be reduced to calculating the support: confidence D (I \ X X ) = support D((I \ X ) X ) support D (I \ X ) support D(I ) support D (I \ X ) θ c θ c support D (I \ X ) 1 θ c support D (I ) If θ c θ s, the values for support D can be looked up in the trie and no database pass is necessary.

Example Trace of inner loop of the algorithm AssociationRules for I = {a, b, e}.... see blackboard...

Outlook Extensions to Apriori and Eclat Further pattern: sequences, trees,... Background knowledge: e.g. taxonomies

Sequence mining Takes the time into account, when an action is performed. E.g. A database of courses attended by a student in one term, i.e. sequences of sets: Student1: ({linear algebra, c++, algorithm theory}, {machine learning, numerics, economics}, {bayessian networks}) Student2: ({linear algebra, java}, {software engineering}, {numerics}) Student3: ({linear algebra, java, algorithm theory}, {economics}, {machine learning, numerics}, {bayessian networks})... A frequent sequence might be ({linear algebra, algorithm theory}, {machine learning, numerics}, {bayessian networks})

Use taxonomies Background knowledge in terms of taxonomies might be used for mining patterns. E.g. The following taxonomy is given over subjects linear algebra isa mathematics mathematics isa science computer science isa science In the student database one could mine the association rule using the taxonomy: if someone has attended machine learning then (s)he also has attended some mathematic lecture {machine learning} {mathematics}

Conclusion Task: Finding frequent patterns in database. Efficient algorithms explore only promising candidates by pruning. Mining association rules can be reduced to mining itemsets with an additional post processing step.

Literature R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In J. B. Bocca, M. Jarke, and C. Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487 499. Morgan Kaufmann, 12 15 1994. R. Agrawal and R. Srikant. Mining sequential patterns. In P. S. Yu and A. S. P. Chen, editors, Eleventh International Conference on Data Engineering, pages 3 14, Taipei, Taiwan, 1995. IEEE Computer Society Press. L. Schmidt-Thieme. Algorithmic features of eclat. 2004.