Frequent Itemset Mining

Similar documents
Frequent Itemset Mining

Frequent Itemset Mining

Data Mining and Analysis: Fundamental Concepts and Algorithms

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

CS 484 Data Mining. Association Rule Mining 2

Association Rules. Fundamentals

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining Concepts & Techniques

D B M G Data Base and Data Mining Group of Politecnico di Torino

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

CS 584 Data Mining. Association Rule Mining 2

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Lecture Notes for Chapter 6. Introduction to Data Mining

Temporal Data Mining

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

Frequent Pattern Mining: Exercises

DATA MINING - 1DL360

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

COMP 5331: Knowledge Discovery and Data Mining

732A61/TDDD41 Data Mining - Clustering and Association Analysis

DATA MINING - 1DL360

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine Learning: Pattern Mining

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Chapters 6 & 7, Frequent Pattern Mining

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Exercise 1. min_sup = 0.3. Items support Type a 0.5 C b 0.7 C c 0.5 C d 0.9 C e 0.6 F

Association Analysis Part 2. FP Growth (Pei et al 2000)

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

Frequent Itemset Mining

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

CSE 5243 INTRO. TO DATA MINING

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

OPPA European Social Fund Prague & EU: We invest in your future.

Associa'on Rule Mining

Association Analysis. Part 2

Association Analysis 1

CSE 5243 INTRO. TO DATA MINING

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Course Content. Association Rules Outline. Chapter 6 Objectives. Chapter 6: Mining Association Rules. Dr. Osmar R. Zaïane. University of Alberta 4

Knowledge Discovery and Data Mining I

Descrip9ve data analysis. Example. Example. Example. Example. Data Mining MTAT (6EAP)

COMP 5331: Knowledge Discovery and Data Mining

Sequential Pattern Mining

Four Paradigms in Data Mining

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Effective Elimination of Redundant Association Rules

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example

Association Analysis. Part 1

Descrip<ve data analysis. Example. E.g. set of items. Example. Example. Data Mining MTAT (6EAP)

A New Concise and Lossless Representation of Frequent Itemsets Using Generators and A Positive Border

On Condensed Representations of Constrained Frequent Patterns

Unit II Association Rules

CS 412 Intro. to Data Mining

A Global Constraint for Closed Frequent Pattern Mining

Introduction to Data Mining

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Interesting Patterns. Jilles Vreeken. 15 May 2015

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Descriptive data analysis. E.g. set of items. Example: Items in baskets. Sort or renumber in any way. Example item id s

Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

FP-growth and PrefixSpan

Data Mining Concepts & Techniques

Handling a Concept Hierarchy

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Mining Infrequent Patter ns

EFFICIENT MINING OF WEIGHTED QUANTITATIVE ASSOCIATION RULES AND CHARACTERIZATION OF FREQUENT ITEMSETS

Data Mining and Analysis: Fundamental Concepts and Algorithms

Association Rule Mining on Web

Data mining, 4 cu Lecture 5:

CPT+: A Compact Model for Accurate Sequence Prediction

Association)Rule Mining. Pekka Malo, Assist. Prof. (statistics) Aalto BIZ / Department of Information and Service Management

Mining Free Itemsets under Constraints

Data mining, 4 cu Lecture 7:

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

DATA MINING - 1DL105, 1DL111

Frequent Pattern Mining. Toon Calders University of Antwerp

Data Warehousing & Data Mining

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Discovering Non-Redundant Association Rules using MinMax Approximation Rules

Data Warehousing & Data Mining

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Maintaining Frequent Itemsets over High-Speed Data Streams

Geometry Problem Solving Drill 08: Congruent Triangles

Un nouvel algorithme de génération des itemsets fermés fréquents

arxiv: v1 [cs.db] 31 Dec 2011

Pattern Space Maintenance for Data Updates. and Interactive Mining

Mining Emerging Substrings

Formal Concept Analysis

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Transcription:

1 Frequent Itemset Mining Nadjib LAZAAR LIRMM- UM IMAGINA 15/16

2 Frequent Itemset Mining: Motivations Frequent Itemset Mining is a method for market basket analysis. It aims at finding regulariges in the shopping behavior of customers of supermarkets, mail-order companies, on-line shops etc. More specifically: Find sets of products that are frequently bought together. Possible applicagons of found frequent itemsets: Improve arrangement of products in shelves, on a catalog s pages etc. Support cross-selling (suggesgon of other products), product bundling. Fraud detecgon, technical dependence analysis, fault localizagon etc. OUen found paverns are expressed as associagon rules, for example: If a customer buys bread and wine, then she/he will probably also buy cheese.

3 Frequent Itemset Mining: Basic notions Items: Itemset: TransacGons: Language of itemsets: TransacGonal dataset: Cover of an itemset: (absolute) Frequency: I = {i 1,...i n } P I T = {t 1,...t m } L I =2 I D I T S(P )={t i 2 T P t i } freq(p )= S(P)

4 Absolute/relative frequency Absolute Frequency: freq(p )= S(P ) RelaGve Frequency: freq(p )= 1 m S(P )

5 Frequent Itemset Mining: Definition Given: A set of items I = {i 1,...i n } A set of transacgons overs the items A minimum support T = {t 1,...t m } The need: The set of itemset P s.t.: freq(p )

6 Example I = {a, b, c, d, e} T = {1...10} H D V D M D S(bc) ={2, 7, 9} freq(bc) =3 Q: is it an absolute or relagve frenquency?

7 Example ;

8 Example ; 7 3 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 2 0 0 0 2 0 0

9 Example Frequent itemset? ; 7 3 7 6 7 4 5 6 3 1 1 4 4 4 3 3 4 1 1 2 2

10 Example Frequent itemset with minimum support θ=3? ; 7 3 7 6 7 4 5 6 3 4 4 4 3 3 4

11 Searching for Frequent Itemsets A naïve search that consists of enumeragng and tesgng the frequency of itemset candidates in a given dataset is usually infeasible. Why? Number of items (n) Search space (2 n ) 10 10 3 20 10 6 30 10 9 100 10 30 128 10 68 (atoms in the universe) 1000 10 301

12 Anti-monotonicity property Given a transacgon database D over items I and two itemsets X, Y: That is, X Y ) S(Y ) S(X) X Y ) freq(y ) apple freq(x)

13 Example S(ade) ={1, 4, 8, 10} freq(ade) =4 S(acde) ={4, 8} freq(acde) =2 7 3 ; 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 ade 2 0 0 0 0 2 acde 0

14 Apriori property Given a transacgon database D over items I, a minsup θ and two itemsets X, Y: X Y ) freq(y ) apple freq(x) It follows: X Y ) (freq(y ) ) freq(x) ) All subsets of a frequent itemset are frequent! ContraposiGon: X Y ) (freq(x) < ) freq(y ) < ) All supersets of an infrequent itemset are infrequent!

15 Example All subsets of a frequent itemset are frequent! =2 7 3 7 6 7 a c d e ; 0 4 5 6 3 1 1 4 4 4 ac ad ae cd ce de 0 0 0 3 3 4 1 1 0 acd ace ade 2 cde 0 0 0 0 2 acde 0

16 Example All supersets of an infrequent itemset are infrequent! =2 7 3 ; 7 6 7 0 4 5 6 3 1 1 4 4 4 be 0 0 0 3 3 4 1 1 0 abe bce bde 2 0 0 0 abce abde 2 0 bcde 0 abcde

17 Partially ordered sets A pargal order is a binary relagon R over a set S : 8x, y, z 2 S x R x x R y ^ y R x ) x = y x R y ^ y R z ) x R z (reflexivity) (ang-symmetry) (transigvity) ; S =? R =?

18 Poset (2 I, ) Comparable itemsets: Incomparable itemsets: x y _ y x x 6 y ^ y 6 x ;

19 Apriori Algorithm [Agrawal and Srikant 1994] Determine the support of the one-element item sets (i.e. singletons) and discard the infrequent items / itemsets. Form candidate itemsets with two items (both items must be frequent), determine their support, and discard the infrequent itemsets. Form candidate item sets with three items (all contained pairs must be frequent), determine their support, and discard the infrequent itemsets. And so on! Based on candidate generagon and pruning

Apriori Algorithm [Agrawal and Srikant 1994] 20

21 Apriori candidates generation apriori-gen(l k ) E ; 8P i,p j 2 L k s.t. : (P i = {i 1,...,i k 1,i k }) ^ (P i = {i 1,...,i k 1,i 0 k }) ^ (i k <i 0 k ) P P 1 [ P 2 //{i 1,...,i k 1,i k,i 0 k } if 8i 2 P : P \{i} 2 L k then E E [ {P } return E

22 Improving candidates generation Using apriori-gen funcgon, an item of k+1 size can be generated in a j possible ways: j = k(k+1) 2 Need: Generate itemset candidate at most once. How: Assign to each itemset a unique parent itemset, from which this itemset is to be generated

23 Improving candidates generation Assigning unique parents turns the poset laqce into a tree:

24 Canonical form for itemsets An itemset can be represented as a word over an alphabet I Q: how many words of 3 items can we have? Of 4 items? Of k items? k! An arbitrary order (e.g., lexicography order) on items can give a canonical form, a unique representagon of itemsets by breaking symmetries. Lex on items : abc < acb < bac < bca...

Recursive processing with Canonical 25 forms Foreach P of a given level, generate all possible extension of P by one item such that: child(p )={P 0 :(i/2 P ) ^ (P 0 = P [ {i}) ^(c(p ).last < i) ^ (P 0 is frequent)} Foreach P, process it recursively.

26 Example (4) Q: what are the children of: 7 3 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 2 0 0 0 2 0 child(p )={P 0 :(i/2 P ) ^ (P 0 = P [ {i}) ^(c(p 0 ).last < i) ^ (P 0 is frequent)} 0

27 Items Ordering Any order can be used, that is, the order is arbitrary The search space differs considerably depending on the order Thus, the efficiency of the Frequent Itemset Mining algorithms can differ considerably depending on the item order Advanced methods even adapt the order of the items during the search: use different, but compagble orders in different branches

28 Items Ordering (heuristics) Frequent itemsets consist of frequent items Sort the items w.r.t. their frequency. (decreasing/increasing) The sum of transacgon sizes, transacgon containing a given item, which captures implicitly the frequency of pairs, triplets etc. Sort items w.r.t. the sum of the sizes of the transacgons that cover them.

Condensed representagon of Frequent Itemsets: Closed and Maximal Itemsets 29

30 Maximal Itemsets The set of Maximal (frequent) Itemsets: M = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) < } That is: An item set is maximal if it is frequent, but none of its proper supersets is frequent. 8, 8P 2 F :(P 2 M ) _ (9P 0 P : freq(p 0 ) )

31 Maximal Itemsets Every frequent itemset has a maximal superset: 8, 8P 2 F :(9P 0 2 M : P P 0 ) The maximal itemsets are a condensed representagon of the frequent itemsets where: 8 : F = [ P 2M 2 P

32 Example (5) Here are the Frequent itemset with minsup θ=3 Q: What are the maximal itemsets minsup θ=3? ; 7 3 7 6 7 4 5 6 3 4 4 4 bc 3 3 4 acd ace ade M = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) < }

33 Maximal Itemsets (limitation) The set of maximal itemsets captures the set of all frequent itemsets BUT it does not preserve the knowledge of all support values THE NEED Can we have a condensed representagon of the set of frequent itemsets, which preserves knowledge of all support values?

34 Closed Itemsets The set of Closed (frequent) Itemsets: C = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) <freq(p )} An item set is closed if it is frequent, but none of its proper supersets has the same support That is: 8, 8P 2 F :(P 2 C ) _ (9P 0 P : freq(p 0 )=freq(p ))

35 Closed Itemsets Every frequent itemset has a closed superset: 8, 8P 2 F :(9P 0 2 C : P P 0 ) The closed itemsets are a condensed representagon of the frequent itemsets where: 8 : F = [ P 2C 2 P

36 Closed Itemsets Every frequent itemset has a closed superset with the same support The set of all closed itemsets preserves knowledge of all support values: 8, 8P 2 F : S(P ) = max P 0 2C,P 0 P S(P 0 ) Which is not the case with the maximal itemsets: 8, 8P 2 F : S(P ) max P 0 2M,P 0 P S(P 0 )

37 Example (6) Here are the Frequent itemset with minsup θ=3 Q: are b and de Closed itemsets? ; 7 3 b 7 6 7 4 5 6 3 4 4 4 de 3 3 4 C = {P 2 I freq(p ) ^8P 0 P : freq(p 0 ) <freq(p )}

38 Maximal/Closed itemsets maximal closed Q: Can the Closed itemsets be expressed using maximal ones? [ C = i2{,...,n} M i

39 Frequent/Closed/Maximal Dataset #Frequent #Closed #Maximal Zoo-1 151 807 3 292 230 Mushroom 155 734 3 287 453 Lymph 9 967 402 46 802 5 191 HepaGGs 27. 10 7 + 1 827 264 189 205 hvps://dtai.cs.kuleuven.be/cp4im/datasets/

40 LCM Algorithm Linear Closed Item Set Miner [Uno et al., 03] (version 1) [Uno et al., 04, 05] (versions 2 & 3)

41 LCM: basic ideas The itemset candidates are checked in lexicographic order (depthfirst traversal of the prefix tree) Step by step eliminagon of items from the transacgon database; recursive processing of the condigonal transacgon databases Maintains both a horizontal and a vergcal representagon of the transacgon database in parallel. Uses the vergcal representagon to filter the transacgons with the chosen split item. Uses the horizontal representagon to fill the vergcal representagon for the next recursion step (no intersecgon as in Eclat algorithm).

42 LCM: basic ideas The itemset candidates are checked in lexicographic order (depth-first traversal of the prefix tree) Let X = {x 1,...,x n } be an itemset: x 1 <...<x n tail(x) =x n X(i) ={x 1,...,x i } X[j] =X [ {j} Xprefix of Y, X = Y (i) ^ i = tail(x)

43 LCM: basic ideas Parent-child relagon P is defined as: X = P(Y ), (Y = X [ {x i }) ^ (x i >tail(x)) Or equivalently: X = P(Y ), X = Y \ tail(y ) P P is an acyclic relagon

44 Example (7) bcd and cda are candidates? tail(abde)=?, tail(a)=?, tail(bd)=? Let X= abde X(1)=?, X(2)=? X(3)=?, X(4)=? X(5)=X(i: i>3)=? P(bc)=?, P(ade)=?, P(c)=? P

45 Example (7) bcd is a candidate, cda is not tail(abde)=e, tail(a)=a, tail(bd)=d Let X= abde X(1)=a, X(2)=ab X(3)=abd, X(4)=abde X(5)=X(i: i>3)=x(4) P(bc)=b, P(ade)=ad, P(c)=emptyset P

46 Closure Extension [Pasquier et al., 99] Closure extension is a rule for construcgng a closed itemset from another one Add an item and take its closure! closure Closed itemset + item

47 LCM: Lemma 1 Let X and X 0 two itemsets: { freq(x 0 ) > 0 X 0 is child of X, X is a prefix of X 0 X 0 = [ i2s(x 0 ) t i condigon 1 condigon 2 condigon 3

48 Example (8) { freq(x 0 ) > 0 X 0 is child of X, X is a prefix of X 0 X 0 = [ i2s(x 0 ) t i 7 3 7 6 7 0 4 5 6 3 1 1 4 4 4 0 0 0 3 3 4 1 1 0 2 0 0 0 2 0 0

49 LCM: Algorithm tail(x) [ j2s(x[i]) t j [Uno et al., 03]

50 Diffsets [Zaki and Hsiao, 02] Given an itemset X and its extension X[i] S(X[i]) = S(X [ {i}) diffset(x, i) =S(X) \ S(X[i]) freq(x[i]) = S(X) diffset(x, i) Example (9) Let S(X)={1,4,6,7,9} and S(X[i])={1,6,9} Diffset(X,i)={4,7} Freq(X[i])= 5-2=3

Some results (1/2) 51

Some results (2/2) 52