Associa'on Rule Mining

Size: px

Start display at page:

Download "Associa'on Rule Mining"

May Waters
5 years ago
Views:

1 Associa'on Rule Mining Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 4 and 7,

Fruit jam, Egg 4 Tea, Bread, Salami, Bacon, Ham, Sausage, Tomato 5 Rice, Egg, Pickle, Curry leaves, Coconut, Red

2 Market Basket Analysis Scenario: customers shopping at a supermarket Transaction id Items 1 Bread, Ham, Juice, Cheese, Salami, Lettuce 2 Rice, Dal, Coconut, Curry leaves, Coffee, Milk, Pickle 3 Milk, Biscuit, Bread, Salami, Fruit jam, Egg 4 Tea, Bread, Salami, Bacon, Ham, Sausage, Tomato 5 Rice, Egg, Pickle, Curry leaves, Coconut, Red chilly What can we infer from the above data? An association rule: {Bread, Salami} à {Ham}, with confidence ~ 2/3 2

3 Applica'ons Information driven marketing Catalog design Store layout Customer segmentation based on buying patterns Several papers by Rakesh Agrawal and others in the 1990s Rakesh Agrawal and Ramakrishnan Srikant Fast Algorithms for Mining Association Rules The VLDB

4 The Market- Basket Model A (large) set of binary attributes, called items I = {i 1,, i n } e.g. milk, bread, the items sold at the market A transaction T consists of a (small) subset of I e.g. the list of items (bill) bought by one customer at once The database D is a (large) set of transactions D = {T 1,, T N } 4

5 The Market- Basket Model Goal: mining associations between the items The transactions or customers also may have associations, but here we are interested in such relations Approach: finding subset of items that are present together in transactions frequently An itemset: any subset X of I 5

6 Support of an Itemset Let X be an itemset Support count σ(x) = # of transactions containing all items of X support(x) = fraction of transactions containing all items of X support({bread, Salami}) = 0.6 support({rice, Pickle, Coconut}) = 0.4 T-ID Items 1 Bread, Ham, Juice, Cheese, Salami, Lettuce 2 Rice, Dal, Coconut, Curry leaves, Coffee, Milk, Pickle 3 Milk, Biscuit, Bread, Salami, Fruit jam, Egg 4 Tea, Bread, Salami, Bacon, Ham, Sausage, Tomato 5 Rice, Egg, Pickle, Curry leaves, Coconut, Red chilly Makes sense (statistically significant) only when support count is at least a few hundreds in a database of several thousand transactions 6

7 Associa'on Rule Association rule: an implication of the form X à Y where X, Y I, and X Y = ϕ. UI support(xà Y) = Transactions containing all items of both X and Y U σ(x U Y) D σ(x U Y) confidence(xà Y) = σ(x) T-ID Items 1 Bread, Ham, Juice, Cheese, Salami, Lettuce R : {Bread, Salami} à {Ham} 2 Rice, Dal, Coconut, Curry leaves, Coffee, Milk, Pickle 3 Milk, Biscuit, Bread, Salami, Fruit jam, Egg 4 Tea, Bread, Salami, Bacon, Ham, Sausage, Tomato 5 Rice, Egg, Pickle, Curry leaves, Coconut, Red chilly 2 support(r) = 5 confidence(r) = 2 3 7

8 Associa'on Rule Mining Task Given a set of items I, a set of transactions D, a minimum support thresholds minsup and a minimum confidence threshold minconf Find all rules R such that support(r) minsup confidence(r) minconf 8

9 One Approach Observe: support(x à Y) = = = support(z) where Z = X U Y σ(x U Y) D σ(z) D If Z = W U V, support(x à Y) = support(w à V) Each binary partition of Z represents an association rule With same support However, the confidences may be different Approach: frequent itemset generation 1. Find all itemsets Z with support(z) minsup. Call such itemsets frequent itemsets. 2. From each Z, generate rules with confidence(z) minconf 9

10 Finding Frequent Itemsets If I = n, then number of possible itemsets = 2 n For each itemset, compute the support by scanning the lists of items of each transaction O(N w), where w is the average length of transactions Overall complexity: O(2 n N w) Computationally very expensive!! 10

11 An'- monotone Property of Support If an itemset is frequent, all its subsets are also frequent Because if X Y, then support(x) support(y) For all transactions T such that Y T, we have X T Support({Bread, Salami}) Support({Bread, Ham, Salami}) T-ID Items 1 Bread, Ham, Juice, Cheese, Salami, Lettuce 2 Rice, Dal, Coconut, Curry leaves, Coffee, Milk, Pickle 3 Milk, Biscuit, Bread, Salami, Fruit jam, Egg 4 Tea, Bread, Salami, Bacon, Ham, Sausage, Tomato 5 Rice, Egg, Pickle, Curry leaves, Coconut, Red chilly 11

12 The A- Priori Algorithm Notation: L k = The set of frequent (large) itemsets of size k C k = The candidate set of frequent (large) itemsets of size. Algorithm: L 1 = {Frequent 1-itemsets}; for ( k = 2; L k 1 0; k++ ) do begin C k = apriori_gen(l k-1 ); /* Generating new candidates */ for all transactions T in D do begin C T = subset(c k,t) /* Keeping only the valid candidates */ for all candidates c in C T do c.count++; end L k = {c in C k c.count minsup} end Output = Union of all L k for k = 1, 2,, n 12

13 Genera'ng candidate itemsets L k A join of L k-1 with itself insert into C k select p.item 1, p.item 2,, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 = q.item 1,, p.item k-2 = q.item k-2, p.item k-1 < q.item k-1 What does it do? L 3 L 3 {1, 2, 3} {1, 2, 3} {1, 2, 4} {1, 2, 4} {1, 3, 4} {1, 3, 4} {1, 3, 5} {1, 3, 5} {2, 3, 4} {2, 3, 4} C 4 = { {1, 2, 3, 4}, {1, 3, 4, 5} } A prune step: {1, 3, 4, 5} will be pruned because {1, 4, 5} L 3 13

14 Checking Support for candidates One approach: for each candidate itemset c C k for each transactions T D do begin check if c T end end Complexity? 14

15 Using a Hash Tree Let us have 12 candidate itemsets of size 3 {1 2 5}, {1 2 7}, {1 3 9}, {2 4 5}, {2 8 9}, {3 5 7}, {4 5 9}, {4 7 8}, {5 6 7}, {5 7 9}, {6 7 8}, {6 7 9} Hash function 1, 4, 7 3, 6, 9 2, 5, 8 15

The Hash Tree {1 2 5}, {1 2 7}, {1 3 9}, {2 4

16 The Hash Tree {1 2 5}, {1 2 7}, {1 3 9}, {2 4 5}, {2 8 9}, {3 5 7}, {4 5 9}, {4 7 8}, {5 6 7}, {5 7 9}, {6 7 8}, {6 7 9} Root 1, 4, 7 Hash Function 2, 5, 8 3, 6, 9 1,4,7+ 2,5,8+ 3,6,9+ {4 7 8} {1 2 7} {1 2 5} {1 3 9} {2 4 5} {2 8 9} {5 6 7} {6 7 8} {6 7 9} {3 5 7} {4 5 9} {5 7 9} 16

Subsets of the transac'on All subsets of size 3 for a transaction{1 2 6 7 8},

{6 7 8} {1 2 6 7 8} {1 6 7 8} {1 7 8} {2 6 7 8} {2 7 8} {1 2 6} {1 2 7} {1 2 8}

17 Subsets of the transac'on All subsets of size 3 for a transaction{ }, ordered by the item id Subsets starting with 1 { } { } { } {6 7 8} { } { } {1 7 8} { } {2 7 8} {1 2 6} {1 2 7} {1 2 8} {1 6 7} {1 6 8} Subsets starting with 12 {2 6 7} {2 6 8} Hashing in the same style 17

18 The Subset Opera'on using Hash Tree Transaction: { }, ordered by item id Hash Function { } { } Root {5 6 8} 1, 4, 7 2, 5, 8 3, 6, 9 1,4,7+ 2,5,8+ 3,6,9+ { } {4 7 8} {1 2 5} {1 2 7} {1 2 5} {1 3 9} {2 4 5} {2 8 9} {5 6 7} {6 7 8} {6 7 9} {3 5 7} {4 5 9} {5 7 9} 18

k 2 rules Rules generated from different itemsets are also different The rules need to be

19 Where are we now? Computed frequent itemsets, i.e. the itemsets with required support minsup Each frequent k-itemset X gives rise to several association rules Ignoring X à ϕ and ϕ à X, 2 k 2 rules Rules generated from different itemsets are also different The rules need to be checked for minimum confidence All these rules already satisfy the support condition How many? 19

20 Rules Generated from the Same Itemset Let X Y, for non empty itemsets X, and Y Then X à Y - X is an association rule Theorem: If X X, then c(x à Y X) c(x à Y X ) Example: c({1 2 3} à {4 5}) c({1 2} à {3 4 5}) Proof. Observe: c(x à Y X) = σ(y)/σ(x) c(x à Y X ) = σ(y)/σ(x ) since X X, σ(x ) σ(x) so c(x à Y X) c(x à Y X ) Corollary: If X à Y X is not a high-confidence association rule, then X à Y X is also not a high confidence rule. n 20

Level- wise Approach for Rule Genera'on Frequent itemset: {1 2 3 4} {1 2 3 4} à {} {1 2 3} à {4} ý {1 2 4} à {3} {1 3 4} à {2} {2 3 4} à {1} ý {1 2} à {3 4} ý ý {1 3} à {2 4} {1 4} à {2 3} {2 3} à

21 Level- wise Approach for Rule Genera'on Frequent itemset: { } { } à {} {1 2 3} à {4} ý {1 2 4} à {3} {1 3 4} à {2} {2 3 4} à {1} ý {1 2} à {3 4} ý ý {1 3} à {2 4} {1 4} à {2 3} {2 3} à {1 4} {2 4} à {1 3} {3 4} à {1 2} ý {1} à {2 3 4} ý ý {2} à {1 3 4} {3} à {1 2 4} {4} à {1 2 3} Suppose {1 2 4} à {3} fails the confidence bar The whole tree under {1 2 4} à {3} can be discarded 21

22 Maximal Frequent itemsets Maximal frequent itemset: an itemset, for which none of its immediate supersets are frequent {} {1} {2} {3} {4} {1 2} {1 3} {1 4} {2 3} {2 4} {3 4} {1 2 3} {1 2 4} {1 3 4} {2 3 4} { } 22

23 Maximal Frequent itemsets Maximal frequent itemset: an itemset, for which none of its immediate supersets are frequent Not frequent {} {1} {2} {3} {4} {1 2} {1 3} {1 4} {2 3} {2 4} {3 4} {1 2 3} {1 2 4} {1 3 4} {2 3 4} { } 23

frequent Not frequent {} Maximal frequent {1} {2} {3} {4} {1

24 Maximal Frequent itemsets Maximal frequent itemset: an itemset, for which none of its immediate supersets are frequent Not frequent {} Maximal frequent {1} {2} {3} {4} {1 2} {1 3} {1 4} {2 3} {2 4} {3 4} {1 2 3} {1 2 4} {1 3 4} {2 3 4} { } 24

25 Maximal Frequent itemsets All frequent itemsets are subsets of one of the maximal frequent itemsets. Not frequent {} Maximal frequent {1} {2} {3} {4} {1 2} {1 3} {1 4} {2 3} {2 4} {3 4} {1 2 3} {1 2 4} {1 3 4} {2 3 4} { } 25

26 Maximal Frequent Itemsets Valuable compact representation of the frequent itemsets But Do not contain the support information of the subsets Says all supersets have lesser support, but does not say if any subset also has the same support 26

frequent itemset: an itemset which is closed and frequent (support minsup) Support for non-closed frequent itemsets can be

27 Closed Frequent Itemsets Closed itemset: an itemset X for which none of its immediate supersets has exactly the same support count as X If X is not closed, at least one of its immediate supersets have the same support as the support of X Closed frequent itemset: an itemset which is closed and frequent (support minsup) Support for non-closed frequent itemsets can be determined from the support information of the closed frequent itemsets Frequent itemsets Closed frequent itemsets Maximal frequent itemsets 27

28 Evalua'on of Associa'on Rules Even from a small dataset a very large number of rules can be generated For example, as support and confidence conditions are relaxed, number of rules explode Interestingness measure for patterns / rules is required Objective interestingness measure: a measure that uses statistics derived from the data Support, confidence, correlation, Domain independent Requires minimal human involvement 28

29 Subjec've Measure of Interes'ngness The rule {Salami} à {Bread} is not so interesting because it is obvious! Rules such as{salami} à {Dish washer detergent}, {Salami} à?{diper}, etc are less obvious Subjectively more interesting for marketing experts Non-trivial cross sell Methods for subjective measurement Visualization aided: human in the loop Template-based: constrains are provided for rules Filter obvious and non-actionable rules 29

30 Con'ngency Table Coffee Coffee Tea Tea B B A f 11 f 10 f 1+ A f 01 f 00 f 0+ f +1 f +0 Frequency tabulated for a pair of binary variables Used as a useful evaluation and illustration tool Generally: A (or B ) denotes the transactions in which A (or B) is absent f 1+ = support count of A f +1 = support count of B 30

31 Limita'ons of Support & Confidence Tuning the support threshold is tricky Low threshold Too many rules generated! High threshold Potentially interesting patterns may fall below the support threshold 31

32 Limita'on of Confidence Coffee Coffee Tea Tea Consider the rule: {Tea} à {Coffee} Support = 15% Confidence = 75% But: Overall 80% people have coffee i.e., the rule{} à {Coffee} has confidence 80%. Among tea takers, the percentage actually drops to 75%!! Where does it go wrong? Confidence measure ignores the support of Y for a rule X à Y 32

33 Interest factor Lift: Lift(X à Y) = c(x à Y) σ(y) For binary variables, lift is equivalent to interest factor s(x UY) Interest factor: I(X,Y) = = s(x) s(y) N f 11 f 1+. f +1 Similar to baseline frequency comparison under statistical independence assumption If X and Y are statistically independent, their baseline frequency (expected frequency of X and Y both occurring) is f 11 = f 1+. f +1 N 33

34 Interest factor Intuitively I(X,Y) = 1, if X and Y are independent > 1, if X and Y have a positive correlation < 1, if X and Y have a negative correlation Verify for the tea coffee example I(Tea, Coffee) = 0.15 / ( ) = 0.94 N f I = 11 f 1+. f +1 Coffee Coffee Tea Tea

35 Limita'on of Interest Factor Text Text Analysis Analysis Mining Mining Graph Graph Observe: I(Text, Analysis) = 1.02, I(Graph, Mining) = 4.08 Text and Analysis are more related than Graph and Mining Confidence measure: c(text à Analysis) = 94.6% c(graph à Mining) = 28.6% What goes wrong here? 35

36 More Measures Correlation coefficient for binary variables: IS Measure: I and S measures combined Mathematically equivalent to cosine measure of binary variables 36

37 Proper'es of Objec've Measures B B A f 11 f 10 f 1+ A f 01 f 00 f 0+ f +1 f +0 Inversion property: Invariant under inversion operation Exchange f 11 with f 00 and f 01 with f 10 The value of the measure remains the same Null addition property: Invariant under addition of counts for other variables, i.e. the value of the measure remains the same if f 00 is increased Which measures have which properties? 37

38 References Rakesh Agrawal and Ramakrishnan Srikant Fast Algorithms for Mining Association Rules VLDB 1994 Introduction to Data Mining, by Tan, Steinbach, Kumar The webpage: Chapter 6 is available online: 38

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 3. Frequent Itemsets Association Rules DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.