Association Analysis 1

Size: px

Start display at page:

Download "Association Analysis 1"

Leslie Peters
6 years ago
Views:

1 Association Analysis 1

2 Association Rules 1 3

3 Learning Outcomes Motivation Association Rule Mining Definition Terminology Apriori Algorithm Reducing Output: Closed and Maximal Itemsets 4

4 Overview One of the most popular problems in computer science! [Agrawal, Imielinski, Swami, SIGMOD 1993] 13th most cited across all computer science [Agrawal, Srikant, VLDB 1994] 15th most cited across all computer science 5

5 Overview Pattern Mining Unsupervised learning Useful for large datasets exploration: what is this data like? building global models Less suitable for well-studied and understood problem domains 6

6 Are humans inherently good at pattern mining? Is there a pattern? 7

7 Pattern or Illusion? 8

8 Pattern Mining Humans have always been doing pattern mining Observing & predicting nature The heavens (climate, navigation) Humans generally good at pattern recognition Illusion vs. pattern? We see what we want to see! (bias) Restricted to the natural dimensionality: 3D 9

9 Pattern vs. Chance Dog is not the pattern; the black patches are! But is that an interesting pattern? 10

10 What is a pattern? Repetitiveness Basically depends on counting Interestingness Avoid trivial patterns Chance occurrences Use statistical tests to weed these out 11

11 Motivation Frequent Item Set Mining is a method for market basket analysis. It aims at finding regularities in the shopping behavior of customers of supermarkets, mail-order companies, on-line shops etc. More specifically : Find sets of products that are frequently bought together. Possible applications of found frequent item sets: Improve arrangement of products in shelves, on a catalog's pages etc. Support cross-selling (suggestion of other products), product bundling. Fraud detection, technical dependence analysis etc. Often found patterns are expressed as association rules, for example: If a customer buys bread and wine, then she/he will probably also buy cheese. 12

Beer and Nappies - A Data Mining Urban Legend There is a story that a large supermarket chain, usually Wal- Mart, did an analysis of customers' buying habits and found a statistically significant

12 Beer and Nappies - A Data Mining Urban Legend There is a story that a large supermarket chain, usually Wal- Mart, did an analysis of customers' buying habits and found a statistically significant correlation between purchases of beer and purchases of nappies (diapers in the US). It was theorized that the reason for this was that fathers were stopping off at Wal-Mart to buy nappies for their babies, and since they could no longer go down to the pub as often, would buy beer as well. As a result of this finding, the supermarket chain is alleged to have the nappies next to the beer, resulting in the increase of sales. Here's what really happened: "The analysis "did discover that between 5:00 and 7:00 p.m. that consumers bought beer and diapers". Osco managers did NOT exploit the beer and diapers relationship by moving the products closer together on the shelves to increased sales of both. 13

13 Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket Transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Association Rules {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, Implication means co-occurrence, not causality! 14

14 Definition: Frequent Itemset Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Support count (σ) Frequency of occurrence of an itemset E.g. σ({milk, Bread,Diaper}) = 2 Support Fraction of transactions that contain an itemset E.g. s({milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke 15

15 Definition: Association Rule Association Rule An implication expression of the form X Y, where X and Y are itemsets Example: {Milk, Diaper} {Beer} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example: { Milk, Diaper} (Milk, Diaper,Beer) s = σ = T σ (Milk,Diaper,Beer) c = = σ (Milk,Diaper) = Beer =

16 Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having support minsup threshold confidence minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! 17

17 Mining Association Rules TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements 18

18 Mining Association Rules Two-step approach: 1. Frequent Itemset Generation Generate all itemsets whose support minsup 2. Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive 19

19 Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Given d items, there are 2 d possible candidate itemsets 20

20 Frequent Itemset Generation Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database N TID Items Candidate Itemsets 001 Bread, Milk, Apples, Juice, Chips, Soda 002 Soda, Chips 003 Bread, Milk, Butter 004 Apples, Soda, Chips, Milk, Bread 005 Cheese, Chips, Soda 006 Bread, Milk, Chips 007 Apples, Chips, Soda 008 Crackers, Apples, Butter Bread Milk Apples Chips M W Match each transaction against every candidate itemset Complexity: O(NWM) expensive since M = 2 x -1 21

21 Given d unique items: Total number of itemsets = 2 d Total number of possible association rules: Computational Complexity = = + = = d d d k k d j j k d k d R If d=6, R = 602 rules If d = 10, R = rules If d = 15, R = rules

22 Needle in a haystack!! 23

23 Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction 24

24 The Apriori Algorithm [Agrawal and Srikant 1994] 25

25 Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: X, Y : ( X Y ) s( X ) s( Y ) Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support 26

26 Illustrating Apriori Principle null A B C D E AB AC AD AE BC BD BE CD CE DE Found to be infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Pruned supersets ABCDE 27

27 Database D TID Items Illustrating Apriori Principle Initial Database Count Prune Database D C 1 L 1 Join Count Prune TID A C D B C E A B C E B E Items 100 A C D Itemset 200 B C E {A B} 300 {A C} A B C E 400 {A E} B E {B C} Itemset {B E} {B {C C E} E} Itemset {A} Itemset {B} {A B} {C} {A C} {D} {A {E} {B C} Itemset {B E} {B{C C E} E} Join Scan D Count Prune Scan D Sup 2 Sup Sup 3 2 Scan DC 2 L 2 C 3 L 3 Itemset {A} Itemset {B} {A{C} {B{E} C} {B E} Itemset {C E} {B C E} minsup = 0.5 Sup 2 Sup Sup

28 Apriori Algorithm Method: Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent 29

29 Factors Affecting Complexity Choice of minimum support threshold Lowering support threshold results in more frequent itemsets This may increase number of candidates and max length of frequent itemsets Dimensionality (number of items) of the data set More space is needed to store support count of each item If number of frequent items also increases, both computation and I/O costs may also increase Size of database Since Apriori makes multiple passes, run time of algorithm may increase with number of transactions Average transaction width Transaction width increases with denser data sets This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width) 30

30 Reducing the Output: Closed and Maximal Item Sets 32

31 Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent null Maximal Itemsets A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Infrequent Itemsets ABCD E Border 33

32 Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset TID Items 1 {A,B} 2 {B,C,D} 3 {A,B,C,D} 4 {A,B,D} 5 {A,B,C,D} Itemset Support {A} 4 {B} 5 {C} 3 {D} 4 {A,B} 4 {A,C} 2 {A,D} 3 {B,C} 3 {B,D} 4 {C,D} 3 Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 3 {A,B,C,D} 2 34

33 TID Items 1 ABC 2 ABCD 3 BCE 4 ACDE 5 DE Maximal vs Closed Itemsets null Transaction Ids A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE Not supported by any transactions ABCDE 35

34 Minimum support = 2 Maximal vs Closed Frequent Itemsets null Closed but not maximal A B C D E Closed and maximal AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE # Closed = 9 # Maximal = 4 ABCDE 36

35 Maximal vs Closed Itemsets Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets 37

36 Summary Association Rule Mining Apriori Algorithm Closed and Maximal Itemsets Next Class Alternative methods FP- Tree ECLAT 38

37 Association Rules 2 39

38 Learning Outcomes Alternative methods to Apriori (Last Lecture) FP- Tree ECLAT Rule Generation 40

39 Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice General-to-specific vs Specific-to-general Frequent itemset border null null Frequent itemset border null {a 1,a 2,...,a n } {a 1,a 2,...,a n } Frequent itemset border {a 1,a 2,...,a n } (a) General-to-specific (b) Specific-to-general (c) Bidirectional 41

40 Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice - Equivalent Classes null null A B C D A B C D AB AC AD BC BD CD AB AC BC AD BD CD ABC ABD ACD BCD ABC ABD ACD BCD ABCD ABCD (a) Prefix tree (b) Suffix tree 42

41 Alternative Methods for Frequent Itemset Generation Traversal of Itemset Lattice Breadth-first vs Depth-first (a) Breadth first (b) Depth first 43

42 Alternative Methods for Frequent Itemset Generation Representation of Database horizontal vs vertical data layout Horizontal Data Layout TID Items 1 A,B,E 2 B,C,D 3 C,E 4 A,C,D 5 A,B,C,D 6 A,E 7 A,B 8 A,B,C 9 A,C,D 10 B Vertical Data Layout A B C D E

43 Rule Generation 45

44 Rule Generation Given a frequent itemset L, find all non-empty subsets f L such that f L f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, If L = k, then there are 2 k 2 candidate association rules (ignoring L and L) 46

45 Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone property c(abc D) can be larger or smaller than c(ab D) But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(abc D) c(ab CD) c(a BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule 47

46 Lattice of rules Low Confidence Rule Rule Generation for Apriori Algorithm ABCD=>{ } BCD=>A ACD=>B ABD=>C ABC=>D CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD Pruned Rules D=>ABC C=>ABD B=>ACD A=>BCD 48

47 Rule Generation for Apriori Algorithm Candidate rule is generated by merging two rules that share the same prefix in the rule consequent join(cd=>ab,bd=>ac) would produce the candidate rule D => ABC CD=>AB BD=>AC Prune rule D=>ABC if its subset AD=>BC does not have high confidence D=>ABC 49

48 Apriori Apriori: uses a generate-and-test approach generates candidate itemsets and tests if they are frequent Generation of candidate itemsets is expensive (in both space and time) Support counting is expensive Subset checking (computationally expensive) Multiple Database scans (I/O) 50

49 The FP-Growth Algorithm Frequent Pattern Growth Algorithm [Han, Pei, and Yin 2000] 51

50 FP-Growth Algorithm FP-Growth: allows frequent itemset discovery without candidate itemset generation. Two step approach: Step 1: Build a compact data structure called the FP-tree Built using 2 passes over the data-set. Step 2: Extracts frequent itemsets directly from the FP-tree Traversal through FP-Tree 52

51 Finding Frequent Itemsets Input: The FP-tree Output: All Frequent Itemsets and their support Method: Divide and Conquer: Consider all itemsets that end in: E, D, C, B, A For each possible ending item, consider the itemsets with last items one of items preceding it in the ordering E.g, for E, consider all itemsets with last item D, C, B, A. This way we get all the itesets ending at DE, CE, BE, AE Proceed recursively this way. Do this for all items.

52 Frequent itemsets All Itemsets Ε D C B A DE CE BE AE CD BD AD BC AC AB CDE BDE ADE BCE ACE ABE BCD ACD ABD AB C ACDE BCDE ABDE ABCE ABCD ABCDE

53 Frequent Itemsets All Itemsets Ε D C B A Frequent?; DE CE BE AE CD BD AD BC AC AB Frequent?; CDE BDE ADE BCE ACE ABE BCD ACD ABD AB C Frequent? ACDE BCDE ABDE ABCE ABCD ABCDE Frequent?

54 Frequent Itemsets All Itemsets Frequent? Ε D C B A DE CE BE AE CD BD AD BC AC AB Frequent? Frequent? CDE BDE ADE BCE ACE ABE BCD ACD ABD AB C ACDE BCDE ABDE ABCE ABCD ABCDE Frequent? Frequent?

55 Frequent Itemsets All Itemsets Frequent? Ε D C B A DE CE BE AE CD BD AD BC AC AB Frequent? CDE BDE ADE BCE ACE ABE BCD ACD ABD AB C Frequent? ACDE BCDE ABDE ABCE ABCD ABCDE We can generate all itemsets this way We expect the FP-tree to contain a lot less

56 Using the FP-tree to find frequent TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} Header table Item Pointer A B C D E Transaction Database C:3 B:5 itemsets A:7 C:1 null E:1 E:1 B:3 C:3 Bottom-up traversal of the tree. E:1 First, itemsets ending in E, then D, etc, each time a suffix-based class

57 Finding Frequent Itemsets Subproblem: find frequent itemsets ending in E null A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1 We will then see how to compute the support for the possible itemsets

58 Finding Frequent Itemsets null Ending in D A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

59 Finding Frequent Itemsets null Ending in C A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

60 Finding Frequent Itemsets null Ending in B A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

61 Finding Frequent Itemsets Ending in Α null A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

62 Algorithm For each suffix X Phase 1 Construct the prefix tree for X as shown before, and compute the support using the header table and the pointers Phase 2 If X is frequent, construct the conditional FP-tree for X in the following steps 1. Recompute support 2. Prune infrequent items 3. Prune leaves and recurse

63 Example Phase 1 construct prefix tree null Find all prefix paths that contain E A:7 B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1 Suffix Paths for Ε: {A,C,D,E}, {A,D,Ε}, {B,C,E}

64 Example Phase 1 construct prefix tree Find all prefix paths that contain E null A:7 B:3 C:1 C:3 E:1 E:1 E:1 Prefix Paths for Ε: {A,C,D,E}, {A,D,Ε}, {B,C,E}

65 Example Compute Support for E (minsup = 2) How? Follow pointers while summing up counts: = 3 > 2 E is frequent null A:7 B:3 C:1 C:3 E:1 E:1 E:1 {E} is frequent so we can now consider suffixes DE, CE, BE, AE

66 Example E is frequent so we proceed with Phase 2 Phase 2 Convert the prefix tree of E into a conditional FP-tree Two changes (1) Recompute support (2) Prune infrequent null A:7 B:3 C:3 C:1 E:1 E:1 E:1

67 Example Recompute Support null A:7 B:3 The support counts for some of the nodes include transactions that do not end in E For example in null->b->c->e we count {B, C} C:1 C:3 The support of any node is equal to the sum of the support of leaves with label E in its subtree E:1 E:1 E:1

68 Example null A:7 B:3 C:1 C:3 E:1 E:1 E:1

69 Example null A:7 B:3 C:1 C:1 E:1 E:1 E:1

70 Example null A:7 B:1 C:1 C:1 E:1 E:1 E:1

71 Example null A:7 B:1 C:1 C:1 E:1 E:1 E:1

72 Example null A:7 B:1 C:1 C:1 E:1 E:1 E:1

73 Example null A:2 B:1 C:1 C:1 E:1 E:1 E:1

74 Example null A:2 B:1 C:1 C:1 E:1 E:1 E:1

75 Example null Truncate A:2 B:1 Delete the nodes of Ε C:1 C:1 E:1 E:1 E:1

76 Example null Truncate A:2 B:1 Delete the nodes of Ε C:1 C:1 E:1 E:1 E:1

77 Example null Truncate A:2 B:1 Delete the nodes of Ε C:1 C:1

78 Example null Prune infrequent A:2 B:1 In the conditional FPtree some nodes may have support less than minsup C:1 C:1 e.g., B needs to be pruned This means that B appears with E less than minsup times

79 Example null A:2 B:1 C:1 C:1

80 Example null A:2 C:1 C:1

81 Example null A:2 C:1 C:1 The conditional FP-tree for E Repeat the algorithm for {D, E}, {C, E}, {A, E}

82 Example null A:2 C:1 C:1 Phase 1 Find all prefix paths that contain D (DE) in the conditional FP-tree

83 Example null A:2 C:1 Phase 1 Find all prefix paths that contain D (DE) in the conditional FP-tree

84 Example null A:2 C:1 Compute the support of {D,E} by following the pointers in the tree 1+1 = 2 2 = minsup {D,E} is frequent

85 Example null A:2 C:1 Phase 2 Construct the conditional FP-tree 1. Recompute Support 2. Prune nodes

86 Example null A:2 Recompute support C:1

87 Example null A:2 Prune nodes C:1

88 Example null A:2 Prune nodes C:1

89 Example null A:2 Prune nodes C:1 Small support

90 Example null A:2 Final condition FP-tree for {D,E} The support of A is minsup so {A,D,E} is frequent Since the tree has a single node we return to the next subproblem.

91 Example null A:2 C:1 C:1 The conditional FP-tree for E We repeat the algorithm for {D,E}, {C,E}, {A,E}

92 Example null A:2 C:1 C:1 Phase 1 Find all prefix paths that contain C (CE) in the conditional FP-tree

93 Example null A:2 C:1 C:1 Phase 1 Find all prefix paths that contain C (CE) in the conditional FP-tree

94 Example null A:2 C:1 C:1 Compute the support of {C,E} by following the pointers in the tree 1+1 = 2 2 = minsup {C,E} is frequent

95 Example null A:2 C:1 C:1 Phase 2 Construct the conditional FP-tree 1. Recompute Support 2. Prune nodes

96 Example null A:1 C:1 Recompute support C:1

97 Example null A:1 C:1 Prune nodes C:1

98 Example null A:1 Prune nodes

99 Example null A:1 Prune nodes

100 Example null Prune nodes Return to the previous subproblem

101 Example null A:2 C:1 C:1 The conditional FP-tree for E We repeat the algorithm for {D,E}, {C,E}, {A,E}

102 Example null A:2 C:1 C:1 Phase 1 Find all prefix paths that contain A (AE) in the conditional FP-tree

103 Example null A:2 Phase 1 Find all prefix paths that contain A (AE) in the conditional FP-tree

104 Example null A:2 Compute the support of {A,E} by following the pointers in the tree 2 minsup {A,E} is frequent There is no conditional FP-tree for {A,E}

105 Example So for E we have the following frequent itemsets {E}, {D,E}, {C,E}, {A,E} We proceed with D

106 Example Ending in D A:7 null B:3 B:5 C:1 C:3 Header table Item A B C D E Pointer C:3 E:1 E:1 E:1

107 Example null Phase 1 construct prefix tree Find all prefix paths that contain D A:7 B:3 Support 5 > minsup, D is frequent Phase 2 Convert prefix tree into conditional FP-tree C:3 B:5 C:1 C:3

108 Example null A:7 B:3 B:5 C:1 C:3 C:1 Recompute support

109 Example null A:7 B:3 B:2 C:1 C:3 C:1 Recompute support

110 Example null A:3 B:3 B:2 C:1 C:3 C:1 Recompute support

111 Example null A:3 B:3 B:2 C:1 C:1 C:1 Recompute support

112 Example null A:3 B:1 B:2 C:1 C:1 C:1 Recompute support

113 Example null A:3 B:1 B:2 C:1 C:1 C:1 Prune nodes

114 Example null A:3 B:1 B:2 C:1 C:1 C:1 Prune nodes

115 Example null A:3 B:1 B:2 C:1 C:1 C:1 Construct conditional FP-trees for {C,D}, {B,D}, {A,D} And so on.

116 Observations At each recursive step we solve a subproblem Construct the prefix tree Compute the new support Prune nodes Subproblems are disjoint so we never consider the same itemset twice Support computation is efficient happens together with the computation of the frequent itemsets.

117 Observations The efficiency of the algorithm depends on the compaction factor of the dataset If the tree is bushy then the algorithm does not work well, it increases a lot of number of subproblems that need to be solved.

118 FP-Tree size The FP-Tree usually has a smaller size than the uncompressed data typically many transactions share items (and hence prefixes). Best case scenario: all transactions contain the same set of items. 1 path in the FP-tree Worst case scenario: every transaction has a unique set of items (no items in common) Size of the FP-tree is at least as large as the original data. Storage requirements for the FP-tree are higher need to store the pointers between the nodes and the counters. The size of the FP-tree depends on how the items are ordered Ordering by decreasing support is typically used but it does not always lead to the smallest tree (it's a heuristic). 120

119 Discussion Advantages of FP-Growth only 2 passes over dataset compresses dataset no candidate generation much faster than Apriori Disadvantages of FP-Growth FP-Tree may not fit in memory!! FP-Tree is expensive to build Trade-off: takes time to build, but once it is built, frequent itemsets are read off easily. Time is wasted (especially if support threshold is high), as the only pruning that can be done is on single items. support can only be calculated once the entire dataset is added to the FP-Tree. 121

120 Should the FP-Growth be used in favor of Apriori? Does FP-Growth have a better complexity than Apriori? Does FP-Growth always have a better running time than Apriori? More relevant questions What kind of data is it? What kind of results do I want? How will I analyze the resulting rules? 122

121 FP Growth vs Apriori D1 FP-grow th runtime D1 Apriori runtime Run time(sec.) Data set T25I20D10K Support threshold(%) 123

122 The Eclat Algorithm [Zaki, Parthasarathy, Ogihara, and Li 1997] 124

123 ECLAT For each item, store a list of transaction ids (tids) Horizontal Data Layout TID Items 1 A,B,E 2 B,C,D 3 C,E 4 A,C,D 5 A,B,C,D 6 A,E 7 A,B 8 A,B,C 9 A,C,D 10 B Vertical Data Layout A B C D E TID-list 128

124 ECLAT Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets. A B traversal approaches: top-down, bottom-up and hybrid Advantage: very fast support counting Disadvantage: intermediate tid-lists may become too large for memory AB

125 Partitioning The set of transactions may be divided into a number of disjoint subsets. Then each partition is searched for frequent itemsets. These frequent itemsets are called local frequent itemsets. How can information about local itemsets be used in finding frequent itemsets of the global set of transactions? Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB. 130

126 Partitioning Phase 1 Divide n transactions into m partitions Find the frequent itemsets in each partition Combine all local frequent itemsets to form candidate itemsets Phase 2 Find global frequent itemsets December 2008 GKGupta 131

127 Sampling A random sample (usually large enough to fit in the main memory) may be obtained from the overall set of transactions and the sample is searched for frequent itemsets. These frequent itemsets are called sample frequent itemsets. How can information about sample itemsets be used in finding frequent itemsets of the global set of transactions? 132

128 Sampling Not guaranteed to be accurate but we sacrifice accuracy for efficiency. A lower support threshold may be used for the sample to ensure not missing any frequent datasets. The actual frequencies of the sample frequent itemsets are then obtained. More than one sample could be used to improve accuracy. 133

129 Summary We looked at 2 alternative methods FP- Tree ECLAT Rule Generation Phase 134

130 Association Rule 3 135

131 Learning Outcomes Pattern Evaluation Advance Pattern Mining Techniques 136

132 Pattern Evaluation 137

133 The Knowledge Discovery Process Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Fayyad et al. 1996) Focus on the quality of discovered patterns independent of the data mining algorithm This definition is often quoted, but not very seriously taken into account

134 Interestingness of Discovered Patterns useful Amount of Research novel, surprising comprehensible valid (accurate) Difficulty of measurement 139

135 Selecting the Most Interesting Rules in a Post-Processing Step Goal: select a subset of interesting rules out of all discovered rules, in order to show to the user just the most interesting rules User-driven, subjective approach Uses the domain knowledge, believes or preferences of the user Data-driven, objective approach Based on statistical properties of discovered patterns More generic, independent of the application domain and user

136 Data-driven vs user-driven rule interestingness measures all rules all rules Note: user-driven data-driven measure selected rules user user-driven measure selected rules measures real user interest believes, Dom. Knowl. user

137 Data-driven Approach Objective Measures 142

138 Data-driven approaches Two broad groups of approaches: Approach based on the statistical strength of patterns entirely data-driven, no user-driven aspect Based on the assumption that a certain kind of pattern is surprising to any user, in general But still independent from the application domain and from believes / domain knowledge of the target user So, mainly data-driven

139 Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence Interestingness measures can be used to prune/rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used 144

140 Application of Interestingness Measure Interestingness Measures 145

141 Computing Interestingness Measure Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X Y Y Y X f 11 f 10 f 1+ X f 01 f 00 f o+ f +1 f +0 T f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures support, confidence, lift, Gini, J-measure, etc. 146

142 Drawback of Confidence Coffee Coffee Tea Tea Association Rule: Tea Coffee Confidence= P(Coffee Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee Tea) =

143 Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(S B) = 420/1000 = 0.42 P(S) P(B) = = 0.42 P(S B) = P(S) P(B) => Statistical independence P(S B) > P(S) P(B) => Positively correlated P(S B) < P(S) P(B) => Negatively correlated 148

144 Measures that take into account statistical dependence Statistical-based Measures 149 )] ( )[1 ( )] ( )[1 ( ) ( ) ( ), ( ) ( ) ( ), ( ) ( ) ( ), ( ) ( ) ( Y P Y P X P X P Y P X P Y X P coefficient Y P X P Y X P PS Y P X P Y X P Interest Y P X Y P Lift = = = = φ

145 Example: Lift/Interest Coffee Coffee Tea Tea Association Rule: Tea Coffee Confidence= P(Coffee Tea) = 0.75 but P(Coffee) = 0.9 Lift = 0.75/0.9= (< 1, therefore is negatively associated) 150

146 Drawback of Lift & Interest Y Y X X Y Y X X Lift = = Lift = = (0.1)(0.1) (0.9)(0.9) Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1 151

147 There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Aprioristyle support based pruning? How does it affect these measures? 152

148 Properties of A Good Measure Piatetsky-Shapiro: 3 properties a good measure M must satisfy: M(A,B) = 0 if A and B are statistically independent M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchanged M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain unchanged 153

149 Comparing Different Measures 10 examples of contingency tables: Rankings of contingency tables using various measures: Example f 11 f 10 f 01 f 00 E E E E E E E E E E

150 Property under Variable Permutation B B A p q A r s A A B p r B q s Does M(A,B) = M(B,A)? Symmetric measures: support, lift, collective strength, cosine, Jaccard, etc Asymmetric measures: confidence, conviction, Laplace, J-measure, etc 155

151 Property under Row/Column Scaling Grade-Gender Example (Mosteller, 1968): Male Female High Low Male Female High Low x 10x Mosteller: Underlying association should be independent of the relative number of male and female students in the samples 156

152 Example: φ-coefficient φ-coefficient is analogous to correlation coefficient for continuous variables Y Y X X Y Y X X φ = = φ = φ Coefficient is the same for both tables = 157

153 Property under Null Addition B B A p q A r s B B A p q A r s + k Invariant measures: support, cosine, Jaccard, etc Non-invariant measures: correlation, Gini, mutual information, odds ratio, etc 158

154 Different Measures have Different Properties Symbol Measure Range P1 P2 P3 O1 O2 O3 O3' O4 Φ Correlation Yes Yes Yes Yes No Yes Yes No λ Lambda 0 1 Yes No No Yes No No* Yes No α Odds ratio 0 1 Yes* Yes Yes Yes Yes Yes* Yes No Q Yule's Q Yes Yes Yes Yes Yes Yes Yes No Y Yule's Y Yes Yes Yes Yes Yes Yes Yes No κ Cohen's Yes Yes Yes Yes No No Yes No M Mutual Information 0 1 Yes Yes Yes Yes No No* Yes No J J-Measure 0 1 Yes No No No No No No No G Gini Index 0 1 Yes No No No No No* Yes No s Support 0 1 No Yes No Yes No No No No c Confidence 0 1 No Yes No Yes No No No Yes L Laplace 0 1 No Yes No Yes No No No No V Conviction No Yes No Yes** No No Yes No I Interest 0 1 Yes* Yes Yes Yes No No No No IS IS (cosine) No Yes Yes Yes No No No Yes PS Piatetsky-Shapiro's Yes Yes Yes Yes No Yes Yes No F Certainty factor Yes Yes Yes No No No Yes No AV Added value Yes Yes Yes No No No No No S Collective strength 0 1 No Yes Yes Yes No Yes* Yes No ζ Jaccard No Yes Yes Yes No No No Yes K Klosgen's Yes Yes Yes No No No No No

155 Support-based Pruning Most of the association rule mining algorithms use support measure to prune rules and itemsets Study effect of support pruning on correlation of itemsets Generate random contingency tables Compute support and pairwise correlation for each table Apply support-based pruning and examine the tables that are removed 160

156 Effect of Support-based Pruning All Itempairs Correlation

157 Effect of Support-based Pruning Support < 0.01 Support < Correlation Correlation Support < Support-based pruning eliminates mostly negatively correlated itemsets Correlation 162

158 Effect of Support-based Pruning Investigate how support-based pruning affects other measures Steps: Generate contingency tables Rank each table according to the different measures Compute the pair-wise correlation between the measures 163

Effect of Support-based Pruning Without Support Pruning (All Pairs) Conviction Odds ratio Col Strength Correlation Interest PS CF Yule Y Reliability Kappa Klosgen Yule Q Confidence Laplace IS Support

159 Effect of Support-based Pruning Without Support Pruning (All Pairs) Conviction Odds ratio Col Strength Correlation Interest PS CF Yule Y Reliability Kappa Klosgen Yule Q Confidence Laplace IS Support Jaccard Lambda Gini J-measure Mutual Info All Pairs (40.14%) Red cells indicate correlation between the pair of measures > % pairs have correlation > 0.85 Jaccard Correlation Scatter Plot between Correlation & Jaccard Measure 164

Effect of Support-based Pruning 0.5% support 50% Interest Conviction Odds ratio 0.005 <= support <= 0.500 (61.45%) 1 0.

160 Effect of Support-based Pruning 0.5% support 50% Interest Conviction Odds ratio <= support <= (61.45%) Col Strength Laplace Confidence Correlation Klosgen Reliability PS Yule Q CF Yule Y Kappa IS Jaccard Support Lambda Gini J-measure Mutual Info % pairs have correlation > 0.85 Jaccard Correlation Scatter Plot between Correlation & Jaccard Measure: 165

Effect of Support-based Pruning 0.5% support 30% Support Interest Reliability 0.005 <= support <= 0.300 (76.42%) 1 0.

161 Effect of Support-based Pruning 0.5% support 30% Support Interest Reliability <= support <= (76.42%) Conviction Yule Q Odds ratio Confidence CF Yule Y Kappa Correlation Col Strength IS Jaccard Laplace PS Klosgen Lambda Mutual Info Gini J-measure % pairs have correlation > 0.85 Jaccard Correlation Scatter Plot between Correlation & Jaccard Measure 166

162 Rule interestingness measures based on the statistical strength of the patterns More than 50 measures have been called rule interestingness measures in the literature The vast majority of these measures are based only on statistical strength or mathematical properties (Hilderman & Hamilton 2001), (Tan et al. 2002) Are they correlated with real human interest??

163 Evaluating the correlation between data-driven measures and real human interest 3-step Methodology: Rank discovered rules according to each of a number of datadriven rule interestingness measures Show (a subset of) discovered rules to the user, who evaluates how interesting each rule is Measure the linear correlation between the ranking of each datadriven measure and real human interest Rules are evaluated by an expert user, who also provided the dataset for the data miner

164 The Study (Carvalho et al. 2005) 8 datasets, one user for each dataset 11 data-driven rule interestingness measures Each data-driven measure was evaluated over 72 <user-rule> pairs (8 users 9 rules/dataset) 88 correlation values (8 users 11 measures) 31 correlation values (35%) were 0.6 The correlations associated with each measure varied considerably across the 8 datasets/users

165 Data-driven approaches based on finding specific kinds of surprising patterns Principle: a few special kinds of patterns tend to be surprising to most users, regardless of the meaning of the attributes in the pattern and the application domain Does not require domain knowledge or believes specified by user We discuss two approaches: Discovery of exception rules contradicting general rules Detection of Simpson s paradox

166 Selecting Surprising Rules Which Are Exceptions of a General Rule Data-driven approach: consider the following 2 rules R 1 : IF Cond 1 THEN Class 1 (general rule) R 2 : IF Cond 1 AND Cond 2 THEN Class 2 (exception) Cond 1, Cond 2 are conjunctions of conditions Rule R 2 is a specialization of rule R 1 Rules R 1 and R 2 predict different classes An exception rule is interesting if both the rule and its general rule have a high predictive accuracy (Suzuki & Kodratoff 1998), (Suzuki 2004)

167 Selecting Surprising Exception Rules an Example Mining data about car accidents (Suzuki 2004) IF (used_seat_belt = yes) THEN (injury = no) (general rule) IF (used_seat_belt = yes) AND (passenger = child) THEN (injury = yes) (exception rule) If both rules are accurate, then the exception rule would be considered interesting (it is both accurate and surprising)

168 Simpson Paradox What s the confidence of the following rules: (rule 1) {HDTV=Yes} {Exercise machine = Yes} (rule 2) {HDTV=No} {Exercise machine = Yes}? Confidence of rule 1 = 99/180 = 55% Confidence of rule 2 = 54/120 = 45% So, Customers who buy high-definition televisions are more likely to buy exercise machines that those who don t buy high-definition televisions. Right? Well, maybe not

Stratification: Simpson paradox Consider this more detailed table: What s the confidence of the rules for each strata: (rule 1) {HDTV=Yes} {Exercise

8% Working Adults: Confidence of rule 1 = 98/170 = 57.7% Confidence of rule 2 = 50/86 = 58.

169 Stratification: Simpson paradox Consider this more detailed table: What s the confidence of the rules for each strata: (rule 1) {HDTV=Yes} {Exercise machine = Yes} (rule 2) {HDTV=No} {Exercise machine = Yes}? College students: Confidence of rule 1 = 1/10 = 10% Confidence of rule 2 = 4/34 = 11.8% Working Adults: Confidence of rule 1 = 98/170 = 57.7% Confidence of rule 2 = 50/86 = 58.1% The rules suggest that, for each group, customers who don t buy HDTV are more likely to buy exercise machines, which contradict the previous conclusion when data from the two customer groups are pooled together.

170 Importance of Stratification The lesson here is that proper stratification is needed to avoid generating spurious patterns resulting from Simpson's paradox. For example Market basket data from a major supermarket chain should be stratified according to store locations, while Medical records from various patients should be stratified according to confounding factors such as age and gender.

171 Effect of Support Distribution Many real data sets have skewed support distribution where most of the items have relatively low to moderate frequencies, but a small number of them have very high frequencies.

172 Skewed distribution Tricky to choose the right support threshold for mining such data sets. If we set the threshold too high (e.g., 20%), then we may miss many interesting patterns involving the low support items from G1. Such low support items may correspond to expensive products (such as jewelry) that are seldom bought by customers, but whose patterns are still interesting to retailers. Conversely, when the threshold is set too low, there is the risk of generating spurious patterns that relate a high frequency item such as milk to a low frequency item such as caviar.

173 Cross-support patterns Cross-support patterns are those that relate a high frequency item such as milk to a low frequency item such as caviar. Likely to be spurious because their correlations tend to be weak. E.g. the confidence of {caviar} {milk} is likely to be high, but still the pattern is spurious, since there isn t probably any correlation between caviar and milk. However, we don t want to use Lift during the computation of frequent itemsets because it doesn t have the anti-monotone property. Lift is rather used as a post-processing step. So, we want to detect cross-support pattern by looking at some antimonotone property.

174 Cross-support patterns Definition A cross-support pattern is an itemset X = {i 1, i 2,, i k } whose support ratio min r( X ) = max { s( i1 ), s( i2 ),..., s( ik )} { s( i ), s( i ),..., s( i )} is less than a user-specified threshold h c. 1 2 k Example Suppose the support for milk is 70%, while the support for sugar is 10% and caviar is 0.04% Given h c = 0.01, the frequent itemset {milk, sugar, caviar} is a crosssupport pattern because its support ratio is r = min {0.7, 0.1, } / max {0.7, 0.1, } = / 0.7 = < 0.01

175 Detecting cross-support patterns E.g. assuming that h c = 0.3, the itemsets {p,q}, {p,r}, and {p,q,r} are cross-support patterns. Because their support ratios, being equal to 0.2, are less than threshold h c. We can apply a high support threshold, say, 20%, to eliminate the cross-support patterns but, this may come at the expense of discarding other interesting patterns such as the strongly correlated itemset {q,r} that has support equal to 16.7%.

176 Detecting cross-support patterns Confidence pruning also doesn t help. Confidence for {q} {p} is 80% even though {p, q} is a crosssupport pattern. Meanwhile, rule {q} {r} also has high confidence even though {q, r} is not a crosssupport pattern. These demonstrate the difficulty of using the confidence measure to distinguish between rules extracted from cross-support and noncross-support patterns. 181

177 Lowest confidence rule Notice that the rule {p} {q} has very low confidence because most of the transactions that contain p do not contain q. This observation suggests that: Cross-support patterns can be detected by examining the lowest confidence rule that can be extracted from a given itemset.

178 Finding lowest confidence Recall the anti-monotone property of confidence: conf( i 1,i 2 i 3,i 4,,i k ) conf( i 1,i 2, i 3 i 4,,i k ) This property suggests that confidence never increases as we shift more items from the left to the right-hand side of an association rule. Hence, the lowest confidence rule that can be extracted from a frequent itemset contains only one item on its left-hand side.

179 Finding lowest confidence Given a frequent itemset {i 1,i 2,i 3,i 4,,i k }, the rule i j i 1,i 2, i 3, i j-1, i j+1, i 4,,i k has the lowest confidence if s(i j ) = max {s(i 1 ), s(i 2 ),,s(i k )} Follows directly from the definition of confidence as the ratio between the rule's support and the support of the rule antecedent.

180 Finding lowest confidence Summarizing, the lowest confidence attainable from a frequent itemset {i 1,i 2,i 3,i 4,,i k }, is s( { i1, i2,..., ik }) max{ s( i ), s( i ),..., s( i )} 1 2 This is also known as the h-confidence measure or all-confidence measure. k

181 and thus can be incorporated directly into h-confidence Clearly, cross-support patterns can be eliminated by ensuring that the h-confidence values for the patterns exceed some threshold hc. Observe that the measure is also antimonotone, i.e., hconfidence({i1,i2,, ik}) h- confidence({i1,i2,, ik+1 })

182 User Driven Approach Subjective Approach 187

183 Subjective Interestingness Measure Objective measure: Rank patterns based on statistics computed from data Subjective measure: Rank patterns according to user s interpretation A pattern is subjectively interesting if it contradicts the expectation of a user (Silberschatz & Tuzhilin) A pattern is subjectively interesting if it is actionable (Silberschatz & Tuzhilin) 188

Interestingness via Unexpectedness Need to model expectation of users (domain knowledge) + Pattern expected to be frequent - Pattern expected to be infrequent Pattern found to

184 Interestingness via Unexpectedness Need to model expectation of users (domain knowledge) + Pattern expected to be frequent - Pattern expected to be infrequent Pattern found to be frequent Pattern found to be infrequent Expected Patterns + Unexpected Patterns Need to combine expectation of users with evidence from data (i.e., extracted patterns) 189

185 Selecting Interesting Association Rules based on User-Defined Templates User-driven approach: the user specifies templates for interesting and non-interesting rules Inclusive templates specify interesting rules Restrictive templates specify non-interesting rules A rule is considered interesting if it matches at least one inclusive template and it matches no restrictive template (Klemettinen et al. 1994)

186 Selecting Interesting Association Rules Hierarchies of items and their categories food drink clothes rice fish hot dog coke wine water trousers socks Inclusive template: IF (food) THEN (drink) Restrictive template: IF (hot dog) THEN (coke) Rule IF (fish) THEN (wine) is interesting

187 Selecting Surprising Rules based on General Impressions User-driven approach: the user specifies her/his general impressions about the application domain (Liu et al. 1997, 2000); (Romao et al. 2004) Example: IF (salary = high) THEN (credit = good) (different from reasonably precise knowledge, e.g. IF salary > 35,500 ) Discovered rules are matched with the user s general impressions in a post-processing step, in order to determine the most surprising (unexpected) rules

188 Selecting Surprising Rules based on General Impressions Case I rule with unexpected consequent A rule and a general impression have similar antecedents but different consequents Case II rule with unexpected antecedent A rule and a general impression have similar consequents but the rule antecedent is different from the antecedent of all general impressions In both cases the rule is considered unexpected

189 Selecting Surprising Rules based on General Impressions GENERAL IMPRESSION: IF (salary = high) AND (education_level = high) THEN (credit = good) UNEXPECTED RULES: IF (salary > 30,000) AND (education BSc) AND (mortage = yes) THEN (credit = bad) /* Case I - unexpected consequent */ IF (mortgage = no) AND (C/A_balance > 10,000) THEN (credit = good) /* Case II unexpected antecedent */ /* assuming no other general impression has a similar antecedent */

190 Summary Association Rule Induction is a Two Step Process Find the frequent item sets (minimum support). Form the relevant association rules (minimum confidence). Generating the Association Rules Form all possible association rules from the frequent item sets. Filter interesting" association rules 195

191 Readings Introduction to Data Mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar - Chapter 6 R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT 96. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94. H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD 00. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94. S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97. E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE 03. Charu C. Aggarwal and Jiawei Han Frequent Pattern Mining. Springer Publishing Company, Incorporated. 196

COMP 5331: Knowledge Discovery and Data Mining

COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei And slides provide by Raymond